#
Number compression for prime number storage

**1**of

**1**[ 16 posts ]

Hi, finding prime numbers is what I consider great fun, but around 30 seconds or so of exciting waiting eventually netted me 256 MB of data containing prime status of the first 2^31 - 1. Basically a bit stream that starts like this: 0011010100, which tells 0 and 1 aren't primes, 2 and 3 are, and so forth.

In order to find all the primes like this I would need infinite time and memory, so my idea was that if I reduced memory used to a finite amount and science eventually found the eternal life potion, I might actually succeed. But "jokes" aside, we came up with a few ways to store the primes in a more compressed fashion.

First, dumping the primes as a string in radix 10. 1 GB for 1 B numbers or so (or maybe it was 2 B, don't remember, took forever to save).

Second, dumping the prime status as bits. 250 MB for 2 B numbers.

Third, dumping the primes as bits. A number stating the amount of primes within a range, then the primes. That allows primes below a certain size in bytes to only take up a certain amount of space. Unfortunately, without being super optimized, this took up more space than the second method. I haven't finished writing the fully optimized one, but it might still be bad, at least for lower numbers.

Fourth, taking on primes 256 at a time, so first we store the amount of primes in that range, and then we only store the last byte. Primes up to 1 B took up 53 MB, which is a lot less than 128 MB the second method consumed, so quite a bonus. :3

So, the topic. Anyone got any ideas, no matter how stupid, as to how primes can be stored? The approach is irrelevant, so long as it can store primes efficiently. Also, if a method becomes efficient later on, it's of great importance.

It's unclear what you mean by "store the primes". One very compact method would be to store a program that calculated the primes. That would be very efficient in data space, "storing" a number of primes only limited by the remaining age of the universe, for the retrieval time.

Once stored, do you wish to retrieve primes merely sequentially, by sequence number (i.e. give me the 76,987,346th prime) or by magnitude (i.e. what is the next prime after 65,874,230?).

I suspect you could come up with a technique that accommodates both the latter retrieval techniques by storing a binary tree of blocks of each 10,000 primes, say, with each block saying what its first prime is, and the block itself storing the gap lengths between the primes in that block, in an optimal **Huffman coding**.

_________________

"Striking up conversations with strangers is an autistic person's version of extreme sports." Kamran Nazeer

There is a wikipedia article on Wheel Factorization that you might find useful.

The simplest form of wheel factorization is really easy to explain: store the numbers as bits, but don't bother with the even ones, since they are all multiples of two, and therefore not prime. If you applied just this to your second method, you'd go from 128MB to 64MB, and if you ignored the multiples of 2 or 3, you would get down to about 42MB, and if you ignore multiples of 2, 3 and 5, you get down to about 34MB. You can, in general, ignore multiples of the first n primes, although the complexity of the scheme goes up as n goes up.

If you pick the first 3 primes, 2, 3, and 5, then for every 30 numbers, you only have to store 8 bits, which is convenient for packing bits into bytes.

I don't think your first or third methods can be improved very substantially, but the fourth is rather interesting. There are about 48 million primes < 1 billion, so that explains the ~53MB file size. If you apply the 'no multiples of two' idea, you can up your range to 512 numbers at a time, which will cut down on the overhead a bit, but since you're storing the last 8 bits for ~48 million numbers, you only have 5MB of overhead, so that's just an incremental improvement. There are 'prime gaps', where no primes occur, and they can get arbitrarily large, so it might help if your encoding scheme can say something like 'skip this many blocks', although that's also just affecting the overhead.

_________________

"A dead thing can go with the stream, but only a living thing can go against it." --G. K. Chesterton

Thanks for the input. Number theory is definitely not my strongest field. I'll look into wheel factorization and Huffman coding.

I haven't decided on exactly how the primes need to be dug out of the data structure. The first idea with bit storage that increases in size dynamically allow for really easy binary search, as soon as the length of the prime has been found (but since it's just exponent incrementation, that should be fast as well).

I like the idea of cutting out the even numbers, but while less size is good, I'm worried doing a search will require multiplication if multiples of 3 and 5 are removed as well. I'm not sure if the fourth method can really benefit from it though. Since it stores a byte, discarding the last bit would only be a 1/8 improvement of the size.

As for prime gaps, none were found in blocks of 265 below 1 billion.

What's a better approach in your opinion? The way I'm reading it you're telling me I'm doing something pointless, which should be kinda obvious given the subject. It's solely for creating a data structure capable of handling this. :3

I don't get what your method is. Any chance you could paint the bits for the first occurring chunk? <.< I prefer looking at bits. Or a nice drawing.

As far as I know there's very little gained from standard compression.

The program used for 256 numbers per chunk is available here. https://github.com/Mortal/primes

Bloom

Deinonychus

Joined: 15 Mar 2012

Gender: Female

Posts: 332

Location: On the OTHER Wrong Planet. The nicer one...

What's a better approach in your opinion? The way I'm reading it you're telling me I'm doing something pointless, which should be kinda obvious given the subject. It's solely for creating a data structure capable of handling this. :3

Why do people climb mount Everest? Because it's there.

**Quote:**

I'm not quite sure exactly what you mean, but it might be because the wheel factorization idea isn't quite clear, so I'll explain that just in case.

2*3*5 = 30, and so all composites < 30 have a prime factor <= 5. If we have a number of the form 30n+offset, then we can be sure that 2, 3, or 5 divides the number evenly if the offset is not a prime, because it has to have a factor of 2, 3, or 5. The only numbers left are primes, 1, and 0. If it's 30n+0, it is composite, if it's 30n+1 it might not be, and if the prime is <= 5 then it's 30n + {2, 3, or 5}, so it's composite.

So if it's of the form 30n + {1, 7, 11, 13, 17, 19, 23, 29} then it might be prime. So you can store each offset as a bit, and a chunk of 30 numbers as a byte, and you can have the nth byte store info on the nth chunk.

So 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89 as primes would be stored as:

byte 1: 11111011

byte 2: 11110111

The zero bits are because 49 and 77 aren't prime.

**Quote:**

I did a bit of math on it, and the prime gaps aren't going to get big very fast.

The approximate density of primes is about 1/(ln n), so if most prime gaps are about that big, then prime gaps around a billion are about 20.7 in length, and to get prime gaps around 256 normally we'd have to go up to e^256 which is about 1.5*10^11.

So much less of a problem than I thought.

_________________

"A dead thing can go with the stream, but only a living thing can go against it." --G. K. Chesterton

**Code:**

}

NB: I ended up using the library from http://code.google.com/p/primesieve/ which seems reasonably fast, but it meant rearranging the above to work with the callback function when a prime was found.

Compressing the output file with 7-zip made it 21.4% smaller (64.0MB to 50.3MB) so this is probably a sign the data could be packed further

Going by your figures, the average gap between primes will be over 20, once you are up in the eight-digit primes. Recording gap lengths, Huffman-stye, should require an average of less than four bits per prime - a factor of three improvement.

So far as the use of such recorded data, I see it as a very valid thing to use in order to answer a question such as: "What is the 95,324,714th prime?".

Questions such as "What is the first prime after(before) 1,567,434,120?" do not need such a table, as the cost of looking up the table becomes similar to the cost of just testing for primality..

_________________

"Striking up conversations with strangers is an autistic person's version of extreme sports." Kamran Nazeer

Oh yes I am!

<.<

Good taste you have there. Although I can assure you there's no need for hints of a prize, solving something geeky is a prize in itself. ;3

Reminds me, a study friend of mine brought a girl to algorithm and data structure class. If someone was trying to bait me, that would be the place to ask me to go.

My point is that iteration through the data structure should be done without multiplication because it becomes a burden later on. But perhaps this stuff centered around the number 30 could be done without that by just manually coding it. I'll have to look into that.

Edit: Never mind, doesn't seem to be an issue with that code. Searching should be in constant time. =/

Also, one byte per 30 numbers is still a linear data size. I'm thinking it might be good up to a certain point, after which storing the primes in chunks of 512 or something similar might save space.

Maybe several kinds of compressions are required. I guess using the standard compression formats could be just be stacked on top of whatever compression we can make here.

I've got no idea if this would really help, I'm not much of a number theorist, but...

First, we only need to write to the file the numbers that are actually prime. Ignore all non-prime numbers.

Second, we'll divide positive integers into chunks consisting of an arbitrary amount of integers, and divide the file into such chunks. We'll deliminate the file in this same way. For this I'm thinking dividing into chunks of 2^16-1 would be good.

So, for instance, the start of the file will consist of the first chunk of numbers, ranging from 0 to 2^16-1.

Because all integers in that range can be stored in 2 bytes, you're never using more than two bytes per prime.

So say we know some prime numbers 1, 3,5,7,11,13,17,19,23,29...

So we store 2 bytes each (obviously not ASCII 1's and 0's)

0000000000000001

0000000000000011

0000000000000101

0000000000000111

0000000000001011

and so forth until we've covered all primes up to 2^16-1

Now we add some sort of delimiter to the file which designates that we are done with 0 to 2^16-1 and have moved on to 2^16 plus 2^16-1 (or 2^16 to 2^17-1 in this case). We store two bytes for each prime in this range as well, but the offset for each is 2^16.

Next we move to a third segment which goes from 2^16+2^16 up to 2^16+2^16+2^16-1, store all primes in that range as two byte integers with offsets starting at 2^16+2^16, and so on...

I'm thinking this will save you some space because you are never storing more than two bytes for each prime itself, though you will have some slight overhead by needing to write some sort of delimiter into the file to denote the boundary between the segments. This eliminates the need to store more and more bytes for each number as you find larger and larger prime numbers. And even as you get larger and larger and the primes become more and more sparse, and you'll start having multiple segments in a row with no prime number, whatever you use to deliminate the segments will be uniform, so you'll end up with huge chunks of consecutive delimitters, which would be highly compressible by something like ZIP or Huffman encoding.

Another thought - what if you simply stored a binary file consisting of a very long sequence of 0's and 1's. The first bit would indicate if the number 1 is a prime (it would be 1 because it is a prime), the 2nd bit would indicate if the number 2 is a prime (it would be 0 because 2 is not a prime), the 3rd bit would indicate if the number 3 is a prime (it would be a 1 because 3 is in fact a prime), and so on. Again, eventually as you get further and further into large numbers, you'd end up with large chunks of consecutive 0's which would be highly compressible for Huffman or something along those lines.

The other methods you mentioned were already discussed, however this last one might be worth something. I didn't consider how much might be saved later on with the bigger gaps.

Working on something else this moment, though, so it will have to wait a bit.

Nope. The number 1 is not prime. A prime is defined by having exactly two divisors, and 1 only has a single divisor.

_________________

"Striking up conversations with strangers is an autistic person's version of extreme sports." Kamran Nazeer

I realise this is a fairly old post, but I stumbled on this looking for something similar and thought you might be interested in my findings:

Assuming you want really large prime numbers, *and* you want random access to them:

I would encode every b primes as the number of "digits" base 256, and the number of bytes required to store that number, then a pointer to those bytes (arbitrary-precision arithmetic style). Storing these n primes indices in an array of bins. Then each n+kth prime (k < b) can be stored and linked from the i-th bin.

Each of these k values can be stored as an arbitrary precision integer, delta encoded prime gap, or prime gap with entropy encoding. The entropy encoding based on probabilities of the difference between the actual gap and an log(n) approximation of it. If the last method is used then the encoding should be as good as the log(n) approximation, and theoretically should be much smaller. This probability can likely be approximated well using prime gap lower and upper bounds (Wikipedia "Prime Gap" for good bounds) and the log(n) prime gap approximation.

The nth prime number can be found by finding (constant time) the ith bin responding to that n (one division). Then decoding and summing each delta and adding it to the prime stored in the bin until b+k=n. Overall I think this should be O(b), so b shouldn't be too large, but is constant.

A lookup to see if a number (P) is prime can be done by making upper and lower bound estimates based on a prime number approximation algorithm of your choice, then a binary search within that interval based on the number of digits, then bytes. This should be very fast, probably O(log(log(n))). Once the ith index before P is found, you have to sequentially decode the k primes until you find one after P (not prime), or P itself (is prime). Overall O(b log(log(n))).

The math involved for the probabilities etc. might be time consuming for large values, but that depends on the approximations chosen, and should be at worst (I guess) around O(log(n)log(log(n))). Overall the storage should be (if I'm right) roughly ~ O((n/b) log(n)+b log(log(n))) for n primes, compared to O(n log(n)) otherwise.

**1**of

**1**[ 16 posts ]

Similar Topics | |
---|---|

What's the significance of the number 12 ? |
01 Jul 2021, 3:43 pm |

Number of people stuck in hospital 'national scandal' |
15 Jul 2021, 7:46 am |