#6
Can you explain why standard compression schemes like gzip or zlib are inadequate to the task?

Have you done tests on your actual data with gzip and zlib to see what the resulting size is? If so, what were the results (be specific), and how much more compression do you need or want?

What is the cost of not meeting the desired size goal? Inconvenience? Annoyance? $1/minute in extra connection charges?

Numbers are often much more compact in binary form. An IEEE-754 float is exactly 4 bytes, regardless of what value it represents (NaN's, infinities, denormalized, etc.)

Depending on the values of the numbers, you can also get a big improvement by sending deltas instead of complete values. Could introduce loss of precision if floating-point is involved, but worth considering, especially if there's a structure to them. For example, a series of 3D points can be encoded as an initial absolute point, and subsequent ones can be relative to that point.

#9
And what were the results when using zip files?

Is the original order of the numbers important? If the numbers have a common element, such as some subset with the same X coordinate, then you can get an instant ~1/3 by sorting into planes, then sending one X value and a series of Y:Z values. The Y:Z values are subject to further compression.

That clarifies things, but you still haven't provided any real-world numbers for the amount of compression you hope to achieve. Nor have you stated what the consequences are for not meeting the desired amount. If it takes 1 millisecond too long to send, is it a hard real-time failure? Is it just annoying? What are the actual consequences? Be specific, even if you have to give a specific range. There's a big difference between millisecond-level crash-and-burn failures (e.g. engine-control systems) and "Oh, it took 30 seconds too long, I guess I'll be 30 seconds late for my haircut appointment."

How many bits or decimal digits is "decent" precision?

A float has about 7 decimal digits of precision (23 bits). If you need less precision, you might get an instant ~1/4 in compression simply by truncating the 32-bit float to 24 bits by discarding the LSBs of the mantissa/significand. A 15-bit significand (23 - 8) is somewhat less than 5 decimal digits of precision.

What is the target compression level, expressed as a number of bits per original number or coordinate?

I'm not sure if you're being vague because you haven't thought about these things, or because you're not sure what values to use. But if you're trying to solve a problem using computers, then accuracy and precision are important almost everywhere. Not just in specifying the accuracy and precision of numeric data, but accurately and precisely defining the problem that needs to be solved.

The simplest thing is to start with basic C programming, using basic functions like scanf(). That's more than enough to parse numbers from text into binary, at which point you can count the number of bytes, or fwrite() them to a file, or whatever.

Other languages can also read numbers and convert to binary floating-point representation. Java can do it pretty easily, and you'll have to learn less about the actual layout of memory, and errors like buffer overrun crashers. Other languages have similar capabilities, Java is just one choice that has many different tutorials and books for it.

I'm unclear about the necessity of Objective-C for this. Is it because you're trying to make a library that can be used by another program? What's the overall strategy, and where does "compress a gazillion numbers" fit into it?

#18
Definitely binary is the way to go.

Have you sorted the x,y,z coordinates and seen how they look?

I would suggest you copy your data to a spreadsheet program and sort in x, y and z.

If you do find that they are quantized with a large number of points sharing the same ordinate in multiple dimensions, you could build up a tree. I believe this is exactly what chown33 was talking about. It will easily reduce the amount of data you need to transmit per point.

Also have a look at the precision again:

For the X,Y,Z numbers like:

-9.2165003 is only -9.2165

and

-9.2180004 is only -9.218

the longest I could spot was:

-2.0952499 which in turn is only -2.09525

These slight errors are due to the floating point representation.

Your apparatus is not really measuring to the float's full precision. I think double is overkill. You might be able to save some space by reconsidering the precision you actually require. You can even make it variable precision depending on the actual precision of the data determined at runtime.

Of course, all this will make compression and decompression costly. So you will have to weigh in that cost vs the cost to just transmit the data.

Also, I would recommend that you do not try to reinvent the wheel here. The only chance you have of beating zlib or gzip is by using the contextual information you have about your data.

#21
Agreed 100%, you should stop thinking about it as a file and start thinking about it as an uncompressed binary data structure that resides in memory. Disk access for these still relatively large files will only slow you down. (The file is just a form of the binary data for human consumption.)

This of course assumes that the downloading code and the viewing code have access to the same memory, which might be a problem if you are using third-party code for your viewer.

NOTE: Assuming the grid size is always 0.00025, you could save yourself some space in the ASCII representation by simply not printing the digits beyond that. -2.0977499 would become -2.09775 with the proper formatter.

You can also save a bit more space by removing the need for a decimal point in each coordinate. Again, assuming that 0.00025 is the grid size, multiply all the coordinates by 100000 so -2.09775 would become -209775.

So you can easily drop 20-30% of the characters in the ASCII representation of the coordinates without loss of data, for human vieweing, but you would need to keep the multiplier around if it can vary.

Where possible it would still be better to save the indices that generate the grid point instead of the coordinates themselves. Again, assuming -9.2172499 was generated as -36869*0.0025 you would get real savings by saving only the -38689. The example you posted has all the X, Y, and Z coordinates "close" to each other. If this will always be the case, you can pack it in by storing an offset as was brought up earlier. taking out the -9 from each X coordinate would turn -36869 into -869. (i.e. coordinate = -9+ (-869)*0.0025.

Note that since an int and float are both 32 bit values, you don't gain much by making the transition from floating to fixed point in terms of further packing the bits without packing it in further somehow. (As someone else pointed out earlier if you know you only need 18 bits for each int you can make your own packed bits data structure.)

B