Optimising Data

Discussion in 'Mac Programming' started by dmerc, Apr 13, 2011.

  1. dmerc macrumors newbie

    Apr 13, 2011
    Hi Everyone,

    I am new to this forum and newish to programming in Objective C, but have a few application ideas in mind that I want to try and make a start on. One of these is an application that involves compressing and optimising a text file (UTF8 encoded) into an NSData object of a much reduced file size.

    I have read about the GZIP and ZLIB compressions, but these seem to only be the same as a normal zip procedure. What I was hoping to do was convert the text file to a different binary encoding and then compress it, if possible.

    It may be the case that this kind of thing just isn't that straightforward and if so feel free to tell me up front. It couldmsave me a lot of wasted time....
    Any help you guys and gals can give me would be really appreciated.
  2. Catfish_Man macrumors 68030


    Sep 13, 2001
    Portland, OR
    There's actually a fascinating ongoing contest to do this extremely well: http://prize.hutter1.net/

    Needless to say, people don't offer 50,000 euro prizes for doing easy things ;) it's pretty much at the cutting edge of certain areas of research.
  3. lee1210 macrumors 68040


    Jan 10, 2005
    Dallas, TX
    A priori do you know any special information about the contents of the file? The structure? Is there a limit to the characters that might appear? Is this "random" data or written English? Do you know anything about the distribution of the characters?

    Compressing data you know something about is a lot easier than having to compress a random stream of bytes. Are you looking for information on how to write your own compression algorithm and implement it, or are you just looking for suggestions of existing methodologies that you could use?

    Could the data be given structure that would allow for better compression? For example, is this a representation of structured data that could be "normalized" into tables like a relational database, and then a "dump" of this database could be compressed? A lot of compression algorithms will use a shorter series of bytes to represent a common longer series of bytes, so it may approximately do this for you, but handling this structure in advance can definitely help.

  4. dmerc thread starter macrumors newbie

    Apr 13, 2011
    Data Type

    Thanks for the replies so far, I will check out the links you posted.

    To answer the question about the data structure. It would be a long list of numerical values (floats) that correspond to coordinates for a point cloud.

    The files are space and/or tab delimited and usually several hundred megabytes in size. I really want to try to compress them to send them efficiently. I was thinking of an encode and decode program but rather than writing my own algorithms just trying to streamline using current ones.

    Hope this makes it a bit clearer, what I am trying to achieve.
  5. chown33, Apr 13, 2011
    Last edited: Apr 13, 2011

    chown33 macrumors 604

    Aug 9, 2009
    Sailing beyond the sunset
    Can you explain why standard compression schemes like gzip or zlib are inadequate to the task?

    Have you done tests on your actual data with gzip and zlib to see what the resulting size is? If so, what were the results (be specific), and how much more compression do you need or want?

    What is the cost of not meeting the desired size goal? Inconvenience? Annoyance? $1/minute in extra connection charges?

    Numbers are often much more compact in binary form. An IEEE-754 float is exactly 4 bytes, regardless of what value it represents (NaN's, infinities, denormalized, etc.)

    Depending on the values of the numbers, you can also get a big improvement by sending deltas instead of complete values. Could introduce loss of precision if floating-point is involved, but worth considering, especially if there's a structure to them. For example, a series of 3D points can be encoded as an initial absolute point, and subsequent ones can be relative to that point.
  6. dmerc thread starter macrumors newbie

    Apr 13, 2011
    Zips or Binary

    To answer chown33's very valid questions :

    I haven't tried any samples of GZip or Zlib yet, mostly due to reading that they are effectively similar to .zip which is very readily available and renders producing another app for it a bit pointless, for my purposes.

    I am trying to create an efficient way of transmitting the data for time critical purposes, which also has a knock on effect of cost for the connection.

    I think the binary route for representing the numbers is really the way I would like to go, perhaps using deltas, but I need to be able to maintain a decent precision so I wouldmlike to have this as optional.

    My big problem is that currently I don't know how to conver the numbers to binary, being newish to programming. I am happy to read up about it and am not trying to get a quick fix, but I would like a pointer as to where to look to understand how to do it, ideally in an obj c friendly way!

    Thanks again for the help so far.
  7. balamw Moderator


    Staff Member

    Aug 16, 2005
    New England
    If you can post a sample of the kind of stream data you expect, we may be able to help more.

    Binary+zlib is probably a pretty good starting point.

    You probably want to packetize the stream somewhat to make it more robust. If you get corrupt data, you don't want to have to retransmit or throw it all out.

  8. chown33, Apr 13, 2011
    Last edited: Apr 13, 2011

    chown33 macrumors 604

    Aug 9, 2009
    Sailing beyond the sunset
    And what were the results when using zip files?

    Is the original order of the numbers important? If the numbers have a common element, such as some subset with the same X coordinate, then you can get an instant ~1/3 by sorting into planes, then sending one X value and a series of Y:Z values. The Y:Z values are subject to further compression.

    That clarifies things, but you still haven't provided any real-world numbers for the amount of compression you hope to achieve. Nor have you stated what the consequences are for not meeting the desired amount. If it takes 1 millisecond too long to send, is it a hard real-time failure? Is it just annoying? What are the actual consequences? Be specific, even if you have to give a specific range. There's a big difference between millisecond-level crash-and-burn failures (e.g. engine-control systems) and "Oh, it took 30 seconds too long, I guess I'll be 30 seconds late for my haircut appointment."

    How many bits or decimal digits is "decent" precision?

    A float has about 7 decimal digits of precision (23 bits). If you need less precision, you might get an instant ~1/4 in compression simply by truncating the 32-bit float to 24 bits by discarding the LSBs of the mantissa/significand. A 15-bit significand (23 - 8) is somewhat less than 5 decimal digits of precision.

    What is the target compression level, expressed as a number of bits per original number or coordinate?

    I'm not sure if you're being vague because you haven't thought about these things, or because you're not sure what values to use. But if you're trying to solve a problem using computers, then accuracy and precision are important almost everywhere. Not just in specifying the accuracy and precision of numeric data, but accurately and precisely defining the problem that needs to be solved.

    The simplest thing is to start with basic C programming, using basic functions like scanf(). That's more than enough to parse numbers from text into binary, at which point you can count the number of bytes, or fwrite() them to a file, or whatever.

    Other languages can also read numbers and convert to binary floating-point representation. Java can do it pretty easily, and you'll have to learn less about the actual layout of memory, and errors like buffer overrun crashers. Other languages have similar capabilities, Java is just one choice that has many different tutorials and books for it.

    I'm unclear about the necessity of Objective-C for this. Is it because you're trying to make a library that can be used by another program? What's the overall strategy, and where does "compress a gazillion numbers" fit into it?
  9. dmi macrumors regular

    Dec 21, 2010
    "point cloud" suggest that the order of the points is not significant, so if you sort the values and encode, say, the differences, that could eliminate some irrelevant information.
    Points in a cloud probably follow some characterizeable distribution, and if you normalize your values to that distribution, you could maximize the entropy of your encoding.
    Points in a cloud may have a limited precision that may differ from the precision of your float representation, so dropping reducing the precision of your representation may eliminate irrelevant information.
    The exact position of individual points within the cloud may not be of as much interest as certain statistical properties of the point cloud.
    If so, you may be able to transmit only those properties of interest, which could result in a very large compression.
  10. dmerc thread starter macrumors newbie

    Apr 13, 2011

    I'll try to answer all the questions put to me as best I can.

    This is an example of some of the data (It is only the numbered parts that are of interest), which is from a laser scanning survey instrument and therefore the measurements / point positions are definitely of importance, hence the need to retain the precision.

    This is the header to the file.
    9001 (This is the number of columns of points)
    2458 (This is the number of rows of points)
    7.960000 -13.215000 -0.011000 (Transformation Parameter)
    0.989940 -0.141451 0.003366 (First line of 3x3 matrix)
    0.141451 0.989945 0.000144 (second line of 3x3 matrix)
    -0.003352 0.000333 0.999994 (third line of 3x3 matrix)
    0.989940 -0.141451 0.003366 0 (First line of 4x4 matrix)
    0.141451 0.989945 0.000144 0 (second line of 4x4 matrix)
    -0.003352 0.000333 0.999994 0 (third line of 4x4 matrix)
    7.960000 -13.215000 -0.011000 1 (fourth line of 4x4 matrix)

    These are the the point strings (There will be 9001 x 2458 points):
    -9.2170000 -2.1182499 -0.0927500 0.0460975 177 189 175 (X, Y, Z, Intensity (0-1), R, G, B (0-255))
    -9.2165003 -2.1055000 -0.0882500 0.0445106 179 191 177
    -9.2170000 -2.0977499 -0.0802500 0.0510414 180 193 176
    -9.2165003 -2.0977499 -0.0900000 0.0482490 177 191 174
    -9.2172499 -2.0952499 -0.0795000 0.0500038 179 192 175
    -9.2180004 -2.0955000 -0.0827500 0.0497139 179 192 175
    -9.2172499 -2.0952499 -0.0860000 0.0493172 178 191 174......

    The result of using a .zip on a 198.9MB file was to compress it to 43.8MB. This isn't bad, but I would like to achieve better.

    I am looking to achieve between 5 -10% of the original size. If there are ways to reduce it to lower than 5 % than great, but I suspect there isn't.

    A longer tansmission period will not affect any engineering process (your 1 millisecond concept) but will affect a human decision making process, and will also affect the next available time slot to send further data, thereby having a knock on effect. Hence, whilst it does not have to be transmitted to the millisecond there are still significant benefits in transmitting it as quickly as possible, although the world will not explode if it is a second late.

    Well this all depends on the type of points being sent. The example given above is a good indication of the decimal size. However, if the points are on a different co-ordinate system the X & Y numbers can be larger, something like 500000.000, 180000.000.

    Point well made. I am perhaps not fully aware of the options to produce precision. I would ideally like a double to represent each value, but that would be quite costly size wise I imagine. Floats are the next best option. I also need to reconstruct the file correctly on decompression, with all the spaces and return values correctly placed. I'm not really sure how to achieve this yet. Any suggestions?

    . This is great. Gives me somewhere to start out and learn a bit more about the binary angle. And it is compatible with the Cocoa, XCode route I want to take. Much appreciated.

    I'm programming in XCode and really only have experience in Obj C, so I was hoping to keep to that track if poss, rather than meander into loads of different languages, as I am doing this as something more of a hobby, and can't commit thousands of hours to learning many languages.

    Thanks for the help so far guys, and I do appreciate you trying to get a handle on this to help me out. I'll try to be more thorough in future posts. Thanks.
  11. balamw Moderator


    Staff Member

    Aug 16, 2005
    New England
    Without zlib, in binary, using 32 bit floats, this would seem to be (9001 x 2458 x (4 x 32 + 3 x 8)) bits = 400 MiB per file. It would be ~double that using 64 bit doubles instead of floats.

    Was the 198.9 MB file a reduced sample?

    I don't think it's really worth building a color table as your colors only make up a relatively small portion of the data.

  12. lee1210 macrumors 68040


    Jan 10, 2005
    Dallas, TX
    Claude Shannon might have something to say about this:

    In the general case 5% of the original is very optimistic.

    You do have some decent structure to the data, though, so things might go better for you by changing this data before compressing it. For example:
    -9.2170000 -2.1182499 -0.0927500 0.0460975 177 189 175

    This is 55 characters long, taking 55 bytes in ASCII. If each floating point value was stored as a single precision IEEE-754 value (4 bytes), and each RGB component was stored as an unsigned char (single byte) you could get this down to 19 bytes. This is already down to about 35% of the original size, adding in "traditional" compression, even if the rate was only 50%, would get you to 17.5% of the original vs. the ~22% you are getting already. This is still a far cry from 10% of the original, but it's getting there. The compression might be much better than 50%, but hoping for 28.5% compression on binary data to get you to 10% of the original is pretty optimistic.

    There's a problem, though. -2.1182499 has 8 significant figures. This is too many than can be accurately represented with a 32-bit float. Bumping to a double means each line will need 35 bytes. Still better than 55, but definitely not as good as 19.

    You might be able to leverage that you only have space, 0,1,2,3,4,5,6,7,8,9, . and a linebreak in your text, too. This is only 13 characters vs. the 256 that can be represented in ASCII. This means you could represent each character with only 4 bits instead of 8. This would get you down to 28ish bytes per line, which is better than storing the floating point values as double precision. This is close to 50% reduction in space vs. the original representation, plus traditional compression estimated at 50% would put you at 25%, which is pretty close to the 22% you got with regular compression... which probably means the traditional compression already took advantage of this for you.

    At this point i would say... take a step back. What are the chances that you are going to do better than people that have been doing this for years? I'm not confident that you're going to get better compression saving 37% by using doubles and unsigned chars for your floating point and fixed point values respectively, then applying traditional compression than you would simply giving a traditional method your data and letting it figure it out. Have you tried gzip and bzip2? It sounds like "getting it there fastest" is important, not "getting it there using the least amount of time on the circuit" which means processor time compressing and decompressing is as important as the time on the wire. As such, you may want to see what algorithms are available that can be decompressed "streamed", that is, something that can decompress as the data becomes available. I haven't needed to do this, so i don't know if this is possible with the "traditional" methods out there.

    For one, i would probably let go of the 5-10% hope. It may be possible, but I wouldn't hold my breath.

    On a last note, it may be possible that the X,Y,Z values need to be stored as double precision, but the intensity could have a lower precision (maybe just store the decimal portion in a 20-bit integer). This might get you to about 30 bytes per line without traditional compression. This might help. I think the better you could eliminate unnecessary pieces of data before applying traditional compression the better off you'll be.

  13. balamw Moderator


    Staff Member

    Aug 16, 2005
    New England
    In this sample, it looks like there may be a fair amount of repetition in the coordinates.

    Are they on a grid of some sort? it looks like they may be multiples of 0.00025 with roundoff. Can they be calculated from (and thus stored as) integers. If they can be calculated from int16s you might get a factor of two over using floats.

  14. ChrisA macrumors G4

    Jan 5, 2006
    Redondo Beach, California
    It looks like the x,y, z values can be converted to integers without loss of president. by a suitable linear function. Pick a function that maps existing data to a minimum range I bet at one time they were integers. After converstion apply a bias so the range stars at zero. Next store them in binary using only as many bits (not bytes) as it takes to cover the range. (I work in telemetry and you should see all the 3, 10 and 13 bit numbers we use.) Once you have the data in what I call "packed bits" format then apply the generaic zlib compression but don't expect much from it packed bits are already very compact if the data are random. In fact if the data are truly random I think it is the best you can do.

    All that said, I bet zlib does almost as good. It does something not much different when it converts common strings to table index values

    This all depends on me guessing correct that the x y z numbers all fall onto a 3D grid.
  15. chown33 macrumors 604

    Aug 9, 2009
    Sailing beyond the sunset
    The majority of the data appears to represent a volumetric image, or the rough equivalent of one. That is, a large number of points (possibly in a grid, but maybe not), each of which has coordinates (X,Y,Z), intensity, and RGB color. In short, a bunch of voxels.

    Assuming that the reason for minimizing transmission time is to let the recipient see the entire object sooner, one strategy is to switch to progressive data. That is, transmit the data in a way designed to present a rough approximation soonest. As additional data is received, it occupies positions "between" points already received, and produces a better refinement or more detail. This is also sometimes called "interlaced" data transmission.

    Progressive images are less common now than they used to be, mainly because line speed has increased. However, JPEG still has a progressive format, and some of the better image-production tools can create it.


    I'm also wondering how long it currently takes to send the 44 MB file, and reconstruct it on the receiving end. Is it a few seconds? A few minutes? Many minutes? Hours?

    What is the current line speed? What if you just doubled it?

    What is the human decision that must be made using the data? Is a progressive transmission worthwhile? That is, can a preliminary decision be made using an approximation of the overall "point cloud", or is it required that the complete set of data be used before ANY decision is rendered?

    Is there any subset of the data that can be used without having all the data? If not an "interlaced" data set, how about a complete subset, say 1/8 of the overall data cube. I.e.. cut each of X,Y,Z in half and send the first complete data subset occupying the upper left forward octant of the divided cube. That gives the recipient something to look at and decide on while the other octants are arriving.
  16. Jethrotoe macrumors regular

    May 24, 2009
    Somewhere over there.
    Wow, I'm taking a change of jumping in here off topic to say that this is the best thread that I have read in the programming forum in terms of analyzing the OPS needs and methods of acomplishing the end result.

    Truly brillent idea's here. I just learned a LOT on a lot of levels.
  17. SidBala, Apr 14, 2011
    Last edited: Apr 14, 2011

    SidBala macrumors 6502a

    Jun 27, 2010
    Definitely binary is the way to go.

    Have you sorted the x,y,z coordinates and seen how they look?

    I would suggest you copy your data to a spreadsheet program and sort in x, y and z.

    If you do find that they are quantized with a large number of points sharing the same ordinate in multiple dimensions, you could build up a tree. I believe this is exactly what chown33 was talking about. It will easily reduce the amount of data you need to transmit per point.

    Also have a look at the precision again:
    For the X,Y,Z numbers like:

    -9.2165003 is only -9.2165
    -9.2180004 is only -9.218

    the longest I could spot was:
    -2.0952499 which in turn is only -2.09525

    These slight errors are due to the floating point representation.

    Your apparatus is not really measuring to the float's full precision. I think double is overkill. You might be able to save some space by reconsidering the precision you actually require. You can even make it variable precision depending on the actual precision of the data determined at runtime.

    Of course, all this will make compression and decompression costly. So you will have to weigh in that cost vs the cost to just transmit the data.

    Also, I would recommend that you do not try to reinvent the wheel here. The only chance you have of beating zlib or gzip is by using the contextual information you have about your data.
  18. dmerc thread starter macrumors newbie

    Apr 13, 2011
    Interlaced Transmission

    Firstly, can I thank everyone participating in this thread so far. I am learning a lot about how to tackle this problem and I think chown33's suggestion about producing an interlaced solution may be the way forwards.

    As some of you have already figured, correctly, the data does reside on a grid. It is effectively a panoramic volumetric image, where the header data states the position of the recording instrument and then every point is in a gridded pattern around that point (Imagine a panoramic photo where every pixel has a depth into the scene).

    Yes, it was a different file of a similar structure that I had to hand.

    I think all the statements about ZLib and GZip doing a pretty good job already are very valid, and so I think the interlaced option is maybe better, as really the intention is to allow visualisation of these points when received and so having a slowly 'in-filling' visualisation of the data may be a good way to go. There are some issues with doing this though.

    1) The gridded pattern must be maintained so that points appear in the correct grid position. This means that for every point not sent you would have to 'fill' the gaps of the grid with zero values for X Y Z (I think for intensity it would actually need to be 0.5). However, the RGB fields could be left completely blank which would save space considerably. This could help compression significantly as it would only be a single string type, but it would need to be clear where the lines needed filling (what lines (position in the grid) they were on)

    2) The data would have to be sent so that every 10th line in the file, for example, was taken for the first send. Then every 5th line etc until it was complete. It couldn't just be sent from the start line to the end line, as you would only see a very small portion of the panorama, example a corner, which wouldn't be that useful to the recipient. They would need a sparse representation of the entire field of view (Entire Grid space) that slowly increased in resolution.

    The next step on from this would be to write an openGL based viewer that allowed selection of certain areas prior to sending so the user could be more selective, but that is the next step....

    Perhaps another option...
    It has just occurred to me that what could be done is a 'dummy' file, containing zero values, could be produced on the destination computer of the required length (9001 x 2458) and the decompressor could then overwrite the values of each line as the data was received. Perhaps then the user can write out a file whenever they need to and see as much detail as has been transmitted so far. This could be a good compromise.
    Would this be an issue to achieve? Presumably replacing lines successfully will be dependant on byte length, which may be different from originating file to destination file due to the zero values. I need a bit of guidance to understand this element.

    Thanks again everyone. Some really great posts and feedback for this thread. I really appreciate it.
  19. SidBala macrumors 6502a

    Jun 27, 2010
    I would suggest that you keep the data in memory. You can easily hold 400MB of data in the heap.

    Manipulating 400MB or even 100MB files for that matter will be very slow, especially if you want to do the interlacing solution, which will require random access.

    You should just hold it in memory until the user wants to write it out to the file.

    Also you might need to build a mesh to display that cloud in OpenGL. This is quite a complex task. OpenGL will not do this for you. Not sure if there any simple methods to accomplish this.
  20. balamw, Apr 15, 2011
    Last edited: Apr 15, 2011

    balamw Moderator


    Staff Member

    Aug 16, 2005
    New England
    Agreed 100%, you should stop thinking about it as a file and start thinking about it as an uncompressed binary data structure that resides in memory. Disk access for these still relatively large files will only slow you down. (The file is just a form of the binary data for human consumption.)

    This of course assumes that the downloading code and the viewing code have access to the same memory, which might be a problem if you are using third-party code for your viewer.

    NOTE: Assuming the grid size is always 0.00025, you could save yourself some space in the ASCII representation by simply not printing the digits beyond that. -2.0977499 would become -2.09775 with the proper formatter.

    You can also save a bit more space by removing the need for a decimal point in each coordinate. Again, assuming that 0.00025 is the grid size, multiply all the coordinates by 100000 so -2.09775 would become -209775.

    So you can easily drop 20-30% of the characters in the ASCII representation of the coordinates without loss of data, for human vieweing, but you would need to keep the multiplier around if it can vary.

    Where possible it would still be better to save the indices that generate the grid point instead of the coordinates themselves. Again, assuming -9.2172499 was generated as -36869*0.0025 you would get real savings by saving only the -38689. The example you posted has all the X, Y, and Z coordinates "close" to each other. If this will always be the case, you can pack it in by storing an offset as was brought up earlier. taking out the -9 from each X coordinate would turn -36869 into -869. (i.e. coordinate = -9+ (-869)*0.0025.

    Note that since an int and float are both 32 bit values, you don't gain much by making the transition from floating to fixed point in terms of further packing the bits without packing it in further somehow. (As someone else pointed out earlier if you know you only need 18 bits for each int you can make your own packed bits data structure.)

  21. subsonix macrumors 68040

    Feb 2, 2008
    The first thing to do, not really counting as compression, is if the file contains lots of ascii values like this: -9.2165003. Save them in binary form, this string is 10 bytes, (11 including separating space), stored as a float it's 4 bytes.
  22. PatrickCocoa macrumors 6502a

    Dec 2, 2008
    Back of the envelope

    Have you done even a back-of-the-envelope calculation comparing the the savings more efficient way of transmitting data versus the cost of implementing it (your time, mainly).

    I see this all the time with technical people - jump to the solution without careful consideration of the problem.
  23. Sydde macrumors 68020


    Aug 17, 2009
    One thing about programmers is we tend to think in terms of what we are familiar with. You have to use either a float (4 bytes) or double (8 bytes) for a true binary representation of a non-integer number. But if your concern us building a file for transmission or storage, why should you be constrained by those sizes? As long as your code knows the format, your FP numbers can be the size you need: just take a double and store six or seven bytes (truncate), whatever minimum precision you can tolerate. Using six-byte abbreviated doubles and bytes for the last 3 values, your lines would be 27 + probably a checksum byte, almost exactly half the original size. Obviously this would only be practical for a file, not memory.
  24. jared_kipe macrumors 68030


    Dec 8, 2003
    Actually, given the limited set of characters, you could do even better with Huffman coding.

    ESPECIALLY if certain characters were much more likely, hopefully 0 would be the most likely.

    You could either use statistical analysis on the frequency of characters, or make some educated guesses.
    0's the most common, (-), ( ), (.) occur on every line in this limited set. (1) (2) should be common on data ranging from 0-255

Share This Page