View Full Version : C: converting 31 digit binary coded decimal to ascii text
Nov 1, 2007, 12:25 PM
I have a task to convert binary coded decimal (BCD) to ascii text. I have to be able to handle 31 digits. An example of decimal 12345 in BCD is X'012345'.
A prototype piece of code I wrote handles up to 7 digits by grabbing each digit, left to right, (shifting and ANDing where needed into a byte) and adds it to a LONG INT accumulator. The next time through the loop, the accumulator is multipled by 10, and the next digit added. Then, I simply used PRINTF to output my value with all needed formatting. Works great for up to the max LONG INT value. (Sometimes I need signed, and sometimes I don't)
However my strategy needs to change to handle up to 31 digits. I suspect the best way to go would be to have a char array to hold the 31 digits and optional sign (and null term char), and work from right to left, so I always end up pointing at the sign position in case I need it. I would grab the last digit, add X'30', and set it into my array at the last position (prior to my null term). Then, back up a byte in the char array, work with the left half of the first input byte, add X'30', put into array, then back up both pointers by one.
(Or, I could initialize my char array to all X'30' and OR the digit in.)
Since the input data is coming in on a file, I don't have to worry about endianness - it's just raw data in a buffer.
The program produces a new data file, with fixed length values that will be used as input into a Load job for an Oracle database.
Thoughts? Performance suggestions? The files I have to process are quite large. The current batch of data is about 750mb.
Nov 1, 2007, 12:48 PM
How about using a lookup table to map 16bits to 4 ascii digits at a time. You could have a different lookup table for the last 4 digits to handle the sign.
b e n
Nov 1, 2007, 01:31 PM
Wait a sec, you have to do this in real life!?! This totally looks like a HW assignment! But I know aren't a student (or at least pretty sure, from previous posts).
Anyway, I think your strategy is good. The output should be right-aligned (padded with zeros or spaces, yes?). That's why you want to work right to left, right?
In terms of performance, your approach in C/C++ should be very fast. Faster than the printf version was. You might want to process the values in batches of, let's say 10000 (or whatever): 1. Read 10000 input values into a buffer. 2. Process 10000 into an output buffer, 3. Write the 10000 in the buffer out.
Then you could multi-thread it! I'd guess that's where you'd probably get a real speed increase because you'd be reading one batch and writing another batch at the "same time." I guess if somehow you were making very efficient use of the IO bandwidth, during either the read or write operation, then doing them at the "same time" would help.
The in-memory processing should be pretty insignificant compared to the I/O operations.
I'd try it with regular buffered I/O first, though, and see if it's fast enough.
Nov 1, 2007, 02:02 PM
b e n
An indexed lookup table is a good idea too - I could have a 10 character array and use the digit as an index into it. I think just adding X'30' might be faster though - leveraging the hardware character set.
Following up on what you suggested, with a 2 byte lookup, (X'0000' to X'9999'), that would work too. I'll could do some performance tests to compare that to the "add X'30' on a digit-by-digit basic" method.
While 31 digits are the max I will need to process, the norm will be 5 or 7 digits per field, so I don't know that there would be a gain - just due to the setup-logic for getting into a mode of working with quads of digits.
No, not a HW assignment. I wish. This is a commission I took for a local consulting/software company to convert their client's health care data into a format that it can analyzed with their (my client's) own software. My client doesn't have the expertise with the format of the incoming data and I do.
I posted on this some time ago I think. Anyway, on the chance of repeating myself, regular buffered I/O will most likely be fast enough. I've written this conversion once in Ruby already. It took 21 minutes to process a 136mb file. With C, my estimate is that I can do the same file in less than 40 seconds, which is good enough for all interested parties.
Yes, the numeric data will be right justified and left padded with blanks. There's an optional decimal point I forgot to mention, but it really doesn't have a significant effect on the logic; just another IF statement and a move to the char array when and if it is present. It's presence is implied in the definition of the incoming field and not physically present. It will need to be physically present in my output file though.
As a side process of this effort, I'll produce a "bad data file" for records that don't conform to the format they should. For example if a BCD field constains a non-digit (A-F), then the data is bad, and I'll write that record to the bad data file, along with a log of the error in yet another file that describes the exact error (record number, what offset in the record, expected value, found value, yada yada).
Should be fun! It might turn into a commercial product.
Nov 1, 2007, 03:28 PM
If you get a chance and it isn't too much trouble, it would be interesting if you posted the various optimizations you tried and what impact they have on the processing time.
My *guess* is that optimizing the in-memory processing won't have a large impact on the whole process because the I/O is taking up 90% of the total time. But it's just a guess, so it would be interesting to see what really happens.
Nov 1, 2007, 04:00 PM
I can do that. I have some timer macros I got from a C book I can add with conditional #ifdefs. It will be a good exercise for me. Good idea!
Nov 1, 2007, 04:19 PM
Sounds like a candidate for vectorization as well. You could read in a group of BCD strings and run one command on all of them at once to convert into a buffer and then move them down the pipeline.
That should make the process even faster on AltiVec/SSE-capable CPUs (and using the Accelerate framework would insulate you from low-level details).
Nov 1, 2007, 04:59 PM
Tell me more about vectorization.
I'll have a tons of fixed length records, made up of several fixed length fields.
Some fields are BCD, some are character fields that need translated from a non-ASCII character set to ASCII, and some are just binary (ints). In my setup, I'll be determining each field's input type and length and offset, and output type and length, the output buffer offset for the field, and which conversion routine (via function address) is needed. So, in theory, each row could be processed by multiple processors on a field-by-field basis, and/or each record could be processed by multiple processors. The order of the data in the input file and output file is irrelevant.
However, the kicker may be that while I'm writing this on a Mac (3GHZ with 4 processors) it will run in production under Windows XP.