C++ character conversions EBCDIC -> ASCII

Discussion in 'Mac Programming' started by toddburch, May 13, 2007.

  1. toddburch macrumors 6502a

    Joined:
    Dec 4, 2006
    Location:
    Katy, Texas
    #1
    I've written a C++ app to translate EBCDIC to ASCII. Being fairly new to C++, I spent quite a bit of time on it, but its now working.

    As part of the process, I wrote a dumping routine to output the incoming data into a hex-dump format display, like this:

    Code:
    00000000  C3F4F4F0 F1F3F0F3 F1F9F525 480CF0F0  F7F0F840 40C240F4 40000000 00914C00  *...........%H......@@.@.@.....L.*
    00000020  00001CD5 D6D560C3 D6E5C5D9 C5C440C3  C8C1D9C7 C5E24000 0C00000C 00000C00  *......`.......@.......@.........*
    
    Above, the first field is the hex offset into the record, then the record itself, then the eyecatcher area on the right, wrapped in "*"s.

    Unpacking the data was the toughest part. For instance, converting an incoming character like 0xF2 and turning it into a string "F2". I'll show you how I did it (a cut-down version for simplicity), and would like to get some feedback on how I might have done it either more efficiently, or "better"** by leveraging (leaning) more on C++. The comments in the code and the code itself tell the story of the issue I was having when picking up a values X'80' or larger. I would be most interested simplifying this conversion process.

    Thanks for your feedback. I've a thick skin, so let me have it!

    Code:
    #include <iostream>
    using namespace std ; 
    
    #define SAMPLE "1 Ring To Rule Them All"  // Sample string to convert 
    #define HEXCHARS "0123456789ABCDEF"       // All valid hex chars 
     
    void to_hex(char *indata, int i) ;  // Function prototype.  -> data, character to convert. 
    
    int main (int argc, char * const argv[]) {
    	int i ; 
    
    	char outdata[((sizeof SAMPLE-1) * 2)+1] ;  // Double length + 1 for null terminator  
    	outdata[sizeof outdata-1] = NULL ;   // Null terminate the string 
    	
        for (i = 0 ; i < sizeof SAMPLE-1 ; i++ ) {  // Run the entire input string. 
    		int c = SAMPLE[i] ;                   // Pick up a character 
    		// Picking up 0X80 or above causes a negative value.  
    		// So, add 256 to it to make it positive. 
    		if (c < 0) c += 256 ;       // make positive if picking up the byte caused the sign bit to propagate.
    		to_hex( &outdata[i*2], c) ; // Convert the byte just picked up 
    	} 
    	cout << '"' << SAMPLE << '"' << " converted is " << outdata << endl ;  // Normal Text
    	
    	// Now, show the scenario for a high value. 
    	char temp[3] ; 
    	temp[2] = NULL ;
    	 
    	char z = 0xFF ; 
    	int j = (int) z ;  
    	
    	to_hex( temp , 0xFF ) ;  
    	cout << "The HIGHVALUE is " << 0xFF << ".  As an integer it is " << j << ". Converted it is " << temp << endl ; 	
    	return 0;
    }
    
    // Function to convert a byte to a hex displayable value. 
    void to_hex(char *indata, int byte) { 
    	indata[0] = HEXCHARS[ byte / 16] ;  // get left nibble 
    	indata[1] = HEXCHARS[ byte % 16] ;  // right nibble 
    	return ; 
    } 
    
    I was reading up on the C++ I/O model with its Formatted I/O and the Manipulators for converting the output stream to hex, but these are only for integers, and it seems like it would be more work to do that than what I've got so far.

    Thanks, Todd

    ** better = me writing less code
     
  2. rand0m3r macrumors regular

    Joined:
    Jun 30, 2006
    #2
    just out of interest, why did you write such a program? there are already C++ libraries (glibmm) that perform character conversions.
     
  3. toddburch thread starter macrumors 6502a

    Joined:
    Dec 4, 2006
    Location:
    Katy, Texas
    #3
    Like I said - I'm new to C++. I'm not even familiar with the STL, much less any other add-on libraries.

    While I am confident character translators are available, I'm not so confident binary data type converters are so readily available. For instance, on z/OS, (an IBM mainframe operating system, that sits on IBM's z/Series hardware), there are special data types stored in several different binary formats, just as there are under OS X and Windows. Hexadecimal Floating Point (HFP on the mainframe) would be one example. Packed Decimal is another. Transfering a binary file that contained these data types from one platform to another, that data has to be converted, following the rules for transformation defined on the source platform, in order to be used on the new platform. If everything was TEXT based, there would be no issue, or need for specialized type conversions.

    Does that answer your question?
     
  4. lazydog macrumors 6502a

    Joined:
    Sep 3, 2005
    Location:
    Cramlington, UK
    #4
    Hi
    I think I would modify your code this way:-

    Code:
    for (i = 0 ; i < sizeof SAMPLE-1 ; ++i )
    {
      unsigned char c = SAMPLE[ i ] ;
       *outdata++ = HEXCHARS[ c >> 4 ] ;
       *outdata++ = HEXCHARS[ c & 0xf ] ;
    }
    
    To avoid the c < 0 test I've used unsigned char.

    I've replaced byte % 16 with byte & 0xff and byte / 16 with byte >> 4. I'm pretty sure this is more efficient then using / and %.

    In your for loop you have i++. It's probably a good idea to get in the habit of using ++i here. Not that it makes much difference for the code you posted but in general there is a subtle difference between ++i and i++ (i++ results in a temporary).

    I got rid of to_hex(). It's such a small function anyway. If efficiency is your concern then the overhead of the funciton call is approaching that of the function itself.

    Hope this helps

    b e n
     
  5. toddburch thread starter macrumors 6502a

    Joined:
    Dec 4, 2006
    Location:
    Katy, Texas
    #5
    unsigned char! That is EXACTLY what I should have been using. And yes, the SHIFT and AND will be much more efficient than the integer and modulus division.

    So, for me, this is the first time to use all three of these features in C++. (Well, I think it's only my 3rd or 4th C++ program to write too...) I use these types of techniques all the time in assembler on the mainframe, but just didn't think to use them in this "high level langauge"!

    Good job!

    Last thing. You are using *outdata++, while I pass the address of an element in my call-by-reference function call, and then use array index notation for the assignments. I haven't used pointer arithmetic yet in C/C++. Will outdata need to be reset following this loop to point back to the start of the array? How does that work?

    Todd
     
  6. gnasher729 macrumors P6

    gnasher729

    Joined:
    Nov 25, 2005
    #6
    Actually, any decent compiler will know how to perform a division and a modulo operation in the quickest possible way. In XCode, use "Show Assembly Code" from the Build menu and have a look. Then have a look how it performs a division by say ten.

    unsigned int divide_by_ten (unsigned int x) { return x / 10; }

    The assembler code will be a bit surprising. The best rule is to write code in the most readable way. Another good rule for fast code is write code in the same way as everyone else does, because then there is a good chance that the compiler knows the patterns that you use and compiles them in the best possible way.

    Only go to less readable code if (1) performance is critical and (2) you have _measured_ the performance and know what parts of a program are actually slow. Most of the time programs are not slow because code isn't optimised, but because someone does something stupid. Obviously if you have done something stupid, it's not in the parts that you think would be slow, but somewhere else. Profiling it, or using Shark on the Macintosh, that will find it.
     
  7. pilotError macrumors 68020

    pilotError

    Joined:
    Apr 12, 2006
    Location:
    Long Island
    #7
    Not to go too off topic, but if your doing a bunch of this type of stuff, we use something called fileport which deals with all those binary flavors pretty well.

    One thing to note, you may want to keep your translation tables external in a flat file or xml file, this way you can change dictionaries. In some instances, you may want to change say a curly brace to a blank, in other cases, you may want a different character (going from ascii to ebcdic).

    Oh the fun of Mainframe interaction! LOL

    http://www.syncsort.com/products/ss/fp/home.htm

    As far as your code, looks fine... The hex dump stuff is very useful, we build pretty much the same things everytime we put a new format online.
     
  8. lazydog macrumors 6502a

    Joined:
    Sep 3, 2005
    Location:
    Cramlington, UK
    #8
    Yes you'll need to reset the pointer every time you convert your sample to hex. Something like this:-

    Code:
    static char outdata_bfr[…] ;
    …
    char* outdata = outdata_bfr ;
    for (i = 0 ; i < sizeof SAMPLE-1 ; ++i )
    {
      unsigned char c = SAMPLE[ i ] ;
       *outdata++ = HEXCHARS[ c >> 4 ] ;
       *outdata++ = HEXCHARS[ c & 0xf ] ;
    }
    
    *outdata = '\0' ;
    
    If you want to use a function call to do the conversion then something like this perhaps:-

    Code:
    static char outdata_bfr[…] ;
    …
    char* outdata = outdata_bfr ;
    for (i = 0 ; i < sizeof SAMPLE-1 ; ++i )
        to_hex( outdata, SAMPLE[ i ] ) ;
    …
    …
    void to_hex( char* & indata, unsigned char byte )
    {
     *indata++ = HEXCHARS[ byte >> 4 ] ;
       *indata ++ = HEXCHARS[ byte & 0xf ] ;
    }
    
    The advantage of using pointer stuff here is that outdata gets incremented through the loop so you don't need to recalculate the index each time round the loop like you had in your original loop, ie to_hex( &outdata[i*2], c).

    But as a side, I still think having a function call to convert 1 byte into hex is over the top. A function to convert a data buffer into hex is much more useful, eg

    Code:
    void to_hex( unsigned char* data, int size, char* hex_bfr ) ;
    
    or even

    Code:
    char* to_hex( unsigned char* data, int size ) ; // Creates and returns  the hex buffer
    
    hope this helps!

    b e n
     
  9. toddburch thread starter macrumors 6502a

    Joined:
    Dec 4, 2006
    Location:
    Katy, Texas
    #9
    Dang. A day late and a dollar short. ;)

    Yes, I had considered that, and will do something in that regard, be it a binary loadable table, or an xml file, or even a text-based file that could be parsed at runtine. I'm not sure how much flexibility is needed yet. I'm still in the early stages of writing. Primary objective - get it working. Then, add the fluff.

    Thanks for the code review. The full blown program is a bit more intense from this scaled down sample.

    What do "y'all" do? PM if you want.

    Todd
     
  10. SilentPanda Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #10
    Oh EBCDIC how I loathe thee... :p

    I had to make a file on the PC but they stored some of the data in packed decimal... they were storing things like the day in 4 bytes (1 for month, 1 for day, 1 for century, and 1 for year)... why I have no clue aside from the fact that it was probably old code... I was being lazy and I only had to run the code the one time so I just made a conversion table in code to convert (for instance) 15 hex in EBCDIC to whatever the ASCII equivalent was and so on and so forth. At the time I didn't know enough about much of anything so I just found a conversion table online and coded to read that in... ah well.

    Glad to see you got something working though. Even if there are libraries out there that do it, there's nothing wrong with coming up with your own way now and again. Sometimes you can come up with a better way or something that fits your particular situation. If all the code in the world was already written we'd all be out of jobs!
     
  11. toddburch thread starter macrumors 6502a

    Joined:
    Dec 4, 2006
    Location:
    Katy, Texas
    #11
    Gotchya. You're using an extra pointer besides the array reference itself.

    Ok, just like I did Java a couple months ago, now I get to go research keywords like static. Fun, fun!

    Ok, I've been using:

    char *c ;

    and you are using:

    char* c ;

    What's the difference?

    I'll be getting rid of the to_hex() function. My hexprt() function is already doing all the dump formatting, so I can suck to_hex() up into that function. When I do that, it will be just as you describe in your void to_hex() function, but with only two parms (pointer, length), as it writes the dump directly to a particular file.

    Yes, this has all helped quite a bit. Thanks again for taking the time.

    Todd
     
  12. SilentPanda Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #12
    I don't believe there is one. It's just preference and habit. Some people do String[] x and others String x[].

    I usually use char* c instead of char *c and String[] x instead of String x[]. Mostly because for me I feel it reads better. Stating what properties my variable is going to have and then what the name is and not mixing the two together.
     
  13. lazydog macrumors 6502a

    Joined:
    Sep 3, 2005
    Location:
    Cramlington, UK
    #13
    Yup, I prefer char* c too but it can lure you into a false sense of security:-

    char* c, d ;

    is not the same as

    char* c ;
    char* d ;

    You need to write:-

    char* c, *d ;

    which is a bit ugly in my opinion!

    b e n
     
  14. toddburch thread starter macrumors 6502a

    Joined:
    Dec 4, 2006
    Location:
    Katy, Texas
    #14
    I've changed my code (not the above snippet) to use pointers. Works great!

    Now that my dumping is working, now I'll create a mechanism to define my record layout, parse the data, do data validation and data type conversions. WOO-HOO!

    C++ classes, here I come!
     
  15. SilentPanda Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #15
    Good point! I usually define my variables one at a time though... tends to save me time in the long run. Good point though for those that do.
     
  16. fimac macrumors member

    Joined:
    Jan 18, 2006
    Location:
    Finland
    #16
    This has been a recurring subject during my career -- and I have never decided upon a definitive answer. That said, mostly this week I have been using the "char* c" style, because the star is clearly part of the data-type. Which is nice.

    Code:
    typedef char* char_ptr_t;
    As you noted, this rule breaks when multiple variables are declared at once. Personally, I find declaring one variable per line to be more maintainable, but YMMV :)
     

Share This Page