Encoding problems in Mac

Discussion in 'Mac Programming' started by anandds, Jan 28, 2010.

  1. anandds macrumors newbie

    Joined:
    May 8, 2009
    #1
    Hi all,

    We are facing some encoding problems related to Japanese special characters. The whole thing boils down to the following analysis:

    Create a file with filename is "ホンダ” with contents ホンダ

    $ pwd
    /test
    $ ls
    ホンダ
    $ cat ホンダ
    ホンダ
    $ cat ホンダ | od -x
    0000000 83e3 e39b b383 83e3 0a80
    0000012
    $ ls | od -x
    0000000 83e3 e39b b383 82e3 e3bf 9982 000a
    0000015


    The question is, why does the output from 'ls' produce more bytes when compared to the 'cat'? It looks like the filenames are encoded differently than the contents in each file

    Any help on this would be great

    Thanks,
    Anand
     
  2. wrldwzrd89 macrumors G5

    wrldwzrd89

    Joined:
    Jun 6, 2003
    Location:
    Solon, OH
    #2
    Indeed the encoding IS different. The goofy thing is that file names are represented one way to the GUI bits of Mac OS X (Unicode) but another to some, but not all, of the CLI bits (the legacy Mac OS Roman encoding). This obviously can lead to issues reading and writing such files.
     
  3. gnasher729 macrumors P6

    gnasher729

    Joined:
    Nov 25, 2005
    #3
    1. "od -t x1" will give the bytes in a sensible order. You got the UTF-8 characters
    e3839b, e383b3, e38380, 0a vs
    e3839b, e383b3, e382bf, e38299, 0a

    2. File names are always converted to canonically decomposed UTF-8. Look those codes up in Keyboard Viewer and it should be quite obvious. Remember that the same text can have multiple representations in Unicode; the file system uses a canonical representation. Has nothing to do with Japanese text, try the same thing with ÄÖÜäöü and see what happens.

    There is no use of MacRoman in the MacOS X file system at all.
     

Share This Page