Encoding problems in Mac

Discussion in 'Mac Programming' started by anandds, Jan 28, 2010.

  1. anandds macrumors newbie

    May 8, 2009
    Hi all,

    We are facing some encoding problems related to Japanese special characters. The whole thing boils down to the following analysis:

    Create a file with filename is "ホンダ” with contents ホンダ

    $ pwd
    $ ls
    $ cat ホンダ
    $ cat ホンダ | od -x
    0000000 83e3 e39b b383 83e3 0a80
    $ ls | od -x
    0000000 83e3 e39b b383 82e3 e3bf 9982 000a

    The question is, why does the output from 'ls' produce more bytes when compared to the 'cat'? It looks like the filenames are encoded differently than the contents in each file

    Any help on this would be great

  2. wrldwzrd89 macrumors G5


    Jun 6, 2003
    Solon, OH
    Indeed the encoding IS different. The goofy thing is that file names are represented one way to the GUI bits of Mac OS X (Unicode) but another to some, but not all, of the CLI bits (the legacy Mac OS Roman encoding). This obviously can lead to issues reading and writing such files.
  3. gnasher729 macrumors P6


    Nov 25, 2005
    1. "od -t x1" will give the bytes in a sensible order. You got the UTF-8 characters
    e3839b, e383b3, e38380, 0a vs
    e3839b, e383b3, e382bf, e38299, 0a

    2. File names are always converted to canonically decomposed UTF-8. Look those codes up in Keyboard Viewer and it should be quite obvious. Remember that the same text can have multiple representations in Unicode; the file system uses a canonical representation. Has nothing to do with Japanese text, try the same thing with ÄÖÜäöü and see what happens.

    There is no use of MacRoman in the MacOS X file system at all.

Share This Page