macOS How to replace ASCII with Unicode/UTF-8?

FatherChristmas · Feb 19, 2010

How can you convert a program written in C (using ASCII) so that it can handle Unicode strings?

I realize there is no simple answer to this question, so I'm posting the source code from two short programs (from Dave Mark's book on C) that I want to convert to handle Unicode.

How can you replace the ASCII code with Unicode/UTF-8 code?

Program 1: (Program to print ASCII characters. I want to extend this to print some Unicode characters.)

Code:

#include <stdio.h>

void	PrintChars( char low, char high );

int main (int argc, const char * argv[]) {
	PrintChars( 32, 47 );
	PrintChars( 48, 57 );
	PrintChars( 58, 64 );
	PrintChars( 65, 90 );
	PrintChars( 91, 96 );
	PrintChars( 97, 122 );
	PrintChars( 123, 126 );
	
	return 0;
}


void	PrintChars( char low, char high ) {
	char	c;
	
	printf( "%d to %d ---> ", low, high );
	
	for ( c = low; c <= high; c++ )
		printf( "%c", c );
	
	printf( "\n" );
}

Program 2: (Program to count the number of words typed into the program. I want to replace the ASCII code with Unicode/UTF-8.)

Code:

#include <stdio.h>
#include <ctype.h> //This is to bring in the declaration of isspace()
#include <stdbool.h> 


#define kMaxLineLength		200
#define kZeroByte			0

void	ReadLine( char *line );
int		CountWords( char *line );

int main (int argc, const char * argv[]) {
	char	line[ kMaxLineLength ];
	int		numWords;
	
	printf( "Type a line of text, please:\n" );
	
	ReadLine( line );
	numWords = CountWords( line );
	
	printf( "\n---- This line has %d word", numWords );
	
	if ( numWords != 1 )
		printf( "s" );
	
	printf( " ----\n%s\n", line );
	
	return 0;
}


void	ReadLine( char *line ) {
    int     numCharsRead = 0;
    
	while ( (*line = getchar()) != '\n' ) {
		line++;
        if ( ++numCharsRead >= kMaxLineLength-1 )
            break;
    }
	
	*line = kZeroByte;
}


int	CountWords( char *line ) {
	int		numWords, inWord;
	
	numWords = 0;
	inWord = false;
	
	while ( *line != kZeroByte ) {
		if ( ! isspace( *line ) ) {
			if ( ! inWord ) {
				numWords++;
				inWord = true;
			}
		}
		else
			inWord = false;
		
		line++;
	}
	
	return numWords;
}

Thanks!

jared_kipe · Feb 19, 2010

Why, and what difference do you expect?

I believe Unicode and ASCII are identical over the ascii range.

Sydde · Feb 19, 2010

But is Unicode not all 2-byte characters? I believe UTF-8 is the one that is ASCII-identical in the lower half, right? And a little more of a pain with its variable-length upper characters? Do I have that right?

lee1210 · Feb 19, 2010

You need to be more specific about your goals. For low ascii, UTF-8 has the exact same binary representation as ASCII. So the problem is solved, you win! If you're needing to store things other than low ascii, then the fun begins. I would say char is still the right datatype, but you may want to make some sort of struct that you call a unicode string, etc. just to make it clear that these are not regular C strings. You will not be able to use standard functions, so you'll probably want to re-implement strlen, strcmp, etc. You'll probably want a bytelen method for your new thing, too, since character length and byte length will vary. Basically your PrintChars would go character by character, checking:

Code:

int len = bytelen(myUniStr);
char key = 0;
int numBytes;
for(int pos = 0; pos < len; pos++) {
  numBytes=1;
  if(myUniStr.v[pos] & 0x80 == 0) { //Low ASCII, can just print
    printf("%c",myUniStr.v[pos]);
  } else { //Unicode
    key = myUniStr.v[pos] >> 4;
    if((key >> 2) != 3) { //Error
      //Do something
    }
    numBytes++; //At least 2
    if(key & 0x2) != 0) {
          numBytes++;
          if(key & 0x1) numBytes++;
    }
    if(pos - 1 + numBytes >= len) { //Not enough characters to fulfill the promise
      //Do something to handle error
    } else {
      uint32 charval = 0; //Something big enough...
      charval = extractPoint(&myUniStr.v[pos],numBytes)
      //Do something to display the character at this code point...
      
    }
}
unit32 extractPoint(char *bytes,numBytes) {
      uint32 tmpVal = 0;
      unsigned char tmp = 0;
      switch(numBytes) {
        case 2: tmpVal +=  ((bytes[0] >> 2) & 0x7) << 8;
          tmpVal += ((bytes[0] & 0x3) << 6) + (bytes[1] & 0x3F);
          break;
        case 3: tmpVal += ( ((bytes[0] & 0xF) << 4) + ((bytes[1] >> 2) & 0xF) ) << 8;
          tmpVal += ((bytes[1] & 0x3) << 6) + (bytes[2] & 0x3F);
          break;
        case 4: tmpVal += ( ((bytes[0] & 0x7) << 5) + ((bytes[1] >> 4) & 0x3) ) << 16; 
          tmpVal += ( ((bytes[1] & 0xF) << 4) + ((bytes[2] >> 2) & 0xF) ) << 8;
          tmpVal += ((bytes[2] & 0x3) << 6) + (bytes[3] & 0x3F);
          break;
      }
      return tmpVal;
}

I didn't really intend to write all that, but kind of got carried away. Anyhow, it's going to be a challenge to output this stuff in a standard way, at least i am not aware of the "right" way to output Unicode characters to a terminal using standard C.

-Lee

gnasher729 · Feb 19, 2010

Sydde said:
But is Unicode not all 2-byte characters? I believe UTF-8 is the one that is ASCII-identical in the lower half, right? And a little more of a pain with its variable-length upper characters? Do I have that right?

Unicode is over a million code points from hexadecimal 0x00 to 0x10FFFF. Use "Character Viewer" to see them all

Unicode is typically stored in UTF-16 format (using 16 bit words) or in UTF-8 format (using 8 bit words). In UTF-16, a Unicode code point uses one or two 16-bit words. In UTF-8, a Unicode code point uses from one to four 8-bit bytes. UTF-8 has the advantage that all the Standard C string functions (strcpy, strlen, strchr etc. ) and Standard C++ string functions (std::string) "just work", there are no byte ordering problems when you store them externally or transmit them from one program to another, there are no problems with compatibility between MacOS X code and Windows code and so on.

One thing: You have to completely remove the idea from your brain that "char" can be used for anything except holding the first 128 characters from the Unicode character set (the "Basic Latin" range). So whenever you use anything outside that range, you can't use char, you have to use a string. And strlen () doesn't return the number of Unicode code points, it returns the number of bytes. Just ignore it.

strncpy () is about the only thing that doesn't work properly (and it's a dangerous function to use anyway), because it cuts off a string after a number of bytes, which could be in the middle of a Unicode code point. But if you rely on strncpy, you have lost anyway.

BTW. printf () prints char* containing UTF-8 just find. Try

Code:

printf ("%s\n", "ÄËÏÖÜäëïöü");

and it just works. Try

Code:

printf ("%d\n", (int) strlen ("ÄËÏÖÜäëïöü"));

and the result might be surprising if you don't realise that XCode uses UTF-8 for strings anyway. However, you can't print a single byte that isn't a complete character using the %c format. It just doesn't make sense; you can't print half of a letter "Ä".

lee1210 · Feb 19, 2010

I did not know the information gnasher729 provided. Wouldn't have spent all that time. Oh well. Also, my extractPoint code won't work quite right b/c of bytes being signed chars. Making it unsigned would be trivial, but it doesn't seem like it matters now.

-Lee

jared_kipe · Feb 19, 2010

Very informative, I knew there is a reason I like @class NSString

chown33 · Feb 19, 2010

jared_kipe said:
Very informative, I knew there is a reason I like @class NSString

The entrance to the rabbit hole:
http://en.wikipedia.org/wiki/Wide_character

Somewhat dated, but still decent:
http://www.joelonsoftware.com/articles/Unicode.html

Definitive:
http://unicode.org/

Principal C99 type:
wchar_t - the implementation-dependent type of a "wide character" (technically, not tied to a particular code, e.g. Unicode; in practice, most things will use Unicode... except when they don't (aren't standards fun?))

Principal C include file:
#include <wchar.h>

Some string & char functions:
wcslen(), wcsncat(),
iswalpha(), iswspace()

Some I/O functions:
fgetwc(), getwchar(),
fgetws(), fputws(),
fwprintf(), swprintf()

apropos wchar
apropos wide
apropos wide | grep string

Note that wchar_t is NOT the same type (or size) as UniChar, which NSString, CFString, etc. use. The UniChar typename is essentially UTF-16 (unsigned 16-bits), while wchar_t is 32-bits.

Also be sure to see the Unicode docs on composed vs. decomposed sequences (affects accents and other combining marks), and also on the canonical composed and decomposed forms.

And look, there goes a white wabbit wearing a waistcoat.

Sydde · Feb 19, 2010

I am guessing Apple is not using C-format strings to store NSString strings. For instance, in File Manager (Core Services), when you provide a string, you also have to provide its length. I would guess the character count (length) of a string object must be one of the instance variables, a little like Pascal, except maybe not neatly fit into a single field.

jared_kipe · Feb 19, 2010

Sydde said:
I am guessing Apple is not using C-format strings to store NSString strings. For instance, in File Manager (Core Services), when you provide a string, you also have to provide its length. I would guess the character count (length) of a string object must be one of the instance variables, a little like Pascal, except maybe not neatly fit into a single field.

Actually to the best of my knowledge, Apple does in fact store it as a C type string. The great thing about objects, introspection, you don't need to know how long as NSString is, because NSString knows how long itself is

Sydde · Feb 19, 2010

jared_kipe said:
Actually to the best of my knowledge, Apple does in fact store it as a C type string. The great thing about objects, introspection, you don't need to know how long as NSString is, because NSString knows how long itself is

Well, I have used length from time to time. For instance

Code:

if ( [[justPathetic commonPrefixWithString:soPathetic] length] == [soPathetic length] )
     // justPathetic is a subdirectory of soPathetic
     ; // other code

Naturally, you will observe that this will eventually fail, that one must loop through the path components for it to be reliable. I was lucky enough to have a really good tester who breaks my work with great alacrity, so I found out well before I let it get out.

jared_kipe · Feb 19, 2010

Sydde said:
Naturally, you will observe that this will eventually fail, that one must loop through the path components for it to be reliable. I was lucky enough to have a really good tester who breaks my work with great alacrity, so I found out well before I let it get out.

I will observe no such thing, because I have no idea what you're talking about.

gnasher729 · Feb 20, 2010

chown33 said:
The entrance to the rabbit hole:
http://en.wikipedia.org/wiki/Wide_character

Somewhat dated, but still decent:
http://www.joelonsoftware.com/articles/Unicode.html

Definitive:
http://unicode.org/

Principal C99 type:
wchar_t - the implementation-dependent type of a "wide character" (technically, not tied to a particular code, e.g. Unicode; in practice, most things will use Unicode... except when they don't (aren't standards fun?))

Principal C include file:
#include <wchar.h>

Some string & char functions:
wcslen(), wcsncat(),
iswalpha(), iswspace()

Some I/O functions:
fgetwc(), getwchar(),
fgetws(), fputws(),
fwprintf(), swprintf()

apropos wchar
apropos wide
apropos wide | grep string

Note that wchar_t is NOT the same type (or size) as UniChar, which NSString, CFString, etc. use. The UniChar typename is essentially UTF-16 (unsigned 16-bits), while wchar_t is 32-bits.

Also be sure to see the Unicode docs on composed vs. decomposed sequences (affects accents and other combining marks), and also on the canonical composed and decomposed forms.

And look, there goes a white wabbit wearing a waistcoat.

That's why I recommend very, very strongly to use UTF-8. You don't have to rely on implementation-dependent types (I think wchar_t even has different size in XCode depending on whether you compile for PowerPC or Intel, and writing wchar_t to a file and reading it back on another processor will definitely fail with wchar_t), you don't have to learn any new string functions, and so on.

Sydde · Feb 20, 2010

gnasher729 said:
That's why I recommend very, very strongly to use UTF-8. You don't have to rely on implementation-dependent types (I think wchar_t even has different size in XCode depending on whether you compile for PowerPC or Intel, and writing wchar_t to a file and reading it back on another processor will definitely fail with wchar_t), you don't have to learn any new string functions, and so on.

From the unicode specification:

A character encoding scheme consists of a specified character encoding form plus a specification of how the code units are serialized into bytes. The Unicode Standard also specifies the use of an initial byte order mark (BOM) to explicitly differentiate big-endian or little-endian data in some of the Unicode encoding schemes.

An elegant implementation: 0xFEFF is a standard BOM - when you read it in the wrong endian-ness, it is one of the reserved, illegal code points 0xFFFE.

chown33 · Feb 20, 2010

gnasher729 said:
That's why I recommend very, very strongly to use UTF-8. You don't have to rely on implementation-dependent types (I think wchar_t even has different size in XCode depending on whether you compile for PowerPC or Intel, and writing wchar_t to a file and reading it back on another processor will definitely fail with wchar_t), you don't have to learn any new string functions, and so on.

Using UTF-8 is orthogonal to using wchar_t. Going thru wchar_t and associated functions may make things easier. It certainly makes them more flexible without having to go back and modify code.

And I don't believe wchar_t sizes differ between ppc and Intel. There aren't any arch-dependent things I can see (see below for test code). The wchar_t sizes do differ between Windows and Posix, as mentioned in the Wide Character article I previously linked on Wikipedia. Essentially, wchar_t on Windows is UTF-16.

All the wide-char functions work with a locale. This defines both regional or language conventions, such as what character represents a decimal point, as well as what text-encoding to use for I/O.

http://en.wikipedia.org/wiki/Locale

The wchar_t types are automatically encoded to the proper text-encoding, according to the locale. The main point here is that the wchar routines and the locale are responsible for doing all text-encoding, not you or your program.

To see a list of locale names, enter this command in Terminal:

Code:

locale -a

A small sample of the output, which lists the locale names for Russian (i.e. Cyrillic alphabet):

Code:

ru_RU
ru_RU.CP1251
ru_RU.CP866
ru_RU.ISO8859-5
ru_RU.KOI8-R
ru_RU.UTF-8

CP866 and ISO8859-5 are older (mutually incompatible) single-byte encodings (0x00-0xFF), where the low half is effectively ASCII and the high half contains the Cyrillic alphabet along with other things:
http://en.wikipedia.org/wiki/Code_page_866
http://en.wikipedia.org/wiki/ISO/IEC_8859-5

Now, suppose you write your code to use the wchar functions for writing text strings. If you change the locale to ru_RU.CP866, then Cyrillic text is stored and read in that encoding. Change the locale to ru_RU.ISO8859-5, though, and without changing your code at all, it now reads and writes ISO8859-5 codes. Or change the locale to ru_RU.UTF-8 (or any of the other UTF-8 locales), and now all the text is read and written in UTF-8. Your program only works with wchar_t (and strings thereof), but it automatically adjusts to different locales and text encodings.

The "C" locale is the default when no other is specified, and it happens to use UTF-8 as its text encoding on Mac OS X. So if a program were written to use the wide-char functions, and not change the default locale, the program would automatically be able to read and write UTF-8. It would also be adaptable to whatever the user sets the locale to.

Test code.

wchar.c

Code:

#include <wchar.h>
#include <stdio.h>

int main (int argc, const char * argv[]) 
{
	int size = sizeof( wchar_t );
	printf( " .. sizeof( wchar_t ) = %d\n", size );
}

Commands and output

Code:

gcc -arch i386 wchar.c && ./a.out
 .. sizeof( wchar_t ) = 4

gcc -arch ppc wchar.c && ./a.out
 .. sizeof( wchar_t ) = 4

file a.out
a.out: Mach-O executable ppc

All of which is why I strongly recommend using the wide-char functionality. For one thing, it's standardized (C99) and has functions written and maintained by people a lot more knowledgeable about these things.

It's possible to do things without using the wide-char functionality, but it's a small sandbox you're playing in. Get out of the sandbox. See the world.

FatherChristmas · Feb 20, 2010

A real world example

gnasher729 said:
One thing: You have to completely remove the idea from your brain that "char" can be used for anything except holding the first 128 characters from the Unicode character set (the "Basic Latin" range). So whenever you use anything outside that range, you can't use char, you have to use a string.

gnasher729 said:
That's why I recommend very, very strongly to use UTF-8. You don't have to rely on implementation-dependent types (I think wchar_t even has different size in XCode depending on whether you compile for PowerPC or Intel, and writing wchar_t to a file and reading it back on another processor will definitely fail with wchar_t), you don't have to learn any new string functions, and so on.

chown33 said:
All of which is why I strongly recommend using the wide-char functionality. For one thing, it's standardized (C99) and has functions written and maintained by people a lot more knowledgeable about these things.

Well, it's good to see we're all in agreement here!

Anyway, thanks for all the responses, much appreciated, especially gnasher's and chown's, which bring me closer to resolving my issue.

Whether you take the UTF-8 string approach or the wchar_t approach, the issue for me was, and remains, how to apply the information in gnasher's and chown's posts to make C source code Unicode aware. When (and how) do you use multi-byte Unicode strings, as gnasher suggests? And when (and how) do you use the wchar_t approach as chown recommends? I guess I'm looking for more concrete, practical examples to help me learn how to implement Unicode. How to do it in my specific program?

I understand some of the Unicode issues raised in this thread, but I'm a beginning programmer so I don't know how to turn these suggestions into workable code.

To be more specific, in my second sample program (a simple program to count the number of words typed into console), the Unicode strings (typed in at run time) do not display correctly. I assume major surgery is required (a total re-writing of the program?) to get it to display the Unicode text correctly. The program DOES count correctly the total number of words typed in, even using multi-byte character (UTF-8) strings. The issue is the character display only.

Here's the program again:
(Program to count the number of words typed into the console. I want to make the program Unicode aware.)

Code:

#include <stdio.h>
#include <ctype.h> //This is to bring in the declaration of isspace()
#include <stdbool.h> 


#define kMaxLineLength		200
#define kZeroByte			0

void	ReadLine( char *line );
int		CountWords( char *line );

int main (int argc, const char * argv[]) {
	char	line[ kMaxLineLength ];
	int		numWords;
	
	printf( "Type a line of text, please:\n" );
	
	ReadLine( line );
	numWords = CountWords( line );
	
	printf( "\n---- This line has %d word", numWords );
	
	if ( numWords != 1 )
		printf( "s" );
	
	printf( " ----\n%s\n", line );
	
	return 0;
}


void	ReadLine( char *line ) {
    int     numCharsRead = 0;
    
	while ( (*line = getchar()) != '\n' ) {
		line++;
        if ( ++numCharsRead >= kMaxLineLength-1 )
            break;
    }
	
	*line = kZeroByte;
}


int	CountWords( char *line ) {
	int		numWords, inWord;
	
	numWords = 0;
	inWord = false;
	
	while ( *line != kZeroByte ) {
		if ( ! isspace( *line ) ) {
			if ( ! inWord ) {
				numWords++;
				inWord = true;
			}
		}
		else
			inWord = false;
		
		line++;
	}
	
	return numWords;
}

So how do you change this code to make it display correctly? I don't know where to begin. Ideally, I'd love to know how to implement both approaches advocated in this thread. (If they're both applicable to this program, that is.)

Thanks for any clues or code!

chown33 · Feb 21, 2010

FatherChristmas said:
To be more specific, in my second sample program (a simple program to count the number of words typed into console), the Unicode strings (typed in at run time) do not display correctly. I assume major surgery is required (a total re-writing of the program?) to get it to display the Unicode text correctly. The program DOES count correctly the total number of words typed in, even using multi-byte character (UTF-8) strings. The issue is the character display only.

Copy and paste the entire text of a malfunctioning run into a post, exactly as input and output by the program.

How do you run your program? Terminal, Xcode console, something else?

What OS version? Xcode version?

When I compile and run it in Terminal, and enter accented characters, it works perfectly as-is. I even pasted in some filenames I have that contain Greek, Russian, etc. characters. Those also worked perfectly.

Test run:

Code:

[B]./a.out[/B]
Type a line of text, please:
one fish two fish

---- This line has 4 words ----
one fish two fish

[B]./a.out[/B]
Type a line of text, please:
übér fish

---- This line has 2 words ----
übér fish

[B]./a.out[/B]
Type a line of text, please:
Greek-ΡΗΟΝΥ.txt

---- This line has 1 word ----
Greek-ΡΗΟΝΥ.txt

There are Terminal settings that affect how it displays non-Latin text. The setting name and location vary by OS and Terminal version.

Same thing for Xcode: the text encoding and how it shows non-Latin text in its console vary by Xcode version.

You also have to be more specific about exactly what you mean by "implement Unicode" in your program.

If the input data is UTF-16, then your current code may not work at all. If the input is UTF-8, then it will work as-is with no changes.

The reason for the difference is that UTF-16 has 16 bits per code, and when that's put on a stream of bytes (which all I/O is), it's split into 2 bytes, either big-endian or little-endian. If one of those bytes is 0, as it will be for every character in the range 0x00-0xFF (which covers Latin letters, digits, space, etc.), then your line-reading routine will effectively be putting a string-terminating nul at that position.

If none of that makes sense, it means you should read the Joel On Software URL I previously posted. You can't remain ignorant of issues like coding, big-endian, little-endian, etc. and hope to understand or address this issue. Not even if someone writes an entire replacement program for you.

gnasher729 · Feb 21, 2010

chown33 said:
The wchar_t types are automatically encoded to the proper text-encoding, according to the locale. The main point here is that the wchar routines and the locale are responsible for doing all text-encoding, not you or your program.

You really, really don't want locale dependent text encoding. That's why you use Unicode: Because it gets rid of all that locale dependent stuff. No codepages in Windows, no scripts like in MacOS 9, a character is just a character and nothing else. With UTF-8, there _is_ no text encoding. It's just plain Unicode.

wchar_t is whatever the compiler and runtime libraries think it is. UTF-8, on the other hand, is always the same, on every operating system, with every compiler, with every processor.

chown33 · Feb 21, 2010

gnasher729 said:
You really, really don't want locale dependent text encoding. That's why you use Unicode: Because it gets rid of all that locale dependent stuff. No codepages in Windows, no scripts like in MacOS 9, a character is just a character and nothing else. With UTF-8, there _is_ no text encoding. It's just plain Unicode.

Unicode is a text encoding. A big one, and one trying to be universal, but it's still a text encoding.

You don't have to use locales and wchar_t etc. with locale-dependent text encodings. That's why I suggested looking at the output of the 'locale' command. Every language and region listed has a UTF-8 variant.

To me it comes down to one thing: do you want to write all your own UTF-8-handling code, which must all maintain compatibility with Unicode standards (e.g. composition, decomposition, combining marks, word breaks, collation)? Or do you want to use the C99-standard library functions that give it to you for free? Yeah, I know, some people insist on writing their own strcmp() replacement functions, too.

FatherChristmas · Feb 21, 2010

chown33 said:
When I compile and run it in Terminal, and enter accented characters, it works perfectly as-is. I even pasted in some filenames I have that contain Greek, Russian, etc. characters. Those also worked perfectly.

Thanks very much for taking the time to test out the code I posted. Your post has raised some interesting questions (for me, at any rate), because, in trying to recreate your success, I stumbled upon this: When I paste in your German letters (as run-time input), the program works as expected (and appears to be Unicode aware). But when I type in the German letters (as run-time input), the German is not displayed correctly. (The "é" does not display in the first line, which is similar to the issue I have when I attempt to type in my non-Latin test characters.)

XCode console (when the German is typed in at run time):

Code:

 Type a line of text, please:
Running
übr fish

---- This line has 2 words ----
übér fish

XCode console (when the German is pasted in at run time):

Code:

 Type a line of text, please:
übér fish

---- This line has 2 words ----
übér fish

My display issue is only slightly different (perhaps because my Unicode characters are 3-bytes long, but your German Unicode charters are 2-bytes long):

XCode console (when a non-Latin script is typed in at run time):

Code:

 Type a line of text, please:
가 

---- This line has 3 words ----
ㅇㅓㄴㅈㅓㅣㄴㄱㅏ ㅈㅏㄹ ㄷㅗㅣㄹㄲㅓㅇㅣㅓㅣㅇㅣㅗ.

XCode console (when a non-Latin script is pasted in at run time):

Code:

Type a line of text, please:
언젠가 잘 될꺼예요.

---- This line has 3 words ----
언젠가 잘 될꺼예요.

XCode console (when a non-Latin script is typed in at run time, using a different keyboard mapping):

Code:

Type a line of text, please:
요.

---- This line has 3 words ----
ㅇㅓㄴㅈㅔㄴㄱㅏ ㅈㅏㄹ ㄷㅗㅣㄹㄲㅓㅇㅖㅇㅛ.

All of the above is the what console displays, using the same three words. Only the pasted-in version is correctly displayed. If the word-count program were Unicode aware, it should display all three exactly like the pasted-in version, which is what a Unicode-aware program like TextEdit does. Whether pasted in or typed in (using any number of different keyboard mappings), TextEdit will always display correctly the three words I've used in my sample.

I can provide all the other details you ask for, but this initial result may suggest something that takes us in another direction. Or it may help us to focus on which of your suggestions is now worth exploring more. So let me leave out some of the other details for the moment.

I thank you for posting, and now pointing out, the Joel on Unicode link.

Oddly, about 2 weeks ago, I read that very article -- and quite a bit more -- in an attempt to sort out the failures I was having in getting the programs I posted to display correctly a non-Latin script. I do understand about 90% of what you've explained here and appreciate your taking the time to post.

Only after hours and hours, and days and days, of failure did I decide to post my issue here. I tried to implement Unicode by substituting wchar_t for char etc., all without success. Of course, not having any experience writing Unicode-aware programs, I wasn't expecting an easy success.

chown33 · Feb 21, 2010

Maybe you should test your program using something other than Xcode console. That's the one thing I see you having trouble with, which I didn't use.

If you don't test on something other than Xcode console, you can't eliminate it as the source of the errors. If you don't eliminate it, and it turns out that it is the source of the errors, you will be chasing a phantom.

Oh, and you haven't identified which Xcode version you're using.

FatherChristmas · Feb 21, 2010

Real-World Unicode-Savvy Code

chown33 said:
Maybe you should test your program using something other than Xcode console.

By running the same executable in Terminal, not in the Xcode console, I was able to reproduce your success fully. In Terminal, the program now appears to be Unicode-savvy. It works perfectly.

Experimenting with various preference settings (encoding and font, mainly) in Xcode did NOTHING to improve the situation, with one failure after another. Thanks for sending me off in the right direction! Problem solved.

But now, of course, I'm still left wondering how to make programs Unicode aware, which is why I was experimenting with this program in the first place. It seems I've accidentally chosen a program that is (for reasons that are not entirely clear to me) Unicode-capable. Bad luck, really!

If anyone can post some actual lines of C code -- NOT the usual generic syntax that is listed in countless documents all over the Internet -- of a Unicode-capable program, that would be great. Better yet (if it's readily available or not too time-consuming), posting all Unicode-relevant source code for a simple (or complex) Unicode-aware program would be an enormous help -- not just to me, I'm sure.

I'm not asking anyone to sit down and write out a whole Unicode-savvy C program. I'm just asking whether anyone with the time, and interest, might be able to share some practical, real-world Unicode-aware source code.

Thanks!

chown33 · Feb 21, 2010

FatherChristmas said:
By running the same executable in Terminal, not in the Xcode console, I was able to reproduce your success fully. In Terminal, the program now appears to be Unicode-savvy. It works perfectly.

Experimenting with various preference settings (encoding and font, mainly) in Xcode did NOTHING to improve the situation, with one failure after another. Thanks for sending me off in the right direction! Problem solved.

Maybe. You still haven't identified which version of Xcode you're using, so I still can't tell you anything about possibly fixing the problem in Xcode's console. But if you're fine with that, then I guess I will be, too.

If anyone can post some actual lines of C code -- NOT the usual generic syntax that is listed in countless documents all over the Internet -- of a Unicode-capable program, that would be great. Better yet (if it's readily available or not too time-consuming), posting all Unicode-relevant source code for a simple (or complex) Unicode-aware program would be an enormous help -- not just to me, I'm sure.

I'm not asking anyone to sit down and write out a whole Unicode-savvy C program. I'm just asking whether anyone with the time, and interest, might be able to share some practical, real-world Unicode-aware source code.

I've already posted the names of some of the more fundamental wchar_t-related functions. I suggest googling for one or more of those names in conjunction with combinations of the following:
sample
example
source

For example:
fwprintf example

The following appears near the top of the results:
http://www.opengroup.org/onlinepubs/009695399/functions/wprintf.html

There are examples located in the EXAMPLES section of that page.

If you want non-trivial source, you might have to refine your search, such as limiting it to the sourceforge website, or looking for a particular type of program that uses wchar_t.

As a first exercise, I suggest writing a very simple "get a wide-char and print its hex representation" loop. It shouldn't be that difficult to do. You might want to start by writing it first to call getchar() instead of getwchar(), to make sure it works as expected with known input.

As a second exercise, count the wide-chars read from stdin, and printf() the count when eof is encountered. Again, I advise writing a getchar() version first, then a getwchar() version, and comparing the counts when you give the program inputs like plain ASCII text, accented Latin characters, then Greek, Russian, etc.

You can prepare text files in TextEdit.app and save as plain text in UTF-8 format. Then run your programs with stdin redirected to read your text files.

You should also prepare text files stored as UTF-16 (Save As and choose from TextEdit popup), and compare the results of your getchar() vs. getwchar() programs.

FatherChristmas · Feb 22, 2010

chown33 said:
You still haven't identified which version of Xcode you're using, so I still can't tell you anything about possibly fixing the problem in Xcode's console.

Well, okay, sure Here you are:
Xcode Version 3.2.1 (IDE 1613.0) 64-bit (on OS X 10.6.2)
Any ideas on how to fix the Unicode display issue? Thanks!

chown33 said:
If you want non-trivial source, you might have to refine your search, such as limiting it to the sourceforge website, or looking for a particular type of program that uses wchar_t.

Yes, of course, thank you. I found some promising code, including the source for a whole Unicode-aware program. If any of it proves particularly useful, I may post it here.

Thanks, too, for the coding practice suggestions.

Doctor Q · Apr 16, 2010

This isn't directly relevant to C coding, but I thought I'd share this here for people who search the forums for "UTF-8".

I wrote this Excel formula today that converts the decimal value of a Unicode character to its UTF-8 byte sequence. It was fun to invent!

Put the decimal value in cell A1. Put the formula in cell B1. Copy it down the column if you want to put more values in column A and get the UTF-8 forms in column B. It's in hex and will be one to four bytes long.

Formula:

Code:

="0x"&IF(A1<128,DEC2HEX(A1,2),IF(A1<2048,DEC2HEX(192+INT(A1/64),2),IF(A1<65536,DEC2HEX(224+INT(A1/4096),2),DEC2HEX(240+INT(A1/262144),2))))&IF(A1<128,""," 0x"&IF(A1<2048,DEC2HEX(128+MOD(A1,64),2),IF(A1<65536,DEC2HEX(128+MOD(INT(A1/64),64),2),DEC2HEX(128+MOD(INT(A1/4096),64),2))))&IF(A1<2048,""," 0x"&IF(A1<65536,DEC2HEX(128+MOD(A1,64),2),DEC2HEX(128+MOD(INT(A1/64),64),2)))&IF(A1<65536,""," 0x"&DEC2HEX(128+MOD(A1,64),2))

Example: The bold italic partial differential symbol &#55349;&#57167; (which I'm sure you use all the time) is decimal 120655 in Unicode. Use this formula and you get its UTF-8 sequence: 0xF0 0x9D 0x9D 0x8F.

To copy the formula, I suggest that you quote this post and steal it from between the

Code:

 tags.

Here it is with extra linebreaks in case you want to figure out how it works:
[indent][FONT="Fixedsys"]="0x"&
IF(A1<128,DEC2HEX(A1,2),
IF(A1<2048,DEC2HEX(192+INT(A1/64),2),
IF(A1<65536,DEC2HEX(224+INT(A1/4096),2),DEC2HEX(240+INT(A1/262144),2))))&
IF(A1<128,""," 0x"&
IF(A1<2048,DEC2HEX(128+MOD(A1,64),2),
IF(A1<65536,DEC2HEX(128+MOD(INT(A1/64),64),2),DEC2HEX(128+MOD(INT(A1/4096),64),2))))&
IF(A1<2048,""," 0x"&
IF(A1<65536,DEC2HEX(128+MOD(A1,64),2),DEC2HEX(128+MOD(INT(A1/64),64),2)))&
IF(A1<65536,""," 0x"&DEC2HEX(128+MOD(A1,64),2))[/FONT][/indent]

macOS How to replace ASCII with Unicode/UTF-8?

macrumors newbie

macrumors 68030

macrumors 68030

macrumors 68040

Suspended

macrumors 68040

macrumors 68030

Moderator

macrumors 68030

macrumors 68030

macrumors 68030

macrumors 68030

Suspended

macrumors 68030

Moderator

macrumors newbie

Moderator

Suspended

Moderator

macrumors newbie

Moderator

macrumors newbie

Moderator

macrumors newbie

Administrator

Our Staff