Character codes in charsets

rendezvouscp

macrumors 68000
Original poster
Aug 20, 2003
1,526
0
Long Beach, California
I've got a few quick questions.

I've always thought it was best to code in your 8-bit characters like ˜, but I've read that you can just use UTF-8 in your charset meta value and not have to. Even with UTF-8, is it still better to code in your characters? I've been playing with Wordpress, and using UTF-8 you can do a search on a character like ˜ and get what you're looking for, whereas you won't get what you're looking for in most other charsets. What's the best way to go?
-Chase
 

MontyZ

macrumors 6502a
Jan 7, 2005
887
0
If your question is in regards to coding HTML pages, yes, it's better to use the HTML entity (˜) rather than the actual Unicode character itself. But if you're storing in a database, retain the Unicode, but convert it to HTML entities only when displayed on a page. For example, in PHP you can do this with the htmlentities() built-in function.
 

Rower_CPU

Moderator emeritus
Oct 5, 2001
11,219
0
San Diego, CA
What's the rationale behind leaving the entities alone when inserting the text into a database? I've made systems that dump unicode text straight in and have also coded entities into MySQL databases and have had no issues with either method.
 

MontyZ

macrumors 6502a
Jan 7, 2005
887
0
Rower_CPU said:
What's the rationale behind leaving the entities alone when inserting the text into a database? I've made systems that dump unicode text straight in and have also coded entities into MySQL databases and have had no issues with either method.
Leaving the Unicode as-is when writing to the database is simply a way to preserve the original text. This can be helpful in a number of instances. Most notably if you want to perform searches on the text. If the text is stored in the DB with Unicode converted to HTML entities, then it will be a lot harder to perform accurate searches. Also, if you want to eventually use that text in other types of documents, it's better to start with the original than try to translate the HTML entities back to Unicode.

But if the only purpose for the text is to display it on a web page and it doesn't need to be searched, then storing with Unicode converted to HTML entities is just fine, too. But I prefer to store with Unicode.

Actually, you can only store Unicode in MySQL if you're using the latest version, so, this may be a good reason to store the Unicode already converted to HTML entities first.
 

Rower_CPU

Moderator emeritus
Oct 5, 2001
11,219
0
San Diego, CA
Makes sense. That's the way I've been doing it all along, I just wasn't sure if there was a security issue or something else involved.

As far as storing unicode in MySQL, when you say "latest" version do you mean 4 or 3? I'm pretty sure I've used 3 with unicode without problem.
 

MontyZ

macrumors 6502a
Jan 7, 2005
887
0
Rower_CPU said:
As far as storing unicode in MySQL, when you say "latest" version do you mean 4 or 3? I'm pretty sure I've used 3 with unicode without problem.
MySQL v3.x doesn't have support for Unicode, unfortunately. I have a server here still running MySQL 3 and had to convert everything to ISO before storing it because the Unicode was getting mangled. Then I found out it was because Unicode was not supported in older versions of MySQL. Here are the character sets it does support:

latin1 big5 cp1251 cp1257 croat czech danish dec8 dos estonia euc_kr gb2312 gbk german1 greek hebrew hp8 hungarian koi8_ru koi8_ukr latin2 latin5 swe7 usa7 win1250 win1251 win1251ukr ujis sjis tis620
 

Rower_CPU

Moderator emeritus
Oct 5, 2001
11,219
0
San Diego, CA
Interesting...I just checked an older version of a site I had done that allowed users to enter unicode text that was then uploaded into MySQL tables (version 3.23.51 - Marc Liyanage's build for OS X). The text is inserted and displayed back with no problem, everything from accented Latin to Arabic and Korean.

I did see in the MySQL reference manual that MySQL 4 adds support for UTF8, but they reference it as a new unicode character set, which implies to me there was some level of unicode support previously.

http://dev.mysql.com/doc/mysql/en/charset-unicode.html

Regardless, it's good to see better support out there for unicode. :)
 

MontyZ

macrumors 6502a
Jan 7, 2005
887
0
Yes, that is interesting. I am using MySQL 3.23.58 on a Red Hat Linux server, and there is no Unicode support at all. I've tried, but, won't work. I guess the builds on the two platforms are different. I know that Red Hat bundles its own build of MySQL with ES 3.0 which is the OS I'm running on my server. So, maybe your build was tweaked to allow Unicode. Wish I had that version! But, I'm going to be upgrading to MySQL 4.x soon, and Unicode is fully supported in all it's flavors in that version.