getSelection() & arabic text (java)

Discussion in 'Mac Programming' started by NoamK, Sep 26, 2007.

  1. NoamK macrumors newbie

    Sep 26, 2007
    I'm kind of new to this (just got my first mac), so maybe the answer is really easy, but I have 2 questions. I'm trying to set a bookmark that looks up selected Arabic text (from a webpage) in an online dictionary like so:


    where this text is the location of a bookmark in Safari.

    1. The text gets mangled, e.g. it becomes something like: %u0627%u0644%u062B%u0627%u0646%u064A
    How do I prevent the text mangling that is going on? I tried unescape(x) in the location, but this gave a weird partly-Arabic-partly-gibberish font. Any ideas?

    2. Maybe a simpler question. How do I get the bookmark to open in a new tab (i.e. by modifying the javascript line above)?

  2. robbieduncan Moderator emeritus


    Jul 24, 2002
    That would appear to be javascript, not java as your title suggested. They are very different (basically they are not related at all).
  3. psingh01 macrumors 65816

    Apr 19, 2004

    1. Does the link with the mangled text actually work? I.e. does it open dictionary page for your arabic word? If so then don't worry about it being "mangled", cause it looks like it is just being encoded like the space " " is being encoded to %20 in the URL.

    2. I am only aware that you can open a "new window" with javascript. It is upto the user to set their browser so that new windows are opened as tabs. You wouldn't want to create a tab if the user prefers windows :)
  4. kylos macrumors 6502a


    Nov 8, 2002
    The output that you are seeing is a 16 bit Unicode representation of the the arabic text you want to define (%u means unicode hexadecimal code). The dictionary service is expecting an ISO 8859-6 8-bit arabic encoding. I'm looking at ways to handle this, but my bookmarklet Javascript is weak.

    In the meantime, a quick explanation of character encodings. ISO 8859 is an 8 bit encoding based on 7-bit ASCII, therefore 8859 has twice as many characters as 7-bit ascii (128 v. 256). ASCII covers the major Latin characters, which covers many European languages, minus some specialized characters, notably, diphthongs and vowels with accent marks. ISO 8859-1 adds these characters in the additional 128 available character representations (-2,-3,-4 are other European variants). However, 256 characters is still way to few to represent all the world's languages, so further ISO 8859 variants were created, 8859-5 for Cyrillic languages, 8859-6 for Arabic languages, etc.

    Still, it can be a pain to make sure you are using the proper character encoding for a sample of text, and some scripts still can't be completely covered by ISO 8859 variants (the Arabic variant does not cover the various position-based representations of Arabic characters, and Oriental languages would still be out of luck.) That's where Unicode comes in; as a 16-bit format, it can represent 2^16, or 65,536 characters, which pretty easily covers all languages and scripts in one format, eliminating the need to worry about code pages, etc. Or at least in theory, anyway. As you can see by the problems you're having, unicode only works when everybody uses it.

    Check out the wikipedia page on character encodings for more information and additional links.

Share This Page