Perl, RegEx and UTF-8 Strings

Discussion in 'Mac Programming' started by kainjow, Nov 29, 2005.

  1. Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #1
    I've got a Perl script that simply parses out HTML from the standard input, and then outputs the result. However, for UTF-8/Unicode text (still not 100% clear on the difference between these encodings...), the output is all garbled. Anyone have any ideas?

    Here's the Perl code:
    Code:
    #!/usr/bin/perl
    $str = "";
    while ($line=<STDIN>)
    {
    	$str .= $line;
    }
    $str =~ s/<script[^>]*>(.*?)<\/script>//gsi; # remove <script>
    $str =~ s/<(?:[^>'"]*|(['"]).*?)*>//gsi; # remove html
    print $str;
     
  2. Moderator emeritus

    SilentPanda

    Joined:
    Oct 8, 2002
    Location:
    The Bamboo Forest
    #2
    What I would probably do is figure out how to detect if it's non-ASCII and then convert it to ASCII before running the regex on it. I'm not a big Perl guy but that's what I'd do in most any other language.

    And the complimentary link that I will leave on your pillow.
     
  3. macrumors 68030

    superbovine

    Joined:
    Nov 7, 2003
    #3
    in php i usually use iconv(), you get it through fink if need be.
     
  4. thread starter Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #4
    OK here's the problem. I'm now playing with PHP since it's a little easier to use for now, and so I recreated the script and tested it by not doing any regex replacing on the string, and when I output the script, it's still garbled. So I need to find out a way of correctly inputting UTF-8/16 text and outputting it the same way:
    Code:
    #!/usr/bin/php
    <?php
    $html = "";
    while ($line = fgets(STDIN))
    	$html .= $line;
    print $html;
    ?>
    I've attached a test file you can use to test it. The way I've been testing it is
    Code:
    cat testfile.txt | ./striphtml.php > output.txt
    (I doubled checked to make sure cat was working find, and it outputs correctly). So I have no clue now. :(
     
  5. thread starter Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #5
    OK I'm stupid ;) In my script, I had a blank line in between the #! and <?php line above, which was outputted, and the input was UTF-16 (2 bytes per char) and so it was outputting 1 byte for a \n character, and that was throwing everything off :eek:

    So basically I need to figure out how to work with UTF-16 data. So far haven't had any luck :(.. need to figure out a way for Perl to automatically determine the encoding of the text...
     
  6. thread starter Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #6
    Now I'm getting desperate. I've tried every variation I could find and think of with PHP and Perl to get this to work. Basically it doesn't work properly with UTF-16, but UTF-8 is fine.

    Background: I'm calling this script (doesn't really matter if it's PHP or Perl) from a Cocoa app to work on some text. Perl does regular expressions the fastest, that's why I'm using it. But I found some Cocoa code (OgreKit) that works with UTF-16, but it's super slow.

    So any more ideas? :(
     
  7. macrumors 68030

    superbovine

    Joined:
    Nov 7, 2003
    #7
    from terminal type "man iconv"

    if you look in the cowzilla cvs there is an example of converting UTF-8 to ISO-8859-1, should be able to do UTF-16 without any problems.

    PHP:
    <?php

    //....from cowzilla

          
    iconv_set_encoding("internal_encoding"$this->feed->charset);
          
    iconv_set_encoding("output_encoding"$this->feed->charset);
          
    iconv_set_encoding("input_encoding"$this->feed->charset); 



    //...

    //what you need, or something close to it.


    iconv_set_encoding("internal_encoding""UTF-8");
    iconv_set_encoding("output_encoding""UTF-16");
    ?>


    ?>
     

Share This Page