PDA

View Full Version : Perl, RegEx and UTF-8 Strings




kainjow
Nov 29, 2005, 10:08 AM
I've got a Perl script that simply parses out HTML from the standard input, and then outputs the result. However, for UTF-8/Unicode text (still not 100% clear on the difference between these encodings...), the output is all garbled. Anyone have any ideas?

Here's the Perl code:
#!/usr/bin/perl
$str = "";
while ($line=<STDIN>)
{
$str .= $line;
}
$str =~ s/<script[^>]*>(.*?)<\/script>//gsi; # remove <script>
$str =~ s/<(?:[^>'"]*|(['"]).*?)*>//gsi; # remove html
print $str;



SilentPanda
Nov 29, 2005, 10:16 AM
What I would probably do is figure out how to detect if it's non-ASCII and then convert it to ASCII before running the regex on it. I'm not a big Perl guy but that's what I'd do in most any other language.

And the complimentary link (http://en.wikipedia.org/wiki/UTF-8) that I will leave on your pillow.

superbovine
Nov 29, 2005, 05:16 PM
in php i usually use iconv(), you get it through fink if need be.

kainjow
Nov 30, 2005, 10:48 AM
OK here's the problem. I'm now playing with PHP since it's a little easier to use for now, and so I recreated the script and tested it by not doing any regex replacing on the string, and when I output the script, it's still garbled. So I need to find out a way of correctly inputting UTF-8/16 text and outputting it the same way:
#!/usr/bin/php
<?php
$html = "";
while ($line = fgets(STDIN))
$html .= $line;
print $html;
?>I've attached a test file you can use to test it. The way I've been testing it is cat testfile.txt | ./striphtml.php > output.txt (I doubled checked to make sure cat was working find, and it outputs correctly). So I have no clue now. :(

kainjow
Nov 30, 2005, 01:41 PM
OK I'm stupid ;) In my script, I had a blank line in between the #! and <?php line above, which was outputted, and the input was UTF-16 (2 bytes per char) and so it was outputting 1 byte for a \n character, and that was throwing everything off :o

So basically I need to figure out how to work with UTF-16 data. So far haven't had any luck :(.. need to figure out a way for Perl to automatically determine the encoding of the text...

kainjow
Nov 30, 2005, 05:25 PM
Now I'm getting desperate. I've tried every variation I could find and think of with PHP and Perl to get this to work. Basically it doesn't work properly with UTF-16, but UTF-8 is fine.

Background: I'm calling this script (doesn't really matter if it's PHP or Perl) from a Cocoa app to work on some text. Perl does regular expressions the fastest, that's why I'm using it. But I found some Cocoa code (OgreKit) that works with UTF-16, but it's super slow.

So any more ideas? :(

superbovine
Nov 30, 2005, 10:14 PM
from terminal type "man iconv"

if you look in the cowzilla cvs there is an example of converting UTF-8 to ISO-8859-1, should be able to do UTF-16 without any problems.


<?php

//....from cowzilla

iconv_set_encoding("internal_encoding", $this->feed->charset);
iconv_set_encoding("output_encoding", $this->feed->charset);
iconv_set_encoding("input_encoding", $this->feed->charset);



//...

//what you need, or something close to it.


iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-16");
?>


?>