macOS Perl, RegEx and UTF-8 Strings

kainjow · Nov 29, 2005

I've got a Perl script that simply parses out HTML from the standard input, and then outputs the result. However, for UTF-8/Unicode text (still not 100% clear on the difference between these encodings...), the output is all garbled. Anyone have any ideas?

Here's the Perl code:

Code:

#!/usr/bin/perl
$str = "";
while ($line=<STDIN>)
{
	$str .= $line;
}
$str =~ s/<script[^>]*>(.*?)<\/script>//gsi; # remove <script>
$str =~ s/<(?:[^>'"]*|(['"]).*?)*>//gsi; # remove html
print $str;

SilentPanda · Nov 29, 2005

What I would probably do is figure out how to detect if it's non-ASCII and then convert it to ASCII before running the regex on it. I'm not a big Perl guy but that's what I'd do in most any other language.

And the complimentary link that I will leave on your pillow.

superbovine · Nov 29, 2005

in php i usually use iconv(), you get it through fink if need be.

kainjow · Nov 30, 2005

OK here's the problem. I'm now playing with PHP since it's a little easier to use for now, and so I recreated the script and tested it by not doing any regex replacing on the string, and when I output the script, it's still garbled. So I need to find out a way of correctly inputting UTF-8/16 text and outputting it the same way:

Code:

#!/usr/bin/php
<?php
$html = "";
while ($line = fgets(STDIN))
	$html .= $line;
print $html;
?>

I've attached a test file you can use to test it. The way I've been testing it is

Code:

cat testfile.txt | ./striphtml.php > output.txt

(I doubled checked to make sure cat was working find, and it outputs correctly). So I have no clue now. 🙁

kainjow · Nov 30, 2005

OK I'm stupid 😉 In my script, I had a blank line in between the #! and <?php line above, which was outputted, and the input was UTF-16 (2 bytes per char) and so it was outputting 1 byte for a \n character, and that was throwing everything off 😱

So basically I need to figure out how to work with UTF-16 data. So far haven't had any luck 🙁.. need to figure out a way for Perl to automatically determine the encoding of the text...

kainjow · Nov 30, 2005

Now I'm getting desperate. I've tried every variation I could find and think of with PHP and Perl to get this to work. Basically it doesn't work properly with UTF-16, but UTF-8 is fine.

Background: I'm calling this script (doesn't really matter if it's PHP or Perl) from a Cocoa app to work on some text. Perl does regular expressions the fastest, that's why I'm using it. But I found some Cocoa code (OgreKit) that works with UTF-16, but it's super slow.

So any more ideas? 🙁

superbovine · Nov 30, 2005

from terminal type "man iconv"

if you look in the cowzilla cvs there is an example of converting UTF-8 to ISO-8859-1, should be able to do UTF-16 without any problems.

PHP:

<?php

//....from cowzilla

      iconv_set_encoding("internal_encoding", $this->feed->charset);
      iconv_set_encoding("output_encoding", $this->feed->charset);
      iconv_set_encoding("input_encoding", $this->feed->charset); 



//...

//what you need, or something close to it.


iconv_set_encoding("internal_encoding", "UTF-8");
iconv_set_encoding("output_encoding", "UTF-16");
?>


?>

Search

Search

macOS Perl, RegEx and UTF-8 Strings

kainjow

Moderator emeritus

SilentPanda

Moderator emeritus

superbovine

macrumors 68030

kainjow

Moderator emeritus

kainjow

Moderator emeritus

kainjow

Moderator emeritus

superbovine

macrumors 68030

Our Staff