PDA

View Full Version : grep, regular expressions, HTML files




wrldwzrd89
Aug 3, 2006, 04:23 PM
I could use a little help here. Alright, let's say I have this here HTML file (http://homepage.mac.com/wrldwzrd89/MacRumors/example.html), and I want to extract some of the things in brackets (which are just placeholders - in the HTML files I'm actually extracting data from there will be actual content where the brackets are) using grep, then send that data to a file.

I'm not entirely sure which regular expressions to use. Also, for a given piece of data I don't want the regular expression to return more than one match.

Oh, ignore all the broken links and images in the linked-to HTML file - they're supposed to be broken.



savar
Aug 3, 2006, 07:49 PM
I could use a little help here. Alright, let's say I have this here HTML file (http://homepage.mac.com/wrldwzrd89/MacRumors/example.html), and I want to extract some of the things in brackets (which are just placeholders - in the HTML files I'm actually extracting data from there will be actual content where the brackets are) using grep, then send that data to a file.

I'm not entirely sure which regular expressions to use. Also, for a given piece of data I don't want the regular expression to return more than one match.

Oh, ignore all the broken links and images in the linked-to HTML file - they're supposed to be broken.

First off, the best regex tutorial I've ever read:
http://www.regular-expressions.info/tutorial.html

What you're looking for is something like this:
\[.*?\]

This matches a pair of square brackets with or without stuff in the middle. The ? makes * ungreedy -- it will return the shortest match possible. The brackets I think must be escaped since they have special meaning in a regex.