PDA

View Full Version : extracting information from an <a> tag using regexp in PHP




mrjamin
Apr 15, 2004, 05:12 PM
Ok, here's the dealio

I have a string which has a load of <a href="http://domain.tld" title="detailed description">link text</a> style things in it, one per line

How would I extract:

1) the URL
2) the value of title attribute
3) the link text

I figured regular expressions are the way to go, but I'm a little confused on where to start!

Any pointers? I came up with this:

<?php
function extractLink($link) {
$link = split("\n",trim($link));
for($i = 0; isset($link[$i]); $i++){
$link[$i] = explode("\"",$link[$i]);
$link[$i]['url'] = substr($link[$i][1],7);
$link[$i]['description'] = $link[$i][3];
$link[$i]['title'] = substr($link[$i][4],1);
$link[$i]['title'] = strrev(substr(strrev($link[$i]['title']),4));
}
for($i = 0; isset($link[$i]); $i++){
foreach($link[$i] as $key => $value){
if(is_numeric($key)){
unset($link[$i][$key]);
} else {
$link[$i][$key] = htmlentities($value);
}
}
}
return $link;
}
?>

Which, while crude, does the job but it'd get messed up if there is no title attribute.

Thanks in advance,

MrJ



Knox
Apr 16, 2004, 05:16 AM
preg_match("/<a href=\"([^\"]+)\"( title=\"([^\"]+)\"|)>([^\<]+)<\/a>/",
$link, $matches);

That gives an array $matches with the various elements, the first element is always the full link. It also matches the link with or without the title attribute.


Array
(
[0] => <a href="http://domain.tld">link text</a>
[1] => http://domain.tld
[2] =>
[3] =>
[4] => link text
)

Array
(
[0] => <a href="http://domain.tld" title="detailed description">link text</a>
[1] => http://domain.tld
[2] => title="detailed description"
[3] => detailed description
[4] => link text
)


Of course you can ignore element 2, there may be a way to get rid of that but I'm not sure.

I'll explain how it works if you want me to :)

Knox
Apr 16, 2004, 12:46 PM
Oh, come to think of it, if you're wanting to match lots of links then you'll want preg_match_all rather than just preg_match.