So I'm using libxml2 here and doing XPath queries to filter out information from some web-pages, everything's been going swell so far, but I've run into a problem on a specific page.
This is the part of the page I'm trying to filter out:
What I want to get at is the information inside the <br /> tags, but this has proven to be quite difficult.
With the xpath query string:
I am capable of getting this:
But this is really more what I would like to have as my nodeName, and then the nodeContent would be, using Genre as example, Coco avant Chanel.
Anybody here that can help me out with this? I've been reading examples and XPath tutorials for hours now, and I still can't quite find a way to do this.
This is the part of the page I'm trying to filter out:
Code:
<table width="252" border="0" cellspacing="0" cellpadding="5">
<tr>
<td width="116" align="left" valign="top">
<h3>Original tittel</h3><br />Coco avant Chanel<br />
<h3>Genre</h3><br />Drama<br />
<h3>Nasjonalitet</h3><br />FRA<br />
<h3>Sensur</h3><br />Tillatt for alle<br />
<h3>Regi</h3><br />Anne Fontaine<br />
<h3>Medvirkende</h3><br />Audrey Tautou, Benoit Poelvoorde, Emmanuelle Devos, Marie Gillain<br />
<h3>Lengde</h3><br />1 t. 50 min.<br />
<h3>FilmbyrÂ</h3><br />SF Norge<br />
<td width="116" align="left" valign="top">
<h3>Publikum mener </h3>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td width="20">
<img src="http://www.oslokino.no/template/static/gfx/hoyre_kolonne/terninger/5.gif" alt="Terningkast" width="16" height="16" vspace="4" /></td>
<td align="left" valign="middle"> </td>
</tr>
</table>
What I want to get at is the information inside the <br /> tags, but this has proven to be quite difficult.
With the xpath query string:
Code:
@"//td[@width='116' and @align='left' and @valign='top']/h3";
I am capable of getting this:
Code:
(
{
nodeContent = Genre;
nodeName = h3;
},
{
nodeContent = Nasjonalitet;
nodeName = h3;
},
{
nodeContent = Regi;
nodeName = h3;
},
{
nodeContent = Produsent;
nodeName = h3;
},
{
nodeContent = Medvirkende;
nodeName = h3;
},
{
nodeContent = Musikk;
nodeName = h3;
},
{
nodeContent = Lengde;
nodeName = h3;
},
{
nodeContent = "Publikum mener";
nodeName = h3;
},
{
nodeContent = "Hva mener du?";
nodeName = h3;
}
)
But this is really more what I would like to have as my nodeName, and then the nodeContent would be, using Genre as example, Coco avant Chanel.
Anybody here that can help me out with this? I've been reading examples and XPath tutorials for hours now, and I still can't quite find a way to do this.