Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

psingh01

macrumors 68000
Original poster
Apr 19, 2004
1,610
659
Disclaimer: Consider me a beginner in XSL.

I want to datamine some information from an html page I have. It's an html page for a web forum (not macrumors) and I want to create an xml representation of the posts in a thread. So basically I want to strip out all the nonessential html stuff and just extract the username, date posted, and message.

I have used the tidy utility to make sure the page is well formed xhtml. What I need help in is setting up the xsl file. Particularly the xpath used to match elements within the html.

So if you can imagine a forum page has many nested tables. In the example below, I want to pick out the <tr> elements of the inner most table. Any suggestion on how to do that?

Code:
<html>
  <body> 
     <center>
        <table>
          <tr>
            <td>
              <table>
                <tr>
                  <td>
                    <table>
                       <tr></tr>
                       <tr></tr>
                       <tr></tr>
                       <tr></tr>
                        .....
                      </table>
                   </td>
                 </tr>
              </table>
           </td>
         </tr>
      </table>
    </center>
  </body>
</html>
 
Not sure if you have any of the XSL done yet, but for grabbing those tr elements,
HTML:
<xsl:template match="body">
<xsl:for-each select="table/tr/td/table/tr/td/table/tr">
stuff
</xsl:for-each>
</xsl:template>
I work with XSL some and have done various things with it, but haven't applied it to XHTML, but have made XML into XHTML.
 
You could also create a regexp to do this, and could setup a cron job to spider the thread. for that, you could just search for <tr>.*?</tr>. However, this would work best if there's some distinguishing identifier for the specific cell you want (like an id or class, or a phrase like "Title:" that appears in all posts.) It's still doable if you need to use lookarounds, but it is definitely a messy regexp for that.
 
Not sure if you have any of the XSL done yet, but for grabbing those tr elements,
HTML:
<xsl:template match="body">
<xsl:for-each select="table/tr/td/table/tr/td/table/tr">
stuff
</xsl:for-each>
</xsl:template>
I work with XSL some and have done various things with it, but haven't applied it to XHTML, but have made XML into XHTML.

Thanks for the sample. That is exactly what I was looking for. I had a template, but just wasn't working. Didn't know about the for-each element. thanks!

You could also create a regexp to do this, and could setup a cron job to spider the thread. for that, you could just search for <tr>.*?</tr>. However, this would work best if there's some distinguishing identifier for the specific cell you want (like an id or class, or a phrase like "Title:" that appears in all posts.) It's still doable if you need to use lookarounds, but it is definitely a messy regexp for that.

I tried something similar at first because there are some google ad comments throughout the html that I could key off of. I had a java program that would search for these comments, grab what I wanted, then move on to the next section. It kind of worked, but it got messy real fast.

This xsl method has turned out to be much simpler to work with (once I got help with the xpath :) ). I've got a nice pipeline going: wget (get raw html) -> tidy (clean up to xhtml) -> xalan (apply xsl to further clean up to xml) -> ? do fun stuff here
 
I guess the forum software you're using doesn't offer an RSS feed of the posts?

(The response I'm expecting is either going to be "nope" or "awww, crap!" ;) )
 
I guess the forum software you're using doesn't offer an RSS feed of the posts?

(The response I'm expecting is either going to be "nope" or "awww, crap!" ;) )

The forum software seems to be state of the art circa 1994 lol so no rss feed....but I hadn't thought of it anyway hehe.

My ultimate aim is to be able to archive the forum in XML so it can be migrated somewhere else. Or just plain search through it, the forum search feature sucks.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.