Applying a XSLT to XHTML

Discussion in 'Web Design and Development' started by psingh01, Mar 20, 2009.

  1. psingh01 macrumors 65816

    Joined:
    Apr 19, 2004
    #1
    Disclaimer: Consider me a beginner in XSL.

    I want to datamine some information from an html page I have. It's an html page for a web forum (not macrumors) and I want to create an xml representation of the posts in a thread. So basically I want to strip out all the nonessential html stuff and just extract the username, date posted, and message.

    I have used the tidy utility to make sure the page is well formed xhtml. What I need help in is setting up the xsl file. Particularly the xpath used to match elements within the html.

    So if you can imagine a forum page has many nested tables. In the example below, I want to pick out the <tr> elements of the inner most table. Any suggestion on how to do that?

    Code:
    <html>
      <body> 
         <center>
            <table>
              <tr>
                <td>
                  <table>
                    <tr>
                      <td>
                        <table>
                           <tr></tr>
                           <tr></tr>
                           <tr></tr>
                           <tr></tr>
                            .....
                          </table>
                       </td>
                     </tr>
                  </table>
               </td>
             </tr>
          </table>
        </center>
      </body>
    </html>
     
  2. angelwatt Moderator emeritus

    angelwatt

    Joined:
    Aug 16, 2005
    Location:
    USA
    #2
    Not sure if you have any of the XSL done yet, but for grabbing those tr elements,
    HTML:
    <xsl:template match="body">
    <xsl:for-each select="table/tr/td/table/tr/td/table/tr">
    stuff
    </xsl:for-each>
    </xsl:template>
    I work with XSL some and have done various things with it, but haven't applied it to XHTML, but have made XML into XHTML.
     
  3. memco macrumors 6502

    Joined:
    May 1, 2008
    #3
    You could also create a regexp to do this, and could setup a cron job to spider the thread. for that, you could just search for <tr>.*?</tr>. However, this would work best if there's some distinguishing identifier for the specific cell you want (like an id or class, or a phrase like "Title:" that appears in all posts.) It's still doable if you need to use lookarounds, but it is definitely a messy regexp for that.
     
  4. psingh01 thread starter macrumors 65816

    Joined:
    Apr 19, 2004
    #4
    Thanks for the sample. That is exactly what I was looking for. I had a template, but just wasn't working. Didn't know about the for-each element. thanks!

    I tried something similar at first because there are some google ad comments throughout the html that I could key off of. I had a java program that would search for these comments, grab what I wanted, then move on to the next section. It kind of worked, but it got messy real fast.

    This xsl method has turned out to be much simpler to work with (once I got help with the xpath :) ). I've got a nice pipeline going: wget (get raw html) -> tidy (clean up to xhtml) -> xalan (apply xsl to further clean up to xml) -> ? do fun stuff here
     
  5. notjustjay macrumors 603

    notjustjay

    Joined:
    Sep 19, 2003
    Location:
    Canada, eh?
    #5
    I guess the forum software you're using doesn't offer an RSS feed of the posts?

    (The response I'm expecting is either going to be "nope" or "awww, crap!" ;) )
     
  6. psingh01 thread starter macrumors 65816

    Joined:
    Apr 19, 2004
    #6
    The forum software seems to be state of the art circa 1994 lol so no rss feed....but I hadn't thought of it anyway hehe.

    My ultimate aim is to be able to archive the forum in XML so it can be migrated somewhere else. Or just plain search through it, the forum search feature sucks.
     

Share This Page