Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

kingsapo

macrumors member
Original poster
Jun 10, 2010
58
4
Hello all! So I have written a program that points to a specified URL, downloads the HTML from the URL and stores it in a variable. Now, I am trying to search the html variable, and more precisely, search between two html heads (I think that's what it's called) and extract everything between the heads. In this example, I am trying to copy everything from <div class="FP_Up_Item2"> to </div> into it's own variable.

HTML:
<div class="FP_Up_Item2">
    
        <div class="FP_Up_Date">
    <b><span style='color:#199600; font-weight:bold;'>01 Aug 2010</span></b> <br />  <span style='color:#a3a3a3;'>out now</span>
    </div>
    
    <div class="FP_Up_ImageWrap"><img src="http://m2.n4g.com/8/GameProfiles/125000/125674_2_cov_med.jpg" /></div>
    <div class="FP_Up_TextWrap" style="width: 300px;">
    <a href="/PC/ReleaseDate-125674.aspx" style="color:#95581f"><b>Alter Ego</b></a> <br /><span style="font-weight: normal; color: #a3a3a3; font-size: 10pt;">(PC)</span><br />
    </div>
    
   <div class='FP_Up_Score'><a href='/PC/ReleaseDate-125674.aspx?ShowReviews=1'>5.9</a></div>

    </div>

So far, I have found something called NSScanner, though I haven't really found a way to implement this yet. I'm pretty sure there's another way to do this, though, so if anyone could give any insight, it would really help! (By the way, as the title implies, the HTML from above is in a NSString variable type)
 

PhoneyDeveloper

macrumors 68040
Sep 2, 2008
3,114
93
I'm not an expert on this but here's what I think I know. The BEST and most CORRECT way to parse html is using one of the xml scanners. This is a semantically correct parsing of the input html and should give valid results.

You can also parse html with simple NSString methods or NSScanner or regular expressions. Depending on the reliability of your input html it may be simple or difficult to parse them correctly with any of these string manipulation methods.

I would look into the xml parsers. Knowing how to do this is a good skill to have.
 

robbieduncan

Moderator emeritus
Jul 24, 2002
25,611
893
Harrogate
The problem with using XML parsers on HTML is that HTML 4.0 documents may not be valid XML. XHTML and HTML 5.0 documents will be.
 

idelovski

macrumors regular
Sep 11, 2008
235
0
Jeff LaMarche wrote a small demo application that downloads images from http://www.deviantart.com and there he parses html to extract locations of those images. You can download the project from his blog where you can read the whole story behind it (even comments by the visitors are very informative.)

Anyway, here's the method that extracts the substring between the tags. This version is modified a bit. As I was reading his source I wanted to make the code more obvious, a bit easier to read. You can use that, or you can find original at his place.

Code:
@interface NSString (search) 
- (NSRange)rangeAfterString:(NSString *)inString1 toStartOfString:(NSString *)inString2;
@end

@implementation NSString(search)

- (NSRange)rangeAfterString:(NSString *)inString1 toStartOfString:(NSString *)inString2
{
   size_t    strLength = [self length];
   size_t    foundLocation = 0;
   
   NSRange   startStrRange = NSMakeRange (0, 0);
   NSRange   endStrRange   = NSMakeRange (strLength, 0);  // if no end string, end here
   NSRange   finalSearchRange;
   NSRange   resultRange;
   
   if (inString1)  {
      startStrRange = [self rangeOfString:inString1 options:0 range:NSMakeRange(0,strLength)];
      if (startStrRange.location == NSNotFound)  {
         return (startStrRange);	// not found
      }
      foundLocation = NSMaxRange (startStrRange);
   }
   
   finalSearchRange = NSMakeRange (foundLocation, strLength - foundLocation);
   
   if (inString2)  {
      endStrRange = [self rangeOfString:inString2 options:0 range:finalSearchRange];
      if (endStrRange.location == NSNotFound)  {
         return (endStrRange);	// not found
      }
   }
   
   size_t  rangeLoc = startStrRange.location + [inString1 length];
   size_t  rangeLen = NSMaxRange (endStrRange) - rangeLoc - [inString2 length];
   
   resultRange = NSMakeRange (rangeLoc, rangeLen);
   
   return (resultRange);
}

@end
 

kingsapo

macrumors member
Original poster
Jun 10, 2010
58
4
idelovski, thanks for the response, it really helped. I think I understand what that person did; they searched for the tag they were looking for, then parked it's location, then searched for the ending tag after it, then marked it's location, then grabbed everything between location a and b. After looking through the program, though, I have failed to find anything that defines what tag, or what text at all, he is telling the program to look for. So I guess I wanted to know, do you know where in the program the definition of those tags are, or how exactly he managed to search for the image links without defining anything to search for?
 

chown33

Moderator
Staff member
Aug 9, 2009
10,751
8,424
A sea of green
idelovski, thanks for the response, it really helped. I think I understand what that person did; they searched for the tag they were looking for, then parked it's location, then searched for the ending tag after it, then marked it's location, then grabbed everything between location a and b. After looking through the program, though, I have failed to find anything that defines what tag, or what text at all, he is telling the program to look for. So I guess I wanted to know, do you know where in the program the definition of those tags are, or how exactly he managed to search for the image links without defining anything to search for?

Search all the source in the project for rangeAfterString. You'll find it in one file other than the NSString-search files. That's where it's used:
Code:
NSRange urlRange = [payload rangeAfterString:START_TOKEN toStartOfString:STOP_TOKEN];

Clearly, START_TOKEN and STOP_TOKEN are symbolic names (could be variables or could be #defined). So search all the source in the project for START_TOKEN.


By the way, I don't think this method will work for you. You have nested div elements in your HTML, so searching for the first </div> after finding "<div class="FP_Up_Item2">" won't work. You have to search for the MATCHING </div>, which is a harder task. You could probably write a more complex parser using the rangeAfterString method, but the logic of doing that is at least as complex as learning how to write an NSXMLParserDelegate.

Your best bet is going to be an NSXMLParser. The HTML is not going to have perfectly nested elements, so it may or may not be strictly XML compliant. However, it won't matter, because the NSCMLParserDelegate doesn't have to enforce strict compliance. You can simply use it to parse <>-enclosed tags and other text, ignoring everything else such as DTD compliance. Knowing how to do this is a useful skill. Don't expect it to be simple.
 

idelovski

macrumors regular
Sep 11, 2008
235
0
By the way, I don't think this method will work for you.

You seem to be right. He will need to rework that method so it counts how many nested <div> tags are there so he can use the right part of the content string between the first and the last tag.

Or, switch to xml parser or something similar.
 

idelovski

macrumors regular
Sep 11, 2008
235
0
kingsapo,

maybe you solved this, maybe you didn't. In the meantime I gave it a second thought and came up with this:

Code:
- (NSRange)rangeAfterString:(NSString *)inString
   bySkippingNestedOpenTags:(NSString *)openTagStr
          toStartOfCloseTag:(NSString *)closeTagStr
{
   size_t    strLength = [self length];
   size_t    foundLocation = 0, tagSearchLocation = 0;
   
   int       nestedOpenTagCnt = 0;
   
   NSRange   startStrRange = NSMakeRange (0, 0);
   NSRange   endStrRange   = NSMakeRange (strLength, 0);  // if no end string, end here
   NSRange   closingSearchRange, nestedSearchRange;
   NSRange   resultRange;
   
   if (inString)  {
      startStrRange = [self rangeOfString:inString options:0 range:NSMakeRange(0, strLength)];
      if (startStrRange.location == NSNotFound)
         return (startStrRange);	// not found
      foundLocation = NSMaxRange (startStrRange);
      tagSearchLocation = foundLocation;
      nestedOpenTagCnt = 1;
   }
   
   do  {
      closingSearchRange = NSMakeRange (foundLocation, strLength - foundLocation);
      
      if (closeTagStr)  {
         endStrRange = [self rangeOfString:closeTagStr options:0 range:closingSearchRange];
         if (endStrRange.location == NSNotFound)
            return (endStrRange);	// not found
         nestedOpenTagCnt--;
         foundLocation = endStrRange.location + [closeTagStr length];
      }
      
      if (openTagStr)  {
         nestedSearchRange = NSMakeRange(tagSearchLocation, NSMaxRange(closingSearchRange) - tagSearchLocation);
         nestedSearchRange = [self rangeOfString:openTagStr options:0 range:nestedSearchRange];
         if (nestedSearchRange.location != NSNotFound)  {
            nestedOpenTagCnt++;	// not found
            tagSearchLocation = nestedSearchRange.location + [openTagStr length];
         }
      }
   } while (nestedOpenTagCnt > 0);

   size_t  rangeLoc = startStrRange.location + [inString length];
   size_t  rangeLen = NSMaxRange (endStrRange) - rangeLoc - [closeTagStr length];
   
   resultRange = NSMakeRange (rangeLoc, rangeLen);
   
   return (resultRange);
}

It appears to be working, I just tested it with your example, you can play with it some more and report back.

This is how you use it:

Code:
NSString  *test = @"<div class=\"FP_Up_Item2\">\
<div class=\"FP_Up_Date\">\
<b><span style='color:#199600; font-weight:bold;'>01 Aug 2010</span></b> <br />  <span style='color:#a3a3a3;'>out now</span>\
</div>\
<div class=\"FP_Up_ImageWrap\"><img src=\"http://m2.n4g.com/8/GameProfiles/125000/125674_2_cov_med.jpg\" /></div>\
<div class=\"FP_Up_TextWrap\" style=\"width: 300px;\">\
<a href=\"/PC/ReleaseDate-125674.aspx\" style=\"color:#95581f\"><b>Alter Ego</b></a> <br /><span style=\"font-weight: normal; color: #a3a3a3; font-size: 10pt;\">(PC)</span><br />\
</div>\
<div class='FP_Up_Score'><a href='/PC/ReleaseDate-125674.aspx?ShowReviews=1'>5.9</a></div>\
</div>";


NSRange   urlRange = [test rangeAfterString:@"<div class=\"FP_Up_Item2\">"
                      bySkippingNestedOpenTags:@"<div"
                      toStartOfCloseTag:@"</div>"];
NSLog (@"The range: %d/%d", urlRange.location, urlRange.length);
NSLog (@"%@", [test substringWithRange:urlRange]);
 

kingsapo

macrumors member
Original poster
Jun 10, 2010
58
4
Alright, I really want to thank everyone for posting all the great replies. I really REALLY wanna thank idelovski for posting a solution! :) Anyway, I have actually taken a break from this project, but when I start back up in a couple days I'll definitely try what idelovski did and report back to you guys! :D
 

kingsapo

macrumors member
Original poster
Jun 10, 2010
58
4
Hello all! just wanted to post a reply after trying idelovski's solution, which I am happy to say completely works! I want to thank all of you for your great replies, and idelovski for posting a working solution.
 

ranguvar

macrumors 6502
Sep 18, 2009
318
2
The solution you're currently using looks a lot like primitive use of regular expressions. That's the Cthuluhu Way.

You really should use a parser for this. I got a program that does exactly the same thing like you want to do. I'm using libxml2 SAX HTML parsing to extract the sections and then use the libxml2 DOM features (through this wrapper) to extract the data from the sections.
 

kingsapo

macrumors member
Original poster
Jun 10, 2010
58
4
thanks ranguvar, I do have one question though. I don't know a lot about HTML, but I'm pretty sure that XML is like a perfect version of HTML, where HTML could have extra spaces or not have closing tags and have it still work (please correct me if I'm wrong). So I guess my question is, with libxml2, would it work even if the document I'm parsing wasn't perfect XML?
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.