PDA

View Full Version : Searching An HTML NSString




kingsapo
Aug 14, 2010, 12:34 AM
Hello all! So I have written a program that points to a specified URL, downloads the HTML from the URL and stores it in a variable. Now, I am trying to search the html variable, and more precisely, search between two html heads (I think that's what it's called) and extract everything between the heads. In this example, I am trying to copy everything from <div class="FP_Up_Item2"> to </div> into it's own variable.

<div class="FP_Up_Item2">

<div class="FP_Up_Date">
<b><span style='color:#199600; font-weight:bold;'>01 Aug 2010</span></b> <br /> <span style='color:#a3a3a3;'>out now</span>
</div>

<div class="FP_Up_ImageWrap"><img src="http://m2.n4g.com/8/GameProfiles/125000/125674_2_cov_med.jpg" /></div>
<div class="FP_Up_TextWrap" style="width: 300px;">
<a href="/PC/ReleaseDate-125674.aspx" style="color:#95581f"><b>Alter Ego</b></a> <br /><span style="font-weight: normal; color: #a3a3a3; font-size: 10pt;">(PC)</span><br />
</div>

<div class='FP_Up_Score'><a href='/PC/ReleaseDate-125674.aspx?ShowReviews=1'>5.9</a></div>

</div>

So far, I have found something called NSScanner, though I haven't really found a way to implement this yet. I'm pretty sure there's another way to do this, though, so if anyone could give any insight, it would really help! (By the way, as the title implies, the HTML from above is in a NSString variable type)



PhoneyDeveloper
Aug 14, 2010, 09:00 AM
I'm not an expert on this but here's what I think I know. The BEST and most CORRECT way to parse html is using one of the xml scanners. This is a semantically correct parsing of the input html and should give valid results.

You can also parse html with simple NSString methods or NSScanner or regular expressions. Depending on the reliability of your input html it may be simple or difficult to parse them correctly with any of these string manipulation methods.

I would look into the xml parsers. Knowing how to do this is a good skill to have.

robbieduncan
Aug 14, 2010, 09:14 AM
The problem with using XML parsers on HTML is that HTML 4.0 documents may not be valid XML. XHTML and HTML 5.0 documents will be.

idelovski
Aug 14, 2010, 09:51 AM
Jeff LaMarche wrote a small demo application (http://iphonedevelopment.blogspot.com/2010/05/downloading-images-for-table-without.html) that downloads images from www.deviantart.com and there he parses html to extract locations of those images. You can download the project from his blog where you can read the whole story behind it (even comments by the visitors are very informative.)

Anyway, here's the method that extracts the substring between the tags. This version is modified a bit. As I was reading his source I wanted to make the code more obvious, a bit easier to read. You can use that, or you can find original at his place.

@interface NSString (search)
- (NSRange)rangeAfterString:(NSString *)inString1 toStartOfString:(NSString *)inString2;
@end

@implementation NSString(search)

- (NSRange)rangeAfterString:(NSString *)inString1 toStartOfString:(NSString *)inString2
{
size_t strLength = [self length];
size_t foundLocation = 0;

NSRange startStrRange = NSMakeRange (0, 0);
NSRange endStrRange = NSMakeRange (strLength, 0); // if no end string, end here
NSRange finalSearchRange;
NSRange resultRange;

if (inString1) {
startStrRange = [self rangeOfString:inString1 options:0 range:NSMakeRange(0,strLength)];
if (startStrRange.location == NSNotFound) {
return (startStrRange); // not found
}
foundLocation = NSMaxRange (startStrRange);
}

finalSearchRange = NSMakeRange (foundLocation, strLength - foundLocation);

if (inString2) {
endStrRange = [self rangeOfString:inString2 options:0 range:finalSearchRange];
if (endStrRange.location == NSNotFound) {
return (endStrRange); // not found
}
}

size_t rangeLoc = startStrRange.location + [inString1 length];
size_t rangeLen = NSMaxRange (endStrRange) - rangeLoc - [inString2 length];

resultRange = NSMakeRange (rangeLoc, rangeLen);

return (resultRange);
}

@end

kingsapo
Aug 14, 2010, 01:44 PM
idelovski, thanks for the response, it really helped. I think I understand what that person did; they searched for the tag they were looking for, then parked it's location, then searched for the ending tag after it, then marked it's location, then grabbed everything between location a and b. After looking through the program, though, I have failed to find anything that defines what tag, or what text at all, he is telling the program to look for. So I guess I wanted to know, do you know where in the program the definition of those tags are, or how exactly he managed to search for the image links without defining anything to search for?

robbieduncan
Aug 14, 2010, 01:46 PM
The start/end tags would be the parameters to the method supplied.

chown33
Aug 14, 2010, 02:55 PM
idelovski, thanks for the response, it really helped. I think I understand what that person did; they searched for the tag they were looking for, then parked it's location, then searched for the ending tag after it, then marked it's location, then grabbed everything between location a and b. After looking through the program, though, I have failed to find anything that defines what tag, or what text at all, he is telling the program to look for. So I guess I wanted to know, do you know where in the program the definition of those tags are, or how exactly he managed to search for the image links without defining anything to search for?

Search all the source in the project for rangeAfterString. You'll find it in one file other than the NSString-search files. That's where it's used:
NSRange urlRange = [payload rangeAfterString:START_TOKEN toStartOfString:STOP_TOKEN];


Clearly, START_TOKEN and STOP_TOKEN are symbolic names (could be variables or could be #defined). So search all the source in the project for START_TOKEN.


By the way, I don't think this method will work for you. You have nested div elements in your HTML, so searching for the first </div> after finding "<div class="FP_Up_Item2">" won't work. You have to search for the MATCHING </div>, which is a harder task. You could probably write a more complex parser using the rangeAfterString method, but the logic of doing that is at least as complex as learning how to write an NSXMLParserDelegate.

Your best bet is going to be an NSXMLParser. The HTML is not going to have perfectly nested elements, so it may or may not be strictly XML compliant. However, it won't matter, because the NSCMLParserDelegate doesn't have to enforce strict compliance. You can simply use it to parse <>-enclosed tags and other text, ignoring everything else such as DTD compliance. Knowing how to do this is a useful skill. Don't expect it to be simple.

idelovski
Aug 14, 2010, 05:12 PM
By the way, I don't think this method will work for you.

You seem to be right. He will need to rework that method so it counts how many nested <div> tags are there so he can use the right part of the content string between the first and the last tag.

Or, switch to xml parser or something similar.

idelovski
Aug 15, 2010, 02:25 PM
kingsapo,

maybe you solved this, maybe you didn't. In the meantime I gave it a second thought and came up with this:

- (NSRange)rangeAfterString:(NSString *)inString
bySkippingNestedOpenTags:(NSString *)openTagStr
toStartOfCloseTag:(NSString *)closeTagStr
{
size_t strLength = [self length];
size_t foundLocation = 0, tagSearchLocation = 0;

int nestedOpenTagCnt = 0;

NSRange startStrRange = NSMakeRange (0, 0);
NSRange endStrRange = NSMakeRange (strLength, 0); // if no end string, end here
NSRange closingSearchRange, nestedSearchRange;
NSRange resultRange;

if (inString) {
startStrRange = [self rangeOfString:inString options:0 range:NSMakeRange(0, strLength)];
if (startStrRange.location == NSNotFound)
return (startStrRange); // not found
foundLocation = NSMaxRange (startStrRange);
tagSearchLocation = foundLocation;
nestedOpenTagCnt = 1;
}

do {
closingSearchRange = NSMakeRange (foundLocation, strLength - foundLocation);

if (closeTagStr) {
endStrRange = [self rangeOfString:closeTagStr options:0 range:closingSearchRange];
if (endStrRange.location == NSNotFound)
return (endStrRange); // not found
nestedOpenTagCnt--;
foundLocation = endStrRange.location + [closeTagStr length];
}

if (openTagStr) {
nestedSearchRange = NSMakeRange(tagSearchLocation, NSMaxRange(closingSearchRange) - tagSearchLocation);
nestedSearchRange = [self rangeOfString:openTagStr options:0 range:nestedSearchRange];
if (nestedSearchRange.location != NSNotFound) {
nestedOpenTagCnt++; // not found
tagSearchLocation = nestedSearchRange.location + [openTagStr length];
}
}
} while (nestedOpenTagCnt > 0);

size_t rangeLoc = startStrRange.location + [inString length];
size_t rangeLen = NSMaxRange (endStrRange) - rangeLoc - [closeTagStr length];

resultRange = NSMakeRange (rangeLoc, rangeLen);

return (resultRange);
}

It appears to be working, I just tested it with your example, you can play with it some more and report back.

This is how you use it:

NSString *test = @"<div class=\"FP_Up_Item2\">\
<div class=\"FP_Up_Date\">\
<b><span style='color:#199600; font-weight:bold;'>01 Aug 2010</span></b> <br /> <span style='color:#a3a3a3;'>out now</span>\
</div>\
<div class=\"FP_Up_ImageWrap\"><img src=\"http://m2.n4g.com/8/GameProfiles/125000/125674_2_cov_med.jpg\" /></div>\
<div class=\"FP_Up_TextWrap\" style=\"width: 300px;\">\
<a href=\"/PC/ReleaseDate-125674.aspx\" style=\"color:#95581f\"><b>Alter Ego</b></a> <br /><span style=\"font-weight: normal; color: #a3a3a3; font-size: 10pt;\">(PC)</span><br />\
</div>\
<div class='FP_Up_Score'><a href='/PC/ReleaseDate-125674.aspx?ShowReviews=1'>5.9</a></div>\
</div>";


NSRange urlRange = [test rangeAfterString:@"<div class=\"FP_Up_Item2\">"
bySkippingNestedOpenTags:@"<div"
toStartOfCloseTag:@"</div>"];
NSLog (@"The range: %d/%d", urlRange.location, urlRange.length);
NSLog (@"%@", [test substringWithRange:urlRange]);

kingsapo
Aug 15, 2010, 03:28 PM
Alright, I really want to thank everyone for posting all the great replies. I really REALLY wanna thank idelovski for posting a solution! :) Anyway, I have actually taken a break from this project, but when I start back up in a couple days I'll definitely try what idelovski did and report back to you guys! :D

kingsapo
Aug 20, 2010, 07:14 PM
Hello all! just wanted to post a reply after trying idelovski's solution, which I am happy to say completely works! I want to thank all of you for your great replies, and idelovski for posting a working solution.

ranguvar
Aug 21, 2010, 02:26 AM
The solution you're currently using looks a lot like primitive use of regular expressions. That's the Cthuluhu Way (http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html).

You really should use a parser for this. I got a program that does exactly the same thing like you want to do. I'm using libxml2 SAX HTML parsing to extract the sections and then use the libxml2 DOM features (through this (http://cocoawithlove.com/2008/10/using-libxml2-for-parsing-and-xpath.html) wrapper) to extract the data from the sections.

kingsapo
Aug 23, 2010, 05:17 PM
thanks ranguvar, I do have one question though. I don't know a lot about HTML, but I'm pretty sure that XML is like a perfect version of HTML, where HTML could have extra spaces or not have closing tags and have it still work (please correct me if I'm wrong). So I guess my question is, with libxml2, would it work even if the document I'm parsing wasn't perfect XML?