Not sure what environment you are in. sed is always there for me in the darkest hour of need.
# cat /path/to/file | sed -e 's/<[^>]*>//g'
textutil -convert txt -output foo.txt foo.html
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.
+ (NSString *)extractTextFromXML:(NSString *)xml{
//Will hold just the text
NSMutableString *text = [NSMutableString string];
NSInteger startOfSubstring = 0;
//Finds first instance of "<"
NSRange startTagRange = [xml rangeOfString:@"<"];
while(startTagRange.location != NSNotFound){
//Extracts text from last location up to "<"
NSString *substring = [xml substringWithRange:NSMakeRange(startOfSubstring, startTagRange.location-startOfSubstring)];
//Removes whitespace from substring
[text appendString:[substring stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]];
//Searches for ">" from "<" to end of string
NSRange startTagToEndRange = NSMakeRange(startTagRange.location, [xml length]-startTagRange.location);
NSRange endTagRange = [xml rangeOfString:@">" options:NSCaseInsensitiveSearch range:startTagToEndRange];
//If ">" found, then sets next location of substring to after that
if(endTagRange.location != NSNotFound){
startOfSubstring = endTagRange.location+1;
}
//If no ">", then appends rest of string and returns
else{
[text appendString:[xml substringFromIndex:startTagRange.location]];
return text;
}
//Finds next "<" in string
NSRange endTagToEndRange = NSMakeRange(startOfSubstring, [xml length]-startOfSubstring);
startTagRange = [xml rangeOfString:@"<" options:NSCaseInsensitiveSearch range:endTagToEndRange];
}
return text;
}
I'm doing some similar text processing in an app I'm writing, and I'm using the RegexKit framework (actually the Lite version):
http://regexkit.sourceforge.net/
I love it. It's really easy to use, because it uses categories to add methods to NS*String and NS*Array classes, so you don't have to deal with funky regex objects or anything.
Dave
It's BSD. The RegexKitLite version doesn't actually give you anything other than an extra class, which contains the additional methods on NS*String and NS*Array. I also just add the linker flag "-licucore" to get the RKL to work. It works great. 🙂
Dave
That's all fine and dandy, but you can't use Frameworks in an iPhone application which the OP has said he is intending this to be. It is possible that he could use the sourcecode for that and compile the files he needs directly into the app of course. The only issue might be licensing. What is the license of RegexKit?
I don't know your application but if you are reading HTML you have to allow for broken, syntactically invalid HTML. At least make sure you don't go off in some infinite loop or crash.
I just worked this out on paper. You can do this in plain old C without using any libraries, it takes all of about 12 lines of code. I think you guys are working to hard at this. Make a for loop that loops over the string from left to right. When you see a "<" set in_tag to 1.
if in_tag is set remove the current character from the string. If the character just removed is a > reset in_tag. That will work most of the time, you have to watch dfor escaped angle brackets