PDA

View Full Version : Remove all HTML tags in a string




ace2600
Aug 23, 2008, 11:34 PM
Hi,

How would I efficiently remove all HTML tags from an NS(Mutable)String?

For example:
<h1>Header</h1><p>Hello world</p>
Would become:
Header Hello world.



wintergreen
Aug 24, 2008, 01:25 AM
Not sure what environment you are in. sed is always there for me in the darkest hour of need.

# cat /path/to/file | sed -e 's/<[^>]*>//g'

hhas
Aug 24, 2008, 03:04 AM
Not sure what environment you are in. sed is always there for me in the darkest hour of need.

# cat /path/to/file | sed -e 's/<[^>]*>//g'

More robust command line solution (10.4+):

textutil -convert txt -output foo.txt foo.html

Also take a look at the TextEdit source at /Developer/Examples/AppKit/TextEdit

ace2600
Aug 24, 2008, 08:44 AM
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.

Soulstorm
Aug 24, 2008, 08:55 AM
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.

You will need to implement regular expression support, and learn about regular expressions. I have written a good article about that here (http://soulstorm-creations.com/index.php?option=com_content&task=view&id=49&Itemid=39)

Sayer
Aug 24, 2008, 10:07 AM
Asking Google usually gives better results:

http://www.google.com/search?client=safari&rls=en-us&q=NSString+strip+HTML+tags&ie=UTF-8&oe=UTF-8

Check the very first link.

hhas
Aug 24, 2008, 02:02 PM
Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.

TextEdit is written in Cocoa. Look at the source for ideas. Check the Webkit API to see if there's anything useful there. Do a websearch (http://www.google.com/search?q=cocoa+convert+html+to+text) for other suggestions. Assuming you want the resulting plain text reasonably formatted, you'll need to use some sort of HMTL parser; just stripping tags with a regex won't cut it.

HTH

kainjow
Aug 24, 2008, 03:39 PM
If it's eventually going to end up in an iPhone app you should post it in the iPhone section because even though the iPhone uses Cocoa it's still a very different environment from the Mac and you are much more limited in what you can do. For example the suggestion on using command-line utilities or the WebKit API or even NSAttributedString are all good suggestions, but they are all unavailable on the iPhone.

ace2600
Aug 25, 2008, 11:06 PM
Thanks everyone for the help. I looked at the methods mentioned and did more searching, but most did not work on the iPhone platform. Next time I will post something like this to that forum.

I tried using XMLParser first, but it failed often with malformed HTML. I tried a couple other ways with direct string manipulation. I ended up with the approach below. I had to trim the text between tags because I ended up with lots of whitespace. The code below is definitely not the most efficient and I'm not too proud of it, but it seems to work.+ (NSString *)extractTextFromXML:(NSString *)xml{
//Will hold just the text
NSMutableString *text = [NSMutableString string];
NSInteger startOfSubstring = 0;
//Finds first instance of "<"
NSRange startTagRange = [xml rangeOfString:@"<"];
while(startTagRange.location != NSNotFound){
//Extracts text from last location up to "<"
NSString *substring = [xml substringWithRange:NSMakeRange(startOfSubstring, startTagRange.location-startOfSubstring)];
//Removes whitespace from substring
[text appendString:[substring stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]];

//Searches for ">" from "<" to end of string
NSRange startTagToEndRange = NSMakeRange(startTagRange.location, [xml length]-startTagRange.location);
NSRange endTagRange = [xml rangeOfString:@">" options:NSCaseInsensitiveSearch range:startTagToEndRange];
//If ">" found, then sets next location of substring to after that
if(endTagRange.location != NSNotFound){
startOfSubstring = endTagRange.location+1;
}
//If no ">", then appends rest of string and returns
else{
[text appendString:[xml substringFromIndex:startTagRange.location]];
return text;
}
//Finds next "<" in string
NSRange endTagToEndRange = NSMakeRange(startOfSubstring, [xml length]-startOfSubstring);
startTagRange = [xml rangeOfString:@"<" options:NSCaseInsensitiveSearch range:endTagToEndRange];
}

return text;
}

lee1210
Aug 25, 2008, 11:24 PM
erm, what about embedded Javascript, or CSS? What if a tag has a property that contains a > in quotes? not to be a black cloud, just trying to point things out.

-Lee

davedelong
Aug 26, 2008, 10:08 AM
I'm doing some similar text processing in an app I'm writing, and I'm using the RegexKit framework (actually the Lite version):

http://regexkit.sourceforge.net/

I love it. It's really easy to use, because it uses categories to add methods to NS*String and NS*Array classes, so you don't have to deal with funky regex objects or anything.

Dave

robbieduncan
Aug 26, 2008, 10:22 AM
I'm doing some similar text processing in an app I'm writing, and I'm using the RegexKit framework (actually the Lite version):

http://regexkit.sourceforge.net/

I love it. It's really easy to use, because it uses categories to add methods to NS*String and NS*Array classes, so you don't have to deal with funky regex objects or anything.

Dave

That's all fine and dandy, but you can't use Frameworks in an iPhone application which the OP has said he is intending this to be. It is possible that he could use the sourcecode for that and compile the files he needs directly into the app of course. The only issue might be licensing. What is the license of RegexKit?

davedelong
Aug 26, 2008, 10:25 AM
It's BSD. The RegexKitLite version doesn't actually give you anything other than an extra class, which contains the additional methods on NS*String and NS*Array. I also just add the linker flag "-licucore" to get the RKL to work. It works great. :)

Dave

robbieduncan
Aug 26, 2008, 10:26 AM
It's BSD. The RegexKitLite version doesn't actually give you anything other than an extra class, which contains the additional methods on NS*String and NS*Array. I also just add the linker flag "-licucore" to get the RKL to work. It works great. :)

Dave

That sounds like it would work fine then :)

ChrisA
Aug 26, 2008, 10:52 AM
That's all fine and dandy, but you can't use Frameworks in an iPhone application which the OP has said he is intending this to be. It is possible that he could use the sourcecode for that and compile the files he needs directly into the app of course. The only issue might be licensing. What is the license of RegexKit?

I don't know your application but if you are reading HTML you have to allow for broken, syntactically invalid HTML. At least make sure you don't go off in some infinite loop or crash.

I just worked this out on paper. You can do this in plain old C without using any libraries, it takes all of about 12 lines of code. I think you guys are working to hard at this. Make a for loop that loops over the string from left to right. When you see a "<" set in_tag to 1.
if in_tag is set remove the current character from the string. If the character just removed is a > reset in_tag. That will work most of the time, you have to watch dfor escaped angle brackets

lee1210
Aug 26, 2008, 11:04 AM
I don't know your application but if you are reading HTML you have to allow for broken, syntactically invalid HTML. At least make sure you don't go off in some infinite loop or crash.

I just worked this out on paper. You can do this in plain old C without using any libraries, it takes all of about 12 lines of code. I think you guys are working to hard at this. Make a for loop that loops over the string from left to right. When you see a "<" set in_tag to 1.
if in_tag is set remove the current character from the string. If the character just removed is a > reset in_tag. That will work most of the time, you have to watch dfor escaped angle brackets

That would be nice, if not for Javascript, CSS, chevrons in quotes in properties, etc. Just stripping tags is easy. Essentially what the OP needs is a light HTML parser. I'm betting they also want &lt; to show as <, etc.

-Lee