Remove all HTML tags in a string

Discussion in 'Mac Programming' started by ace2600, Aug 23, 2008.

  1. macrumors member

    Joined:
    Mar 16, 2008
    Location:
    Austin, Texas
    #1
    Hi,

    How would I efficiently remove all HTML tags from an NS(Mutable)String?

    For example:
    Code:
    <h1>Header</h1><p>Hello world</p>
    Would become:
    Code:
    Header Hello world
    .
     
  2. macrumors newbie

    Joined:
    Jun 30, 2007
    #2
    Not sure what environment you are in. sed is always there for me in the darkest hour of need.

    # cat /path/to/file | sed -e 's/<[^>]*>//g'
     
  3. macrumors regular

    Joined:
    Oct 15, 2007
    #3
    More robust command line solution (10.4+):

    Code:
    textutil -convert txt -output foo.txt foo.html
    Also take a look at the TextEdit source at /Developer/Examples/AppKit/TextEdit
     
  4. thread starter macrumors member

    Joined:
    Mar 16, 2008
    Location:
    Austin, Texas
    #4
    Sorry, I should have made it clear. I need to do this only using Objective-C and Cocoa. It will actually be on the iPhone, but I figured this was more of a general Cocoa question so put it in Mac Programming.
     
  5. macrumors 68000

    Soulstorm

    Joined:
    Feb 1, 2005
    #5
    You will need to implement regular expression support, and learn about regular expressions. I have written a good article about that here
     
  6. macrumors 6502a

    Sayer

    Joined:
    Jan 4, 2002
    Location:
    Austin, TX
    #6
  7. macrumors regular

    Joined:
    Oct 15, 2007
    #7
    TextEdit is written in Cocoa. Look at the source for ideas. Check the Webkit API to see if there's anything useful there. Do a websearch for other suggestions. Assuming you want the resulting plain text reasonably formatted, you'll need to use some sort of HMTL parser; just stripping tags with a regex won't cut it.

    HTH
     
  8. Moderator emeritus

    kainjow

    Joined:
    Jun 15, 2000
    #8
    If it's eventually going to end up in an iPhone app you should post it in the iPhone section because even though the iPhone uses Cocoa it's still a very different environment from the Mac and you are much more limited in what you can do. For example the suggestion on using command-line utilities or the WebKit API or even NSAttributedString are all good suggestions, but they are all unavailable on the iPhone.
     
  9. thread starter macrumors member

    Joined:
    Mar 16, 2008
    Location:
    Austin, Texas
    #9
    Thanks everyone for the help. I looked at the methods mentioned and did more searching, but most did not work on the iPhone platform. Next time I will post something like this to that forum.

    I tried using XMLParser first, but it failed often with malformed HTML. I tried a couple other ways with direct string manipulation. I ended up with the approach below. I had to trim the text between tags because I ended up with lots of whitespace. The code below is definitely not the most efficient and I'm not too proud of it, but it seems to work.
    PHP:
    + (NSString *)extractTextFromXML:(NSString *)xml{
        
    //Will hold just the text
        
    NSMutableString *text = [NSMutableString string];
        
    NSInteger startOfSubstring 0;
        
    //Finds first instance of "<"
        
    NSRange startTagRange = [xml rangeOfString:@"<"];
        while(
    startTagRange.location != NSNotFound){
            
    //Extracts text from last location up to "<"
            
    NSString *substring = [xml substringWithRange:NSMakeRange(startOfSubstringstartTagRange.location-startOfSubstring)];
            
    //Removes whitespace from substring
            
    [text appendString:[substring stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]]];
            
            
    //Searches for ">" from "<" to end of string
            
    NSRange startTagToEndRange NSMakeRange(startTagRange.location, [xml length]-startTagRange.location);
            
    NSRange endTagRange = [xml rangeOfString:@">" options:NSCaseInsensitiveSearch range:startTagToEndRange];
            
    //If ">" found, then sets next location of substring to after that
            
    if(endTagRange.location != NSNotFound){
                
    startOfSubstring endTagRange.location+1;
            }
            
    //If no ">", then appends rest of string and returns
            
    else{
                [
    text appendString:[xml substringFromIndex:startTagRange.location]];
                return 
    text;
            }
            
    //Finds next "<" in string
            
    NSRange endTagToEndRange NSMakeRange(startOfSubstring, [xml length]-startOfSubstring);
            
    startTagRange = [xml rangeOfString:@"<" options:NSCaseInsensitiveSearch range:endTagToEndRange];
        }
        
        return 
    text;
    }
     
  10. macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #10
    erm, what about embedded Javascript, or CSS? What if a tag has a property that contains a > in quotes? not to be a black cloud, just trying to point things out.

    -Lee
     
  11. macrumors member

    davedelong

    Joined:
    Sep 9, 2007
    Location:
    Right here.
    #11
    I'm doing some similar text processing in an app I'm writing, and I'm using the RegexKit framework (actually the Lite version):

    http://regexkit.sourceforge.net/

    I love it. It's really easy to use, because it uses categories to add methods to NS*String and NS*Array classes, so you don't have to deal with funky regex objects or anything.

    Dave
     
  12. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #12
    That's all fine and dandy, but you can't use Frameworks in an iPhone application which the OP has said he is intending this to be. It is possible that he could use the sourcecode for that and compile the files he needs directly into the app of course. The only issue might be licensing. What is the license of RegexKit?
     
  13. macrumors member

    davedelong

    Joined:
    Sep 9, 2007
    Location:
    Right here.
    #13
    It's BSD. The RegexKitLite version doesn't actually give you anything other than an extra class, which contains the additional methods on NS*String and NS*Array. I also just add the linker flag "-licucore" to get the RKL to work. It works great. :)

    Dave
     
  14. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #14
    That sounds like it would work fine then :)
     
  15. macrumors G4

    Joined:
    Jan 5, 2006
    Location:
    Redondo Beach, California
    #15
    I don't know your application but if you are reading HTML you have to allow for broken, syntactically invalid HTML. At least make sure you don't go off in some infinite loop or crash.

    I just worked this out on paper. You can do this in plain old C without using any libraries, it takes all of about 12 lines of code. I think you guys are working to hard at this. Make a for loop that loops over the string from left to right. When you see a "<" set in_tag to 1.
    if in_tag is set remove the current character from the string. If the character just removed is a > reset in_tag. That will work most of the time, you have to watch dfor escaped angle brackets
     
  16. macrumors 68040

    lee1210

    Joined:
    Jan 10, 2005
    Location:
    Dallas, TX
    #16
    That would be nice, if not for Javascript, CSS, chevrons in quotes in properties, etc. Just stripping tags is easy. Essentially what the OP needs is a light HTML parser. I'm betting they also want < to show as <, etc.

    -Lee
     

Share This Page