Unable to strip excess newline characters from a string or an array :/

Discussion in 'iPhone/iPad Programming' started by chrono1081, Feb 14, 2012.

  1. macrumors 604

    chrono1081

    Joined:
    Jan 26, 2008
    Location:
    Isla Nublar
    #1
    Hi guys,

    I have information I pulled from a webpage. I stripped the HTML from it so now it is just a string and looks like this (its long so this is only part of it):

    Code:
     
         Beginner   
       
       
         Upper Mambo Alley 
         Yes 
         Yes 
       
       
         Lower Mambo Alley 
         Yes 
         Yes 
       
       
         Snow Drop - Beginner's Area 
         Yes 
         Yes 
    
    The problem is, I am trying to get it to look like this:

    Code:
         Upper Mambo Alley 
         Yes 
         Yes 
         Lower Mambo Alley 
         Yes 
         Yes  
         Snow Drop - Beginner's Area 
         Yes 
         Yes 
    
    For some reason this seems like an impossible task in Objective-C. Here are the things I have tried before posting here:

    1. I tried using NSScanner to scan for two consecutive newline characters. No luck.

    2. I tried using stringByReplacingOccurancesOfString@"\n\n" withString: ""];

    3. I tried reading the string into an array using NSArray *testContents = [strippedSiteData componentsSeparatedByString:mad:"\n"];, converting it to an NSMutableArray and then comparing the contents and removing any array member that was a newline character.

    Nothing seems to be working.

    I am guessing one of two things, either my comparison statement is wrong (where the code says "this is not working") or its something other than newline characters that are in these strings or in the array created.

    If anyone can give me a heads up to what is wrong it would be greatly appreciated. Here is my code:

    Code:
    #import <Foundation/Foundation.h>
    
    //Function Prototypes
    NSString *stripHTML(NSString *html);
    NSString *removeRandomTags(NSString *html);
    
    
    
    int main (int argc, const char * argv[])
    {
    
        @autoreleasepool {
            
            //Create URL
            NSURL *url = [NSURL URLWithString:@"http://www.blueknob.com/winter/conditions.php"];
            
            //Request information from website
            NSURLRequest *request = [NSURLRequest requestWithURL:url];
            NSError *error = nil;
            NSData *data = [NSURLConnection sendSynchronousRequest:request returningResponse:NULL error:&error];
            
            //Check if data was read
            if(!data)
            {
                NSLog(@"Request failed %@", [error localizedDescription]);
                return 1;
            }
            
            //Convert NSData object to an NSString
            NSString *siteData = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
            
            //Strip HTML tags and random HTML tags
            NSString *strippedSiteData = [[NSString alloc] initWithString:stripHTML(siteData)];
            strippedSiteData = removeRandomTags(strippedSiteData);
            
            //Check output
            NSLog(@"%@", strippedSiteData);
            
            //Create an array based on data
            NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];
            NSMutableArray *contents = [NSMutableArray arrayWithArray:testContents];
            
            //Attempt to remove any objects that are only newline characters
            for(int i = 0; i < [contents count]; ++i)
            {
                if([contents objectAtIndex:i] == @"\n") //This doesn't work
                    [contents removeObjectAtIndex:i];
            }
            
            //Print contents
            for(NSString *s in contents)
            {
                NSLog(@"%@", s);
            }
        }
        return 0;
    }
    
    NSString *stripHTML(NSString *html)
    {
        //Scan the string and strip out the HTML from it
        NSScanner *scanner = [NSScanner scannerWithString:html];
        NSString *text = nil;
        
        while([scanner isAtEnd] == NO)
        {
            //Beginning of a tag
            [scanner scanUpToString:@"<" intoString:nil];
            
            //End of a tag
            [scanner scanUpToString:@">" intoString:&text];
            
            //Replace the found tag with a space
            html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@" "];
        }
        
        return html;
    }
    
    NSString *removeRandomTags(NSString *html)
    {
        NSScanner *scanner = [NSScanner scannerWithString:html];
        NSString *text = nil;
        
        while([scanner isAtEnd] == NO)
        {
            //Beginning of a tag
            [scanner scanUpToString:@"&" intoString:nil];
            
            //End of a tag
            [scanner scanUpToString:@";" intoString:&text];
            
            //Replace the found tag with nothing
            html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@;", text] withString:@""];
        }
        
        return html;
    }
    
     
  2. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #2
    Are you sure they are \n characters? There are other return characters (\r for example).
     
  3. thread starter macrumors 604

    chrono1081

    Joined:
    Jan 26, 2008
    Location:
    Isla Nublar
    #3
    I'm not sure : / I have tried to find ways online to view which characters they are but I haven't been successful in that yet.

    I know there is a way, and I used to do it long ago in C and C++ I just haven't came across it yet (and I forget how to do it).
     
  4. Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #4
    You see this line:

    Code:
    if([contents objectAtIndex:i] == @"\n") //This doesn't work
    That will never ever work, even if the object is a string containing \n. Why? Because NSString instances are objects. So you are comparing the pointer addresses. Not that the strings contain the same characters. That's what the NSString isEqualToString: method is for.
     
  5. KnightWRX, Feb 14, 2012
    Last edited: Feb 14, 2012

    macrumors Pentium

    KnightWRX

    Joined:
    Jan 28, 2009
    Location:
    Quebec, Canada
    #5
    https://developer.apple.com/library...ularExpression_Class/Reference/Reference.html

    Code:
    
    NSError error = [[NSError alloc] init...whatever];
    NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\s+$" options: 0 error: &error ];
    NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData 
               options: 0
               range: NSMakeRange(0, [strippedSiteData length])
               withTemplate: @""];
    
    ?

    The regular expression might not be up to snuff, but that's as easy as it gets.

    BTW, you shouldn't capitalize any characters in your variable names, that only applies to methods following the guidelines.
     
  6. thread starter macrumors 604

    chrono1081

    Joined:
    Jan 26, 2008
    Location:
    Isla Nublar
    #6
    Doh! I didn't even think of that :eek: Thanks for the clarification.

    Thanks for the tips :) I never even heard of NSRegularExpression so I will take a look at it and see how to work with it.
     
  7. KnightWRX, Feb 14, 2012
    Last edited: Feb 14, 2012

    macrumors Pentium

    KnightWRX

    Joined:
    Jan 28, 2009
    Location:
    Quebec, Canada
    #7
    Ok, I think I got a good pattern down with "^\s+$"

    That one should do it. Should being the key word. Regexps are powerful but a pain sometimes. ;)

    Maybe because it's brand new and shiny :

    Apple finally joins the Unix world, a couple of decades late. ;)
     
  8. thread starter macrumors 604

    chrono1081

    Joined:
    Jan 26, 2008
    Location:
    Isla Nublar
    #8
    A pain may be an understatement ;) I am still pretty confused reading through the documentation.

    Am I using this correctly? My string still looks the same :/ I was getting a warning from the compiler that /s was an unrecognized escape sequence (even though it shows it in the page you referenced me) so I figured maybe it needed an extra slash so I added one, I don't know if that may have screwed it up or not.

    Here is my revamped code. (I didn't get a chance to change the variable names yet):

    Code:
    #import <Foundation/Foundation.h>
    
    //Function Prototypes
    NSString *stripHTML(NSString *html);
    NSString *removeRandomTags(NSString *html);
    
    
    
    int main (int argc, const char * argv[])
    {
    
        @autoreleasepool {
            
            //Create URL
            NSURL *url = [NSURL URLWithString:@"http://www.blueknob.com/winter/conditions.php"];
            
            //Request information from website
            NSURLRequest *request = [NSURLRequest requestWithURL:url];
            NSError *error = nil;
            NSData *data = [NSURLConnection sendSynchronousRequest:request returningResponse:NULL error:&error];
            
            //Check if data was read
            if(!data)
            {
                NSLog(@"Request failed %@", [error localizedDescription]);
                return 1;
            }
            
            //Convert NSData object to an NSString
            NSString *siteData = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
            
            //Strip HTML tags and random HTML tags
            NSString *strippedSiteData = [[NSString alloc] initWithString:stripHTML(siteData)];
            strippedSiteData = removeRandomTags(strippedSiteData);
            
            //Check output
            NSLog(@"%@", strippedSiteData);
            
            //Create an array based on data
            NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];
            __unused NSMutableArray *contents = [NSMutableArray arrayWithArray:testContents];
    
            NSRegularExpression *regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s+$" options: 0 error: &error ];
            NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData 
                                                                                      options: 0
                                                                                        range: NSMakeRange(0, [strippedSiteData length])
                                                                                 withTemplate: @""];
    
            NSLog(@"%@", strippedandpurifiedsitedata);
        }
        return 0;
    }
    
    NSString *stripHTML(NSString *html)
    {
        //Scan the string and strip out the HTML from it
        NSScanner *scanner = [NSScanner scannerWithString:html];
        NSString *text = nil;
        
        while([scanner isAtEnd] == NO)
        {
            //Beginning of a tag
            [scanner scanUpToString:@"<" intoString:nil];
            
            //End of a tag
            [scanner scanUpToString:@">" intoString:&text];
            
            //Replace the found tag with a space
            html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@" "];
        }
        
        return html;
    }
    
    NSString *removeRandomTags(NSString *html)
    {
        NSScanner *scanner = [NSScanner scannerWithString:html];
        NSString *text = nil;
        
        while([scanner isAtEnd] == NO)
        {
            //Beginning of a tag
            [scanner scanUpToString:@"&" intoString:nil];
            
            //End of a tag
            [scanner scanUpToString:@";" intoString:&text];
            
            //Replace the found tag with nothing
            html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@;", text] withString:@""];
        }
        
        return html;
    }
    
     
  9. macrumors Pentium

    KnightWRX

    Joined:
    Jan 28, 2009
    Location:
    Quebec, Canada
    #9
    I know I used it recently in some code that I got working. Let me try to dig it up instead of just writing it by hand.

    EDIT : found it, unfortunately, mine is used to validate that a string is a representation of an hex number (0xAFF2 for example) :

    Code:
        NSError * error;
        NSRange range = { 0, [self length] };
        NSRegularExpression * regex = [[NSRegularExpression alloc] initWithPattern:@"\\A0x[0-9a-f]+\\z" options: NSRegularExpressionCaseInsensitive error: &error];
    
        if([regex numberOfMatchesInString: self options:0 range: range])
        {
                 ...
        }
    
    That's how I did mine.

    You could start with NSLog()ing the result from [regex numberOfMatchesInString:options:range:] though, that could give you a big clue to see if it's finding anything. Then once you're matching stuff, go with the replace function.
     
  10. Moderator

    dejo

    Staff Member

    Joined:
    Sep 2, 2004
    Location:
    The Centennial State
    #10
    Which guidelines state that? Are you saying
    Code:
    NSString *hostName;
    is bad form?
     
  11. macrumors Pentium

    KnightWRX

    Joined:
    Jan 28, 2009
    Location:
    Quebec, Canada
    #11
    Dunno, I had read that somewhere, but looking through Apple's guidelines, I can't find it, in fact it says the opposite :

    https://developer.apple.com/library...cs.html#//apple_ref/doc/uid/20001281-BBCHBFAH

    I use to capitalize the letters on the 2nd and subsequent words and had stopped because of something I had read. I guess it was wrong information.
     
  12. macrumors 603

    Joined:
    Aug 9, 2009
    #12
    It's hard to decipheranyinherentmeaningwhenmultiplewordsruntogether. So this is bad:
    Code:
    NSString * strippedandpurifiedsitedata;
    
    but either of these is much more readable:
    Code:
    NSString * strippedAndPurifiedSiteData;
    NSString * stripped_and_purified_site_data;
    The latter is a long-standing C convention. Some C programmers get upset by CamelCase variable names (or function names, or anything other than _t in typedefs).

    Short names don't necessarily need the same rules. Example:
    Code:
    NSRegularExpression * regex;
    
    It's when the names get long and consist of many words stuck together that problems arise, e.g.
    Code:
    NSRegularExpression * regularexpression;
    NSRegularExpression * regularimpression;
    NSRegularExpression * regularexpanssion;
    
     
  13. KnightWRX, Feb 14, 2012
    Last edited: Feb 14, 2012

    macrumors Pentium

    KnightWRX

    Joined:
    Jan 28, 2009
    Location:
    Quebec, Canada
    #13
    Figured it out for you. Basically, since the string is a multiline input, we need to tell NSRegularExpression to treat it as a multi-line text rather than a single long line. That's what the NSRegularExpressionOptions are for, when we create the regexp. This code works and does what you want :

    Code:
    int main (int argc, const char * argv[])
    {
    
        @autoreleasepool {
            NSError * error;
            NSString * stringData = [[NSString alloc] initWithContentsOfFile: @"/path/to/file/named/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
            if(error)
            {
                NSLog(@"%@", [error description]);
                exit(EXIT_FAILURE);
            }
            
            NSString * strippedStringData = [[NSString alloc] initWithString: stripHTML(stringData)];
            strippedStringData = removeRandomTags(strippedStringData);
            
            NSLog(@"String contains : \n%@", strippedStringData);
            
            NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s+$" options: NSRegularExpressionAnchorsMatchLines error: &error];
            
            NSLog(@"Found %lu matches", [[regexp matchesInString: strippedStringData options: 0 range: NSMakeRange(0, [strippedStringData length])] count]);
            if([[regexp matchesInString: strippedStringData options: 0 range: NSMakeRange(0, [strippedStringData length])] count])
            {
                NSMutableString * strippedAndPurifiedString = [[NSMutableString alloc] initWithString: strippedStringData];
                [regexp replaceMatchesInString: strippedAndPurifiedString options: 0 range: NSMakeRange(0, [strippedStringData length]) withTemplate: @""];
                NSLog(@"New String contains :\n %@", strippedAndPurifiedString);
            }
            
        }
        return EXIT_SUCCESS;
    }
    We don't specify options during matching, but during the initialization of the regexp, we add in NSRegularExpressionAnchorsMatchLines. Then my earlier regexp of simply "^\s*$" (or @"^\\s*$" in Objective-C notation) works.

    I love these things. So powerful, once you got them sorted out.
     
  14. KnightWRX, Feb 14, 2012
    Last edited: Feb 15, 2012

    macrumors Pentium

    KnightWRX

    Joined:
    Jan 28, 2009
    Location:
    Quebec, Canada
    #14
    Rewrote the code without stripHTML() and removeblahblahtags() :

    Code:
    int main (int argc, const char * argv[])
    {
    
        @autoreleasepool {
            NSError * error;
            NSMutableString * stringData = [[NSMutableString alloc] initWithContentsOfFile: @"/path/to/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
            NSUInteger matches;
            
            if(error)
            {
                NSLog(@"%@", [error description]);
                exit(EXIT_FAILURE);
            }
            
            NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"</*[a-z]+\\s*[a-z]*=*\"*[0-9a-z]*\"*>" options: NSRegularExpressionCaseInsensitive error: &error];
            matches = [[regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])] count];
            if(matches)
            {
                NSLog(@"Matched %lu tags, let's strip 'em", matches);
                [regexp replaceMatchesInString: stringData options: 0 range:  NSMakeRange(0, [stringData length]) withTemplate: @""];
            }
            
            regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s*$" options: NSRegularExpressionAnchorsMatchLines error: &error];
            matches = [[regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])] count];
            
            if(matches)
            {
                NSLog(@"Found %lu matches", matches);
                [regexp replaceMatchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length]) withTemplate: @""];
                NSLog(@"New String contains :\n %@", stringData);
            }
            
        }
        return EXIT_SUCCESS;
    }
    Result :

    Code:
    2012-02-14 20:40:24.163 regexptest[13592:707] Matched 26 tags, let's strip 'em
    2012-02-14 20:40:24.167 regexptest[13592:707] Found 9 matches
    2012-02-14 20:40:24.168 regexptest[13592:707] New String contains :
     
    Beginner
    
    Upper Mambo Alley 
    Yes
    Yes
    
    Lower Mambo Alley 
    Yes 
    Yes
    
    Snow Drop - Beginner's Area 
    Yes
    Yes
    Program ended with exit code: 0
    Doesn't seem too bad. Getting there, still a bit of work to do on that 2nd regular expression. See how powerful this crap is. ;) I basically replaced all your NSScanner stuff and multitude of strings with 1 NSMutableString and 2 RegularExpressions.

    And people say Lion sucks. :rolleyes:.

    EDIT :

    Works perfectly this morning, after removing some "over-thinking" and using NSMutableString's delete function rather than replacing with null characters (which is more akin to what you want) :

    Code:
    #import <Foundation/Foundation.h>
    
    void removeMatchesFromString(NSArray *, NSMutableString *);
    
    int main (int argc, const char * argv[])
    {
    
        @autoreleasepool {
            NSError * error;
            NSMutableString * stringData = [[NSMutableString alloc] initWithContentsOfFile: @"/path/to/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
            NSArray * matches;
            
            if(error)
            {
                NSLog(@"%@", [error description]);
                exit(EXIT_FAILURE);
            }
            
            NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"<[\\w\\d\\s\"=/]+>" options: NSRegularExpressionCaseInsensitive error: &error];
            matches = [regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])];
            
            NSLog(@"Matched %lu tags, let's strip 'em", [matches count]);
            removeMatchesFromString(matches, stringData);
            NSLog(@"After HTML strip : %@", stringData);
            
            regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s*$" options: NSRegularExpressionAnchorsMatchLines error: &error];
            matches = [regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])];
            
            NSLog(@"Matched %lu garbage, let's strip 'em", [matches count]);
            removeMatchesFromString(matches, stringData);
            NSLog(@"After purification : \n%@", stringData);
        }
        return EXIT_SUCCESS;
    }
    
    void removeMatchesFromString(NSArray * matches, NSMutableString * string)
    {
        NSInteger rangeoffset = 0;
        NSRange range;
        
        for(int i = 0; i < [matches count]; i++)
        {
            range = [[matches objectAtIndex: i] range];
            range.location -= rangeoffset;
            if(!range.length)
                range.length++;
            rangeoffset += range.length;
            [string deleteCharactersInRange: range];
        }
    }
    Code:
    2012-02-15 06:33:34.323 regexptest[14110:707] After purification : 
    Beginner
    Upper Mambo Alley 
    Yes
    Yes
    Lower Mambo Alley 
    Yes 
    Yes
    Snow Drop - Beginner's Area 
    Yes
    Yes
    Program ended with exit code: 0
    A good night's sleep does wonder to the mind. :D
     
  15. macrumors 68000

    Sydde

    Joined:
    Aug 17, 2009
    #15
    I am wondering, if you use a scanner and a character set from +newlineCharacterSet, would that not easily capture any type of returns? Just scan each line (-scanUpToCharactersFromSet:intoString:), append the string into a mutable string, then -scanCharactersFromSet:intoString: into nil to get to the next line.
     

Share This Page