Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

chrono1081

macrumors G3
Original poster
Jan 26, 2008
8,456
4,164
Isla Nublar
Hi guys,

I have information I pulled from a webpage. I stripped the HTML from it so now it is just a string and looks like this (its long so this is only part of it):

Code:
     Beginner   
   
   
     Upper Mambo Alley 
     Yes 
     Yes 
   
   
     Lower Mambo Alley 
     Yes 
     Yes 
   
   
     Snow Drop - Beginner's Area 
     Yes 
     Yes

The problem is, I am trying to get it to look like this:

Code:
     Upper Mambo Alley 
     Yes 
     Yes 
     Lower Mambo Alley 
     Yes 
     Yes  
     Snow Drop - Beginner's Area 
     Yes 
     Yes

For some reason this seems like an impossible task in Objective-C. Here are the things I have tried before posting here:

1. I tried using NSScanner to scan for two consecutive newline characters. No luck.

2. I tried using stringByReplacingOccurancesOfString@"\n\n" withString: ""];

3. I tried reading the string into an array using NSArray *testContents = [strippedSiteData componentsSeparatedByString:mad:"\n"];, converting it to an NSMutableArray and then comparing the contents and removing any array member that was a newline character.

Nothing seems to be working.

I am guessing one of two things, either my comparison statement is wrong (where the code says "this is not working") or its something other than newline characters that are in these strings or in the array created.

If anyone can give me a heads up to what is wrong it would be greatly appreciated. Here is my code:

Code:
#import <Foundation/Foundation.h>

//Function Prototypes
NSString *stripHTML(NSString *html);
NSString *removeRandomTags(NSString *html);



int main (int argc, const char * argv[])
{

    @autoreleasepool {
        
        //Create URL
        NSURL *url = [NSURL URLWithString:@"http://www.blueknob.com/winter/conditions.php"];
        
        //Request information from website
        NSURLRequest *request = [NSURLRequest requestWithURL:url];
        NSError *error = nil;
        NSData *data = [NSURLConnection sendSynchronousRequest:request returningResponse:NULL error:&error];
        
        //Check if data was read
        if(!data)
        {
            NSLog(@"Request failed %@", [error localizedDescription]);
            return 1;
        }
        
        //Convert NSData object to an NSString
        NSString *siteData = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
        
        //Strip HTML tags and random HTML tags
        NSString *strippedSiteData = [[NSString alloc] initWithString:stripHTML(siteData)];
        strippedSiteData = removeRandomTags(strippedSiteData);
        
        //Check output
        NSLog(@"%@", strippedSiteData);
        
        //Create an array based on data
        NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];
        NSMutableArray *contents = [NSMutableArray arrayWithArray:testContents];
        
        //Attempt to remove any objects that are only newline characters
        for(int i = 0; i < [contents count]; ++i)
        {
            if([contents objectAtIndex:i] == @"\n") //This doesn't work
                [contents removeObjectAtIndex:i];
        }
        
        //Print contents
        for(NSString *s in contents)
        {
            NSLog(@"%@", s);
        }
    }
    return 0;
}

NSString *stripHTML(NSString *html)
{
    //Scan the string and strip out the HTML from it
    NSScanner *scanner = [NSScanner scannerWithString:html];
    NSString *text = nil;
    
    while([scanner isAtEnd] == NO)
    {
        //Beginning of a tag
        [scanner scanUpToString:@"<" intoString:nil];
        
        //End of a tag
        [scanner scanUpToString:@">" intoString:&text];
        
        //Replace the found tag with a space
        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@" "];
    }
    
    return html;
}

NSString *removeRandomTags(NSString *html)
{
    NSScanner *scanner = [NSScanner scannerWithString:html];
    NSString *text = nil;
    
    while([scanner isAtEnd] == NO)
    {
        //Beginning of a tag
        [scanner scanUpToString:@"&" intoString:nil];
        
        //End of a tag
        [scanner scanUpToString:@";" intoString:&text];
        
        //Replace the found tag with nothing
        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@;", text] withString:@""];
    }
    
    return html;
}
 

chrono1081

macrumors G3
Original poster
Jan 26, 2008
8,456
4,164
Isla Nublar
Are you sure they are \n characters? There are other return characters (\r for example).

I'm not sure : / I have tried to find ways online to view which characters they are but I haven't been successful in that yet.

I know there is a way, and I used to do it long ago in C and C++ I just haven't came across it yet (and I forget how to do it).
 

robbieduncan

Moderator emeritus
Jul 24, 2002
25,611
893
Harrogate
You see this line:

Code:
if([contents objectAtIndex:i] == @"\n") //This doesn't work

That will never ever work, even if the object is a string containing \n. Why? Because NSString instances are objects. So you are comparing the pointer addresses. Not that the strings contain the same characters. That's what the NSString isEqualToString: method is for.
 

KnightWRX

macrumors Pentium
Jan 28, 2009
15,046
4
Quebec, Canada
https://developer.apple.com/library...ularExpression_Class/Reference/Reference.html

Code:
NSError error = [[NSError alloc] init...whatever];
NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\s+$" options: 0 error: &error ];
NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData 
           options: 0
           range: NSMakeRange(0, [strippedSiteData length])
           withTemplate: @""];

?

The regular expression might not be up to snuff, but that's as easy as it gets.

BTW, you shouldn't capitalize any characters in your variable names, that only applies to methods following the guidelines.
 
Last edited:

chrono1081

macrumors G3
Original poster
Jan 26, 2008
8,456
4,164
Isla Nublar
You see this line:

Code:
if([contents objectAtIndex:i] == @"\n") //This doesn't work

That will never ever work, even if the object is a string containing \n. Why? Because NSString instances are objects. So you are comparing the pointer addresses. Not that the strings contain the same characters. That's what the NSString isEqualToString: method is for.

Doh! I didn't even think of that :eek: Thanks for the clarification.

https://developer.apple.com/library...ularExpression_Class/Reference/Reference.html

Code:
NSError error = [[NSError alloc] init...whatever];
NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\r*\n+" options: 0 error: &error ];
NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData 
           options: 0
           range: NSMakeRange(0, [strippedSiteData length])
           withTemplate: @""];

?

The regular expression might not be up to snuff, but that's as easy as it gets.

BTW, you shouldn't capitalize any characters in your variable names, that only applies to methods following the guidelines.

Thanks for the tips :) I never even heard of NSRegularExpression so I will take a look at it and see how to work with it.
 

KnightWRX

macrumors Pentium
Jan 28, 2009
15,046
4
Quebec, Canada
Ok, I think I got a good pattern down with "^\s+$"

That one should do it. Should being the key word. Regexps are powerful but a pain sometimes. ;)

Thanks for the tips :) I never even heard of NSRegularExpression

Maybe because it's brand new and shiny :

Availability
Available in Mac OS X v10.7 and later.
Availability
Available in iOS 4.0 and later.

Apple finally joins the Unix world, a couple of decades late. ;)
 
Last edited:

chrono1081

macrumors G3
Original poster
Jan 26, 2008
8,456
4,164
Isla Nublar
Ok, I think I got a good pattern down with "^\s+$"

That one should do it. Should being the key word. Regexps are powerful but a pain sometimes. ;)

A pain may be an understatement ;) I am still pretty confused reading through the documentation.

Am I using this correctly? My string still looks the same :/ I was getting a warning from the compiler that /s was an unrecognized escape sequence (even though it shows it in the page you referenced me) so I figured maybe it needed an extra slash so I added one, I don't know if that may have screwed it up or not.

Here is my revamped code. (I didn't get a chance to change the variable names yet):

Code:
#import <Foundation/Foundation.h>

//Function Prototypes
NSString *stripHTML(NSString *html);
NSString *removeRandomTags(NSString *html);



int main (int argc, const char * argv[])
{

    @autoreleasepool {
        
        //Create URL
        NSURL *url = [NSURL URLWithString:@"http://www.blueknob.com/winter/conditions.php"];
        
        //Request information from website
        NSURLRequest *request = [NSURLRequest requestWithURL:url];
        NSError *error = nil;
        NSData *data = [NSURLConnection sendSynchronousRequest:request returningResponse:NULL error:&error];
        
        //Check if data was read
        if(!data)
        {
            NSLog(@"Request failed %@", [error localizedDescription]);
            return 1;
        }
        
        //Convert NSData object to an NSString
        NSString *siteData = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
        
        //Strip HTML tags and random HTML tags
        NSString *strippedSiteData = [[NSString alloc] initWithString:stripHTML(siteData)];
        strippedSiteData = removeRandomTags(strippedSiteData);
        
        //Check output
        NSLog(@"%@", strippedSiteData);
        
        //Create an array based on data
        NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];
        __unused NSMutableArray *contents = [NSMutableArray arrayWithArray:testContents];

        NSRegularExpression *regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s+$" options: 0 error: &error ];
        NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData 
                                                                                  options: 0
                                                                                    range: NSMakeRange(0, [strippedSiteData length])
                                                                             withTemplate: @""];

        NSLog(@"%@", strippedandpurifiedsitedata);
    }
    return 0;
}

NSString *stripHTML(NSString *html)
{
    //Scan the string and strip out the HTML from it
    NSScanner *scanner = [NSScanner scannerWithString:html];
    NSString *text = nil;
    
    while([scanner isAtEnd] == NO)
    {
        //Beginning of a tag
        [scanner scanUpToString:@"<" intoString:nil];
        
        //End of a tag
        [scanner scanUpToString:@">" intoString:&text];
        
        //Replace the found tag with a space
        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@" "];
    }
    
    return html;
}

NSString *removeRandomTags(NSString *html)
{
    NSScanner *scanner = [NSScanner scannerWithString:html];
    NSString *text = nil;
    
    while([scanner isAtEnd] == NO)
    {
        //Beginning of a tag
        [scanner scanUpToString:@"&" intoString:nil];
        
        //End of a tag
        [scanner scanUpToString:@";" intoString:&text];
        
        //Replace the found tag with nothing
        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@;", text] withString:@""];
    }
    
    return html;
}
 

KnightWRX

macrumors Pentium
Jan 28, 2009
15,046
4
Quebec, Canada
I know I used it recently in some code that I got working. Let me try to dig it up instead of just writing it by hand.

EDIT : found it, unfortunately, mine is used to validate that a string is a representation of an hex number (0xAFF2 for example) :

Code:
    NSError * error;
    NSRange range = { 0, [self length] };
    NSRegularExpression * regex = [[NSRegularExpression alloc] initWithPattern:@"\\A0x[0-9a-f]+\\z" options: NSRegularExpressionCaseInsensitive error: &error];

    if([regex numberOfMatchesInString: self options:0 range: range])
    {
             ...
    }

That's how I did mine.

You could start with NSLog()ing the result from [regex numberOfMatchesInString:options:range:] though, that could give you a big clue to see if it's finding anything. Then once you're matching stuff, go with the replace function.
 

KnightWRX

macrumors Pentium
Jan 28, 2009
15,046
4
Quebec, Canada

chown33

Moderator
Staff member
Aug 9, 2009
10,751
8,423
A sea of green
I use to capitalize the letters on the 2nd and subsequent words and had stopped because of something I had read. I guess it was wrong information.

It's hard to decipheranyinherentmeaningwhenmultiplewordsruntogether. So this is bad:
Code:
NSString * strippedandpurifiedsitedata;
but either of these is much more readable:
Code:
NSString * strippedAndPurifiedSiteData;
NSString * stripped_and_purified_site_data;
The latter is a long-standing C convention. Some C programmers get upset by CamelCase variable names (or function names, or anything other than _t in typedefs).

Short names don't necessarily need the same rules. Example:
Code:
NSRegularExpression * regex;
It's when the names get long and consist of many words stuck together that problems arise, e.g.
Code:
NSRegularExpression * regularexpression;
NSRegularExpression * regularimpression;
NSRegularExpression * regularexpanssion;
 

KnightWRX

macrumors Pentium
Jan 28, 2009
15,046
4
Quebec, Canada
A pain may be an understatement ;) I am still pretty confused reading through the documentation.

Figured it out for you. Basically, since the string is a multiline input, we need to tell NSRegularExpression to treat it as a multi-line text rather than a single long line. That's what the NSRegularExpressionOptions are for, when we create the regexp. This code works and does what you want :

Code:
int main (int argc, const char * argv[])
{

    @autoreleasepool {
        NSError * error;
        NSString * stringData = [[NSString alloc] initWithContentsOfFile: @"/path/to/file/named/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
        if(error)
        {
            NSLog(@"%@", [error description]);
            exit(EXIT_FAILURE);
        }
        
        NSString * strippedStringData = [[NSString alloc] initWithString: stripHTML(stringData)];
        strippedStringData = removeRandomTags(strippedStringData);
        
        NSLog(@"String contains : \n%@", strippedStringData);
        
        NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s+$" options: NSRegularExpressionAnchorsMatchLines error: &error];
        
        NSLog(@"Found %lu matches", [[regexp matchesInString: strippedStringData options: 0 range: NSMakeRange(0, [strippedStringData length])] count]);
        if([[regexp matchesInString: strippedStringData options: 0 range: NSMakeRange(0, [strippedStringData length])] count])
        {
            NSMutableString * strippedAndPurifiedString = [[NSMutableString alloc] initWithString: strippedStringData];
            [regexp replaceMatchesInString: strippedAndPurifiedString options: 0 range: NSMakeRange(0, [strippedStringData length]) withTemplate: @""];
            NSLog(@"New String contains :\n %@", strippedAndPurifiedString);
        }
        
    }
    return EXIT_SUCCESS;
}

We don't specify options during matching, but during the initialization of the regexp, we add in NSRegularExpressionAnchorsMatchLines. Then my earlier regexp of simply "^\s*$" (or @"^\\s*$" in Objective-C notation) works.

I love these things. So powerful, once you got them sorted out.
 
Last edited:

KnightWRX

macrumors Pentium
Jan 28, 2009
15,046
4
Quebec, Canada
Rewrote the code without stripHTML() and removeblahblahtags() :

Code:
int main (int argc, const char * argv[])
{

    @autoreleasepool {
        NSError * error;
        NSMutableString * stringData = [[NSMutableString alloc] initWithContentsOfFile: @"/path/to/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
        NSUInteger matches;
        
        if(error)
        {
            NSLog(@"%@", [error description]);
            exit(EXIT_FAILURE);
        }
        
        NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"</*[a-z]+\\s*[a-z]*=*\"*[0-9a-z]*\"*>" options: NSRegularExpressionCaseInsensitive error: &error];
        matches = [[regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])] count];
        if(matches)
        {
            NSLog(@"Matched %lu tags, let's strip 'em", matches);
            [regexp replaceMatchesInString: stringData options: 0 range:  NSMakeRange(0, [stringData length]) withTemplate: @""];
        }
        
        regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s*$" options: NSRegularExpressionAnchorsMatchLines error: &error];
        matches = [[regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])] count];
        
        if(matches)
        {
            NSLog(@"Found %lu matches", matches);
            [regexp replaceMatchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length]) withTemplate: @""];
            NSLog(@"New String contains :\n %@", stringData);
        }
        
    }
    return EXIT_SUCCESS;
}

Result :

Code:
2012-02-14 20:40:24.163 regexptest[13592:707] Matched 26 tags, let's strip 'em
2012-02-14 20:40:24.167 regexptest[13592:707] Found 9 matches
2012-02-14 20:40:24.168 regexptest[13592:707] New String contains :
 
Beginner

Upper Mambo Alley 
Yes
Yes

Lower Mambo Alley 
Yes 
Yes

Snow Drop - Beginner's Area 
Yes
Yes
Program ended with exit code: 0

Doesn't seem too bad. Getting there, still a bit of work to do on that 2nd regular expression. See how powerful this crap is. ;) I basically replaced all your NSScanner stuff and multitude of strings with 1 NSMutableString and 2 RegularExpressions.

And people say Lion sucks. :rolleyes:.

EDIT :

Works perfectly this morning, after removing some "over-thinking" and using NSMutableString's delete function rather than replacing with null characters (which is more akin to what you want) :

Code:
#import <Foundation/Foundation.h>

void removeMatchesFromString(NSArray *, NSMutableString *);

int main (int argc, const char * argv[])
{

    @autoreleasepool {
        NSError * error;
        NSMutableString * stringData = [[NSMutableString alloc] initWithContentsOfFile: @"/path/to/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
        NSArray * matches;
        
        if(error)
        {
            NSLog(@"%@", [error description]);
            exit(EXIT_FAILURE);
        }
        
        NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"<[\\w\\d\\s\"=/]+>" options: NSRegularExpressionCaseInsensitive error: &error];
        matches = [regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])];
        
        NSLog(@"Matched %lu tags, let's strip 'em", [matches count]);
        removeMatchesFromString(matches, stringData);
        NSLog(@"After HTML strip : %@", stringData);
        
        regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s*$" options: NSRegularExpressionAnchorsMatchLines error: &error];
        matches = [regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])];
        
        NSLog(@"Matched %lu garbage, let's strip 'em", [matches count]);
        removeMatchesFromString(matches, stringData);
        NSLog(@"After purification : \n%@", stringData);
    }
    return EXIT_SUCCESS;
}

void removeMatchesFromString(NSArray * matches, NSMutableString * string)
{
    NSInteger rangeoffset = 0;
    NSRange range;
    
    for(int i = 0; i < [matches count]; i++)
    {
        range = [[matches objectAtIndex: i] range];
        range.location -= rangeoffset;
        if(!range.length)
            range.length++;
        rangeoffset += range.length;
        [string deleteCharactersInRange: range];
    }
}

Code:
2012-02-15 06:33:34.323 regexptest[14110:707] After purification : 
Beginner
Upper Mambo Alley 
Yes
Yes
Lower Mambo Alley 
Yes 
Yes
Snow Drop - Beginner's Area 
Yes
Yes
Program ended with exit code: 0
A good night's sleep does wonder to the mind. :D
 
Last edited:

Sydde

macrumors 68030
Aug 17, 2009
2,552
7,050
IOKWARDI
I am wondering, if you use a scanner and a character set from +newlineCharacterSet, would that not easily capture any type of returns? Just scan each line (-scanUpToCharactersFromSet:intoString:), append the string into a mutable string, then -scanCharactersFromSet:intoString: into nil to get to the next line.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.