PDA

View Full Version : Unable to strip excess newline characters from a string or an array :/




chrono1081
Feb 14, 2012, 09:37 AM
Hi guys,

I have information I pulled from a webpage. I stripped the HTML from it so now it is just a string and looks like this (its long so this is only part of it):



Beginner


Upper Mambo Alley
Yes
Yes


Lower Mambo Alley
Yes
Yes


Snow Drop - Beginner's Area
Yes
Yes


The problem is, I am trying to get it to look like this:


Upper Mambo Alley
Yes
Yes
Lower Mambo Alley
Yes
Yes
Snow Drop - Beginner's Area
Yes
Yes


For some reason this seems like an impossible task in Objective-C. Here are the things I have tried before posting here:

1. I tried using NSScanner to scan for two consecutive newline characters. No luck.

2. I tried using stringByReplacingOccurancesOfString@"\n\n" withString: ""];

3. I tried reading the string into an array using NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];, converting it to an NSMutableArray and then comparing the contents and removing any array member that was a newline character.

Nothing seems to be working.

I am guessing one of two things, either my comparison statement is wrong (where the code says "this is not working") or its something other than newline characters that are in these strings or in the array created.

If anyone can give me a heads up to what is wrong it would be greatly appreciated. Here is my code:


#import <Foundation/Foundation.h>

//Function Prototypes
NSString *stripHTML(NSString *html);
NSString *removeRandomTags(NSString *html);



int main (int argc, const char * argv[])
{

@autoreleasepool {

//Create URL
NSURL *url = [NSURL URLWithString:@"http://www.blueknob.com/winter/conditions.php"];

//Request information from website
NSURLRequest *request = [NSURLRequest requestWithURL:url];
NSError *error = nil;
NSData *data = [NSURLConnection sendSynchronousRequest:request returningResponse:NULL error:&error];

//Check if data was read
if(!data)
{
NSLog(@"Request failed %@", [error localizedDescription]);
return 1;
}

//Convert NSData object to an NSString
NSString *siteData = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];

//Strip HTML tags and random HTML tags
NSString *strippedSiteData = [[NSString alloc] initWithString:stripHTML(siteData)];
strippedSiteData = removeRandomTags(strippedSiteData);

//Check output
NSLog(@"%@", strippedSiteData);

//Create an array based on data
NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];
NSMutableArray *contents = [NSMutableArray arrayWithArray:testContents];

//Attempt to remove any objects that are only newline characters
for(int i = 0; i < [contents count]; ++i)
{
if([contents objectAtIndex:i] == @"\n") //This doesn't work
[contents removeObjectAtIndex:i];
}

//Print contents
for(NSString *s in contents)
{
NSLog(@"%@", s);
}
}
return 0;
}

NSString *stripHTML(NSString *html)
{
//Scan the string and strip out the HTML from it
NSScanner *scanner = [NSScanner scannerWithString:html];
NSString *text = nil;

while([scanner isAtEnd] == NO)
{
//Beginning of a tag
[scanner scanUpToString:@"<" intoString:nil];

//End of a tag
[scanner scanUpToString:@">" intoString:&text];

//Replace the found tag with a space
html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@" "];
}

return html;
}

NSString *removeRandomTags(NSString *html)
{
NSScanner *scanner = [NSScanner scannerWithString:html];
NSString *text = nil;

while([scanner isAtEnd] == NO)
{
//Beginning of a tag
[scanner scanUpToString:@"&" intoString:nil];

//End of a tag
[scanner scanUpToString:@";" intoString:&text];

//Replace the found tag with nothing
html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@;", text] withString:@""];
}

return html;
}



robbieduncan
Feb 14, 2012, 09:44 AM
Are you sure they are \n characters? There are other return characters (\r for example).

chrono1081
Feb 14, 2012, 09:45 AM
Are you sure they are \n characters? There are other return characters (\r for example).

I'm not sure : / I have tried to find ways online to view which characters they are but I haven't been successful in that yet.

I know there is a way, and I used to do it long ago in C and C++ I just haven't came across it yet (and I forget how to do it).

robbieduncan
Feb 14, 2012, 09:50 AM
You see this line:

if([contents objectAtIndex:i] == @"\n") //This doesn't work

That will never ever work, even if the object is a string containing \n. Why? Because NSString instances are objects. So you are comparing the pointer addresses. Not that the strings contain the same characters. That's what the NSString isEqualToString: method is for.

KnightWRX
Feb 14, 2012, 09:55 AM
https://developer.apple.com/library/mac/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html



NSError error = [[NSError alloc] init...whatever];
NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\s+$" options: 0 error: &error ];
NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData
options: 0
range: NSMakeRange(0, [strippedSiteData length])
withTemplate: @""];


?

The regular expression might not be up to snuff, but that's as easy as it gets.

BTW, you shouldn't capitalize any characters in your variable names, that only applies to methods following the guidelines.

chrono1081
Feb 14, 2012, 10:01 AM
You see this line:

if([contents objectAtIndex:i] == @"\n") //This doesn't work

That will never ever work, even if the object is a string containing \n. Why? Because NSString instances are objects. So you are comparing the pointer addresses. Not that the strings contain the same characters. That's what the NSString isEqualToString: method is for.

Doh! I didn't even think of that :o Thanks for the clarification.

https://developer.apple.com/library/mac/#documentation/Foundation/Reference/NSRegularExpression_Class/Reference/Reference.html



NSError error = [[NSError alloc] init...whatever];
NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\r*\n+" options: 0 error: &error ];
NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData
options: 0
range: NSMakeRange(0, [strippedSiteData length])
withTemplate: @""];


?

The regular expression might not be up to snuff, but that's as easy as it gets.

BTW, you shouldn't capitalize any characters in your variable names, that only applies to methods following the guidelines.

Thanks for the tips :) I never even heard of NSRegularExpression so I will take a look at it and see how to work with it.

KnightWRX
Feb 14, 2012, 10:06 AM
Ok, I think I got a good pattern down with "^\s+$"

That one should do it. Should being the key word. Regexps are powerful but a pain sometimes. ;)

Thanks for the tips :) I never even heard of NSRegularExpression

Maybe because it's brand new and shiny :

Availability
Available in Mac OS X v10.7 and later.
Availability
Available in iOS 4.0 and later.

Apple finally joins the Unix world, a couple of decades late. ;)

chrono1081
Feb 14, 2012, 10:22 AM
Ok, I think I got a good pattern down with "^\s+$"

That one should do it. Should being the key word. Regexps are powerful but a pain sometimes. ;)


A pain may be an understatement ;) I am still pretty confused reading through the documentation.

Am I using this correctly? My string still looks the same :/ I was getting a warning from the compiler that /s was an unrecognized escape sequence (even though it shows it in the page you referenced me) so I figured maybe it needed an extra slash so I added one, I don't know if that may have screwed it up or not.

Here is my revamped code. (I didn't get a chance to change the variable names yet):


#import <Foundation/Foundation.h>

//Function Prototypes
NSString *stripHTML(NSString *html);
NSString *removeRandomTags(NSString *html);



int main (int argc, const char * argv[])
{

@autoreleasepool {

//Create URL
NSURL *url = [NSURL URLWithString:@"http://www.blueknob.com/winter/conditions.php"];

//Request information from website
NSURLRequest *request = [NSURLRequest requestWithURL:url];
NSError *error = nil;
NSData *data = [NSURLConnection sendSynchronousRequest:request returningResponse:NULL error:&error];

//Check if data was read
if(!data)
{
NSLog(@"Request failed %@", [error localizedDescription]);
return 1;
}

//Convert NSData object to an NSString
NSString *siteData = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];

//Strip HTML tags and random HTML tags
NSString *strippedSiteData = [[NSString alloc] initWithString:stripHTML(siteData)];
strippedSiteData = removeRandomTags(strippedSiteData);

//Check output
NSLog(@"%@", strippedSiteData);

//Create an array based on data
NSArray *testContents = [strippedSiteData componentsSeparatedByString:@"\n"];
__unused NSMutableArray *contents = [NSMutableArray arrayWithArray:testContents];

NSRegularExpression *regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s+$" options: 0 error: &error ];
NSString * strippedandpurifiedsitedata = [regexp stringByReplacingMatchesInString: strippedSiteData
options: 0
range: NSMakeRange(0, [strippedSiteData length])
withTemplate: @""];

NSLog(@"%@", strippedandpurifiedsitedata);
}
return 0;
}

NSString *stripHTML(NSString *html)
{
//Scan the string and strip out the HTML from it
NSScanner *scanner = [NSScanner scannerWithString:html];
NSString *text = nil;

while([scanner isAtEnd] == NO)
{
//Beginning of a tag
[scanner scanUpToString:@"<" intoString:nil];

//End of a tag
[scanner scanUpToString:@">" intoString:&text];

//Replace the found tag with a space
html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@" "];
}

return html;
}

NSString *removeRandomTags(NSString *html)
{
NSScanner *scanner = [NSScanner scannerWithString:html];
NSString *text = nil;

while([scanner isAtEnd] == NO)
{
//Beginning of a tag
[scanner scanUpToString:@"&" intoString:nil];

//End of a tag
[scanner scanUpToString:@";" intoString:&text];

//Replace the found tag with nothing
html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@;", text] withString:@""];
}

return html;
}

KnightWRX
Feb 14, 2012, 10:25 AM
I know I used it recently in some code that I got working. Let me try to dig it up instead of just writing it by hand.

EDIT : found it, unfortunately, mine is used to validate that a string is a representation of an hex number (0xAFF2 for example) :

NSError * error;
NSRange range = { 0, [self length] };
NSRegularExpression * regex = [[NSRegularExpression alloc] initWithPattern:@"\\A0x[0-9a-f]+\\z" options: NSRegularExpressionCaseInsensitive error: &error];

if([regex numberOfMatchesInString: self options:0 range: range])
{
...
}


That's how I did mine.

You could start with NSLog()ing the result from [regex numberOfMatchesInString:options:range:] though, that could give you a big clue to see if it's finding anything. Then once you're matching stuff, go with the replace function.

dejo
Feb 14, 2012, 10:34 AM
BTW, you shouldn't capitalize any characters in your variable names, that only applies to methods following the guidelines.

Which guidelines state that? Are you saying
NSString *hostName;
is bad form?

KnightWRX
Feb 14, 2012, 11:23 AM
Which guidelines state that? Are you saying
NSString *hostName;
is bad form?

Dunno, I had read that somewhere, but looking through Apple's guidelines, I can't find it, in fact it says the opposite :

https://developer.apple.com/library/mac/#documentation/Cocoa/Conceptual/CodingGuidelines/Articles/NamingBasics.html#//apple_ref/doc/uid/20001281-BBCHBFAH

I use to capitalize the letters on the 2nd and subsequent words and had stopped because of something I had read. I guess it was wrong information.

chown33
Feb 14, 2012, 11:43 AM
I use to capitalize the letters on the 2nd and subsequent words and had stopped because of something I had read. I guess it was wrong information.

It's hard to decipheranyinherentmeaningwhenmultiplewordsruntogether. So this is bad:
NSString * strippedandpurifiedsitedata;

but either of these is much more readable:
NSString * strippedAndPurifiedSiteData;
NSString * stripped_and_purified_site_data;
The latter is a long-standing C convention. Some C programmers get upset by CamelCase variable names (or function names, or anything other than _t in typedefs).

Short names don't necessarily need the same rules. Example:
NSRegularExpression * regex;

It's when the names get long and consist of many words stuck together that problems arise, e.g.
NSRegularExpression * regularexpression;
NSRegularExpression * regularimpression;
NSRegularExpression * regularexpanssion;

KnightWRX
Feb 14, 2012, 07:21 PM
A pain may be an understatement ;) I am still pretty confused reading through the documentation.


Figured it out for you. Basically, since the string is a multiline input, we need to tell NSRegularExpression to treat it as a multi-line text rather than a single long line. That's what the NSRegularExpressionOptions are for, when we create the regexp. This code works and does what you want :

int main (int argc, const char * argv[])
{

@autoreleasepool {
NSError * error;
NSString * stringData = [[NSString alloc] initWithContentsOfFile: @"/path/to/file/named/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
if(error)
{
NSLog(@"%@", [error description]);
exit(EXIT_FAILURE);
}

NSString * strippedStringData = [[NSString alloc] initWithString: stripHTML(stringData)];
strippedStringData = removeRandomTags(strippedStringData);

NSLog(@"String contains : \n%@", strippedStringData);

NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s+$" options: NSRegularExpressionAnchorsMatchLines error: &error];

NSLog(@"Found %lu matches", [[regexp matchesInString: strippedStringData options: 0 range: NSMakeRange(0, [strippedStringData length])] count]);
if([[regexp matchesInString: strippedStringData options: 0 range: NSMakeRange(0, [strippedStringData length])] count])
{
NSMutableString * strippedAndPurifiedString = [[NSMutableString alloc] initWithString: strippedStringData];
[regexp replaceMatchesInString: strippedAndPurifiedString options: 0 range: NSMakeRange(0, [strippedStringData length]) withTemplate: @""];
NSLog(@"New String contains :\n %@", strippedAndPurifiedString);
}

}
return EXIT_SUCCESS;
}

We don't specify options during matching, but during the initialization of the regexp, we add in NSRegularExpressionAnchorsMatchLines. Then my earlier regexp of simply "^\s*$" (or @"^\\s*$" in Objective-C notation) works.

I love these things. So powerful, once you got them sorted out.

KnightWRX
Feb 14, 2012, 07:41 PM
Rewrote the code without stripHTML() and removeblahblahtags() :

int main (int argc, const char * argv[])
{

@autoreleasepool {
NSError * error;
NSMutableString * stringData = [[NSMutableString alloc] initWithContentsOfFile: @"/path/to/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
NSUInteger matches;

if(error)
{
NSLog(@"%@", [error description]);
exit(EXIT_FAILURE);
}

NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"</*[a-z]+\\s*[a-z]*=*\"*[0-9a-z]*\"*>" options: NSRegularExpressionCaseInsensitive error: &error];
matches = [[regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])] count];
if(matches)
{
NSLog(@"Matched %lu tags, let's strip 'em", matches);
[regexp replaceMatchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length]) withTemplate: @""];
}

regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s*$" options: NSRegularExpressionAnchorsMatchLines error: &error];
matches = [[regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])] count];

if(matches)
{
NSLog(@"Found %lu matches", matches);
[regexp replaceMatchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length]) withTemplate: @""];
NSLog(@"New String contains :\n %@", stringData);
}

}
return EXIT_SUCCESS;
}

Result :


2012-02-14 20:40:24.163 regexptest[13592:707] Matched 26 tags, let's strip 'em
2012-02-14 20:40:24.167 regexptest[13592:707] Found 9 matches
2012-02-14 20:40:24.168 regexptest[13592:707] New String contains :

Beginner

Upper Mambo Alley
Yes
Yes

Lower Mambo Alley
Yes
Yes

Snow Drop - Beginner's Area
Yes
Yes
Program ended with exit code: 0

Doesn't seem too bad. Getting there, still a bit of work to do on that 2nd regular expression. See how powerful this crap is. ;) I basically replaced all your NSScanner stuff and multitude of strings with 1 NSMutableString and 2 RegularExpressions.

And people say Lion sucks. :rolleyes:.

EDIT :

Works perfectly this morning, after removing some "over-thinking" and using NSMutableString's delete function rather than replacing with null characters (which is more akin to what you want) :

#import <Foundation/Foundation.h>

void removeMatchesFromString(NSArray *, NSMutableString *);

int main (int argc, const char * argv[])
{

@autoreleasepool {
NSError * error;
NSMutableString * stringData = [[NSMutableString alloc] initWithContentsOfFile: @"/path/to/taggeddata.html" encoding: NSUTF8StringEncoding error: &error];
NSArray * matches;

if(error)
{
NSLog(@"%@", [error description]);
exit(EXIT_FAILURE);
}

NSRegularExpression * regexp = [[NSRegularExpression alloc] initWithPattern: @"<[\\w\\d\\s\"=/]+>" options: NSRegularExpressionCaseInsensitive error: &error];
matches = [regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])];

NSLog(@"Matched %lu tags, let's strip 'em", [matches count]);
removeMatchesFromString(matches, stringData);
NSLog(@"After HTML strip : %@", stringData);

regexp = [[NSRegularExpression alloc] initWithPattern: @"^\\s*$" options: NSRegularExpressionAnchorsMatchLines error: &error];
matches = [regexp matchesInString: stringData options: 0 range: NSMakeRange(0, [stringData length])];

NSLog(@"Matched %lu garbage, let's strip 'em", [matches count]);
removeMatchesFromString(matches, stringData);
NSLog(@"After purification : \n%@", stringData);
}
return EXIT_SUCCESS;
}

void removeMatchesFromString(NSArray * matches, NSMutableString * string)
{
NSInteger rangeoffset = 0;
NSRange range;

for(int i = 0; i < [matches count]; i++)
{
range = [[matches objectAtIndex: i] range];
range.location -= rangeoffset;
if(!range.length)
range.length++;
rangeoffset += range.length;
[string deleteCharactersInRange: range];
}
}

2012-02-15 06:33:34.323 regexptest[14110:707] After purification :
Beginner
Upper Mambo Alley
Yes
Yes
Lower Mambo Alley
Yes
Yes
Snow Drop - Beginner's Area
Yes
Yes
Program ended with exit code: 0
A good night's sleep does wonder to the mind. :D

Sydde
Feb 15, 2012, 01:28 AM
I am wondering, if you use a scanner and a character set from +newlineCharacterSet, would that not easily capture any type of returns? Just scan each line (-scanUpToCharactersFromSet:intoString:), append the string into a mutable string, then -scanCharactersFromSet:intoString: into nil to get to the next line.