PDA

View Full Version : help! force NSString to NSUTF8StringEncoding




Oats
Aug 8, 2007, 12:14 AM
I have some string data in an NSData * named "data" that I have loaded from an HTML file, and I want to create an NSString with it, for easy parsing later. So I was trying to use initWithData:encoding: to create an NSString from the data:


NSString *string = [[NSString alloc] initWithData: data
encoding: NSUTF8StringEncoding];
the problem is, my data has a few chars not compatible with NSUTF8StringEncoding so the method returns "nil". How can I force it to use this encoding and ignore illegal characters? Is there a better encoding for HTML data?
Thanks!



Soulstorm
Aug 8, 2007, 01:24 AM
I have some string data in an NSData * named "data" that I have loaded from an HTML file, and I want to create an NSString with it, for easy parsing later. So I was trying to use initWithData:encoding: to create an NSString from the data:


NSString *string = [[NSString alloc] initWithData: data
encoding: NSUTF8StringEncoding];
the problem is, my data has a few chars not compatible with NSUTF8StringEncoding so the method returns "nil". How can I force it to use this encoding and ignore illegal characters? Is there a better encoding for HTML data?
Thanks!

NSData was never meant to hold strings. When loading a file from the HDD with NSData, all you get is a chunk of bytes representing a file on the disk and not just the string contained therein.

So, in order to do what you want, you must use the NSSTring's loading method, like this:

NSString *theString = [NSString stringWithContentsOfFile:@"PathToFile" encoding:NSUTF8StringEncoding error:nil];
NSString *theString2 = [[NSString alloc]initWithContentsOfFile:@"PathToFile" encoding:NSUTF8StringEncoding error:nil];

Test it and see if it works.

Nutter
Aug 8, 2007, 06:23 AM
Maybe he's loading the file from the net, not his disk?

In any case, if the file contains any non-string data, you're never going to get a proper NSString out of it.

It's best to keep the data as data. You can parse the data itself for anything, including strings. For example, here's a category on NSData for finding the range of a particular chunk of data:


@implementation NSData (MBSearch)

- (NSRange)rangeOfData:(NSData *)data searchFromIndex:(unsigned)index;
{
if (data == nil)
[NSException raise:NSInvalidArgumentException format:nil];

NSRange range = NSMakeRange(NSNotFound, 0);
if ([data length] == 0 || [self length] == 0)
return range;

uint8_t *source = (uint8_t *)[self bytes];
uint8_t *search = (uint8_t *)[data bytes];
unsigned i, j;

for (i = index; i < [self length]; i++)
{
if (source[i] != search[0])
continue;

for (j = 1; j < [data length] && i + j < [self length]; j++)
{
if (source[i + j] != search[j])
break;
}

if (j == [data length])
{
range.location = i;
range.length = [data length];
break;
}
}

return range;
}

@end


Here's how you use this to search for a string in the data:


NSRange range = [data rangeOfData:[@"Hello World!" dataUsingEncoding:NSASCIIStringEncoding] searchFromIndex:0];


I'm not sure if ASCII is the proper encoding for this stuff, but it's worked for me thus far.

Krevnik
Aug 8, 2007, 09:20 AM
In any case, if the file contains any non-string data, you're never going to get a proper NSString out of it.

It's best to keep the data as data. You can parse the data itself for anything, including strings.

The problem is that HTML files are supposed to be pure-text of a particular encoding. I would very surprised if loading it into an NSString directly won't work correctly. If there is non-text data (which is kinda hard to say yes or no to, considering the range of values UTF-8 uses), then it isn't a valid HTML file.


Here's how you use this to search for a string in the data:


NSRange range = [data rangeOfData:[@"Hello World!" dataUsingEncoding:NSASCIIStringEncoding] searchFromIndex:0];


I'm not sure if ASCII is the proper encoding for this stuff, but it's worked for me thus far.

With apps needing to be global, ASCII should be avoided at all costs. UTF8 should be used instead, which includes the ASCII set as the English character set. This means ASCII strings can be used as UTF8 English strings without problems, but the encodings should remain as one of the UTF encodings.


My question for Oats at this point is this: Can we get a look at the data and the bytes in question causing the problem?

Oats
Aug 8, 2007, 09:32 AM
In any case, if the file contains any non-string data, you're never going to get a proper NSString out of it.

I am hoping that isn't true... for example, I think that my file contains accented characters or something... isn't there some way to get this into a string of some type for display? The file isn't binary, it is text of some sort, and mostly ASCII.

The NSData search routine you showed doesn't really work for me, because I am not looking for a particular string. What I hope to do is display the results of a "grep" UNIX shell call, using NSTask, which returns the results as NSData.

If I have to, I guess I would rather replace all non-ASCII characters with ?'s just so I could continue and display most of what is there.

The problem is that HTML files are supposed to be pure-text of a particular encoding. I would very surprised if loading it into an NSString directly won't work correctly. If there is non-text data (which is kinda hard to say yes or no to, considering the range of values UTF-8 uses), then it isn't a valid HTML file.
It is quite possible that the file is not 100% valid HTML, like 90% of HTML files on the web :-)

Krevnik
Aug 8, 2007, 09:50 AM
I am hoping that isn't true... for example, I think that my file contains accented characters or something... isn't there some way to get this into a string of some type for display? The file isn't binary, it is text of some sort, and mostly ASCII.

The NSData search routine you showed doesn't really work for me, because I am not looking for a particular string. What I hope to do is display the results of a "grep" UNIX shell call, using NSTask, which returns the results as NSData.

If I have to, I guess I would rather replace all non-ASCII characters with ?'s just so I could continue and display most of what is there.

Okay, you didn't mention you were running a task and using NSPipe to get at the results, that changes the game quite a bit. It isn't about HTML being valid anymore, as grep could potentially be the cause of your problems now too.

Is there any way we can see a file dump of the NSData you are getting from the pipe? (Preferably raw, this is something I would need to see in a hex editor)


It is quite possible that the file is not 100% valid HTML, like 90% of HTML files on the web :-)

HTML that is readable by a browser must be pure-text. I don't know of any browser that assumes that someone might spew binary data into their HTML. Browsers on the other hand, have a lot of work in detecting the encoding.

Oats
Aug 8, 2007, 10:32 AM
Okay, you didn't mention you were running a task and using NSPipe to get at the results, that changes the game quite a bit. It isn't about HTML being valid anymore, as grep could potentially be the cause of your problems now too.

Is there any way we can see a file dump of the NSData you are getting from the pipe? (Preferably raw, this is something I would need to see in a hex editor)

It is true, the HTML discussion is really not the point... I am looking for a general solution for ignoring/replacing illegal characters so I can somehow get an NSString from this data. Most of the time, the grep results are UTF8 compatible, but every once in a while, one of the files returns a bad character, and I can't convert to NSString. I am not able to easily dump the NSData right now, but that shouldn't really matter?

grep could be searching binary documents, but usually returns with "Binary file /../.. matches" rather than returning the actual contents of the file as it would with a text file.

Nutter
Aug 8, 2007, 07:21 PM
It is true, the HTML discussion is really not the point...

Yeah, sorry about that. Of course HTML is supposed to be text, I was just being dense.

Krevnik
Aug 8, 2007, 07:59 PM
It is true, the HTML discussion is really not the point... I am looking for a general solution for ignoring/replacing illegal characters so I can somehow get an NSString from this data. Most of the time, the grep results are UTF8 compatible, but every once in a while, one of the files returns a bad character, and I can't convert to NSString. I am not able to easily dump the NSData right now, but that shouldn't really matter?

grep could be searching binary documents, but usually returns with "Binary file /../.. matches" rather than returning the actual contents of the file as it would with a text file.

I would suggest looking at changing your parameters to grep to filter out binary file results completely. I would wager that grep is spewing some bad characters.

Oats
Aug 9, 2007, 07:02 AM
I would suggest looking at changing your parameters to grep to filter out binary file results completely. I would wager that grep is spewing some bad characters.

thank you for continuing to think about this problem with me. however, grep isn't really the issue for me. i want as much information as grep will give me. and grep doesn't think the file is binary. i just want to have a more robust NSData to NSString conversion that can accept a few accented characters somehow.

is there a predefined encoding I can use that includes ASCII + accented characters?

lets say i want to replace all illegal characters with a "?" character... perhaps I can pre-process the NSData like that, and just replace all characters with a decimal value greater than 127. it is a horrible hack i hope to avoid, but a possibility.

gnasher729
Aug 9, 2007, 07:37 AM
HTML that is readable by a browser must be pure-text. I don't know of any browser that assumes that someone might spew binary data into their HTML. Browsers on the other hand, have a lot of work in detecting the encoding.

HTML must be pure text _in some encoding_. The HTML he received might be MacRoman, or some Windows encoding, and if you lie to NSString and tell it that the encoding is UTF8, then things will go wrong. One peculiar difference between MacRoman and UTF8 is that any sequence of bytes is valid MacRoman text (it might not be displayed the way you expected it, but it is always valid), but not every sequence of bytes is valid UTF8.

gnasher729
Aug 9, 2007, 07:44 AM
thank you for continuing to think about this problem with me. however, grep isn't really the issue for me. i want as much information as grep will give me. and grep doesn't think the file is binary. i just want to have a more robust NSData to NSString conversion that can accept a few accented characters somehow.

is there a predefined encoding I can use that includes ASCII + accented characters?

lets say i want to replace all illegal characters with a "?" character... perhaps I can pre-process the NSData like that, and just replace all characters with a decimal value greater than 127. it is a horrible hack i hope to avoid, but a possibility.

You need to find out which encoding is used and pass that encoding to NSString. Google for "html encoding". It is either in the Content-type header or as a META element in the actual html.

Krevnik
Aug 9, 2007, 09:41 AM
You need to find out which encoding is used and pass that encoding to NSString. Google for "html encoding". It is either in the Content-type header or as a META element in the actual html.

I think the problem here is that he is grepping on the file or files (of which he isn't being very specific). Grep is a bit of a wildcard here since I have no idea if it is preserving the original encoding, inserting new problematic characters/etc.

Right now, we can throw guesses, but we don't have enough information to give Oats a solid answer. We just have a tiny snippet of code and a partial use case. This isn't enough to debug this particular problem.

Oats, I can't help you further if you don't help me a bit. Let me know more about what the heck you are doing (specifically) to produce the output, so I can at least create an a testing environment. What arguments are being passed to grep? Is it a single file being searched? Is it multiple files?

Oats
Aug 9, 2007, 02:54 PM
I think the problem here is that he is grepping on the file or files (of which he isn't being very specific). Grep is a bit of a wildcard here since I have no idea if it is preserving the original encoding, inserting new problematic characters/etc.

I thought I was being very specific, the problem is accent characters, other non-standard ASCII characters, and how to encode these correctly in an NSString. The tip by gnasher729 helps me get halfway there: NSMacOSRomanStringEncoding, or MacRoman. This treats ASCII normally, but recognized the funny other characters which were showing up in some of my HTML files.

Here is an example of a troublesome character, the funny "i" in:
"Youíre born"

This was supposed to be an apostrophe at some point, and Firefox somehow displays it correctly:
"You’re born"

As some have pointed out, this may not be valid HTML, but I don't care, I have to work with files as they are. At least MacRoman encoding doesn't crap out on the conversion, even if the exact character representation isn't right.

by the way, check this link to see valid encodings, perhaps I should be using some sort of windows encoding, i guess there is no way to know ahead of time which encoding is best, but the most important thing to me right now is that MacRoman at least allows me to get an NSString out of it for further processing.

http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/doc/uid/20000154-BAJJAICE

Krevnik
Aug 9, 2007, 05:04 PM
I thought I was being very specific, the problem is accent characters, other non-standard ASCII characters, and how to encode these correctly in an NSString. The tip by gnasher729 helps me get halfway there: NSMacOSRomanStringEncoding, or MacRoman. This treats ASCII normally, but recognized the funny other characters which were showing up in some of my HTML files.

Here is an example of a troublesome character, the funny "i" in:
"Youíre born"

This was supposed to be an apostrophe at some point, and Firefox somehow displays it correctly:
"You’re born"


I apologize for missing these details.

As some have pointed out, this may not be valid HTML, but I don't care, I have to work with files as they are. At least MacRoman encoding doesn't crap out on the conversion, even if the exact character representation isn't right.

Well, if it is text, then it is valid HTML, the HTML usually is supposed to include a tag at the very beginning that lets you know the encoding. Not everyone uses it though.


by the way, check this link to see valid encodings, perhaps I should be using some sort of windows encoding, i guess there is no way to know ahead of time which encoding is best, but the most important thing to me right now is that MacRoman at least allows me to get an NSString out of it for further processing.

http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Classes/NSString_Class/Reference/NSString.html#//apple_ref/doc/uid/20000154-BAJJAICE

My recommendation is to have a little more robust string creation that uses something like MacOSRoman as a last resort, and UTF + UTF-8 as a first resort, and ISO Latin somewhere in between. Odds are that this is ISO Latin or some variant of it (which Windows and Linux both use).

At least this way, you can retain as much unicode information as possible and resort back to the 'quick and dirty' if it doesn't work. I wouldn't completely ditch the ability to extract Unicode, as it will at least keep the app globalized.

gnasher729
Aug 9, 2007, 05:53 PM
The tip by gnasher729 helps me get halfway there: NSMacOSRomanStringEncoding, or MacRoman. This treats ASCII normally, but recognized the funny other characters which were showing up in some of my HTML files.

Here is an example of a troublesome character, the funny "i" in:
"Youíre born"

This was supposed to be an apostrophe at some point, and Firefox somehow displays it correctly:
"You’re born"

It's (most likely) valid HTML, it's definitely not MacRoman, and it is also not UTF8: MacRoman and some Windows encodings are plain 8 bit encodings, that is there are exactly 256 characters. UTF8 encodes the whole of Unicode, that is more than a million possible codepoints. The first 128 are identical in Unicode, ASCII, MacRoman and Windows. But all other characters are encoded in two, three or four bytes, never in one byte. So if you have UTF8, and you interpret it as MacRoman, these "funny" characters never come alone, they always come in twos, threes or fours.

You got a single "funny" character; that means it is not UTF8.