macOS [java] replacing all non alpha chars in a String with no char

MarkCollette · Dec 5, 2006

Good example of in-place string manipulation, and standard library use. But, may I suggest using two index variables, to reduce redundant copies, and migrating to strncpy or strlcpy. Of course strcpy is completely safe in your example, but it sets a good example to copy&pasters. Oh, and do you mean: if( ! isalpha(s )

MarkCollette · Dec 5, 2006

plinden said:
Not really relevant here, but you could remember it in the future.

When manipulating characters from Strings in Java, if the String is longer than about 300 characters, getting an array using toCharArray and iterating over the array is faster than using charAt. charAt is slightly faster than toCharArray with shorter Strings. (That's since JDK1.4 - with earlier versions the limit was about 15 characters)

This whole discussion reminded me of a change in Java 1.5. Apparently the new Unicode 4 standard adds more characters than would fit in 16 bits. So, Java 1.5 now supports UTF-16. So basically, if you want to be completely correct, you can no longer assume that a char primitive is sufficient to hold a character. In retrospect, everyone else using UTF-8 now sounds pretty smart. The point is, that the optimisation of getting the source char[] is no longer a good idea. If someone had used the simple String.charAt(int) and Character.isLetter(char) approach, then they could simply replace those with String.codePointAt(int) and Character.isLetter(int). Well, to be honest, they'd also have to change the looping to depend on Character.isSupplementaryCodePoint(int). Of course, you could just use the Character methods on the char[], but I think it makes the porting issue less straightforward.

String.codePointAt(int)
Character.isLetter(int)
Character.isSupplementaryCodePoint(int)

gnasher729 · Dec 6, 2006

ChrisA said:
I'll write it in C as compactly as possible you can translate to whatever you like

char s[] = "amajjjfgy, torsrfew";
for (i=0; i++; s){
if(isalpha(s){
strcpy(&s, &s[i+1]);
}
}

Yes I actually do use code like the above in real life. It has uses for example to remove prohibited characters from strings before using them to build an SQL query.

In real life, this has a good chance of crashing. strcpy for overlapping strings invokes undefined behavior. It might work on your particular machine today, but it can crash any time.

MarkCollette · Dec 6, 2006

gnasher729 said:
In real life, this has a good chance of crashing. strcpy for overlapping strings invokes undefined behavior. It might work on your particular machine today, but it can crash any time.

Another reason why Java is better: System.arraycopy guarantees proper behaviour with intra-array copying

bousozoku · Dec 6, 2006

MarkCollette said:
Another reason why Java is better: System.arraycopy guarantees proper behaviour with intra-array copying

It's also a call for sensible programming. Don't do what you shouldn't and things will work correctly.

gnasher729 · Dec 6, 2006

ChrisA said:
I'll write it in C as compactly as possible you can translate to whatever you like

char s[] = "amajjjfgy, torsrfew";
for (i=0; i++; s){
if(isalpha(s){
strcpy(&s, &s[i+1]);
}
}

Yes I actually do use code like the above in real life. It has uses for example to remove prohibited characters from strings before using them to build an SQL query.

It is worse than I thought. Apart from the undefined behavior in the call to strcpy, and the error in the test (it should have been ! isalpha (s )), execution time is quadratic, and it allows SQL code injections.

Try this first:

char* s = malloc (1000001);
memset (s, '*', 1000000);
s [1000000] = '\0';

Your code does one million calls to strcpy, each call copies on the average 500000 bytes, which means a total of 500 billion bytes are copied. Expect to wait for a few minutes.

But it actually runs twice as fast because there is a bug, and that bug is absolutely fatal: If s is a character that you want to reject, then you move all characters starting at s [i+1] forward by one byte, and the character that was previously stored at s [i+1] is not checked. In other words, your code will let any character through if I just put it into the input twice.

Lets say I am a registered user on your website, and I enter username and password (username = gnasher729 and password = abcdef), and you use this to construct a query like

username = "gnasher729" and password = "abcdef".

A hacker that wants to get into my account without knowing my password would enter gnasher729 as the user name. He enters the username X. As the password he enters

x"" or username == ""gnasher729"" and ""A"" == ""A

Your code strips out only the first of two consecutive non-letters, so its result is

x" or username = "gnasher729" and "A" = "A

If you put that into your query, the query looks like

username = "X" and password = "x" or username = "gnasher729" and "A" = "A"

In other words, since "A" = "A" is true, he will get into my account and create all kinds of trouble. Of course he can get into anyones account as long as he can guess the username. Classical SQL code injection.

Search

Search

macOS [java] replacing all non alpha chars in a String with no char

MarkCollette

macrumors 68000

MarkCollette

macrumors 68000

gnasher729

Suspended

MarkCollette

macrumors 68000

bousozoku

Moderator emeritus

gnasher729

Suspended

Our Staff