PDA

View Full Version : NSRegularExpression to capture C function arguments?




ArtOfWarfare
Jun 16, 2013, 11:45 AM
I wrote this NSRegularExpression for detecting Core Graphics C functions:

NSString *regexString = @"([_a-zA-Z][_0-9a-zA-Z]*)\\(context(?:,(-?[0-9]*.?[0-9]+))*\\);";
NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:regexString options:0 error:nil]

But it's not picking up individual arguments like I want.

For example, I want this string

CGContextMoveToPoint(context,0,100);

to have the following subranges captured:

CGContextMoveToPoint
0
100

But instead right now it picks up:

CGContextMoveToPoint
0,100

Why is it picking up that middle comma? I set it aside in its own non-capture group explicitly so it wouldn't be picked up and placed in any of the groups.

Here's my code using the regular expression:

[regex enumerateMatchesInString:codeString options:0 range:NSMakeRange(0, codeString.length) usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) {
for (NSUInteger i = 1, n = result.numberOfRanges; i < n; i++) {
NSLog(@"%@", [codeString substringWithRange:[result rangeAtIndex:i]]);
}
}];

(It starts with i = 1 because rangeAtIndex:0 always has the range of the entire matched string, whereas the 1 through numberOfRanges - 1 are supposed to have the matches for the individual capture groups.)



subsonix
Jun 16, 2013, 12:13 PM
Have you confirmed that the regex works as intended by itself?

ArtOfWarfare
Jun 16, 2013, 12:19 PM
Have you confirmed that the regex works as intended by itself?

How do you mean? When I give it a small text file full of code it's able to extract all of the CG function calls, it just messes up finding those individual arguments.

subsonix
Jun 16, 2013, 12:25 PM
How do you mean? When I give it a small text file full of code it's able to extract all of the CG function calls, it just messes up finding those individual arguments.

I looked at what you said here:


to have the following subranges captured:

CGContextMoveToPoint
0
100

But instead right now it picks up:

CGContextMoveToPoint
0,100


And suspected the regex, but perhaps I missed your point.

ArtOfWarfare
Jun 16, 2013, 12:32 PM
I looked at what you said here:

And suspected the regex, but perhaps I missed your point.

I'm confused. The regex was in the first code block of my original post. It's the NSString that's passed into NSRegularExpression's initWithPattern method. It's picking up the function calls fine. The issue I'm having is with the individual capture groups it reports. I want it to separately report each argument, not give me a single string with all of the arguments in it.

subsonix
Jun 16, 2013, 12:40 PM
The issue I'm having is with the individual capture groups it reports. I want it to separately report each argument, not give me a single string with all of the arguments in it.

Yes, and perhaps that is down to the regular expression it self, I just asked if you have confirmed that it does break down the string into each individual argument.

ArtOfWarfare
Jun 16, 2013, 12:49 PM
Yes, and perhaps that is down to the regular expression it self, I just asked if you have confirmed that it does break down the string into each individual argument.

I wrote it thinking it would, but as far as I can tell it doesn't. If it did, everything would work and I wouldn't be asking about it.

(?:,(-?[0-9]*.?[0-9]+))*

I put the inner ()'s around so that it would be a separate capture group. The outer ()'s have the ?: prefix it so that , won't also get a capture group.

subsonix
Jun 16, 2013, 01:10 PM
I wrote it thinking it would, but as far as I can tell it doesn't. If it did, everything would work and I wouldn't be asking about it.


I usually test them for example in the terminal, there could also be a problem else where.


(?:,(-?[0-9]*.?[0-9]+))*

I put the inner ()'s around so that it would be a separate capture group. The outer ()'s have the ?: prefix it so that , won't also get a capture group.

Wouldn't you need exactly one more group like the inner group separated by comma?

JoshDC
Jun 16, 2013, 05:44 PM
Looks like your first problem is with the dot here:

(-?[0-9]*.?[0-9]+)

I don't think it's matching what you think it does.

I'm also not convinced this will ever work. From a few small tests it looks like ICU's regular expressions (the engine behind NSRegularExpression) does not support variable capture groups. It's explained well – albeit for Java – in this Stack Overflow (http://stackoverflow.com/questions/6939526/java-regex-repeating-capturing-groups) post.

I would strongly recommend looking at other approaches than regular expression for this kind of task. The expression you're using already requires a fair level of effort to understand, and doesn't handle all valid function calls (whitespace is an obvious omission). You're going to end up with an unwieldy regular expression that still doesn't accurately match all valid calls, while disallowing all invalid ones.

NSScanner might be worth a look, but you'll likely end up requiring a proper parser.

ArtOfWarfare
Jun 17, 2013, 12:09 AM
Looks like your first problem is with the dot here:

(-?[0-9]*.?[0-9]+)

I don't think it's matching what you think it does.

D'oh! I forgot . is a metacharacter in regex.

I'm also not convinced this will ever work. From a few small tests it looks like ICU's regular expressions (the engine behind NSRegularExpression) does not support variable capture groups. It's explained well – albeit for Java – in this Stack Overflow (http://stackoverflow.com/questions/6939526/java-regex-repeating-capturing-groups) post.

I would strongly recommend looking at other approaches than regular expression for this kind of task. The expression you're using already requires a fair level of effort to understand, and doesn't handle all valid function calls (whitespace is an obvious omission). You're going to end up with an unwieldy regular expression that still doesn't accurately match all valid calls, while disallowing all invalid ones.

NSScanner might be worth a look, but you'll likely end up requiring a proper parser.

I'm looking into NSScanner right now... I'm feeling like this should be simple enough a task for NSScanner + NSRegularExpression to make short work of and I shouldn't need anything like ParseKit (which, although I've found it's quite powerful, the documentation available for it is very sparse and often entirely incorrect.)

chown33
Jun 17, 2013, 02:17 PM
D'oh! I forgot . is a metacharacter in regex.

Google search terms: regex buddy mac os


I'm looking into NSScanner right now... I'm feeling like this should be simple enough a task for NSScanner + NSRegularExpression to make short work of and I shouldn't need anything like ParseKit (which, although I've found it's quite powerful, the documentation available for it is very sparse and often entirely incorrect.)

I don't understand why you'd build a parser from scratch. As already pointed out above, an existing C parser might be a better fit. For example, googling c interpreter finds this C interpreter:
http://code.google.com/p/picoc/

It's entirely in C, has a makefile, and is offered under the BSD license. Even if you don't use it, it can be an example of how to write a C parser.

Pretty much any parser that produces an AST (http://en.wikipedia.org/wiki/Abstract_syntax_tree) would work. And finding one of those for C, even if it's in lex, yacc, bison, etc. would get you farther along than a scanner and reg-ex.

Heed this quote (http://en.wikiquote.org/wiki/Jamie_Zawinski):
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.


Also see:
http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

I've worked on a past projects that was started using reg-ex for parsing both C and assembly language. It was one of the most complex monstrosities I've ever seen, and its performance was abysmal. It would take anywhere from many seconds to several minutes to parse what seemed like relatively simple things, and it was still incomplete. The reg-ex was eventually scrapped and replaced with a proper parser (lex & yacc) and it became much simpler and far faster.

ArtOfWarfare
Jun 17, 2013, 04:08 PM
Google search terms: regex buddy mac os

That app doesn't exist in OS X, but I used the lite web version, regexpal.com. It doesn't highlight subexpressions making it of limited value for this particular issue.

I just found reggy (http://reggyapp.com) which seems like it might be particularly helpful... also, it's free.

I don't understand why you'd build a parser from scratch.

I thought this task was sufficiently simple that I didn't need a more robust parser than regex. But as I'm progressing, I'm sensing that I was wrong and that I do need something better like you mentioned.

As already pointed out above, an existing C parser might be a better fit. For example, googling c interpreter finds this C interpreter:
http://code.google.com/p/picoc/

It's entirely in C, has a makefile, and is offered under the BSD license. Even if you don't use it, it can be an example of how to write a C parser.

Pretty much any parser that produces an AST (http://en.wikipedia.org/wiki/Abstract_syntax_tree) would work. And finding one of those for C, even if it's in lex, yacc, bison, etc. would get you farther along than a scanner and reg-ex.

Thanks, I'll look into that.

subsonix
Jun 17, 2013, 04:34 PM
Pretty much any parser that produces an AST (http://en.wikipedia.org/wiki/Abstract_syntax_tree) would work. And finding one of those for C, even if it's in lex, yacc, bison, etc. would get you farther along than a scanner and reg-ex.

Syntax trees are powerful in that they represent an expression precisely including nesting, but also unnecessary complex if they are not strictly needed IMO.


I thought this task was sufficiently simple that I didn't need a more robust parser than regex. But as I'm progressing, I'm sensing that I was wrong and that I do need something better like you mentioned.

You could also simplify the regex to match anything between the parenthesis, then making sense of it when you have the individual components. Adding support for variables and hex values in the arguments for example would mean that anything but C reserved symbols would be legal.

ArtOfWarfare
Jun 17, 2013, 11:17 PM
First screenshot of my first working prototype (attached to this post)

I got it working with a tiny bit of regex (to remove comments and whitespace) coupled with this NSScanner code:

NSMutableArray *commands = [[NSMutableArray alloc] init];
NSScanner *scanner = [NSScanner scannerWithString:string];

while (![scanner isAtEnd]) {
NSString *commandName;
[scanner scanUpToAndOverString:@"(" intoString:&commandName];
NSMutableArray *arguments = [[NSMutableArray alloc] init];
while (![scanner scanOverString:@")"]) {
NSString *argument;
[scanner scanOverString:@","];
[scanner scanUpToCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:@",)"] intoString:&argument];
[arguments addObject:argument];
if ([scanner isAtEnd]) {
NSLog(@"Error! Command unfinished!");
break;
}
}
if (![scanner scanOverString:@";"]) {
NSLog(@"Error! Command not terminated by semicolon!");
break;
}
[commands addObject:[[CGLCommand alloc] initWithString:commandName
andArguments:arguments]];
}

I also added these methods to NSScanner in a category... they're both just convenience methods so that I don't have to have NULLs and repeated strings all over my code.

- (BOOL)scanOverString:(NSString *)string
{
return [self scanString:string intoString:NULL];
}

- (BOOL)scanUpToAndOverString:(NSString *)endString
intoString:(NSString **)string;
{
[self scanUpToString:endString intoString:string];
return [self scanOverString:endString];
}

I've tested it a bit and it's quite snappy with code of this length. I haven't tested how much code you have to type before it starts getting bogged down.