NSRegularExpression to capture C function arguments?

Discussion in 'Mac Programming' started by ArtOfWarfare, Jun 16, 2013.

  1. ArtOfWarfare, Jun 16, 2013
    Last edited: Jun 16, 2013

    macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #1
    I wrote this NSRegularExpression for detecting Core Graphics C functions:

    Code:
    NSString *regexString = @"([_a-zA-Z][_0-9a-zA-Z]*)\\(context(?:,(-?[0-9]*.?[0-9]+))*\\);";
    NSRegularExpression *regex = [[NSRegularExpression alloc] initWithPattern:regexString options:0 error:nil]
    But it's not picking up individual arguments like I want.

    For example, I want this string

    Code:
    CGContextMoveToPoint(context,0,100);
    to have the following subranges captured:

    Code:
    CGContextMoveToPoint
    0
    100
    But instead right now it picks up:

    Code:
    CGContextMoveToPoint
    0,100
    Why is it picking up that middle comma? I set it aside in its own non-capture group explicitly so it wouldn't be picked up and placed in any of the groups.

    Here's my code using the regular expression:

    Code:
    [regex enumerateMatchesInString:codeString options:0 range:NSMakeRange(0, codeString.length) usingBlock:^(NSTextCheckingResult *result, NSMatchingFlags flags, BOOL *stop) {
        for (NSUInteger i = 1, n = result.numberOfRanges; i < n; i++) {
            NSLog(@"%@", [codeString substringWithRange:[result rangeAtIndex:i]]);
        }
    }];
    (It starts with i = 1 because rangeAtIndex:0 always has the range of the entire matched string, whereas the 1 through numberOfRanges - 1 are supposed to have the matches for the individual capture groups.)
     
  2. macrumors 68040

    Joined:
    Feb 2, 2008
    #2
    Have you confirmed that the regex works as intended by itself?
     
  3. thread starter macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #3
    How do you mean? When I give it a small text file full of code it's able to extract all of the CG function calls, it just messes up finding those individual arguments.
     
  4. macrumors 68040

    Joined:
    Feb 2, 2008
    #4
    I looked at what you said here:

    And suspected the regex, but perhaps I missed your point.
     
  5. thread starter macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #5
    I'm confused. The regex was in the first code block of my original post. It's the NSString that's passed into NSRegularExpression's initWithPattern method. It's picking up the function calls fine. The issue I'm having is with the individual capture groups it reports. I want it to separately report each argument, not give me a single string with all of the arguments in it.
     
  6. macrumors 68040

    Joined:
    Feb 2, 2008
    #6
    Yes, and perhaps that is down to the regular expression it self, I just asked if you have confirmed that it does break down the string into each individual argument.
     
  7. thread starter macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #7
    I wrote it thinking it would, but as far as I can tell it doesn't. If it did, everything would work and I wouldn't be asking about it.

    Code:
    (?:,(-?[0-9]*.?[0-9]+))*
    I put the inner ()'s around so that it would be a separate capture group. The outer ()'s have the ?: prefix it so that , won't also get a capture group.
     
  8. macrumors 68040

    Joined:
    Feb 2, 2008
    #8
    I usually test them for example in the terminal, there could also be a problem else where.

    Wouldn't you need exactly one more group like the inner group separated by comma?
     
  9. macrumors regular

    Joined:
    Apr 8, 2009
    #9
    Looks like your first problem is with the dot here:

    Code:
    (-?[0-9]*.?[0-9]+)
    I don't think it's matching what you think it does.

    I'm also not convinced this will ever work. From a few small tests it looks like ICU's regular expressions (the engine behind NSRegularExpression) does not support variable capture groups. It's explained well – albeit for Java – in this Stack Overflow post.

    I would strongly recommend looking at other approaches than regular expression for this kind of task. The expression you're using already requires a fair level of effort to understand, and doesn't handle all valid function calls (whitespace is an obvious omission). You're going to end up with an unwieldy regular expression that still doesn't accurately match all valid calls, while disallowing all invalid ones.

    NSScanner might be worth a look, but you'll likely end up requiring a proper parser.
     
  10. thread starter macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #10
    D'oh! I forgot . is a metacharacter in regex.

    I'm looking into NSScanner right now... I'm feeling like this should be simple enough a task for NSScanner + NSRegularExpression to make short work of and I shouldn't need anything like ParseKit (which, although I've found it's quite powerful, the documentation available for it is very sparse and often entirely incorrect.)
     
  11. macrumors 603

    Joined:
    Aug 9, 2009
    #11
    Google search terms: regex buddy mac os


    I don't understand why you'd build a parser from scratch. As already pointed out above, an existing C parser might be a better fit. For example, googling c interpreter finds this C interpreter:
    It's entirely in C, has a makefile, and is offered under the BSD license. Even if you don't use it, it can be an example of how to write a C parser.

    Pretty much any parser that produces an AST would work. And finding one of those for C, even if it's in lex, yacc, bison, etc. would get you farther along than a scanner and reg-ex.

    Heed this quote:
    Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

    Also see:
    http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

    I've worked on a past projects that was started using reg-ex for parsing both C and assembly language. It was one of the most complex monstrosities I've ever seen, and its performance was abysmal. It would take anywhere from many seconds to several minutes to parse what seemed like relatively simple things, and it was still incomplete. The reg-ex was eventually scrapped and replaced with a proper parser (lex & yacc) and it became much simpler and far faster.
     
  12. thread starter macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #12
    That app doesn't exist in OS X, but I used the lite web version, regexpal.com. It doesn't highlight subexpressions making it of limited value for this particular issue.

    I just found reggy which seems like it might be particularly helpful... also, it's free.

    I thought this task was sufficiently simple that I didn't need a more robust parser than regex. But as I'm progressing, I'm sensing that I was wrong and that I do need something better like you mentioned.

    Thanks, I'll look into that.
     
  13. macrumors 68040

    Joined:
    Feb 2, 2008
    #13
    Syntax trees are powerful in that they represent an expression precisely including nesting, but also unnecessary complex if they are not strictly needed IMO.

    You could also simplify the regex to match anything between the parenthesis, then making sense of it when you have the individual components. Adding support for variables and hex values in the arguments for example would mean that anything but C reserved symbols would be legal.
     
  14. ArtOfWarfare, Jun 17, 2013
    Last edited: Jun 17, 2013

    thread starter macrumors 604

    ArtOfWarfare

    Joined:
    Nov 26, 2007
    #14
    First screenshot of my first working prototype (attached to this post)

    I got it working with a tiny bit of regex (to remove comments and whitespace) coupled with this NSScanner code:

    Code:
        NSMutableArray *commands = [[NSMutableArray alloc] init];
        NSScanner *scanner = [NSScanner scannerWithString:string];
    	
    	while (![scanner isAtEnd]) {
    		NSString *commandName;
    		[scanner scanUpToAndOverString:@"(" intoString:&commandName];
    		NSMutableArray *arguments = [[NSMutableArray alloc] init];
    		while (![scanner scanOverString:@")"]) {
    			NSString *argument;
    			[scanner scanOverString:@","];
    			[scanner scanUpToCharactersFromSet:[NSCharacterSet characterSetWithCharactersInString:@",)"] intoString:&argument];
    			[arguments addObject:argument];
    			if ([scanner isAtEnd]) {
    				NSLog(@"Error! Command unfinished!");
    				break;
    			}
    		}
    		if (![scanner scanOverString:@";"]) {
    			NSLog(@"Error! Command not terminated by semicolon!");
    			break;
    		}
            [commands addObject:[[CGLCommand alloc] initWithString:commandName
                                                      andArguments:arguments]];
    	}
    I also added these methods to NSScanner in a category... they're both just convenience methods so that I don't have to have NULLs and repeated strings all over my code.

    Code:
    - (BOOL)scanOverString:(NSString *)string
    {
    	return [self scanString:string intoString:NULL];
    }
    
    - (BOOL)scanUpToAndOverString:(NSString *)endString
    				   intoString:(NSString **)string;
    {
    	[self scanUpToString:endString intoString:string];
    	return [self scanOverString:endString];
    }
    I've tested it a bit and it's quite snappy with code of this length. I haven't tested how much code you have to type before it starts getting bogged down.
     

    Attached Files:

Share This Page