Silly awk regular expression question.

Discussion in 'Mac Programming' started by jemo07, Jan 31, 2010.

  1. jemo07 macrumors regular

    Joined:
    May 6, 2006
    Location:
    Madrid, Spain
    #1
    Hi,

    I am running into a small issue here trying to load a file to a DB.
    Basically I am trying to get the numbers that are in a string with alot of garbage.
    So I tried it with a simple awk hack but it fails. I can´t figure out what I am doing wrong but here is a sample using an echo so keep it short:

    echo "$%$&&$··....aaaffff><SPP0022555445DFDDSDFvdbdbd" |awk ' { /[0-9]+/ ; print }'

    I expected "0022555445" to print but instead I get:
    $%$&&?·....aaaffff><SPP0022555445DFDDSDFvdbdbd :(

    now, I thought, wait, I am not selecting correctly. So I did a simple test replace my selection with something, like this:

    echo "$%$&&$··....aaaffff><SPP0022555445DFDDSDFvdbdbd" |awk ' { sub(/[0-9]+/, "<AAAAA>"); print }'

    Here is what I get
    $%$&&?·....aaaffff><SPP<AAAAA>DFDDSDFvdbdbd :mad:

    As you can see I am selecting all the digits and replacing them with a "<AAAAA>" correctly.

    I am a little rusty in figuring this one out. Again, all I want it the print all the numbers in the range [0-9] from a string.

    Thank you all in advanced for helping out.

    Regards,

    Jose
     
  2. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #2
    I recommend that you read the awk man page.

    http://developer.apple.com/mac/library/DOCUMENTATION/Darwin/Reference/ManPages/man1/awk.1.html

    This does not mean what you seem to think it means.

    A line-matching pattern goes outside the {}'s. But then this:
    Code:
    awk ' /[0-9]+/ { print }'
    
    won't do what you say you want, either. Which is why you should Read The Fine Man page.

    I think you will need the match() builtin function, or possibly the split() builtin function, or both.

    Yet another strategy is to gsub all non-[0-9] chars with a blank, then strip the whitespace from the resulting string.

    It's also unclear to me what should happen if the input contains multiple digit-sequences separated by non-digits, e.g. "xyzzy987foo42at694". Again, that would be something match() or split() might best be applied to.
     
  3. cqexbesd macrumors regular

    Joined:
    Jun 4, 2009
    #3
    If you know all the lines have that format then a dirty hacky method based on that is

    Code:
    awk -F '[^0-9]+' '{ print $2 }'
    but you are probably better off doing something like:

    Code:
    awk '{ gsub(/[^0-9]+/, " "); print }'
    which will replace all the non-numbers on the line with a space and then print out the string. Exactly what you want to use depends on the format of the incoming data and exactly what you want to get out of course.

    HTH,

    Andrew
     
  4. jemo07 thread starter macrumors regular

    Joined:
    May 6, 2006
    Location:
    Madrid, Spain
    #4
    Andrew, that is what I was looking for. I went the much longer route and created a long nested exclusion with:
    Code:
    awk '{ sub(/[ffa-z\$.................]+/, " "); print }'
    But your code is exactly what I was trying to accomplish.

    thank you very much! I will be able to get this done and a little quicker now since I can reuse this example in many other ways to parse the same file!

    Jose
     

Share This Page