grep recognizing CRs rather than LFs

Discussion in 'Mac Programming' started by mysterytramp, Apr 3, 2010.

  1. mysterytramp macrumors 65816

    mysterytramp

    Joined:
    Jul 17, 2008
    Location:
    Maryland
    #1
    (apologies in advance; I'm a bit of a Unix noob)

    I need to process some fairly large text files created on a Mac. It would save several steps if grep recognized a line ending in a carriage return instead of a linefeed, but I don't see anything in the man page to do it.

    Is there some unix voodoo to pull this off?

    I could tr the CRs to LFs, except tr appears to have a file size limitation.

    mt
     
  2. robbieduncan Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #2
  3. mdatwood macrumors 6502a

    Joined:
    Mar 14, 2010
    Location:
    Denver, CO
    #3
    Are you on windows trying to process a file created on the mac? Looking at 'man grep' I came across this option (emphasis mine):

    Seems that specifying -U would keep grep from stripping the CRs in your file if you're on windows.

    Here is a link with lots of methods of changing the LFs if you're on the mac/unix. I prefer sed.
     
  4. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #4
    Post your command-line.

    tr is a filter, and should be completely oblivious to file size. It simply processes data as the bytes go through.

    Also, can you be more precise about the file size, especially the file size where you see a problem in tr? "Fairly large" is too vague.
     
  5. mysterytramp thread starter macrumors 65816

    mysterytramp

    Joined:
    Jul 17, 2008
    Location:
    Maryland
    #5
    Here's what I've entered:

    testfile.ged is 1.1 MB on disk (1,138,814 bytes)

    workfile.txt is 213 KB on disk (212,144 bytes)

    testfile.ged is a GEDCOM, which is a textual representation of genealogical data. Each line has a tag that describes the information on the rest of the line. Ultimately, I'm looking for "1 NAME" lines to create a list of surnames.

    I'm working on a Mac. grep-ping testfile.ged produced nothing useful until the CRs were replaced with LFs.

    I assumed tr's exit message: tr: Illegal byte sequence meant it couldn't handle a file as big as testfile.ged. Looking at testfile.ged it appears tr didn't like a left curly quote.

    Now does this mean tr can't handle higher ASCII characters?

    mt
     
  6. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #6
    Not exactly. The problem is it's interpreting your input file as UTF8, and UTF8 requires certain values in a specific order. That is, some sequences are invalid (read the Wikipedia page on UTF8).

    The trick is to tell tr to use a coding other than UTF8. That requires setting some environment variables when tr is run.

    I did some poking around, so try this:
    Code:
    LANG=en_US.ISO8859-1 tr '\r' '\n' < testfile.ged > workfile.txt
    
    If that works, post and reply and I'll explain. Or read the docs for 'man locale', 'man tr' which mentions the dependence on LC_* and LANG env-vars, and these URLs:

    http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html#tag_005_003

    http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html


    EDIT:
    If it doesn't work, you can use HexFiend to replace CR (0d) with LF (0a) on the entire file. Then Save or Save As... and it should be ready to go.
    http://www.ridiculousfish.com/hexfiend/
     
  7. mdatwood macrumors 6502a

    Joined:
    Mar 14, 2010
    Location:
    Denver, CO
    #7
  8. mysterytramp thread starter macrumors 65816

    mysterytramp

    Joined:
    Jul 17, 2008
    Location:
    Maryland
    #8
    It indeed worked. Thanks.

    mt
     
  9. mysterytramp thread starter macrumors 65816

    mysterytramp

    Joined:
    Jul 17, 2008
    Location:
    Maryland
    #9
    One more issue ...

    I'm looking for lines that would say:

    I want the data between the slashes. I was using sed to strip out the unnecessary parts to leave the surname. (I'm doing this in AppleScript)

    Code:
    set shellString to "sed 's/1 NAME.* \\///g' < " & temp2FilePosix & " > " & temp3FilePosix
    
    (Applescript needs the slashes escaped out. Here's what gets sent to the command line: sed 's/1 NAME.* \///g')

    A separate sed line strips the final slash.

    This works fine so long as the name line follow the pattern in the example. A few dozen times (in a 52,000-line data file), there's no space between the middle name and the first slash. sed misses it, which fouls up downstream processing. (after sed, the text file is sort'd and uniq'd)

    1) is there a way to get sed to catch:

    1 NAME Maria/BROUWER/

    That is, no space between 1 NAME Maria and the first slash?

    2) or is there a better way to approach this? Can grep pluck out the data between the slashes?

    Thanks in advance.

    mt
     
  10. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #10
    I'd use awk.

    The matching pattern is $1 == 1 and $2 == "NAME".

    The action looks at $3. If it contains "/", it needs special handling. You could use split() at "/" and then take the array element. Or you could use index() and substr(), or use match() and RSTART and RLENGTH.

    If $3 doesn't contain "/", then print $4 stripped of slashes (e.g. gsub, or match() and RSTART and RLENGTH, or whatever).
     
  11. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
    #11
    Seems like a perfect job for perl. Whenever I feel the need to reach for more than one of grep, awk or sed perl is there.

    B
     

Share This Page