Mac grep recognizing CRs rather than LFs

mysterytramp

macrumors 65816
Original poster
Jul 17, 2008
1,335
3
Maryland
(apologies in advance; I'm a bit of a Unix noob)

I need to process some fairly large text files created on a Mac. It would save several steps if grep recognized a line ending in a carriage return instead of a linefeed, but I don't see anything in the man page to do it.

Is there some unix voodoo to pull this off?

I could tr the CRs to LFs, except tr appears to have a file size limitation.

mt
 

mdatwood

macrumors 6502a
Mar 14, 2010
694
233
East Coast, USA
Are you on windows trying to process a file created on the mac? Looking at 'man grep' I came across this option (emphasis mine):

-U, --binary
Treat the file(s) as binary. By default, under MS-DOS and MS-
Windows, grep guesses the file type by looking at the contents
of the first 32KB read from the file. If grep decides the file
is a text file, it strips the CR characters from the original
file contents (to make regular expressions with ^ and $ work
correctly). Specifying -U overrules this guesswork, causing all
files to be read and passed to the matching mechanism verbatim;

if the file is a text file with CR/LF pairs at the end of each
line, this will cause some regular expressions to fail. This
option has no effect on platforms other than MS-DOS and MS-Win-
dows.
Seems that specifying -U would keep grep from stripping the CRs in your file if you're on windows.

Here is a link with lots of methods of changing the LFs if you're on the mac/unix. I prefer sed.
 

chown33

Moderator
Staff member
Aug 9, 2009
8,488
4,497
Restivus
I could tr the CRs to LFs, except tr appears to have a file size limitation.
Post your command-line.

tr is a filter, and should be completely oblivious to file size. It simply processes data as the bytes go through.

Also, can you be more precise about the file size, especially the file size where you see a problem in tr? "Fairly large" is too vague.
 

mysterytramp

macrumors 65816
Original poster
Jul 17, 2008
1,335
3
Maryland
Here's what I've entered:

$ tr '\r' '\n' < testfile.ged > workfile.txt
tr: Illegal byte sequence
$
testfile.ged is 1.1 MB on disk (1,138,814 bytes)

workfile.txt is 213 KB on disk (212,144 bytes)

testfile.ged is a GEDCOM, which is a textual representation of genealogical data. Each line has a tag that describes the information on the rest of the line. Ultimately, I'm looking for "1 NAME" lines to create a list of surnames.

I'm working on a Mac. grep-ping testfile.ged produced nothing useful until the CRs were replaced with LFs.

I assumed tr's exit message: tr: Illegal byte sequence meant it couldn't handle a file as big as testfile.ged. Looking at testfile.ged it appears tr didn't like a left curly quote.

Now does this mean tr can't handle higher ASCII characters?

mt
 

chown33

Moderator
Staff member
Aug 9, 2009
8,488
4,497
Restivus
I assumed tr's exit message: tr: Illegal byte sequence meant it couldn't handle a file as big as testfile.ged. Looking at testfile.ged it appears tr didn't like a left curly quote.

Now does this mean tr can't handle higher ASCII characters?
Not exactly. The problem is it's interpreting your input file as UTF8, and UTF8 requires certain values in a specific order. That is, some sequences are invalid (read the Wikipedia page on UTF8).

The trick is to tell tr to use a coding other than UTF8. That requires setting some environment variables when tr is run.

I did some poking around, so try this:
Code:
LANG=en_US.ISO8859-1 tr '\r' '\n' < testfile.ged > workfile.txt
If that works, post and reply and I'll explain. Or read the docs for 'man locale', 'man tr' which mentions the dependence on LC_* and LANG env-vars, and these URLs:

http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html#tag_005_003

http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html


EDIT:
If it doesn't work, you can use HexFiend to replace CR (0d) with LF (0a) on the entire file. Then Save or Save As... and it should be ready to go.
http://www.ridiculousfish.com/hexfiend/
 

mysterytramp

macrumors 65816
Original poster
Jul 17, 2008
1,335
3
Maryland
One more issue ...

I'm looking for lines that would say:

1 NAME Maria /BROUWER/
I want the data between the slashes. I was using sed to strip out the unnecessary parts to leave the surname. (I'm doing this in AppleScript)

Code:
set shellString to "sed 's/1 NAME.* \\///g' < " & temp2FilePosix & " > " & temp3FilePosix
(Applescript needs the slashes escaped out. Here's what gets sent to the command line: sed 's/1 NAME.* \///g')

A separate sed line strips the final slash.

This works fine so long as the name line follow the pattern in the example. A few dozen times (in a 52,000-line data file), there's no space between the middle name and the first slash. sed misses it, which fouls up downstream processing. (after sed, the text file is sort'd and uniq'd)

1) is there a way to get sed to catch:

1 NAME Maria/BROUWER/

That is, no space between 1 NAME Maria and the first slash?

2) or is there a better way to approach this? Can grep pluck out the data between the slashes?

Thanks in advance.

mt
 

chown33

Moderator
Staff member
Aug 9, 2009
8,488
4,497
Restivus
This works fine so long as the name line follow the pattern in the example. A few dozen times (in a 52,000-line data file), there's no space between the middle name and the first slash. sed misses it, which fouls up downstream processing. (after sed, the text file is sort'd and uniq'd)

1) is there a way to get sed to catch:

1 NAME Maria/BROUWER/

That is, no space between 1 NAME Maria and the first slash?

2) or is there a better way to approach this? Can grep pluck out the data between the slashes?
I'd use awk.

The matching pattern is $1 == 1 and $2 == "NAME".

The action looks at $3. If it contains "/", it needs special handling. You could use split() at "/" and then take the array element. Or you could use index() and substr(), or use match() and RSTART and RLENGTH.

If $3 doesn't contain "/", then print $4 stripped of slashes (e.g. gsub, or match() and RSTART and RLENGTH, or whatever).