macOS grep recognizing CRs rather than LFs

mysterytramp · Apr 3, 2010

(apologies in advance; I'm a bit of a Unix noob)

I need to process some fairly large text files created on a Mac. It would save several steps if grep recognized a line ending in a carriage return instead of a linefeed, but I don't see anything in the man page to do it.

Is there some unix voodoo to pull this off?

I could tr the CRs to LFs, except tr appears to have a file size limitation.

mt

robbieduncan · Apr 3, 2010

You could try using dos2unix instead of tr to see if that works?

Edit: or try the commands suggested on this wiki page: http://en.wikipedia.org/wiki/Newline

mdatwood · Apr 3, 2010

Are you on windows trying to process a file created on the mac? Looking at 'man grep' I came across this option (emphasis mine):

-U, --binary
Treat the file(s) as binary. By default, under MS-DOS and MS-
Windows, grep guesses the file type by looking at the contents
of the first 32KB read from the file. If grep decides the file
is a text file, it strips the CR characters from the original
file contents (to make regular expressions with ^ and $ work
correctly). Specifying -U overrules this guesswork, causing all
files to be read and passed to the matching mechanism verbatim;
if the file is a text file with CR/LF pairs at the end of each
line, this will cause some regular expressions to fail. This
option has no effect on platforms other than MS-DOS and MS-Win-
dows.

Seems that specifying -U would keep grep from stripping the CRs in your file if you're on windows.

Here is a link with lots of methods of changing the LFs if you're on the mac/unix. I prefer sed.

chown33 · Apr 3, 2010

mysterytramp said:
I could tr the CRs to LFs, except tr appears to have a file size limitation.

Post your command-line.

tr is a filter, and should be completely oblivious to file size. It simply processes data as the bytes go through.

Also, can you be more precise about the file size, especially the file size where you see a problem in tr? "Fairly large" is too vague.

mysterytramp · Apr 3, 2010

Here's what I've entered:

$ tr '\r' '\n' < testfile.ged > workfile.txt
tr: Illegal byte sequence
$

testfile.ged is 1.1 MB on disk (1,138,814 bytes)

workfile.txt is 213 KB on disk (212,144 bytes)

testfile.ged is a GEDCOM, which is a textual representation of genealogical data. Each line has a tag that describes the information on the rest of the line. Ultimately, I'm looking for "1 NAME" lines to create a list of surnames.

I'm working on a Mac. grep-ping testfile.ged produced nothing useful until the CRs were replaced with LFs.

I assumed tr's exit message: tr: Illegal byte sequence meant it couldn't handle a file as big as testfile.ged. Looking at testfile.ged it appears tr didn't like a left curly quote.

Now does this mean tr can't handle higher ASCII characters?

mt

chown33 · Apr 3, 2010

mysterytramp said:
I assumed tr's exit message: tr: Illegal byte sequence meant it couldn't handle a file as big as testfile.ged. Looking at testfile.ged it appears tr didn't like a left curly quote.

Now does this mean tr can't handle higher ASCII characters?

Not exactly. The problem is it's interpreting your input file as UTF8, and UTF8 requires certain values in a specific order. That is, some sequences are invalid (read the Wikipedia page on UTF8).

The trick is to tell tr to use a coding other than UTF8. That requires setting some environment variables when tr is run.

I did some poking around, so try this:

Code:

LANG=en_US.ISO8859-1 tr '\r' '\n' < testfile.ged > workfile.txt

If that works, post and reply and I'll explain. Or read the docs for 'man locale', 'man tr' which mentions the dependence on LC_* and LANG env-vars, and these URLs:

http://www.opengroup.org/onlinepubs/007908799/xbd/locale.html#tag_005_003

http://www.opengroup.org/onlinepubs/007908799/xbd/envvar.html

EDIT:
If it doesn't work, you can use HexFiend to replace CR (0d) with LF (0a) on the entire file. Then Save or Save As... and it should be ready to go.
http://www.ridiculousfish.com/hexfiend/

mdatwood · Apr 3, 2010

This doesn't address your problem exactly, but it as the title says if you have to deal with text of any kind there are some minimum things every programmer needs to know.

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

mysterytramp · Apr 4, 2010

chown33 said:
I did some poking around, so try this:

Code:

LANG=en_US.ISO8859-1 tr '\r' '\n' < testfile.ged > workfile.txt

If that works, post and reply and I'll explain.

It indeed worked. Thanks.

mt

mysterytramp · Apr 5, 2010

One more issue ...

I'm looking for lines that would say:

1 NAME Maria /BROUWER/

I want the data between the slashes. I was using sed to strip out the unnecessary parts to leave the surname. (I'm doing this in AppleScript)

Code:

set shellString to "sed 's/1 NAME.* \\///g' < " & temp2FilePosix & " > " & temp3FilePosix

(Applescript needs the slashes escaped out. Here's what gets sent to the command line: sed 's/1 NAME.* \///g')

A separate sed line strips the final slash.

This works fine so long as the name line follow the pattern in the example. A few dozen times (in a 52,000-line data file), there's no space between the middle name and the first slash. sed misses it, which fouls up downstream processing. (after sed, the text file is sort'd and uniq'd)

1) is there a way to get sed to catch:

1 NAME Maria/BROUWER/

That is, no space between 1 NAME Maria and the first slash?

2) or is there a better way to approach this? Can grep pluck out the data between the slashes?

Thanks in advance.

mt

chown33 · Apr 5, 2010

mysterytramp said:
This works fine so long as the name line follow the pattern in the example. A few dozen times (in a 52,000-line data file), there's no space between the middle name and the first slash. sed misses it, which fouls up downstream processing. (after sed, the text file is sort'd and uniq'd)

1) is there a way to get sed to catch:

1 NAME Maria/BROUWER/

That is, no space between 1 NAME Maria and the first slash?

2) or is there a better way to approach this? Can grep pluck out the data between the slashes?

I'd use awk.

The matching pattern is $1 == 1 and $2 == "NAME".

The action looks at $3. If it contains "/", it needs special handling. You could use split() at "/" and then take the array element. Or you could use index() and substr(), or use match() and RSTART and RLENGTH.

If $3 doesn't contain "/", then print $4 stripped of slashes (e.g. gsub, or match() and RSTART and RLENGTH, or whatever).

balamw · Apr 5, 2010

chown33 said:
I'd use awk.

Seems like a perfect job for perl. Whenever I feel the need to reach for more than one of grep, awk or sed perl is there.

B

Search

Search

macOS grep recognizing CRs rather than LFs

mysterytramp

macrumors 65816

robbieduncan

Moderator emeritus

mdatwood

macrumors 65816

chown33

Moderator

mysterytramp

macrumors 65816

chown33

Moderator

mdatwood

macrumors 65816

mysterytramp

macrumors 65816

mysterytramp

macrumors 65816

chown33

Moderator

balamw

Moderator emeritus

Our Staff