Line merging in large text file

Discussion in 'Mac Programming' started by wrldwzrd89, Nov 11, 2011.

  1. wrldwzrd89 macrumors G5

    wrldwzrd89

    Joined:
    Jun 6, 2003
    Location:
    Solon, OH
    #1
    So, I've got a large text file with over 6,000 lines in it.

    I've managed to do the first part of what I want, which is prefix any line that does NOT start with a tab character with a semicolon followed by a space, with some clever regular expressions.

    Now what I'd like to do is this... any line that has been prefixed in the previous step should be merged with the line before it, by deleting the newline character separating them. My Google-fu is failing me on this matter, though.
     
  2. jiminaus, Nov 11, 2011
    Last edited: Nov 11, 2011

    jiminaus macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #2
    This works. It may be overly complicated because I don't know sed well.

    Code:
    sed 's/^\([^	]\)/; \1/' | \
    sed -n '
    /^; / !{
    	x
    	/^$/ n
    	s/\n//g
    	p
    }
    /^; / H
    '
    
    Note that there's a literal tab character between [^ and ] on the first line.

    Of course, you didn't provide any sample file so I can't actually be sure it'll work for you. :p
     
  3. wrldwzrd89 thread starter macrumors G5

    wrldwzrd89

    Joined:
    Jun 6, 2003
    Location:
    Solon, OH
    #3
    I try pasting the first command in that into Terminal, and it beeps at me :S
    This is the output I get:
    Code:
    sed 's/^\([^]\)/; \1/' | \ < /Users/wrldwzrd89/Desktop/raw_armory.txt > /Users/wrldwzrd89/Desktop/raw_armory2.txt
    sed: 1: "s/^\([^]\)/; \1/": unbalanced brackets ([])
    -bash:  : command not found
    
    Also attached an example of the type of file I'm dealing with.
     

    Attached Files:

  4. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #4
    File: semi-merge.awk
    Code:
    # awk program
    
    # p holds one previous line, assembles merged lines.
    BEGIN { p = "" }
    
    # for each line NOT starting with semicolon.
    # If p holds anything, print it, then store line in p.
    $0 !~ /^;/  { if ( length( p ) > 0 ) print p;  p = $0; }
    
    # for each line starting with semicolon.
    # Append it to p.
    $0 ~ /^;/  { p = p $0; }
    
    # ensures last line stored in p is printed.
    END  { print p }
    Command line:
    Code:
    awk -f semi-merge.awk raw_armory_sample.txt >out.txt
    
     
  5. wrldwzrd89 thread starter macrumors G5

    wrldwzrd89

    Joined:
    Jun 6, 2003
    Location:
    Solon, OH
    #5
    Success! This worked. :D
     
  6. dmi macrumors regular

    Joined:
    Dec 21, 2010
    #6
    perl -i.bak -lpe 'BEGIN{$/="\n; ";$\="; "}' raw_armory_sample.txt
     
  7. dmi macrumors regular

    Joined:
    Dec 21, 2010
    #7
    awk 'BEGIN{RS="\n; ";ORS="; "' raw_armory_sample > out.txt
     
  8. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #8
    Did you test that? It's not even syntactically correct: missing }.
     
  9. dmi, Nov 12, 2011
    Last edited: Nov 12, 2011

    dmi macrumors regular

    Joined:
    Dec 21, 2010
    #9
    I did test it, but when I tested it, it looked like
    awk 'BEGIN{RS="\n; ";ORS="; "}1' raw_armory_sample.txt
    I must have deleted some characters when I added the
    > out.txt
    which part I admit I did not test, but thought ought to work equivalently to an earlier example
     
  10. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #10
    That script now does something, but the output isn't correct. There are no newlines in the ouput:
    Code:
    wc raw*.txt
    [B]      66    1054    5917 raw_armory_sample.txt[/B]
    
    awk 'BEGIN{RS="\n; ";ORS="; "}1' raw_armory_sample.txt | wc
    [B]       0    1093    5985[/B]
    Your perl script works, so maybe leave it at that.
     

Share This Page