Selecting and copy blocks of text

Discussion in 'Mac Programming' started by KevinMSadler, Jul 17, 2011.

  1. KevinMSadler macrumors newbie

    Joined:
    Nov 17, 2007
    #1
    Hi Guys,

    I've had help here before and need some advice again.

    I have a large number of absolutely gigantic txt files and need to try and extract information from them. There is far too much to do it by hand and I have been trying to do it with either shellscripts or applescript but cannot work out how.

    The file consists of multiple variations like this:

    ----------------------------------
    line 1
    line 2
    …
    line x
    Solved Position (10 Moves)
    line x+2
    line x+3
    ...
    line z
    ----------------------------------​

    This layout is repeated many, many, many times
    What I need to do is search for "Solved Position (10 Moves)"
    select the all text between the "----------------------------------"
    copy this to another file and repeat the progress until the end of the file (concatenating each time)

    So I will end up with a new file that has all the text associated with every occurrence of "Solved Position (10 Moves)"
    I can do this for whichever number of moves I want.

    I hope someone has some ideas.

    Thanks,
    Kevin
     
  2. jiminaus macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #2
    To help clarify your specification...

    So these blocks may or may not have "Solved Position (10 Moves)" and you just want the one's that have that text? Are there multiple blocks with this text, or one a single one per file? If there's multiple, is that text always exactly the same within the same file?

    If the whole file was summarised, could it be summarised like so:
    Code:
    ----------------------------------
    Block 1, line 1
    Block 1, line 2
    ...
    Block 1, line x1
    ----------------------------------
    Block 2, line 1
    ...
    Block 2, line x2
    ----------------------------------
    ...
    ----------------------------------
    Block y, line 1
    ...
    Block y, line xy-1
    Block y, line xy
    ----------------------------------
    
    Particularly is there a single line of dashes between each block, with a line of dashes opening and closing the file, and nothing else?
     
  3. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #3
    Look at using the awk or perl commands.

    The strategy would be to start collecting lines in an array when the "----" line is first seen. During collection, look for the "Solved Position (N moves)" line that matches the desired N. Set a flag when this line is seen. Finally, when the next "----" line is seen, check the flag. If set, then append array's lines to output file. Then clear the array and the flag and start collecting again.
     
  4. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #4
    Yes that summarises it perfectly. Each block is delineated by the lines of dashes. Somewhere within the block is the "Solved Position (10 Moves)" which I wish to search for, then I want to copy the whole block if it contains exactly that text i.e. if it contains "Solved Position (11 Moves)" it should be ignored.

    There could be thousands of blocks containing the text I want so I wish to copy the blocks to a new file.
     
  5. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #5
    Do you have a preferred reference for me to get instructions on what to do?
    Do I do this in an AppleScript or in the Terminal?
     
  6. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #6
    Please summarize your programming experience and/or skill level.

    What programming languages do you know? Which ones are you proficient in?

    You said you were trying this with shellscripts. Post an example of shell script that you tried, and describe what didn't work.

    Also, exactly how many megabytes or gigabytes is "gigantic"?
     
  7. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #7
    I have not really done anything apart from a little light applescript programming in 20 years. Prior to that I programmed in Basic, 6502 and 6809 assembly language and Pascal. I have never done any windowed programming at all.

    I have not actually managed to create a shellscript which will work. I have been looking at grep but could not work out how to count back from the position it finds. I did look at awk but shied anyway when I read the man page for it!!!

    These files are pure .txt files and vary from 45 to 115MB!
     
  8. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #8
    Please post a sample consisting of not more than 200 lines. For example, the first 200 lines of one of the files.

    Put it in CODE tags so it doesn't get interpreted as smilies or URLs or anything weird. (CODE tags described)
     
  9. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #9
    Here is an excerpt from one of the files. It is best viewed in a monospace font to allow the text pictures to be visualised properly. The search text is pretty much in the middle of the blocks.

    Thanks for your help with this! Currently reading up on awk - it seems very powerful but a bit turgid reading!!


    Code:
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 57, Mazes 1.1|2.1|3.2|3.4|3.1|6.1, Start Position
    
               [E]
                +--+
                |   
             +  O   
             |  |   
             +--+   
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+             +--+   
       |     |     |     |      180 Deg   |      
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                      |
       +--+     +  X  +--+                   +--+
               [C]
             +--+   
                |   
                O  +
                |  |
                +--+
    
    Solved Position (40 Moves)
    
               [E]
                +--+
                |   
             O  +   
             |  |   
             +--+   
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+             +--+   
       |     |     |     |      180 Deg   |      
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                      |
       +--+     +  O  O--+                   O--+
               [C]
             +--+   
                |   
                +  +
                |  |
                +--O
    
    -> AE CA CB DC DF DE CD CF AC AD AC EA EB EF AE AB AE CA CB DC DA de DA FD FE 
       FB DF dc DF DE CD cf CD AC AD AC BC BF ab ae
       B(56-69) C(69) E(58-78) F(12-14-45) 
    
    Positions Evaluated .......... 6,394
    Times Maximum depth reached .. 86
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 60, Mazes 1.1|2.1|3.2|3.4|3.2|6.2, Start Position
    
               [E]
             +--+   
             |  |   
             +  O   
                |   
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                |      
       +--+     +  X  +--+                +--+   
               [C]
             +--+   
                |   
                O  +
                |  |
                +--+
    
    Solved Position (27 Moves)
    
               [E]
             +--+   
             |  |   
             +  O   
                |   
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                |      
       +--+     +  O  O--+                +--O   
               [C]
             +--+   
                |   
                +  +
                |  |
                +--O
    
    -> AE CA CB DC DF DE CD CF AC AD AC EA ED BE BA FB FE FD BF CF CD bc BF ab EB 
       EF ae
       C(69) D(45-47) E(58-69-89) F(23-36-56) 
    
    Positions Evaluated .......... 2,265
    Times Maximum depth reached .. 70
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 64, Mazes 1.1|2.1|3.2|3.4|3.4|6.2, Start Position
    
               [E]
             +--+   
                |   
                O  +
                |  |
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                |      
       +--+     +  X  +--+                +--+   
               [C]
             +--+   
                |   
                O  +
                |  |
                +--+
    
    Solved Position (26 Moves)
    
               [E]
             +--+   
                |   
                +  O
                |  |
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                |      
       +--+     +  O  O--+                +--O   
               [C]
             +--+   
                |   
                +  +
                |  |
                +--O
    
    -> AE CA CB DC DF DE CD CF AC AD AC EA ED EF BE BA FB FE FD BF CF CD bc BF ab 
       ae
       C(69) D(45-47) E(58-89) F(23-36-56) 
    
    Positions Evaluated .......... 8,298
    Times Maximum depth reached .. 172
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 121, Mazes 1.1|2.1|3.4|3.4|3.1|6.1, Start Position
    
               [E]
                +--+
                |   
             +  O   
             |  |   
             +--+   
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+             +--+   
       |     |     |     |      180 Deg   |      
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                      |
       +--+     +  X  +--+                   +--+
               [C]
             +--+   
             |  |   
             +  O   
                |   
                +--+
    
    Solved Position (36 Moves)
    
               [E]
                +--+
                |   
             O  +   
             |  |   
             +--+   
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+             +--+   
       |     |     |     |      180 Deg   |      
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                      |
       +--+     +  O  O--+                   O--+
               [C]
             +--+   
             |  |   
             +  +   
                |   
                +--O
    
    -> AE CA CB CF AC DC DF AD AC EA EB EF AE AB AE CA DA de DA FD FE FB DF dc DF 
       DE CD cf CD AC AD AC BC BF ab ae
       B(56-69) E(58-78) F(12-14-45) 
    
    Positions Evaluated .......... 8,779
    Times Maximum depth reached .. 92
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 124, Mazes 1.1|2.1|3.4|3.4|3.2|6.2, Start Position
    
               [E]
             +--+   
             |  |   
             +  O   
                |   
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                |      
       +--+     +  X  +--+                +--+   
               [C]
             +--+   
             |  |   
             +  O   
                |   
                +--+
    
    Solved Position (27 Moves)
    
               [E]
             +--+   
             |  |   
             +  O   
                |   
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                |      
       +--+     +  O  O--+                +--O   
               [C]
             +--+   
             |  |   
             +  +   
                |   
                +--O
    
    -> AE CA CB DC DF DE CD CF AC AD AC EA ED BE BA FB FE FD BF CF CD bc BF ab EB 
       EF ae
       C(14) D(45-47) E(58-69-89) F(23-36-56) 
    
    Positions Evaluated .......... 3,782
    Times Maximum depth reached .. 80
    Computation time (h:m:s:ms) .. 00:00:00:015
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 125, Mazes 1.1|2.1|3.4|3.4|3.3|6.1, Start Position
    
               [E]
                +--+
                |  |
                O  +
                |   
             +--+   
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+             +--+   
       |     |     |     |      180 Deg   |      
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                      |
       +--+     +  X  +--+                   +--+
               [C]
             +--+   
             |  |   
             +  O   
                |   
                +--+
    
    Solved Position (39 Moves)
    
               [E]
                +--+
                |  |
                O  +
                |   
             +--+   
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+             +--+   
       |     |     |     |      180 Deg   |      
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                      |
       +--+     +  O  O--+                   O--+
               [C]
             +--+   
             |  |   
             +  +   
                |   
                +--O
    
    -> AE CA CB CF AC DC DF AD AC AB DA EA EB de DA FD FE FB DF dc DF DE AD ED EF 
       AE AB AE CA CD cf CD AC AD AC BC BF ab ae
       B(56-69) E(47-58-78) F(12-14-45) 
    
    Positions Evaluated .......... 871
    Times Maximum depth reached .. 47
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 128, Mazes 1.1|2.1|3.4|3.4|3.4|6.2, Start Position
    
               [E]
             +--+   
                |   
                O  +
                |  |
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       O  +  +--O  +     O     <--------  +--O--+
       |  |     |  |     |                |      
       +--+     +  X  +--+                +--+   
               [C]
             +--+   
             |  |   
             +  O   
                |   
                +--+
    
    Solved Position (26 Moves)
    
               [E]
             +--+   
                |   
                +  O
                |  |
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+     +--+                +--+
       |     |     |     |      180 Deg         |
       +  O  +--+  +     +     <--------  +--+--+
       |  |     |  |     |                |      
       +--+     +  O  O--+                +--O   
               [C]
             +--+   
             |  |   
             +  +   
                |   
                +--O
    
    -> AE CA CB DC DF DE CD CF AC AD AC EA ED EF BE BA FB FE FD BF CF CD bc BF ab 
       ae
       C(14) D(45-47) E(58-89) F(23-36-56) 
    
    Positions Evaluated .......... 18,560
    Times Maximum depth reached .. 105
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 135, Mazes 1.1|2.2|3.1|3.1|3.4|6.1, Start Position
    
               [E]
             +--+   
                |   
                O  +
                |  |
                +--+
      [D]      [A]      [B]                 [F]
       +--+  +--+--+  +--+                +--+   
       |     |     |     |      180 Deg   |      
    +  O     +--O  +     O     <--------  +--O--+
    |  |        |  |     |                      |
    +--+        +  X     +--+                +--+
               [C]
                +--+
                |  |
                O  +
                |   
             +--+   
    
    Solved Position (28 Moves)
    
               [E]
             +--+   
                |   
                +  O
                |  |
                +--+
      [D]      [A]      [B]                 [F]
       +--+  +--+--+  O--+                +--O   
       |     |     |     |      180 Deg   |      
    +  O     +--+  +     +     <--------  +--+--+
    |  |        |  |     |                      |
    +--+        +  O     +--+                +--+
               [C]
                +--+
                |  |
                +  O
                |   
             +--+   
    
    -> AE CA CD CF AC BC BA FB FC FD BF EF EB DE DF AD AC AB DA DC ED ea ED EF be 
       BF AB ae
       C(58-78) D(56-69-89) F(56-69-89) 
    
    Positions Evaluated .......... 2,642
    Times Maximum depth reached .. 53
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    Maze Burr, Configuration 144, Mazes 1.1|2.2|3.1|3.2|3.4|6.2, Start Position
    
               [E]
             +--+   
                |   
                O  +
                |  |
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+  +--+                   +--+
    |  |     |     |     |      180 Deg         |
    +  O     +--O  +     O     <--------  +--O--+
       |        |  |     |                |      
       +--+     +  X     +--+             +--+   
               [C]
                +--+
                |  |
                O  +
                |   
             +--+   
    
    Solved Position (25 Moves)
    
               [E]
             +--+   
                |   
                +  O
                |  |
                +--+
      [D]      [A]      [B]                 [F]
    +--+     +--+--+  O--+                   +--+
    |  |     |     |     |      180 Deg         |
    +  +     +--+  +     +     <--------  +--+--+
       |        |  |     |                |      
       +--O     +  O     +--+             +--O   
               [C]
                +--+
                |  |
                +  +
                |   
             O--+   
    
    -> AE CA CD BC BA FB FE FD BF BE CB cf CB AC DC DF AD AC EA ED EF BE BF ab ae
       C(36) D(45-47-78) E(58-89) F(23-36-56) 
    
    Positions Evaluated .......... 640
    Times Maximum depth reached .. 37
    Computation time (h:m:s:ms) .. 00:00:00:000
    -----------------------------------------------------------------------------
    
     
  10. chown33, Jul 17, 2011
    Last edited: Jul 17, 2011

    chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #10
    Thanks for posting the sample. I used it to test the program below.

    File: blocky.awk
    Code:
    # Example command line:
    # ---
    #   awk -f blocky.awk  -v moves=26  file1.txt file2.txt >>found26.txt
    # ---
    # Sets 'moves' to 26, so pattern to find will be "Solved Position (26 Moves)".
    # Searches files file1.txt, file2.txt.
    # Appends found lines to found26.txt.
    
    BEGIN { 
      # If moves isn't assigned a value, it will be 0.
      solved = sprintf( "Solved Position (%d Moves)", moves )
    
      # Print what we're looking for on stderr stream, not stdout.
      print "Looking for: " solved   >>"/dev/stderr"
    
      found = 0
      lineNum = 0
      kept[ 0 ] = "---debug"
    }
    
    ## Matches lines that start with 10 or more consecutive hyphens.
    ## Because I'm too lazy to type in the exact number of actual hyphens.
    /^----------+/  { 
    #  print $0 "Looking for: " solved ", found: " found
    
      if ( found > 0 )  {
        for ( i = 0; i < lineNum; ++i )  { print kept[ i ] }
      }
    
      found = 0
      lineNum = 0
      delete kept  # discard entire array
    }
    
    ## Matches all lines.
    {
      kept[ lineNum++ ] = $0
      found += (index( $0, solved ) > 0);  # adds 0 when solved not present
    }
    
    Example command-lines:
    Code:
    ## Doesn't assign a value to 'moves' variable.
    awk -f ./blocky.awk  sample.txt
    
    
    awk -f ./blocky.awk  -v moves=10  sample.txt
    
    
    awk -f ./blocky.awk  -v moves=26  sample2.txt  >found26.txt
    
    awk -f ./blocky.awk  -v moves=27  sample2.txt  >found27.txt
    
    I stored the posted sample data as "sample2.txt".
     
  11. jiminaus macrumors 65816

    jiminaus

    Joined:
    Dec 16, 2010
    Location:
    Sydney
    #11
    The awk man page is a really concise description for people who already know awk.

    Although the awk installed in Mac OS X isn't the GNU awk, the GNU awk user's guide is a more extensive description. Just be aware that any GNU extensions to awk described in that user's guide won't work, unless you install and use GNU awk of course.

    There's also many gentiler awk tutorials on the Internet.
    Awk - A Tutorial and Introduction - by Bruce Barnett
    An Awk Primer

    I don't use awk often enough to remember it's syntax and details, so I use the resources when I need to use it.
     
  12. ChOas macrumors regular

    Joined:
    Nov 24, 2006
    Location:
    The Netherlands
    #12
    Or in Perl:

    Code:
    #!/usr/bin/perl
    
    
    local $/ = '-' x 77;
    my $m = shift;
    print grep /Solved Position \($m Moves\)/,<>;
    
    usage: ./perlprogram _number moves to look for_ _input file(s)_ > _output file_
     
  13. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #13
    Wow Guys!!!

    Thanks so much for the help! I am sure that this will allow me to produce a more organised set of files.

    I will be sure to read up on this subject too.

    Kevin
     
  14. ChOas macrumors regular

    Joined:
    Nov 24, 2006
    Location:
    The Netherlands
    #14
    Just realised you can just do it in one go :)

    Code:
    perl -n -e 'BEGIN{$/ = "-" x 77;}system("echo \"$_\" >> $1.txt") if /Solved Position \((\d+) Moves\)/' inpt.fil
    
    Where inpt.fil was your sample file. Gives:

    Code:
    unknown$ ls -altr *.txt
    total 32
    -rw-------  1 xx users  1473 Jul 19 04:38 40.txt
    -rw-------  1 xx users  1428 Jul 19 04:38 36.txt
    -rw-------  1 xx users  2819 Jul 19 04:38 27.txt
    -rw-------  1 xx users  1438 Jul 19 04:38 39.txt
    -rw-------  1 xx users  2810 Jul 19 04:38 26.txt
    -rw-------  1 xx users  1407 Jul 19 04:38 28.txt
    -rw-------  1 xx users  1398 Jul 19 04:38 25.txt
    
    Where <n>.txt is the collection of moves found for that number.
     
  15. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #15
    Now that is fantastic!:D:D
    Managed to sort them all out in one go!

    It's amazing how long it takes to do a 110MB file and I have 180 of them to do!!!:eek:
     
  16. ChOas macrumors regular

    Joined:
    Nov 24, 2006
    Location:
    The Netherlands
    #16
    Does it really take that long ?

    This should be pretty efficient. I don't know what kind of machine you are running on, of course...

    Btw.. you can just use a wildcard for the filename and let it run through the night if you want to. the thing will take any number of files.
     
  17. ChOas macrumors regular

    Joined:
    Nov 24, 2006
    Location:
    The Netherlands
    #17
    Ah.. it actually is NOT that efficient :D

    use this script:

    Code:
    #!/usr/bin/perl -w
    
    use strict;
    
    local $/ = '-' x 77;
    
    my %cache = ();
    while (<>) {
     if (/Solved Position \((\d+) Moves\)/) {
      open $cache{$1},">>$1.txt" unless $cache{$1};
      print {$cache{$1}} $_;
     };
    };
    
    close $cache{$_} for keys %cache;
    
    It caches the file handles and doesn't fork a shell. Does a 250MB file in about
    4 seconds on my i5 iMac.
     
  18. KevinMSadler thread starter macrumors newbie

    Joined:
    Nov 17, 2007
    #18
    I have a 3 year old Core 2 Duo Macbook - the new code seems to be a bit faster but not much. The biggest worry for me if I deliberately do all 18GB in one go will be heat. I did 7 files in a row and had to take the laptop off my lap as it got horrendously hot!

    I now have lots of options - now I have to work on understanding how the code actually works.

    Thanks a lot for all this.
     

Share This Page