Perl parsing question

Discussion in 'Mac Programming' started by LtRammstein, Feb 4, 2010.

  1. LtRammstein macrumors 6502a

    LtRammstein

    Joined:
    Jun 20, 2006
    Location:
    Denver, CO
    #1
    Hey all,

    I'm trying to teach myself proper Perl parsing methods, but running into issues. Especially on this one.

    I am trying to parse content from a website that is stored in $contents. I am specifically looking for a series of 6 digits. I have an array filled with 6 digit elements (example 990146). What I want to do is is parse the $contents variable line by line, and pull out the lines that contain that 6 digit number.

    What functions/methods/routines should/can I use to do this?

    EDIT: this is what I have so far:

    Code:
    145 sub ContentParser($$)
    146 {
    147     my @ContentArray = "";
    148     my $content = $_[0];
    149     my $model = $_[1];
    150     while(<$_[0]>)
    151     {
    152         chomp($_[0]);
    153         if($_[0] =~ /($model)/)
    154         {
    155             print "Item model is: $model\tItem is: $_\n";
    156         }
    157     }
    158 }
    
    Its output is:

    Code:
    Item model is: 991779	Item is: <!DOCTYPE
    Item model is: 991779	Item is: html
    Item model is: 991779	Item is: PUBLIC
    Item model is: 991779	Item is: -//W3C//DTD XHTML 1.0 Transitional//EN
    Item model is: 991779	Item is: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd>
    Item model is: 991779	Item is: <html
    Item model is: 991779	Item is: xmlns=http://www.w3.org/1999/xhtml>
    Item model is: 991779	Item is: <head
    ^C
    
    I am quite confused on what's going on with this. Any help is greatly appreciated.
     
  2. robbieduncan Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #2
    split to split into lines then the regex matching functions to work out if a line contains the data you desire.

    split
    regex
     
  3. LtRammstein thread starter macrumors 6502a

    LtRammstein

    Joined:
    Jun 20, 2006
    Location:
    Denver, CO
    #3
    I've done the regex a little bit, but it's either being too greedy or not greedy enough.

    I'll try the split function and see what it can do. Thanks.
     
  4. robbieduncan Moderator emeritus

    robbieduncan

    Joined:
    Jul 24, 2002
    Location:
    London
    #4
    Well I suggested split as you said that the entire content was in a scalar variable. But your while loop is kind of doing that for you. But kind of not as it seems to be splitting on white space...
     
  5. LtRammstein thread starter macrumors 6502a

    LtRammstein

    Joined:
    Jun 20, 2006
    Location:
    Denver, CO
    #5
    Thanks for the help!

    I think I got it working now.

    Code:
    Code:
    145 sub ContentParser($$)
    146 {
    147     my @ContentArray = "";
    148     my $content = $_[0];
    149     my $model = $_[1];
    150     
    151     @ContentArray = split(/\n/,$content);
    152     
    153     foreach $line (@ContentArray)
    154     {
    155         if($line =~ /($model)/)
    156         {
    157             print "Model: $model\t Line: $line\n";
    158         }
    159     }
    160     
    161 #   while(<$_[0]>)
    162 #   {   
    163 #       chomp($_[0]);
    164 #       $line = split(/\n/);
    165 #       if($line =~ /($model)/)
    166 #       {   
    167 #           print "Item model is: $model\tItem is: $_\n";
    168 #       }
    169 #   }
    170 }
    
     
  6. ChOas macrumors regular

    Joined:
    Nov 24, 2006
    Location:
    The Netherlands
    #6
    You could also do something like this:

    Code:
    sub getContentLines {
     my ($content,$model) = @_;
     return grep /$model/, split /\n/, $content;
    };
    
    Which you can then use in your main program like:

    Code:
    my @contentLines = getContentLines($yourPage, '991779');
    
    But then you might aswell just do this and skip the whole subroutine:

    Code:
    my @contentLines = grep /$model/, split /\n/, $yourPage;
    
    And if you are looking for multiple models:

    Code:
     my $model = join '|', ('991779','991780','991781');
     my @contentLines = grep /$model/, split /\n/, $yourPage;
    
    Loads of ways :D
     

Share This Page