Regular expressions not working

Discussion in 'Web Design and Development' started by big_malk, Apr 21, 2010.

  1. big_malk macrumors 6502a

    Joined:
    Aug 7, 2005
    Location:
    Scotland
    #1
    A client of mine has over a million scientific article summaries that need important into a database, for some reason the only format they can get them in is the text from emails, each email having over 8,000 articles.
    Seems to me the best way to extract the data is using regular expressions. I've split the articles into an array, and I'm trying to get the data out of them, here's a sample of one:

    I'm having trouble getting even something as simple as the the paper title out of it, here's what I've tried:

    PHP:
    //Paper:(?P<paper>[-a-zA-Z0-9/\w]*)[\r\n\t\w]*From:
    preg_match_all('Paper:([-/a-zA-Z0-9]*)'$art$data); // Paper:(?P<paper>[\w\s\d/-]?)From:(?P<from>\w\s\d/-@<>?)Date:
    print_r($data);
    The comments are variations I've tried, but even using '[a-z]*' as the expression returns an empty array. I'm sure I'm missing something very simple here... but I'm stuck? :confused:
     
  2. angelwatt Moderator emeritus

    angelwatt

    Joined:
    Aug 16, 2005
    Location:
    USA
    #2
    I don't have time to try out much, but for the Paper regex you have there, it doesn't take in account for the space after the Paper:.

    Code:
    '^Paper: ([\w\/-]+)$'
    The ^ represents the start of a line, the $ the end of a line.
     
  3. big_malk thread starter macrumors 6502a

    Joined:
    Aug 7, 2005
    Location:
    Scotland
    #3
    Thanks, I finally got it working though :)
    I needed slashes / round all of the expression and quite a few more tweaks to cope with all the variation in the formatting of each article, currently what I have is possibly the most unreadable code I've ever worked with :)

    Code:
    /Paper:?\s*(?P<paper>[-a-zA-Z0-9\/:\(\)\?_\*\s]*)\s*(From:\s(?P<from>[-a-zA-Z0-9\/\<\>\s@\.\"]*)Date:\s(?P<date>[-a-zA-Z0-9\s:,]*)\([0-9][0-9]?[0-9]?kb\))?(replaced with revised version\s*(?P<revised_date>[-a-zA-Z0-9\s:,]*)\([0-9][0-9]?[0-9]?kb\))?\s*Title:\s*(?P<title>[-a-zA-Z0-9:\/\s,\.&\(\)_\?\<\>\'\"=\*]*)\s*Authors:\s*(?P<authors>[-a-zA-Z0-9\s,\/\'\\\(\)\.&]*)\s*Comments:\s*(?P<comments>[-a-zA-Z0-9\s,\.:;&\/\(\)]*)\s*(Report-no:\s*(?P<report_no>[-a-zA-Z0-9\.]*))?\s*\\\\\s*(?P<summary>[-a-zA-Z0-9:;,\.\/\\\(\)&\*\$\s\^{}\<\>~_]*)\\\\\s*\(\s*(?P<url>[-a-zA-Z0-9:\.\/]*)\s*,?\s*(?P<filesize>[0-9]*)kb\s*\)/
    This maybe isn't the best way to do all the regex, but it works and that's the main thing :)
     
  4. Bostonaholic macrumors 6502

    Bostonaholic

    Joined:
    Aug 21, 2009
    Location:
    Columbus, Ohio
    #4
    This site will be your friend

    http://www.regexpal.com/

    Put in your sample and your regex. When you have a working regex, refactor until you're comfortable.
     
  5. BollywooD macrumors 6502

    BollywooD

    Joined:
    Apr 27, 2005
    Location:
    Surfers Paradise
    #5
    great link!

    thanks:)
     

Share This Page