Regular expressions not working

big_malk · Apr 21, 2010

A client of mine has over a million scientific article summaries that need important into a database, for some reason the only format they can get them in is the text from emails, each email having over 8,000 articles.
Seems to me the best way to extract the data is using regular expressions. I've split the articles into an array, and I'm trying to get the data out of them, here's a sample of one:

\\
Paper: astro-ph/0102400
From: Mikhail V. MEDVEDEV <medvedev@cita.utoronto.ca>
Date: Fri, 23 Feb 2001 00:37:07 GMT (18kb)

Title: Self-Interacting Dark Matter with Flavor Mixing
Authors: Mikhail V. Medvedev (CITA)
Comments: 3 pages with 2 eps figures, aipproc.sty. To appear in Proceedings of
the 20th Texas Symposium on Relativistic Astrophysics, Austin, Texas, 2000,
edited by J. Craig Wheeler and Hugo Martel (American Institute of Physics)
Report-no: Poster-14.4
\\
The crisis of the cold dark matter and problems of the self-interacting dark
matter models is resolved by postulating flavor mixing of dark matter
particles. Flavor-mixed particles segregate in the gravitational field to form
dark halos composed of heavy mass eigenstates. Since these particles are mixed
in the interaction basis, elastic collisions convert some of heavy eigenstates
into light ones which leave dense central regions of the halo. This
annihilation-like process will soften dense central cusps of halos. The
proposed model accumulates most of the attractive features of self-interacting
and annihilating dark matter models, but does not suffer from their severe
drawbacks. This model is natural; it does not require fine tuning.
\\ ( http://arXiv.org/abs/astro-ph/0102400 , 18kb)

I'm having trouble getting even something as simple as the the paper title out of it, here's what I've tried:

PHP:

//Paper:(?P<paper>[-a-zA-Z0-9/\w]*)[\r\n\t\w]*From:
preg_match_all('Paper:([-/a-zA-Z0-9]*)', $art, $data); // Paper:(?P<paper>[\w\s\d/-]?)From:(?P<from>\w\s\d/-@<>?)Date:
print_r($data);

The comments are variations I've tried, but even using '[a-z]*' as the expression returns an empty array. I'm sure I'm missing something very simple here... but I'm stuck?

angelwatt · Apr 21, 2010

I don't have time to try out much, but for the Paper regex you have there, it doesn't take in account for the space after the Paper:.

Code:

'^Paper: ([\w\/-]+)$'

The ^ represents the start of a line, the $ the end of a line.

big_malk · Apr 21, 2010

angelwatt said:
I don't have time to try out much, but for the Paper regex you have there, it doesn't take in account for the space after the Paper:.

Code:

'^Paper: ([\w\/-]+)$'

The ^ represents the start of a line, the $ the end of a line.

Thanks, I finally got it working though

I needed slashes / round all of the expression and quite a few more tweaks to cope with all the variation in the formatting of each article, currently what I have is possibly the most unreadable code I've ever worked with

Code:

/Paper:?\s*(?P<paper>[-a-zA-Z0-9\/:\(\)\?_\*\s]*)\s*(From:\s(?P<from>[-a-zA-Z0-9\/\<\>\s@\.\"]*)Date:\s(?P<date>[-a-zA-Z0-9\s:,]*)\([0-9][0-9]?[0-9]?kb\))?(replaced with revised version\s*(?P<revised_date>[-a-zA-Z0-9\s:,]*)\([0-9][0-9]?[0-9]?kb\))?\s*Title:\s*(?P<title>[-a-zA-Z0-9:\/\s,\.&\(\)_\?\<\>\'\"=\*]*)\s*Authors:\s*(?P<authors>[-a-zA-Z0-9\s,\/\'\\\(\)\.&]*)\s*Comments:\s*(?P<comments>[-a-zA-Z0-9\s,\.:;&\/\(\)]*)\s*(Report-no:\s*(?P<report_no>[-a-zA-Z0-9\.]*))?\s*\\\\\s*(?P<summary>[-a-zA-Z0-9:;,\.\/\\\(\)&\*\$\s\^{}\<\>~_]*)\\\\\s*\(\s*(?P<url>[-a-zA-Z0-9:\.\/]*)\s*,?\s*(?P<filesize>[0-9]*)kb\s*\)/

This maybe isn't the best way to do all the regex, but it works and that's the main thing

Bostonaholic · Apr 21, 2010

This site will be your friend

http://www.regexpal.com/

Put in your sample and your regex. When you have a working regex, refactor until you're comfortable.

BollywooD · May 9, 2010

Bostonaholic said:
This site will be your friend

http://www.regexpal.com/

Put in your sample and your regex. When you have a working regex, refactor until you're comfortable.

great link!

thanks

Search

Search

Regular expressions not working

big_malk

macrumors 6502a

angelwatt

Moderator emeritus

big_malk

macrumors 6502a

Bostonaholic

macrumors 6502

BollywooD

macrumors 6502

Our Staff