Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

big_malk

macrumors 6502a
Original poster
Aug 7, 2005
557
1
Scotland
A client of mine has over a million scientific article summaries that need important into a database, for some reason the only format they can get them in is the text from emails, each email having over 8,000 articles.
Seems to me the best way to extract the data is using regular expressions. I've split the articles into an array, and I'm trying to get the data out of them, here's a sample of one:

\\
Paper: astro-ph/0102400
From: Mikhail V. MEDVEDEV <medvedev@cita.utoronto.ca>
Date: Fri, 23 Feb 2001 00:37:07 GMT (18kb)

Title: Self-Interacting Dark Matter with Flavor Mixing
Authors: Mikhail V. Medvedev (CITA)
Comments: 3 pages with 2 eps figures, aipproc.sty. To appear in Proceedings of
the 20th Texas Symposium on Relativistic Astrophysics, Austin, Texas, 2000,
edited by J. Craig Wheeler and Hugo Martel (American Institute of Physics)
Report-no: Poster-14.4
\\
The crisis of the cold dark matter and problems of the self-interacting dark
matter models is resolved by postulating flavor mixing of dark matter
particles. Flavor-mixed particles segregate in the gravitational field to form
dark halos composed of heavy mass eigenstates. Since these particles are mixed
in the interaction basis, elastic collisions convert some of heavy eigenstates
into light ones which leave dense central regions of the halo. This
annihilation-like process will soften dense central cusps of halos. The
proposed model accumulates most of the attractive features of self-interacting
and annihilating dark matter models, but does not suffer from their severe
drawbacks. This model is natural; it does not require fine tuning.
\\ ( http://arXiv.org/abs/astro-ph/0102400 , 18kb)

I'm having trouble getting even something as simple as the the paper title out of it, here's what I've tried:

PHP:
//Paper:(?P<paper>[-a-zA-Z0-9/\w]*)[\r\n\t\w]*From:
preg_match_all('Paper:([-/a-zA-Z0-9]*)', $art, $data); // Paper:(?P<paper>[\w\s\d/-]?)From:(?P<from>\w\s\d/-@<>?)Date:
print_r($data);

The comments are variations I've tried, but even using '[a-z]*' as the expression returns an empty array. I'm sure I'm missing something very simple here... but I'm stuck? :confused:
 
I don't have time to try out much, but for the Paper regex you have there, it doesn't take in account for the space after the Paper:.

Code:
'^Paper: ([\w\/-]+)$'
The ^ represents the start of a line, the $ the end of a line.
 
I don't have time to try out much, but for the Paper regex you have there, it doesn't take in account for the space after the Paper:.

Code:
'^Paper: ([\w\/-]+)$'
The ^ represents the start of a line, the $ the end of a line.

Thanks, I finally got it working though :)
I needed slashes / round all of the expression and quite a few more tweaks to cope with all the variation in the formatting of each article, currently what I have is possibly the most unreadable code I've ever worked with :)

Code:
/Paper:?\s*(?P<paper>[-a-zA-Z0-9\/:\(\)\?_\*\s]*)\s*(From:\s(?P<from>[-a-zA-Z0-9\/\<\>\s@\.\"]*)Date:\s(?P<date>[-a-zA-Z0-9\s:,]*)\([0-9][0-9]?[0-9]?kb\))?(replaced with revised version\s*(?P<revised_date>[-a-zA-Z0-9\s:,]*)\([0-9][0-9]?[0-9]?kb\))?\s*Title:\s*(?P<title>[-a-zA-Z0-9:\/\s,\.&\(\)_\?\<\>\'\"=\*]*)\s*Authors:\s*(?P<authors>[-a-zA-Z0-9\s,\/\'\\\(\)\.&]*)\s*Comments:\s*(?P<comments>[-a-zA-Z0-9\s,\.:;&\/\(\)]*)\s*(Report-no:\s*(?P<report_no>[-a-zA-Z0-9\.]*))?\s*\\\\\s*(?P<summary>[-a-zA-Z0-9:;,\.\/\\\(\)&\*\$\s\^{}\<\>~_]*)\\\\\s*\(\s*(?P<url>[-a-zA-Z0-9:\.\/]*)\s*,?\s*(?P<filesize>[0-9]*)kb\s*\)/

This maybe isn't the best way to do all the regex, but it works and that's the main thing :)
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.