A client of mine has over a million scientific article summaries that need important into a database, for some reason the only format they can get them in is the text from emails, each email having over 8,000 articles.
Seems to me the best way to extract the data is using regular expressions. I've split the articles into an array, and I'm trying to get the data out of them, here's a sample of one:
I'm having trouble getting even something as simple as the the paper title out of it, here's what I've tried:
The comments are variations I've tried, but even using '[a-z]*' as the expression returns an empty array. I'm sure I'm missing something very simple here... but I'm stuck?
Seems to me the best way to extract the data is using regular expressions. I've split the articles into an array, and I'm trying to get the data out of them, here's a sample of one:
\\
Paper: astro-ph/0102400
From: Mikhail V. MEDVEDEV <medvedev@cita.utoronto.ca>
Date: Fri, 23 Feb 2001 00:37:07 GMT (18kb)
Title: Self-Interacting Dark Matter with Flavor Mixing
Authors: Mikhail V. Medvedev (CITA)
Comments: 3 pages with 2 eps figures, aipproc.sty. To appear in Proceedings of
the 20th Texas Symposium on Relativistic Astrophysics, Austin, Texas, 2000,
edited by J. Craig Wheeler and Hugo Martel (American Institute of Physics)
Report-no: Poster-14.4
\\
The crisis of the cold dark matter and problems of the self-interacting dark
matter models is resolved by postulating flavor mixing of dark matter
particles. Flavor-mixed particles segregate in the gravitational field to form
dark halos composed of heavy mass eigenstates. Since these particles are mixed
in the interaction basis, elastic collisions convert some of heavy eigenstates
into light ones which leave dense central regions of the halo. This
annihilation-like process will soften dense central cusps of halos. The
proposed model accumulates most of the attractive features of self-interacting
and annihilating dark matter models, but does not suffer from their severe
drawbacks. This model is natural; it does not require fine tuning.
\\ ( http://arXiv.org/abs/astro-ph/0102400 , 18kb)
I'm having trouble getting even something as simple as the the paper title out of it, here's what I've tried:
PHP:
//Paper:(?P<paper>[-a-zA-Z0-9/\w]*)[\r\n\t\w]*From:
preg_match_all('Paper:([-/a-zA-Z0-9]*)', $art, $data); // Paper:(?P<paper>[\w\s\d/-]?)From:(?P<from>\w\s\d/-@<>?)Date:
print_r($data);
The comments are variations I've tried, but even using '[a-z]*' as the expression returns an empty array. I'm sure I'm missing something very simple here... but I'm stuck?