PDA

View Full Version : Parse table from pdf




Hansr
Sep 28, 2010, 04:24 AM
I have to read a pdf periodically for information and need to parse it into a usable format. I was able to get the data to a pure text format which it's pretty much just the data table I need but every content item is separated by a single whitespace instead of something more parsable. An example of the output from the pdf is below + attached if it looks bad.

The problem I have is the columns are also very "funky" some are just numbers which are easy to identify but some of the columns can have these forms:
Item1 Item2 Item3
Item1 Item2
Item1
Item1 Item2 (Item3)
Item1 Item2, Item3

Is there any way of getting through this mess easily?

The first method I thought of was to identify the elements here that could be loaded into a known items dictionary (loaded from XML) and parse the whole text first, line by line, word combo by word combo (First+\s+Second, First+\s+Second+\s+Third,.....,Second+\s+Third,.... etc) to put these within some identifying markers "" etc. so they'd be displayed as:
"Item1 Item2 Item3"
"Item1 Item2"
"Item1"
"Item1 Item2 (Item3)"
"Item1 Item2, Item3"

Then I could use regex to break up the line or just replace all the spaces not in "" with commas and flag any line that didn't meet the criteria and then see if those elements are missing from the dictionary and manually add them (elements are regularly repeated and not many additions between permutations of the PDF).

The attached example should cover most of the examples where joining is needed, which are the columns: Asset, Sponsor, ECOL1, ECOL2, ECOL3 and Sub category.

Anyone have any thought or suggestions about this? Maybe an alternative super easy method?

This is pretty much language independent but if anyone has any language specific options I'd most likely go with c++ or c#.


Asset Ticker Sponsor Maturity Rating EL BID OFFER DM DM / EL CPN ECOL1 ECOL2 ECOL3 Sub category CUSIP
Category 1
Super Duper Asset A SDAA Spons1 1/1/10 AA+ 100 95.00 96.00 2500 25.0 L+1000 Some Extra, Info XYZ Some Item Category 1 Sub 1 123456AA1
Super Duper Asset B SDAB Spons2 1/1/10 AA+ 150 99.00 100.00 500 3.33 L+1200 Some ExtraInfo ABC Some (Item) Category 1 Sub 2 123456AB1
Category 2
Normal Asset NA Spons1 1/1/10 BB+/Ba2 500 95.00 96.00 1200 2.4 L+500 Blah DEF AA (Abc abc) Category 2 Sub 1 123456NA1



chown33
Sep 28, 2010, 12:35 PM
Is there any way of getting through this mess easily?
No. Not easily.

Then I could use regex to break up the line or just replace all the spaces not in "" with commas and flag any line that didn't meet the criteria and then see if those elements are missing from the dictionary and manually add them (elements are regularly repeated and not many additions between permutations of the PDF).

I doubt that regex will play a significant role, except as a building-block for heuristics. The matching appears to be too variable.

There is no consistency at all in what constitutes a delimiter. Sometimes space is a delimiter, and sometimes it isn't, even in the same line, and even at different positions in the same line. And comma may be a delimiter, but only if present. And parens may be a delimiter, or maybe should just be stripped. Or maybe there's no delimiter at all, instead you have to look at a change of case in the middle of a word (Some ExtraInfo). It's going to take some context-sensitive heuristics (rules) to parse this stuff.

As a strategy, you could look for probable "landmark" fields, which have lower ambiguity. These must match the "archetype" for the column. Once landmarks have been identified, you can make better guesses at what lies between the landmarks.

For example:
Asset Ticker Sponsor Maturity Rating EL BID OFFER DM DM / EL CPN ECOL1 ECOL2 ECOL3 Sub category CUSIP
Super Duper Asset A SDAA Spons1 1/1/10 AA+ 100 95.00 96.00 2500 25.0 L+1000 Some Extra, Info XYZ Some Item Category 1 Sub 1 123456AA1
Super Duper Asset B SDAB Spons2 1/1/10 AA+ 150 99.00 100.00 500 3.33 L+1200 Some ExtraInfo ABC Some (Item) Category 1 Sub 2 123456AB1


Each colored field identifies a possible landmark. Each is presumably drawn from a relatively small set of potential symbol-strings, in a narrowly prescribed format, and whose veracity can be determined by additional checking.

The ticker symbol can be checked against a dictionary of known existing ticker symbols. The Maturity symbol is a date in a prescribed format (or range of formats). A date can be validated numerically (month 1-12, day 1-31, year in a specified range, etc.). So "date" is the archetype for the Maturity column, regardless of the different formats it might have.

Once the landmarks in a line have been identified to an acceptable level of certainty (which may be less than 100% certainty), you can start trying to identify the symbols between the landmarks. For example, there is only one column to the left of Ticker, so everything to the left of the ticker-symbol landmark can only be the Asset name.

Ideally, you want landmarks with higher certainty located between fields of lesser or unknown certainty. Essentially, you're using the landmarks as delimiters, while also using them as data.

In any case, you're going to need some way to handle cases where landmarks are too uncertain, or where all the rules (heuristics) ultimately fail.

The only other approach I can think of is to abandon PDFs as the data source, and go get the original data the PDF was made from. That may not be better or even possible, but without knowing the origin of the PDFs, it's hard to guess at its viability.

Hansr
Sep 29, 2010, 06:52 AM
Thanks for the landmark suggestion which is in-line with what I was thinking about first, that is adding in separators for known values.

The final logic was to first define the known values for ECOL2 and ECOL3 (3 and 20 items respectively) as known items and parse the whole file for those values and place them within [].

Then define the following regex:
^(.*?)\s+([A-Z]{5,6})\s+(.*?)\s+(\d+\/\d+\/\d+)\s+(\w+[0-9]*[\+\-]?\/?[abcABC]*[0-9]*[\+\-]?)\s+(\d+\.?\d*)\s+(\d+\.?\d*)\s+(\d+\.?\d*)\s+(\d+\.?\d*)\s+(\d+\.?\d*)\s+(.*?)\s+\+\s+(\d+\.?\d*)\ s+(.*?)\s+\[(.*?)\]\s+\[(.*?)\]\s+(.*?)\s+(\w+)\s*$

Then I applied the regex line for line, if it fails then check if it's a category line, if not log the line number and content for verification.

Then add the item/line content to a db, works fine.

The only other approach I can think of is to abandon PDFs as the data source, and go get the original data the PDF was made from. That may not be better or even possible, but without knowing the origin of the PDFs, it's hard to guess at its viability.

Wish I could but legal reasons prevent the PDF provider for doing it in any other format except online bitmap or flash. I hate lawyers restricting logical options.