I have to read a pdf periodically for information and need to parse it into a usable format. I was able to get the data to a pure text format which it's pretty much just the data table I need but every content item is separated by a single whitespace instead of something more parsable. An example of the output from the pdf is below + attached if it looks bad.
The problem I have is the columns are also very "funky" some are just numbers which are easy to identify but some of the columns can have these forms:
Item1 Item2 Item3
Item1 Item2
Item1
Item1 Item2 (Item3)
Item1 Item2, Item3
Is there any way of getting through this mess easily?
The first method I thought of was to identify the elements here that could be loaded into a known items dictionary (loaded from XML) and parse the whole text first, line by line, word combo by word combo (First+\s+Second, First+\s+Second+\s+Third,.....,Second+\s+Third,.... etc) to put these within some identifying markers "" etc. so they'd be displayed as:
"Item1 Item2 Item3"
"Item1 Item2"
"Item1"
"Item1 Item2 (Item3)"
"Item1 Item2, Item3"
Then I could use regex to break up the line or just replace all the spaces not in "" with commas and flag any line that didn't meet the criteria and then see if those elements are missing from the dictionary and manually add them (elements are regularly repeated and not many additions between permutations of the PDF).
The attached example should cover most of the examples where joining is needed, which are the columns: Asset, Sponsor, ECOL1, ECOL2, ECOL3 and Sub category.
Anyone have any thought or suggestions about this? Maybe an alternative super easy method?
This is pretty much language independent but if anyone has any language specific options I'd most likely go with c++ or c#.
The problem I have is the columns are also very "funky" some are just numbers which are easy to identify but some of the columns can have these forms:
Item1 Item2 Item3
Item1 Item2
Item1
Item1 Item2 (Item3)
Item1 Item2, Item3
Is there any way of getting through this mess easily?
The first method I thought of was to identify the elements here that could be loaded into a known items dictionary (loaded from XML) and parse the whole text first, line by line, word combo by word combo (First+\s+Second, First+\s+Second+\s+Third,.....,Second+\s+Third,.... etc) to put these within some identifying markers "" etc. so they'd be displayed as:
"Item1 Item2 Item3"
"Item1 Item2"
"Item1"
"Item1 Item2 (Item3)"
"Item1 Item2, Item3"
Then I could use regex to break up the line or just replace all the spaces not in "" with commas and flag any line that didn't meet the criteria and then see if those elements are missing from the dictionary and manually add them (elements are regularly repeated and not many additions between permutations of the PDF).
The attached example should cover most of the examples where joining is needed, which are the columns: Asset, Sponsor, ECOL1, ECOL2, ECOL3 and Sub category.
Anyone have any thought or suggestions about this? Maybe an alternative super easy method?
This is pretty much language independent but if anyone has any language specific options I'd most likely go with c++ or c#.
Code:
Asset Ticker Sponsor Maturity Rating EL BID OFFER DM DM / EL CPN ECOL1 ECOL2 ECOL3 Sub category CUSIP
Category 1
Super Duper Asset A SDAA Spons1 1/1/10 AA+ 100 95.00 96.00 2500 25.0 L+1000 Some Extra, Info XYZ Some Item Category 1 Sub 1 123456AA1
Super Duper Asset B SDAB Spons2 1/1/10 AA+ 150 99.00 100.00 500 3.33 L+1200 Some ExtraInfo ABC Some (Item) Category 1 Sub 2 123456AB1
Category 2
Normal Asset NA Spons1 1/1/10 BB+/Ba2 500 95.00 96.00 1200 2.4 L+500 Blah DEF AA (Abc abc) Category 2 Sub 1 123456NA1