Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

balamw

Moderator emeritus
Original poster
Aug 16, 2005
19,366
979
New England
I have a PDF file that was created from MS Word 2008 using the usual File->Print->PDF->Save as PDF routine. It renders fine in Preview and Adobe Reader, but apparently gets mangled by other applications that try to extract text from it.

I think this is due to the "/Filter /FlateDecode" compression, so I'd like to make a version of the document without compression that should be easier to extract plan text from.

I found an article on what seems to be the opposite, i.e. compressing a PDF using quartz filters in Preview. http://docs.info.apple.com/article.html?path=Mac/10.5/en/9021.html

Is there something I could do to undo this?

Thanks!

B
 
Last edited:

Blue Velvet

Moderator emeritus
Jul 4, 2004
21,929
265
How many pages in the file? Is it not small enough to just copy and paste text into a Word file and then clean up? Normally, with something like this, I'd try extracting text from within Acrobat Pro to an RTF or Word file.

Three quick suggestions to start:

1. Post the file here if you like and I or someone else can try to extract the copy for you or...

2. Try one of the online converters, see what happens.

http://www.pdftoword.com/
http://www.pdfonline.com/pdf2word/index.asp
http://www.freepdfconvert.com/convert_pdf_to_source.asp
http://www.convertpdftoword.net/
http://www.freepdftoword.org/

3. If you've got access to anything that runs Windows, download and run the Windows trial of Acrobat Pro and see if you can extract the copy that way. There is no trial available for Mac.

Edit: Scratch all that, I've misunderstood the question. :D

Try exporting your PDF to PDF/X, see what happens. What version of OS X are you running?
 

balamw

Moderator emeritus
Original poster
Aug 16, 2005
19,366
979
New England
How many pages in the file? Is it not small enough to just copy and paste text into a Word file and then clean up?

It's only 2 pages, and I control the Word document and the PDF, but not the PDF->text step.

The document is my CV, and I've noticed that sometimes when I upload it to a site, they are unable to extract info from it. Today, I got feedback from two recipients that they were unable to read my submission as it was symbols and gobbledygook. That's not good.

I would still prefer to keep the file as a PDF, precisely because I can't trust Word with formatting, but if the PDF will get mangled it's pointless.

I may have found a way using ghostscript, but it's not installed by default and I don't have enough space on this Mac to get it from source or MacPorts.

EDIT: Just saw your edit. Using Word 2008's "native" Save AS PDF gives me the exact same file as the Print, and Save as PDF from Preview also doesn't help. The PDF-X file is larger, but still uses FlateDecode. Still on 10.6.6 due to the lack of disk space issue.

B
 

Blue Velvet

Moderator emeritus
Jul 4, 2004
21,929
265
EDIT: Just saw your edit. Using Word 2008's "native" Save AS PDF gives me the exact same file as the Print, and Save as PDF from Preview also doesn't help. The PDF-X file is larger, but still uses FlateDecode. Still on 10.6.6 due to the lack of disk space issue.


All seems a little curious. I'm not the world's greatest MS Office guru, but what about trying to produce a PDF from the downloadable trial of Word 2011? It seems to have better native support for producing PDFs.

What fonts are you using? What format are they?
 

balamw

Moderator emeritus
Original poster
Aug 16, 2005
19,366
979
New England
All seems a little curious. I'm not the world's greatest MS Office guru, but what about trying to produce a PDF from the downloadable trial of Word 2011? It seems to have better native support for producing PDFs.

I've got to solve my disk space issues before trying that. :p

What fonts are you using? What format are they?
I'm sorry to say this in the design forum, but Calibri/Cambria. Both are TTF and provided by Office 2008.

B
 

Blue Velvet

Moderator emeritus
Jul 4, 2004
21,929
265
I don't know what else to add at this point. Myself, I wouldn't use Word for such an important document, laying it out in InDesign... but then that's what I do almost every day, knowing I can get perfect PDFs from it.

If I had to use Word and these issues were getting in the way of job applications, I'd also be looking at asking these questions in any Microsoft/Office user forums, as well as perhaps Adobe forums... and Planet PDF.

I'll check again if anyone has any answers, because it's the kind of thing that intrigues me. Sorry for not being able to find a solution.
 

covisio

macrumors 6502
Aug 22, 2007
284
20
UK
Though of course, like most people on this forum, I regard InDesign (or any Adobe product) a vastly superior tool to MS Word when it comes to laying out a page, I do know that there are several good reasons for producing a CV in Word.

I do a bit in recruitment and we use a CV parser to extract raw text from CVs, which then gets saved into a database. While our system will handle PDF without a great deal of issue, it much prefers Word and will extract more reliably.

Most agencies will use a similar system and I would wager if you asked them what file format they would prefer you presented your CV in, most would say Word. After all, the purpose of a CV is to get you some work, right? So make it as easy for them as possible.

In my view CVs should be relatively simple. If you want to show off your design skills, do it in your portfolio. I often see the odd CV which has been 'over designed' and I kind of mentally roll my eyes and think, 'just give me the info I need'. Word is capable of nicely formatting a CV. Not with the typesetting finesse of InDesign perhaps, but well enough.

Anyway, doesn't answer this question. How about creating a similar file without your personal information but using the same fonts, etc., converting it to PDF using the same method and then posting it up on here for people to have a play with. We may find out what the issue is.
 

balamw

Moderator emeritus
Original poster
Aug 16, 2005
19,366
979
New England
I do a bit in recruitment and we use a CV parser to extract raw text from CVs, which then gets saved into a database. While our system will handle PDF without a great deal of issue, it much prefers Word and will extract more reliably.
That's exactly the problem. Many companies/recruiters... are forcing you to submit via their web portal and the parsers they use to get the data in the database can't read my PDF. (I've had trouble with them even reading my DOCX, but DOC generally works fine. So I have to keep it in DOCX/DOC/PDF and TXT). I originally started with it in Pages, and it looked better, but I had to export to one of these formats for submission eliminating any reason to use Pages.

Anyhow, following on your suggestion, I borrowed a sample resume from here: http://www.freeresumesamples.org/samples/engineer/electricalengineer.asp and followed the same process to generate a PDF from it by copying and pasting the info into my CV and removing my info. I didn't spend any time tweaking it. Just the info and process. PDF is attached.

Almost got ghostscript loaded on my MBP and will try the filter later.

B
 

Attachments

  • SAMPLE EE RESUME.pdf
    60.3 KB · Views: 336

MisterMe

macrumors G4
Jul 17, 2002
10,709
69
USA
That's exactly the problem. ...
This is a long-standing problem in the era of computers. Your vita will be read by a machine. You need enough confidence in yourself to believe that it is not necessary to impress a machine with your typography skills. Computer-based CV systems have traditionally used OCR to read CVs. You would do well to read, process, and follow covisio's advice. Your posted sample is not the worst CV that I have seen. It doesn't have multiple colors and embedded graphics, thank God. However, you use multiple typefaces and type sizes. Bad.

The only recommended concession to art is boldface. Otherwise, you should use a single typeface--something totally non-creative like Times or Times New--in a single typesize--say 10 point or 12 point. All to make the job of the OCR easier.
 

covisio

macrumors 6502
Aug 22, 2007
284
20
UK
I've just run your CV through my parser and it's extracted the raw text fine. It has broken up the text into individual words rather than flowed properly into whole sentences, this is common though with PDFs. Any device which interprets PDFs will always encounter this problem, some deal with it by intelligently reflowing the text after conversion.

It may be this which is causing some online parsers to reject it?

Can't see anything specifically wrong with the file, all the fonts are embedded properly.
 

MisterMe

macrumors G4
Jul 17, 2002
10,709
69
USA
...

Can't see anything specifically wrong with the file, all the fonts are embedded properly.
If you insist on submitting CVs with "creative" typography, then you prospective employers' CV archiving systems will have problems with them.
 

balamw

Moderator emeritus
Original poster
Aug 16, 2005
19,366
979
New England
Let me reiterate that I am an Engineer, not a designer. That is not my CV, but is close enough to illustrate the problem.

This is a long-standing problem in the era of computers. Your vita will be read by a machine. You need enough confidence in yourself to believe that it is not necessary to impress a machine with your typography skills. Computer-based CV systems have traditionally used OCR to read CVs.

Yeah, that's why I keep the document in DOCX and export to DOC/PDF/TXT. The fall back is always TXT. Plain old boring ASCII. I'm not trying to impress the machine, I'm trying to help the human that will hopefully look at it next pull out the essential information while keeping it compact.

Having been a hiring manager for many years, I hated getting CVs from many of the larger job boards that had been so mangled by the system that is was very hard (as a human) to pull out any useful information.

There's no OCR here. I wish it was OCR type errors. What I've seen at a number of places is if I upload my CV as PDF for screen scraping/parsing it pulls out what looks like the compressed text i.e. Name: x’ökì‹FÜøÎW¥s€Ôƨ÷ which is why I want to try and turn off the compression for the text. The embedded fonts glyphs can probably stay.

FWIW here's the version "inflated" by ghostscript. It's better, as in I can actually read more of what it is trying to do if I open the file in a text editor, but still doesn't seem to have easily extracted text. Looks like a lot of kerning of individual letters going on right at the top. Grr.

Thanks to all for your valuable input.

B
 

Attachments

  • SAMPLE EE RESUME.out.pdf
    95.4 KB · Views: 252

MisterMe

macrumors G4
Jul 17, 2002
10,709
69
USA
Let me reiterate that I am an Engineer, not a designer. That is not my CV, but is close enough to illustrate the problem.

...
The only problem that I see with either of your posted sample documents is that each looks like a CV that I would expect to see from a job applicant who doesn't understand the subtle issues associated with typing a CV. If either PDF shows messed-up characters on your system, then the problem is on you system and not within the PDF files.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.