Workflow for digitising paper books - PDF editing

Discussion in 'Mac Apps and Mac App Store' started by annk, Oct 25, 2013.

  1. annk Administrator

    annk

    Staff Member

    Joined:
    Apr 18, 2004
    Location:
    Somewhere over the rainbow
    #1
    I've picked out a lot of my books and have started digitising them. I work at a library that has a book binder department, so I can easily cut off the bindings in a huge, professional machine.

    I have been basing my workflow on the instructions here. But I have some questions and concerns about how to perform certain steps along the way. So I hope some of you have done at least some of these operations, and can advise me.

    My set-up:

    • Fujitsu SnapScan ix500
    • PDF Nomad
    • Adobe Acrobat Pro XI (license for Windows that came with my SnapScan, trial version on my iMac)

    After I scan, I do OCR, link the table of contents to the correct pages, and optimise (not necessarily in that order). Or that's the plan at least.

    So far, I have several scanned books in various stages of being OCR'ed, linked (hasn't worked well), and optimised (also something I'm in the middle of testing).

    1. Does the order for OCR/table of content/optimising matter?
    2. I can't find a way to optimise in anything other than Adobe. Finder doesn't make the files small enough (ebooks are usually very small files), and I can't figure out how to do it in PDF Nomad. But I'd rather not 1) fire up the Asus netbook to use the Windows version of Adobe if I can help it, or 2) pay for the Mac license for same (I will if I have to) - at least if I can find a way to do it with what I have on my Mac now.
    3. I watched the tutorial in PDF Nomad for making an outline from the table of contents (essentially bookmarks that behave as links, if I understand) but even after watching and trying several times, the result is at best inconsistent. Anyone have any luck doing this?
    4. Some of my books are OCR'ed for two languages. Seems to work MOST of the time (using PDF Nomad). Anyone have experience with OCR'ing for two languages?
    5. The page size options in PDF Nomad are extensive, but confusing. Anyone know what the standard sizes are for ebooks you want to read on Kindle, iPad, phones? Do you have to choose, and if so, what's the most flexible choice?

    I realise this is a lot of questions, and I guess more than specific answers (though that would be great) I'm looking for experiences anyone has had with digitising their books and using them as ebooks.



    tl;dr: Anyone have experience digitising their paper books? :p
     
  2. Doctor Q Administrator

    Doctor Q

    Staff Member

    Joined:
    Sep 19, 2002
    Location:
    Los Angeles
    #2
    I have not done what you are doing, but I can make a few suggestions.

    1. Does the order for OCR/table of content/optimising matter?

      Yes. You need to OCR first, to turn most of the pixels into text, before you optimize to reduce what remains. Table of contents and optimizing can probably be in either order but it's more logical to manipulate the document first and optimize last.

    2. I can't find a way to optimise in anything other than Adobe...

      I haven't used any other PDF app for this, but it would be worth knowing what aspect of the data you need optimized. If you use the "Audit Space Usage" feature of Acrobat Pro, which items does it show to be the big percentages (e.g., content, fonts, graphics)? You can find instructions for Audit Space Usage and for optimizing
      here.

      Here is an article about using reducing image sizes in PDFs using only built-in Apple software.

    3. I watched the tutorial in PDF Nomad for making an outline from the table of contents (essentially bookmarks that behave as links, if I understand) but even after watching and trying several times, the result is at best inconsistent. Anyone have any luck doing this?

      I've never used PDF Nomad but I imagine that its ability to do OCR depends on the quality of the book and the scanning. Have you found that your success in making an outline correlates to how good a scan you can get, so for example a book with large sharp type on a contrasting background (e.g., very black text on a very white background) outlines more easily than one with smaller fuzzier type on poor quality paper? If so, that's the nature of the beast; garbage in, garbage out.

      Have you considered using Apple Pages? If you could get the results of your OCR process into Pages, it has feature to reduce file sizes and to save to epub format.

    4. Some of my books are OCR'ed for two languages. Seems to work MOST of the time (using PDF Nomad). Anyone have experience with OCR'ing for two languages?

      No experience here.

    5. The page size options in PDF Nomad are extensive, but confusing. Anyone know what the standard sizes are for ebooks you want to read on Kindle, iPad, phones? Do you have to choose, and if so, what's the most flexible choice?

      Just to be clear, is your goal to reproduce the look of each book, with page numbers, figures, illustrations, and formatting? In other words, a fixed-format PDF with images of the original pages. Or are you mostly just interested in the text of the book and want to take advantage of the ability to change font sizes and reflow the text in your ebook reader? If you want fixed-format PDFs, you could set the page size to match the original book's dimensions or pick 8.5x11 or A4 to use a standard print size. If you want flowable text then page size should be irrelevant so any standard size should be fine.

      Have you looked around at the PDF Nomad forums? You could sign up and post questions there too.
     
  3. balamw Moderator

    balamw

    Staff Member

    Joined:
    Aug 16, 2005
    Location:
    New England
    #3
    Sounds like a huge project!

    W.r.t. the size of the finished files, many commercial eBooks are only small because they contain only text and all the scanned graphics are gone. You can see this if you create a PDF from a Word doc vs. print and scan it to PDF.

    Off and on I've proofread works for PGDP, and some have been OCRed to two languages. It's rarely clean enough to accept the results of the OCR without proofreading carefully!

    B
     
  4. annk thread starter Administrator

    annk

    Staff Member

    Joined:
    Apr 18, 2004
    Location:
    Somewhere over the rainbow
    #4
    Hmmm. So I should even OCR books I don't need to be able to search in, in order to have the optimisation work? I didn't think of that.

    Good idea, I will try the "Audit Space Usage" to find this out. I saw the function, but didn't immediately see how it was relevant.


    All my scans have been of good quality; I don't choose books to scan unless they will make a good scan. The success of the outline seems very random. I'm hoping there's some vital step being left out of the tutorial, that someone will let me know about!

    Good idea, thanks!

    Yes, in the case of all non-fiction.

    For novels, then yes, this would be best.

    I'm still trying out the various ways of getting flowable text.

    Yeah, I have spent time digging there, to see if my beginner issues are already discussed. I'll eventually post there, but I figured this forum was bigger, and has the advantage of being Mac-specific!

    Well, I do a few books at a time, and since I'm saving the scans in Dropbox, I can work on it where ever I am, and whenever I have time. So as long as I keep things in order, in a good system of folders that lets me know where I am for each book, it shouldn't seem too huge of a project.

    Most of the books I'm scanning have no or few graphics, unless text is being presented as a graphic image, which I might not realise. I haven't been able to get a pdf to under 15 MB yet, whereas ebooks are usually quite small. But it's possible to convert ebook formats with various tools, so doing this might help.

    Hmmm. Thanks.

    As for the non-fiction books, I can always just save them in DP as PDFs, even though they're big. I'll only open them when I need them anyway.

    The fiction would be really nice to get into an ebook format, to be able to have all my books in one (or few) places. I tried Send to Kindle today, but it insists my files that are 15 MB are over 100 MB, and won't send them to my Kindle library.
     
  5. SintraWorks macrumors newbie

    Joined:
    Oct 25, 2013
    #5
    Yes it does. You want to OCR first, otherwise you can't automate the table of content creation in PDF Nomad. You can optimise a bit during OCR by specifying low output resolution, if you are creating searchable pages, but keeping the scanned image. You can get much smaller file sizes however by choosing to "replace original with recognised text". That will get rid of the scanned images and replace them with proper text. This should reduce the file size drastically. If you want to keep the original scanned images, and the low resolution output doesn't create small enough files for you, then you could further a apply a reduce file size Quartz filter. (Filters are accessible through the Save As dialog.)

    After OCR, you can optimise and create TOC in whatever order you prefer, if still needed.

    See above. You have several options in PDF Nomad.

    Quality of outline creation depends on the quality of the text on the pages, so, in this case, it is dependent on the results of the OCR.

    What specifically is your question here?

    Sounds to me like you'd want A4 or US Letter for iOS devices. Not sure about Kindle.

    Kind Regards,
    António Nunes
    SintraWorks
     

Share This Page