What text encoding is in this PDF?

Discussion in 'Mac Programming' started by jared_kipe, Jun 19, 2009.

  1. macrumors 68030


    Dec 8, 2003
    Adobe Acrobat Pro cannot export this to .doc or rtf because of "unsupported Type 2 font".

    When you copy and paste text out of it, you just get garbage.

    If you look at the clipboard viewer its essentially the same garbage. I've tried putting it in TextWrangler and trying random text encodings, nothing seems to fix it.

    0*&'F#.5 == youngest
    as an example

    Here is a single page from it.

    Attached Files:

  2. macrumors 65816


    Nov 30, 2003
    Looks like Brian Herbert has found an effective DRM solution.
  3. Moderator emeritus


    Aug 16, 2005
    You know what's even more funny, have your Mac speak the PDF. It speaks the garbled form. This doesn't look like an encoding issue, maybe a copyright technique to keep you from copying the text out.
  4. thread starter macrumors 68030


    Dec 8, 2003
    Yes, my guess is that the included font is in the odd text encoding as well.

    Thus, if you typed in normal english it would come out garbled.

    In theory, you could probably do a one2one substition if you were able to get it into a string.

    Say you had a string that you know comes out to "many decorations", but is actually ";2'0$7#,*)25+*'."

    then you could replace all the 2s with a's and so on and so forth.

    I actually just did this, on the single page, and it seems to fix it just fine. Some care would probably need to be taken so you don't replace a lot of letters accidentally, like say replacing the 2s with as, and then replacing the as with Fs or something like that.

    It would probably be safter/easier to do all the replacements at the same time. Track down all the one2one substituions or figure out the text encoding then change the file to the new encoding.

    Any ideas how?
  5. Moderator emeritus


    Aug 16, 2005
    I made a quick replacer using JavaScript. You just copy and paste the text from the PDF to the text box then hit the button and it'll show the transformed text. I was only able to work with the characters that existed on the page you provided so there are some missing replacements. You can either improve upon the code or do it in another language you're more comfortable with.

    Just put the following into a file and save with a .html extension.
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
    <title>Transform text</title>
    <meta http-equiv="content-type" content="text/html; charset=utf-8" />
    <script type="text/javascript">
    function Trans() {
      var t = {
        '$':' ',
        ' ':'\r',
      var text = '';
      var given = document.getElementById('start').value;
      for (var a=0, b=given.length; a<b; ++a) {
        var ch = given.substring(a,a+1);
        if (t[ch] !== undefined) {
          text += t[ch];
        else {
          alert('Missing transform for: "'+ ch +'"');
      document.getElementById('output').innerHTML = text;
    <p><textarea id="start" cols="55" rows="12"></textarea></p>
    <p><button onclick="Trans();">Transform</button></p>
    <div id="output"></div>
  6. macrumors 6502a


    Jan 4, 2002
    Austin, TX
    It just looks like someone made a custom font that displays '#' as an 'e' and so on. And some kind of pre-processor converts the normal text into this format. The custom font is embedded in the PDF and thus you see the correct text displayed, even tho the source text is gibberish.

    Its really just a simple 1:1 substitution scheme, but effective as you found out.
  7. thread starter macrumors 68030


    Dec 8, 2003
    Thats great!! is there any substitution like thing that is so elegant in C or Objective-C?

    NSRange wholeString = NSMakeRange(0, [myString length]);
    [myString replaceOccurencesOfString: @"2" withString: @"a" options: 0 range: wholeString];

    Over and over. And that is assuming that you rearrange them to avoid collisions.
  8. Moderator emeritus


    Aug 16, 2005
    I've never used Objective C and just a little C. The easiest way is probably to have two arrays to hold the different sets, then walk through the file and swap each character as you go through. I'd recommend writing out to a new file just to make sure not to screw up the original.

    If you work straight from the PDF file you'll need something that can handle that part as there's a bunch of extra code in there you wouldn't want to run the code on. That's why I did the copy and paste method. Made it easier.
  9. thread starter macrumors 68030


    Dec 8, 2003
    Very true, it would probably be safer/easier to have a loop that goes through each character of the text linearly and make the substitution.

    There would be no collisions this way.

    Using PDFkit I could probably get the text contents of the PDF and make the switch. I've never worked on such a large chunk of data. Only relatively short string. So it should be interesting.

    Does anybody know of a way to abstract it by making a text-encoding profile or something, so that a program, say textwrangler, could change it on the fly or something?
  10. macrumors 6502a


    Oct 26, 2003
    Richmond, VA
    can you just OCR it?
  11. thread starter macrumors 68030


    Dec 8, 2003
    Thats actually an interesting possibility. I have sophisticated OCR software in my windows virtual machine for doing just that on books I scan in myself.

    I would need to generate picture equivalents of each page however. I wonder if Preview has a export to JPG for every page kind of export. I'll look into it.

    EDIT: Looks like Acrobat has the option. Trying now.

    EDIT2: That worked really well.

Share This Page