PDA

View Full Version : What text encoding is in this PDF?




jared_kipe
Jun 19, 2009, 03:33 PM
Adobe Acrobat Pro cannot export this to .doc or rtf because of "unsupported Type 2 font".

When you copy and paste text out of it, you just get garbage.

If you look at the clipboard viewer its essentially the same garbage. I've tried putting it in TextWrangler and trying random text encodings, nothing seems to fix it.

0*&'F#.5 == youngest
as an example

Here is a single page from it.



telecomm
Jun 19, 2009, 04:31 PM
Looks like Brian Herbert has found an effective DRM solution.

angelwatt
Jun 19, 2009, 04:34 PM
You know what's even more funny, have your Mac speak the PDF. It speaks the garbled form. This doesn't look like an encoding issue, maybe a copyright technique to keep you from copying the text out.

jared_kipe
Jun 19, 2009, 05:58 PM
Yes, my guess is that the included font is in the odd text encoding as well.

Thus, if you typed in normal english it would come out garbled.

In theory, you could probably do a one2one substition if you were able to get it into a string.

Say you had a string that you know comes out to "many decorations", but is actually ";2'0$7#,*)25+*'."

then you could replace all the 2s with a's and so on and so forth.

I actually just did this, on the single page, and it seems to fix it just fine. Some care would probably need to be taken so you don't replace a lot of letters accidentally, like say replacing the 2s with as, and then replacing the as with Fs or something like that.

It would probably be safter/easier to do all the replacements at the same time. Track down all the one2one substituions or figure out the text encoding then change the file to the new encoding.

Any ideas how?

angelwatt
Jun 19, 2009, 10:03 PM
I made a quick replacer using JavaScript. You just copy and paste the text from the PDF to the text box then hit the button and it'll show the transformed text. I was only able to work with the characters that existed on the page you provided so there are some missing replacements. You can either improve upon the code or do it in another language you're more comfortable with.

Just put the following into a file and save with a .html extension.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Transform text</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<script type="text/javascript">
//<![CDATA[
function Trans() {
var t = {
'2':'a',
'/':'b',
',':'c',
'7':'d',
'#':'e',
'8':'f',
'F':'g',
'"':'h',
'+':'i',
'[':'j',
'3':'k',
'-':'l',
';':'m',
"'":'n',
'*':'o',
'<':'p',
'd':'q',
')':'r',
'.':'s',
'5':'t',
'&':'u',
'A':'v',
'I':'w',
']':'x',
'0':'y',
'a':'z',
'D':'A',
'?':'B',
'(':'C',
'':'D',
':':'E',
'1':'F',
'9':'G',
'4':'H',
'N':'I',
'B':'J',
'':'K',
'E':'L',
'6':'M',
'K':'N',
'L':'O',
'>':'P',
'`':'Q',
'M':'R',
'X':'S',
'!':'T',
'':'U',
'c':'V',
'\\':'W',
'b':'X',
'':'Y',
'':'Z',

'n':'9',

'$':' ',
' ':'\r',
'\n':'\n',
'_':'—',
'C':'.',
'S':',',
'^':"'",
'm':'‘',
'^':'’',
'f':'“',
'g':'”',
'Y':'-',
'k':'!',
'=':':',
'e':'…'
};

var text = '';
var given = document.getElementById('start').value;
for (var a=0, b=given.length; a<b; ++a) {
var ch = given.substring(a,a+1);
if (t[ch] !== undefined) {
text += t[ch];
}
else {
alert('Missing transform for: "'+ ch +'"');
}
}
document.getElementById('output').innerHTML = text;
}

//]]>
</script>
</head>
<body>
<p><textarea id="start" cols="55" rows="12"></textarea></p>
<p><button onclick="Trans();">Transform</button></p>
<div id="output"></div>
</body>
</html>

Sayer
Jun 20, 2009, 10:36 AM
It just looks like someone made a custom font that displays '#' as an 'e' and so on. And some kind of pre-processor converts the normal text into this format. The custom font is embedded in the PDF and thus you see the correct text displayed, even tho the source text is gibberish.

Its really just a simple 1:1 substitution scheme, but effective as you found out.

jared_kipe
Jun 20, 2009, 12:14 PM
I made a quick replacer using JavaScript. You just copy and paste the text from the PDF to the text box then hit the button and it'll show the transformed text. I was only able to work with the characters that existed on the page you provided so there are some missing replacements. You can either improve upon the code or do it in another language you're more comfortable with.

Just put the following into a file and save with a .html extension.


Thats great!! is there any substitution like thing that is so elegant in C or Objective-C?

NSRange wholeString = NSMakeRange(0, [myString length]);
[myString replaceOccurencesOfString: @"2" withString: @"a" options: 0 range: wholeString];

Over and over. And that is assuming that you rearrange them to avoid collisions.

angelwatt
Jun 20, 2009, 12:35 PM
Thats great!! is there any substitution like thing that is so elegant in C or Objective-C?

NSRange wholeString = NSMakeRange(0, [myString length]);
[myString replaceOccurencesOfString: @"2" withString: @"a" options: 0 range: wholeString];

Over and over. And that is assuming that you rearrange them to avoid collisions.

I've never used Objective C and just a little C. The easiest way is probably to have two arrays to hold the different sets, then walk through the file and swap each character as you go through. I'd recommend writing out to a new file just to make sure not to screw up the original.

If you work straight from the PDF file you'll need something that can handle that part as there's a bunch of extra code in there you wouldn't want to run the code on. That's why I did the copy and paste method. Made it easier.

jared_kipe
Jun 20, 2009, 04:14 PM
Very true, it would probably be safer/easier to have a loop that goes through each character of the text linearly and make the substitution.

There would be no collisions this way.

Using PDFkit I could probably get the text contents of the PDF and make the switch. I've never worked on such a large chunk of data. Only relatively short string. So it should be interesting.

Does anybody know of a way to abstract it by making a text-encoding profile or something, so that a program, say textwrangler, could change it on the fly or something?

GorillaPaws
Jun 20, 2009, 07:42 PM
can you just OCR it?

jared_kipe
Jun 21, 2009, 12:01 AM
can you just OCR it?

Thats actually an interesting possibility. I have sophisticated OCR software in my windows virtual machine for doing just that on books I scan in myself.

I would need to generate picture equivalents of each page however. I wonder if Preview has a export to JPG for every page kind of export. I'll look into it.


EDIT: Looks like Acrobat has the option. Trying now.

EDIT2: That worked really well.