Register FAQ / Rules Forum Spy Search Today's Posts Mark Forums Read
Go Back   MacRumors Forums > Apple Systems and Services > Programming > Mac Programming

Reply
 
Thread Tools Search this Thread Display Modes
Old Jun 19, 2009, 03:33 PM   #1
jared_kipe
macrumors 68030
 
jared_kipe's Avatar
 
Join Date: Dec 2003
Location: Seattle
Send a message via AIM to jared_kipe
What text encoding is in this PDF?

Adobe Acrobat Pro cannot export this to .doc or rtf because of "unsupported Type 2 font".

When you copy and paste text out of it, you just get garbage.

If you look at the clipboard viewer its essentially the same garbage. I've tried putting it in TextWrangler and trying random text encodings, nothing seems to fix it.

0*&'F#.5 == youngest
as an example

Here is a single page from it.
Attached Files
File Type: pdf testpage.pdf (57.7 KB, 279 views)
jared_kipe is offline   0 Reply With Quote
Old Jun 19, 2009, 04:31 PM   #2
telecomm
macrumors 65816
 
telecomm's Avatar
 
Join Date: Nov 2003
Location: Rome
Looks like Brian Herbert has found an effective DRM solution.
telecomm is offline   0 Reply With Quote
Old Jun 19, 2009, 04:34 PM   #3
angelwatt
Moderator emeritus
 
angelwatt's Avatar
 
Join Date: Aug 2005
Location: USA
You know what's even more funny, have your Mac speak the PDF. It speaks the garbled form. This doesn't look like an encoding issue, maybe a copyright technique to keep you from copying the text out.
angelwatt is offline   0 Reply With Quote
Old Jun 19, 2009, 05:58 PM   #4
jared_kipe
Thread Starter
macrumors 68030
 
jared_kipe's Avatar
 
Join Date: Dec 2003
Location: Seattle
Send a message via AIM to jared_kipe
Yes, my guess is that the included font is in the odd text encoding as well.

Thus, if you typed in normal english it would come out garbled.

In theory, you could probably do a one2one substition if you were able to get it into a string.

Say you had a string that you know comes out to "many decorations", but is actually ";2'0$7#,*)25+*'."

then you could replace all the 2s with a's and so on and so forth.

I actually just did this, on the single page, and it seems to fix it just fine. Some care would probably need to be taken so you don't replace a lot of letters accidentally, like say replacing the 2s with as, and then replacing the as with Fs or something like that.

It would probably be safter/easier to do all the replacements at the same time. Track down all the one2one substituions or figure out the text encoding then change the file to the new encoding.

Any ideas how?
jared_kipe is offline   0 Reply With Quote
Old Jun 19, 2009, 10:03 PM   #5
angelwatt
Moderator emeritus
 
angelwatt's Avatar
 
Join Date: Aug 2005
Location: USA
I made a quick replacer using JavaScript. You just copy and paste the text from the PDF to the text box then hit the button and it'll show the transformed text. I was only able to work with the characters that existed on the page you provided so there are some missing replacements. You can either improve upon the code or do it in another language you're more comfortable with.

Just put the following into a file and save with a .html extension.
[HTML]<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Transform text</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<script type="text/javascript">
//<![CDATA[
function Trans() {
var t = {
'2':'a',
'/':'b',
',':'c',
'7':'d',
'#':'e',
'8':'f',
'F':'g',
'"':'h',
'+':'i',
'[':'j',
'3':'k',
'-':'l',
';':'m',
"'":'n',
'*':'o',
'<':'p',
'd':'q',
')':'r',
'.':'s',
'5':'t',
'&':'u',
'A':'v',
'I':'w',
']':'x',
'0':'y',
'a':'z',
'D':'A',
'?':'B',
'(':'C',
'':'D',
':':'E',
'1':'F',
'9':'G',
'4':'H',
'N':'I',
'B':'J',
'':'K',
'E':'L',
'6':'M',
'K':'N',
'L':'O',
'>':'P',
'`':'Q',
'M':'R',
'X':'S',
'!':'T',
'':'U',
'c':'V',
'\\':'W',
'b':'X',
'':'Y',
'':'Z',

'n':'9',

'$':' ',
' ':'\r',
'\n':'\n',
'_':'—',
'C':'.',
'S':',',
'^':"'",
'm':'‘',
'^':'’',
'f':'“',
'g':'”',
'Y':'-',
'k':'!',
'=':':',
'e':'…'
};

var text = '';
var given = document.getElementById('start').value;
for (var a=0, b=given.length; a<b; ++a) {
var ch = given.substring(a,a+1);
if (t[ch] !== undefined) {
text += t[ch];
}
else {
alert('Missing transform for: "'+ ch +'"');
}
}
document.getElementById('output').innerHTML = text;
}

//]]>
</script>
</head>
<body>
<p><textarea id="start" cols="55" rows="12"></textarea></p>
<p><button onclick="Trans();">Transform</button></p>
<div id="output"></div>
</body>
</html>
[/HTML]

Last edited by angelwatt; Jun 20, 2009 at 12:51 PM. Reason: typo
angelwatt is offline   0 Reply With Quote
Old Jun 20, 2009, 10:36 AM   #6
Sayer
macrumors 6502a
 
Sayer's Avatar
 
Join Date: Jan 2002
Location: Austin, TX
It just looks like someone made a custom font that displays '#' as an 'e' and so on. And some kind of pre-processor converts the normal text into this format. The custom font is embedded in the PDF and thus you see the correct text displayed, even tho the source text is gibberish.

Its really just a simple 1:1 substitution scheme, but effective as you found out.
__________________
Obama is a true statesman whose experience as a state senator, half-term US Senator & guest lecturer in a Constitutional Law class has fully prepared him to take control of our nuclear arsenal.-Me
Sayer is offline   0 Reply With Quote
Old Jun 20, 2009, 12:14 PM   #7
jared_kipe
Thread Starter
macrumors 68030
 
jared_kipe's Avatar
 
Join Date: Dec 2003
Location: Seattle
Send a message via AIM to jared_kipe
Quote:
Originally Posted by angelwatt View Post
I made a quick replacer using JavaScript. You just copy and paste the text from the PDF to the text box then hit the button and it'll show the transformed text. I was only able to work with the characters that existed on the page you provided so there are some missing replacements. You can either improve upon the code or do it in another language you're more comfortable with.

Just put the following into a file and save with a .html extension.
Thats great!! is there any substitution like thing that is so elegant in C or Objective-C?

NSRange wholeString = NSMakeRange(0, [myString length]);
[myString replaceOccurencesOfString: @"2" withString: @"a" options: 0 range: wholeString];

Over and over. And that is assuming that you rearrange them to avoid collisions.
jared_kipe is offline   0 Reply With Quote
Old Jun 20, 2009, 12:35 PM   #8
angelwatt
Moderator emeritus
 
angelwatt's Avatar
 
Join Date: Aug 2005
Location: USA
Quote:
Originally Posted by jared_kipe View Post
Thats great!! is there any substitution like thing that is so elegant in C or Objective-C?

NSRange wholeString = NSMakeRange(0, [myString length]);
[myString replaceOccurencesOfString: @"2" withString: @"a" options: 0 range: wholeString];

Over and over. And that is assuming that you rearrange them to avoid collisions.
I've never used Objective C and just a little C. The easiest way is probably to have two arrays to hold the different sets, then walk through the file and swap each character as you go through. I'd recommend writing out to a new file just to make sure not to screw up the original.

If you work straight from the PDF file you'll need something that can handle that part as there's a bunch of extra code in there you wouldn't want to run the code on. That's why I did the copy and paste method. Made it easier.
angelwatt is offline   0 Reply With Quote
Old Jun 20, 2009, 04:14 PM   #9
jared_kipe
Thread Starter
macrumors 68030
 
jared_kipe's Avatar
 
Join Date: Dec 2003
Location: Seattle
Send a message via AIM to jared_kipe
Very true, it would probably be safer/easier to have a loop that goes through each character of the text linearly and make the substitution.

There would be no collisions this way.

Using PDFkit I could probably get the text contents of the PDF and make the switch. I've never worked on such a large chunk of data. Only relatively short string. So it should be interesting.

Does anybody know of a way to abstract it by making a text-encoding profile or something, so that a program, say textwrangler, could change it on the fly or something?
jared_kipe is offline   0 Reply With Quote
Old Jun 20, 2009, 07:42 PM   #10
GorillaPaws
macrumors 6502a
 
GorillaPaws's Avatar
 
Join Date: Oct 2003
Location: Richmond, VA
can you just OCR it?
GorillaPaws is offline   0 Reply With Quote
Old Jun 21, 2009, 12:01 AM   #11
jared_kipe
Thread Starter
macrumors 68030
 
jared_kipe's Avatar
 
Join Date: Dec 2003
Location: Seattle
Send a message via AIM to jared_kipe
Quote:
Originally Posted by GorillaPaws View Post
can you just OCR it?
Thats actually an interesting possibility. I have sophisticated OCR software in my windows virtual machine for doing just that on books I scan in myself.

I would need to generate picture equivalents of each page however. I wonder if Preview has a export to JPG for every page kind of export. I'll look into it.


EDIT: Looks like Acrobat has the option. Trying now.

EDIT2: That worked really well.

Last edited by jared_kipe; Jun 21, 2009 at 11:38 AM.
jared_kipe is offline   0 Reply With Quote

Reply
MacRumors Forums > Apple Systems and Services > Programming > Mac Programming

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Similar Threads
thread Thread Starter Forum Replies Last Post
iPhone: Select All Text and Copy from PDF? ma1nstr3am iOS 7 0 Jul 17, 2013 05:50 PM
Why Mac OS X "Save a pdf" feature always print a pdf with smaller text? satanicsurferz Mac Applications and Mac App Store 1 Jan 31, 2013 11:21 AM
any PDF text to speech app out there? ElDogman iPad Apps 1 Jan 12, 2013 12:34 PM
changing default mail app text encoding patent10021 Mac OS X 10.7 Lion 1 Oct 2, 2012 08:13 PM
HELP! How can you change text encoding on Safari on iPad? Nanasaki iOS 5 and earlier 0 Jun 26, 2012 09:31 PM

Forum Jump

All times are GMT -5. The time now is 01:59 PM.

Mac Rumors | Mac | iPhone | iPhone Game Reviews | iPhone Apps

Mobile Version | Fixed | Fluid | Fluid HD
Copyright 2002-2013, MacRumors.com, LLC