There are apps out there that will let you edit PDF files, how does that work?
It works by being a total pain to program.
Trust me, I wrote a simple PDF parser.
A PDF loses all the semantics (meaning) of the underlying document. Inside, the PDF literally says things like:
Code:
there is a sequence "the" on coordinates X,Y
there is a sequence "b" on coordinates A, B
there is a sequence "opportuni" on coordinates U, I
so, for example, to be sure that the PDF contains a sentence "There is a nice dog." you have to sort all sequences on the page by their height coordinate into batches. You get a batch of ["there" "nice" "dog" "is" "a" "."] of sequences that are on the same or roughly the same height coordinate. That would translate into "the same line", right?
Then you sort them by the other coordinate, from left to right. Your thought-to-be-a-line batch looks like ["There" "is" "a" "nice" "dog" "."], but you still don't know for sure whether it's the same literal line inside the same paragraph. Imagine if the page was the two-column type where you first read the left column top-down, then the right column. So you have to do another check.
Get every sequence and from its length and the properties of the font compute its width, then add that to the left-right coordinate. What you get is the end of the sequence.
If the very next sequence after this one starts a single space (+-) to the right, they must be from the same paragraph, but there is a space. If it starts immediately, well, it probably is even the same word. If the next sequence starts way further right, it must be the two-column type or maybe something else and good luck parsing that.
You get the idea, right?