PDF: optimised for a different problem

July 31, 2008

An extremely interesting post by Jim King (PDF architect and Senior Principle Scientist at Adobe), explaining why getting text back out of PDF is a pain. In his words: –

To extract text from PDF documents is a rather difficult and a highly technical task…

We’ve had plenty of problems with PDF in our data mining efforts. Leaving aside the basic problem that most of the data we care about can’t be encoded in PDF at all, we’ve had plenty of problems extracting text too. Jim explains some of the reasons why and concludes: –

The PDF design is very tailored to the creator being able to quite directly and without ambiguity, specify the exact output desired. That is a strong virtue for PDF and the price of more difficult text extraction is a price worth paying for that design.

PDF is not designed primarily for text. In PDF, typography trumps text.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: