LOADING...
Why AI still struggles with PDF files
PDFs were never designed to be machine-readable

Why AI still struggles with PDF files

Feb 24, 2026
09:38 am

What's the story

Artificial Intelligence (AI) has made significant strides in recent years, but it still struggles with one of the most common file formats: PDF. The problem was highlighted when tech entrepreneur Luke Igel and his friends tried to navigate through 20,000 pages of documents related to Jeffrey Epstein's estate. They found the PDF viewer difficult to use, and searching for specific information was challenging due to poor OCR quality.

Parsing challenges

'Unsexy failures' of AI

Despite the rapid progress in AI's ability to solve complex software and physics problems, PDF remains a major challenge. Edwin Chen, CEO of data company Surge, has called it one of AI's "unsexy failures" that limit its real-world applicability. He discovered that even advanced models tasked with extracting information from a PDF often summarize it instead or confuse footnotes with body text.

Extraction difficulties

Why PDFs are so difficult for AI

Extracting information from PDFs is more complicated than it seems. This is because PDFs were never designed to be machine-readable. They were developed by Adobe in the early '90s as a way to share documents while preserving their exact visual appearance, first on paper and later on screens. Unlike formats like HTML that represent text in logical order, PDF consists of character codes, coordinates, and other instructions for rendering an image of a page.

Advertisement

OCR challenges

OCR can convert images to text but struggles with layouts

Optical character recognition (OCR) can convert these images of words back into machine-readable text, but it struggles with multi-column layouts often found in academic papers. This leads to jumbled outputs. When given a PDF, an AI assistant like ChatGPT tries various tools for extraction but the results are often uneven due to the inherent difficulties of the format.

Advertisement

Training issues

Models rarely trained on PDFs

Another problem is that models are rarely trained on PDFs. This is changing as AI developers look for high-quality data, and PDFs provide a lot of it. Government reports, textbooks, academic papers—all in PDF format. Researchers at the Allen Institute for AI last year suggested that "PDF documents have the potential to provide trillions of novel high-quality tokens for training language models."

Advertisement