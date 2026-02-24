Artificial Intelligence (AI) has made significant strides in recent years, but it still struggles with one of the most common file formats: PDF. The problem was highlighted when tech entrepreneur Luke Igel and his friends tried to navigate through 20,000 pages of documents related to Jeffrey Epstein's estate. They found the PDF viewer difficult to use, and searching for specific information was challenging due to poor OCR quality.

Parsing challenges 'Unsexy failures' of AI Despite the rapid progress in AI's ability to solve complex software and physics problems, PDF remains a major challenge. Edwin Chen, CEO of data company Surge, has called it one of AI's "unsexy failures" that limit its real-world applicability. He discovered that even advanced models tasked with extracting information from a PDF often summarize it instead or confuse footnotes with body text.

Extraction difficulties Why PDFs are so difficult for AI Extracting information from PDFs is more complicated than it seems. This is because PDFs were never designed to be machine-readable. They were developed by Adobe in the early '90s as a way to share documents while preserving their exact visual appearance, first on paper and later on screens. Unlike formats like HTML that represent text in logical order, PDF consists of character codes, coordinates, and other instructions for rendering an image of a page.

OCR challenges OCR can convert images to text but struggles with layouts Optical character recognition (OCR) can convert these images of words back into machine-readable text, but it struggles with multi-column layouts often found in academic papers. This leads to jumbled outputs. When given a PDF, an AI assistant like ChatGPT tries various tools for extraction but the results are often uneven due to the inherent difficulties of the format.

