Nodejam Logo

Why AI Models Struggle with Office Files

The formats were designed for desktop printers. The models were designed for text. The mismatch explains more about AI's real limitations than any benchmark.

April 7, 2026 · 10 min read

Why AI Models Struggle with Office Files

Upload a Word document to an AI chatbot. Ask it to change a heading, reformat a table, and adjust the margins. Download the result. The fonts are wrong. The table borders disappeared. An image shifted to the wrong page. Ask it to fix those problems and you get a new set of problems. You're now three round trips deep, and the document looks worse than when you started.

This is not an edge case. It's the default experience. The same AI models that pass bar exams, write production-grade code, and synthesize research papers fall apart on a task that sounds trivially simple: edit a document and keep the formatting intact.

The explanation is not that AI is bad at documents. It's that the file formats themselves were never designed to be read or edited by anything other than the software that made them. The problem starts with what a .docx file actually is.

What a .docx File Actually Is

Most people think of a .docx file as a document. It is not. It's a compressed archive containing 10 to 20 separate files. Your actual words live in one file. Formatting rules sit in another. Font information in a third. Document properties, relationship maps, and a package manifest each get their own file too. Open a simple one-page letter in Word and you're looking at over a dozen files behind the scenes before the first paragraph.

The code inside those files is verbose in a way that's hard to appreciate without seeing it. In October 2025, The Document Foundation published a detailed analysis comparing how different formats represent the same content. They used Shakespeare's Hamlet as a test case, plain text with no formatting at all. The original text is 5,566 lines. Saved as a .docx in Microsoft Word, the underlying code expands to over 60,000 lines. That's roughly an 11x expansion for content with zero formatting. Every paragraph gets wrapped in repeated structural tags, tracking codes, and style references. The actual words are buried inside layers of format scaffolding that only Microsoft Word knows how to read efficiently.

For a human opening the file in Word, none of this matters. Word handles the complexity behind the scenes. But for an AI model trying to understand and modify the document, all of it matters. AI models process text through a finite window of attention. Every line of format code that fills that window is a line that can't be spent understanding what you actually asked for. It's like asking someone to edit a book, but handing them the book wrapped in ten times its weight in packaging instructions.

Spreadsheets and Slides Are Worse

Excel files take the complexity further with a design choice that borders on adversarial for automated processing. The actual text in your cells and the cell data are stored in two completely separate files inside the archive. If a cell contains "Q1 Revenue," the spreadsheet data file doesn't actually contain that text. It contains a reference number that points to a separate list where the text is stored. To reconstruct what a spreadsheet actually says, you need to cross-reference two different files by position. There's no direct link between them. It's like a book where the chapters are in one volume and the chapter titles are in another, matched only by page number.

It gets stranger. Dates aren't stored as dates. They're stored as plain numbers counting the days since January 1, 1900, carrying forward a bug from the 1980s that was kept for backward compatibility. A date that reads "March 15, 2026" in the spreadsheet appears as the number 46,096 in the underlying file. Empty cells are stored explicitly too, meaning a spreadsheet with ten rows of actual data can contain thousands of blank cell entries underneath. The format stores what isn't there alongside what is.

PowerPoint files are where the format complexity reaches its peak. Each slide is a separate file inside the archive. The colors used in your slides live in yet another file, and they aren't stored as simple color values. They're stored as references to a theme, with mathematical modifiers applied on top. The positioning system uses a unit of measurement where 914,400 units equal one inch. There are nearly 150 named colors in the specification, each with its own label. A single shape on a slide can pull its fill color from the theme file, its font from a separate file, and its position from a third file. The Document Foundation's analysis found that even the internal structure of a PowerPoint file differs unnecessarily from Word and Excel files, with no technical justification for the inconsistency.

Word (.docx)Excel (.xlsx)PowerPoint (.pptx)
What it actually isCompressed archive of 10-20+ filesCompressed archive of 10-20+ filesCompressed archive of 20-50+ files
Where your content livesOne main file among manySplit across two separate filesSeparate file for every slide
How it measures thingsPoints and half-pointsCharacter widths and points914,400 units per inch
How it handles colorsColor codes plus theme referencesColor codes plus theme references~150 named colors, theme math, multiple color systems
How many other files each piece depends on3-5 (styles, fonts, relationships)3-5 (string tables, styles, references)5-10+ (themes, layouts, masters, relationships)
Based on the ECMA-376 Office Open XML specification. File counts vary by document complexity.

What This Means for AI

When you upload a PowerPoint file to an AI chatbot, the model gets one of two things. Either it sees the raw code from inside the archive, where thousands of lines of format instructions surround every actual word of content. Or it gets a stripped-down text extraction that throws away all the formatting, layout, and visual structure, leaving the model with no way to preserve how the document looks. Neither version gives the model what it needs to do reliable work.

Research backs this up. A 2025 study by Chroma tested 18 major AI models and found that performance degrades consistently as the amount of input text grows. Even on simple tasks like finding a specific piece of information in a document, accuracy drops as the input gets longer. The study also found something counterintuitive: AI models perform worse on structured, organized content than on randomly shuffled text. The internal structure of format code, with its nested layers and cross-references, actively confuses the model's ability to focus on what matters.

Office file formats are almost purpose-built to trigger this problem. The 11x overhead from a simple Word document means that for every line of actual content the model needs to think about, ten more lines of format packaging compete for its attention. Scale that to a 40-slide presentation with charts, custom colors, and embedded images, and the ratio gets significantly worse. The model spends its capacity processing format structure instead of understanding what you want. Reliability doesn't degrade gracefully. It falls off a cliff as document complexity increases.

The Reconstruction Problem

AI chatbots don't edit documents the way you'd expect. They reconstruct them from scratch. When you ask a chatbot to change a heading in a Word file, it doesn't open the file the way Word does. It writes a small program to unpack the archive, find the heading in the code, change it, repackage everything, and produce a brand new file. Every edit is a full demolition and rebuild of the entire document.

The software tools available for this are designed to build documents step by step, setting one property at a time. That's a fundamentally different process from how AI models work. The model has to write a correct sequence of instructions, run them, check the output, and start over if something broke. Each cycle is a chance for something to go wrong. A font that looked fine in the original might not survive the rebuild. A table border might double up or disappear. A color that was defined through the theme system might come back as a flat value that doesn't match the rest of the document.

This isn't speculation. OpenAI's own user forums have documented these issues extensively through 2025 and 2026. Users report broken Word downloads, PowerPoint files that can't be opened at all, formatting loss on every round trip, and processing sessions timing out halfway through. These aren't rare bugs. They're the predictable result of an approach where every interaction requires a complete file rebuild.

Microsoft's own response to this problem is telling. In late 2024, they released MarkItDown, a free tool whose entire purpose is to strip the format noise out of Office files and convert them to plain text before feeding them to AI. It supports Word, PowerPoint, Excel, and other formats. Its existence is an implicit acknowledgment from the company that created these formats that the formats themselves are the bottleneck for AI processing.

Why This Matters for Enterprise

For an individual editing a simple memo, the reliability problem is annoying. You try a few times, manually fix what the AI got wrong, and move on. For an enterprise trying to build AI-assisted document workflows at scale, the same problem becomes a blocker.

Enterprise documents are complex by definition. Legal contracts have precise formatting requirements where a misplaced clause or a changed font carries real consequences. Financial models depend on formula chains, conditional formatting, and cross-sheet references that break silently when reconstructed. Board presentations use branded templates with exact color specifications, chart styles, and layout grids that must match corporate identity standards. These are the documents where AI assistance would deliver the most value, and they're exactly the documents where the rebuild-from-scratch approach fails most consistently.

The reliability curve points in the wrong direction. Simple documents that a person could edit in two minutes survive the AI round trip reasonably well. Complex documents that would actually benefit from AI assistance, the ones that take hours of manual formatting work, are the ones most likely to come back broken. The technology works on the easy things and breaks on the hard things. For any enterprise evaluating AI document workflows, that inversion is the central problem.

The Workarounds and Their Limits

The industry has developed a set of workarounds, and they're worth acknowledging because they're practical and sometimes they're exactly right. Convert your document to Markdown or plain text before sending it to an AI. Ask for spreadsheet data in CSV rather than raw XLSX. Use specialized document AI tools that flatten the format before processing. These approaches work for specific situations.

But they all trade something. Markdown strips formatting, layout, and visual structure. CSV strips formulas, merged cells, conditional formatting, and multi-sheet relationships. Specialized tools handle narrow tasks but can't generalize across the full range of document operations. The workarounds exist because the format is the problem. They don't solve the problem. They route around it. And they break down at exactly the point where the user needs the AI to work with the document as it actually exists, not a simplified version of it.

There's a common thread. Every workaround starts by acknowledging that the AI can't work with the Office format directly. The solution is always to remove the format, do the work, and then hope the format can be reassembled. That's not a workflow. It's a prayer.

A Different Starting Point

The question most people ask is how to make AI better at handling legacy formats. More capable models, smarter parsing, better code generation. Those improvements will come, and they'll help at the margins. But they don't address the structural issue. The formats themselves were designed for a world where software ran locally, files lived on hard drives, and the idea of an AI agent editing a document didn't exist. No amount of model improvement changes the fact that a PowerPoint file scatters its data across dozens of separate files, each with its own rules for colors, positioning, and structure.

The deeper question is whether the format should be the bottleneck at all. When the file format is designed for direct, targeted edits rather than full reconstruction, the entire problem chain described in this post disappears. The AI reads and writes the same thing the user sees. No unpacking, no repackaging, no writing code to glue it all back together. The model's full attention goes to the actual content, not the packaging around it. Edits are precise, not demolition-and-rebuild. We wrote about this in more detail in Why We're Building a New File Format. The short version: the format is the ceiling. Everything built on top of a legacy format inherits its limits, including AI.

The bottleneck in AI-assisted document work is not the model. Models improve every quarter. The bottleneck is thirty years of accumulated format complexity, compressed into archives and scattered across dozens of internal files, processed through a rebuild-from-scratch approach that was never designed for this. That's not a problem you solve with a better prompt or a faster model. It's a problem you solve by changing what the model works with.