Which table format do LLMs understand best?

(improvingagents.com)

108 points | by oidar 3 days ago ago

56 comments

  • jcheng 17 minutes ago

    I was curious enough to have Codex create a similar benchmark: https://github.com/jcheng5/table-formats

    With 1000 rows and 100 samples and markdown-kv, I got these scores:

    - gpt-4.1-nano: 52%

    - gpt-4.1-mini: 72%

    - gpt-4.1: 93%

    - gpt-5: 100%

    I was so surprised by gpt-5 getting 100% that I ran it again with 1000 samples. It got 999 correct, and one wrong.

    To reproduce it yourself, clone the repo, add a .env file with OPENAI_API_KEY, `uv sync`, and then run:

        uv run inspect eval evals/table_formats_eval.py@table_formats_markdown_kv --model openai/gpt-5 --limit 100
    • jcheng 4 minutes ago

      gpt-5 also got 100/100 for both CSV and JSON.

          uv run inspect eval evals/table_formats_eval.py@table_formats_csv --model openai/gpt-5 --limit 100
          uv run inspect eval evals/table_formats_eval.py@table_formats_json --model openai/gpt-5 --limit 100
  • Sharlin 6 hours ago

    > where accuracy is paramount

    > accuracy: 60%

    Not to mention that the least poorly performing format is probably the stupidest way to encode tabular data, beating even XML. But I guess that’s the new normal because we’re trying to shoehorn conversational AI models to every use case rather than, say, training finetunes that are better at particular tasks. (Yes, of course you can’t train finetunes when the model is a proprietary black box on someone else’s computer.) Something about hammers and nails…

    • mattcollins 43 minutes ago

      I'm the person who ran the test.

      To explain the 60% a bit more...

      Accuracy is hard to distinguish from 100% with tiny amounts of input data and gradually decreases as you increase the amount of input data.

      For this test, I intentionally chose an input data set large enough that the LLM would score in the region of 50% accuracy (with variation between formats) in order to maximise the discriminative power of the test.

    • gpt5 an hour ago

      Isn't the best performing (markdown tables) and the worst (pipe delimited tables) basically the same format?

      • simonw 14 minutes ago

        The best performing isn't markdown tables, it's markdown key/value pairs:

          ## Record 1
          
          ```
          id: 1
          name: Charlie A0
          age: 56
          city: New York
          department: Operations
          salary: 67896
          years_experience: 7
          project_count: 1
          ```
        
        Which makes sense to me because the problem with formats like CSV and regular markdown tables is that it is too easy for the model to mistakenly associate a value in a row with the wrong header.

        Explicit key/value formats like this or YAML or JSON objects make that a lot less likely.

    • mritchie712 5 hours ago

      they used GPT-4.1 nano, results would be quite different with sonnet or gpt5.

      • fnordpiglet 3 hours ago

        I was looking for the frontier curve where they tested their benchmark across different models since this sort of behavior is highly parameter, architecture, training, and fine tuning sensitive. It’s a practically useful question so I was really disappointed when a) they didn’t publish their code so you could test yourself, b) they didn’t do even a cursory examination of other models and sizes.

      • lyu07282 4 hours ago

        Or just regular gpt-4.1, it's a quite capable model.

  • xnx 6 hours ago

    Title says "LLMs" (plural) but they only tested one

    > We only tested OpenAI’s GPT-4.1 nano.

    • picardo 5 hours ago

      This should be higher. While the research question is interesting, the sample size makes the conclusion highly suspect. I'd like to see more research on this.

    • cwyers 3 hours ago

      And not even a commonly used one. Gemini Flash or o4-mini would have been a much better choice if they wanted a cheap model

  • cjonas 6 hours ago

    The test really needed to be run on multiple data sizes (50, 100, 500, 1000, 5000). The more token efficient formats would probably eventually overtake the token heavy ones due to context pollution. All this test really says is what performs best for 1 particular model at one particular context length.

  • Ciantic an hour ago

    This is a bit silly way to use LLMs to process tabular data. In reality, you'd ask it to write functions and execute them. First you'd ask it to create a type definition from the table, then ask it to create functions to process the data.

    "Write a function to find years of experience by name? Return just the number, e.g. '12'."

    It works much better, and it can single-shot many of the processing requirements just from type definitions it can infer from the data.

    This way it's easier to stick to tabular formats that have easy reading libraries, like with TypeScript/JavaScript JSON, and with Python, maybe CSV...

  • lowbloodsugar 3 minutes ago

    In mice.

    Or in this case gpt-4.1-nano

  • skyfantom 21 minutes ago

    Super surprised, I would expect CSV to beat all the others. And Markdown KV is something I hear first time about.

  • sega_sai 5 hours ago

    Bizarre conclusions when on average all the formats perform poorly with average accuracy of 50%. Sure 60% is better than 40% but they are both unusable if you actually care about numbers...

    • mattcollins 22 minutes ago

      I'm the person who ran the test.

      To hopefully clarify a bit...

      I intentionally chose input data large enough that the LLM would be scoring in the region of 50% accuracy in order to maximise the discriminative power of the test.

    • zaidf 2 hours ago

      I've been stunned by how many smart people talk so casually about LLMs becoming better at math. Do they just forget that a calculator that is wrong 1% of the time is a de facto calculator that doesn't work and should not be used?

      • westoncb 2 hours ago

        Doing math is not the same as calculating. LLMs can be very useful in doing math; for calculating they are the wrong tool (and even there they can be very useful, but you ask them to use calculating tools, not to do the calculations themselves—both Claude and ChatGPT are set up to do this).

        If you're curious, check out how mathematicians like Robert Ghrist or Terence Tao are using LLMs for math research, both have written about it online repeatedly (along with an increasing number of other researchers).

        Apart from assisting with research, their ability on e.g. math olympiad problems is periodically measured and objectively rapidly improving, so this isn't just a matter of opinion.

      • xnx 2 hours ago

        > I've been stunned by how many smart people talk so casually about LLMs becoming better at math

        Could they be referring to this?

        "Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad" https://deepmind.google/discover/blog/advanced-version-of-ge...

      • magicalhippo 2 hours ago

        The best math lecturers I had at university sucked at mental calculations. Some almost screwed up 2+2 on the blackboard.

        Yes LLMs suck at calculating stuff. However they can manipulate equations and such, and sometimes impressively so.

      • crazygringo an hour ago

        You realize that when typing into a calculator, you probably hit a wrong key more than 1% of the time? Which is why you always type important calculations twice?

        I've been stunned by how many smart people talk so casually about how because LLMs aren't perfect, they therefore have no value. Do they just forget that nothing in the world is perfect, and the values of things are measured in degrees?

    • zeitgeistcowboy 4 hours ago

      My sentiments exactly. All the formats were so poorly read that they are all effectively useless.

  • sails 27 minutes ago

    I’d be interested in testing different data formats when using the structured outputs api

  • brap 6 hours ago

    I wonder how this compares to a more agentic approach where the LLM composes SQL queries to answer the questions, for example.

    • thom an hour ago

      Well, ironically you then have the issue of how to present your database schema (including important things like the values in some categorical fields) to the LLM and in what format, so you never really escape this issue.

    • jitl 5 hours ago

      Yeah I mean for many real world scale datasets you don’t want to blow the whole context window on a massive markdown file. Instead you can provide a tool that presents the data as a SQLite database. In my testing Claude code seems very capable of answering questions via SQLite queries or even `head` and `grep` on CSV files.

      • bwestergard 2 hours ago

        But the result from the SQL query is going to be... a table. So at some point, tables need to go into context, and we need to know how well LLMs can incorporate those tables.

    • efitz 6 hours ago

      This was exactly my thought. Rather than feed the table directly to the LLM, build agents that extract the data and have the LLM act on the extracted data items. Then it’s a preference issue.

      The author didn’t see much more than 60% accuracy which is not very useful for many (most?) real world tasks.

      • coeneedell 4 hours ago

        “Agents that extract the data” Are we really reinventing data frame readers to have an LLM in the critical path?

  • freehorse 5 hours ago

    Tbh I am more interested in processing data and formatting it to tabular forms than extracting data from tabular forms. One of the main uses I see in LLMs is structuring unstructured/semistructured data. I may occasionally feed a table to an LLM and ask such kinds of questions when I feel lazy, but I see no serious application of this as compared with using whatever language/library to process the data from the table (whether using an llm or not in the whole process). The point of having structured data is exactly this. But much more often I feed data to an llm and ask it to create a table.

  • SweetSoftPillow 21 minutes ago

    Misleading title, just one LLM was tested.

  • mingtianzhang 3 hours ago

    The current OCR approach typically relies on a Vision-Language Model (VLM) to convert a table into a JSON structure. However, a table inherently has a 2D spatial structure, while Large Language Models (LLMs) are optimized for processing 1D sequential text. This creates a fundamental mismatch between the data representation and the model’s input format.

    Most existing pipelines address this by preprocessing the table into a linearized 1D string before passing it to the LLM — a question-agnostic step that may lose structural information.

    Instead, one could retain the original table form and, when a question is asked, feed both the question and the original table (as an image) directly into the VLM. This approach allows the model to reason over the data in its native 2D domain, providing a more natural and potentially more accurate solution.

    • fragmede 3 hours ago

      Yeah, I wonder how PNG would fare in this contest.

  • grey-area 2 hours ago

    They don’t understand any table formats; as shown by these results.

    They can transform information in tables but information is lost due to that lack of understanding.

  • ComputerGuru 5 hours ago

    Inputs were not long enough to properly see either of the true wins in terms of reduced token counts for terser formats or their benefits in terms of avoiding stuffing the context window thereby potentially reducing accuracy. The test really needs to be conducted across multiple dimensions!

  • dctoedt 5 hours ago

    KSON? (I'm a complete ignoramus in this area but recently read about KSON in a piece posted here at HN.)

    https://ochagavia.nl/blog/configuration-files-are-user-inter...

    https://news.ycombinator.com/item?id=45291858 (135 comments)

  • dcre 3 hours ago

    Only testing GPT-4.1-nano makes this basically useless. Most people are almost certainly using GPT-5 mini or better. This very poor analysis is like an LLM literacy test for readers.

    • grey-area 2 hours ago

      Please go away and do the work for us and let us know what anmazing accuracy you got with whatever version you think is better.

      Anything below 100% is actually pretty useless when it comes to stats.

      • simonw 2 hours ago

        If you want 100% accuracy from these kinds of tasks with LLMs you can get it today, but you need to provide the LLM with the ability to run Python code and tell it to use something like Pandas.

        You can confirm it's doing the right thing by reviewing the code it wrote.

      • dcre 2 hours ago

        Simon is right about using code execution, but many tables one might look at outside of formal data work are small enough for LLMs to be very reliable at, so this format question is practically relevant. I wish they had tested better models.

  • nightshift1 2 days ago

    I am not an expert on the subject but i suggest that you can also save context space by using shorter XML element names (like f instead of function, c instead of class, etc.). Just add a legend at the top or bottom to explain what each abbreviation means, LLMs can figure out the mapping without issues. I use this approach when generating project structure maps with Tree-sitter. I did a quick comparison and didn't notice much degradation with claude, so the context space you save may make it worthwhile. I would be interested to see a proper comparison.

    • 1aurent29 6 hours ago

      Common enough words like `function` and `class` are generally encoded as a single token by the tokenizer and may provide a slightly better context to the LLM. For openai you can test this stuff at https://platform.openai.com/tokenizer

    • Yiin 6 hours ago

      if both f and function uses 1 token, are you really saving anything?

  • fancyfredbot 5 hours ago

    This is an interesting theoretical exercise but please for the love of god don't actually use an LLM to search tabular data. This is a solved problem. Free software does this with 100% accuracy and insane efficiency.

    • ModernMech 5 hours ago

      This is a really eye-popping example. Because here we have input text that is fully structured perfectly unambiguous (it was carefully designed that way!) and yet the LLM can't get all the information out of it. Yet people are using these tools to summarize unstructured text, assuming the summary will capture the most salient points. Well how is the LLM supposed to be good for that task, if it can't even summarize the dang XML document? They keep telling me this thing is more expert than all the experts combined.

  • lmeyerov 6 hours ago

    That's a cool concept - would be curious about a more common setup for agentic data analysis (ex: for using in Claude Code) like:

    * Multiple tasks vs 1

    * O3/o3-mini + 4o/4o-mini instead of nano

    * Extra credit: Inside a fixed cost/length reasoning loop

    Ex: does the md-kv benefit disappear with smarter models that you'r typically use, and thus just become a 2-3x cost?

  • rcarmo 6 hours ago

    Hmmm. I’ve been using YAML data for tables for a while now, and had pretty good results.

  • veryrealsid 4 hours ago

    I'm surprised by the accuracy, in practice, I feel like I generally have a lot better results

    • mattcollins an hour ago

      I'm the person who ran the test.

      The context I used in the test was pretty large. You'll see much better (near 100%) accuracy if you're using smaller amounts of context.

      [I chose the context size so that the LLM would be scoring in the ballpark of 50% accuracy (with variation between formats) to maximise the discriminative power of the test.]

    • coeneedell 4 hours ago

      Do you measure your results in a repeatable way? In a way where your hypotheses about accuracy are falsifiable? Or do they just “feel” right?

  • ggm 3 days ago

    I find this extremely surprising. I would have expected dict structures to have higher semantic context associated with them.

  • xnx 2 hours ago

    Great idea. Very limited execution. If they release the source data and question set, I'll repeat with more LLMs to flesh out the findings.

  • reidgreer 2 days ago

    interesting. I'm curious how this compares across different model families.

  • secwang 5 hours ago

    maybe be org table