#1 on the leading AI memory benchmark using a smaller, cheaper model

(exabase.io)

5 points | by johnnymakes 6 hours ago ago

1 comments

Hey HN. I'm Johnny, founder of Exabase.

M-1 is our first-generation memory engine. We evaluated it against LongMemEval, the most comprehensive public benchmark for conversational memory retrieval: 500 questions, ~115k tokens of history, relevant information scattered across sessions and buried in noise.

M-1 scored 96.4% at top-50 retrieval, the highest reported score, with consistent performance across all top-k's. The most interesting part is that we did it with Gemini 3 Flash, while every other system on the leaderboard used Gemini 3 Pro.

A bigger model can compensate for weaker retrieval – absorbing a larger, noisy context at the cost of increased inference. We deliberately chose a smaller model to isolate retrieval quality from model capability and solve for real, production use. This result is Pareto optimal: cheaper and better performance, which is what we're solving for.

Our results are in the spirit of real, production use – so we used a single generic prompt for our answerer – stripping out the question-specific prompt language we observed in other benchmark attempts/runners. The methodology, prompt, and full results JSON are all linked in the research post.

The research post also has a discussion of the evaluation ceiling we hit at this accuracy level (there are errors in the benchmark itself which create a noise floor – we reported a few upstream to the benchmark creator).

Happy to discuss the architecture, methodology, or how we think about memory retrieval differently!