1 comments

  • johnnymakes 6 hours ago

    Hey HN. I'm Johnny, founder of Exabase.

    M-1 is our first-generation memory engine. We evaluated it against LongMemEval, the most comprehensive public benchmark for conversational memory retrieval: 500 questions, ~115k tokens of history, relevant information scattered across sessions and buried in noise.

    M-1 scored 96.4% at top-50 retrieval, the highest reported score, with consistent performance across all top-k's. The most interesting part is that we did it with Gemini 3 Flash, while every other system on the leaderboard used Gemini 3 Pro.

    A bigger model can compensate for weaker retrieval – absorbing a larger, noisy context at the cost of increased inference. We deliberately chose a smaller model to isolate retrieval quality from model capability and solve for real, production use. This result is Pareto optimal: cheaper and better performance, which is what we're solving for.

    Our results are in the spirit of real, production use – so we used a single generic prompt for our answerer – stripping out the question-specific prompt language we observed in other benchmark attempts/runners. The methodology, prompt, and full results JSON are all linked in the research post.

    The research post also has a discussion of the evaluation ceiling we hit at this accuracy level (there are errors in the benchmark itself which create a noise floor – we reported a few upstream to the benchmark creator).

    Happy to discuss the architecture, methodology, or how we think about memory retrieval differently!