VOYGR team here. We built this because we kept running into the same problem: LLMs confidently recommending places that turned out to be closed, fabricated, or in the wrong neighborhood. We wanted to measure how bad it actually is.
Setup: 345 prompts across 50+ cities, 5 task types (discovery, place details, navigation, booking, sharing), each run across ChatGPT, Gemini, Claude, and Perplexity with search ON and OFF. 2,415 total evaluated responses. Every recommended place was verified against Google Search and Maps.
What surprised us:
1. Search makes booking tasks worse. Enabling web search improved discovery by ~8 points but hurt transactional tasks. Claude and Gemini both lost 5+ points on "help me book a table" prompts. Models switched from giving step-by-step advice to quoting search snippets.
2. Every model confidently books you a table at closed restaurants. We tested a permanently closed Buenos Aires restaurant. All 7 configs gave booking guidance and seating tips. Even search-equipped models didn't catch it.
3. The real gap is constraint matching. Models find real places but ignore parts of the prompt: price range, neighborhood, cuisine type. Ask for "affordable rooftop bars in Gangnam" and you get champagne lounges with $30 cocktails. This gap is 16 points between the best and worst provider.
The full methodology is in the report. We're planning to open-source the benchmark
repo (all 345 prompts, evaluation pipeline, and raw results) in the coming weeks.
We built a *Business Validation API* designed for AI developers and agents, catching these failures before they reach production. Pass in a place name and address from any LLM response and get back: existence verification and operating status. These are the exact checks that would have caught fatal flaws in this benchmark. Link is in the report if you want to try it.
Happy to answer questions about methodology or anything else.
VOYGR team here. We built this because we kept running into the same problem: LLMs confidently recommending places that turned out to be closed, fabricated, or in the wrong neighborhood. We wanted to measure how bad it actually is.
Setup: 345 prompts across 50+ cities, 5 task types (discovery, place details, navigation, booking, sharing), each run across ChatGPT, Gemini, Claude, and Perplexity with search ON and OFF. 2,415 total evaluated responses. Every recommended place was verified against Google Search and Maps.
What surprised us:
1. Search makes booking tasks worse. Enabling web search improved discovery by ~8 points but hurt transactional tasks. Claude and Gemini both lost 5+ points on "help me book a table" prompts. Models switched from giving step-by-step advice to quoting search snippets.
2. Every model confidently books you a table at closed restaurants. We tested a permanently closed Buenos Aires restaurant. All 7 configs gave booking guidance and seating tips. Even search-equipped models didn't catch it.
3. The real gap is constraint matching. Models find real places but ignore parts of the prompt: price range, neighborhood, cuisine type. Ask for "affordable rooftop bars in Gangnam" and you get champagne lounges with $30 cocktails. This gap is 16 points between the best and worst provider.
The full methodology is in the report. We're planning to open-source the benchmark repo (all 345 prompts, evaluation pipeline, and raw results) in the coming weeks.
We built a *Business Validation API* designed for AI developers and agents, catching these failures before they reach production. Pass in a place name and address from any LLM response and get back: existence verification and operating status. These are the exact checks that would have caught fatal flaws in this benchmark. Link is in the report if you want to try it.
Happy to answer questions about methodology or anything else.