Case study

Mini-FRED: Deterministic finance Q&A benchmark

Offline evaluation with database-grounded truth. Shows how reliability improves across versions.

This is a benchmark scorecard, not a live system claim.
Vero's rigor of using MVES (Minimum Viable Evaluation Suite) is a deterministic, end-to-end evaluation harness for data agents. It turns natural-language questions into structured expectations, runs your agent against a local, versioned data warehouse, and scores answers with transparent, reproducible checks. No hidden APIs, no flaky metrics-just clear pass/fail signals, failure breakdowns, and reports you can track over time.
Mini-FRED is a 5-year collection of FRED data used to answer natural-language finance questions with deterministic transforms: point (raw value at a date), YoY (year-over-year % change), MoM (month-over-month % change), MA (moving average over N periods), max (maximum in a window), and min (minimum in a window).

Executive summary

  • Reliability improved from 65.9% (v1) to 78.2% (v5), a +12.3 point lift.
  • Primary remaining issue: Transform confusion (intent parsing drives the wrong computation path).
  • Next improvements focus on rules-first transforms, a stricter output contract, and clarifying questions for ambiguous phrasing.
ProgressionPass rate
v1
65.9%
v2
69.8%
v3
69.8%
v4
66.8%
v5
78.2%

Offline MVES benchmark • 560 questions

Mini-FRED Agent v1

Baseline deterministic agent scored against DuckDB-grounded truth.

Benchmark Index (proxy)

657.4

Success Rate

65.9%

Critical flags (assertion)

441

Pass / Fail

369 passed

191 failed

Top failure modes

Wrong computation path · 181Transform confusion · 142Output numeric formatting · 60

Baseline parser + deterministic DuckDB truth checks.

BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)

1 / 6Agent version

Mini-FRED Agent v1

Baseline deterministic agent scored against DuckDB-grounded truth.

Executive takeaway: Baseline reliability with clear transform errors.

What changed

  • >Baseline parser + deterministic DuckDB truth checks
  • >Single-pass transform selection
  • >No intent normalization yet

Primary remaining issue

Wrong computation path (misidentifies the requested transform, so correct data but wrong calculation).

Offline MVES benchmark • 560 questions

Mini-FRED Agent v2

Improved parsing to reduce date handling and extraction errors.

Benchmark Index (proxy)

696.8

Success Rate

69.8%

Critical flags (assertion)

339

Pass / Fail

391 passed

169 failed

Top failure modes

Wrong computation path · 159Transform confusion · 120Output numeric formatting · 60

Improved date parsing + stricter value extraction.

BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)

2 / 6Agent version

Mini-FRED Agent v2

Improved parsing to reduce date handling and extraction errors.

Executive takeaway: Higher success rate; computation path still dominant.

What changed

  • >Improved date parsing + stricter value extraction
  • >More explicit transform coercion
  • >Cleaner numeric extraction fallback

Primary remaining issue

Transform confusion remains (better parsing, but still picks YoY vs MoM incorrectly in noisy phrasing).

Offline MVES benchmark • 560 questions

Mini-FRED Agent v3

Retrieval + parsing refinements to stabilize answer formatting.

Benchmark Index (proxy)

696.8

Success Rate

69.8%

Critical flags (assertion)

339

Pass / Fail

391 passed

169 failed

Top failure modes

Wrong computation path · 159Transform confusion · 120Output numeric formatting · 60

Retrieval + parsing refinements; stabilized outputs.

BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)

3 / 6Agent version

Mini-FRED Agent v3

Retrieval + parsing refinements to stabilize answer formatting.

Executive takeaway: Stability improved, but core failure types persist.

What changed

  • >Retrieval + parsing refinements; stabilized outputs
  • >Reduced ambiguous transform collisions
  • >More consistent series selection

Primary remaining issue

Wrong computation path persists (formatting stabilized, but intent errors still dominate).

Offline MVES benchmark • 560 questions

Mini-FRED Agent v4

Guardrails added for windows, dates, and refusal correctness.

Benchmark Index (proxy)

666.6

Success Rate

66.8%

Critical flags (assertion)

394

Pass / Fail

374 passed

186 failed

Top failure modes

Wrong computation path · 176Transform confusion · 120Output numeric formatting · 78

Window/date guardrails; more refusal correctness.

BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)

4 / 6Agent version

Mini-FRED Agent v4

Guardrails added for windows, dates, and refusal correctness.

Executive takeaway: Guardrails helped, but wrong paths still frequent.

What changed

  • >Window/date guardrails; more refusal correctness
  • >Refusal criteria made explicit
  • >Tighter date-window matching

Primary remaining issue

Edge-case date parsing (improved windows/Moving Average, but some ambiguous date prompts still fail).

Offline MVES benchmark • 560 questions

Mini-FRED Agent v5

Local Phi-4 intent normalization reduces transform ambiguity.

Benchmark Index (proxy)

780.9

Success Rate

78.2%

Critical flags (assertion)

301

Pass / Fail

438 passed

122 failed

Top failure modes

Transform confusion · 128Wrong computation path · 119Output numeric formatting · 54

Local Phi-4 intent normalization for transform detection.

BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)

5 / 6Agent version

Mini-FRED Agent v5

Local Phi-4 intent normalization reduces transform ambiguity.

Executive takeaway: Best reliability so far; transform confusion remains.

What changed

  • >Added local Phi-4 intent normalization for transform detection
  • >Reduced transform ambiguity on edge phrasing
  • >More consistent transform labels downstream

Primary remaining issue

LLM still misses some nuance (Phi-4 helps, but certain change vs level cues still misfire).

Planned improvements roadmap

What's next

Potential upgrades focused on transform clarity and output contracts.

Benchmark Index (proxy)

Success Rate

Critical flags (assertion)

Pass / Fail

Focus areas

Transform clarity & output contracts

BenchmarkIndex = round((passRate * 10) - (criticalFailures/totalCases) * 2, 1)

6 / 6Roadmap

What's next

Potential upgrades focused on transform clarity and output contracts.

What changed

  • >Rules-first transform lexicon (map phrases like "annual swing" -> YoY, "month-to-month" -> MoM)
  • >Output contract: enforce {series_id, transform, date/window, value} and single numeric
  • >Confidence gate: ask a clarifying question when date/window/transform is ambiguous
  • >Add more transforms step-by-step later (avg, median, CAGR, z-score)
  • >Expand eval coverage for ambiguous phrasing + adversarial paraphrases

Focus

Transform ambiguity + output contract discipline

Methodology

How the benchmark is computed

Benchmark Index is an internal proxy score, not ROI.

DuckDB computes truth from cached mini-fred series; the agent parses intent and selects a transform; evaluation checks the reported value within tolerance.