The hidden cost of
ask anything analytics
We built an analytics agent that converts natural language to SQL. It works remarkably well, most of the time. Then we discovered a fundamental problem.
The variability problem
When a user asks "show me page views," our agent can generate valid SQL in many different ways. Each interpretation is defensible. The problem isn't that any version is wrong, it's that the interpretation varies between invocations.
Same question, three different answers
FROM events
WHERE type = 'pageview'
FROM analytics
WHERE date = TODAY
session_id)
FROM ga4_events
WHERE event = 'page_view'
For a data engineer, this is annoying. For a marketing manager who can't read SQL? It's a serious problem.
Why this matters for non-technical users
The core issue is an asymmetry in ability to evaluate output quality. The most technical users can verify the SQL. Those without the technical know-how see a chart and a number, and the AI's confidence masks the underlying uncertainty.
Same question → different SQL over time → incomparable trends
Multiple valid tables (raw events vs. facts) → inconsistent choice
COUNT vs. SUM vs. COUNT DISTINCT → fundamentally different questions
Verified metrics
Our solution is to constrain what the agent can do. Instead of generating SQL from scratch, it selects from a library of pre-defined, human-verified metrics. The critical property is determinism: given the same metric and parameters, you get identical SQL every time.
- •Different SQL every time
- •AI guesses which tables to use
- •Calculation logic varies
- •No way to audit what happened
- ✓Same SQL every time, guaranteed
- ✓Human-verified table selection
- ✓Documented calculation logic
- ✓Full explanation of assumptions
Semantic retrieval with LLM validation
Finding the right metric uses a two-stage process.
Stage 1: Embedding similarity
We generate embeddings for user queries and metric intents, then use pgvector to find candidates. "Top pages", "most viewed content", and "popular articles" all match the same metric.
Stage 2: LLM validation
Embedding similarity alone produces false positives. We pass candidates through an LLM that evaluates whether each metric can actually answer the query. About 15% of candidates get filtered out here: 15% fewer wrong answers.
Metric detection
We can't pre-define metrics for every possible question. The system needs to handle novel queries and learn what metrics are missing.
When established metrics are not found, our agent writes raw SQL to provide responsive results. We capture each of those instances as opportunities, turning common patterns into new metrics and reviewing edge cases to ensure reliability. The system learns from real usage without sacrificing consistency.
What we've learned
Validation is essential
We continue to learn how we can improve our validation techniques so that you receive the information you need. Our LLM validation catches a meaningful number of false positives to ensure reliable data.
Coverage grows fast
After a few weeks of detection, we captured the vast majority of common queries. The long tail exists, but it's smaller than we assumed.
Users notice inconsistency
Before this, we'd get tickets like "why did my conversion rate change 20% overnight?" The answer was usually different SQL. Those tickets are gone.
Know your users
This approach makes sense for non-technical users. For data engineers exploring unfamiliar schemas, raw generation might be right. Context matters.
What's next
We're still iterating. Can we use opportunity data to draft new metrics automatically? Can we build metrics that combine data across integrations? Can we surface metric documentation better so users understand exactly what they're looking at?
The core question, whether AI analytics tools should generate SQL freely or select from verified queries, doesn't have a universal answer. We recognize that our users benefit from reliable, human-validated information. That way, the AI tools can be applied exactly where they are helpful.