LLM judge and regression harness
Testing Helix requires more than happy-path assertions. We run a layered judge system that compares model output, SQL traces and rendered charts against expected behaviour.
Golden workflows
Product managers author YAML workflows that capture the intent of key journeys. Each workflow contains:
- Input prompts with guard-rail metadata (tenant, datasets, feature flags).
- Expected semantic outcomes such as trend direction should be upward or the query must filter to premium customers.
- Hooks that export intermediate SQL and dataframe samples.
Judge models
After executing a workflow we collect artifacts and ask a mixture of LLMs to grade them. The primary judge looks at textual explanations, while a lightweight verifier checks structural rules.
{
"prompt": "Did the plan restrict results to premium customers?",
"context": {
"sql": "...",
"explain": "...",
"samples": [{"segment": "premium", "count": 428}]
},
"decision": "pass",
"rationale": "WHERE clause contains membership_tier = 'premium'"
}
Judges vote independently. A quorum failure triggers automated triage that files an issue with the prompt, decisions and logs attached.
Replay and drift analysis
We replay a subset of workflows against nightly builds. The harness stores judge decisions in Postgres so we can graph regression trends and spot drift.
When the judge disagrees with historical baselines we fall back to deterministic assertions or request a human review depending on severity.
Continuous improvement loop
Failures feed our training datasets. We extract minimal counter examples and enrich them with grounded context before adding them to the augmentation pipeline.
The harness also measures latency, token usage and query cost, ensuring that improvements do not regress operational budgets.
Extending the judge
Teams can add domain-specific verifiers via plugins. These plugins run as python callables that return structured verdicts, which the judge aggregator merges with the LLM assessments.
Upcoming improvements include visual diffing of rendered charts and reinforcement signals that prioritize flaky workflows.