Domain Experts as Eval Builders

# Domain Experts as Eval Builders The biggest unlock in [[AI Verification]] is giving domain experts the tools to write [[Evals]]. The people who know what "correct" looks like in cardiology, contract law, or chemical engineering are the ones who should be defining the tests. Not ML engineers. Not prompt hackers. [[Large Language Model - LLMs]] are general. Verification is specific. That mismatch is the opportunity. ## How Verification and Evals Actually Work ![[Screenshot 2026-02-08 at 00.53.23.png]] Three core approaches, each suited to different contributions: **1. Test Case Libraries (Structured Evals)** The simplest contribution. An expert writes input-output pairs: "Given this patient history, the model should flag X." Hundreds of these create a benchmark. The model runs against them, gets a score. Think of it like an exam bank written by practitioners, not academics. This is where most domain experts start. Low technical lift. High impact. The test cases encode years of tacit knowledge that no training dataset captures. **2. Rubric-Based Judging** More nuanced than pass/fail. The expert defines scoring criteria: "A good radiology report should mention laterality, comparison to priors, and clinical correlation." An [[LLM-as-Judge]] or human reviewer scores against the rubric. The expert contribution here is the rubric itself. What separates a dangerous answer from a merely mediocre one? Only someone who has made real decisions in the domain can draw that line. The rubric is the moat. **3. Red Teaming and Adversarial Testing** Experts probe for failure modes that generalists miss. A petroleum engineer knows the edge cases where a model might confuse upstream and midstream calculations. A lawyer knows the jurisdictional traps. They craft inputs designed to break the model in domain-specific ways. This is the highest-value contribution. [[SelfCheckGPT]] and [[Semantic Entropy]] catch generic hallucinations. Only a domain expert catches the plausible-sounding hallucination that would actually get someone hurt. ![[Screenshot 2026-02-08 at 00.53.35.png]] # What the Contribution Stack Looks Like ![[Screenshot 2026-02-08 at 00.53.50.png]] A domain expert building evals typically operates across three layers: **Ground truth curation.** Collecting and labeling examples of correct, incorrect, and ambiguous outputs. This feeds [[RAG-based verification]] systems and NLI-based checkers. The data itself becomes a defensible asset. **Threshold calibration.** Deciding what confidence score triggers human review. A 90% threshold might be fine for internal summaries but reckless for medical dosing. This maps directly to the hierarchical verification model: high-stakes gets full review, low-stakes gets automated checks. See [[Human-in-the-Loop Systems]] for why this threshold decision matters commercially. **Feedback loops.** Every time an expert corrects a model output, that correction trains the next version. This is the data flywheel that creates [[AI era Defensibility]]. The expert becomes part of the system, not a one-time consultant. # Why This Matters Now The [[AI Capex Super-Cycle]] is flooding infrastructure with compute. Generation is essentially solved. The constraint has moved to: can you trust what comes out? That trust layer is built by people who've spent decades in their fields, not by more parameters. 100M+ knowledge workers spending 4.3 hours/week verifying AI outputs. The companies that help domain experts encode their expertise into scalable eval systems will capture that $2.2 trillion verification gap. The playbook: find domains with high verification costs and clear experts. Give those experts simple tooling. Let them build the eval libraries. The library becomes the product. --- Links: - [[AI Verification]] - [[Evals]] - [[Where Domain Evals Matter Most]] - [[How AI Verification Tools Actually Work - A Technical Deep Dive]] - [[LLM-as-Judge]] - [[Human-in-the-Loop Systems]] - [[Industrial AI MOC]] - [[domain specific sense-making]] - [[Bottleneck Business]] - [[First Principles and Mental Models MoC]] --- Tags: #deeptech #kp #firstprinciple