Where Domain Evals Matter Most

# Where Domain Evals Matter Most Not all domains need expert-built [[Evals]] equally. The value of a domain expert writing verification tests scales with two forces: the cost of being wrong and the difficulty of checking. A marketing team using AI to draft social posts can eyeball the output. A process engineer verifying whether an AI's recommended operating parameters will cause a thermal runaway cannot. That gap is where the money lives. # The Eval Prioritization Grid Two axes define where [[Domain Experts as Eval Builders]] create the most value: ![[Pasted image 20260212105554.png]] **Axis 1: Penalty for failure.** What happens when the AI is wrong and nobody catches it? This ranges from mild embarrassment to regulatory fines, physical harm, or death. The higher the penalty, the higher the [[Willingness to pay]] for verification. **Axis 2: Verification opacity.** How hard is it for a non-expert to spot errors? Some AI outputs are obviously wrong to anyone. Others require years of domain training to evaluate. High opacity means the expert's judgment is irreplaceable. No amount of generic [[SelfCheckGPT]] or [[Semantic Entropy]] catches the error. Plot any domain on these two axes and you get four quadrants: ![[Screenshot 2026-02-12 at 10.56.15.png]] - **High penalty, high opacity: "Expert evals are existential."** Medical diagnostics, pharmaceutical development, structural engineering, process safety in O&G, legal advice in regulated industries. Wrong answers kill people, trigger lawsuits, or cause environmental disasters. A generalist reviewer cannot tell whether a dosage recommendation or a [[Safety Integrity Level - SIL]] calculation is correct. This is where expert evals command premium pricing and create durable moats. See [[False Alarm Problem in F&G]] for a concrete example: only someone who knows the difference between steam and smoke signatures can write the eval. - **High penalty, low opacity: "Evals are valuable but contestable."** Financial reporting, tax compliance, insurance underwriting. Errors are expensive but often caught by existing audit processes. The AI output is checkable by trained but non-specialist reviewers. Expert evals add speed and consistency, but the moat is weaker because the verification skill is more widely distributed. - **Low penalty, high opacity: "Expert evals are a luxury."** Academic research assistance, niche creative domains, specialized translations. Errors matter to quality but rarely create catastrophic outcomes. Experts add real value, but willingness to pay is limited because the downside of failure is reputational, not existential. - **Low penalty, low opacity: "Generic evals suffice."** Content generation, summarization, internal communications. Anyone can check the output. Errors are easily caught and cheaply fixed. Expert-built evals are overkill. Standard [[LLM-as-Judge]] or [[RAG-based verification]] handles it. # Scoring a Domain ![[Screenshot 2026-02-12 at 10.56.45.png]] For any niche, run it through five questions: 1. **What breaks when the AI is wrong?** Score 1-5. Equipment damage, patient harm, regulatory penalty, financial loss, or just embarrassment. The [[Compliance Automation in F&G]] space scores a 5: wrong answers mean explosions or regulatory shutdown. Content marketing scores a 1. 2. **Can a smart generalist spot the error?** Score 1-5. If reviewing the output requires board certification, a PE license, or 10+ years of domain experience, score high. [[Bespoke Engineering in Industrial AI]] exists precisely because domain-specific event labelling requires a process engineer who understands the specific chemistry. That expertise translates directly into eval quality. 3. **How many AI interactions happen daily in this domain?** Volume matters. A domain with 10,000 daily AI queries generates enough data for the eval flywheel to spin. Low-volume domains make the economics harder. This is the data gravity argument from [[Industrial AI Unit Economics]]: more usage, better evals, better models, more usage. 4. **How scarce are the domain experts?** Scarce experts mean higher eval value per person. There are millions of marketers but a few thousand process safety engineers qualified to assess [[IEC 61511]] compliance. Scarcity creates pricing power. 5. ** Does regulation mandate verification?** Regulated domains have forced buyers. Healthcare (FDA), aviation (FAA), financial services (SEC/FCA), process safety ([[IEC 61508]]). Regulation converts "nice to have" evals into "required to operate" evals. This is the difference between optional and mandatory spend. ## The Top Tier Niches Domains that score 4-5 across most dimensions: ![[Screenshot 2026-02-12 at 10.59.41.png]] - **Clinical medicine and diagnostics.** Highest penalty. Maximum opacity. Heavily regulated. Scarce specialists. Every AI-assisted diagnosis needs evals written by clinicians who know what a subtle presentation of sepsis looks like versus a benign fever. The eval library becomes a regulated medical device in itself. - **Process safety and industrial operations.** The [[F&G Safety Opportunity MOC]] space. $84M/facility/year in downtime costs from false alarms alone. Only experienced plant operators can distinguish real hazards from sensor noise. Evals here encode decades of operational intuition. See [[Human-in-the-Loop Systems]] for why the threshold calibration decision is both a safety and commercial question. - **Pharmaceutical R&D.** Drug interaction checks, dosage calculations, clinical trial protocol review. Errors cascade into patient harm and billion-dollar liability. The expert pool is tiny and expensive. Eval sets become proprietary assets that compound with each trial reviewed. - **Legal and regulatory compliance.** Jurisdictional complexity creates verification opacity. A contract clause that's valid in Delaware might be unenforceable in Germany. Only practitioners with jurisdiction-specific experience can write meaningful evals. Regulatory penalties provide the willingness to pay. - **Financial risk and audit.** Material misstatement in financial reporting triggers SEC action and shareholder lawsuits. CPAs and auditors bring verification instincts honed over thousands of engagements. The eval library maps to specific accounting standards (GAAP, IFRS), creating structural lock-in. ## The Moat Mechanics The [[Technical Moat Assessment Framework]] applies directly. Individual eval questions are easy to replicate. A library of 10,000 domain-specific test cases, calibrated against real-world outcomes, validated by recognized experts, and continuously updated with production feedback is not. ![[Screenshot 2026-02-12 at 11.00.26.png]] The defensibility stack: - **Data flywheel.** More usage generates more edge cases generates better evals. This is [[Wright's Law]] applied to verification quality. - **Expert network.** The community of domain experts contributing evals is itself a moat. Recruiting and retaining them creates a talent barrier competitors can't buy their way past. - **Regulatory entrenchment.** Once a set of evals is referenced in regulatory guidance or industry standards, switching costs become structural. The eval set effectively becomes infrastructure. The playbook from [[Deployment Velocity]] applies: measure how quickly a new domain can be onboarded with expert evals. If it takes 6 months and a domain hire per vertical, you scale linearly. If the tooling lets a domain expert self-serve eval creation in days, you scale exponentially. --- Links: - [[Domain Experts as Eval Builders]] - [[AI Verification]] - [[Evals]] - [[How AI Verification Tools Actually Work - A Technical Deep Dive]] - [[False Alarm Problem in F&G]] - [[Compliance Automation in F&G]] - [[Safety Integrity Level - SIL]] - [[Industrial AI Unit Economics]] - [[Deployment Velocity]] - [[Technical Moat Assessment Framework]] - [[Willingness to pay]] - [[Bottleneck Business]] - [[Human-in-the-Loop Systems]] - [[Bespoke Engineering in Industrial AI]] - [[F&G Safety Opportunity MOC]] --- Tags: #deeptech #kp #firstprinciple #investing