# LLM-as-Judge Use one model to evaluate another. The setup: a "judge" LLM reviews outputs from a primary model. Scores for accuracy, relevance, coherence. Flags problems. Scales better than human review. It works. Datadog's research shows prompt design matters as much as model choice. Structured output prompts and explicit reasoning chains improve judge reliability. The catch: judges can inherit biases. If the judge is too similar to the primary model, it may miss the same errors. Overfitting happens when they're trained on similar data. Best practice: use a different model family as judge, or ensemble multiple judges. Cross-model disagreement is a useful signal. This approach sits between fully automated checks (fast, limited) and human review (slow, expensive). Good for medium-stakes verification where you need scale but can't afford to miss everything. --- Links: - [[AI Verification]] - [[Semantic Entropy]] - [[SelfCheckGPT]] - [[AI Agents Stack]] --- #deeptech #kp