How AI Verification Tools Actually Work - A Technical Deep Dive

**Created**: November 11, 2025 **For**: Understanding the mechanics behind AI verification **Context**: Technical implementation details for evaluating verification tool - [[#The Verification Problem|The Verification Problem]] - [[#The Verification Problem#What We're Trying to Solve|What We're Trying to Solve]] - [[#The Verification Problem#The Core Challenge|The Core Challenge]] - [[#RAG-Based Verification|RAG-Based Verification]] - [[#RAG-Based Verification#How It Works|How It Works]] - [[#RAG-Based Verification#Technical Components|Technical Components]] - [[#RAG-Based Verification#Verification Mechanism|Verification Mechanism]] - [[#RAG-Based Verification#Performance Metrics|Performance Metrics]] - [[#NLI-Based Verification|NLI-Based Verification]] - [[#NLI-Based Verification#Natural Language Inference|Natural Language Inference]] - [[#NLI-Based Verification#How It's Used for Verification|How It's Used for Verification]] - [[#NLI-Based Verification#Technical Implementation|Technical Implementation]] - [[#NLI-Based Verification#NLI Models Used|NLI Models Used]] - [[#NLI-Based Verification#Performance|Performance]] - [[#Multi-Agent Verification (Hebbia Approach)|Multi-Agent Verification (Hebbia Approach)]] - [[#Multi-Agent Verification (Hebbia Approach)#The "Agent Swarm" Architecture|The "Agent Swarm" Architecture]] - [[#Multi-Agent Verification (Hebbia Approach)#Key Technical Innovations|Key Technical Innovations]] - [[#Multi-Agent Verification (Hebbia Approach)#Hebbia's Verification Mechanisms|Hebbia's Verification Mechanisms]] - [[#Multi-Agent Verification (Hebbia Approach)#Performance|Performance]] - [[#Ensemble Methods|Ensemble Methods]] - [[#Ensemble Methods#The Principle|The Principle]] - [[#Ensemble Methods#Technical Implementation|Technical Implementation]] - [[#Ensemble Methods#Component Methods|Component Methods]] - [[#Ensemble Methods#Performance|Performance]] - [[#Comparative Analysis|Comparative Analysis]] - [[#Comparative Analysis#Method Comparison Matrix|Method Comparison Matrix]] - [[#Comparative Analysis#Cost-Benefit Analysis|Cost-Benefit Analysis]] - [[#Comparative Analysis#Decision Framework|Decision Framework]] - [[#Key Takeaways|Key Takeaways]] - [[#Key Takeaways#Technical Principles|Technical Principles]] - [[#Key Takeaways#Implementation Insights|Implementation Insights]] - [[#Key Takeaways#Future Directions|Future Directions]] - [[#Further Reading|Further Reading]] ## The Verification Problem ### What We're Trying to Solve ``` Input: LLM generates text Problem: Is it factually accurate? Constraint: Can't retrain the model every time Goal: Verify outputs faster/cheaper than humans reading everything ``` ### The Core Challenge **LLMs have no "ground truth" detector built in**. They're trained to predict next tokens, not to verify factual accuracy. This creates three verification challenges: 1. **Hallucination**: Model generates plausible but false information 2. **Source Attribution**: Model can't show where info came from 3. **Consistency**: Same question → different answers --- ## RAG-Based Verification ### How It Works **RAG (Retrieval Augmented Generation)** = Fetch relevant docs → Give to LLM → Generate answer ``` ┌──────────────┐ │ User Query │ └──────┬───────┘ │ ▼ ┌──────────────────────┐ │ 1. Convert to Vector │ ← Embedding model converts text to numbers └──────┬───────────────┘ │ ▼ ┌────────────────────────┐ │ 2. Search Vector DB │ ← Find semantically similar documents │ (semantic search) │ └──────┬─────────────────┘ │ ▼ ┌────────────────────────┐ │ 3. Retrieve Top N Docs │ ← Usually N = 3-10 └──────┬─────────────────┘ │ ▼ ┌─────────────────────────┐ │ 4. Inject into Prompt │ ← "Given these docs: [docs], answer: [query]" └──────┬──────────────────┘ │ ▼ ┌─────────────────────────┐ │ 5. LLM Generates Answer │ ← With context from retrieved docs └──────┬──────────────────┘ │ ▼ ┌─────────────────────────┐ │ 6. Return with Citations│ ← Shows which docs were used └─────────────────────────┘ ``` The most common verification approach: fetch relevant documents, inject into prompt, generate with citations. ![[Screenshot 2025-11-24 at 01.00.22.png]] Standard RAG workflow: The most common verification method, but requires humans to manually check sources. ### Technical Components **1. Embedding Model** - Converts text to dense vectors (typically 768-1536 dimensions) - Examples: `text-embedding-ada-002` (OpenAI), `all-MiniLM-L6-v2` (open source) - **Purpose**: Enable semantic search (find similar meaning, not just keywords) **2. Vector Database** - Stores document embeddings for fast retrieval - Examples: Pinecone, Weaviate, Chroma, FAISS - **Search**: Uses cosine similarity or dot product to find nearest neighbors **3. Chunking Strategy** - Documents split into smaller pieces (chunks) - **Challenge**: Too small = missing context; too large = irrelevant info - **Typical size**: 256-512 tokens per chunk **4. Retrieval Methods** ```python # Dense Retrieval (semantic) query_embedding = embed(query) results = vector_db.search(query_embedding, top_k=5) # Sparse Retrieval (keyword) results = bm25_search(query, documents) # Hybrid (best of both) dense_results = semantic_search(query) sparse_results = keyword_search(query) final_results = rerank(dense_results + sparse_results) ``` ### Verification Mechanism RAG verifies through **source grounding**: ``` User: "What's the capital of France?" Without RAG: LLM: "The capital of France is Paris." Problem: Could be hallucinated, no source With RAG: 1. Retrieve: [doc1: "France, capital Paris", doc2: "Paris, French capital"] 2. Generate: "The capital of France is Paris." 3. Cite: [Source: doc1, doc2] Benefit: Human can check sources ``` ### Performance Metrics **Standard RAG** (like ChatGPT with web search): - Accuracy: ~68% on complex queries - Speed: 2-5 seconds per query - Cost: $0.001-0.01 per query **Limitations**: 1. **Retrieval Fails**: Relevant info not found → hallucination 2. **Context Loss**: Chunks miss important context 3. **Ranking Errors**: Wrong docs ranked higher 4. **Citation Lag**: Still need human to verify sources --- ## NLI-Based Verification ### Natural Language Inference **NLI Task**: Given two sentences, determine relationship: - **Entailment**: Premise → Hypothesis is true - **Contradiction**: Premise → Hypothesis is false - **Neutral**: Can't determine from premise ### How It's Used for Verification ``` Premise (Source): "The company reported $10M revenue in Q3" Hypothesis (LLM Output): "Q3 revenue was $10 million" NLI Model: ENTAILMENT ✓ vs. Premise (Source): "The company reported $10M revenue in Q3" Hypothesis (LLM Output): "Q3 revenue was $12 million" NLI Model: CONTRADICTION ✗ ``` Use inference models to check if generated text is logically supported by source documents. ![[Screenshot 2025-11-24 at 01.02.12.png]] ![[Screenshot 2025-11-24 at 01.02.20.png]] Two NLI approaches: SelfCheckGPT uses consistency across samples; RefChecker compares knowledge triplets against sources. ### Technical Implementation **Method 1: SelfCheckGPT NLI** ```python # Step 1: Generate multiple samples samples = [] for i in range(20): samples.append(llm.generate(prompt)) # Step 2: Split into sentences sentences = [] for sample in samples: sentences.extend(split_sentences(sample)) # Step 3: Check each sentence for consistency for sentence in sentences: # Use NLI model (e.g., DeBERTa) to check if sentence # is entailed by other samples consistency_scores = [] for other_sample in samples: score = nli_model(premise=other_sample, hypothesis=sentence) consistency_scores.append(score) # If low consistency → hallucination if avg(consistency_scores) < 0.35: flag_as_hallucination(sentence) ``` **Key Insight**: If LLM knows something, it should be consistent across multiple generations. Inconsistency = uncertainty = likely hallucination. **Method 2: RefChecker (Amazon)** ``` ┌─────────────────┐ │ LLM Response │ └────────┬────────┘ │ ▼ ┌──────────────────────────┐ │ Extract Knowledge Triplets│ ← <subject, predicate, object> │ Example: <Apple, revenue,│ │ $394B in 2023> │ └────────┬─────────────────┘ │ ▼ ┌──────────────────────────┐ │ Compare with Source Docs │ │ Using NLI Model │ └────────┬─────────────────┘ │ ▼ ┌──────────────────────────┐ │ Classify Each Triplet: │ │ • Entailment (supported) │ │ • Contradiction (wrong) │ │ • Neutral (can't verify) │ └────────┬─────────────────┘ │ ▼ ┌──────────────────────────┐ │ Report Hallucinations │ └──────────────────────────┘ ``` ### NLI Models Used **Common Choices**: 1. **DeBERTaV3**: Fine-tuned on NLI datasets, ~90% accuracy 2. **RoBERTa-NLI**: Lightweight, fast inference 3. **T5-based**: TRUE/TrueTeacher models **Training Data**: - SNLI (Stanford Natural Language Inference) - MNLI (Multi-Genre NLI) - DocNLI (Document-level NLI) ### Performance **SelfCheckGPT**: - Calibration: 80% score → 80% of flagged sentences are actually hallucinations - Speed: ~1 second per sentence (20 samples = ~20 seconds) - Cost: 20x generation cost **RefChecker**: - Precision: ~85-90% on detecting hallucinations - Recall: ~75-80% (misses some hallucinations) - Speed: ~2-3 seconds per response --- ## Multi-Agent Verification (Hebbia Approach) ### The "Agent Swarm" Architecture Unlike single-model RAG, Hebbia uses multiple specialized agents working in parallel. ``` User Query: "Compare termination clauses in 100 contracts" ┌────────────────────────┐ │ Orchestrator Agent │ ← Breaks down complex query └───────┬────────────────┘ │ │ Decomposes into sub-tasks: │ ├──► Task 1: Retrieve all 100 contracts │ ├──► Task 2: Identify termination clauses in each │ ├──► Task 3: Extract key terms from each clause │ ├──► Task 4: Categorize by clause type │ └──► Task 5: Generate comparison matrix │ ▼ ┌─────────────────────┐ │ Specialized Agents │ ← Each handles one task type └─────────────────────┘ │ │ Agents work in parallel: │ ┌─────────┴─────────────┬──────────────┐ │ │ │ ▼ ▼ ▼ ┌───────────────┐ ┌──────────────┐ ┌─────────────┐ │ ReadAgent │ │ ExtractAgent │ │ CompareAgent│ │ (o1 for │ │ (GPT-4o for │ │ (GPT-4o for │ │ reasoning) │ │ extraction) │ │ synthesis) │ └───────┬───────┘ └──────┬───────┘ └─────┬───────┘ │ │ │ └────────────────────┴───────────────────┘ │ ▼ ┌──────────────────────┐ │ OutputAgent │ ← Formats final result │ (structured output) │ └──────────┬───────────┘ │ ▼ ┌──────────────────────┐ │ Matrix Grid Interface│ ← Spreadsheet view │ With Citations │ └──────────────────────┘ ``` Advanced platforms use specialized agents working in parallel with transparent verification at each step. ![[Screenshot 2025-11-24 at 01.03.06.png]] Multi-agent architecture: Task decomposition → Specialized agents work in parallel → Each verifies its own output → Aggregated with full citations. ### Key Technical Innovations **1. Task Decomposition** ```python def decompose_query(complex_query): # Use LLM to break down query decomposition = orchestrator_llm.call( prompt=f"Break this into atomic steps: {complex_query}" ) # Example output: # Step 1: Retrieve contracts mentioning "termination" # Step 2: For each contract, extract termination clause # Step 3: For each clause, identify: notice period, conditions, penalties # Step 4: Create comparison table return decomposition.steps ``` **2. Dynamic Model Routing** ```python def route_task(task): if task.requires_deep_reasoning: return o1_model # OpenAI o1 for complex reasoning elif task.requires_vision: return gpt4o_vision # For charts, tables elif task.is_simple_extraction: return gpt4o_mini # Cheaper for simple tasks else: return gpt4o # General purpose ``` **3. "Infinite Context Window"** Not actually infinite, but clever workaround: ``` Traditional LLM: 128K token limit → Can't process 100 documents at once Hebbia Approach: 1. Index all 100 documents in vector DB 2. For each sub-task, retrieve only relevant chunks 3. Each agent works on <128K tokens 4. Combine results across agents 5. Result: Effectively "infinite" by not loading everything at once ``` **4. Verification Through Transparency** Every agent step is visible: ``` Matrix Grid View: ┌────────────┬─────────────────┬──────────────────┐ │ Contract │ Termination │ Notice Period │ │ │ Clause (source) │ (extracted) │ ├────────────┼─────────────────┼──────────────────┤ │ Contract-1 │ [📄 p. 12] │ 30 days [📄 p.12]│ │ Contract-2 │ [📄 p. 8] │ 60 days [📄 p.8] │ │ Contract-3 │ [📄 p. 15] │ 90 days [📄 p.15]│ └────────────┴─────────────────┴──────────────────┘ Click on [📄 p. 12] → See exact source text User can verify: "Did the AI extract correctly?" ``` ### Hebbia's Verification Mechanisms **1. Source Tracking** - Every cell in Matrix links to exact source - Hover → see verbatim text from document - Citations at token level, not document level **2. Agent Consensus** ```python # For critical extractions, use multiple agents results = [] for agent in [agent1, agent2, agent3]: results.append(agent.extract(clause)) # Verify consistency if len(set(results)) == 1: return results[0] # All agree else: return flag_for_human_review(results) ``` **3. Iterative Refinement** ``` User: "Extract revenue" Agent: Returns $50M User: "That's wrong, check again" Agent: Re-runs with adjusted prompt → Returns $53.2M System: Learns from correction ``` ### Performance **Hebbia Matrix**: - Accuracy: 92% (vs 68% standard RAG) - Speed: Seconds for 1000s of documents - Verification: Built-in source links - Cost: High (multiple model calls) but justified by accuracy **Why It Works Better**: 1. **Specialization**: Each agent optimized for its task 2. **Parallel Processing**: Faster than sequential 3. **Better Retrieval**: Advanced indexing beats standard RAG 4. **Transparency**: Every step visible = easier verification --- ## Ensemble Methods ### The Principle Use multiple verification methods, vote on result. ``` Input: LLM Output to verify Method 1: NLI Check → 85% confidence "Correct" Method 2: NER Check → 90% confidence "Correct" Method 3: Span Detection → 70% confidence "Hallucination" Method 4: Consistency Check → 80% confidence "Correct" Ensemble Vote: 3 "Correct", 1 "Hallucination" → Final: "Correct" (but flagged for review due to dissent) ``` Combine multiple verification methods and vote on results for maximum accuracy. ![[Screenshot 2025-11-24 at 01.04.01.png]] Ensemble method: Run multiple verification techniques, combine signals with gradient boosting, achieve 5-10% higher accuracy than single methods. ### Technical Implementation **Production System Example** (from research): ```python class HallucinationDetector: def __init__(self): self.ner_model = load_ner_model() # Named Entity Recognition self.nli_model = load_nli_model() # Natural Language Inference self.sbd_model = load_sbd_model() # Span-Based Detection self.gbdt = load_gbdt_model() # Gradient Boosting Decision Tree def detect(self, response, source_docs): # Run all methods ner_score = self.check_entities(response, source_docs) nli_score = self.check_entailment(response, source_docs) sbd_score = self.check_spans(response, source_docs) # Ensemble with GBDT features = [ner_score, nli_score, sbd_score] hallucination_prob = self.gbdt.predict_proba(features) if hallucination_prob > 0.7: return "HALLUCINATION" elif hallucination_prob < 0.3: return "VERIFIED" else: return "UNCERTAIN" ``` ### Component Methods **1. Named Entity Recognition (NER)** ```python # Check if entities in response exist in source source_entities = ner_model.extract(source_docs) response_entities = ner_model.extract(response) hallucinated_entities = [] for entity in response_entities: if entity not in source_entities: hallucinated_entities.append(entity) score = 1 - (len(hallucinated_entities) / len(response_entities)) ``` **2. Span-Based Detection (SBD)** ```python # Mark each span in response as verified or not spans = split_into_spans(response) verified_spans = [] for span in spans: # Check if span appears in source (with some fuzzy matching) if fuzzy_match(span, source_docs, threshold=0.8): verified_spans.append(span) coverage = len(verified_spans) / len(spans) ``` **3. Gradient Boosting Decision Trees** ```python # Train on labeled data X_train = [ [ner_score1, nli_score1, sbd_score1], # Example 1 [ner_score2, nli_score2, sbd_score2], # Example 2 ... ] y_train = [0, 1, ...] # 0=verified, 1=hallucination gbdt = GradientBoostingClassifier() gbdt.fit(X_train, y_train) # Use for new predictions new_scores = [ner_score, nli_score, sbd_score] prediction = gbdt.predict(new_scores) ``` ### Performance **Ensemble vs Single Methods**: - Accuracy: +5-10% over best single method - Robustness: Handles edge cases better - Latency: 2-3x slower (multiple models) - Cost: Higher (running multiple methods) **When to Use**: - High-stakes applications (legal, medical) - When single-method confidence is low - Production systems with strict accuracy requirements --- ## Comparative Analysis ### Method Comparison Matrix ![[Screenshot 2025-11-24 at 01.04.41.png]] ### Cost-Benefit Analysis **Example: Legal Document Review** **Traditional Human Review**: - Speed: 2-3 hours per document - Cost: $400-600 per document ($200/hr lawyer) - Accuracy: 95-98% - Verification: N/A (human is ground truth) **Standard RAG**: - Speed: 2-3 minutes per document - Cost: $0.50 per document - Accuracy: 68% - Verification Cost: $200 (1 hour human review) - **Total**: $200.50 per document, 1 hour total - **ROI**: Negative (slower + similar cost due to verification) **Multi-Agent (Hebbia)**: - Speed: 30 seconds per document - Cost: $5 per document - Accuracy: 92% - Verification Cost: $50 (15 min spot-check) - **Total**: $55 per document, 15 min total - **ROI**: **90% time savings, 86% cost savings** ### Decision Framework **Choose Standard RAG when**: - Low-stakes applications - Simple queries - Speed > accuracy - Budget constrained **Choose NLI-Based when**: - Need hallucination detection - Can tolerate latency - Post-processing is acceptable - Have labeled training data **Choose Multi-Agent when**: - Complex analytical tasks - High-stakes decisions - Need transparency - ROI justifies cost **Choose Ensemble when**: - Highest accuracy required - Have labeled data for training - Latency acceptable - High-stakes + production system --- ## Key Takeaways ### Technical Principles 1. **No Silver Bullet**: All methods have trade-offs 2. **Verification ≠ Elimination**: Reduces but doesn't eliminate hallucinations 3. **Transparency Matters**: Verifiable systems need source tracking 4. **Context is King**: Better retrieval → better verification ![[Screenshot 2025-11-24 at 01.05.37.png]] ### Implementation Insights 1. **Start Simple**: RAG is good enough for 80% of use cases 2. **Measure First**: Track accuracy before optimizing 3. **Build Iteratively**: Add verification layers as needed 4. **User Feedback**: Critical for improving accuracy over time ### Future Directions 1. **Real-time Verification**: Verify during generation, not after 2. **Uncertainty Quantification**: Model should know when it doesn't know 3. **Formal Verification**: Provable correctness for critical applications 4. **Self-Improving**: Systems that learn from corrections --- ## Further Reading **Papers**: - RAG: "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020) - SelfCheckGPT: "SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection" (Manakul et al., 2023) - RefChecker: Amazon Science Blog, 2024 - Multi-Agent: Hebbia blog posts on Matrix architecture **Tools**: - LangChain (RAG framework) - LlamaIndex (RAG framework) - Haystack (NLP framework with verification) - Hebbia Matrix (production multi-agent system) **Benchmarks**: - HaluEval: Hallucination evaluation dataset - RAGTruth: Human-labeled RAG outputs - BEIR: Information retrieval benchmark --- _For questions or discussion: This is a living document based on November 2025 research._