Back to research overview

Research program

Limits of Intelligence — orchestration vs raw capacity

We study how decomposition, routing, verification, and retrieval stacks beat raw parameter scale on cost, latency, and stability while matching quality.

Data vintage: Oct 2025

Scope at a glance

Position
Intelligence scales with orchestration as much as with parameters.
Scope
Decomposition depth, routing, verifiers, conformal calibration, selective retrieval, latency engineering.
Products
PageMind & Emplo implement these stacks with traceable evidence and audit hooks.

Abstract

Orchestration layers—step decomposition, uncertainty-aware routing, verifier-guided decoding, retrieval/memory, and plan–act–verify loops—match or beat larger monolithic models at equal quality while improving €/task, seconds per task, and stability. We publish a 2024–Oct 2025 evidence synthesis, a decomposition-depth versus capability model with error-cascade and repair analysis, ablation deltas across routing and verifiers, and product integrations that cut cost and variance without losing auditability.

Contributions

  • Evidence map (2024–Oct 2025): Consolidates quantitative studies showing Pareto gains from routing, cascades, test-time compute scaling, verifier stacks, conformal calibration, and selective retrieval.
  • Theory with practice: Links decomposition depth d, base capability C, and error-propagation to predict regions where orchestration wins.
  • Method ablations: Summaries for multi-sampling, verifier choice, speculative decoding, conformal filters, and retrieval gating on accuracy, variance, €/task, and seconds per task.
  • Product mapping: Shows how PageMind and Emplo exploit glossary memories, verifier stacks, retry bins, selective retrieval, and structured decoding to reduce cost and latency while improving auditability.
  • Open research agenda: Testable questions tied to OVC-1: repair budgets, multilingual calibration, and formal win-region characterisations.

State of the field (2024–Oct 2025)

Benchmarking matured around cost-quality planes and latency trade-offs. RouterBench and FrugalGPT quantify cross-model price dispersion; compute-optimal inference work reframes scaling as spend-more-at-inference. Verifier stacks expose both gains and scaling flaws; conformal prediction and calibration tuning keep routers honest. Selective retrieval beats always-on RAG, while verification-aware planning and speculative decoding reclaim orchestration overhead.

Routing & cascades

RouterBench and FrugalGPT establish cost-quality frontiers; learned policies and cascades match GPT-4-level quality with up to −98% cost.

RouterBench (Mar 2024) · FrugalGPT (Dec 2024)

Test-time compute

Compute-optimal inference studies show smaller models with tree search beat 14× larger baselines under matched FLOPs.

Wu et al., Snell et al. (ICLR 2025)

Verifier stacks

Process reward models, outcome models, and automated supervision increase accuracy with fewer samples but require robust ranking strategies.

NeurIPS 2024–2025

Uncertainty & calibration

Conformal prediction adapts to LLMs to guarantee coverage; calibration-tuning improves gating signals for cascades.

NeurIPS 2024 · ACL 2024

Selective retrieval

RAFT and self-routing RAG reduce retrieval calls by ~29% while raising accuracy by ~5 pp, cutting tokens and spend.

arXiv 2024–2025

Planner–executor–verifier

Verification-aware plans encode checks that trigger rollback; verification hooks beat monolithic act-only agents.

arXiv 2024–2025

Routing policies in practice

  • Static tiers by content type (cheap generalist → expensive specialist).
  • Learned routers predicting P(correct) × utility with spend caps.
  • Conformal routers guaranteeing coverage at target rejection rates (α = 0.1, 0.2).
  • Post-hoc acceptance using verifiers with bounded retries.
  • Cascade routing that unifies ex-ante gating and post-hoc acceptance.

Retrieval & memory patterns

  • Glossary memory — enforce terminology and unit normalisation.
  • Trace memory — attach per-row evidence for audit trails.
  • Vector memory — dense retrieval for long-tail entities.
  • Selective RAG — route to retrieval only when it beats long-context reading.

Anti-patterns to avoid

  • Always-on RAG pipelines that ignore latency/price budgets.
  • Pipelines sensitive to spurious features without corrective retrieval.
  • Uncalibrated retrieval scoring with no abstention channel.

Quantitative frontiers

Pareto frontier schematic — orchestration pushes €/task and seconds/task down while maintaining quality parity.

Decomposition depth vs quality

Base 8B modelBase 34B modelOrchestrated 8B
Decomposition depth line chartStep depth on the X axis and quality on the Y axis with multiple capability lines.Base 8B model: depth 1, quality 72Base 8B model: depth 2, quality 78Base 8B model: depth 3, quality 82Base 8B model: depth 4, quality 83Base 34B model: depth 1, quality 79Base 34B model: depth 2, quality 84Base 34B model: depth 3, quality 86Base 34B model: depth 4, quality 87Orchestrated 8B: depth 1, quality 78Orchestrated 8B: depth 2, quality 84Orchestrated 8B: depth 3, quality 88Orchestrated 8B: depth 4, quality 90Step depth (d)Exact/soft match accuracy

Decomposition depth increases win-region for orchestrated 8B stacks versus larger monoliths; report paired accuracy with confidence intervals.

Benchmark summary

Task classModels or stackQuality / win-rateVarianceLatency€/taskTokens/taskSource
Multi-LLM routingRouterBench routers vs single LLMsComparable accuracy; 2–5× cost spreadvaries 2–5×RouterBench (Mar 2024)
API cascadeFrugalGPT cascade → GPT-4Matches GPT-4; −98% cost−98%TMLR (Dec 2024)
Math reasoningLlemma-7B + tree search vs Llemma-34B7B+search > 34B under matched FLOPs+ budgeted+ samplesICLR (Apr 2025)
Latency optimisationDraft-&-Verify, cascade-speculativeComparable quality≈2–3× fasterACL & NeurIPS (2024)
Mixture-of-agentsOpen-model ensembles vs GPT-4 Omni65.1% vs 57.5% judged win-ratearXiv (Jun 2024); ICLR (Jan 2025)
Selective retrievalSelf-routing RAG+5.1 pp accuracy; −29% retrievals↓ contextarXiv (Apr 2025)

Component ablations

ComponentSwap / ablationΔQualityΔVarianceCost / latency impactSource
Multi-sample (k)1 → k vote↑ (task-dependent)+ tokens, + secondsICLR (Apr 2025)
Early-stop self-consistencyOff → on≈ quality−34–84% samplesFindings-ACL (Nov 2024)
Verifier choiceORM → PRM/OVM↑ accuracy− samplesNeurIPS & ACL (2024–2025)
Conformal filterOff → onCoverage guarantee↓ FP variance+ small secondsNeurIPS (Oct 2024)
Selective RAGAlways → gated↑ accuracy↓ tokensarXiv (Mar–Jun 2024)
Speculative decodeDisable → enable≈ quality≈2–3× fasterACL & NeurIPS (2024)

Operational guidance

Variance control

Seed variance on reasoning benchmarks is high; run ≥30 seeds with confidence intervals. Combine self-consistency with verifier-guided re-ranking to stabilise acceptance. Track p95/p99 latency when speculative decoding is enabled so orchestration overhead does not erode service-level objectives.

Publishing posture

  • Publish cost, latency, and win-rate deltas with confidence bounds; surface negative results.
  • Package RouterBench-style CSVs so partners can re-plot cost-quality frontiers.
  • Report repair budgets (max retries) and abstention policies to keep cascades predictable.

Open problems

  • Quantify repair budgets that keep cascades efficient without runaway retries.
  • Multilingual calibration for routers and verifiers under conformal guarantees.
  • Formal win-region characterisation for decomposition depth × capability (d, C).
  • Robust ranking when verifiers are imperfect at scale.
  • Programmatic guarantees for selective retrieval triggering.

Product integration

PageMind and Emplo use glossary memories, trace memories, and selective retrieval to attach evidence to every output. Verifier stacks gate publish decisions; conformal filters govern abstention; regression monitors with WECO rules trigger rollback. Cost savings accrue through cascades that pick cheaper models for simple attributes while escalating to expensive models only where acceptance targets demand it.

Data vintage: Oct 2025 · Last updated 01 Oct 2025