Research program
Limits of Intelligence — orchestration vs raw capacity
We study how decomposition, routing, verification, and retrieval stacks beat raw parameter scale on cost, latency, and stability while matching quality.
Data vintage: Oct 2025
Scope at a glance
- Position
- Intelligence scales with orchestration as much as with parameters.
- Scope
- Decomposition depth, routing, verifiers, conformal calibration, selective retrieval, latency engineering.
- Products
- PageMind & Emplo implement these stacks with traceable evidence and audit hooks.
Abstract
Orchestration layers—step decomposition, uncertainty-aware routing, verifier-guided decoding, retrieval/memory, and plan–act–verify loops—match or beat larger monolithic models at equal quality while improving €/task, seconds per task, and stability. We publish a 2024–Oct 2025 evidence synthesis, a decomposition-depth versus capability model with error-cascade and repair analysis, ablation deltas across routing and verifiers, and product integrations that cut cost and variance without losing auditability.
Contributions
- Evidence map (2024–Oct 2025): Consolidates quantitative studies showing Pareto gains from routing, cascades, test-time compute scaling, verifier stacks, conformal calibration, and selective retrieval.
- Theory with practice: Links decomposition depth d, base capability C, and error-propagation to predict regions where orchestration wins.
- Method ablations: Summaries for multi-sampling, verifier choice, speculative decoding, conformal filters, and retrieval gating on accuracy, variance, €/task, and seconds per task.
- Product mapping: Shows how PageMind and Emplo exploit glossary memories, verifier stacks, retry bins, selective retrieval, and structured decoding to reduce cost and latency while improving auditability.
- Open research agenda: Testable questions tied to OVC-1: repair budgets, multilingual calibration, and formal win-region characterisations.
State of the field (2024–Oct 2025)
Benchmarking matured around cost-quality planes and latency trade-offs. RouterBench and FrugalGPT quantify cross-model price dispersion; compute-optimal inference work reframes scaling as spend-more-at-inference. Verifier stacks expose both gains and scaling flaws; conformal prediction and calibration tuning keep routers honest. Selective retrieval beats always-on RAG, while verification-aware planning and speculative decoding reclaim orchestration overhead.
Routing & cascades
RouterBench and FrugalGPT establish cost-quality frontiers; learned policies and cascades match GPT-4-level quality with up to −98% cost.
RouterBench (Mar 2024) · FrugalGPT (Dec 2024)
Test-time compute
Compute-optimal inference studies show smaller models with tree search beat 14× larger baselines under matched FLOPs.
Wu et al., Snell et al. (ICLR 2025)
Verifier stacks
Process reward models, outcome models, and automated supervision increase accuracy with fewer samples but require robust ranking strategies.
NeurIPS 2024–2025
Uncertainty & calibration
Conformal prediction adapts to LLMs to guarantee coverage; calibration-tuning improves gating signals for cascades.
NeurIPS 2024 · ACL 2024
Selective retrieval
RAFT and self-routing RAG reduce retrieval calls by ~29% while raising accuracy by ~5 pp, cutting tokens and spend.
arXiv 2024–2025
Planner–executor–verifier
Verification-aware plans encode checks that trigger rollback; verification hooks beat monolithic act-only agents.
arXiv 2024–2025
Routing policies in practice
- Static tiers by content type (cheap generalist → expensive specialist).
- Learned routers predicting P(correct) × utility with spend caps.
- Conformal routers guaranteeing coverage at target rejection rates (α = 0.1, 0.2).
- Post-hoc acceptance using verifiers with bounded retries.
- Cascade routing that unifies ex-ante gating and post-hoc acceptance.
Retrieval & memory patterns
- Glossary memory — enforce terminology and unit normalisation.
- Trace memory — attach per-row evidence for audit trails.
- Vector memory — dense retrieval for long-tail entities.
- Selective RAG — route to retrieval only when it beats long-context reading.
Anti-patterns to avoid
- Always-on RAG pipelines that ignore latency/price budgets.
- Pipelines sensitive to spurious features without corrective retrieval.
- Uncalibrated retrieval scoring with no abstention channel.
Quantitative frontiers
Pareto frontier schematic — orchestration pushes €/task and seconds/task down while maintaining quality parity.
Decomposition depth vs quality
Decomposition depth increases win-region for orchestrated 8B stacks versus larger monoliths; report paired accuracy with confidence intervals.
Benchmark summary
| Task class | Models or stack | Quality / win-rate | Variance | Latency | €/task | Tokens/task | Source |
|---|---|---|---|---|---|---|---|
| Multi-LLM routing | RouterBench routers vs single LLMs | Comparable accuracy; 2–5× cost spread | — | — | varies 2–5× | — | RouterBench (Mar 2024) |
| API cascade | FrugalGPT cascade → GPT-4 | Matches GPT-4; −98% cost | — | — | −98% | — | TMLR (Dec 2024) |
| Math reasoning | Llemma-7B + tree search vs Llemma-34B | 7B+search > 34B under matched FLOPs | — | + budgeted | — | + samples | ICLR (Apr 2025) |
| Latency optimisation | Draft-&-Verify, cascade-speculative | Comparable quality | — | ≈2–3× faster | ↓ | — | ACL & NeurIPS (2024) |
| Mixture-of-agents | Open-model ensembles vs GPT-4 Omni | 65.1% vs 57.5% judged win-rate | — | — | — | — | arXiv (Jun 2024); ICLR (Jan 2025) |
| Selective retrieval | Self-routing RAG | +5.1 pp accuracy; −29% retrievals | — | ↓ | ↓ | ↓ context | arXiv (Apr 2025) |
Component ablations
| Component | Swap / ablation | ΔQuality | ΔVariance | Cost / latency impact | Source |
|---|---|---|---|---|---|
| Multi-sample (k) | 1 → k vote | ↑ (task-dependent) | ↓ | + tokens, + seconds | ICLR (Apr 2025) |
| Early-stop self-consistency | Off → on | ≈ quality | ≈ | −34–84% samples | Findings-ACL (Nov 2024) |
| Verifier choice | ORM → PRM/OVM | ↑ accuracy | ↓ | − samples | NeurIPS & ACL (2024–2025) |
| Conformal filter | Off → on | Coverage guarantee | ↓ FP variance | + small seconds | NeurIPS (Oct 2024) |
| Selective RAG | Always → gated | ↑ accuracy | — | ↓ tokens | arXiv (Mar–Jun 2024) |
| Speculative decode | Disable → enable | ≈ quality | — | ≈2–3× faster | ACL & NeurIPS (2024) |
Operational guidance
Variance control
Seed variance on reasoning benchmarks is high; run ≥30 seeds with confidence intervals. Combine self-consistency with verifier-guided re-ranking to stabilise acceptance. Track p95/p99 latency when speculative decoding is enabled so orchestration overhead does not erode service-level objectives.
Publishing posture
- Publish cost, latency, and win-rate deltas with confidence bounds; surface negative results.
- Package RouterBench-style CSVs so partners can re-plot cost-quality frontiers.
- Report repair budgets (max retries) and abstention policies to keep cascades predictable.
Open problems
- Quantify repair budgets that keep cascades efficient without runaway retries.
- Multilingual calibration for routers and verifiers under conformal guarantees.
- Formal win-region characterisation for decomposition depth × capability (d, C).
- Robust ranking when verifiers are imperfect at scale.
- Programmatic guarantees for selective retrieval triggering.
Product integration
PageMind and Emplo use glossary memories, trace memories, and selective retrieval to attach evidence to every output. Verifier stacks gate publish decisions; conformal filters govern abstention; regression monitors with WECO rules trigger rollback. Cost savings accrue through cascades that pick cheaper models for simple attributes while escalating to expensive models only where acceptance targets demand it.
Data vintage: Oct 2025 · Last updated 01 Oct 2025
