Research program

Limits of Intelligence — orchestration vs raw capacity

We study how decomposition, routing, verification, and retrieval stacks beat raw parameter scale on cost, latency, and stability while matching quality.

Data vintage: Oct 2025

Scope at a glance

Position: Intelligence scales with orchestration as much as with parameters.
Scope: Decomposition depth, routing, verifiers, conformal calibration, selective retrieval, latency engineering.
Products: PageMind & Emplo implement these stacks with traceable evidence and audit hooks.

Abstract

Orchestration layers—step decomposition, uncertainty-aware routing, verifier-guided decoding, retrieval/memory, and plan–act–verify loops—match or beat larger monolithic models at equal quality while improving €/task, seconds per task, and stability. We publish a 2024–Oct 2025 evidence synthesis, a decomposition-depth versus capability model with error-cascade and repair analysis, ablation deltas across routing and verifiers, and product integrations that cut cost and variance without losing auditability.

Contributions

Evidence map (2024–Oct 2025): Consolidates quantitative studies showing Pareto gains from routing, cascades, test-time compute scaling, verifier stacks, conformal calibration, and selective retrieval.
Theory with practice: Links decomposition depth d, base capability C, and error-propagation to predict regions where orchestration wins.
Method ablations: Summaries for multi-sampling, verifier choice, speculative decoding, conformal filters, and retrieval gating on accuracy, variance, €/task, and seconds per task.
Product mapping: Shows how PageMind and Emplo exploit glossary memories, verifier stacks, retry bins, selective retrieval, and structured decoding to reduce cost and latency while improving auditability.
Open research agenda: Testable questions tied to OVC-1: repair budgets, multilingual calibration, and formal win-region characterisations.

State of the field (2024–Oct 2025)

Benchmarking matured around cost-quality planes and latency trade-offs. RouterBench and FrugalGPT quantify cross-model price dispersion; compute-optimal inference work reframes scaling as spend-more-at-inference. Verifier stacks expose both gains and scaling flaws; conformal prediction and calibration tuning keep routers honest. Selective retrieval beats always-on RAG, while verification-aware planning and speculative decoding reclaim orchestration overhead.

Routing & cascades

RouterBench and FrugalGPT establish cost-quality frontiers; learned policies and cascades match GPT-4-level quality with up to −98% cost.

RouterBench (Mar 2024) · FrugalGPT (Dec 2024)

Test-time compute

Compute-optimal inference studies show smaller models with tree search beat 14× larger baselines under matched FLOPs.

Wu et al., Snell et al. (ICLR 2025)

Verifier stacks

Process reward models, outcome models, and automated supervision increase accuracy with fewer samples but require robust ranking strategies.

NeurIPS 2024–2025

Uncertainty & calibration

Conformal prediction adapts to LLMs to guarantee coverage; calibration-tuning improves gating signals for cascades.

NeurIPS 2024 · ACL 2024

Selective retrieval

RAFT and self-routing RAG reduce retrieval calls by ~29% while raising accuracy by ~5 pp, cutting tokens and spend.

arXiv 2024–2025

Planner–executor–verifier

Verification-aware plans encode checks that trigger rollback; verification hooks beat monolithic act-only agents.

arXiv 2024–2025

Routing policies in practice

Static tiers by content type (cheap generalist → expensive specialist).
Learned routers predicting P(correct) × utility with spend caps.
Conformal routers guaranteeing coverage at target rejection rates (α = 0.1, 0.2).
Post-hoc acceptance using verifiers with bounded retries.
Cascade routing that unifies ex-ante gating and post-hoc acceptance.

Retrieval & memory patterns

Glossary memory — enforce terminology and unit normalisation.
Trace memory — attach per-row evidence for audit trails.
Vector memory — dense retrieval for long-tail entities.
Selective RAG — route to retrieval only when it beats long-context reading.

Anti-patterns to avoid

Always-on RAG pipelines that ignore latency/price budgets.
Pipelines sensitive to spurious features without corrective retrieval.
Uncalibrated retrieval scoring with no abstention channel.

Quantitative frontiers

Pareto frontier schematic — orchestration pushes €/task and seconds/task down while maintaining quality parity.

Decomposition depth vs quality

Base 8B modelBase 34B modelOrchestrated 8B

Decomposition depth increases win-region for orchestrated 8B stacks versus larger monoliths; report paired accuracy with confidence intervals.

Benchmark summary

Task class	Models or stack	Quality / win-rate	Variance	Latency	€/task	Tokens/task	Source
Multi-LLM routing	RouterBench routers vs single LLMs	Comparable accuracy; 2–5× cost spread	—	—	varies 2–5×	—	RouterBench (Mar 2024)
API cascade	FrugalGPT cascade → GPT-4	Matches GPT-4; −98% cost	—	—	−98%	—	TMLR (Dec 2024)
Math reasoning	Llemma-7B + tree search vs Llemma-34B	7B+search > 34B under matched FLOPs	—	+ budgeted	—	+ samples	ICLR (Apr 2025)
Latency optimisation	Draft-&-Verify, cascade-speculative	Comparable quality	—	≈2–3× faster	↓	—	ACL & NeurIPS (2024)
Mixture-of-agents	Open-model ensembles vs GPT-4 Omni	65.1% vs 57.5% judged win-rate	—	—	—	—	arXiv (Jun 2024); ICLR (Jan 2025)
Selective retrieval	Self-routing RAG	+5.1 pp accuracy; −29% retrievals	—	↓	↓	↓ context	arXiv (Apr 2025)

Component ablations

Component	Swap / ablation	ΔQuality	ΔVariance	Cost / latency impact	Source
Multi-sample (k)	1 → k vote	↑ (task-dependent)	↓	+ tokens, + seconds	ICLR (Apr 2025)
Early-stop self-consistency	Off → on	≈ quality	≈	−34–84% samples	Findings-ACL (Nov 2024)
Verifier choice	ORM → PRM/OVM	↑ accuracy	↓	− samples	NeurIPS & ACL (2024–2025)
Conformal filter	Off → on	Coverage guarantee	↓ FP variance	+ small seconds	NeurIPS (Oct 2024)
Selective RAG	Always → gated	↑ accuracy	—	↓ tokens	arXiv (Mar–Jun 2024)
Speculative decode	Disable → enable	≈ quality	—	≈2–3× faster	ACL & NeurIPS (2024)

Operational guidance

Variance control

Seed variance on reasoning benchmarks is high; run ≥30 seeds with confidence intervals. Combine self-consistency with verifier-guided re-ranking to stabilise acceptance. Track p95/p99 latency when speculative decoding is enabled so orchestration overhead does not erode service-level objectives.

Publishing posture

Publish cost, latency, and win-rate deltas with confidence bounds; surface negative results.
Package RouterBench-style CSVs so partners can re-plot cost-quality frontiers.
Report repair budgets (max retries) and abstention policies to keep cascades predictable.

Open problems

Quantify repair budgets that keep cascades efficient without runaway retries.
Multilingual calibration for routers and verifiers under conformal guarantees.
Formal win-region characterisation for decomposition depth × capability (d, C).
Robust ranking when verifiers are imperfect at scale.
Programmatic guarantees for selective retrieval triggering.

Product integration

PageMind and Emplo use glossary memories, trace memories, and selective retrieval to attach evidence to every output. Verifier stacks gate publish decisions; conformal filters govern abstention; regression monitors with WECO rules trigger rollback. Cost savings accrue through cascades that pick cheaper models for simple attributes while escalating to expensive models only where acceptance targets demand it.

Data vintage: Oct 2025 · Last updated 01 Oct 2025