Research program
Agentic Decision Systems — from assist to autonomous execution
Agents are decision systems. They plan, act, verify, repair, and resume under explicit autonomy gates, oversight minutes, and audit trails.
Data vintage: Oct 2025
Scope at a glance
- Loop
- Plan → Act → Verify → Repair → Resume under audit.
- Gates
- Decision accuracy, plan minimality, repair rate, escalation %, oversight minutes/100 tasks.
- Products
- PageMind & Emplo implement numeric autonomy ladders, drift alarms, and SAGA rollbacks.
Abstract
Enterprise agents succeed when treated as decision systems with world models, plans, verifiers, repair loops, and auditable traces. Autonomy scales up a ladder—Assist → Approve → Auto-with-review → Auto—bounded by gates on accuracy, oversight minutes, escalation rate, unit cost, and incidents avoided. Production deployments at Klarna, Intercom, Walmart, LILT, and others show "Auto-with-review" is commercially viable for bounded workloads, while compliance-heavy flows remain at "Approve."
Definition: the decision-system stack
- World model: structured state over entities, policies, constraints, and affordances.
- Planner: produces minimal, constraint-satisfying plans (graph of tool calls/sub-goals).
- Actor: executes schema-checked tool calls or controlled computer-use with rate limits.
- Verifier: checks pre/post conditions; supports LTL/model checking for compliance.
- Memory: episodic traces with drift detection and freshness metadata.
- Loop: plan → act → verify → repair → resume with escalation paths and rollback gates.
Deployments & pilots
| Task class | Current autonomy | Decision accuracy / service KPIs | Oversight / escalation | Source |
|---|---|---|---|---|
| Support triage/resolution (retail fintech) | Auto-with-review | CSAT parity; repeat inquiries −25%; TTR < 2 min; ~⅔ chats handled | ≈33% escalated | Klarna (Aug 2024) |
| Support triage/resolution (HR SaaS) | Auto-with-review | AI resolution 82%; CSAT 85–90% | ≈11% escalated | Intercom Fin (Oct 2024) |
| IT support triage | Approve | 53% deflection; Average resolution time −26.63% | 47% not deflected | Freshservice (2024 aggregate) |
| Catalog ops attribute extraction | Auto-with-review | F1 95.6–97.9; online CTR +2.16%; ATC +1.42%; GMV +0.38% | Exception sampling only | Walmart (May 2024) |
| Localization (enterprise) | Approve | +17.5% accuracy; −20% cost | Editor time proxy ↓ | LILT × Miro (2024) |
Reality check: frontier models still underperform humans on WebArena (~14% vs human 78%), WorkArena++ (~2% vs human 94%), and OSWorld (~29–38% success), so we hold "Auto" for constrained surfaces with verifiers and guardrails that close failure modes.
Benchmark landscape
| Suite | Task type | Input / modality | Eval metrics | Strengths | Caveats |
|---|---|---|---|---|---|
| WebArena | Realistic web tasks (e-comm, forum, CMS, dev) | Browser control; text + vision; tool APIs | Task success, step accuracy | Execution-based; support-like web ops | Early agents far below human; limited auth flows |
| WorkArena++ | Office/enterprise multi-app tasks | Browser + SaaS UIs | Success, efficiency | Targets business workflows & compositional planning | Very low SOTA success; still simulated |
| OSWorld | Real OS apps + web (369 tasks) | Desktop + web | Success, execution-based | Closest to real computer use | Setup complexity; lab sandbox |
| BFCL V4 | Tool/function calling | Structured function calls | Call accuracy, cost, latency | Enterprise tool-use predictivity | Abstracts away UI dynamics |
| GAIA | Real-world Qs requiring tools/browse | Tool use + web | Human 92% vs GPT-4 15% | Stress-tests general assistantship | Not fully execution-based |
Autonomy ladder targets
Numeric gates keep autonomy honest. We publish autonomy requirements per domain so stakeholders know when the ladder can advance and what data support the move.
Catalog ops
- F1 ≥ 95% for Assist → Approve
- F1 ≥ 97% and ≤10 oversight minutes/100 for Approve → Auto-with-review
- F1 ≥ 99% and ≤1% exceptions for Auto-with-review → Auto
Walmart production paper reports 95.6–97.9 accuracy and online lifts before tapering oversight.
Localization
- Acceptance ≥ 60–70% for Assist → Approve
- Acceptance ≥ 85% and ≤40 minutes/100 for Approve → Auto-with-review
- Auto only for low-risk content with 0 major errors and rollback < 0.5%
LILT customer outcomes (+17.5% accuracy, −20% cost) support the "Approve" rung; high-risk remains human-first.
Application packets
- Template compliance ≥ 98% for Assist → Approve
- Compliance ≥ 99.5% and ≤30 minutes/100 for Approve → Auto-with-review
- Auto not targeted unless regulator-cleared; if so ≥99.9% compliance
Employment use is high-risk; EU AI Act triggers demand provable oversight.
Illustrative ladder: oversight minutes fall as decision accuracy and repair rate rise; use as guardrails before graduating autonomy.
Decision audit anatomy
OpenTelemetry-compatible traces capture every decision. We hash-chain audit stores with write-once retention, logging policy version IDs, redaction policies, and reviewer decisions so regulators and partners can replay and inspect.
- Inputs (goal, constraints, policy version, PII flags).
- Plan graph + minimality estimate + LTL spec hash.
- Actions (tool name/version, parameters, idempotency key, pre/post state).
- Verifier outcomes (type checks, post-conditions, groundedness).
- Repairs (diff, compensations, SAGA correlators).
- Oversight minutes, cost, latency, reviewer decisions.
- Immutable outcome log with rollback trace.
Decision Audit Flow
Constraints
Every decision references guard-rails, policies, and task-level boundaries before execution.
Source trace
Audit logs preserve prompts, retrieved context, tool calls, and remediation steps for each output.
Decision-flow with recovery
- Goal + constraints intake with policy + PII tagging.
- Plan compiled with LTL checks and static analysis.
- Execute actions via tool APIs or controlled computer-use.
- Verify post-conditions; run read-after-write checks.
- Repair or escalate via exception channels when checks fail.
- Commit trace to audit store; sample reviews feed evaluation buffers.
Integration with inAi products
PageMind operates at Auto-with-review for catalog batches and low-risk localisation while staying at Approve for high-impact support actions. Acceptance thresholds (accuracy ≥ 98%, minimality ≥ 80%, repair ≥ 70%) keep oversight under 15 minutes/100 tasks. Emplo maintains LL 144 audit artefacts—data lineage, impact-ratio dashboards, annual audits, and notice workflows—and uses WECO rules to halt automations on adverse-impact drift.
Drift & recovery
- p-chart on violation rate per 1 000 outputs with WECO rules (Rule 1–4).
- Traffic-aligned acceptance sampling and low-confidence human review.
- Regression detection on catalog coverage/pricing with near-instant rollback.
Auto-repair playbook
- Rollback to previous known-good config or model snapshot.
- Quarantine bad segments and re-run with stricter guardrails.
- Raise verifier thresholds or add dual-source corroboration.
- Intensify sampling temporarily until metrics stabilise.
Data vintage: Oct 2025 · Last updated 01 Oct 2025
