Back to research overview

Research program

AI and Business Operations — where the work actually moves

Quantifies autonomy migration from assist → supervise → run across retail, candidate ops, and support. Focus on €/accepted item, oversight minutes, constraint violations, and data protection.

Data vintage: Oct 2025

Scope at a glance

Domains
Retail/catalog, candidate operations, support operations.
Economics
€/accepted item, throughput at fixed budget, oversight minutes/100 tasks.
Guardrails
Constraint-violation rate, data-protection posture, rollback criteria.

Abstract

Real work is moving. Retail attribute extraction, catalog publishing, guardrailed support automation, and candidate operations now run with automation rungs that balance economics and risk. We publish acceptance gates, sampling plans, drift monitors, rollback criteria, and data-protection posture so partners know when autonomy is justified and what keeps it safe.

Assist → Supervise → Run

Move up the ladder when cost per accepted item falls and throughput at fixed budget rises without violating constraint rates or oversight budgets. Guardrails (policy checks, glossary enforcement, read-after-write) and zero-retention data posture are mandatory in regulated workloads.

Task ladders by domain

TaskAutonomy (Oct 2025)Acceptance gateUnit-economics deltaStatus
Retail — Attribute extraction (simple)AutoPrecision ≥ 95%, recall ≥ 92%; LLM-judge pass; low-confidence → humanModel cascade saves ~70% cost; hit gates in last sampled lotProduction (Instacart)
Retail — Attribute extraction (complex numeric)Auto-with-reviewDual-source corroboration; tight interval correctness; exception reviewMulti-modal improves recall by 10 pp; oversight falls via samplingProduction (Instacart)
Retail — Catalog publishAutoZero high-severity violations; WECO signal → rollbackSub-minute latency with versioned rollbackProduction (Uber INCA)
Candidate ops — Resume shortlistAuto-with-reviewLL 144 bias audit ≤ 12 months; public summary; candidate noticeOversight falls via impact-ratio dashboards; legal risk boundedProduction (NYC LL 144 governed)
Support — Triage / routingAutoAccuracy ≥ 90%; confidence-gated escalations120 hours/week saved at Gelato; latency downProduction (Vertex AI)
Support — Guarded Q&AAuto-with-review → Auto0 high-severity policy fails; groundedness pass−90% hallucinations; −99% severe issuesProduction (DoorDash)
Support — End-to-end chatAuto (with handoff)CSAT ≥ human; resolution time ≤ target2/3 chats handled; resolution 11 → 2 minutesProduction (Klarna)

Unit economics visualization

Cost per accepted item — autonomy waterfall

Assist · Retail

Attribute extraction (simple)

Total: €1.00 per accepted itemManual ops baseline

Supervise · Retail

Attribute extraction (simple)

Total: €0.32 per accepted itemCheap model cascade; sampling; low-confidence human review

Run · Retail

Attribute extraction (simple)

Total: €0.16 per accepted itemAuto with monitor; periodic sampling only

Run · Support

Triage / routing

Total: €0.20 per accepted itemAccuracy ≥90%; 120 hrs/wk saved

Run · Support

End-to-end chat

Total: €0.20 per accepted item2/3 chats handled; 11→2 min resolution

Supervise · Retail

Catalog publish

Total: €0.18 per accepted itemRegression detectors; rollback on anomaly
Model Retrieval Guardrail Oversight Rework
€/accepted item falls as we move from Assist to Run; stacked costs show model, retrieval, guardrail, oversight, and rework spend.

Domain playbooks

Retail / Catalog

  • Acceptance sampling with attribute-specific precision/recall targets.
  • Low-confidence routing to humans; cheaper models only when attribute gates hold.
  • Glossary enforcement and per-row evidence for audits; EU workloads pinned to EU endpoints with zero data retention.

Candidate Operations

  • Embed LL 144 artefacts: data lineage, impact-ratio dashboards, independent audits, public notices.
  • Use p-charts on adverse-impact rate; WECO rules pause automations on drift.
  • Log model, prompt, and features per decision for audit replay.

Support Operations

  • Confidence routing plus acceptance sampling; guardrails escalate severe policy or grounding failures.
  • Cheap shallow checks first, LLM-judge only on failures (DoorDash pattern).
  • Define defect taxonomy and fix at cheapest stage (retrieval, prompt, constraint, glossary, model).

Migration checklist

  1. Instrument €/accepted item and throughput at fixed budget before moving up the ladder.
  2. Use confidence routing so only low-confidence items and acceptance samples reach humans.
  3. Adopt NIST attribute acceptance sampling (producer/consumer risk) aligned to workload tolerance.
  4. Guardrail-first evaluation pipelines reduce cost by reserving LLM-judge for near-fails.
  5. Publish failure taxonomy and fix loops so issues close in the cheapest stage.

Drift & recovery

  • p-chart on violation rate per 1 000 outputs with WECO rules (Rule 1–4).
  • Traffic-aligned acceptance sampling and low-confidence human review (Instacart PARSE).
  • Regression detection on catalog coverage/pricing/availability with near-instant rollback (Uber INCA).

Auto-recovery playbook

  1. Rollback to previous known-good configuration or model snapshot.
  2. Quarantine bad segments (taxonomy node, language, retailer) and re-run with stricter guardrails.
  3. Raise verifier thresholds or add dual-source corroboration; adjust cascades.
  4. Increase sampling temporarily until metrics stabilise, then return to baseline plan.

WECO p-chart — high-severity violations per 1 000 outputs

Control chart for high-severity violationsEach point shows violations per 1000 outputs with WECO thresholds and annotations.5.010.015.020.025.0Violations per 1 000 outputsCalendar week (Sep–Oct 2025)0.00409-010.00609-080.00409-150.01809-22WECO Rule 2 trigger0.02309-23Rollback + sampling increase0.00709-30Guardrail retuned0.00510-07Return to steady state
baseline alert mitigation recovery
WECO p-chart catches drift spikes; rollback and sampling increase bring violations back within band.

Data-protection posture (EU workloads)

  • Regional ML processing: pin jobs to EU endpoints; ensure ML processing stays in-region.
  • Zero-data-retention: configure Vertex AI and similar platforms not to retain prompts/outputs.
  • EU AI Act posture: maintain technical documentation, training data sources, evaluation results, and risk controls for audits.
  • Per-row source trace, action logs, approver identity, prompt/model versioning, and privacy posture exports ship with PageMind & Emplo.

PageMind and Emplo ship per-row source trace, action logs, approver identity, prompt/model versioning, region pinning, and privacy posture exports to satisfy LL 144, EU AI Act, and enterprise compliance reviews.

Data vintage: Oct 2025 · Last updated 01 Oct 2025