Research program
AI and Business Operations — where the work actually moves
Quantifies autonomy migration from assist → supervise → run across retail, candidate ops, and support. Focus on €/accepted item, oversight minutes, constraint violations, and data protection.
Data vintage: Oct 2025
Scope at a glance
- Domains
- Retail/catalog, candidate operations, support operations.
- Economics
- €/accepted item, throughput at fixed budget, oversight minutes/100 tasks.
- Guardrails
- Constraint-violation rate, data-protection posture, rollback criteria.
Abstract
Real work is moving. Retail attribute extraction, catalog publishing, guardrailed support automation, and candidate operations now run with automation rungs that balance economics and risk. We publish acceptance gates, sampling plans, drift monitors, rollback criteria, and data-protection posture so partners know when autonomy is justified and what keeps it safe.
Assist → Supervise → Run
Move up the ladder when cost per accepted item falls and throughput at fixed budget rises without violating constraint rates or oversight budgets. Guardrails (policy checks, glossary enforcement, read-after-write) and zero-retention data posture are mandatory in regulated workloads.
Task ladders by domain
| Task | Autonomy (Oct 2025) | Acceptance gate | Unit-economics delta | Status |
|---|---|---|---|---|
| Retail — Attribute extraction (simple) | Auto | Precision ≥ 95%, recall ≥ 92%; LLM-judge pass; low-confidence → human | Model cascade saves ~70% cost; hit gates in last sampled lot | Production (Instacart) |
| Retail — Attribute extraction (complex numeric) | Auto-with-review | Dual-source corroboration; tight interval correctness; exception review | Multi-modal improves recall by 10 pp; oversight falls via sampling | Production (Instacart) |
| Retail — Catalog publish | Auto | Zero high-severity violations; WECO signal → rollback | Sub-minute latency with versioned rollback | Production (Uber INCA) |
| Candidate ops — Resume shortlist | Auto-with-review | LL 144 bias audit ≤ 12 months; public summary; candidate notice | Oversight falls via impact-ratio dashboards; legal risk bounded | Production (NYC LL 144 governed) |
| Support — Triage / routing | Auto | Accuracy ≥ 90%; confidence-gated escalations | 120 hours/week saved at Gelato; latency down | Production (Vertex AI) |
| Support — Guarded Q&A | Auto-with-review → Auto | 0 high-severity policy fails; groundedness pass | −90% hallucinations; −99% severe issues | Production (DoorDash) |
| Support — End-to-end chat | Auto (with handoff) | CSAT ≥ human; resolution time ≤ target | 2/3 chats handled; resolution 11 → 2 minutes | Production (Klarna) |
Unit economics visualization
Cost per accepted item — autonomy waterfall
Assist · Retail
Attribute extraction (simple)
Supervise · Retail
Attribute extraction (simple)
Run · Retail
Attribute extraction (simple)
Run · Support
Triage / routing
Run · Support
End-to-end chat
Supervise · Retail
Catalog publish
Domain playbooks
Retail / Catalog
- Acceptance sampling with attribute-specific precision/recall targets.
- Low-confidence routing to humans; cheaper models only when attribute gates hold.
- Glossary enforcement and per-row evidence for audits; EU workloads pinned to EU endpoints with zero data retention.
Candidate Operations
- Embed LL 144 artefacts: data lineage, impact-ratio dashboards, independent audits, public notices.
- Use p-charts on adverse-impact rate; WECO rules pause automations on drift.
- Log model, prompt, and features per decision for audit replay.
Support Operations
- Confidence routing plus acceptance sampling; guardrails escalate severe policy or grounding failures.
- Cheap shallow checks first, LLM-judge only on failures (DoorDash pattern).
- Define defect taxonomy and fix at cheapest stage (retrieval, prompt, constraint, glossary, model).
Migration checklist
- Instrument €/accepted item and throughput at fixed budget before moving up the ladder.
- Use confidence routing so only low-confidence items and acceptance samples reach humans.
- Adopt NIST attribute acceptance sampling (producer/consumer risk) aligned to workload tolerance.
- Guardrail-first evaluation pipelines reduce cost by reserving LLM-judge for near-fails.
- Publish failure taxonomy and fix loops so issues close in the cheapest stage.
Drift & recovery
- p-chart on violation rate per 1 000 outputs with WECO rules (Rule 1–4).
- Traffic-aligned acceptance sampling and low-confidence human review (Instacart PARSE).
- Regression detection on catalog coverage/pricing/availability with near-instant rollback (Uber INCA).
Auto-recovery playbook
- Rollback to previous known-good configuration or model snapshot.
- Quarantine bad segments (taxonomy node, language, retailer) and re-run with stricter guardrails.
- Raise verifier thresholds or add dual-source corroboration; adjust cascades.
- Increase sampling temporarily until metrics stabilise, then return to baseline plan.
WECO p-chart — high-severity violations per 1 000 outputs
Data-protection posture (EU workloads)
- Regional ML processing: pin jobs to EU endpoints; ensure ML processing stays in-region.
- Zero-data-retention: configure Vertex AI and similar platforms not to retain prompts/outputs.
- EU AI Act posture: maintain technical documentation, training data sources, evaluation results, and risk controls for audits.
- Per-row source trace, action logs, approver identity, prompt/model versioning, and privacy posture exports ship with PageMind & Emplo.
PageMind and Emplo ship per-row source trace, action logs, approver identity, prompt/model versioning, region pinning, and privacy posture exports to satisfy LL 144, EU AI Act, and enterprise compliance reviews.
Data vintage: Oct 2025 · Last updated 01 Oct 2025
