Back to research overview

Research program

AI for Knowledge Creation — discovery, synthesis, replication

Discovery, synthesis, and replication tooling that keeps signal above the noise. Provenance-first graphs, verification gates, and conformal publish/hold decisions keep outputs auditable.

Data vintage: Oct 2025

Scope at a glance

Stance
Publish proof. Hold when uncertainty exceeds conformal thresholds.
Focus
Provenance graphs, novelty scoring, contradiction detection, replication packs, uncertainty gates.
Deliverable
PageMind & Emplo ship provenance-backed synthesis with replication packs and abstention logs.

Abstract

Open indexes now surface hundreds of millions of research objects. Without provenance and verification, LLM-assisted reading amplifies both findable science and slop. We integrate provenance graphs, citation-intent gates, contradiction sentinels, and replication packs so outputs ship only when conformal thresholds bound residual risk; otherwise we hold and escalate.

State of the field (2024–Oct 2025)

1.1 Provenance & citation integrity

  • Provenance graphs spanning Works, Authors, Venues, References, Claims, Datasets, and Code nodes, enriched with citation-intent labels.
  • Citation integrity debt tracked via DOI resolvability, title/year/venue matching, and scite role frequencies.
  • Citation-intent models (FINECITE) outperform zero-shot LLMs; we gate references and label intent mixes before drafting.

1.2 Novelty estimation

  • SchNovel introduces RAG-Novelty, improving over embedding-only baselines on 15k paper pairs.
  • GraphMind builds hierarchical contextual graphs and constrains novelty scoring with structural neighbours.
  • Bibliographic coupling and co-citation remain first-principles baselines we log alongside neural scores.

1.3 Contradiction detection & verification

  • SPOT (83 papers) reveals paper-level verification remains difficult (precision 6.1%, recall 21.1%).
  • PRISMM-Bench (262 multimodal inconsistencies) requires identify, remedy, and pair-match tasks—best models 26–54% accuracy.
  • CliniFact provides 1,970 clinical claim/evidence pairs for domain-specific contradiction mining.

1.4 Replication workflows & packaging

  • Follow PRISMA 2020 for review documentation and flow diagrams; attach Datasheets and Model Cards to each pack.
  • Target ≤ 50 MB replication packs with deterministic scripts, hashes, and environment locks; log time-to-reproduce.
  • EuroSys/SIGMOD artifact programs show ~72% reproducible, 41% reusable artefacts—our baseline for success.

1.5 Human-time saved (with risk controls)

  • LLM ensembles for abstract screening (JAMIA 2025) deliver 41.8% workload reduction at 100% sensitivity.
  • Update workflows and weakly-supervised active learning report WSS@95 gains and high recall with pseudo-labelling.
  • Conformal risk control supplies miscoverage bounds on long-form outputs; thresholds decide publish vs hold.

1.6 Uncertainty-aware publish/hold

Conformal Risk Control (CRC) and conformal tail-risk control provide miscoverage-bounded abstention for text outputs. We calibrate on held-out splits, log every abstention, and escalate when citation integrity scores fall below threshold.

1.7 Spam pressure & integrity threats

Paper mills and AI-generated survey floods raise the baseline for filtering. Provenance-first pipelines with integrity scoring and contradiction sentinels are the defence. We treat unresolved integrity debt as a blocker.

Visual evidence

Provenance-aware literature map — nodes link papers, claims, datasets, code, and contradiction edges so reviewers see integrity debt at a glance.

Verification frontiers (Oct 2025)

SPOT precision

6.1%

Paper-level precision on confirmed errors

SPOT recall

21.1%

Paper-level recall on confirmed errors

PRISMM identify

54%

Reviewer-flagged multimodal inconsistencies — identify

PRISMM remediate

41%

Reviewer-flagged inconsistencies — remedy

Verification benchmarks remain low. We publish SPOT precision/recall and PRISMM task scores to anchor abstention policies.

Replication ecology — pack outcomes

≤ 10 MB

100%

10–25 MB

100%

25–50 MB

100%
Reproduced Reusable Failed / blocked
Replication packs target ≤ 50 MB with reproducibility > 60%; we log outcomes by pack size bucket and feed back into pack templates.

Method map (verification-first)

  1. 1. Data ingestion & provenance graph. Ingest OpenAlex, Crossref, arXiv, OpenReview, Semantic Scholar, scite. Create Work, Venue, Author, Reference, Citance, Claim, Dataset, and Code nodes with resolver status and hashes.
  2. 2. Conflict & claim layer. Extract claims from text, tables, and figures; align citations; mine contradictions using SPOT/PRISMM and domain suites like CliniFact.
  3. 3. Constraint-aware synthesis. Draft only from integrity-clean sources; enforce glossary terms; require neighbourhood reading (bibliographic coupling, co-citation, embedding neighbours).
  4. 4. Verification & risk control. Run contradiction sentinels and conformal risk control (CRC/CTR) to produce publish / abstain / escalate decisions with logged rationales.
  5. 5. Replication packs. Ship ≤ 50 MB packs with redacted data, env lockfile, deterministic script, Datasheet, and Model Card. Log reproduction success and time.

Publish-or-hold policy

Outputs ship only if citation integrity scores clear thresholds, contradictions are absent, and conformal risk stays below the user-set α. Otherwise, we hold, escalate, and log rationale with timestamped reviewer approvals.

Benchmark table (selected, 2024–Oct 2025)

TaskNovelty proxyContradiction P/RReplication successCorpus sizeHuman-time savedSource
Paper-error detection (SPOT)0.061 / 0.21183 papers, 91 confirmed errorsMay 2025, preprint
Reviewer-flag inconsistencies (PRISMM)Task scores 0.26–0.54262 inconsistencies, 242 papersOct 2025, preprint
Novelty (GraphMind)Graph-aware novelty acc 0.50–0.693,063 papersMay 2025, preprint
Novelty (SchNovel)Pairwise novelty accuracy vs emb-only15k paper pairsJul 2025, peer-reviewed
Citation intent (FINECITE)4 public datasetsJul 2025, peer-reviewed
SR screening (JAMIA ensembles)119,695 records41.8% at 100% sensitivity; 99.1% max at lowerMay 2025, peer-reviewed
SR pipeline (TrialMind)100 systematic reviews, 2,220 studies+71.4% recall; −44.2% screening timeAug 2025, peer-reviewed
Replication norms (EuroSys AE)75 "Results Reproduced"; ~58% participationMulti-year AE recordsAug 2025, whitepaper

Integration with inAi products

5.1 PageMind (research & QA)

  • Build provenance graphs with contradiction edges and intent labels; filter navigation via glossary enforcement.
  • Draft only from integrity-clean sources; embed citance snippets; enforce intent mix quotas (background ≤ 40%).
  • Run sentence- and paper-level contradiction checks (SPOT) plus PRISMM before publishing figure-heavy sections.
  • Ship ≤ 50 MB replication packs with hashes, deterministic scripts, Datasheet/Model Card, and time-to-reproduce logs.
  • Gate publish-or-hold via CRC/CTR thresholds; escalate when two models disagree or intent consistency breaks.

5.2 Emplo (evidence-grounded business packets)

  • Require source trace for claims about roles, companies, or revenues.
  • Enforce citation integrity checks (DOI/URL resolve, metadata match); flag unsupported claims under CRC.

Figures & tables to include

  • F1: Provenance-aware literature map (above).
  • F2: Verification bars (above) with 95% CIs where reported.
  • F3: Replication ecology stacked bars (above) showing reproduce/reuse/fail buckets.
  • T1: Benchmark table (above) with novelty, contradiction, replication, time saved, corpus size.
  • T2: Method map diagram (described in section 2).

Open problems (KCR-1 invites)

  • Reliable novelty at small scale—compare GraphMind, SchNovel, neighbourhood density, and human scoring for n < 50 candidates.
  • Contradiction mining at paper scale—push SPOT/PRISMM recall > 50% without collapsing precision; add multimodal sentinels.
  • Minimal replication ecology—determine smallest pack that keeps ≥ 80% reproduction across labs using ARI/AE baselines.
  • Conformal thresholds for publish/hold—study CRC vs tail-risk control for long-form synthesis and abstention budgets.
  • Auditable citation chains through transformations—preserve intent and provenance across translation/summarisation.
  • Spam filtration at scale—detect paper-mill and AI-survey floods without suppressing legitimate rapid reviews.

Integration notes

  • Adopt: GraphMind features as constraints; SchNovel for evaluation only.
  • Adopt: Citation-intent encoders for gating references and labelling intent mix; Crossref and scite for integrity scoring.
  • Adopt: Verification gates using SPOT and PRISMM; log all abstentions with CRC/CTR thresholds.
  • Adopt: Perfect-sensitivity screening with WSS@95 reporting; replication norms from ARI/EuroSys.
  • Revise: Weakly-supervised screening limited to curator-approved domains.
  • Reject: Auto-publish without PRISMM checks or with unresolved citation integrity debt.

Limits & failure controls

  • Hallucination traps (unanswerable prompts, synthetic DOIs) plus CRC/CTR gating.
  • Citation integrity enforcement via Crossref resolve + title/year/venue matching; block mismatches.
  • Contradiction sentinels on sentence and paper level; run SPOT/PRISMM before publish.
  • Red-team prompts to force unsupported claims; require dual-model agreement + human spot checks.
  • Packaging caps (≤ 50 MB, hashed artefacts, deterministic logs); audit trails by email only.

Data vintage: Oct 2025 · Last updated 01 Oct 2025