ICML 2026 · AI for Science

XDomainBench

Diagnosing reasoning collapse in high-dimensional scientific knowledge composition with a controllable interactive benchmark.

Zhiren Gong^1,2, Tiantong Wu¹, Jiaming Zhang¹, Fuyao Zhang¹, Che Wang¹, Yurong Hao¹, Yikun Hou^1,4, Foo Ping¹, Yilei Zhao¹, Fei Huang⁵, Chau Yuen³, Wei Yang Bryan Lim¹

¹ College of Computing and Data Science, Nanyang Technological University ² Interdisciplinary Graduate Programme, Nanyang Technological University ³ School of Electrical and Electronic Engineering, Nanyang Technological University ⁴ Department of Mathematics and Mathematical Statistics, Umea University ⁵ Alibaba Group

Paper ▶ Watch tutorial GitHub Hugging Face

Domain taxonomy and benchmark scale — Figure. Domain taxonomy and scale across 20 domains.

8,598

interactive sessions

20

scientific domains

4

task categories

1→4

composition orders

6

interdisciplinary categories

Abstract

Large Language Models are increasingly used for knowledge synthesis, but their compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks mostly focus on single-turn restricted settings and fail to capture capability boundaries in realistic interactive workflows.

XDomainBench introduces a diagnostic benchmark for interactive interdisciplinary scientific reasoning. It formalizes composition order and mixture structure, covering 8,598 interactive sessions across 20 domains and 4 task categories, with realistic trajectory patterns over difficulty and domain-mixture dynamics.

Large-scale evaluation reveals systematic reasoning collapse as composition order increases, driven by two root causes: direct difficulty growth from domain composition and indirect interaction-amplified failures such as error accumulation, reasoning breaks, and domain confusion.

Takeaway. XDomainBench reframes interdisciplinary reasoning evaluation from isolated QA to controllable multi-turn composition stress testing.

Motivation & Positioning

Why New Benchmarking Is Needed

Real AI4S reasoning is multi-turn and cross-domain. Most prior benchmarks evaluate isolated queries and cannot separate whether failures come from missing knowledge or from fragile joint reasoning under composition.

What XDomainBench Adds

Controllable composition space from single-domain to high-order cross-domain sessions.
Trajectory-level diagnostics beyond aggregate accuracy.
Standardized history-aware evaluation protocol for reproducible comparisons.

Overview of the XDomainBench framework — Figure. Overview of the framework across generation, control signals, and evaluation.

Takeaway. The key novelty is diagnostic controllability: it separates knowledge coverage limits from compositional reasoning fragility.

Benchmark Design

Design framework of XDomainBench — Figure. Construction framework with composition and trajectory controls.

Three Design Principles

Interdisciplinary richness: composition order k in {1,2,3,4} and realistic domain combinations.
Content complexity richness: controlled difficulty trajectories and domain-mixture trajectories.
Diagnosability: interpretable failure signatures at turn and session levels.

Construction Pipeline

Stage 1 selects interdisciplinary domain sets. Stage 2 generates sessions under trajectory targets with validation. Stage 3 performs detailed checks and human quality review.

Distribution of patterns and task types — Figure. Pattern/task distribution supporting controlled complexity coverage.

Insight: Why Composition Selection Matters

The benchmark does not only increase domain count; it controls which domain combinations appear and how they evolve by turn. This avoids easy synthetic mixtures and makes failures more faithful to real interdisciplinary workflows.

A key outcome in the paper is that higher-order composition introduces integrative reasoning overhead even at early turns, before long-horizon drift effects begin to dominate.

Quantitative Footprint

Full split includes 8,598 sessions, 52,582 turns, and an average of 6.12 turns per session, giving enough trajectory depth to expose interaction-amplified failures.

Takeaway. Benchmark construction jointly controls domain order, trajectory complexity, and validation gates, enabling mechanism-level analysis rather than score-only comparison.

Dataset & Evaluation Protocol

Open Release Content

dataset/full_dataset: full benchmark for final reporting.
dataset/small_dataset: compact benchmark for rapid iteration.
evaluation/: unified evaluator with history enabled by default.
Multi-provider model access via LiteLLM-compatible model identifiers.

Metrics

Turn-level metrics: Recall and F1.
Session-level metric: SessionSuccess@tau, marking a session successful if correctness rate exceeds a fixed threshold.

Scale Summary

Split	JSON files	Scenarios	Turns
full_dataset	64	8,598	52,582
small_dataset	64	1,137	6,659

Task Types

Multiple Choice, Factual QA, Reasoning, and Code, with normalization in aggregated analyses.

Example multi-turn session — Figure. Example interdisciplinary interaction session used in benchmark construction.

Composition Order Distribution (Full)

Order k	Sessions	Share
k=1	1,346	15.7%
k=2	3,749	43.6%
k=3	2,105	24.5%
k=4	1,398	16.3%

Task-Type Distribution (Full Turns)

Task type	Turns	Share
Reasoning	25,070	47.7%
Multiple choice	15,481	29.4%
Factual	11,818	22.5%
Code	213	0.4%

Trajectory Patterns (Full Sessions)

Difficulty pattern	Sessions
Gradual increase	2,567
Fluctuate	2,305
Stable	1,858
Gradual decrease	1,502
Spike	366

For multi-domain sessions only (k>=2), mixture patterns are: Stable 5,032, Gradual shift 1,361, Fluctuate 859.

Main Experimental Results

Multi-metric scaling across composition orders — Figure. Multi-metric scaling patterns across composition orders.

Observed Scaling Pattern

Performance declines as composition order increases, and this drop becomes steeper at higher k. The paper further shows that model families behave differently: small models are more sensitive to composition stress, while MoE models retain stronger session-level robustness in many settings.

Nonlinear degradation: Large-model deterministic mean S@t drops from 38.7 to 27.1 as k goes from 1 to 4.
Architecture gap: MoE models keep the strongest S@t at each k, but still exhibit substantial decay under higher composition.
Metric robustness: Judge-based semantic scoring follows the same monotonic collapse trend as token-based metrics.
Early-turn burden: Turn-1 comparisons indicate direct composition overhead before long-session error propagation starts.

Large Mean S@t: 38.7 → 27.1 Small Mean S@t: 32.3 → 25.9 MoE Mean S@t: 82.1 → 42.0

See full main table below for complete per-model values (deterministic + open-world, across k=1..4).

Observed Scaling Pattern (Full Main Table)

Model	Deterministic Problems												Open-world Problems
Model	D-k1 R	D-k1 F1	D-k1 S@t	D-k2 R	D-k2 F1	D-k2 S@t	D-k3 R	D-k3 F1	D-k3 S@t	D-k4 R	D-k4 F1	D-k4 S@t	O-k1 R	O-k1 F1	O-k1 S@t	O-k2 R	O-k2 F1	O-k2 S@t	O-k3 R	O-k3 F1	O-k3 S@t	O-k4 R	O-k4 F1	O-k4 S@t
GPT-5.2	30.8	22.0	42.5	25.9	11.4	28.8	22.8	8.2	25.2	18.0	4.8	26.9	25.9	9.3	29.6	21.7	7.0	28.2	28.0	5.8	21.4	13.0	5.8	35.3
Claude-4.5 Sonnet	30.4	22.8	48.5	28.6	14.0	31.3	21.0	7.3	26.7	19.3	4.5	24.4	30.2	8.0	40.7	23.4	10.3	21.4	13.5	5.5	21.4	13.3	4.0	14.7
Claude-4.5 Haiku	30.6	23.0	41.8	29.7	13.7	31.1	26.9	12.3	30.7	23.0	9.5	37.7	19.4	7.0	40.7	22.4	11.0	21.4	32.6	13.8	23.2	19.1	11.5	35.3
Gemini-2.5 Flash	27.1	20.4	36.6	21.4	5.3	24.3	18.4	3.0	25.2	13.8	3.1	21.3	18.8	6.9	33.3	22.1	4.1	34.2	13.5	3.3	21.4	12.4	3.7	34.5
Gemini-2.0 Flash	27.2	12.9	26.9	22.4	5.6	25.8	20.7	5.0	24.3	15.0	2.8	20.3	26.8	1.8	22.2	16.2	2.8	24.8	12.6	4.6	16.1	11.9	2.3	15.4
Qwen2.5-72B	26.8	17.8	35.8	27.0	12.8	27.3	25.0	11.5	28.2	19.9	8.9	32.3	31.3	8.4	25.9	21.9	11.2	20.5	16.1	10.2	21.4	13.2	10.3	32.4
Large Mean	28.8	19.8	38.7	25.8	10.5	28.1	22.5	7.9	26.7	18.2	5.6	27.1	25.4	6.9	32.1	21.3	7.7	25.1	19.4	7.2	20.8	13.8	6.3	27.9
GPT-5-mini	23.1	12.8	25.2	25.3	14.4	21.3	22.4	2.2	20.0	11.4	1.3	20.0	22.0	2.8	10.0	23.1	3.4	20.0	3.2	3.3	10.0	3.2	3.3	10.0
Qwen2.5-14B	27.2	19.1	39.6	25.5	11.3	26.6	23.3	9.1	27.7	18.5	7.6	29.9	18.2	3.9	18.5	27.1	10.4	25.6	28.1	9.3	10.7	11.6	7.5	20.6
Qwen2.5-7B	26.4	18.4	31.3	25.8	11.5	26.8	23.3	9.8	28.7	18.9	7.3	28.1	50.2	3.7	18.5	22.6	8.6	23.9	11.1	7.6	8.9	14.5	6.8	17.6
Llama-3.1-8B	26.2	18.8	35.8	24.7	11.4	23.8	23.3	9.9	28.2	16.5	8.1	25.1	15.2	4.1	14.8	23.9	9.8	29.9	28.9	10.3	17.9	19.0	7.9	29.4
Llama-3.2-3B	26.8	16.3	31.3	24.5	10.5	23.8	23.2	8.7	28.7	20.9	5.5	29.9	19.6	3.4	25.9	22.8	8.9	25.6	47.1	6.9	23.2	14.3	5.1	26.5
Gemma-2-2B-IT	27.3	18.6	30.6	26.3	12.1	27.6	20.6	7.4	21.8	18.1	4.3	22.6	18.4	4.6	7.4	20.7	8.5	18.8	45.0	7.4	17.9	9.5	2.8	17.6
Small Mean	26.2	17.3	32.3	25.3	11.9	25.0	22.7	7.9	25.9	17.4	5.7	25.9	23.9	3.8	15.9	23.4	8.3	24.0	27.2	7.5	14.8	12.0	5.6	20.3
Qwen3-Next-80B	63.8	65.7	89.6	46.3	40.1	51.1	41.7	36.4	52.0	29.5	26.3	34.5	10.8	15.0	40.7	29.8	27.9	28.2	26.3	23.9	46.4	25.7	24.1	41.2
Mixtral-8x7B	56.4	22.4	74.6	51.9	18.0	46.4	51.0	18.9	48.5	41.0	19.5	49.4	24.1	21.3	40.7	41.9	25.7	35.9	42.0	25.5	53.6	39.9	23.7	55.9
MoE Mean	60.1	44.0	82.1	49.1	29.1	48.8	46.4	27.6	50.2	35.2	22.9	42.0	17.5	18.1	40.7	35.9	26.8	32.0	34.1	24.7	50.0	32.8	23.9	48.5

Full values from the main overall scaling table in the paper (R/F1/S@t for deterministic and open-world settings).

Insight: Deterministic vs Open-world Regimes

Deterministic tasks expose clearer compositional scaling laws, while open-world settings add semantic variance. Even under this variance, degradation with increasing k remains directionally consistent, indicating the collapse is not a token-overlap artifact.

Insight: Why This Matters for AI4S

In realistic scientific workflows, higher-order composition is unavoidable. The results imply that improvements in standalone reasoning are insufficient unless models also mitigate interaction-level drift across turns.

Turn-1 breakdown by composition order and task type — Figure. Turn-1 breakdown: composition effect and difficulty/performance relation by model type.

Takeaway. Performance degradation with higher composition order is systematic and model-family dependent, with small models most sensitive and MoE variants comparatively resilient.

Mechanism Analysis

Mechanism summary left — Figure. Pattern distributions and pattern-to-collapse associations.

Mechanism summary right — Figure. Pattern trajectories showing how difficulty/mixture dynamics trigger failures.

Direct Mechanism

Immediate compositional overhead appears at turn 1 as domains are combined, before long-session error propagation. This reflects insufficient capability in high-dimensional joint reasoning.

Indirect Mechanism

Interaction patterns amplify failures over turns, causing error accumulation, reasoning breaks, and domain confusion, ultimately increasing collapse probability.

Core Insight Across Results

The paper's key message is not only that scores drop with larger k, but that two mechanisms interact: domain composition increases immediate reasoning burden, while trajectory dynamics in multi-turn sessions amplify mistakes. This explains why collapse is often sharper in realistic interactive settings than in one-shot evaluations.

Takeaway. Collapse arises from coupled effects: direct compositional burden at early turns and trajectory-amplified error propagation over interaction history.

Authors & Institutions

Zhiren Gong^1,2 · Tiantong Wu¹ · Jiaming Zhang¹ · Fuyao Zhang¹ · Che Wang¹ · Yurong Hao¹ · Yikun Hou^1,4 · Foo Ping¹ · Yilei Zhao¹ · Fei Huang⁵ · Chau Yuen³ · Wei Yang Bryan Lim¹

¹ College of Computing and Data Science, NTU Singapore ² Interdisciplinary Graduate Programme, NTU Singapore ³ School of EEE, NTU Singapore ⁴ Umea University, Sweden ⁵ Alibaba Group, China

Contact

For any questions, collaborations, or issues with benchmark usage, please feel free to contact zhiren001@e.ntu.edu.sg.

BibTeX

@inproceedings{gong2026xdomainbench,
  title     = {{XD}omainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition},
  author    = {Gong, Zhiren and Wu, Tiantong and Zhang, Jiaming and Zhang, Fuyao and Wang, Che and Hao, Yurong and Hou, Yikun and Foo, Ping and Zhao, Yilei and Huang, Fei and Yuen, Chau and Lim, Wei Yang Bryan},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=U8x5SYtT5b}
}