ICML 2026 · AI for Science

XDomainBench

Diagnosing reasoning collapse in high-dimensional scientific knowledge composition with a controllable interactive benchmark.

Zhiren Gong1,2, Tiantong Wu1, Jiaming Zhang1, Fuyao Zhang1, Che Wang1, Yurong Hao1, Yikun Hou1,4, Foo Ping1, Yilei Zhao1, Fei Huang5, Chau Yuen3, Wei Yang Bryan Lim1

1 College of Computing and Data Science, Nanyang Technological University 2 Interdisciplinary Graduate Programme, Nanyang Technological University 3 School of Electrical and Electronic Engineering, Nanyang Technological University 4 Department of Mathematics and Mathematical Statistics, Umea University 5 Alibaba Group

Domain taxonomy and benchmark scale
Figure. Domain taxonomy and scale across 20 domains.

8,598

interactive sessions

20

scientific domains

4

task categories

1→4

composition orders

6

interdisciplinary categories

Abstract

Large Language Models are increasingly used for knowledge synthesis, but their compositional generalization in scientific knowledge remains under-characterized. Existing benchmarks mostly focus on single-turn restricted settings and fail to capture capability boundaries in realistic interactive workflows.

XDomainBench introduces a diagnostic benchmark for interactive interdisciplinary scientific reasoning. It formalizes composition order and mixture structure, covering 8,598 interactive sessions across 20 domains and 4 task categories, with realistic trajectory patterns over difficulty and domain-mixture dynamics.

Large-scale evaluation reveals systematic reasoning collapse as composition order increases, driven by two root causes: direct difficulty growth from domain composition and indirect interaction-amplified failures such as error accumulation, reasoning breaks, and domain confusion.

Takeaway. XDomainBench reframes interdisciplinary reasoning evaluation from isolated QA to controllable multi-turn composition stress testing.

Motivation & Positioning

Why New Benchmarking Is Needed

Real AI4S reasoning is multi-turn and cross-domain. Most prior benchmarks evaluate isolated queries and cannot separate whether failures come from missing knowledge or from fragile joint reasoning under composition.

What XDomainBench Adds

  • Controllable composition space from single-domain to high-order cross-domain sessions.
  • Trajectory-level diagnostics beyond aggregate accuracy.
  • Standardized history-aware evaluation protocol for reproducible comparisons.
Overview of the XDomainBench framework
Figure. Overview of the framework across generation, control signals, and evaluation.

Takeaway. The key novelty is diagnostic controllability: it separates knowledge coverage limits from compositional reasoning fragility.

Benchmark Design

Design framework of XDomainBench
Figure. Construction framework with composition and trajectory controls.

Three Design Principles

  • Interdisciplinary richness: composition order k in {1,2,3,4} and realistic domain combinations.
  • Content complexity richness: controlled difficulty trajectories and domain-mixture trajectories.
  • Diagnosability: interpretable failure signatures at turn and session levels.

Construction Pipeline

Stage 1 selects interdisciplinary domain sets. Stage 2 generates sessions under trajectory targets with validation. Stage 3 performs detailed checks and human quality review.

Distribution of patterns and task types
Figure. Pattern/task distribution supporting controlled complexity coverage.

Insight: Why Composition Selection Matters

The benchmark does not only increase domain count; it controls which domain combinations appear and how they evolve by turn. This avoids easy synthetic mixtures and makes failures more faithful to real interdisciplinary workflows.

A key outcome in the paper is that higher-order composition introduces integrative reasoning overhead even at early turns, before long-horizon drift effects begin to dominate.

Quantitative Footprint

Full split includes 8,598 sessions, 52,582 turns, and an average of 6.12 turns per session, giving enough trajectory depth to expose interaction-amplified failures.

Takeaway. Benchmark construction jointly controls domain order, trajectory complexity, and validation gates, enabling mechanism-level analysis rather than score-only comparison.

Dataset & Evaluation Protocol

Open Release Content

  • dataset/full_dataset: full benchmark for final reporting.
  • dataset/small_dataset: compact benchmark for rapid iteration.
  • evaluation/: unified evaluator with history enabled by default.
  • Multi-provider model access via LiteLLM-compatible model identifiers.

Metrics

Turn-level metrics: Recall and F1.
Session-level metric: SessionSuccess@tau, marking a session successful if correctness rate exceeds a fixed threshold.

Scale Summary

SplitJSON filesScenariosTurns
full_dataset648,59852,582
small_dataset641,1376,659

Task Types

Multiple Choice, Factual QA, Reasoning, and Code, with normalization in aggregated analyses.

Example multi-turn session
Figure. Example interdisciplinary interaction session used in benchmark construction.

Composition Order Distribution (Full)

Order kSessionsShare
k=11,34615.7%
k=23,74943.6%
k=32,10524.5%
k=41,39816.3%

Task-Type Distribution (Full Turns)

Task typeTurnsShare
Reasoning25,07047.7%
Multiple choice15,48129.4%
Factual11,81822.5%
Code2130.4%

Trajectory Patterns (Full Sessions)

Difficulty patternSessions
Gradual increase2,567
Fluctuate2,305
Stable1,858
Gradual decrease1,502
Spike366

For multi-domain sessions only (k>=2), mixture patterns are: Stable 5,032, Gradual shift 1,361, Fluctuate 859.

Main Experimental Results

Multi-metric scaling across composition orders
Figure. Multi-metric scaling patterns across composition orders.

Observed Scaling Pattern

Performance declines as composition order increases, and this drop becomes steeper at higher k. The paper further shows that model families behave differently: small models are more sensitive to composition stress, while MoE models retain stronger session-level robustness in many settings.

  • Nonlinear degradation: Large-model deterministic mean S@t drops from 38.7 to 27.1 as k goes from 1 to 4.
  • Architecture gap: MoE models keep the strongest S@t at each k, but still exhibit substantial decay under higher composition.
  • Metric robustness: Judge-based semantic scoring follows the same monotonic collapse trend as token-based metrics.
  • Early-turn burden: Turn-1 comparisons indicate direct composition overhead before long-session error propagation starts.
Large Mean S@t: 38.7 → 27.1 Small Mean S@t: 32.3 → 25.9 MoE Mean S@t: 82.1 → 42.0

See full main table below for complete per-model values (deterministic + open-world, across k=1..4).

Observed Scaling Pattern (Full Main Table)

Model Deterministic Problems Open-world Problems
D-k1 RD-k1 F1D-k1 S@t D-k2 RD-k2 F1D-k2 S@t D-k3 RD-k3 F1D-k3 S@t D-k4 RD-k4 F1D-k4 S@t O-k1 RO-k1 F1O-k1 S@t O-k2 RO-k2 F1O-k2 S@t O-k3 RO-k3 F1O-k3 S@t O-k4 RO-k4 F1O-k4 S@t
GPT-5.230.822.042.525.911.428.822.88.225.218.04.826.925.99.329.621.77.028.228.05.821.413.05.835.3
Claude-4.5 Sonnet30.422.848.528.614.031.321.07.326.719.34.524.430.28.040.723.410.321.413.55.521.413.34.014.7
Claude-4.5 Haiku30.623.041.829.713.731.126.912.330.723.09.537.719.47.040.722.411.021.432.613.823.219.111.535.3
Gemini-2.5 Flash27.120.436.621.45.324.318.43.025.213.83.121.318.86.933.322.14.134.213.53.321.412.43.734.5
Gemini-2.0 Flash27.212.926.922.45.625.820.75.024.315.02.820.326.81.822.216.22.824.812.64.616.111.92.315.4
Qwen2.5-72B26.817.835.827.012.827.325.011.528.219.98.932.331.38.425.921.911.220.516.110.221.413.210.332.4
Large Mean28.819.838.725.810.528.122.57.926.718.25.627.125.46.932.121.37.725.119.47.220.813.86.327.9
GPT-5-mini23.112.825.225.314.421.322.42.220.011.41.320.022.02.810.023.13.420.03.23.310.03.23.310.0
Qwen2.5-14B27.219.139.625.511.326.623.39.127.718.57.629.918.23.918.527.110.425.628.19.310.711.67.520.6
Qwen2.5-7B26.418.431.325.811.526.823.39.828.718.97.328.150.23.718.522.68.623.911.17.68.914.56.817.6
Llama-3.1-8B26.218.835.824.711.423.823.39.928.216.58.125.115.24.114.823.99.829.928.910.317.919.07.929.4
Llama-3.2-3B26.816.331.324.510.523.823.28.728.720.95.529.919.63.425.922.88.925.647.16.923.214.35.126.5
Gemma-2-2B-IT27.318.630.626.312.127.620.67.421.818.14.322.618.44.67.420.78.518.845.07.417.99.52.817.6
Small Mean26.217.332.325.311.925.022.77.925.917.45.725.923.93.815.923.48.324.027.27.514.812.05.620.3
Qwen3-Next-80B63.865.789.646.340.151.141.736.452.029.526.334.510.815.040.729.827.928.226.323.946.425.724.141.2
Mixtral-8x7B56.422.474.651.918.046.451.018.948.541.019.549.424.121.340.741.925.735.942.025.553.639.923.755.9
MoE Mean60.144.082.149.129.148.846.427.650.235.222.942.017.518.140.735.926.832.034.124.750.032.823.948.5

Full values from the main overall scaling table in the paper (R/F1/S@t for deterministic and open-world settings).

Insight: Deterministic vs Open-world Regimes

Deterministic tasks expose clearer compositional scaling laws, while open-world settings add semantic variance. Even under this variance, degradation with increasing k remains directionally consistent, indicating the collapse is not a token-overlap artifact.

Insight: Why This Matters for AI4S

In realistic scientific workflows, higher-order composition is unavoidable. The results imply that improvements in standalone reasoning are insufficient unless models also mitigate interaction-level drift across turns.

Turn-1 breakdown by composition order and task type
Figure. Turn-1 breakdown: composition effect and difficulty/performance relation by model type.

Takeaway. Performance degradation with higher composition order is systematic and model-family dependent, with small models most sensitive and MoE variants comparatively resilient.

Mechanism Analysis

Mechanism summary left
Figure. Pattern distributions and pattern-to-collapse associations.
Mechanism summary right
Figure. Pattern trajectories showing how difficulty/mixture dynamics trigger failures.

Direct Mechanism

Immediate compositional overhead appears at turn 1 as domains are combined, before long-session error propagation. This reflects insufficient capability in high-dimensional joint reasoning.

Indirect Mechanism

Interaction patterns amplify failures over turns, causing error accumulation, reasoning breaks, and domain confusion, ultimately increasing collapse probability.

Core Insight Across Results

The paper's key message is not only that scores drop with larger k, but that two mechanisms interact: domain composition increases immediate reasoning burden, while trajectory dynamics in multi-turn sessions amplify mistakes. This explains why collapse is often sharper in realistic interactive settings than in one-shot evaluations.

Takeaway. Collapse arises from coupled effects: direct compositional burden at early turns and trajectory-amplified error propagation over interaction history.

Authors & Institutions

Zhiren Gong1,2 · Tiantong Wu1 · Jiaming Zhang1 · Fuyao Zhang1 · Che Wang1 · Yurong Hao1 · Yikun Hou1,4 · Foo Ping1 · Yilei Zhao1 · Fei Huang5 · Chau Yuen3 · Wei Yang Bryan Lim1

1 College of Computing and Data Science, NTU Singapore 2 Interdisciplinary Graduate Programme, NTU Singapore 3 School of EEE, NTU Singapore 4 Umea University, Sweden 5 Alibaba Group, China

Resources

Contact

For any questions, collaborations, or issues with benchmark usage, please feel free to contact zhiren001@e.ntu.edu.sg.

BibTeX

@inproceedings{gong2026xdomainbench,
  title     = {{XD}omainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition},
  author    = {Gong, Zhiren and Wu, Tiantong and Zhang, Jiaming and Zhang, Fuyao and Wang, Che and Hao, Yurong and Hou, Yikun and Foo, Ping and Zhao, Yilei and Huang, Fei and Yuen, Chau and Lim, Wei Yang Bryan},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=U8x5SYtT5b}
}