ICML 2026 Accepted · Inference-Time Specialization

SubspacePath Pruner

A scenario-level pruning framework that couples representation subspaces in embedding space with sparse executable pathways in parameter space.

Zhiren Gong1,2, Yikun Hou1,4, Fan Wu1, Che Wang1, Fuyao Zhang1, Tiantong Wu1, Yurong Hao1, Jiaming Zhang1, Yiyang Duan1, Tiantong Wang1, Fei Huang5, Chau Yuen3, Wei Yang Bryan Lim1

1 College of Computing and Data Science, Nanyang Technological University 2 Interdisciplinary Graduate Programme, Nanyang Technological University 3 School of Electrical and Electronic Engineering, Nanyang Technological University 4 Department of Mathematics and Mathematical Statistics, Umea University 5 Alibaba Group

Moderate pruning on Qwen2.5-14B reaches 47.8 / 44.1 / 31.3 Recall (Selected/OOD/Cross) versus dense 40.9 / 37.2 / 22.8, while online compilation remains within 0.027-0.068s.

Model-average radar chart
Figure. Model-average robustness and efficiency profile across domain and dataset shifts.

4

Backbones evaluated

6

Dataset splits/tests

47.8 / 44.1 / 31.3 vs 40.9 / 37.2 / 22.8

Qwen2.5-14B Recall (moderate vs dense)

Light: 1.38-3.22x · Agg: 0.88-1.51x

Speedup range across four backbones

0.027-0.068s

Online compilation time range

Abstract

We study practical inference-time specialization: given a frozen LLM and a deployment scenario, compile a reusable budget-bounded subnetwork without scenario-specific supervised training.

The core hypothesis is subspace-pathway coupling: inputs aligned in similar representation subspaces tend to activate sparse and consistent head-level pathways. SubspacePath operationalizes this with two modules: Domain-Basis Synthesis (DBS) and Probe-based Scenario Pruning (PSP).

Across XDomainBench splits and cross-dataset benchmarks (CommonsenseQA, Natural Questions, ARC), SubspacePath improves robustness-efficiency trade-offs under moderate and aggressive pruning.

Motivation & Core Insight

Problem

Global static pruning criteria often fail under scenario shifts, while router-heavy approaches add runtime complexity. We need pruning that is specialized, interpretable, and deployment-friendly.

Key Insight

Embedding-level domain axes can act as stable coordinates, and probe signals can map those coordinates to executable head pathways. This turns pruning from generic compression into scenario-conditioned compilation.

What Actually Improves

The paper shows gains are largest in OOD and cross-domain settings because scenario-conditioned masks reduce pathway interference. This is a structural effect (better pathway organization), not only parameter removal.

SubspacePath overview pipeline
Figure. End-to-end SubspacePath pipeline: DBS builds domain axes and PSP compiles scenario masks.

Method

Method Overview (No Training During Deployment)

SubspacePath separates work into an offline preparation stage and a lightweight online compilation stage. The online stage uses only scenario-start input and does not run optimization.

Stage A: DBS (Domain-Basis Synthesis)

  • Build input-only domain pools from training-side selected-domain data.
  • Project embeddings to a compact shared space and synthesize domain axes.
  • Select a stable axis subset that balances separation and coverage.

Stage B: PSP (Probe-based Scenario Pruning)

  • Train layer-wise lightweight probes to read axis relevance from residual signals.
  • Cache domain-head importance and an always-keep whitelist of backbone heads.
  • At scenario start, estimate domain mixture and compile one reusable head mask under budget.

Why This Works

  • DBS gives a stable semantic coordinate system instead of ad-hoc global ranking.
  • PSP turns semantic alignment into executable pathway scoring.
  • Whitelist + budgeted mask keeps general capability while removing scenario-conflicting heads.
  • The resulting mask is reused over turns, so overhead is low and deployment-friendly.

Step 1 · Offline Preparation

Construct domain pools, synthesize DBS axes, and train PSP probes on input-only data.

Output: reusable semantic basis + probe toolkit + cached head importance.

Step 2 · Scenario Compilation

Read scenario-start input, infer domain mixture, combine with cached importance, and compile budgeted head mask.

Output: one scenario-level executable mask (`m_s`) with whitelist preserved.

Step 3 · Multi-turn Reuse

Reuse the compiled mask for subsequent turns to avoid repeated optimization and keep per-turn overhead low.

Result: stable, efficient specialization for coherent multi-turn scenarios.

Method Evidence (Click to Expand)

Images are supporting evidence; method logic is primary. Click each button to reveal the corresponding figure block.

Main Results

LLaMA-2-13B (Moderate)

43.0 / 32.5 / 20.2 Recall (Selected/OOD/Cross)

Dense baseline: 29.6 / 26.1 / 18.4

LLaMA-2-13B (Aggressive)

34.7 / 30.4 / 22.9 Recall

Cross-domain remains above dense under heavier pruning.

Qwen2.5-14B (Moderate)

47.8 / 44.1 / 31.3 Recall

Strong cross-dataset retention on NQ and ARC.

Tradeoff curve on selected split
Figure. Trade-off curve on Selected split (LLaMA-2-13B).
Tradeoff curve on OOD split
Figure. Trade-off curve on OOD split (LLaMA-2-13B).
Tradeoff curve on cross-domain split
Figure. Trade-off curve on Cross-domain split (LLaMA-2-13B).

Performance Interpretation

Gains are not only from removing heads. Scenario-conditioned masks suppress domain-conflicting pathways and preserve axis-coupled heads, reducing latent competition in residual aggregation.

This effect is strongest on distribution-shifted settings, where static global ranking methods are more brittle to scenario mismatch.

Main Result Table (LLaMA-2-13B, Full Baselines)

Hint: scroll horizontally to view all columns on smaller screens.

Method Moderate Pruning Recall Aggressive Pruning Recall
SelOODCrossCSQANQARC SelOODCrossCSQANQARC
Dense29.626.118.422.2730.2523.8729.626.118.422.2730.2523.87
DaSS33.8228.6219.8920.1018.8019.2235.2727.8415.4018.1312.4216.29
Wanda26.8727.2615.4317.6317.2714.3818.1226.578.7713.708.099.50
LLM-Pr.29.3427.8118.9221.0827.6323.0927.3028.1517.3621.2019.1221.01
RIA27.0627.0315.1217.9817.2414.3318.9426.698.5212.488.069.75
Probe Pr.29.0227.8119.0821.7527.6423.3215.1927.8411.5421.6511.346.96
Ours-SubspacePath43.0032.5020.2019.4333.6622.9134.7030.4022.9018.4024.8921.56

Complete baseline-inclusive recall table for LLaMA-2-13B, matching the paper's moderate/aggressive comparison block.

Main Result Table (More Backbones, Moderate Pruning)

Hint: scroll horizontally to view all columns on smaller screens.

Backbone XDB Selected (Dense/Ours) XDB OOD (Dense/Ours) XDB Cross (Dense/Ours) NQ (Dense/Ours) Speedup (Light/Agg.)
LLaMA-2-13B 29.6 / 43.0 26.1 / 32.5 18.4 / 20.2 30.25 / 33.66 3.22 / 1.51
Qwen2.5-7B 40.8 / 46.4 33.6 / 41.0 22.9 / 27.2 17.51 / 19.51 2.01 / 0.88
Qwen2.5-14B 40.9 / 47.8 37.2 / 44.1 22.8 / 31.3 19.72 / 27.21 1.38 / 1.29

Aggregated from the main and appendix experiment tables; values shown as representative dense-vs-ours comparisons.

Efficiency & Ablation

Efficiency (Matched to Main Paper Table)

  • LLaMA-2-13B light pruning memory: 13.0 -> 12.1 GB, speedup 1.26x (XDomainBench avg).
  • LLaMA-2-13B heavy pruning memory: 13.0 -> 11.7 GB, speedup 1.24x (XDomainBench avg).
  • ARC reaches strongest acceleration: 2.21x at light pruning.
  • Online compilation remains low-latency: 0.027s-0.068s across tested backbones.

These numbers are directly aligned with the efficiency metrics and pruning-time/speedup tables in the paper.

Ablation (Coupling Components)

Ablation XDB Sel. XDB OOD XDB Cross CSQA NQ ARC
Dense29.626.118.422.2730.2523.87
Full (DBS+PSP)34.730.422.919.4333.6625.62
w/o DBS selection1.81.40.90.820.401.16
w/o whitelist22.40.420.20.100.120.23
w/o multi-domain mixing29.523.519.07.0710.6817.04

Ablation table values are taken from the coupling ablation in the experiment section.

Pruning Time and Speedup Summary

Model Pruning Time (s) Speedup (Light / Aggressive)
LLaMA-3.1-8B + Ours0.0391.41 / 1.35
LLaMA-2-13B + Ours0.0603.22 / 1.51
Qwen2.5-7B + Ours0.0272.01 / 0.88
Qwen2.5-14B + Ours0.0681.38 / 1.29

Efficiency Insight

The strongest practical property is the offline/online separation: heavy computation is amortized offline, while online compilation stays sub-0.1s. This directly matches multi-turn deployment where one mask is reused through a scenario.

Reported speedups are backend-sensitive, so the paper separates raw speedup from retention. This is why some settings still emphasize retention gains even when wall-clock speedup is moderate.

Case Studies

Natural Questions case study
Figure. Cross-dataset case (Natural Questions): better relevance and reduced drift.
OOD biology multi-turn case study
Figure. OOD multi-turn biology case: pruned pathway preserves coherent reasoning.
Cross-domain philosophy and sociology case study
Figure. Cross-domain case: scenario mask helps control mixed-domain interference.

Authors & Affiliations

Zhiren Gong1,2, Yikun Hou1,4, Fan Wu1, Che Wang1, Fuyao Zhang1, Tiantong Wu1, Yurong Hao1, Jiaming Zhang1, Yiyang Duan1, Tiantong Wang1, Fei Huang5, Chau Yuen3, Wei Yang Bryan Lim1

1 College of Computing and Data Science, NTU Singapore 2 Interdisciplinary Graduate Programme, NTU Singapore 3 School of EEE, NTU Singapore 4 Umea University, Sweden 5 Alibaba Group, China

Resources

Paper

ICML 2026 accepted paper. Public URL can be inserted once camera-ready page is available.

Code

Repository link can be added after internal release.

Experimental Assets

All figures and tables in this page are rendered from the paper assets in slm_agent/papers.

BibTeX

@inproceedings{gong2026subspacepathpruner,
  title   = {SubspacePath Pruner: Inference-time Pruning via Probe-based Representation-Parameter Coupling},
  author  = {Gong, Zhiren and Hou, Yikun and Wu, Fan and Wang, Che and Zhang, Fuyao and Wu, Tiantong and Hao, Yurong and Zhang, Jiaming and Duan, Yiyang and Wang, Tiantong and Huang, Fei and Yuen, Chau and Lim, Wei Yang Bryan},
  booktitle = {Forty-third International Conference on Machine Learning},
  year    = {2026}
}