ICML 2026 Accepted · Inference-Time Specialization

SubspacePath Pruner

A scenario-level pruning framework that couples representation subspaces in embedding space with sparse executable pathways in parameter space.

Zhiren Gong^1,2, Yikun Hou^1,4, Fan Wu¹, Che Wang¹, Fuyao Zhang¹, Tiantong Wu¹, Yurong Hao¹, Jiaming Zhang¹, Yiyang Duan¹, Tiantong Wang¹, Fei Huang⁵, Chau Yuen³, Wei Yang Bryan Lim¹

¹ College of Computing and Data Science, Nanyang Technological University ² Interdisciplinary Graduate Programme, Nanyang Technological University ³ School of Electrical and Electronic Engineering, Nanyang Technological University ⁴ Department of Mathematics and Mathematical Statistics, Umea University ⁵ Alibaba Group

Moderate pruning on Qwen2.5-14B reaches 47.8 / 44.1 / 31.3 Recall (Selected/OOD/Cross) versus dense 40.9 / 37.2 / 22.8, while online compilation remains within 0.027-0.068s.

Paper Code ▶ Tutorial

Model-average radar chart — Figure. Model-average robustness and efficiency profile across domain and dataset shifts.

4

Backbones evaluated

6

Dataset splits/tests

47.8 / 44.1 / 31.3 vs 40.9 / 37.2 / 22.8

Qwen2.5-14B Recall (moderate vs dense)

Light: 1.38-3.22x · Agg: 0.88-1.51x

Speedup range across four backbones

0.027-0.068s

Online compilation time range

Abstract

We study practical inference-time specialization: given a frozen LLM and a deployment scenario, compile a reusable budget-bounded subnetwork without scenario-specific supervised training.

The core hypothesis is subspace-pathway coupling: inputs aligned in similar representation subspaces tend to activate sparse and consistent head-level pathways. SubspacePath operationalizes this with two modules: Domain-Basis Synthesis (DBS) and Probe-based Scenario Pruning (PSP).

Across XDomainBench splits and cross-dataset benchmarks (CommonsenseQA, Natural Questions, ARC), SubspacePath improves robustness-efficiency trade-offs under moderate and aggressive pruning.

Motivation & Core Insight

Problem

Global static pruning criteria often fail under scenario shifts, while router-heavy approaches add runtime complexity. We need pruning that is specialized, interpretable, and deployment-friendly.

Key Insight

Embedding-level domain axes can act as stable coordinates, and probe signals can map those coordinates to executable head pathways. This turns pruning from generic compression into scenario-conditioned compilation.

SubspacePath overview pipeline — Figure. End-to-end SubspacePath pipeline: DBS builds domain axes and PSP compiles scenario masks.

What Actually Improves

Gains are strongest in OOD and cross-domain settings because scenario-conditioned masks reduce pathway interference. The effect comes from better pathway organization, not only parameter removal.

OOD Stability

Domain-axis conditioning keeps execution focused on heads that remain semantically aligned under shift.
Compared with global pruning, the mask is less sensitive to activation drift when input style changes.
This is why OOD recall rises consistently across backbones at moderate pruning levels.

Cross-Domain Robustness

Mixed-domain prompts trigger less interference because conflicting pathways are suppressed early.
Whitelist preservation keeps shared general reasoning capacity while specialized heads are selectively routed.
Result: better retention on Cross/NQ/ARC while still delivering deployment-friendly speedups.

Method

Method Overview (No Training During Deployment)

SubspacePath separates work into an offline preparation stage and a lightweight online compilation stage. The online stage uses only scenario-start input and does not run optimization.

Stage A: DBS (Domain-Basis Synthesis)

Build input-only domain pools from training-side selected-domain data.
Project embeddings to a compact shared space and synthesize domain axes.
Select a stable axis subset that balances separation and coverage.

Stage B: PSP (Probe-based Scenario Pruning)

Train layer-wise lightweight probes to read axis relevance from residual signals.
Cache domain-head importance and an always-keep whitelist of backbone heads.
At scenario start, estimate domain mixture and compile one reusable head mask under budget.

Why This Works

DBS gives a stable semantic coordinate system instead of ad-hoc global ranking.
PSP turns semantic alignment into executable pathway scoring.
Whitelist + budgeted mask keeps general capability while removing scenario-conflicting heads.
The resulting mask is reused over turns, so overhead is low and deployment-friendly.

Step 1 · Offline Preparation

Construct domain pools, synthesize DBS axes, and train PSP probes on input-only data.

Output: reusable semantic basis + probe toolkit + cached head importance.

Step 2 · Scenario Compilation

Read scenario-start input, infer domain mixture, combine with cached importance, and compile budgeted head mask.

Output: one scenario-level executable mask (`m_s`) with whitelist preserved.

Step 3 · Multi-turn Reuse

Reuse the compiled mask for subsequent turns to avoid repeated optimization and keep per-turn overhead low.

Result: stable, efficient specialization for coherent multi-turn scenarios.

Method Evidence (Click to Expand)

Images are supporting evidence; method logic is primary. Click each button to reveal the corresponding figure block.

3D embedding projection for selected domains — Figure. Selected domain subspaces are geometrically separable in embedding projection.

Orthogonality matrix for selected domain axes — Figure. DBS-selected axes achieve higher orthogonality with sufficient semantic coverage.

Subspace-pathway coupling mechanism visualization — Figure. Mechanism evidence: subspace alignment patterns map to executable pathway selection behavior.

Domain-head cross section importance map — Figure. Domain-specific head-importance cross section supports pathway compilation.

Main Results

LLaMA-2-13B (Moderate)

43.0 / 32.5 / 20.2 Recall (Selected/OOD/Cross)

Dense baseline: 29.6 / 26.1 / 18.4

LLaMA-2-13B (Aggressive)

34.7 / 30.4 / 22.9 Recall

Cross-domain remains above dense under heavier pruning.

Qwen2.5-14B (Moderate)

47.8 / 44.1 / 31.3 Recall

Strong cross-dataset retention on NQ and ARC.

Tradeoff curve on selected split — Figure. Trade-off curve on Selected split (LLaMA-2-13B).

Tradeoff curve on OOD split — Figure. Trade-off curve on OOD split (LLaMA-2-13B).

Tradeoff curve on cross-domain split — Figure. Trade-off curve on Cross-domain split (LLaMA-2-13B).

Trade-off Insight Across Splits

Selected split: SubspacePath sustains the highest recall as sparsity increases, showing stronger in-domain head prioritization.
OOD split: the margin versus generic pruning widens, indicating better resistance to domain mismatch.
Cross-domain split: gains remain under heavier pruning because mask compilation reduces cross-topic pathway collision.

Performance Interpretation

Gains are not only from removing heads. Scenario-conditioned masks suppress domain-conflicting pathways and preserve axis-coupled heads, reducing latent competition in residual aggregation.

This effect is strongest on distribution-shifted settings, where static global ranking methods are more brittle to scenario mismatch.

Main Result Table (LLaMA-2-13B, Full Baselines)

Hint: scroll horizontally to view all columns on smaller screens.

Method	Moderate Pruning Recall						Aggressive Pruning Recall
Method	Sel	OOD	Cross	CSQA	NQ	ARC	Sel	OOD	Cross	CSQA	NQ	ARC
Dense	29.6	26.1	18.4	22.27	30.25	23.87	29.6	26.1	18.4	22.27	30.25	23.87
DaSS	33.82	28.62	19.89	20.10	18.80	19.22	35.27	27.84	15.40	18.13	12.42	16.29
Wanda	26.87	27.26	15.43	17.63	17.27	14.38	18.12	26.57	8.77	13.70	8.09	9.50
LLM-Pr.	29.34	27.81	18.92	21.08	27.63	23.09	27.30	28.15	17.36	21.20	19.12	21.01
RIA	27.06	27.03	15.12	17.98	17.24	14.33	18.94	26.69	8.52	12.48	8.06	9.75
Probe Pr.	29.02	27.81	19.08	21.75	27.64	23.32	15.19	27.84	11.54	21.65	11.34	6.96
Ours-SubspacePath	43.00	32.50	20.20	19.43	33.66	22.91	34.70	30.40	22.90	18.40	24.89	21.56

Main Result Table (More Backbones, Moderate Pruning)

Hint: scroll horizontally to view all columns on smaller screens.

Backbone	XDB Selected (Dense/Ours)	XDB OOD (Dense/Ours)	XDB Cross (Dense/Ours)	NQ (Dense/Ours)	Speedup (Light/Agg.)
LLaMA-2-13B	29.6 / 43.0	26.1 / 32.5	18.4 / 20.2	30.25 / 33.66	3.22 / 1.51
Qwen2.5-7B	40.8 / 46.4	33.6 / 41.0	22.9 / 27.2	17.51 / 19.51	2.01 / 0.88
Qwen2.5-14B	40.9 / 47.8	37.2 / 44.1	22.8 / 31.3	19.72 / 27.21	1.38 / 1.29

Table-Level Interpretation

The consistent pattern is that moderate pruning gives the best overall robustness-efficiency point, while aggressive pruning still preserves competitive recall in OOD and cross-domain settings. This indicates SubspacePath primarily reorganizes executable pathways rather than relying on fragile one-shot compression.

Efficiency & Ablation

Efficiency (Matched to Main Paper Table)

LLaMA-2-13B light pruning memory: 13.0 -> 12.1 GB, speedup 1.26x (XDomainBench avg).
LLaMA-2-13B heavy pruning memory: 13.0 -> 11.7 GB, speedup 1.24x (XDomainBench avg).
ARC reaches strongest acceleration: 2.21x at light pruning.
Online compilation remains low-latency: 0.027s-0.068s across tested backbones.

Ablation (Coupling Components)

Ablation	XDB Sel.	XDB OOD	XDB Cross	CSQA	NQ	ARC
Dense	29.6	26.1	18.4	22.27	30.25	23.87
Full (DBS+PSP)	34.7	30.4	22.9	19.43	33.66	25.62
w/o DBS selection	1.8	1.4	0.9	0.82	0.40	1.16
w/o whitelist	22.4	0.4	20.2	0.10	0.12	0.23
w/o multi-domain mixing	29.5	23.5	19.0	7.07	10.68	17.04

Pruning Time and Speedup Summary

Model	Pruning Time (s)	Speedup (Light / Aggressive)
LLaMA-3.1-8B + Ours	0.039	1.41 / 1.35
LLaMA-2-13B + Ours	0.060	3.22 / 1.51
Qwen2.5-7B + Ours	0.027	2.01 / 0.88
Qwen2.5-14B + Ours	0.068	1.38 / 1.29

Efficiency Insight

The strongest practical property is the offline/online separation: heavy computation is amortized offline, while online compilation stays sub-0.1s. This directly matches multi-turn deployment where one mask is reused through a scenario.

Reported speedups are backend-sensitive, so the paper separates raw speedup from retention. This is why some settings still emphasize retention gains even when wall-clock speedup is moderate.

Case Studies

Case-Level Insights

Natural Questions

The compiled mask reduces off-topic continuation and keeps retrieval-oriented reasoning concise, which improves relevance under dataset shift.

OOD Biology Multi-turn

Subspace-conditioned pathways maintain multi-turn coherence: later responses stay aligned with earlier factual context instead of drifting into generic templates.

Cross-domain QA

In mixed philosophy/sociology prompts, interference control is visible as cleaner reasoning transitions between concepts that would otherwise activate competing heads.

Natural Questions case study — Figure. Cross-dataset case (Natural Questions): better relevance and reduced drift.

OOD biology multi-turn case study — Figure. OOD multi-turn biology case: pruned pathway preserves coherent reasoning.

Cross-domain philosophy and sociology case study — Figure. Cross-domain case: scenario mask helps control mixed-domain interference.

Authors & Affiliations

Zhiren Gong^1,2, Yikun Hou^1,4, Fan Wu¹, Che Wang¹, Fuyao Zhang¹, Tiantong Wu¹, Yurong Hao¹, Jiaming Zhang¹, Yiyang Duan¹, Tiantong Wang¹, Fei Huang⁵, Chau Yuen³, Wei Yang Bryan Lim¹

¹ College of Computing and Data Science, NTU Singapore ² Interdisciplinary Graduate Programme, NTU Singapore ³ School of EEE, NTU Singapore ⁴ Umea University, Sweden ⁵ Alibaba Group, China

Resources

Paper

OpenReview page for the ICML 2026 publication.

Code

Official implementation: GitHub repository.

Tutorial

A narrated, animated ~7-minute video tour — the problem, the DBS + PSP method, and the results, built from the paper's own figures.

BibTeX

@inproceedings{gong2026subspacepathpruner,
  title   = {SubspacePath Pruner: Inference-time Pruning via Probe-based Representation-Parameter Coupling},
  author  = {Gong, Zhiren and Hou, Yikun and Wu, Fan and Wang, Che and Zhang, Fuyao and Wu, Tiantong and Hao, Yurong and Zhang, Jiaming and Duan, Yiyang and Wang, Tiantong and Huang, Fei and Yuen, Chau and Lim, Wei Yang Bryan},
  booktitle = {Forty-third International Conference on Machine Learning},
  year    = {2026}
}

Contact

For collaboration or questions about this project, contact zhiren001@e.ntu.edu.sg.