ICML 2026 Accepted · Inference-Time Specialization

SubspacePath Pruner

A scenario-level pruning framework that couples representation subspaces in embedding space with sparse executable pathways in parameter space.

Zhiren Gong1,2, Yikun Hou1,4, Fan Wu1, Che Wang1, Fuyao Zhang1, Tiantong Wu1, Yurong Hao1, Jiaming Zhang1, Yiyang Duan1, Tiantong Wang1, Fei Huang5, Chau Yuen3, Wei Yang Bryan Lim1

1 College of Computing and Data Science, Nanyang Technological University 2 Interdisciplinary Graduate Programme, Nanyang Technological University 3 School of Electrical and Electronic Engineering, Nanyang Technological University 4 Department of Mathematics and Mathematical Statistics, Umea University 5 Alibaba Group

Moderate pruning on Qwen2.5-14B reaches 47.8 / 44.1 / 31.3 Recall (Selected/OOD/Cross) versus dense 40.9 / 37.2 / 22.8, while online compilation remains within 0.027-0.068s.

Model-average radar chart
Figure. Model-average robustness and efficiency profile across domain and dataset shifts.

4

Backbones evaluated

6

Dataset splits/tests

47.8 / 44.1 / 31.3 vs 40.9 / 37.2 / 22.8

Qwen2.5-14B Recall (moderate vs dense)

Light: 1.38-3.22x · Agg: 0.88-1.51x

Speedup range across four backbones

0.027-0.068s

Online compilation time range

Abstract

We study practical inference-time specialization: given a frozen LLM and a deployment scenario, compile a reusable budget-bounded subnetwork without scenario-specific supervised training.

The core hypothesis is subspace-pathway coupling: inputs aligned in similar representation subspaces tend to activate sparse and consistent head-level pathways. SubspacePath operationalizes this with two modules: Domain-Basis Synthesis (DBS) and Probe-based Scenario Pruning (PSP).

Across XDomainBench splits and cross-dataset benchmarks (CommonsenseQA, Natural Questions, ARC), SubspacePath improves robustness-efficiency trade-offs under moderate and aggressive pruning.

Motivation & Core Insight

Problem

Global static pruning criteria often fail under scenario shifts, while router-heavy approaches add runtime complexity. We need pruning that is specialized, interpretable, and deployment-friendly.

Key Insight

Embedding-level domain axes can act as stable coordinates, and probe signals can map those coordinates to executable head pathways. This turns pruning from generic compression into scenario-conditioned compilation.

SubspacePath overview pipeline
Figure. End-to-end SubspacePath pipeline: DBS builds domain axes and PSP compiles scenario masks.

What Actually Improves

Gains are strongest in OOD and cross-domain settings because scenario-conditioned masks reduce pathway interference. The effect comes from better pathway organization, not only parameter removal.

OOD Stability

  • Domain-axis conditioning keeps execution focused on heads that remain semantically aligned under shift.
  • Compared with global pruning, the mask is less sensitive to activation drift when input style changes.
  • This is why OOD recall rises consistently across backbones at moderate pruning levels.

Cross-Domain Robustness

  • Mixed-domain prompts trigger less interference because conflicting pathways are suppressed early.
  • Whitelist preservation keeps shared general reasoning capacity while specialized heads are selectively routed.
  • Result: better retention on Cross/NQ/ARC while still delivering deployment-friendly speedups.

Method

Method Overview (No Training During Deployment)

SubspacePath separates work into an offline preparation stage and a lightweight online compilation stage. The online stage uses only scenario-start input and does not run optimization.

Stage A: DBS (Domain-Basis Synthesis)

  • Build input-only domain pools from training-side selected-domain data.
  • Project embeddings to a compact shared space and synthesize domain axes.
  • Select a stable axis subset that balances separation and coverage.

Stage B: PSP (Probe-based Scenario Pruning)

  • Train layer-wise lightweight probes to read axis relevance from residual signals.
  • Cache domain-head importance and an always-keep whitelist of backbone heads.
  • At scenario start, estimate domain mixture and compile one reusable head mask under budget.

Why This Works

  • DBS gives a stable semantic coordinate system instead of ad-hoc global ranking.
  • PSP turns semantic alignment into executable pathway scoring.
  • Whitelist + budgeted mask keeps general capability while removing scenario-conflicting heads.
  • The resulting mask is reused over turns, so overhead is low and deployment-friendly.

Step 1 · Offline Preparation

Construct domain pools, synthesize DBS axes, and train PSP probes on input-only data.

Output: reusable semantic basis + probe toolkit + cached head importance.

Step 2 · Scenario Compilation

Read scenario-start input, infer domain mixture, combine with cached importance, and compile budgeted head mask.

Output: one scenario-level executable mask (`m_s`) with whitelist preserved.

Step 3 · Multi-turn Reuse

Reuse the compiled mask for subsequent turns to avoid repeated optimization and keep per-turn overhead low.

Result: stable, efficient specialization for coherent multi-turn scenarios.

Method Evidence (Click to Expand)

Images are supporting evidence; method logic is primary. Click each button to reveal the corresponding figure block.

Main Results

LLaMA-2-13B (Moderate)

43.0 / 32.5 / 20.2 Recall (Selected/OOD/Cross)

Dense baseline: 29.6 / 26.1 / 18.4

LLaMA-2-13B (Aggressive)

34.7 / 30.4 / 22.9 Recall

Cross-domain remains above dense under heavier pruning.

Qwen2.5-14B (Moderate)

47.8 / 44.1 / 31.3 Recall

Strong cross-dataset retention on NQ and ARC.

Tradeoff curve on selected split
Figure. Trade-off curve on Selected split (LLaMA-2-13B).
Tradeoff curve on OOD split
Figure. Trade-off curve on OOD split (LLaMA-2-13B).
Tradeoff curve on cross-domain split
Figure. Trade-off curve on Cross-domain split (LLaMA-2-13B).

Trade-off Insight Across Splits

  • Selected split: SubspacePath sustains the highest recall as sparsity increases, showing stronger in-domain head prioritization.
  • OOD split: the margin versus generic pruning widens, indicating better resistance to domain mismatch.
  • Cross-domain split: gains remain under heavier pruning because mask compilation reduces cross-topic pathway collision.

Performance Interpretation

Gains are not only from removing heads. Scenario-conditioned masks suppress domain-conflicting pathways and preserve axis-coupled heads, reducing latent competition in residual aggregation.

This effect is strongest on distribution-shifted settings, where static global ranking methods are more brittle to scenario mismatch.

Main Result Table (LLaMA-2-13B, Full Baselines)

Hint: scroll horizontally to view all columns on smaller screens.

Method Moderate Pruning Recall Aggressive Pruning Recall
SelOODCrossCSQANQARC SelOODCrossCSQANQARC
Dense29.626.118.422.2730.2523.8729.626.118.422.2730.2523.87
DaSS33.8228.6219.8920.1018.8019.2235.2727.8415.4018.1312.4216.29
Wanda26.8727.2615.4317.6317.2714.3818.1226.578.7713.708.099.50
LLM-Pr.29.3427.8118.9221.0827.6323.0927.3028.1517.3621.2019.1221.01
RIA27.0627.0315.1217.9817.2414.3318.9426.698.5212.488.069.75
Probe Pr.29.0227.8119.0821.7527.6423.3215.1927.8411.5421.6511.346.96
Ours-SubspacePath43.0032.5020.2019.4333.6622.9134.7030.4022.9018.4024.8921.56

Main Result Table (More Backbones, Moderate Pruning)

Hint: scroll horizontally to view all columns on smaller screens.

Backbone XDB Selected (Dense/Ours) XDB OOD (Dense/Ours) XDB Cross (Dense/Ours) NQ (Dense/Ours) Speedup (Light/Agg.)
LLaMA-2-13B 29.6 / 43.0 26.1 / 32.5 18.4 / 20.2 30.25 / 33.66 3.22 / 1.51
Qwen2.5-7B 40.8 / 46.4 33.6 / 41.0 22.9 / 27.2 17.51 / 19.51 2.01 / 0.88
Qwen2.5-14B 40.9 / 47.8 37.2 / 44.1 22.8 / 31.3 19.72 / 27.21 1.38 / 1.29

Table-Level Interpretation

The consistent pattern is that moderate pruning gives the best overall robustness-efficiency point, while aggressive pruning still preserves competitive recall in OOD and cross-domain settings. This indicates SubspacePath primarily reorganizes executable pathways rather than relying on fragile one-shot compression.

Efficiency & Ablation

Efficiency (Matched to Main Paper Table)

  • LLaMA-2-13B light pruning memory: 13.0 -> 12.1 GB, speedup 1.26x (XDomainBench avg).
  • LLaMA-2-13B heavy pruning memory: 13.0 -> 11.7 GB, speedup 1.24x (XDomainBench avg).
  • ARC reaches strongest acceleration: 2.21x at light pruning.
  • Online compilation remains low-latency: 0.027s-0.068s across tested backbones.

Ablation (Coupling Components)

Ablation XDB Sel. XDB OOD XDB Cross CSQA NQ ARC
Dense29.626.118.422.2730.2523.87
Full (DBS+PSP)34.730.422.919.4333.6625.62
w/o DBS selection1.81.40.90.820.401.16
w/o whitelist22.40.420.20.100.120.23
w/o multi-domain mixing29.523.519.07.0710.6817.04

Pruning Time and Speedup Summary

Model Pruning Time (s) Speedup (Light / Aggressive)
LLaMA-3.1-8B + Ours0.0391.41 / 1.35
LLaMA-2-13B + Ours0.0603.22 / 1.51
Qwen2.5-7B + Ours0.0272.01 / 0.88
Qwen2.5-14B + Ours0.0681.38 / 1.29

Efficiency Insight

The strongest practical property is the offline/online separation: heavy computation is amortized offline, while online compilation stays sub-0.1s. This directly matches multi-turn deployment where one mask is reused through a scenario.

Reported speedups are backend-sensitive, so the paper separates raw speedup from retention. This is why some settings still emphasize retention gains even when wall-clock speedup is moderate.

Case Studies

Case-Level Insights

Natural Questions

The compiled mask reduces off-topic continuation and keeps retrieval-oriented reasoning concise, which improves relevance under dataset shift.

OOD Biology Multi-turn

Subspace-conditioned pathways maintain multi-turn coherence: later responses stay aligned with earlier factual context instead of drifting into generic templates.

Cross-domain QA

In mixed philosophy/sociology prompts, interference control is visible as cleaner reasoning transitions between concepts that would otherwise activate competing heads.

Natural Questions case study
Figure. Cross-dataset case (Natural Questions): better relevance and reduced drift.
OOD biology multi-turn case study
Figure. OOD multi-turn biology case: pruned pathway preserves coherent reasoning.
Cross-domain philosophy and sociology case study
Figure. Cross-domain case: scenario mask helps control mixed-domain interference.

Authors & Affiliations

Zhiren Gong1,2, Yikun Hou1,4, Fan Wu1, Che Wang1, Fuyao Zhang1, Tiantong Wu1, Yurong Hao1, Jiaming Zhang1, Yiyang Duan1, Tiantong Wang1, Fei Huang5, Chau Yuen3, Wei Yang Bryan Lim1

1 College of Computing and Data Science, NTU Singapore 2 Interdisciplinary Graduate Programme, NTU Singapore 3 School of EEE, NTU Singapore 4 Umea University, Sweden 5 Alibaba Group, China

Resources

Tutorial

A narrated, animated ~7-minute video tour — the problem, the DBS + PSP method, and the results, built from the paper's own figures.

BibTeX

@inproceedings{gong2026subspacepathpruner,
  title   = {SubspacePath Pruner: Inference-time Pruning via Probe-based Representation-Parameter Coupling},
  author  = {Gong, Zhiren and Hou, Yikun and Wu, Fan and Wang, Che and Zhang, Fuyao and Wu, Tiantong and Hao, Yurong and Zhang, Jiaming and Duan, Yiyang and Wang, Tiantong and Huang, Fei and Yuen, Chau and Lim, Wei Yang Bryan},
  booktitle = {Forty-third International Conference on Machine Learning},
  year    = {2026}
}