Mechanistic interpretability · self-repair · circuit discovery

Conditional Co-Ablation

Recovering the self-repair backups a transformer circuit falls back on — the redundancy a first-order score is built to miss.

Zhiren GongZihao ZengChau YuenWei Yang Bryan Lim
Nanyang Technological University, Singapore
single ablation0.33
backup recovery ROC-AUC →
CoAx (ours)0.91
CoAx teaser
One score, five payoffs (GPT-2-small). (a) the backups hide in the first-order blind spot · (b) CoAx ranks them at the top · (c) they compensate for the primaries · (d) knockout matches the oracle while a first-order top-up overshoots · (e) repair-aware pruning wins from 124M to 7B.

A component's importance is not a property it carries alone. In a redundant circuit it becomes visible only relative to what has been removed — and that single shift is what CoAx measures, turning a blind spot into a signal.

01
The problem

Self-repair hides the components that matter

A behavior is localized to a circuit, and each component is scored by the effect of ablating it in isolation. That first-order weight is only meaningful if importance is additive. It stops being additive the moment a circuit is redundant.

Two wrong conclusions from one measurement

Remove a primary head and a dormant backup takes over. The output barely moves, so the primary reads as unimportant — the model quietly repaired the damage. And the backup was silent on the intact model, so it reads as unimportant too.

A single-ablation score therefore misreads both sides of the redundancy — and every tool built on it (attribution, knockout, pruning) inherits the error.

Self-repair schematic
Ablating the primary path wakes a dormant backup that re-routes the answer.

First-order scoring can't tell a dormant backup from a dead head — both are silent. So we change the question.

02
The idea

Ask what grows once the primary is gone

Instead of a unit's effect in isolation, CoAx measures its effect after a primary seed S has been ablated, in the output distribution's own (Fisher) metric. The score is the growth of that ablation energy under conditioning:

$$\operatorname{comp}_u(S)\;=\;\underbrace{\mathcal{E}\!\big(\delta z_u \mid S\big)}_{\text{effect once }S\text{ is gone}}\;-\;\underbrace{\mathcal{E}\!\big(\delta z_u \mid \varnothing\big)}_{\text{effect alone (first order)}}$$

Blind-spot-proof

A dormant backup and a dead head look identical on the clean model. CoAx separates them: the backup's effect grows once its primary is gone; the dead head's does not.

Cheap & label-free

No gradients, no task labels — one clean pass, one seed-ablated pass, one pass per candidate: O(#heads) forwards.

A completion, not a rewrite

The seed can come from any first-order method; CoAx completes that circuit by returning the backups it hides.

Synergy module and circuit re-wiring
What conditioning exploits. The pairwise interaction between heads is a whole matrix (b): the name-movers and their backups form a bright off-diagonal block — the self-repair module a per-head score (a) cannot see. Intact (c) the primaries write the answer while the backups stay dormant; ablate the primaries (d) and the backups wake and re-route to the logit.
Synergy matrix
The pairwise synergy over the IOI heads — the boxed name-mover / backup block.
Head map
The 12×12 head grid: first-order saliency vs. the CoAx score — the same inversion, in space.

The structure is there in the second-order signal. Does conditioning actually surface the documented backups?

03
Discovery

From the blind spot to the top of the ranking

On the GPT-2-small IOI circuit — the one with head-level backup ground truth — conditioning on the three name-mover primaries and ranking the other 141 heads by conditional growth recovers the eight documented backups that single-ablation saliency ranks below chance.

scorebackup ROC-AUC
single ablation 1st0.33
attribution patching (AtP) 1st0.60
EAP-IG 1st0.70
AtP* GradDrop 1st0.82
CoAx 2nd0.91

Every additive score falls short — including those built for self-repair. The gap is not a smarter gradient; it is the node-additive form, which a non-additive substitution cannot be expressed in. The plot at right is the same result, head by head.

ordinary head name-mover (seed) documented backup live plot · released code · 48 IOI prompts · hover a point
Blind-spot separation
The same eight backups, two scores. Every head placed by first-order saliency (left) and by the CoAx score (right): the documented backups sit near the bottom on the left (AUC 0.33) and jump to the top on the right (AUC 0.91).

A high rank is necessary, not sufficient. Are the surfaced heads mechanistically backups?

04
Mechanism

They wake up — and the wake-up is causal

The recovered heads behave like backups on three independent, label-free signals, and a counterfactual patch closes the causal loop.

Wake-up curves and patching
Progressively ablating the primaries. The backups grow in output norm (a) and in causal effect on the answer (b) while matched random heads stay flat; the answer's direct logit attribution hands off to them (c); and freezing their dormant activations removes 55% of the self-repair (d) — freezing random heads removes none.
One backup head
Head [10,6], traced end to end.

One head, four views

A single documented backup, [10,6], is silent on the clean model, yet at the prediction position it already reads the indirect-object name — the defining name-mover behavior.

Once the primaries are ablated it becomes load-bearing, and CoAx ranks it among the top backups. Invisible first-order, structurally a name-mover, causal only under conditioning.

The recovered circuit, three views — the primary write-path (orange) and the dormant backup route (blue) CoAx adds in parallel.

Functional circuit
Functional circuit
Head re-wiring
Head-level re-wiring
Information routes
Token×layer routes

If they are load-bearing, they should change the tools that depend on them.

05
Closing the loop · downstream applications

The blind spot corrupts the tools built on it — CoAx repairs them

Attribution, capability knockout, and structured pruning all rank components by the same node-additive score, so each inherits the same error wherever a circuit is redundant. Fed the recovered backups — one label-free pass — all three are repaired.

1.76

Attribution recovered
the effect self-repair masked (vs 0.22 from the primaries alone)

0.70

Knockout accuracy
matches the 0.72 documented-backup oracle; a first-order top-up overshoots

80.6

Pruned perplexity @50%
vs gradient-Taylor 201, from 124M to 7B

Capability knockout — a knockout needs the backups · click through the states

IOI answer margin (% of intact)
100%
Task accuracy (% correct)
100%

Repair-aware pruning · the removal order that keeps backups

Pruning across scales
WikiText-2 perplexity vs. heads pruned, 124M–7B. Re-measuring the score as heads are removed gives a repair-aware order (blue) that stays nearest the dense model at every scale, dominating random, magnitude, Wanda, Taylor, and the static co-ablation order.

One circuit, one model — does the phenomenon hold more broadly?

06
Generalization & scope

Label-free, across scale and architecture

The same label-free pipeline replicates the backup signature across the GPT-2 family and completes a second redundant circuit — induction — on eight further models spanning six architecture families.

Generalization
Two axes of generalization. (a) the discovered IOI backups wake up under primary ablation across GPT-2 small/medium/large; (b) label-free induction completion drops the log-probability far more than matched-random on eight models; (c) a stronger "+own" control confirms the recovered set is load-bearing.

Where it applies — and where it doesn't

CoAx wins on the output-movement circuits whose heads share aligned write directions. Its scope is honest: where redundancy is instead shared among co-firing heads, even input-side co-activation finds them and CoAx is complementary; and on the MLP-dominated greater-than circuit the head-level signal does not transfer — a property of the circuit, not the score.

Cross-model geometry
Output- vs. input-side geometry across twelve models. Blue = CoAx's co-ablation beats input-side co-activation at recovering that circuit's redundancy; orange = co-activation wins.
Blind spot across GPT-2
The first-order blind spot recurs across the whole GPT-2 family.
Cross-architecture induction
Induction completion transfers across six architecture families.
Under the hood

What makes the score work — and what it costs

Two design choices carry it — conditioning and centering — and the price is forward passes only: no gradients, no labels.

Method analysis
Data-efficient (strong at 32 prompts); centering is the load-bearing ingredient; and the score is invariant to whether a backup writes the answer direction.
Quality vs. cost
Quality vs. compute: the conditional route reaches the second-order signal at O(#heads), far left of the explicit pairwise wall.
Resources

Paper, code, video, and citation

BibTeX
@inproceedings{gong2026coax,
  title     = {Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits},
  author    = {Gong, Zhiren and Zeng, Zihao and Yuen, Chau and Lim, Wei Yang Bryan},
  year      = {2026},
  note      = {Project page: https://gongzhiren.github.io/Conditional-Co-Ablation-website}
}