Conditional Co-Ablation (CoAx) — Recovering Self-Repair Backups in Transformer Circuits

01

The problem

Self-repair hides the components that matter

A behavior is localized to a circuit, and each component is scored by the effect of ablating it in isolation. That first-order weight is only meaningful if importance is additive. It stops being additive the moment a circuit is redundant.

Two wrong conclusions from one measurement

Remove a primary head and a dormant backup takes over. The output barely moves, so the primary reads as unimportant — the model quietly repaired the damage. And the backup was silent on the intact model, so it reads as unimportant too.

A single-ablation score therefore misreads both sides of the redundancy — and every tool built on it (attribution, knockout, pruning) inherits the error.

Self-repair schematic — Ablating the primary path wakes a dormant backup that re-routes the answer.

First-order scoring can't tell a dormant backup from a dead head — both are silent. So we change the question.

02

The idea

Ask what grows once the primary is gone

Instead of a unit's effect in isolation, CoAx measures its effect after a primary seed S has been ablated, in the output distribution's own (Fisher) metric. The score is the growth of that ablation energy under conditioning:

$$\operatorname{comp}_u(S)\;=\;\underbrace{\mathcal{E}\!\big(\delta z_u \mid S\big)}_{\text{effect once }S\text{ is gone}}\;-\;\underbrace{\mathcal{E}\!\big(\delta z_u \mid \varnothing\big)}_{\text{effect alone (first order)}}$$

Blind-spot-proof

A dormant backup and a dead head look identical on the clean model. CoAx separates them: the backup's effect grows once its primary is gone; the dead head's does not.

Cheap & label-free

No gradients, no task labels — one clean pass, one seed-ablated pass, one pass per candidate: O(#heads) forwards.

A completion, not a rewrite

The seed can come from any first-order method; CoAx completes that circuit by returning the backups it hides.

Synergy module and circuit re-wiring — **What conditioning exploits.** The pairwise interaction between heads is a whole matrix (b): the name-movers and their backups form a bright off-diagonal block — the **self-repair module** a per-head score (a) cannot see. Intact (c) the primaries write the answer while the backups stay dormant; ablate the primaries (d) and the backups wake and re-route to the logit.

Synergy matrix — The pairwise synergy over the IOI heads — the boxed name-mover / backup block.

Head map — The 12×12 head grid: first-order saliency vs. the CoAx score — the same inversion, in space.

The structure is there in the second-order signal. Does conditioning actually surface the documented backups?

03

Discovery

From the blind spot to the top of the ranking

On the GPT-2-small IOI circuit — the one with head-level backup ground truth — conditioning on the three name-mover primaries and ranking the other 141 heads by conditional growth recovers the eight documented backups that single-ablation saliency ranks below chance.

score	backup ROC-AUC
single ablation 1st	0.33
attribution patching (AtP) 1st	0.60
EAP-IG 1st	0.70
AtP* GradDrop 1st	0.82
CoAx 2nd	0.91

Every additive score falls short — including those built for self-repair. The gap is not a smarter gradient; it is the node-additive form, which a non-additive substitution cannot be expressed in. The plot at right is the same result, head by head.

ordinary head name-mover (seed) ★documented backup live plot · released code · 48 IOI prompts · hover a point

Blind-spot separation — **The same eight backups, two scores.** Every head placed by first-order saliency (left) and by the CoAx score (right): the documented backups sit near the bottom on the left (AUC 0.33) and jump to the top on the right (AUC 0.91).

A high rank is necessary, not sufficient. Are the surfaced heads mechanistically backups?

04

Mechanism

They wake up — and the wake-up is causal

The recovered heads behave like backups on three independent, label-free signals, and a counterfactual patch closes the causal loop.

Wake-up curves and patching — **Progressively ablating the primaries.** The backups grow in output norm (a) and in causal effect on the answer (b) while matched random heads stay flat; the answer's direct logit attribution hands off to them (c); and freezing their dormant activations removes **55%** of the self-repair (d) — freezing random heads removes none.

One backup head — Head [10,6], traced end to end.

One head, four views

A single documented backup, [10,6], is silent on the clean model, yet at the prediction position it already reads the indirect-object name — the defining name-mover behavior.

Once the primaries are ablated it becomes load-bearing, and CoAx ranks it among the top backups. Invisible first-order, structurally a name-mover, causal only under conditioning.

The recovered circuit, three views — the primary write-path (orange) and the dormant backup route (blue) CoAx adds in parallel.

If they are load-bearing, they should change the tools that depend on them.

05

Closing the loop · downstream applications

The blind spot corrupts the tools built on it — CoAx repairs them

Attribution, capability knockout, and structured pruning all rank components by the same node-additive score, so each inherits the same error wherever a circuit is redundant. Fed the recovered backups — one label-free pass — all three are repaired.

1.76

Attribution recovered
the effect self-repair masked (vs 0.22 from the primaries alone)

0.70

Knockout accuracy
matches the 0.72 documented-backup oracle; a first-order top-up overshoots

80.6

Pruned perplexity @50%
vs gradient-Taylor 201, from 124M to 7B

Capability knockout — a knockout needs the backups · click through the states

IOI answer margin (% of intact)

100%

Task accuracy (% correct)

100%

Repair-aware pruning · the removal order that keeps backups

Pruning across scales — **WikiText-2 perplexity vs. heads pruned, 124M–7B.** Re-measuring the score as heads are removed gives a repair-aware order (blue) that stays nearest the dense model at every scale, dominating random, magnitude, Wanda, Taylor, and the static co-ablation order.

One circuit, one model — does the phenomenon hold more broadly?

06

Generalization & scope

Label-free, across scale and architecture

The same label-free pipeline replicates the backup signature across the GPT-2 family and completes a second redundant circuit — induction — on eight further models spanning six architecture families.

Generalization — **Two axes of generalization.** (a) the discovered IOI backups wake up under primary ablation across GPT-2 small/medium/large; (b) label-free induction completion drops the log-probability far more than matched-random on eight models; (c) a stronger "+own" control confirms the recovered set is load-bearing.

Where it applies — and where it doesn't

CoAx wins on the output-movement circuits whose heads share aligned write directions. Its scope is honest: where redundancy is instead shared among co-firing heads, even input-side co-activation finds them and CoAx is complementary; and on the MLP-dominated greater-than circuit the head-level signal does not transfer — a property of the circuit, not the score.

Cross-model geometry — **Output- vs. input-side geometry across twelve models.** Blue = CoAx's co-ablation beats input-side co-activation at recovering that circuit's redundancy; orange = co-activation wins.

Blind spot across GPT-2 — The first-order blind spot recurs across the whole GPT-2 family.

Cross-architecture induction — Induction completion transfers across six architecture families.

Under the hood

What makes the score work — and what it costs

Two design choices carry it — conditioning and centering — and the price is forward passes only: no gradients, no labels.

Method analysis — Data-efficient (strong at 32 prompts); centering is the load-bearing ingredient; and the score is invariant to whether a backup writes the answer direction.

Quality vs. cost — Quality vs. compute: the conditional route reaches the second-order signal at O(#heads), far left of the explicit pairwise wall.

Resources

Paper, code, video, and citation

PaperarXiv preprint Codereference implementation + one-command reproduction Tutoriala narrated ~4½-minute video tour BibTeXcite this work

BibTeX

@inproceedings{gong2026coax,
  title     = {Conditional Co-Ablation: Recovering Self-Repair Backups in Transformer Circuits},
  author    = {Gong, Zhiren and Zeng, Zihao and Yuen, Chau and Lim, Wei Yang Bryan},
  year      = {2026},
  note      = {Project page: https://gongzhiren.github.io/Conditional-Co-Ablation-website}
}