1
00:00:00,600 --> 00:00:13,176
Let's understand Conditional Co-Ablation — a method for finding the hidden backup parts of a neural network's reasoning. We'll build it up from the very beginning, assuming no background in interpretability.

2
00:00:13,516 --> 00:00:21,604
Along the way we'll see why the standard way of measuring importance has a blind spot — and how one change to the question fixes it.

3
00:00:22,354 --> 00:00:28,258
First, what are we even looking at? A language model reads text and predicts the next word.

4
00:00:28,598 --> 00:00:39,134
Mechanistic interpretability is the effort to reverse-engineer how it does that — not just what it answers, but the internal computation that produces the answer.

5
00:00:39,474 --> 00:00:50,346
The model is built from small, repeated parts. The ones we'll focus on are attention heads — each a tiny unit that moves and mixes information between words.

6
00:00:50,686 --> 00:01:02,398
For any one task, only a handful of these heads actually do the work. That small, connected set of components is called a circuit — the model's wiring for a single behavior.

7
00:01:03,148 --> 00:01:09,196
How do researchers find a circuit? With a simple, powerful tool: ablation.

8
00:01:09,536 --> 00:01:18,440
You switch off one component and watch how much the answer changes. A big drop in the correct answer means it mattered; a small drop means it didn't.

9
00:01:18,780 --> 00:01:28,764
Do this for every head, rank them by their effect, and the important ones light up. This single-ablation score is the workhorse of the whole field.

10
00:01:29,104 --> 00:01:39,736
Here's a concrete circuit. On a name-copying task, a few heads read the correct name and write it to the output. Ablate them, and the model should fail.

11
00:01:40,486 --> 00:01:46,774
But something strange happens. You delete an important head — and the answer barely changes.

12
00:01:47,114 --> 00:01:57,866
The reason is self-repair. The moment a key component is removed, another component that was sitting silent wakes up, takes over its job, and re-routes the answer.

13
00:01:58,206 --> 00:02:08,502
This is redundancy. The network keeps dormant backups and heals itself under damage — a little like a power grid rerouting around a failed line.

14
00:02:08,842 --> 00:02:17,578
Now watch what that does to our measurement. The primary head, when deleted, looks unimportant — because the backup hid the damage.

15
00:02:17,918 --> 00:02:28,646
And the backup looks unimportant too — because on the intact model it was silent. A single-ablation score misreads both sides of the redundancy at once.

16
00:02:29,396 --> 00:02:38,252
Deleting a group is not the sum of deleting its parts. This is a genuine blind spot — and it is not just a curiosity.

17
00:02:38,592 --> 00:02:49,464
Interpretability isn't only about understanding. It's used to attribute behavior to components, to knock out unwanted capabilities, and to prune models smaller.

18
00:02:49,804 --> 00:02:58,372
Every one of those tools is built on the importance score. When the score is blind to backups, all of them inherit the same blindness.

19
00:02:58,712 --> 00:03:08,456
For safety this is the sharp edge: try to remove a capability by deleting the heads that seem to carry it, and a dormant backup can quietly restore it.

20
00:03:09,206 --> 00:03:13,982
So CoAx changes the question. Don't ask how important a unit is on its own.

21
00:03:14,322 --> 00:03:21,330
Ask a conditional one: once the main circuit is already removed, how much does this unit's effect grow?

22
00:03:21,670 --> 00:03:34,270
A true backup does almost nothing by itself — its effect alone is near zero. But once its partner is gone, its effect becomes large. That growth is the CoAx score.

23
00:03:34,610 --> 00:03:48,386
It needs no labels and no gradients — only forward passes that measure the change in the output. Silent alone, load-bearing once its primary is gone: that is exactly the signature of a backup.

24
00:03:49,136 --> 00:03:54,872
For the simplest case — a pair of heads — there is a companion signal called synergy.

25
00:03:55,212 --> 00:04:07,356
Ablate each head alone and little happens. Ablate them together, and the effect is far larger than the sum of the two. That gap flags a mutually-compensating pair.

26
00:04:07,696 --> 00:04:19,720
Measure this across every pair of heads, and a hidden structure appears: the main heads and their backups, wired in parallel — a module a one-head-at-a-time score can never see.

27
00:04:20,470 --> 00:04:27,886
Now, the evidence. We start where we actually have ground truth — a circuit with eight documented backup heads.

28
00:04:28,226 --> 00:04:39,098
We rank all the heads by each method, and check how well the known backups rise to the top. A score of one is a perfect ranking; a half is random guessing.

29
00:04:39,438 --> 00:04:49,614
Single-ablation scoring sits below chance, at 0.33 — worse than a coin flip, because self-repair actively hides the backups.

30
00:04:49,954 --> 00:05:02,458
Even gradient-based methods, built to be smarter, top out around 0.82. CoAx reaches 0.91 — moving the backups from the blind spot to the top of the ranking.

31
00:05:03,208 --> 00:05:08,416
But a high score isn't proof. Are these heads really backups? We test them causally.

32
00:05:08,756 --> 00:05:19,244
As we remove the main heads one by one, the heads CoAx found wake up — their activity climbs and they start writing the correct answer — while random heads stay flat.

33
00:05:19,584 --> 00:05:31,344
And the decisive test: freeze these backups so they cannot respond, and 55 percent of the self-repair disappears. They don't just correlate with the healing — they cause it.

34
00:05:32,094 --> 00:05:38,022
Because the backups are load-bearing, recovering them repairs the downstream tools that were blind to them.

35
00:05:38,362 --> 00:05:51,154
Take attribution — measuring how much a circuit really drives the answer. The primaries alone seem to move it by only 0.22, because self-repair masks the rest.

36
00:05:51,494 --> 00:06:00,710
Add the recovered backups, and the true effect reappears, several times larger. The redundancy had been hiding most of the real signal.

37
00:06:01,050 --> 00:06:16,386
Capability knockout tells the same story: adding CoAx's backups reaches 0.70, matching the 0.72 you get from the documented backups, while a first-order top-up overshoots and damages other behavior.

38
00:06:16,726 --> 00:06:28,702
And a repair-aware pruning order, which re-measures importance as heads are removed, stays closest to the full model — from a 124-million-parameter network up to 7 billion.

39
00:06:29,452 --> 00:06:42,748
Is this one lucky circuit, or a general phenomenon? The same label-free method transfers to a second circuit — induction — across eight further models and six architecture families.

40
00:06:43,088 --> 00:06:58,280
And it is honest about its limits. Where redundancy is shared among heads that fire together, simpler input-side methods can find it too; and on circuits dominated by feed-forward layers, this head-level signal does not transfer.

41
00:06:59,030 --> 00:07:08,798
Step back, and the lesson is about measurement itself. A component's importance is not a fixed property — it is conditional on what else is present.

42
00:07:09,138 --> 00:07:17,682
The parts that matter most are often the ones that stay silent until they are needed. Measure them in isolation, and you will miss them every time.

43
00:07:18,022 --> 00:07:28,270
CoAx recovers exactly those hidden parts — label-free and gradient-free — and hands back a truer picture of how a model actually computes.

44
00:07:28,610 --> 00:07:34,562
The code, the paper, and an interactive page are linked on screen. Thanks for watching.