AI Interpretability @ NEURAI

Reverse-engineering how models think, so we can shape how they act.

Lead: Dr. Nadim Saad (Silicon Valley) in collaboration with Dr. David Bau’s Lab (Boston).
Who it’s for: Students who want hands-on, reproducible research in mechanistic interpretability of LLMs.

See our Talks

Read the NDIF Workshop Recap

What is “interpretability”?

Modern AI systems work through layers of learned representations. Interpretability asks: what is stored where, and why does a model produce this output?
We focus on mechanistic interpretability—identifying features, circuits, and causal pathways inside models, then testing our understanding with interventions (e.g., patching activations). It’s useful for:

Reliability & debugging: find the feature/circuit causing a failure and fix that thing.
Safety & alignment: detect and reduce harmful or unintended behavior.
Science of ML: learn the algorithms that emerge inside networks.

NEURAI’s focus

We work primarily on transformer LLMs, adapting ideas from vision (e.g., Network/GAN Dissection) and modern LLM tooling (activation patching, sparse/cross-coders, influence functions, representation steering). Projects favor small, testable artifacts: a clear question → a minimal experiment → a replicable repo.

Figure 1: The steps of reverse engineering neural networks. (1) Decomposing a network into simpler
components. This decomposition might not necessarily use architecturally-defined bases, such as individual
neurons or layers. (2) Hypothesizing about the functional roles of some or all components.
(3) Validating whether our hypotheses are correct, creating a cycle in which we iteratively
refine our decompositions and hypotheses to improve our understanding of the network

From Sharkey et al. (2025), Open Problems in Mechanistic Interpretability, arXiv:2501.16496.

Current research (people + projects)

Qiaochu Liu — Influence Functions for Mechanistic Interpretability of LLMs
Goal. Trace a model’s prediction back to the most influential training data and internal features.
What this looks like. Scaling influence-function ideas to LLMs; approximations that work on practical budgets; case studies on bias/safety behaviors.
Outcome. Tools that say “this subset of data and these features most affected that output”—and let us test fixes.

Rohan Kathuria — Interpreting Feature Development with Crosscoders
Goal. Watch features form and evolve across layers or between model variants (e.g., base → chat).
What this looks like. Train crosscoders to map representations across layers/models; compare, track drift, and identify monosemantic features; connect to sparse autoencoders.
Outcome. A “feature timeline” that shows how capabilities emerge—and where regressions sneak in.

Ali Shehral — Controlling Emergent Misalignment with Persona Vectors
Goal. Use activation-space directions (“persona vectors”) to measure/steer behaviors tied to risky traits.
What this looks like. Build detectors, then steer (at inference) or regularize (during fine-tuning); evaluate trade-offs (safety vs. capability).
Outcome. Practical knobs to reduce misaligned behavior without carpet-bombing model quality.

Methods we use

Activation / Logit Patching: swap activations from one run/layer into another to test causality (“did this circuit cause that behavior?”).
Sparse Dictionary Learning: compress activations into human-readable features; align features across layers/models to see how they evolve.
Influence Functions & Data Attribution: estimate which training examples most affected a prediction; great for bias/bug hunting.
Representation Steering (incl. Persona Vectors): nudge internal activations along a direction to amplify/suppress behaviors—then measure side effects.
Data Lineage & Evaluation: track where data came from and how it was filtered; connect dataset slices to model quirks.

Figure 2: Three ideas underlying the sparse dictionary learning (SDL) paradigm in mechanistic inter-
pretability. (Left) The linear representation hypothesis states that the map from ‘concepts’ to neural acti-
vations is linear. (Middle) Superposition is the hypothesis that models represent many more concepts than
they have dimensions by representing them both sparsely and linearly in activation spaces. (Right) SDL
attempts to recover an overcomplete basis of concepts represented in superposition in activation space.

From Sharkey et al. (2025), Open Problems in Mechanistic Interpretability, arXiv:2501.16496.

Why this matters

Understanding models at the mechanism level lets us:

fix failures surgically,
build safer systems, and
turn “it works” into “we know why it works.”

If that sounds like your kind of fun, join our Talks series—we break models (gently), learn, and ship small proofs that scale.

Figure 3: A summary of problem areas for applications of mechanistic interpretability

From Sharkey et al. (2025), Open Problems in Mechanistic Interpretability, arXiv:2501.16496.

Recent activity

NDIF x NEURAI Workshop (interpretability methods & demos) — see recap above.
Presentations on papers in mechanistic interpretability
Ongoing collaboration with Bau Lab on feature discovery and LLM circuits.

Learn More

David Bau / NDIF

Community Resources and Papers

Contact Dr. Nadim at n.saad@northeastern.edu for any specific queries