Nan Li

Surreal

2026-06-09T00:00:00+00:00

Defended my PhD yesterday. Got the diploma. It still doesn’t feel real.

Preparing the presentation and the defense, even knowing you’re not going to fail, was still daunting and nerve-racking. It feels a bit unfair when you realize the world already rewards the loud ones, for whom such events are just another opportunity to shine. My proposal: let introverts use an AI voice-over. You should be able to just stand there and nod while a confident voice explains your thesis.

I decided not to ask my mom to watch online. It would have been 10:30pm her time. She should be deep asleep by then, and even if not, my presentation would surely put her there.

AI sycophancy: what the new Science paper shares with our cross-lingual study

2026-04-07T00:00:00+00:00

Cheng et al. published a study in Science last week on AI sycophancy. They fed 11 LLMs posts from Reddit’s r/AmITheAsshole and found models affirm user behavior 49% more often than the human crowd consensus. A follow-up with 2,400 participants showed that interacting with sycophantic AI made people more entrenched and less willing to apologize, while rating the AI as trustworthy and objective.

We’ve been working on a related problem. Our paper (January 2026, under review at ACL) also uses AITA data and finds the same leniency: 12 of 13 models judge more leniently than the Reddit baseline (Cohen’s d > 1.6), all 13 on a Chinese moral dilemma dataset. Two independent groups, same data, same finding.

The studies go in different directions from there.

What happens to the user vs. what happens in the model

Cheng et al. ask what happens to the user. Sycophancy measurably shifts how people reason about their own conflicts. We didn’t study that.

We ask what’s going on inside the model. Standard evaluation tells you a model behaves differently in English vs. Chinese, but not whether the gap comes from how it reads the input or how it reasons. We tested mismatched conditions (English story + Chinese chain-of-thought, and vice versa) to pull these apart. Reasoning language drives about 2x the behavioral shift that input language does (7.2 pp vs. 3.5 pp, p < 0.001).

Through Moral Foundations Theory analysis, the main pattern is calibration drift: models change how harshly they judge, but moral priority rankings stay mostly stable (mean Spearman rho = 0.88). But “mostly” is doing work there. Some models shift which moral dimensions they weigh, and models that look stable on Western data shift on Chinese data.

44% of the models we tested appear stable under English-only evaluation but show hidden context-dependency cross-lingually. Monolingual benchmarking misses this.

Two halves of one problem

The two papers split the problem down the middle. Stanford treats the model as a black box: sycophantic, yes, but that’s where their model-side analysis ends. They focus on what that sycophancy does to the person on the other end, and with 2,400 participants they nail it. People who interact with sycophantic AI become more entrenched and less willing to repair relationships.

We open the box. We decompose the sources of leniency, characterize per-model variation, identify which models fail silently across languages. But we stop at the model boundary. No user study (due to resource constraints).

Read together, you get both halves. We show that models aren’t just generically lenient – they reason with specific moral patterns that vary by language and shift in ways monolingual benchmarks can’t detect. Stanford shows that whatever the model produces, it sticks.

The gap between them

Cheng et al. show sycophantic AI makes users more self-serving in their conclusions. But does it also reshape how they reason? We find that LLMs weigh moral dimensions differently from humans, not just in severity but in which considerations matter most. If the model’s moral fingerprint transfers to the user, that’s a different problem than generic leniency.

The experiment to close the loop: have participants discuss moral dilemmas with chatbots, then evaluate new dilemmas on their own. Measure whether their moral foundation profiles drift toward the chatbot’s pattern. The fingerprinting method from our paper works on human judgments too, so this is directly testable.

Our paper: arXiv:2601.10257 (Li, Kang, De Bie), code and datasets included.

Stanford study: Cheng et al., Science 2026

Five heads in layer 12: what a learned KB encoder actually learns

2026-04-02T00:00:00+00:00

Part 1 showed that adding KG vectors to an LLM’s hidden state fails regardless of alignment. What works: train a small encoder to project KB entries into the model’s attention key-value space as extra tokens, as in KBLAM (Feng et al., ICLR 2025). The LLM stays frozen; the encoder learns which directions actually change the output.

We reproduced this on Pythia-1.4B and then asked where the signal goes.

The encoder works, and the shuffled-KB test proves it

Pythia gets 14% on factual probes by itself. With the encoder providing correct KB: 96%.

The convincing number is the shuffled-KB result: 5%, below the 14% clean baseline. When given the wrong entity’s fact in the right format, the model follows it. The injection is live, not decorative.

Left: factual encoder, 14% baseline jumps to 96% with correct KB, drops to 5% with shuffled KB (below clean). Right: counterfactual encoder on facts Pythia has never seen, 0% baseline rises to 98%. Both show the model actively reads and follows the injected signal.

The counterfactual result is the stronger test: these are facts Pythia’s parametric memory assigns near-zero probability. The encoder writes new facts into a frozen model.

The signal concentrates in layer 12; head identity doesn’t matter

The encoder injects into 64 head-layer slots (4 layers × 16 heads). We added a learnable on/off gate per head with a sparsity penalty (Hard Concrete L0; Louizos et al., 2018) and trained two encoders with different penalty strengths.

Both converged to the same answer: zero out layers 0, 6, 18. Keep only layer 12.

Two independently trained sparse encoders both concentrate all signal in layer 12. The active heads are completely disjoint, {0,1,5,7,14} vs {10,12}, yet achieve comparable accuracy. Layer identity is consistent across solutions; head identity is arbitrary.

Layer 12 is where Pythia’s relation-probing accuracy peaks and roughly where causal tracing methods (ROME, Knowledge Neurons) localize factual recall. The encoder discovered this on its own.

Sparse vs all-heads: no difference

You might expect fewer heads to mean less interference. Each head independently attends to all KB triples including distractors, so inactive heads import noise into the residual stream. The math predicts an optimal head count well below 64 (cross-head interference scales as H^2 while signal scales as H, via a Hanson-Wright-style argument).

We tested this by training an all-heads encoder from scratch (penalty disabled, all 64 heads free) and sweeping KB size. No difference. All three configurations (64, 5, 2 heads) overlap within bootstrap CIs at every KB size. Restrict the encoder to 5 heads and it routes the signal more strongly through those 5. Give it 64 and it spreads out. The bottleneck is the encoder’s mapping quality, not head count.

Practical takeaway: train end-to-end, inject into the right layer, and don’t bother optimizing over head selection.

Right zip code, wrong address: why you can’t just add knowledge to an LLM

2026-04-01T00:00:00+00:00

I spent the last couple of weeks reading into how people combine knowledge graphs with LLMs. Knowledge graphs store facts as triples like (France, capital, Paris), structured and updatable. LLMs know a lot but hallucinate and go stale. Plenty of work tries to bridge the two: you can convert triples to text and put them in the prompt (KAPING), fine-tune with projected KG embeddings (ConceptFormer, TEA-GLM), inject KG entries as extra attention key-value pairs (KBLAM), or just edit the model’s weights (ROME, MEMIT).

I kept coming back to a simpler question: what if you just take a KG vector for “France,” project it into the LLM’s activation space, and add it to the hidden state? No fine-tuning, no weight editing, just vector addition on a frozen model.

It doesn’t work. We tried 300 configurations on Pythia-1.4B with factual fill-in-the-blank questions from LAMA T-REx (baseline: 20% Hit@1). Five injection methods, five layers, four strengths, three positions. Every one degraded performance. Best result: ~14%. The KG vectors performed no better than random noise of the same magnitude.

KG embeddings are noise to the LLM, but the LLM knows the facts anyway

We measured representational similarity (CKA) between KG embeddings and Pythia’s hidden states. TransE, RotatE, and sentence embeddings all look identical to random noise at every layer. Three fundamentally different external representations, same result.

Left: all external embeddings overlap with the random baseline; the lines are inseparable. Right: Pythia’s inter-layer CKA exceeds 0.95 through layers 4–22, confirming the metric can detect real similarity. KG embeddings aren’t slightly misaligned. They’re orthogonal.

The interesting part: the LLM does encode relational structure internally. A linear probe on Pythia’s hidden states classifies relation type (capital of? native language of?) at 65% on a 15-way task (chance: 6.7%). The facts are there. The geometry is completely different.

Alignment is easy to learn, and useless

We trained an MLP projector: KG vector → Pythia activation space. With 10 examples we hit 0.93 cosine similarity to Pythia’s own representations. The alignment problem is trivially solvable. Injecting these aligned vectors still doesn’t improve accuracy.

We tracked why. An injected perturbation, whether carefully aligned or random noise, gets amplified (~1.85×) and rotated until its direction is essentially random by the final layer.

Perturbation magnitude grows (left) and direction randomizes (right) through the network. The aligned vector (blue) and random noise (orange) reach nearly the same endpoint by layer 23. The transformer treats them identically; alignment washes out.

Alignment buys tolerance, not benefit. The model can handle an aligned perturbation at higher strength before collapsing, but accuracy never rises above the clean baseline. Both aligned and random perturbations end up in the same place by the output layer.

A static KG vector, the same “France” whether you’re asking about its capital or its language, can’t target the specific internal features the model reads for each prediction. Right zip code, wrong address.

Part 2: a learned encoder gets 96% on a frozen LLM, and we trace where in the model the signal goes.

It’s done

2026-03-21T00:00:00+00:00

Submitted my PhD thesis three days ago. For the first time in years, I’m enjoying some guilt-free time off. It feels unreal.