Writing

Technical explorations at the intersection of physics and artificial intelligence

Recent Posts

Training Dynamics of Transformer Attention Heads

15 minute read

Published:

A time-dependent study of W_QK statistics across training checkpoints in the Pythia model suite: how spectral structure, stable rank, and head diversity evolve during pretraining.

Singular Value Structure of Transformer Attention Heads

20 minute read

Published:

An empirical study of the singular value spectra of W_QK matrices across transformer architectures: spectral distributions, participation ratios, and what they reveal about learned attention geometry.