Training Dynamics of Transformer Attention Heads
Published:
A time-dependent study of W_QK statistics across training checkpoints in the Pythia model suite: how spectral structure, stable rank, and head diversity evolve during pretraining.
Published:
A time-dependent study of W_QK statistics across training checkpoints in the Pythia model suite: how spectral structure, stable rank, and head diversity evolve during pretraining.
Published:
An empirical study of the singular value spectra of W_QK matrices across transformer architectures: spectral distributions, participation ratios, and what they reveal about learned attention geometry.
Published:
PINN solving the KdV equation with boundary conditions specified via inverse scattering transform.
Published:
An empirical study of the statistical distributions of W_Q, W_K, and W_QK weight matrices across transformer architectures.