
import { EntryBody } from "../components/Entry/EntryBody";

export const notes_on_grokking: { [id: string]: any } = {

    id: "notes_on_grokking",
    title: <>Notes on Grokking</>,
    date: "s23",

    Body: (
        <EntryBody
        paragraphs={[

<div className="font-mono">


</div>,

<div className="font-mono">
----------------------- Progress Measures for Grokking via Mechanistic Interpretability; Nanda

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Abstract:

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
- understanding emergence is to find continuous progress measures that underlie the seemingly discontinuous qualitative changes

</div>,

<div className="font-mono">
- mechanistic interpretability: reverse-engineering learned behaviors into their individual components

</div>,

<div className="font-mono">
- grokking, rather than being a sudden shift, arises from the gradual amplification of structured mechanisms encoded in the weights, followed by the later removal of memorizing components

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Mapping a human algorithm (Fourier transforms and trigonometric identities to convert addition to rotation about a circle) to components of the network. It is argued the network _learns_ the algorithm.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
training can be split into three phases:

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
- memorization of the training data;

</div>,

<div className="font-mono">
- circuit formation, where the network learns a mechanism that generalizes;

</div>,

<div className="font-mono">
- cleanup, where weight decay removes the memorization components

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Phase Changes ... AlphaZero quickly learns many human chess concepts between 10k and 30k training steps and reinvents human opening theory between 25k and 60k training steps

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Set Up:

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
We train transformers to perform addition mod P . The input to the model is of the form “a b =”, where a and b are encoded as P -dimensional one-hot vectors, and = is a special token above which we read the output c. In our mainline experiment, we take P = 113 and use a one-layer ReLU transformer, token embeddings with d = 128, learned positional embeddings, 4 attention heads of dimension d/4 = 32, and n = 512 hidden units in the MLP. In other experiments, we vary the depth and dimension of the model. We did not use LayerNorm or tie our embed/unembed matrices.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Our mainline dataset consists of 30% of the entire set of possible inputs (that is, 30% of the 113 · 113 pairs of numbers mod P ). We use full batch gradient descent using the AdamW optimizer (Loshchilov & Hutter, 2017) with learning rate γ = 0.001 and weight decay parameter λ = 1. We perform 40, 000 epochs of training. As there are only 113 · 113 possible pairs, we evaluate test loss and accuracy on all pairs of inputs not used for training.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
----------------------- Grokking of Hierarchical Structure in Vanilla Transformers; Shikhar Murty

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Abstract:

</div>,

<div className="font-mono">
optimal (transformer) depth for grokking can be identified using the tree-structuredness metric of Murty et al. (2023)

</div>,

<div className="font-mono">
... strong evidence that, with extended training, vanilla transformers discover and use hierarchical structure.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
evaluate generalization in models trained on ambiguous tasks in which training data is consistent with both a “hierarchical rule” as well as a “non-hierarchical rule”

</div>,

<div className="font-mono">
datasets from "Does syntax need to grow on trees? Sources of hierarchical inductive bias in sequence-to-sequence networks" (https://github.com/tommccoy1/rnn-hierarchical-biases/tree/master/data)

</div>,

<div className="font-mono">
Generalization performance improves even after in-domain accuracies have saturated, showing structural grokking. (when you would do early stopping because the in-domain accuracy has saturated, if you continue trianing you improve quite a bit)

</div>,

<div className="font-mono">
U-shaped scaling.... is this because of the dataset size?! can we replicate this experiment with random data

</div>,

<div className="font-mono">
the “optimal” model learns the most tree-structured solution compared to both deep and shallow models

</div>,

<div className="font-mono">
we also do not study the effect of training data size on structural grokking, and do not investigate whether transformers learn to grok hierarchical structure in low data regimes

</div>,

<div className="font-mono">
all datasets here are based on context-free grammars ... we believe constructing similar generalization benchmarks on real language data is a good avenue for future work

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
----------------------- Explaining grokking through circuit efficiency; Vikrant Varma, Rohin Shah

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Abstract:

</div>,

<div className="font-mono">
memorising circuits become more inefficient with larger training datasets while generalising circuits do not, suggesting there is a critical dataset size at which memorisation and generalisation are equally efficient

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
three key properties are sufficient for grokking:

</div>,

<div className="font-mono">
(1) 𝐶gen generalises well while 𝐶mem does not,

</div>,

<div className="font-mono">
(2) 𝐶gen is more efficient than 𝐶mem

</div>,

<div className="font-mono">
(3) 𝐶gen is learned more slowly than 𝐶mem

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
4.1.

</div>,

<div className="font-mono">
classifier efficiency is monotonically non-increasing in dataset size

</div>,

<div className="font-mono">
Effect of weight decay on 𝐷crit . Since 𝐷crit is determined only by the relative efficiencies of 𝐶gen and 𝐶mem , and none of these depends on the exact value of weight decay (just on weight decay being present at all), our theory predicts that 𝐷crit should not change as a function of weight decay.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
(P1) Efficiency. We predict (Section 4.1) that memorisation efficiency decreases with increasing train dataset size, while generalisation efficiency stays constant.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
grokking has been observed even when weight decay is not present ... We speculate there is at least one other effect that has a similar regularising effect favouring 𝐶gen over 𝐶mem , such as the implicit regularisation of gradient descent (Lyu and Li, 2019; Smith and Le, 2017; Soudry et al., 2018; Wang et al., 2021), and that the speed of the transition from 𝐶mem to 𝐶gen is based on the sum of these effects and the effect from weight decay.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
Future work. Within grokking, several interesting puzzles are still left unexplained. Why does the time taken to grok rise super-exponentially as dataset size decreases? How does the random initialisation interact with efficiency to determine which circuits are found by gradient descent? What causes generalising circuits to develop slower?

</div>,

      ]}
    />
  ),
};
