
import { EntryBody } from "../components/Entry/EntryBody";

export const now_generating_neuron: { [id: string]: any } = {

    id: "now_generating_neuron",
    title: <>Is a now-generating neuron in a language model possible?</>,
    date: "18 Oct 2024",

    Body: (
        <EntryBody
        paragraphs={[

<div className="font-mono">


</div>,

<div className="font-mono">
Today a colleague mentioned an experiment Anthropic run on Claude where they amped up the neurons that fired on mentions of the Golden Gate Bridge, and the resulting generations of the modified language model were always about such bridge, even if the topic was apparently unrelated. For simplicity let"s assume that this was done on a single neuron.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
When a language model is used in a conversational manner, part of the text is prompt or text generated by the interlocutor, and part of the text is the one the model is generating. Could there possibly be a neuron that fires only when the text is being generated by the model, and not when its being "read" or force-decoded? Is there a now-generating or a now-force-decoding neuron in current language models? Can any current training scheme favour the apparition of such mechanism?

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
What is the diference between generation and force-decoding? For either case, the hidden states are computed in the same manner: from token embeddings, sequential transformer blocks are applied (grabbing information from the hideen states from previous tokens through attention) until a probability distribution over all the tokens is generated (through the softmax on the logits). Then a sampling algorithm samples such probaility distribution in some way (fro example, on greedy decoding the highest probability token is picked, with temperature this distribution can be modified, etc...). The only difference between those two modes is the sampling algorithm.

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
When force-decoding, the probability distribution does not matter, because, as it were, the tokens have been sampled from _another_ probability distribution, that of the original text. When generating, the distribution matters, as the tokens are being sampled from that distribution. Could a model _access_ this information? Compare the token it is processing with the token probability distribution it gave from the previous token and estimate how likely it is that the current token was sampled from that distribution?

</div>,

<div className="font-mono">


</div>,

<div className="font-mono">
It seems hard to believe that a now-generating neuron can exist, primarily because during training _everything_ is force-decoding. Maybe on the RLHF paradigm? But even then it may just be related on there being a different probability distribution between the prompt and the answer, but nothing to do with the answer being generated or not (?). Unsure on this last case. But an interesting posit maybe.

</div>,

      ]}
    />
  ),
};
