Screenshot 2026-01-18 at 5.46.29 PM.png

The whale opened 2026 with a bang! They dropped two really cool and high-impact papers that deal with training stability of LLMs but also expand on some important ideas like memory modules and stable residual connections. The paper "Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models" particularly resonated with me as that is one idea that I have been an advocate for which is a way to "disentangle raw general transferable reasoning from static knowledge/memory." This paper takes a meaningful step in that direction, albeit the main goal is for better training stability and early learning of advanced features/reasoning so that compute can be allocated efficiently on dynamic knowledge rather than static knowledge such as names of people, locations, etc. (separate "static pattern storage" from "dynamic computation" structurally, rather than forcing everything through the same dense Transformer computation.)

The paper introduces the "Engram Module," which is a learnable memory module that gives the model the ability to store factual and lexical patterns (rare names, phrases, domain terms, local collocations). This primarily allows us to train compute-efficient reasoning models that can store factual and lexical patterns efficiently.

Screenshot 2026-01-18 at 5.46.48 PM.png

Engram Architecture Overview

At selected layers of the transformer, Engram adds a module that does two things at each position $t$:

  1. Retrieval: Look at the suffix N-gram ending at position $t$ (like the last 2–3 tokens), map it to indices in huge embedding tables using deterministic hashing, retrieve the corresponding vectors, and concatenate them into a memory vector $e$.
  2. Fusion: Use the current hidden state $h_t$ as a context-aware "query" to decide how much to trust the retrieved memory (because hash collisions/polysemy can inject noise), then lightly refine it (a small causal convolution), and add it back via a residual connection.

Now let's go through the architecture in much more detail and understand the core components of the architecture and why it works (with some examples as well).

Tokenizer Compression: Make IDs "Semantically Denser"

The way tokenizers are trained for LLMs (Byte Pair Encoding) can result in similar words or tokens being assigned different token IDs and hence different embedding vectors. This results in two words like "apple" vs "Apple" being assigned two different text IDs and hence differing embedding vectors. Engram compresses these via a precomputed surjection:

This results in the compression of the vocabulary size of the standard tokenizer by mapping tokens to similar canonical IDs (the paper reports ~23% compression for a 128k tokenizer).

Concrete example: Tokens like "A", "a", " a" and accented variants get merged into one canonical representative, increasing the frequency of "the same meaning token" for N-grams, improving lookup quality.

Sparse Retrieval via Multi-Head Hashing

For each position $t$, we form the suffix N-gram (the paper deals with bigrams ($n=2$) and trigrams ($n=3$). We will later show a detailed example for N-grams with both bigrams and trigrams, but let's say we are dealing with bigrams $n=2$:

Then the embedding for that position $t$ would be $g(t,2) = (x_{t-1}, x_t)$

Now the next question becomes: there could be close to infinite combinations of bigrams (not really, but $P(vocab\_size,2)$ to be exact, P stands for “permutation”) possible out of the canonical IDs, so will we have to learn the weights of each of those combinations?, or is there a smart trick that we can use to learn the weights associated with each bigram for the Engram Embedding table?

To solve this, we can fuse two concepts so that we can learn the Engram module weights: