Memory And Reasoning Partition in LLMs

Developing a more robust understanding of how neural networks work is critical to improving them as well as leveraging them for novel applications. This paper (https://lnkd.in/gXQ988Wt) presents a critical understanding of one aspect of it. If you map this to neuroscience language:

High-curvature subspaces resemble dynamic cortical networks, responsible for flexible, context-sensitive reasoning, whereas low-curvature ones resemble long-term memory storage, which are stable, low-energy patterns that can persist without change.

In effect, the “AI brain” shows a functional separation between memory circuits and reasoning circuits, emergent from optimization dynamics.

So, what are these optimization dynamics?

The crux of the article lies in this sentence segment: “decomposing the models’ weight matrices into components, ordered from ‘high curvature’ to ‘low curvature”.

What exactly is decomposition? Well, something similar to PCA.

Every neural network, including a large language model (LLM) contains weight matrices that connect neurons between layers. These matrices determine how input activations are transformed at each step (i.e., how information flows).

“Decomposing” those matrices means representing the internal structure of the network (or a layer) in terms of distinct modes or directions of variation. Each mode corresponds to a particular pattern in parameter space that affects how the model behaves.

By analyzing these modes, we can identify which directions correspond to sensitive, information-rich aspects (reasoning) and which correspond to stable, memorization-oriented aspects.

As mentioned earlier, this is analogous to Principal Component Analysis (PCA) in statistics, except that instead of variance, we are measuring curvature (sensitivity of loss to parameter changes).

What does “curvature” mean in this context?

Curvature refers to the second derivative (Hessian) of the loss function with respect to model parameters. An intuitive way to think about it is that in high curvature directions, loss changes rapidly if you move a little along that direction.

These correspond to important, delicate structures in the model, which typically include parts that encode reasoning or compositional transformations.

In low curvature directions, the loss hardly changes when you move along that direction. These directions tend to store redundant or memorized patterns that don’t affect general reasoning much.

Computing the full curvature (the Hessian matrix) for an LLM is impossible since it would have trillions of parameters. K-FAC provides an efficient approximation.

So the key findings were:

Memory (rote factual recall, arithmetic, narrow tasks) and reasoning (generalization, logical problem solving) appear to live in different parts of the model architecture.

Low-curvature components: strongly connected to memorization of facts.

High‐curvature components: more involved in reasoning/solving new problems

The separation suggests that one could target the memorization pathways without damaging the reasoning capacity of the model.

Designed Analytics Blog

Memory And Reasoning Partition in LLMs

Leave a comment Cancel reply

Memory And Reasoning Partition in LLMs

Share this:

Leave a comment Cancel reply