Spectral Insights into Data-Oblivious Critical Layers in Large Language Models

Abstract

Understanding how feature representations evolve across layers in large language models (LLMs) is key to improving their interpretability and robustness. While recent studies have identified critical layers linked to specific functions or behaviors, these efforts typically rely on data-dependent analyses of fine-tuned models, limiting their use to post-hoc settings. In contrast, we introduce a \textit{data-oblivious} approach to identify intrinsic critical layers in pre-fine-tuned LLMs by analyzing representation dynamics via Centered Kernel Alignment (CKA). We show that layers with significant shifts in representation space are also those most affected during fine-tuning—a pattern that holds consistently across tasks for a given model. Our spectral analysis further reveals that these shifts are driven by changes in the top principal components, which encode semantic transitions from rationales to conclusions. We further apply these findings to two practical scenarios: efficient domain adaptation, where fine-tuning critical layers leads to greater loss reduction compared to non-critical layers; and backdoor defense, where freezing them reduces attack success rates by up to 40%.

Intro

LLMs are remarkably good at generating and understanding text, yet we still know little about how their internal layers process information. Previous work typically identifies "important" layers only after a model has been fine-tuned on a particular dataset, making these findings inherently post-hoc and dataset-specific. But are critical layers an intrinsic property of the model, independent of specific data? If so, can we predict a model's future training behavior from its current state alone?

To investigate these questions, we adopt a different approach: we analyze off-the-shelf (pre-fine-tuned) models and show that certain layers are intrinsically easier to adapt during subsequent fine-tuning. We further demonstrate that each layer's Representation Dynamics reliably predicts its behavior in subsequent training steps, regardless of the dataset used.

Data-oblivious Critical Layers & Representation Dynamics

Critical Layers Identified during Supervised Fine-Tuning.

We identify the critical layers during Supervised Fine-Tuning (SFT) by substituting each layer in the fine-tuned model with its corresponding layer from the pre-fine-tuned model, and then measure the loss reduction of the model during SFT for each layer. High values here indicates that the layer is more sensitive in the fine-tuning steps.

Llama-2-7b's Loss reduction by layer on Dolly dataset

Llama-2-7b's Loss reduction by layer on OpenBookQA dataset

We find that the same model shows a very similar pattern in the loss curves across different datasets (high values in the middle layers, low values in the last layer), which indicates that the critical layer is determined by the pre-fine-tuned model and is independent of the fine-tuning dataset.

Representation Dynamics of the Pre-fine-tuned Models

Centered Kernel Alignment (CKA) is a popular metric for measuring the similarity between two representation spaces. We use it to quantitatively describe changes in the representation space between layer $\ell$ and its neighboring layers, using the average CKA value, denoted as $\delta^{\ell}$. A smaller $\delta^{\ell}$ indicates a greater representation shift at layer $\ell$ relative to its neighbors. Layers with the largest shifts are called change-point layers.

Llama-3.1-8B's CKA similarity and average CKA $\delta^{\ell}$ by layer on BoolQ and Dolly datasets

We also observe that these CKA patterns and the change-point layers are independent of the data used to compute CKA and are instead determined by the pre-fine-tuned model state. For example, the CKA patterns of LLaMA2-7B-Chat are very similar, with the same change-point layers at layers from the 8th to the 14th.

Different datasets exhibit a consistent pattern in average CKA $\delta^{\ell}$ on the LLaMA2-7B-Chat model, suggesting its data-invariant property.

Connect Critical Layers with Representation Dynamics

From the figure above, we can also observe an interesting phenomenon that average CKA $\delta^{\ell}$ is negatively correlated with the tracked loss value after SFT. This trend is consistent with a high negative correlation coefficient across different models and datasets.

Take away

During the SFT stage, layers exhibiting greater shifts in representation space prior to fine-tuning tend to undergo more significant modifications compared to layers with minimal shifts.

Spectral Analysis of Representations

Having established that Representation Dynamics correlate with subsequent training behavior, and knowing that CKA is closely related to spectral properties, we investigate two key questions:

Q1: How do Principal Components explain the representation dynamics?
Q2: What semantic information is encoded in the Principal Components?

Principal Components Explaining Representation Dynamics

We analyze how representation spaces evolve across layers by examining changes in their principal components using Canonical Correlation Analysis (CCA).

Average CCA values of the top-K principal components across layers in the LLaMA2-7B-Chat model on the Dolly dataset

We observe a strong alignment between the average CCA values of the top-3 principal components and the average CKA patterns observed previously, especially at the change-point layers. This indicates that the top-3 principal components are responsible for the representation shifts observed in the previous study.

Semantic Information in Principal Components

We also investigate the semantic information encoded in the principal components by removing them at change-point layers and observing how the model's output changes.

The effects of removing different principal components at various layers

We find that the top-3 principal components at critical layers (which cause representation shifts) play a key role in summarizing rationales to derive conclusions during reasoning. In contrast, other principal components are primarily associated with formatting and template-related aspects.

Applications of Critical Layers

Finally, we present two key applications of identifying data-oblivious critical layers:

Efficient Domain Adaptation

Critical layers can be leveraged for efficient domain adaptation when fine-tuning is restricted to a subset of layers due to resource constraints. We find that fine-tuning only the critical layers leads to faster loss decrease compared to fine-tuning non-critical layers.

Test loss curves for fine-tuning LLaMA-2-7B-Chat on the Dolly and OpenBookQA datasets by training only the critical layers, only the non-critical layers, or the full model

Targeted Defense Against Backdoor Attacks

Model robustness can be improved by preventing harmful information from adapting the critical layers. We find that freezing the critical layers leads to a significant reduction in attack success rates when the model faces backdoor attacks.

Freezing the critical layers reduces attack success rates by up to 40%, and the defense also works against different triggers.

BibTeX

@misc{liu2025spectralinsightsdataobliviouscritical,
      title={Spectral Insights into Data-Oblivious Critical Layers in Large Language Models}, 
      author={Xuyuan Liu and Lei Hsiung and Yaoqing Yang and Yujun Yan},
      year={2025},
      eprint={2506.00382},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.00382}, 
}