The difficulty with long context is an indonesia number details essential problem of the R layer: unlike the self-attention mechanism, the R layer must compress the context into a fixed-size hidden state. As a compression heuristic, the update rule needs to discover the underlying structure and relationship between thousands or even millions of k. The researchers first observed that self-supervised learning can compress a large training set into the weights of a model such as LL, which often shows a deep understanding of the semantic connections between its training data, which is exactly what they need. . Layer Inspired by this, the researchers designed a new class of sequence modeling layers, in which the hidden state is the model and the update rule is a step in self-supervised learning.

Since the process of updating the hidden state on the test sequence is equivalent to training the model at test time, this new class of layers is called test-time training () layers. The researchers introduced two simple instances: -Lir and -L, where the hidden state is a linear model and two layers of L, respectively. The layer can be integrated into any network architecture and optimized end-to-end, similar to the R layer and self-attention. . The actual runtime layers are already very efficient in FL, and the researchers go a step further and propose two innovations to keep them efficient in actual runtime.