YaRN: Enhancing Context Window Extension of Large Language Models for Improved Efficiency (arXiv:2309.00071v1 [cs.CL])

The Rotary Position Embeddings (RoPE) have proven to be successful in encoding positional information in transformer-based language models. However, these models struggle to generalize beyond the maximum sequence length they were trained on. To address this limitation, we introduce YaRN (Yet another RoPE extensioN method), a compute-efficient approach that extends the context window of these models. YaRN achieves this with significantly fewer tokens and training steps compared to previous methods, reducing the overhead by 10 times and 2.5 times respectively. By leveraging YaRN, we demonstrate that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow. Moreover, our approach surpasses the state-of-the-art in context window extension. Furthermore, we showcase that YaRN has the ability to extrapolate beyond the limited context of a fine-tuning dataset. For those interested, we have made available the checkpoints of Llama 2 7B/13B fine-tuned using YaRN with 64k and 128k context windows on our GitHub repository: https://github.com/jquesnelle/yarn.

YaRN: Enhancing Context Window Extension of Large Language Models for Improved Efficiency (arXiv:2309.00071v1 [cs.CL])

Live Search

Posts

Categories

Popular Posts

Contact Info

About Us

Useful Information