Research Papers
The foundational and breakthrough papers driving the open-source LLM revolution.
Mixtral of Experts
Jiang et al. (Mistral AI) · 2024-01-08
Sparse Mixture-of-Experts model using 8 expert networks with top-2 routing. Matches or outperforms Llama 2 70B while using only 13B active parameters per token.
Efficient Memory Management for Large Language Model Serving with PagedAttention
Kwon et al. (UC Berkeley) · 2023-09-12
Applied virtual memory concepts to KV cache management in LLM serving. PagedAttention eliminates memory fragmentation, enabling 2-4x throughput improvement and forming the basis of vLLM.
LLaMA: Open and Efficient Foundation Language Models
Touvron et al. (Meta) · 2023-02-27
Demonstrated that smaller models trained on more tokens can match larger models. Catalyzed the open-source LLM movement by releasing weights from 7B to 65B parameters.
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Dao et al. · 2022-05-27
Proposed an IO-aware attention algorithm that uses tiling to reduce HBM reads/writes by orders of magnitude. Enabled 2-4x wall-clock speedup and longer sequences without approximation.
LoRA: Low-Rank Adaptation of Large Language Models
Hu et al. (Microsoft) · 2021-06-17
Introduced low-rank decomposition for parameter-efficient fine-tuning. Freezes pretrained weights and injects trainable rank decomposition matrices, reducing trainable parameters by 10,000x.
Attention Is All You Need
Vaswani et al. · 2017-06-12
Introduced the Transformer architecture based entirely on self-attention mechanisms, replacing recurrence and convolutions. The foundational architecture behind all modern LLMs.