The Open Source
LLM Stack
A curated list of the latest open-source LLMs, inference engines and optimizations, agentic frameworks, and the research powering it all.
Latest Open Source LLMs
GLM-5.2
2026-06
- Parameters
- 753B (40B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text
- License
- MIT
- Recommended Hardware
- 8× H2008× B200
MiniMax-M3
2026-06
- Parameters
- 428B (23B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text, Image, Video
- License
- MiniMax Community License
- Recommended Hardware
- 8× H2004× B200
DeepSeek-V4-Pro
2026-04
- Parameters
- 1.6T (49B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text
- License
- MIT
- Recommended Hardware
- 8× H2008× B200
Gemma 4 31B IT
2026-04
- Parameters
- 30.7B
- Architecture
- Dense
- Context window
- 256K
- Modality
- Text, Image
- License
- Apache 2.0
- Recommended Hardware
- 2× H2001× MI325X
Kimi-K2.6
2026-04
- Parameters
- 1T (32B active)
- Architecture
- MoE
- Context window
- 256K
- Modality
- Text, Image, Video
- License
- Modified MIT
- Recommended Hardware
- 8× H2008× B300
MiMo-V2.5-Pro
2026-04
- Parameters
- 1.02T (42B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text
- License
- MIT
- Recommended Hardware
- 2-node 8× H200
MiMo-V2.5
2026-04
- Parameters
- 310B (15B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text, Image, Video, Audio
- License
- MIT
- Recommended Hardware
- 8× H1004× B200
Qwen3.6-27B
2026-04
- Parameters
- 27B
- Architecture
- Dense
- Context window
- 256K
- Modality
- Text, Image, Video
- License
- Apache 2.0
- Recommended Hardware
- 1× H1001× H200
Qwen3.6-35B-A3B
2026-04
- Parameters
- 35B (3B active)
- Architecture
- MoE
- Context window
- 256K
- Modality
- Text, Image, Video
- License
- Apache 2.0
- Recommended Hardware
- 1× H1001× H200
GLM-5.2
2026-06
- Parameters
- 753B (40B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text
- License
- MIT
- Recommended Hardware
- 8× H2008× B200
MiniMax-M3
2026-06
- Parameters
- 428B (23B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text, Image, Video
- License
- MiniMax Community License
- Recommended Hardware
- 8× H2004× B200
DeepSeek-V4-Pro
2026-04
- Parameters
- 1.6T (49B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text
- License
- MIT
- Recommended Hardware
- 8× H2008× B200
Gemma 4 31B IT
2026-04
- Parameters
- 30.7B
- Architecture
- Dense
- Context window
- 256K
- Modality
- Text, Image
- License
- Apache 2.0
- Recommended Hardware
- 2× H2001× MI325X
Kimi-K2.6
2026-04
- Parameters
- 1T (32B active)
- Architecture
- MoE
- Context window
- 256K
- Modality
- Text, Image, Video
- License
- Modified MIT
- Recommended Hardware
- 8× H2008× B300
MiMo-V2.5-Pro
2026-04
- Parameters
- 1.02T (42B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text
- License
- MIT
- Recommended Hardware
- 2-node 8× H200
MiMo-V2.5
2026-04
- Parameters
- 310B (15B active)
- Architecture
- MoE
- Context window
- 1M
- Modality
- Text, Image, Video, Audio
- License
- MIT
- Recommended Hardware
- 8× H1004× B200
Qwen3.6-27B
2026-04
- Parameters
- 27B
- Architecture
- Dense
- Context window
- 256K
- Modality
- Text, Image, Video
- License
- Apache 2.0
- Recommended Hardware
- 1× H1001× H200
Qwen3.6-35B-A3B
2026-04
- Parameters
- 35B (3B active)
- Architecture
- MoE
- Context window
- 256K
- Modality
- Text, Image, Video
- License
- Apache 2.0
- Recommended Hardware
- 1× H1001× H200
Inference Optimizations
AWQ
Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.
GGUF Quantization
File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.
GPTQ
Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.
PagedAttention
Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.
Speculative Decoding
Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.
FlashAttention
IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.
AWQ
Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.
GGUF Quantization
File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.
GPTQ
Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.
PagedAttention
Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.
Speculative Decoding
Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.
FlashAttention
IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.
Why Optimizations Matter
Running large language models at scale requires clever engineering. From quantization that shrinks model weights to attention mechanisms that reduce memory usage, these techniques make it possible to serve powerful models on commodity hardware.
Hottest Agentic Frameworks
AutoGen
Microsoft
Multi-agent conversation framework enabling agents to chat with each other to solve tasks. Supports customizable agents, human participation, and diverse conversation patterns.
CrewAI
CrewAI
Role-based multi-agent framework where you define agents with specific roles, goals, and backstories. Agents collaborate to complete complex tasks through delegation and tool use.
LangGraph
LangChain
Framework for building stateful, multi-agent applications as graphs. Supports cycles, persistence, human-in-the-loop, and streaming. The agent orchestration layer of the LangChain ecosystem.
OpenAI Agents SDK
OpenAI
Lightweight Python SDK for building agentic AI apps. Features agent handoffs, guardrails, tracing, and tool integration with a minimal abstraction layer over the OpenAI API.
Pydantic AI
Pydantic
Agent framework built on Pydantic for type-safe AI applications. Structured outputs, dependency injection, and model-agnostic design from the team behind Pydantic and FastAPI.
smolagents
Hugging Face
Minimalist agent library focused on code agents that write and execute Python. Simple API with tool calling, multi-step reasoning, and tight Hugging Face Hub integration.