Tracking the open-source LLM ecosystem

The Open Source
LLM Stack

A curated list of the latest open-source LLMs, inference engines and optimizations, agentic frameworks, and the research powering it all.

Explore Models View Timeline

Models

Latest Open Source LLMs

View all

GLM-5.2

🤗

2026-06

Parameters: 753B (40B active)
Architecture: MoE
Context window: 1M
Modality: Text
License: MIT
Recommended Hardware: 8× H2008× B200

Run it with

MiniMax-M3

🤗

2026-06

Parameters: 428B (23B active)
Architecture: MoE
Context window: 1M
Modality: Text, Image, Video
License: MiniMax Community License
Recommended Hardware: 8× H2004× B200

Run it with

DeepSeek-V4-Pro

🤗

2026-04

Parameters: 1.6T (49B active)
Architecture: MoE
Context window: 1M
Modality: Text
License: MIT
Recommended Hardware: 8× H2008× B200

Run it with

Gemma 4 31B IT

🤗

2026-04

Parameters: 30.7B
Architecture: Dense
Context window: 256K
Modality: Text, Image
License: Apache 2.0
Recommended Hardware: 2× H2001× MI325X

Run it with

Kimi-K2.6

🤗

2026-04

Parameters: 1T (32B active)
Architecture: MoE
Context window: 256K
Modality: Text, Image, Video
License: Modified MIT
Recommended Hardware: 8× H2008× B300

Run it with

MiMo-V2.5-Pro

🤗

2026-04

Parameters: 1.02T (42B active)
Architecture: MoE
Context window: 1M
Modality: Text
License: MIT
Recommended Hardware: 2-node 8× H200

Run it with

MiMo-V2.5

🤗

2026-04

Parameters: 310B (15B active)
Architecture: MoE
Context window: 1M
Modality: Text, Image, Video, Audio
License: MIT
Recommended Hardware: 8× H1004× B200

Run it with

Qwen3.6-27B

🤗

2026-04

Parameters: 27B
Architecture: Dense
Context window: 256K
Modality: Text, Image, Video
License: Apache 2.0
Recommended Hardware: 1× H1001× H200

Run it with

Qwen3.6-35B-A3B

🤗

2026-04

Parameters: 35B (3B active)
Architecture: MoE
Context window: 256K
Modality: Text, Image, Video
License: Apache 2.0
Recommended Hardware: 1× H1001× H200

Run it with

GLM-5.2

🤗

2026-06

Parameters: 753B (40B active)
Architecture: MoE
Context window: 1M
Modality: Text
License: MIT
Recommended Hardware: 8× H2008× B200

Run it with

MiniMax-M3

🤗

2026-06

Parameters: 428B (23B active)
Architecture: MoE
Context window: 1M
Modality: Text, Image, Video
License: MiniMax Community License
Recommended Hardware: 8× H2004× B200

Run it with

DeepSeek-V4-Pro

🤗

2026-04

Parameters: 1.6T (49B active)
Architecture: MoE
Context window: 1M
Modality: Text
License: MIT
Recommended Hardware: 8× H2008× B200

Run it with

Gemma 4 31B IT

🤗

2026-04

Parameters: 30.7B
Architecture: Dense
Context window: 256K
Modality: Text, Image
License: Apache 2.0
Recommended Hardware: 2× H2001× MI325X

Run it with

Kimi-K2.6

🤗

2026-04

Parameters: 1T (32B active)
Architecture: MoE
Context window: 256K
Modality: Text, Image, Video
License: Modified MIT
Recommended Hardware: 8× H2008× B300

Run it with

MiMo-V2.5-Pro

🤗

2026-04

Parameters: 1.02T (42B active)
Architecture: MoE
Context window: 1M
Modality: Text
License: MIT
Recommended Hardware: 2-node 8× H200

Run it with

MiMo-V2.5

🤗

2026-04

Parameters: 310B (15B active)
Architecture: MoE
Context window: 1M
Modality: Text, Image, Video, Audio
License: MIT
Recommended Hardware: 8× H1004× B200

Run it with

Qwen3.6-27B

🤗

2026-04

Parameters: 27B
Architecture: Dense
Context window: 256K
Modality: Text, Image, Video
License: Apache 2.0
Recommended Hardware: 1× H1001× H200

Run it with

Qwen3.6-35B-A3B

🤗

2026-04

Parameters: 35B (3B active)
Architecture: MoE
Context window: 256K
Modality: Text, Image, Video
License: Apache 2.0
Recommended Hardware: 1× H1001× H200

Run it with

View all models →

Optimizations

Inference Optimizations

View all

Quantization

AWQ

Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.

Quantization

GGUF Quantization

File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.

Quantization

GPTQ

Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.

Memory

PagedAttention

Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.

Decoding

Speculative Decoding

Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.

Attention

FlashAttention

IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.

Quantization

AWQ

Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.

Quantization

GGUF Quantization

File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.

Quantization

GPTQ

Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.

Memory

PagedAttention

Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.

Decoding

Speculative Decoding

Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.

Attention

FlashAttention

IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.

Why Optimizations Matter

Running large language models at scale requires clever engineering. From quantization that shrinks model weights to attention mechanisms that reduce memory usage, these techniques make it possible to serve powerful models on commodity hardware.

4-8x

Memory reduction with quantization

2-4x

Throughput gain with FlashAttention

24x

Higher batch with PagedAttention

2-3x

Faster decode with speculative

View all optimizations →

Agents

Hottest Agentic Frameworks

View all

AutoGen

Microsoft

40k+

Multi-agent conversation framework enabling agents to chat with each other to solve tasks. Supports customizable agents, human participation, and diverse conversation patterns.

conversationalmulti-agentMicrosoftPython

CrewAI

25k+

Role-based multi-agent framework where you define agents with specific roles, goals, and backstories. Agents collaborate to complete complex tasks through delegation and tool use.

role-basedmulti-agentcollaborationPython

LangGraph

LangChain

10k+

Framework for building stateful, multi-agent applications as graphs. Supports cycles, persistence, human-in-the-loop, and streaming. The agent orchestration layer of the LangChain ecosystem.

graph-basedstatefulmulti-agentstreaming

OpenAI Agents SDK

OpenAI

15k+

Lightweight Python SDK for building agentic AI apps. Features agent handoffs, guardrails, tracing, and tool integration with a minimal abstraction layer over the OpenAI API.

handoffsguardrailstracingPython

Pydantic AI

Pydantic

10k+

Agent framework built on Pydantic for type-safe AI applications. Structured outputs, dependency injection, and model-agnostic design from the team behind Pydantic and FastAPI.

type-safestructured-outputmodel-agnosticPython

smolagents

Hugging Face

15k+

Minimalist agent library focused on code agents that write and execute Python. Simple API with tool calling, multi-step reasoning, and tight Hugging Face Hub integration.

code-agentsminimalistHugging-FacePython

View all frameworks →

The Open Source LLM Stack

Latest Open Source LLMs

GLM-5.2

MiniMax-M3

DeepSeek-V4-Pro

Gemma 4 31B IT

Kimi-K2.6

MiMo-V2.5-Pro

MiMo-V2.5

Qwen3.6-27B

Qwen3.6-35B-A3B

GLM-5.2

MiniMax-M3

DeepSeek-V4-Pro

Gemma 4 31B IT

Kimi-K2.6

MiMo-V2.5-Pro

MiMo-V2.5

Qwen3.6-27B

Qwen3.6-35B-A3B

Inference Optimizations

AWQ

GGUF Quantization

GPTQ

PagedAttention

Speculative Decoding

FlashAttention

AWQ

GGUF Quantization

GPTQ

PagedAttention

Speculative Decoding

FlashAttention

Why Optimizations Matter

Hottest Agentic Frameworks

AutoGen

CrewAI

LangGraph

OpenAI Agents SDK

Pydantic AI

smolagents

The Open Source
LLM Stack