Tracking the open-source LLM ecosystem

The Open Source
LLM Stack

A curated list of the latest open-source LLMs, inference engines and optimizations, agentic frameworks, and the research powering it all.

Models

Latest Open Source LLMs

Zhipu AI logo

GLM-5.2

🤗

2026-06

Parameters
753B (40B active)
Architecture
MoE
Context window
1M
Modality
Text
License
MIT
Recommended Hardware
8× H2008× B200

Run it with

MiniMax logo

MiniMax-M3

🤗

2026-06

Parameters
428B (23B active)
Architecture
MoE
Context window
1M
Modality
Text, Image, Video
License
MiniMax Community License
Recommended Hardware
8× H2004× B200

Run it with

DeepSeek logo

DeepSeek-V4-Pro

🤗

2026-04

Parameters
1.6T (49B active)
Architecture
MoE
Context window
1M
Modality
Text
License
MIT
Recommended Hardware
8× H2008× B200

Run it with

Google logo

Gemma 4 31B IT

🤗

2026-04

Parameters
30.7B
Architecture
Dense
Context window
256K
Modality
Text, Image
License
Apache 2.0
Recommended Hardware
2× H2001× MI325X

Run it with

Moonshot AI logo

Kimi-K2.6

🤗

2026-04

Parameters
1T (32B active)
Architecture
MoE
Context window
256K
Modality
Text, Image, Video
License
Modified MIT
Recommended Hardware
8× H2008× B300

Run it with

Xiaomi logo

MiMo-V2.5-Pro

🤗

2026-04

Parameters
1.02T (42B active)
Architecture
MoE
Context window
1M
Modality
Text
License
MIT
Recommended Hardware
2-node 8× H200

Run it with

Xiaomi logo

MiMo-V2.5

🤗

2026-04

Parameters
310B (15B active)
Architecture
MoE
Context window
1M
Modality
Text, Image, Video, Audio
License
MIT
Recommended Hardware
8× H1004× B200

Run it with

Alibaba logo

Qwen3.6-27B

🤗

2026-04

Parameters
27B
Architecture
Dense
Context window
256K
Modality
Text, Image, Video
License
Apache 2.0
Recommended Hardware
1× H1001× H200

Run it with

Alibaba logo

Qwen3.6-35B-A3B

🤗

2026-04

Parameters
35B (3B active)
Architecture
MoE
Context window
256K
Modality
Text, Image, Video
License
Apache 2.0
Recommended Hardware
1× H1001× H200

Run it with

Zhipu AI logo

GLM-5.2

🤗

2026-06

Parameters
753B (40B active)
Architecture
MoE
Context window
1M
Modality
Text
License
MIT
Recommended Hardware
8× H2008× B200

Run it with

MiniMax logo

MiniMax-M3

🤗

2026-06

Parameters
428B (23B active)
Architecture
MoE
Context window
1M
Modality
Text, Image, Video
License
MiniMax Community License
Recommended Hardware
8× H2004× B200

Run it with

DeepSeek logo

DeepSeek-V4-Pro

🤗

2026-04

Parameters
1.6T (49B active)
Architecture
MoE
Context window
1M
Modality
Text
License
MIT
Recommended Hardware
8× H2008× B200

Run it with

Google logo

Gemma 4 31B IT

🤗

2026-04

Parameters
30.7B
Architecture
Dense
Context window
256K
Modality
Text, Image
License
Apache 2.0
Recommended Hardware
2× H2001× MI325X

Run it with

Moonshot AI logo

Kimi-K2.6

🤗

2026-04

Parameters
1T (32B active)
Architecture
MoE
Context window
256K
Modality
Text, Image, Video
License
Modified MIT
Recommended Hardware
8× H2008× B300

Run it with

Xiaomi logo

MiMo-V2.5-Pro

🤗

2026-04

Parameters
1.02T (42B active)
Architecture
MoE
Context window
1M
Modality
Text
License
MIT
Recommended Hardware
2-node 8× H200

Run it with

Xiaomi logo

MiMo-V2.5

🤗

2026-04

Parameters
310B (15B active)
Architecture
MoE
Context window
1M
Modality
Text, Image, Video, Audio
License
MIT
Recommended Hardware
8× H1004× B200

Run it with

Alibaba logo

Qwen3.6-27B

🤗

2026-04

Parameters
27B
Architecture
Dense
Context window
256K
Modality
Text, Image, Video
License
Apache 2.0
Recommended Hardware
1× H1001× H200

Run it with

Alibaba logo

Qwen3.6-35B-A3B

🤗

2026-04

Parameters
35B (3B active)
Architecture
MoE
Context window
256K
Modality
Text, Image, Video
License
Apache 2.0
Recommended Hardware
1× H1001× H200

Run it with

Optimizations

Inference Optimizations

Quantization

AWQ

Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.

Quantization

GGUF Quantization

File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.

Quantization

GPTQ

Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.

Memory

PagedAttention

Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.

Decoding

Speculative Decoding

Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.

Attention

FlashAttention

IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.

Quantization

AWQ

Activation-aware weight quantization that protects salient weight channels based on activation magnitude. Achieves better quality than naive quantization at 4-bit precision.

Quantization

GGUF Quantization

File format and quantization scheme for llama.cpp. Supports mixed-precision quantization (Q2-Q8) enabling models to run on CPUs and Apple Silicon with configurable quality-speed tradeoffs.

Quantization

GPTQ

Post-training quantization method that compresses model weights to 4-bit or 3-bit precision using approximate second-order information. Enables running large models on consumer GPUs.

Memory

PagedAttention

Virtual memory-inspired KV cache management that eliminates memory fragmentation. Enables near-zero waste in KV cache allocation, dramatically increasing batch sizes and throughput.

Decoding

Speculative Decoding

Uses a smaller draft model to generate candidate tokens that the larger model verifies in parallel. Achieves 2-3x faster decoding without any quality loss.

Attention

FlashAttention

IO-aware exact attention algorithm that reduces memory reads/writes by tiling and recomputation. Provides 2-4x speedup and enables longer context lengths without approximation.

Why Optimizations Matter

Running large language models at scale requires clever engineering. From quantization that shrinks model weights to attention mechanisms that reduce memory usage, these techniques make it possible to serve powerful models on commodity hardware.

4-8x
Memory reduction with quantization
2-4x
Throughput gain with FlashAttention
24x
Higher batch with PagedAttention
2-3x
Faster decode with speculative
Agents

Hottest Agentic Frameworks