GitHub - hanzoai/engine: Hanzo AI native inference engine - Rust-based LLM & embedding engine for foundational models

Fast, flexible LLM inference.

Latest

Anthropic Messages API: hanzo serve now exposes an Anthropic-compatible POST /v1/messages endpoint (streaming, tool use, and Claude Code harness support) alongside the OpenAI-compatible /v1 API. Examples
Agentic runtime: web search, local Python code execution with model feedback, session management, and custom tool hooks. Guide
Gemma 4: full multimodal: text, image, video, and audio input. Guide | Video setup
MXFP4 ISQ quantization: MXFP4 with optimized decode kernels for faster, smaller models. Quantization docs

Why hanzo?

Any Hugging Face model, zero config: Just hanzo run -m user/model. Architecture, quantization format, and chat template are auto-detected.
True multimodality: Text, vision, video, and audio, speech generation, image generation, and embeddings in one engine.
Smart quantization: --quant automatically selects the best quantization format at that level: using a prebuilt UQFF if one is published, otherwise applying ISQ. Docs
OpenAI + Anthropic compatible serving: The same hanzo serve process exposes OpenAI-compatible /v1 endpoints and an Anthropic-compatible Messages endpoint.
Built-in web UI: Served at /ui by default. Shows reasoning, code execution, plots, and files inline. Edit any message and the new branch runs with its own Python state. Pass --no-ui to disable.
Hardware-aware: hanzo tune benchmarks your system and picks optimal quantization + device mapping.
Flexible SDKs: Python package and Rust crate to build your projects.
Native agentic support: built-in agentic loop with web search, local Python code execution with model feedback, session management, and custom tool hooks.

Quick Start

Install

Linux/macOS:

curl --proto '=https' --tlsv1.2 -sSf https://raw.githubusercontent.com/hanzoai/engine/master/install.sh | sh

Windows (PowerShell):

irm https://raw.githubusercontent.com/hanzoai/engine/master/install.ps1 | iex

Manual installation & other platforms

Run Your First Model

# Interactive chat
hanzo run -m Qwen/Qwen3-4B

# One-shot prompt (no interactive session)
hanzo run -m Qwen/Qwen3-4B -i "What is the capital of France?"

# One-shot with an image
hanzo run -m google/gemma-4-E4B-it --image photo.jpg -i "Describe this image"

# Agentic REPL: search + code execution from the terminal
hanzo run --agent -m Qwen/Qwen3-4B

# Start an API server with the built-in web UI
hanzo serve -m google/gemma-4-E4B-it

For the server command, visit http://localhost:1234/ui for the web chat interface. OpenAI-compatible clients use http://localhost:1234/v1; Anthropic-compatible clients use http://localhost:1234.

The `hanzo` CLI

The CLI is designed to be zero-config: just point it at a model and go.

Auto-detection: Automatically detects model architecture, quantization format, and chat template
All-in-one: Single binary for chat, server, benchmarks, and web UI (run, serve, bench)
Hardware tuning: Run hanzo tune to automatically benchmark and configure optimal settings for your hardware
Format-agnostic: Works with Hugging Face models, GGUF files, and UQFF quantizations seamlessly

# Auto-tune for your hardware and emit a config file
hanzo tune -m Qwen/Qwen3-4B --emit-config config.toml

# Run using the generated config
hanzo from-config -f config.toml

# Diagnose system issues (CUDA, Metal, HuggingFace connectivity)
hanzo doctor

Full CLI documentation

UI Demo

What Makes It Fast

Performance

Continuous batching support by default on all devices.
CUDA with FlashAttention V2/V3, Metal, multi-GPU tensor parallelism
PagedAttention for high throughput continuous batching on CUDA or Apple Silicon, prefix caching (including multimodal)

Quantization (full docs)

In-situ quantization (ISQ) of any Hugging Face model
GGUF (2-8 bit), GPTQ, AWQ, HQQ, FP8, BNB support
⭐ Per-layer topology: Fine-tune quantization per layer for optimal quality/speed
⭐ Auto-select fastest quant method for your hardware

Flexibility

LoRA & X-LoRA with weight merging
AnyMoE: Create mixture-of-experts on any base model
Multiple models: Load/unload at runtime

Agentic Features

Integrated tool calling with grammar enforcement and strict schema mode
⭐ Server-side agentic loop: auto-execute tools and feed results back
⭐ Python code execution: persistent Jupyter-like sessions with matplotlib capture and multimodal feedback
⭐ Web search integration with embedding-based ranking
⭐ Tool dispatch URL: POST tool calls to your own endpoint
⭐ MCP client: Connect to external tools via Process, HTTP, or WebSocket
Python/Rust tool callbacks for custom execution

Full feature documentation

Supported Models

Text Models

Granite 4.0
SmolLM 3
DeepSeek V3
GPT-OSS
DeepSeek V2
Qwen 3 Next
Qwen 3 MoE
Phi 3.5 MoE
Qwen 3
GLM 4
GLM-4.7-Flash
GLM-4.7 (MoE)
Gemma 2
Qwen 2
Starcoder 2
Phi 3
Mixtral
Phi 2
Gemma
Llama
Mistral

Multimodal Models

Qwen 3.5
Qwen 3.5 MoE
Qwen 3-VL
Qwen 3-VL MoE
Gemma 3n
Llama 4
Gemma 3
Mistral 3
Phi 4 multimodal
Qwen 2.5-VL
MiniCPM-O
Llama 3.2 Vision
Qwen 2-VL
Idefics 3
Idefics 2
LLaVA Next
LLaVA
Phi 3V

Speech Models

Voxtral (ASR/speech-to-text)
Dia

Image Generation Models

FLUX

Embedding Models

Embedding Gemma
Qwen 3 Embedding

Request a new model | Full compatibility tables

Python SDK

pip install hanzo  # or hanzo-cuda, hanzo-metal, hanzo-mkl, hanzo-accelerate

from hanzo import Runner, Which, ChatCompletionRequest

runner = Runner(
    which=Which.Plain(model_id="Qwen/Qwen3-4B"),
    in_situ_quant="4",
)

res = runner.send_chat_completion_request(
    ChatCompletionRequest(
        model="default",
        messages=[{"role": "user", "content": "Hello!"}],
        max_tokens=256,
    )
)
print(res.choices[0].message.content)

Python SDK | Installation | Examples | Cookbook

Rust SDK

cargo add hanzo

use anyhow::Result;
use hanzo::{IsqType, TextMessageRole, TextMessages, MultimodalModelBuilder};

#[tokio::main]
async fn main() -> Result<()> {
    let model = MultimodalModelBuilder::new("google/gemma-4-E4B-it")
        .with_isq(IsqType::Q4K)
        .with_logging()
        .build()
        .await?;

    let messages = TextMessages::new().add_message(
        TextMessageRole::User,
        "Hello!",
    );

    let response = model.send_chat_request(messages).await?;

    println!("{:?}", response.choices[0].message.content);

    Ok(())
}

API Docs | Crate | Examples

Docker

For quick containerized deployment:

docker pull ghcr.io/hanzoai/engine:latest
docker run --gpus all -p 1234:1234 ghcr.io/hanzoai/engine:latest \
  serve -m Qwen/Qwen3-4B

Docker images

For production use, we recommend installing the CLI directly for maximum flexibility.

Documentation

For complete documentation, see the Documentation.

Quick Links:

CLI Reference - All commands and options
HTTP API - OpenAI-compatible endpoints
Quantization - ISQ, GGUF, GPTQ, and more
Device Mapping - Multi-GPU and CPU offloading
MCP Integration - MCP integration documentation
Troubleshooting - Common issues and solutions
Configuration - Environment variables for configuration

Contributing

Contributions welcome! Please open an issue to discuss new features or report bugs. If you want to add a new model, please contact us via an issue and we can coordinate.

Credits

This project would not be possible without the excellent work at Hanzo. Thank you to all contributors!

hanzo is not affiliated with Mistral AI.

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 3,431 Commits
.cargo		.cargo
.github		.github
calibration_data		calibration_data
chat_templates		chat_templates
docs		docs
examples		examples
game_of_life_plots		game_of_life_plots
hanzo-audio		hanzo-audio
hanzo-bench		hanzo-bench
hanzo-cli		hanzo-cli
hanzo-code-exec		hanzo-code-exec
hanzo-engine		hanzo-engine
hanzo-llm-mcp		hanzo-llm-mcp
hanzo-macros		hanzo-macros
hanzo-paged-attn		hanzo-paged-attn
hanzo-pyo3		hanzo-pyo3
hanzo-quant		hanzo-quant
hanzo-sandbox		hanzo-sandbox
hanzo-server-core		hanzo-server-core
hanzo-server		hanzo-server
hanzo-vision		hanzo-vision
hanzo-web-chat		hanzo-web-chat
hanzo		hanzo
matformer_configs		matformer_configs
orderings		orderings
releases/v0.8.2		releases/v0.8.2
res		res
ring_configs		ring_configs
scripts		scripts
site		site
toml-selectors		toml-selectors
topologies		topologies
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.typos.toml		.typos.toml
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Dockerfile.cuda-13.0-ubi9		Dockerfile.cuda-13.0-ubi9
Dockerfile.cuda-all		Dockerfile.cuda-all
Dockerfile.manylinux		Dockerfile.manylinux
EMBEDDINGS_FIX.md		EMBEDDINGS_FIX.md
HANZO_ENGINE_INTEGRATION.md		HANZO_ENGINE_INTEGRATION.md
LICENSE		LICENSE
LLM.md		LLM.md
Makefile		Makefile
README.md		README.md
RELEASE.md		RELEASE.md
compose.yml		compose.yml
install.ps1		install.ps1
install.sh		install.sh
sample_speech.wav		sample_speech.wav
speculative.toml		speculative.toml
test_embeddings.py		test_embeddings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast, flexible LLM inference.

Latest

Why hanzo?

Quick Start

Install

Run Your First Model

The `hanzo` CLI

What Makes It Fast

Supported Models

Python SDK

Rust SDK

Docker

Documentation

Contributing

Credits

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fast, flexible LLM inference.

Latest

Why hanzo?

Quick Start

Install

Run Your First Model

The hanzo CLI

What Makes It Fast

Supported Models

Python SDK

Rust SDK

Docker

Documentation

Contributing

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The `hanzo` CLI

Packages