ggerganov/llama.cpp · Archaeologist

From the Field

“The de facto standard for local LLM inference, but this repo is just its scripts.”

Verdict:Reach for it

Reach for it when

You need to run, convert, or benchmark LLMs locally on CPU or consumer GPU with minimal dependencies.

Look elsewhere when

You want a polished GUI, cloud API, or production-grade serving with multi-user auth.

In context

It's like Ollama but lower-level, faster, and more hackable — you build and script everything yourself.

Complexity●●●Medium

Read time~30 minutes

Language

Dependencies

0total

What using it looks like

Drawn from the project's README

From the README

# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

Fig. 1 — example 1 of 6

What this is

As told for the tourist

What Is This?

llama.cpp is a program that lets you run powerful AI language models—the same kind that power ChatGPT—directly on your own computer, without needing an internet connection or paying anyone. Think of it as a tiny engine that can bring a smart chatbot to life right inside your laptop or desktop.

What Can You Do With It?

You could use this to have a private conversation with an AI assistant that never sends your data anywhere. For example, you can open your terminal and type:

llama-cli -m my_model.ggufGGUFconceptA file format for storing quantized large language models, designed for efficient loading and inference. It replaces older formats like GGML and is the standard format for llama.cpp.

Then just start chatting: "Hi, who are you?" and it'll reply like a friendly helper. You can also launch your own personal AI server that works like OpenAI's API, so other apps on your computer can talk to it:

llama-server -hf ggml-org/gemma-3-1b-it-GGUFGGUFconceptA file format for storing quantized large language models, designed for efficient loading and inference. It replaces older formats like GGML and is the standard format for llama.cpp.

This means you could build a writing assistant, a coding buddy, or a study tutor that runs entirely offline. The README shows you can even download models directly from Hugging FaceHugging FacelibraryA platform and library ecosystem for sharing, storing, and using machine learning models. Scripts here interact with it to download or upload model files. (a big online library of AI models) with a simple command like llama-cli -hf ggml-org/gemma-3-1b-it-GGUF.

llama-cli -m my_model.gguf

llama-server -hf ggml-org/gemma-3-1b-it-GGUF

How It Works (No Jargon)

1. The Model is Like a Cookbook

An AI language model is basically a giant collection of patterns—like a cookbook with millions of recipes for how words fit together. When you ask it a question, it's like flipping through the cookbook to find the most likely next word, then the next, then the next. llama.cpp reads this cookbook file (usually a .gguf file) and uses it to generate responses.

2. Your Computer is the Kitchen

Running a model requires a lot of math—think of it as following thousands of recipes simultaneously. llama.cpp is designed to use your computer's brain (the CPU) and, if available, its graphics card (GPU) to do this math really fast. It's like having a super-efficient kitchen that can cook thousands of dishes at once without burning anything.

3. The "Quantization" Trick

The cookbook is usually huge—like 100 billion recipes. llama.cpp can shrink it down by rounding off tiny details (like using "1.5 cups" instead of "1.5234 cups"). This makes the cookbook smaller and faster to use, while still keeping the food tasting almost the same. That's why you can run these models on a regular laptop instead of needing a supercomputer.

What's Cool About It?

The coolest thing is that it's written in C/C++, which is like building a race car engine instead of a family sedan. Most AI tools are written in Python, which is easier to write but slower. llama.cpp is incredibly fast and lightweight—it can run on everything from a beefy gaming PC to a tiny Raspberry Pi. It also supports "multimodal" models now, meaning it can look at pictures and describe them, not just chat.

Who Should Care?

Reach for this if you're curious about AI but don't want to send your private conversations to a cloud server, or if you're a developer who wants to build apps with a free, local AI brain. Skip it if you just want to use ChatGPT in your browser—that's already perfect for casual use. But if you like tinkering, learning, or keeping your data private, llama.cpp is your new best friend.

Start Here

A recommended reading path through the code

Start Here

A recommended reading path through the code

01
examples/pydantic_models_to_grammar.py
Reveals the core abstraction of mapping Python types to GBNF grammar rules, which is central to constrained text generation in llama.cpp.
02
examples/json_schema_to_grammar.py
Shows how JSON Schema definitions are converted to grammar rules, complementing the Pydantic approach and illustrating the key data transformation pipeline.
03
examples/convert_legacy_llama.py
Demonstrates the model conversion pipeline from legacy formats to GGUF, highlighting quantization and tensor handling—a critical architectural flow.
04
examples/model-conversion/scripts/utils/common.py
Provides shared utilities for tensor summarization and validation, revealing common patterns used across conversion scripts.
05
examples/model-conversion/scripts/utils/inspect-org-model.py
Illustrates how tensor metadata is inspected, giving insight into the data structures that underpin model loading and conversion.

What's inside

7 sections of the codebase

Sibling Projects

Codebases that occupy adjacent space

Related Expeditions

⚙️

ggerganov/llama.cpp

↗

The core inference engine itself, not just its scripts — lower-level C/C++ code.

Higher-level user experience with a built-in server and model management, less hackable.

Optimized for high-throughput server-side inference with PagedAttention, not local/edge use.

Full chatbot training and serving framework with multi-model support, heavier dependency stack.

▴larger

🐍

abetlen/llama-cpp-python

↗

Python bindings for llama.cpp providing a high-level API, less direct control over scripts.

≈similar size

Export & Share

Take the field notes with you

concept

A small unit of text that an LLM processes, like a word, part of a word, or punctuation mark. Models predict one token at a time to generate responses.

Visitor pattern

pattern

A design pattern where you define an operation to be performed on elements of a structure without changing the elements' classes. Here, it's used to walk through schema definitions and generate grammar rules.

ggerganovllama.cpp

What using it looks like

What this is

What Is This?

What Can You Do With It?

How It Works (No Jargon)

What's Cool About It?

Who Should Care?

Start Here

Start Here

What's inside

Examples

Examples

Examples

Model Conversion

Model Conversion

Model Conversion

Model Conversion

Read Next

Running LLMs Locally with llama.cpp

llama.cpp: Run LLMs on Your CPU (Tutorial)

What is GGUF and Why Does It Matter?

A Beginner's Guide to Structured Output from LLMs

Sibling Projects

Export & Share

Words You'll Hear

Checkpoint

concurrent.futures

Context-free grammar

Cosine similarity

Factory method

GBNF

GGUF

Hugging Face

JSON Schema

LLM inference engine

Logits

Normalized Mean Squared Error (NMSE)

Null Object pattern

pickle

Pipeline architecture

Pydantic

PyTorch

Quantization

Recursive descent

safetensors

SentencePiece

Strategy pattern

Tensor

Token

Visitor pattern