Archaeologist·Field Notes from ggerganov/llama.cpp
Vol. I · Field Notes

ggerganovllama.cpp

Scripts that ship with llama.cpp

9 May 2026·a modest project
Reading Posture
From the Field
The de facto standard for local LLM inference, but this repo is just its scripts.
Verdict:Reach for it
Reach for it when

You need to run, convert, or benchmark LLMs locally on CPU or consumer GPU with minimal dependencies.

Look elsewhere when

You want a polished GUI, cloud API, or production-grade serving with multi-user auth.

In context

It's like Ollama but lower-level, faster, and more hackable — you build and script everything yourself.

Complexity●●Medium
Read time~30 minutes
Language
Dependencies
0total

What using it looks like

Drawn from the project's README

From the README
# Use a local model file
llama-cli -m my_model.gguf

# Or download and run a model directly from Hugging Face
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch OpenAI-compatible API server
llama-server -hf ggml-org/gemma-3-1b-it-GGUF
Fig. 1 — example 1 of 6

What this is

As told for the tourist

What Is This?

llama.cpp is a program that lets you run powerful AI language models—the same kind that power ChatGPT—directly on your own computer, without needing an internet connection or paying anyone. Think of it as a tiny engine that can bring a smart chatbot to life right inside your laptop or desktop.

What Can You Do With It?

You could use this to have a private conversation with an AI assistant that never sends your data anywhere. For example, you can open your terminal and type:

llama-cli -m my_model.gguf

Then just start chatting: "Hi, who are you?" and it'll reply like a friendly helper. You can also launch your own personal AI server that works like OpenAI's API, so other apps on your computer can talk to it:

llama-server -hf ggml-org/gemma-3-1b-it-GGUF

This means you could build a writing assistant, a coding buddy, or a study tutor that runs entirely offline. The README shows you can even download models directly from Hugging Face (a big online library of AI models) with a simple command like llama-cli -hf ggml-org/gemma-3-1b-it-GGUF.

llama-cli -m my_model.gguf
llama-server -hf ggml-org/gemma-3-1b-it-GGUF

How It Works (No Jargon)

1. The Model is Like a Cookbook

An AI language model is basically a giant collection of patterns—like a cookbook with millions of recipes for how words fit together. When you ask it a question, it's like flipping through the cookbook to find the most likely next word, then the next, then the next. llama.cpp reads this cookbook file (usually a .gguf file) and uses it to generate responses.

2. Your Computer is the Kitchen

Running a model requires a lot of math—think of it as following thousands of recipes simultaneously. llama.cpp is designed to use your computer's brain (the CPU) and, if available, its graphics card (GPU) to do this math really fast. It's like having a super-efficient kitchen that can cook thousands of dishes at once without burning anything.

3. The "Quantization" Trick

The cookbook is usually huge—like 100 billion recipes. llama.cpp can shrink it down by rounding off tiny details (like using "1.5 cups" instead of "1.5234 cups"). This makes the cookbook smaller and faster to use, while still keeping the food tasting almost the same. That's why you can run these models on a regular laptop instead of needing a supercomputer.

What's Cool About It?

The coolest thing is that it's written in C/C++, which is like building a race car engine instead of a family sedan. Most AI tools are written in Python, which is easier to write but slower. llama.cpp is incredibly fast and lightweight—it can run on everything from a beefy gaming PC to a tiny Raspberry Pi. It also supports "multimodal" models now, meaning it can look at pictures and describe them, not just chat.

Who Should Care?

Reach for this if you're curious about AI but don't want to send your private conversations to a cloud server, or if you're a developer who wants to build apps with a free, local AI brain. Skip it if you just want to use ChatGPT in your browser—that's already perfect for casual use. But if you like tinkering, learning, or keeping your data private, llama.cpp is your new best friend.

Start Here

A recommended reading path through the code

Start Here

A recommended reading path through the code

  1. 01

    Reveals the core abstraction of mapping Python types to GBNF grammar rules, which is central to constrained text generation in llama.cpp.

  2. 02

    Shows how JSON Schema definitions are converted to grammar rules, complementing the Pydantic approach and illustrating the key data transformation pipeline.

  3. 03

    Demonstrates the model conversion pipeline from legacy formats to GGUF, highlighting quantization and tensor handling—a critical architectural flow.

  4. 04

    Provides shared utilities for tensor summarization and validation, revealing common patterns used across conversion scripts.

  5. 05

    Illustrates how tensor metadata is inspected, giving insight into the data structures that underpin model loading and conversion.

What's inside

7 sections of the codebase

Read Next

Where to go from here

📰
Article2024

Running LLMs Locally with llama.cpp

Simon Willison

A clear, hands-on walkthrough of installing and using llama.cpp for local inference, perfect for beginners.

📺
Video2024

llama.cpp: Run LLMs on Your CPU (Tutorial)

Fireship

A fast-paced, visually engaging intro that demystifies local LLM inference and llama.cpp's role.

📰
Article2023

What is GGUF and Why Does It Matter?

Hugging Face Blog

Explains the GGUF format that llama.cpp scripts convert to, making the conversion pipeline understandable.

📰
Article2024

A Beginner's Guide to Structured Output from LLMs

LangChain Blog

Introduces the concept of grammar-constrained generation that llama.cpp's GBNF scripts implement.

Sibling Projects

Codebases that occupy adjacent space

Related Expeditions
llama.cpp⚙️llama.cpp🦙Ollama🚀vLLM💬FastChat🐍llama-cpp-python
 

Export & Share

Take the field notes with you

Words You'll Hear

Hover the dotted terms above for definitions in context

Checkpoint

concept

A saved snapshot of a model's weights at a specific point during training. These files are the input to the conversion scripts.

concurrent.futures

library

A Python library for running tasks in parallel using thread or process pools. It's used to speed up model conversion by processing multiple tensors simultaneously.

Context-free grammar

concept

A set of rules that define how to build strings from a language, where each rule can be applied regardless of surrounding context. It's used here to make LLMs output valid JSON or other structured formats.

Cosine similarity

concept

A measure of how similar two vectors (lists of numbers) are, ranging from -1 to 1. It's used here to quickly check if model outputs are close to expected values.

Factory method

pattern

A design pattern that provides an interface for creating objects in a superclass, but lets subclasses alter the type of objects created. It's partially used in the DataType hierarchy.

GBNF

concept

A grammar format used by llama.cpp to define rules for what tokens an LLM can generate at each step. It's a variant of Backus-Naur Form (BNF) that constrains output to follow a specific structure.

GGUF

concept

A file format for storing quantized large language models, designed for efficient loading and inference. It replaces older formats like GGML and is the standard format for llama.cpp.

Hugging Face

library

A platform and library ecosystem for sharing, storing, and using machine learning models. Scripts here interact with it to download or upload model files.

JSON Schema

concept

A standard format for describing the structure of JSON data, including types, required fields, and constraints. It's used as input to generate grammar rules for LLM output.

LLM inference engine

concept

A software system that runs a trained large language model to generate text, rather than training it. It takes input prompts and produces output tokens efficiently.

Logits

concept

Raw numerical scores output by an LLM for each possible token before they are converted into probabilities. Comparing logits helps check if two model versions produce similar outputs.

Normalized Mean Squared Error (NMSE)

concept

A metric that measures the average squared difference between predicted and actual values, normalized by the variance of the actual values. It's a precise way to validate model output accuracy.

Null Object pattern

pattern

A design pattern that uses a special object representing 'no value' to avoid null checks. It's implicitly used when handling optional fields in grammar generation.

pickle

library

A Python module for serializing (saving) objects to files. It's used here to load older model checkpoint files.

Pipeline architecture

pattern

A software design where data flows through a series of processing stages, each performing a specific transformation. The conversion scripts follow this pattern with separate workers for each step.

Pydantic

library

A Python library for data validation using type annotations. It's used here to define model schemas that are converted into grammar rules.

PyTorch

library

A popular Python library for building and training neural networks. It's used here to run original models and generate reference outputs for validation.

Quantization

concept

A technique that reduces the precision of a model's numerical weights (e.g., from 16-bit to 8-bit) to make it smaller and faster to run, with minimal loss in quality. It's like compressing a high-resolution image into a smaller file.

Recursive descent

pattern

A technique where a function calls itself to process nested structures, like walking through a tree of JSON schema definitions. It's the core approach in the grammar generators.

safetensors

library

A safe file format for storing tensor data, designed to avoid security issues with pickle. It's a modern alternative for saving model weights.

SentencePiece

library

A tokenizer library that breaks text into tokens without needing a predefined word list. It's used by LLaMA models to convert text into token IDs.

Strategy pattern

pattern

A design pattern that lets you define a family of algorithms and make them interchangeable. It's mentioned as a missing pattern that would improve the converter's architecture.

Tensor

concept

A multi-dimensional array of numbers, like a matrix but with more dimensions. Model weights and activations are stored as tensors.

Token

concept

A small unit of text that an LLM processes, like a word, part of a word, or punctuation mark. Models predict one token at a time to generate responses.

Visitor pattern

pattern

A design pattern where you define an operation to be performed on elements of a structure without changing the elements' classes. Here, it's used to walk through schema definitions and generate grammar rules.

ggerganov/llama.cpp · Archaeologist