Local LLMs 101
What it must have felt like to experience the sheer joy of local large language models back in the 1500s.
A practical foundation for understanding and locally running Large Language Models (LLMs)
This guide made possible by the fantastic public information sharing of Ahmad Osman. Thank you Ahmad for your significant efforts to educate the masses about the importance of accumulating GPUs and running local LLMs.
The information in this guide enables a quick but deep foundation for learning how LLMs work, and how to reason about them when running LLMs on your own hardware, including topics like:
- How models actually generate text
- What tokens, weights, and context really mean
- How and why VRAM limits inference performance
- How quantization and formats work
- What commonly goes wrong
- How to think about local inference
Inference: What LLMs actually do
Inference means running a trained model to generate output.
During inference:
- The model does not learn.
- Certain layers in a model have their parameters locked.
- The model only performs math on existing parameters.
At every step, the model answers this question:
Given everything so far, what token will most likely come next?
This happens one token at a time.
The sequence
A model always operates on a sequence of tokens with this flow:
[input tokens] → predict next token → append → repeat
Each step uses:
- Your prompt.
- All generated tokens from earlier steps.
- The model’s weights.
- The key/value cache (detailed later).
This repeating loop comprises the entire generation process.
Tokens (not words)
Important: tokens ≠ words
Tokens represent the units of text a model actually sees.
Tokens can represent:
- Whole words
- Parts of words
- Punctuation
- Whitespace
- Unicode fragments
Token examples (varies by tokenizer):
- “hello” → 1 token (sometimes 2–3)
- “internationalization” → 5–8 tokens
- ” ChatGPT” (note leading space) → often its own token
Tokenizers
A tokenizer converts text ↔ token IDs.
Common types:
- Byte Pair Encoding (BPE) – GPT-style
- WordPiece – BERT-style
- SentencePiece – LLaMA / Mistral-style
Inside the model, tokens are just integers.
- Humans see text.
- Models see numbers.
Context window
The context window defines the number of tokens a model can “see” at once.
Examples:
- 2K tokens
- 8K tokens
- 32K+
- 128K+
Longer context means:
- More memory usage
- Larger KV cache
- Slower decoding
VRAM capacity limits context length, not the model file itself.
Models, weights, and parameters
Parameters (weights)
When we talk about parameters, we refer to a model’s learned numerical values.
They encode:
- Language structure
- Patterns
- Statistical associations
- Learned reasoning behaviors
Another term for these parameters is weights.
Examples
- Granite 4 3B → ~3 billion parameters
- Ministral-3 14B → ~14 billion parameters
Larger models:
- Require more VRAM
- Are generally more capable
- Are slower and more expensive to run
When people say:
This model knows things.
What they really mean:
The weights encode statistical patterns learned during training.
What makes up a model?
More than just a file full of weights, a complete model includes:
- Architecture (transformer structure)
- Weights (learned parameters)
- Tokenizer
- Configuration (layer counts, dimensions, special tokens)
- Chat template (for chat/instruct models)
- License metadata
Missing any of these often causes:
- Gibberish output.
- Broken formatting.
- Refusal to follow instructions.
The Transformer
Almost every modern LLM uses a decoder-only transformer architecture.
The design of this architecture optimizes for sequence prediction.
What transformers do
Transformers use self-attention to:
- Process sequences of tokens
- Look backward at previous tokens
- Decide which tokens matter most
- Predict probabilities for the next token
Transformer layers
Each model has many stacked layers:
- 24 layers (small)
- 32 layers (7B-class)
- 80+ layers (70B+)
Every generated token passes through all layers, every time.
You can offload layers to CPUs, GPUs or TPUs, and you can have hybrid combinations, but VRAM capacity and performance remains the primary bottleneck.
Inside one transformer layer
Each layer contains:
Self-attention
Self-attention determines the relevancy of previous tokens.
Variants include:
- Multi-Head Attention
- Grouped Query Attention (GQA)
- Multi-Query Attention (MQA) → saves memory and bandwidth
Multilayer Perceptron (MLP)
An MLP adds non-linearity and expressive power.
An MLP typically consists of:
- Two linear layers.
- Gaussian Error Linear Unit (GELU) or Swish-Gated Linear Unit (SwiGLU) activation.
Positional encoding
Positional encoding tells the model the location of tokens in the sequence.
Contemporary models use Rotary Positional Embeddings (RoPE).
Without position information, word order would not exist.
Residual connections + LayerNorm
These stabilize deep networks and prevent gradient collapse.
Transformer summary
- Decoder-only
- Token-by-token prediction
- Self-attention over all prior tokens
- Stacked many times
- Outputs logits → probabilities → next token
Mathematically:
f_θ(sequence) → P(next_token)
Everything else is repetition.
Generation: How text appears
Generation is always stepwise.
- Predict probabilities.
- Choose a token.
- Append it to the sequence.
- Update KV cache.
- Repeat.
This continues until:
- End-of-sequence token.
- Max tokens reached.
- You stop generation.
No “thinking ahead” actually occurs, only repeated prediction.
Decoding: Choosing the next token
The model outputs probabilities which you choose how to sample.
The following sections list common decoding strategies.
Greedy
- Always pick the top token
- Deterministic
- Often robotic
Temperature
- Controls randomness
- Lower = more focused
- Higher = more creative
Top-k
- Sample only from top K tokens
Top-p (nucleus sampling)
- Sample from smallest set whose cumulative probability ≥ p
Other controls
- Repetition penalty
- Frequency penalty
- No-repeat n-grams
- Deterministic seeds
Decoding heavily affects output quality.
Key/Value Cache (Session Memory)
The key/value cache stores attention states for previous tokens.
This is the model’s working memory during inference.
Important facts:
- Grows linearly with context length.
- Stored per layer, per head.
- Dominates VRAM usage for long contexts.
Rule of thumb (7B-class model):
- ~0.5 MB per token.
- 4K tokens ≈ ~2 GB KV cache.
Many runtimes support 4-bit and 8-bit key/value cache quantization to reduce memory usage.
VRAM: The real bottleneck
VRAM must have capacity for the entirety of:
- Model weights
- KV cache
- Runtime overhead
Weight memory (approximate)
| Precision | Memory |
|---|---|
| FP16 | ~2 bytes / parameter |
| 8-bit | ~1 byte / parameter |
| 4-bit | ~0.5 bytes / parameter |
Examples:
- 7B FP16 → ~14 GB
- 7B 4-bit → ~3.5 GB
You need to also factor in 10–30% overhead.
CPU vs GPU
- You should strongly prefer GPUs.
- CPUs can (slowly) run quantized models.
- CPU offloading = massive slowdown.
Real local inference with acceptable performance requires:
- CUDA (NVIDIA).
- ROCm (AMD).
- Metal (Apple Silicon).
Quantization
Quantization reduces numerical precision to save memory.
Common types:
- FP16 / BF16 is full quality.
- INT8 is moderate compression.
- INT4 / NF4 is aggressive compression.
4-bit quantization turns out to be the sweet spot for most users:
- Significant VRAM savings.
- Minor quality loss for most tasks.
Degradation appears first in:
- Math.
- Logic.
- Complex code generation.
Model formats & runtimes
The next section covers some common model formats and runtimes.
PyTorch + safetensors
- Standard format.
- Secure (no pickle).
- Flexible.
- Used by Hugging Face.
GGUF (llama.cpp format)
- Optimized for quantization.
- CPU/GPU portable.
- Great fit for local and edge devices.
- Used by llama.cpp and Ollama.
Others
- MLX
- ONNX
- TensorRT-LLM
- MLC
Avoid legacy .bin files (pickle risk).
Serving local models
Common server options:
- llama.cpp (OpenAI-compatible)
- Ollama
- vLLM (high-throughput, production quality)
- ExLlama V3 (fast GPU inference)
- Local scripts
- FastAPI / Flask wrappers
Remember, local does not mean offline, it means self-hosted.
Chat vs base models
Base models
- Not instruction-tuned.
- Require few-shot prompting.
- Easily produce nonsense in chat.
Chat / instruct models
- Fine-tuned for dialogue.
- Require chat templates.
- Wrong template = garbage output.
Always use the correct template:
apply_chat_template()in Transformers.- Built-in templates in llama.cpp / vLLM.
Fine-tuning (briefly)
Most users do not need full fine-tuning.
Common approaches:
- LoRA / QLoRA
- Lightweight adapter layers.
- Minimal VRAM.
- Can merge or swap.
Often better alternatives:
- Prompt engineering.
- Few-shot examples.
- Retrieval-Augmented Generation (RAG).
- Agents with skills and tooling.
Common problems
Out of memory (OOM)
- Model too large.
- Context too long.
- Use quantization or reduce context.
Gibberish output
- Wrong chat template.
- Base model used as chat model.
- Temperature too high.
Slow performance
- CPU offloading.
- Missing FlashAttention.
- Wrong drivers.
Unsafe Models
- Avoid random
.binfiles. - Avoid
trust_remote_codeunless necessary. - Prefer safetensors.
Why run LLMs locally?
Advantages
- Full control over decoding and prompts.
- No per-token billing.
- Privacy.
- No network latency.
- Deep customization.
Challenges
- Hardware limits.
- Driver complexity.
- Ecosystem fragmentation.
- More operational work.
Practical local LLM checklist
- Pick a chat-tuned model.
- Size it for your VRAM.
- Choose quantization.
- Install runtime.
- Verify memory fit.
- Use correct chat template.
- Tune decoding.
- Benchmark your task.
- Serve via local API if needed.
Glossary
- Token: smallest unit processed by model.
- Context window: max visible tokens.
- KV cache: attention memory.
- Quantization: lower-precision weights.
- RoPE: rotary positional embeddings.
- GQA/MQA: efficient attention variants.
- Decoding: token selection strategy.
- RAG: retrieval-augmented generation.
Final takeaway
Local LLMs are not magic.
They are:
- Memory math.
- Token sequencing.
- Correct formatting.
- Hardware constraints.
- Sampling decisions.
Understand these concepts and you can:
- Run nearly any modern model locally.
- Diagnose failures quickly.
- Reason about performance.
- Build reliable local AI systems.
Once you understand the sequence, you understand the model.