Track Overview
Specialisation BGo deep on the model layer: fine-tuning open-source LLMs, building rigorous eval frameworks, and serving models in production. This track is for engineers who want to work at AI labs, research teams, or companies that run their own models rather than using third-party APIs.
Skills You Will Build
- HuggingFace ecosystem: datasets, transformers, hub, evaluate
- QLoRA fine-tuning with Unsloth — 2x faster, 70% less VRAM
- PEFT LoRA — train 0.1% of parameters, get 80% of the quality gain
- vLLM for high-throughput PagedAttention-based model serving
- GGUF quantisation for local CPU deployment with llama.cpp
- Rigorous LLM evaluation: domain accuracy, fluency, cost comparison
HuggingFace Ecosystem
Infrastructurepip install transformers datasets huggingface_hub accelerate evaluate trl
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
# Load any open-source model
model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
# Inference pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("Explain DPDK mempool:", max_new_tokens=200)
print(result[0]["generated_text"])
# Build a fine-tuning dataset from domain text
training_data = [
{"prompt": "What is DPDK?",
"completion": "DPDK is a set of libraries and drivers for fast packet processing..."},
{"prompt": "How does rte_ring work?",
"completion": "rte_ring is a lock-free FIFO queue implementation in DPDK..."},
# ... 100-1000 examples
]
# Format for instruction fine-tuning (Llama chat template)
def format_prompt(example):
return {"text": f"<|user|>\n{example['prompt']}<|assistant|>\n{example['completion']}"}
dataset = Dataset.from_list(training_data).map(format_prompt)
dataset.push_to_hub("your-username/domain-qa-dataset")Fine-tuning with Unsloth QLoRA
Core Skillpip install unsloth
from unsloth import FastLanguageModel
import torch
# Load model in 4-bit quantised form — fits in 8GB VRAM
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/llama-3.2-3b-instruct-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Add LoRA adapters — only 0.1% of parameters are trainable
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more params
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing=True,
)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 3,254,702,080 || 1.29%
# Training with SFTTrainer (TRL library)
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=formatted_dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
output_dir="./output",
),
)
trainer.train()
# Save LoRA adapter and push to hub
model.save_pretrained("lora_adapter")
model.push_to_hub("your-username/domain-llama-lora")
# Merge adapter into base model for standalone deployment
merged = model.merge_and_unload()
merged.save_pretrained("merged_model")PEFT & LoRA — The Math That Matters
ConceptsLoRA adds small trainable matrices to frozen model weights. Instead of training W (d x d, billions of params), you train A (d x r) and B (r x d) where r is typically 8-64. The effective weight update is A times B, which is added to the frozen W.
# LoRA: W_update = A x B, where rank r << d
# At r=16 on a 7B model: 0.1% of parameters trained
# Quality: typically 80-95% of full fine-tune quality at 1% the cost
from peft import LoraConfig, get_peft_model, TaskType
# Manual PEFT config (without Unsloth)
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32, # scaling factor — usually 2x r
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
)
peft_model = get_peft_model(base_model, config)
peft_model.print_trainable_parameters()
# Convert to GGUF for local CPU deployment (llama.cpp)
# 1. Save merged model: merged.save_pretrained("merged")
# 2. Convert: python llama.cpp/convert_hf_to_gguf.py merged --outfile model.gguf
# 3. Quantise: ./llama.cpp/quantize model.gguf model.q4_k_m.gguf Q4_K_M
# 4. Run: ./llama.cpp/main -m model.q4_k_m.gguf -p "What is DPDK?"
# Quantisation comparison:
# Q8_0: ~8GB, highest quality local
# Q4_K_M: ~4GB, good quality/size balance — recommended default
# Q3_K_M: ~3GB, noticeable quality drop
# Q2_K: ~2GB, significant quality loss, emergency onlyvLLM — Production Model Serving
Servingpip install vllm
# Start server (OpenAI-compatible API)
# vllm serve your-username/domain-llama --port 8000 --max-model-len 4096
# Python client — same as OpenAI API
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
model="your-username/domain-llama",
messages=[{"role": "user", "content": "How does rte_mempool_create work?"}],
max_tokens=512
)
print(response.choices[0].message.content)
# vLLM advantages over transformers.generate():
# - PagedAttention: manages KV cache in pages, 3-5x higher throughput
# - Continuous batching: batches requests dynamically, no padding waste
# - CUDA kernels: fused attention, faster than naive PyTorch
# - OpenAI-compatible: drop-in replacement for OpenAI API calls
# Benchmark your fine-tuned model vs base model vs Claude-3-Haiku
import anthropic, time
def eval_model(questions: list[str], answers: list[str]) -> dict:
"""Compare fine-tuned vs base vs Claude on domain Q&A."""
results = {"fine_tuned": [], "base": [], "claude": []}
for q, expected in zip(questions, answers):
# Fine-tuned (via vLLM)
ft_resp = client.chat.completions.create(
model="fine-tuned", messages=[{"role":"user","content":q}], max_tokens=256
)
# Claude as baseline
cl_resp = anthropic.Anthropic().messages.create(
model="claude-3-haiku-20240307", max_tokens=256,
messages=[{"role":"user","content":q}]
)
# Score both with LLM judge
ft_score = judge_answer(q, expected, ft_resp.choices[0].message.content)
cl_score = judge_answer(q, expected, cl_resp.content[0].text)
results["fine_tuned"].append(ft_score)
results["claude"].append(cl_score)
return {k: sum(v)/len(v) for k, v in results.items()}Fine-tune a 3B parameter open-source model on a domain-specific dataset from your professional area (DPDK documentation, network engineering, telecom). Evaluate it rigorously and serve it in production.
Requirements
- Build a dataset of 200+ Q&A pairs from your domain
- Fine-tune using Unsloth QLoRA on Google Colab T4 (free) or local GPU
- Serve the merged model with vLLM on a $5/mo cloud VM
- Evaluation report on 50 domain questions: fine-tuned vs base model vs Claude-3-Haiku
- Metrics: accuracy (LLM-judged), latency, cost per query
- GGUF version for local deployment — verify it runs on CPU
Push dataset, training code, evaluation results, and model to HuggingFace Hub.
MASTERY CHECKLIST
- Can load any HuggingFace model and run inference with the pipeline API
- Can build a formatted instruction fine-tuning dataset and push to HuggingFace Hub
- Can fine-tune a 3B model with Unsloth QLoRA in 4-bit on a single GPU
- Can explain LoRA: low-rank matrices A and B trained, 0.1% of parameters
- Can merge LoRA adapters into base model: merge_and_unload() then save
- Can convert a merged model to GGUF and quantise to Q4_K_M
- Can serve a fine-tuned model with vLLM: OpenAI-compatible API on port 8000
- Can build an eval harness comparing fine-tuned vs base model vs Claude on 50 questions
- Capstone: evaluation report showing improvement on domain questions published to GitHub
When complete: move to Part 9 — Portfolio and Launch.