Part 8 — Specialisation  ·  Track B of 4
Track B — Applied ML / LLM Engineer
Fine-tune models, build rigorous evals, and work at the model layer
⏱ 2–3 Weeks 🔴 Advanced 🔧 Unsloth · HuggingFace · PEFT · vLLM
🎯

Track Overview

Specialisation B

Go deep on the model layer: fine-tuning open-source LLMs, building rigorous eval frameworks, and serving models in production. This track is for engineers who want to work at AI labs, research teams, or companies that run their own models rather than using third-party APIs.

Skills You Will Build

  • HuggingFace ecosystem: datasets, transformers, hub, evaluate
  • QLoRA fine-tuning with Unsloth — 2x faster, 70% less VRAM
  • PEFT LoRA — train 0.1% of parameters, get 80% of the quality gain
  • vLLM for high-throughput PagedAttention-based model serving
  • GGUF quantisation for local CPU deployment with llama.cpp
  • Rigorous LLM evaluation: domain accuracy, fluency, cost comparison
🤗

HuggingFace Ecosystem

Infrastructure
pip install transformers datasets huggingface_hub accelerate evaluate trl

from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Load any open-source model
model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

# Inference pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("Explain DPDK mempool:", max_new_tokens=200)
print(result[0]["generated_text"])

# Build a fine-tuning dataset from domain text
training_data = [
    {"prompt": "What is DPDK?",
     "completion": "DPDK is a set of libraries and drivers for fast packet processing..."},
    {"prompt": "How does rte_ring work?",
     "completion": "rte_ring is a lock-free FIFO queue implementation in DPDK..."},
    # ... 100-1000 examples
]

# Format for instruction fine-tuning (Llama chat template)
def format_prompt(example):
    return {"text": f"<|user|>\n{example['prompt']}<|assistant|>\n{example['completion']}"}

dataset = Dataset.from_list(training_data).map(format_prompt)
dataset.push_to_hub("your-username/domain-qa-dataset")

Fine-tuning with Unsloth QLoRA

Core Skill
pip install unsloth

from unsloth import FastLanguageModel
import torch

# Load model in 4-bit quantised form — fits in 8GB VRAM
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.2-3b-instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Add LoRA adapters — only 0.1% of parameters are trainable
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank — higher = more capacity, more params
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 3,254,702,080 || 1.29%

# Training with SFTTrainer (TRL library)
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="./output",
    ),
)
trainer.train()

# Save LoRA adapter and push to hub
model.save_pretrained("lora_adapter")
model.push_to_hub("your-username/domain-llama-lora")

# Merge adapter into base model for standalone deployment
merged = model.merge_and_unload()
merged.save_pretrained("merged_model")
📊

PEFT & LoRA — The Math That Matters

Concepts

LoRA adds small trainable matrices to frozen model weights. Instead of training W (d x d, billions of params), you train A (d x r) and B (r x d) where r is typically 8-64. The effective weight update is A times B, which is added to the frozen W.

# LoRA: W_update = A x B, where rank r << d
# At r=16 on a 7B model: 0.1% of parameters trained
# Quality: typically 80-95% of full fine-tune quality at 1% the cost

from peft import LoraConfig, get_peft_model, TaskType

# Manual PEFT config (without Unsloth)
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,           # scaling factor — usually 2x r
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)
peft_model = get_peft_model(base_model, config)
peft_model.print_trainable_parameters()

# Convert to GGUF for local CPU deployment (llama.cpp)
# 1. Save merged model:  merged.save_pretrained("merged")
# 2. Convert: python llama.cpp/convert_hf_to_gguf.py merged --outfile model.gguf
# 3. Quantise: ./llama.cpp/quantize model.gguf model.q4_k_m.gguf Q4_K_M
# 4. Run: ./llama.cpp/main -m model.q4_k_m.gguf -p "What is DPDK?"

# Quantisation comparison:
# Q8_0:   ~8GB,  highest quality local
# Q4_K_M: ~4GB,  good quality/size balance — recommended default
# Q3_K_M: ~3GB,  noticeable quality drop
# Q2_K:   ~2GB,  significant quality loss, emergency only
🚀

vLLM — Production Model Serving

Serving
pip install vllm

# Start server (OpenAI-compatible API)
# vllm serve your-username/domain-llama --port 8000 --max-model-len 4096

# Python client — same as OpenAI API
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="your-username/domain-llama",
    messages=[{"role": "user", "content": "How does rte_mempool_create work?"}],
    max_tokens=512
)
print(response.choices[0].message.content)

# vLLM advantages over transformers.generate():
# - PagedAttention: manages KV cache in pages, 3-5x higher throughput
# - Continuous batching: batches requests dynamically, no padding waste
# - CUDA kernels: fused attention, faster than naive PyTorch
# - OpenAI-compatible: drop-in replacement for OpenAI API calls

# Benchmark your fine-tuned model vs base model vs Claude-3-Haiku
import anthropic, time

def eval_model(questions: list[str], answers: list[str]) -> dict:
    """Compare fine-tuned vs base vs Claude on domain Q&A."""
    results = {"fine_tuned": [], "base": [], "claude": []}

    for q, expected in zip(questions, answers):
        # Fine-tuned (via vLLM)
        ft_resp = client.chat.completions.create(
            model="fine-tuned", messages=[{"role":"user","content":q}], max_tokens=256
        )
        # Claude as baseline
        cl_resp = anthropic.Anthropic().messages.create(
            model="claude-3-haiku-20240307", max_tokens=256,
            messages=[{"role":"user","content":q}]
        )
        # Score both with LLM judge
        ft_score = judge_answer(q, expected, ft_resp.choices[0].message.content)
        cl_score = judge_answer(q, expected, cl_resp.content[0].text)
        results["fine_tuned"].append(ft_score)
        results["claude"].append(cl_score)

    return {k: sum(v)/len(v) for k, v in results.items()}
🛠 Capstone: Domain Fine-Tuned Model with Evaluation Report 2–3 weeks

Fine-tune a 3B parameter open-source model on a domain-specific dataset from your professional area (DPDK documentation, network engineering, telecom). Evaluate it rigorously and serve it in production.

Requirements

  • Build a dataset of 200+ Q&A pairs from your domain
  • Fine-tune using Unsloth QLoRA on Google Colab T4 (free) or local GPU
  • Serve the merged model with vLLM on a $5/mo cloud VM
  • Evaluation report on 50 domain questions: fine-tuned vs base model vs Claude-3-Haiku
  • Metrics: accuracy (LLM-judged), latency, cost per query
  • GGUF version for local deployment — verify it runs on CPU

Push dataset, training code, evaluation results, and model to HuggingFace Hub.

MASTERY CHECKLIST

When complete: move to Part 9 — Portfolio and Launch.

← Track A: AI Product All Modules Next: Track C — Automation →