Track B - Applied ML / LLM Engineer

Part 8 — Specialisation · Track B of 4

Track B — Applied ML / LLM Engineer

Fine-tune models, build rigorous evals, and work at the model layer

⏱ 2–3 Weeks 🔴 Advanced 🔧 Unsloth · HuggingFace · PEFT · vLLM

🎯

Track Overview

Specialisation B

Go deep on the model layer: fine-tuning open-source LLMs, building rigorous eval frameworks, and serving models in production. This track is for engineers who want to work at AI labs, research teams, or companies that run their own models rather than using third-party APIs.

Skills You Will Build

HuggingFace ecosystem: datasets, transformers, hub, evaluate
QLoRA fine-tuning with Unsloth — 2x faster, 70% less VRAM
PEFT LoRA — train 0.1% of parameters, get 80% of the quality gain
vLLM for high-throughput PagedAttention-based model serving
GGUF quantisation for local CPU deployment with llama.cpp
Rigorous LLM evaluation: domain accuracy, fluency, cost comparison

🤗

HuggingFace Ecosystem

Infrastructure

pip install transformers datasets huggingface_hub accelerate evaluate trl

from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

# Load any open-source model
model_id = "meta-llama/Llama-3.2-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)

# Inference pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("Explain DPDK mempool:", max_new_tokens=200)
print(result[0]["generated_text"])

# Build a fine-tuning dataset from domain text
training_data = [
    {"prompt": "What is DPDK?",
     "completion": "DPDK is a set of libraries and drivers for fast packet processing..."},
    {"prompt": "How does rte_ring work?",
     "completion": "rte_ring is a lock-free FIFO queue implementation in DPDK..."},
    # ... 100-1000 examples
]

# Format for instruction fine-tuning (Llama chat template)
def format_prompt(example):
    return {"text": f"<|user|>\n{example['prompt']}<|assistant|>\n{example['completion']}"}

dataset = Dataset.from_list(training_data).map(format_prompt)
dataset.push_to_hub("your-username/domain-qa-dataset")

⚡

Fine-tuning with Unsloth QLoRA

Core Skill

pip install unsloth

from unsloth import FastLanguageModel
import torch

# Load model in 4-bit quantised form — fits in 8GB VRAM
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3.2-3b-instruct-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Add LoRA adapters — only 0.1% of parameters are trainable
model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank — higher = more capacity, more params
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing=True,
)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 3,254,702,080 || 1.29%

# Training with SFTTrainer (TRL library)
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=formatted_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="./output",
    ),
)
trainer.train()

# Save LoRA adapter and push to hub
model.save_pretrained("lora_adapter")
model.push_to_hub("your-username/domain-llama-lora")

# Merge adapter into base model for standalone deployment
merged = model.merge_and_unload()
merged.save_pretrained("merged_model")

📊

PEFT & LoRA — The Math That Matters

Concepts

LoRA adds small trainable matrices to frozen model weights. Instead of training W (d x d, billions of params), you train A (d x r) and B (r x d) where r is typically 8-64. The effective weight update is A times B, which is added to the frozen W.

# LoRA: W_update = A x B, where rank r << d
# At r=16 on a 7B model: 0.1% of parameters trained
# Quality: typically 80-95% of full fine-tune quality at 1% the cost

from peft import LoraConfig, get_peft_model, TaskType

# Manual PEFT config (without Unsloth)
config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,
    lora_alpha=32,           # scaling factor — usually 2x r
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
)
peft_model = get_peft_model(base_model, config)
peft_model.print_trainable_parameters()

# Convert to GGUF for local CPU deployment (llama.cpp)
# 1. Save merged model:  merged.save_pretrained("merged")
# 2. Convert: python llama.cpp/convert_hf_to_gguf.py merged --outfile model.gguf
# 3. Quantise: ./llama.cpp/quantize model.gguf model.q4_k_m.gguf Q4_K_M
# 4. Run: ./llama.cpp/main -m model.q4_k_m.gguf -p "What is DPDK?"

# Quantisation comparison:
# Q8_0:   ~8GB,  highest quality local
# Q4_K_M: ~4GB,  good quality/size balance — recommended default
# Q3_K_M: ~3GB,  noticeable quality drop
# Q2_K:   ~2GB,  significant quality loss, emergency only

🚀

vLLM — Production Model Serving

Serving

pip install vllm

# Start server (OpenAI-compatible API)
# vllm serve your-username/domain-llama --port 8000 --max-model-len 4096

# Python client — same as OpenAI API
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="your-username/domain-llama",
    messages=[{"role": "user", "content": "How does rte_mempool_create work?"}],
    max_tokens=512
)
print(response.choices[0].message.content)

# vLLM advantages over transformers.generate():
# - PagedAttention: manages KV cache in pages, 3-5x higher throughput
# - Continuous batching: batches requests dynamically, no padding waste
# - CUDA kernels: fused attention, faster than naive PyTorch
# - OpenAI-compatible: drop-in replacement for OpenAI API calls

# Benchmark your fine-tuned model vs base model vs Claude-3-Haiku
import anthropic, time

def eval_model(questions: list[str], answers: list[str]) -> dict:
    """Compare fine-tuned vs base vs Claude on domain Q&A."""
    results = {"fine_tuned": [], "base": [], "claude": []}

    for q, expected in zip(questions, answers):
        # Fine-tuned (via vLLM)
        ft_resp = client.chat.completions.create(
            model="fine-tuned", messages=[{"role":"user","content":q}], max_tokens=256
        )
        # Claude as baseline
        cl_resp = anthropic.Anthropic().messages.create(
            model="claude-3-haiku-20240307", max_tokens=256,
            messages=[{"role":"user","content":q}]
        )
        # Score both with LLM judge
        ft_score = judge_answer(q, expected, ft_resp.choices[0].message.content)
        cl_score = judge_answer(q, expected, cl_resp.content[0].text)
        results["fine_tuned"].append(ft_score)
        results["claude"].append(cl_score)

    return {k: sum(v)/len(v) for k, v in results.items()}

🛠 Capstone: Domain Fine-Tuned Model with Evaluation Report 2–3 weeks

Fine-tune a 3B parameter open-source model on a domain-specific dataset from your professional area (DPDK documentation, network engineering, telecom). Evaluate it rigorously and serve it in production.

Requirements

Build a dataset of 200+ Q&A pairs from your domain
Fine-tune using Unsloth QLoRA on Google Colab T4 (free) or local GPU
Serve the merged model with vLLM on a $5/mo cloud VM
Evaluation report on 50 domain questions: fine-tuned vs base model vs Claude-3-Haiku
Metrics: accuracy (LLM-judged), latency, cost per query
GGUF version for local deployment — verify it runs on CPU

Push dataset, training code, evaluation results, and model to HuggingFace Hub.

MASTERY CHECKLIST

Can load any HuggingFace model and run inference with the pipeline API
Can build a formatted instruction fine-tuning dataset and push to HuggingFace Hub
Can fine-tune a 3B model with Unsloth QLoRA in 4-bit on a single GPU
Can explain LoRA: low-rank matrices A and B trained, 0.1% of parameters
Can merge LoRA adapters into base model: merge_and_unload() then save
Can convert a merged model to GGUF and quantise to Q4_K_M
Can serve a fine-tuned model with vLLM: OpenAI-compatible API on port 8000
Can build an eval harness comparing fine-tuned vs base model vs Claude on 50 questions
Capstone: evaluation report showing improvement on domain questions published to GitHub

When complete: move to Part 9 — Portfolio and Launch.

← Track A: AI Product All Modules Next: Track C — Automation →