Fine-Tuning Llama 3.2–3B on IPCC Climate Reports

Timeframe: Feb 2025 – Apr 2025
Stack: PyTorch · Hugging Face (Transformers, PEFT) · Nanotron · vLLM · Milvus · LangChain · Singularity · SLURM · A100 GPUs

TL;DR

I fine‑tuned LLaMA‑3B on IPCC climate reports with aggressive memory optimizations (QLoRA + BF16 + gradient checkpointing), then scaled training across 2×A100 using data, tensor, and pipeline parallelism. Finally, I benchmarked a RAG system against the fine‑tuned model on climate‑domain queries.

Highlights

  • 4‑bit quantization + BF16 + LoRA (r=8, α=32) with gradient checkpointing enabled micro‑batch 32 on a single A100.
  • Distributed training across 2×A100 combined DP/TP/PP to halve epoch time (337 → 169 min) and cut per‑GPU memory by ~48%.
  • End‑to‑end packaging: modular Trainer + SLURM workflows for QLoRA and multi‑GPU runs; evaluated perplexity and RAG time‑per‑request.

Problem & Dataset

Climate literature updates faster than general‑purpose LLMs adapt. I curated a corpus of IPCC reports and recent climate/AI publications (past ~5 years) from PDFs, extracted/cleaned text, and split 90/10 for train/test. The goal: produce a stronger climate‑domain model and compare it to a retrieval‑augmented setup.

Approach

(1) Memory‑Efficient Fine‑Tuning (Single‑GPU)

  • QLoRA: 4‑bit quantization with LoRA adapters (r=8, α=32)
  • Mixed precision: BF16 for speed/stability
  • Gradient accumulation & checkpointing to increase effective batch size
  • Outcome: micro‑batch 32 on a single A100 while maintaining training stability

(2) Multi‑GPU Scaling (2×A100)

  • Data Parallelism (DP) to shard batches
  • Tensor Parallelism (TP) to split large weight matrices
  • Pipeline Parallelism (PP) to place layer stages on separate GPUs with micro‑batching
  • Outcome: ~2× throughput (337 → 169 min per epoch) and ~48% per‑GPU memory reduction

(3) Retrieval‑Augmented Generation (RAG)

  • Indexing: chunked documents (200–400 tokens), embeddings, vector DB (Milvus/FAISS)
  • Retrieval: top‑k ANN search (HNSW / IVF‑PQ) + optional reranking
  • Generation: combine retrieved context with the query and pass to LLM
  • Metric: average time per request and qualitative answer comparison vs fine‑tuned LLaMA‑3B

Results

Training

  • Perplexity (QLoRA single‑GPU): 7.996
  • Perplexity (multi‑GPU runs): 14.36–22.97 (early‑stoped comparative runs under heavier parallelism)
  • Epoch time: 337 → 169 minutes (2×A100, DP+TP+PP)
  • Memory: ~48% reduction per GPU from baseline

Inference (RAG)

  • Lower time‑per‑request variance on knowledge‑dense queries
  • Improved factual grounding on climate‑specific prompts with explicit citations to retrieved text

Engineering & Reproducibility

  • Trainer scripts: modular configs for PEFT/quantization/precision and parallel modes (DP/TP/PP)
  • SLURM jobs: easy submit templates for single‑GPU QLoRA and multi‑GPU distributed runs
  • Containers: Singularity images for reproducible environments on HPC
  • Logging/Eval: perplexity on held‑out 10%; latency collection for RAG vs fine‑tuned

Short Code/Config Sketch (illustrative)

# Single‑GPU QLoRA
sbatch jobs/qlora_single_a100.SBATCH   --config configs/qlora_llama3b_ipcc.yaml

# 2×A100 DP+TP+PP
sbatch jobs/dist_2x_a100.SBATCH   --config configs/distributed_llama3b_dp_tp_pp.yaml

# RAG
python rag/index.py --conf configs/rag_ipcc.yaml
python rag/serve.py --conf configs/rag_ipcc.yaml