CoBaLI – Continuous Batching for LLM Inference

Timeframe: Aug 2025 – Nov 2025
Stack: C++17 · CUDA C · CMake · llama.cpp · Qwen2.5-0.5B-Instruct (GGUF) · NVIDIA RTX 4070 · Nsight Systems · Nsight Compute

Overview

Developed a C++/CUDA inference engine on top of llama.cpp (used as an immutable black-box backend) for Qwen2.5-0.5B-Instruct on a single RTX 4070. The engine demonstrates that significant throughput gains on commodity GPUs are possible through scheduling alone without modifying model kernels.

Architecture

  • Host-side engine (C++)CUDA selection kernelllama.cpp black-box backend
  • Manages KV-cache slots and request state entirely on the host side with a minimal CUDA selection kernel
  • All model computation (attention, matmul, RoPE, quantization) left untouched in llama.cpp

Execution Modes

  1. Sequential Serving (baseline): processes one request at a time
  2. Continuous Batching: dynamically packs requests into llama batch each decode step, keeping up to 16 sequences active
  3. Continuous Batching + Chunked Prefill: splits long prefills at configurable chunk sizes, overlapping prefill and decode for maximum GPU utilization

Results

ModeTime (186 prompts)Speedup
Sequential154.17s1.0×
Continuous Batching32.6s4.7×
CB + Prefill Split (chunk=256)24.69s6.2×
  • Optimal prefill chunk size is 256 tokens — smaller (128) adds overhead, larger (512) lets long prefills block other requests
  • Nsight Systems profiling shows continuous mode reduces mean inter-kernel gap from 170 µs to 10 µs, densifying the launch pattern and eliminating GPU idle time