CoBaLI – Continuous Batching for LLM Inference
Timeframe: Aug 2025 – Nov 2025
Stack: C++17 · CUDA C · CMake · llama.cpp · Qwen2.5-0.5B-Instruct (GGUF) · NVIDIA RTX 4070 · Nsight Systems · Nsight Compute
Overview
Developed a C++/CUDA inference engine on top of llama.cpp (used as an immutable black-box backend) for Qwen2.5-0.5B-Instruct on a single RTX 4070. The engine demonstrates that significant throughput gains on commodity GPUs are possible through scheduling alone without modifying model kernels.
Architecture
- Host-side engine (C++) → CUDA selection kernel → llama.cpp black-box backend
- Manages KV-cache slots and request state entirely on the host side with a minimal CUDA selection kernel
- All model computation (attention, matmul, RoPE, quantization) left untouched in llama.cpp
Execution Modes
- Sequential Serving (baseline): processes one request at a time
- Continuous Batching: dynamically packs requests into llama batch each decode step, keeping up to 16 sequences active
- Continuous Batching + Chunked Prefill: splits long prefills at configurable chunk sizes, overlapping prefill and decode for maximum GPU utilization
Results
| Mode | Time (186 prompts) | Speedup |
|---|---|---|
| Sequential | 154.17s | 1.0× |
| Continuous Batching | 32.6s | 4.7× |
| CB + Prefill Split (chunk=256) | 24.69s | 6.2× |
- Optimal prefill chunk size is 256 tokens — smaller (128) adds overhead, larger (512) lets long prefills block other requests
- Nsight Systems profiling shows continuous mode reduces mean inter-kernel gap from 170 µs to 10 µs, densifying the launch pattern and eliminating GPU idle time
