CoBaLI – Continuous Batching for LLM Inference

Timeframe: Aug 2025 – Nov 2025
Stack: C++17 · CUDA C · CMake · llama.cpp · Qwen2.5-0.5B-Instruct (GGUF) · NVIDIA RTX 4070 · Nsight Systems · Nsight Compute

Overview

Developed a C++/CUDA inference engine on top of llama.cpp (used as an immutable black-box backend) for Qwen2.5-0.5B-Instruct on a single RTX 4070. The engine demonstrates that significant throughput gains on commodity GPUs are possible through scheduling alone without modifying model kernels.

Architecture

Host-side engine (C++) → CUDA selection kernel → llama.cpp black-box backend
Manages KV-cache slots and request state entirely on the host side with a minimal CUDA selection kernel
All model computation (attention, matmul, RoPE, quantization) left untouched in llama.cpp

Execution Modes

Sequential Serving (baseline): processes one request at a time
Continuous Batching: dynamically packs requests into llama batch each decode step, keeping up to 16 sequences active
Continuous Batching + Chunked Prefill: splits long prefills at configurable chunk sizes, overlapping prefill and decode for maximum GPU utilization

Results

Mode	Time (186 prompts)	Speedup
Sequential	154.17s	1.0×
Continuous Batching	32.6s	4.7×
CB + Prefill Split (chunk=256)	24.69s	6.2×

Optimal prefill chunk size is 256 tokens — smaller (128) adds overhead, larger (512) lets long prefills block other requests
Nsight Systems profiling shows continuous mode reduces mean inter-kernel gap from 170 µs to 10 µs, densifying the launch pattern and eliminating GPU idle time

Share on

Bluesky Facebook LinkedIn X (formerly Twitter)

Arjun Parasuram Prasad

Overview

Architecture

Execution Modes

Results

Share on