Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels

bird-of-paradise · December 26, 2025, 12:34am

Hi everyone

I wanted to share a recent deep dive I wrote on agent systems and RL-based reasoning, based on implementing ReTool-style rollouts myself and then studying how more mature stacks (e.g. AI2’s DR-Tulu) handle the same problems.

The short version:

Most mainstream “agent frameworks” implicitly assume synchronous execution. That assumption breaks down once you introduce:

interleaved tool use
delayed rewards
GRPO-style multi-sample rollouts
long reasoning traces

In the post, I walk through:

a manual KV-cache + interpreter-masking implementation (and why it inevitably hits memory / blocking limits)
why userland generation loops can’t saturate GPUs
how async reactor patterns (used in modern RL training stacks) decouple tool I/O from decoding
why scheduling, not prompting, becomes the dominant concern at scale

This isn’t meant as a critique of any specific framework — more an attempt to articulate where the abstraction boundary starts to hurt, and why many research systems drop below the “agent wrapper” layer entirely.

Blog post:
Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels

I’d love to hear thoughts from folks working on:

agent infra
vLLM / inference engines
RLHF / GRPO-style training
async environments & sandboxes

Thanks for reading!

Pimpcat-AU · December 27, 2025, 10:47pm

Async kernels solve scheduling. They do not solve state.

If an agent can suspend on tool I O, delayed rewards, or multi sample rollouts, you need a durable state plane to resume deterministically without replaying full history.

The minimal unit is an append only event log plus artifact store, with bounded retrieval for reinjection. Otherwise pause resume becomes ad hoc prompt concatenation, which is exactly where VRAM and throughput collapse.

bird-of-paradise · December 28, 2025, 1:08am

@PimpCat-AU Thanks for the feedback! You are absolutely right that for long-running, durable agents (like Temporal workflows), you need an append-only event log to handle server failures.

However, the context here is high-throughput RL training (GRPO), not production service orchestration.

Re: “Throughput Collapse” — In a naive implementation, you’d be right. Concatenating history would force a re-computation of the full attention matrix.
But this architecture leverages vLLM’s Automatic Prefix Caching (APC). Because the “concatenated” history matches the KV blocks already resident in VRAM (from the PagedAttention block table), the engine skips the prefill entirely.

The “resume” isn’t a re-computation; it’s a pointer snap. We get the developer ergonomics of “ad hoc concatenation” with the physical performance of a continuous batch.

Topic		Replies	Views
[Discussion] Beyond Text Merging: Exploring Composition of KV Caches from Parallel Agent Threads Research	3	44	September 9, 2025
“How do you preserve agent state across restarts?” Models	2	17	January 3, 2026
🚧 ReTool: PyTorch Implementation of Strategic Tool Use in LLMs (Seeking Collaborators) Research	0	48	June 1, 2025
Designing multi-agent pipelines with shared state — how are you approaching it? Models	2	38	December 25, 2025
Managing shared state in multi-agent workflows — what’s working for you? Beginners	1	24	December 17, 2025

Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels

Related topics