Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels

Hi everyone :waving_hand:

I wanted to share a recent deep dive I wrote on agent systems and RL-based reasoning, based on implementing ReTool-style rollouts myself and then studying how more mature stacks (e.g. AI2’s DR-Tulu) handle the same problems.

The short version:

Most mainstream “agent frameworks” implicitly assume synchronous execution. That assumption breaks down once you introduce:

  • interleaved tool use

  • delayed rewards

  • GRPO-style multi-sample rollouts

  • long reasoning traces

In the post, I walk through:

  • a manual KV-cache + interpreter-masking implementation (and why it inevitably hits memory / blocking limits)

  • why userland generation loops can’t saturate GPUs

  • how async reactor patterns (used in modern RL training stacks) decouple tool I/O from decoding

  • why scheduling, not prompting, becomes the dominant concern at scale

This isn’t meant as a critique of any specific framework — more an attempt to articulate where the abstraction boundary starts to hurt, and why many research systems drop below the “agent wrapper” layer entirely.

Blog post:
:link: Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels

I’d love to hear thoughts from folks working on:

  • agent infra

  • vLLM / inference engines

  • RLHF / GRPO-style training

  • async environments & sandboxes

Thanks for reading!

3 Likes

Async kernels solve scheduling. They do not solve state.

If an agent can suspend on tool I O, delayed rewards, or multi sample rollouts, you need a durable state plane to resume deterministically without replaying full history.

The minimal unit is an append only event log plus artifact store, with bounded retrieval for reinjection. Otherwise pause resume becomes ad hoc prompt concatenation, which is exactly where VRAM and throughput collapse.

1 Like

@PimpCat-AU Thanks for the feedback! You are absolutely right that for long-running, durable agents (like Temporal workflows), you need an append-only event log to handle server failures.

However, the context here is high-throughput RL training (GRPO), not production service orchestration.

Re: “Throughput Collapse” — In a naive implementation, you’d be right. Concatenating history would force a re-computation of the full attention matrix.
But this architecture leverages vLLM’s Automatic Prefix Caching (APC). Because the “concatenated” history matches the KV blocks already resident in VRAM (from the PagedAttention block table), the engine skips the prefill entirely.

The “resume” isn’t a re-computation; it’s a pointer snap. We get the developer ergonomics of “ad hoc concatenation” with the physical performance of a continuous batch.

1 Like