Hi everyone ![]()
I wanted to share a recent deep dive I wrote on agent systems and RL-based reasoning, based on implementing ReTool-style rollouts myself and then studying how more mature stacks (e.g. AI2’s DR-Tulu) handle the same problems.
The short version:
Most mainstream “agent frameworks” implicitly assume synchronous execution. That assumption breaks down once you introduce:
-
interleaved tool use
-
delayed rewards
-
GRPO-style multi-sample rollouts
-
long reasoning traces
In the post, I walk through:
-
a manual KV-cache + interpreter-masking implementation (and why it inevitably hits memory / blocking limits)
-
why userland generation loops can’t saturate GPUs
-
how async reactor patterns (used in modern RL training stacks) decouple tool I/O from decoding
-
why scheduling, not prompting, becomes the dominant concern at scale
This isn’t meant as a critique of any specific framework — more an attempt to articulate where the abstraction boundary starts to hurt, and why many research systems drop below the “agent wrapper” layer entirely.
Blog post:
Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels
I’d love to hear thoughts from folks working on:
-
agent infra
-
vLLM / inference engines
-
RLHF / GRPO-style training
-
async environments & sandboxes
Thanks for reading!