Question about K2/V2 cache computation in prefill vs generation

#10
by kernelpool - opened

I'm trying to understand the caching behavior in modeling_iquestloopcoder.py and noticed a difference between prefill and generation:

Prefill (_forward_loop, lines 1072-1074):

hidden_states, gate_mean = decoder_layer.forward_loop2_mixed(...)
if use_cache and loop_idx == 2:
    hidden_states_normed = decoder_layer.input_layernorm(hidden_states)
    _, k2, v2 = decoder_layer.self_attn.get_qkv(hidden_states_normed, position_ids)

Here hidden_states is the layer OUTPUT (after attention + MLP).

Generation (forward_decode_loop2, line 630):

q2, k2, v2 = self.get_qkv(hidden_states, position_ids)

Here hidden_states is the layer INPUT (before attention).

Is the difference in prefill intentional, or should K2/V2 be computed from the same source in both paths? I ran some tests comparing both approaches against full recomputation and found that INPUT-based K2 matches exactly, while OUTPUT-based differs slightly. However, the practical impact seems minimal since the gates strongly favor global attention (~87%). I'm curious whether the difference is intentional or an oversight.

IQuest org
β€’
edited 1 day ago

Sorry for the delayed reply. We had improved code clarity by refactoring.

Great, thanks for the clarification!

kernelpool changed discussion status to closed

Sign up or log in to comment