vllm `--reasoning-parser` option needs correction
The current model card shows:
$ vllm serve LGAI-EXAONE/K-EXAONE-236B-A23B \
--reasoning-parser deepseek_v3 \
--enable-auto-tool-choice \
--tool-call-parser hermes
However, to properly handle reasoning_content, the --reasoning-parser should be changed to deepseek_r1. When set to deepseek_v3, the reasoning gets mixed into the content output.
Additionally, since this model requires at least 8x H100 GPUs, it would be helpful to explicitly specify the --tensor-parallel-size parameter:
$ vllm serve LGAI-EXAONE/K-EXAONE-236B-A23B \
--tensor-parallel-size 8 \
--reasoning-parser deepseek_r1 \
--enable-auto-tool-choice \
--tool-call-parser hermes
Hello, @likejazz . Thank you for your attention!
The deepseek_v3 reasoning parser was recently updated to accept the enable_thinking kwargs from requests, in addition to its original thinking flags
This change can be found in this commit.
This update has already been applied to our fork of vLLM, but it may not be included if you copied exaone_moe modeling files into your own branch/repository.
If you need to continue using your branch, we recommend updating deepseek_v3 parser code to use enable_thinking keywords.
You can use --reasoning-parser deepseek_r1 instead as you mentioned, but this may lead to unexpected behavior.
If the </think> token is not generated (e.g. early stopping), the entire output will be placed in reasoning_content rather than content.
Thus, we recommend using our fork to use K-EXAONE model, or updating the deepseek_v3 reasoning parser code in your repository.
Moreover, as we noticed in deployment section, the K-EXAONE model can be served with a 256K context on 4 x H200 GPUs.
We agree that it would be more helpful to specify --tensor-parallel-size 4 option.
We will update the deployment section accordingly. Thank you for your feedback π