DeepSeek 最新的论文，看上去是用生成模型来做强化学习，有种左脚踩右脚的感觉 | acshame

Skip to main content

13:38 · Apr 4, 2025 · Fri

DeepSeek 最新的论文，看上去是用生成模型来做强化学习，有种左脚踩右脚的感觉
https://arxiv.org/abs/2504.02495
arXiv.org

Inference-Time Scaling for Generalist Reward Modeling

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates...