13:38 · Apr 4, 2025 · Fri DeepSeek 最新的论文,看上去是用生成模型来做强化学习,有种左脚踩右脚的感觉https://arxiv.org/abs/2504.02495 arXiv.org Inference-Time Scaling for Generalist Reward Modeling Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates...