acshame
21:38 · 2025年4月4日 · 周五
DeepSeek 最新的论文,看上去是用生成模型来做强化学习,有种左脚踩右脚的感觉
https://arxiv.org/abs/2504.02495
arXiv.org
Inference-Time Scaling for Generalist Reward Modeling
Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates...
Home
Powered by
BroadcastChannel
&
Sepia