Scaling LLM Reinforcement Learning with Prolonged Training: ProRL v2
PorAinvest
miércoles, 13 de agosto de 2025, 5:38 pm ET1 min de lectura
NVDA--
ProRL v2 builds on the REINFORCE++ baseline, leveraging Clip-Higher to encourage exploration and Dynamic Sampling to reduce noise and improve learning efficiency. It incorporates several innovations, including scheduled cosine length penalties for producing concise outputs, KL-regularized trust regions with periodic reference resets to the current best checkpoint to help prevent overfitting and ensure stability, and proximal policy optimization with REINFORCE++ baseline. These techniques collectively contribute to sustained improvement and novel solutions in LLMs.
The release of ProRL v2 coincides with ongoing geopolitical tensions and regulatory changes in the tech industry. Chinese authorities have recently advised firms to avoid NVIDIA's H20 chips due to security concerns and domestic semiconductor goals [1]. This regulatory environment adds complexity to the market for AI hardware, with domestic alternatives lagging in AI versatility but still sought in non-government sectors [2]. Despite these challenges, ProRL v2 represents a significant advancement in the field of reinforcement learning and AI capabilities.
References:
[1] https://www.ainvest.com/news/china-urges-firms-avoid-nvidia-h20-chips-security-concerns-push-domestic-semiconductors-2508/
[2] https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/
NVIDIA Research has developed ProRL v2, an evolution of Prolonged Reinforcement Learning (ProRL), to test the effects of extended RL training on large language models (LLMs). ProRL v2 achieves state-of-the-art performance through advanced algorithms, rigorous regularization, and comprehensive domain coverage. The technique enables RL scaling by exploiting knowledge already possessed by the model and pushing it into genuinely new territory. It provides extended training, stability and robustness, fully verifiable rewards, and brevity enforcement.
NVIDIA Research has developed ProRL v2, an evolution of Prolonged Reinforcement Learning (ProRL), to test the effects of extended RL training on large language models (LLMs). ProRL v2 achieves state-of-the-art performance through advanced algorithms, rigorous regularization, and comprehensive domain coverage. The technique enables RL scaling by exploiting knowledge already possessed by the model and pushing it into genuinely new territory. It provides extended training, stability and robustness, fully verifiable rewards, and brevity enforcement.ProRL v2 builds on the REINFORCE++ baseline, leveraging Clip-Higher to encourage exploration and Dynamic Sampling to reduce noise and improve learning efficiency. It incorporates several innovations, including scheduled cosine length penalties for producing concise outputs, KL-regularized trust regions with periodic reference resets to the current best checkpoint to help prevent overfitting and ensure stability, and proximal policy optimization with REINFORCE++ baseline. These techniques collectively contribute to sustained improvement and novel solutions in LLMs.
The release of ProRL v2 coincides with ongoing geopolitical tensions and regulatory changes in the tech industry. Chinese authorities have recently advised firms to avoid NVIDIA's H20 chips due to security concerns and domestic semiconductor goals [1]. This regulatory environment adds complexity to the market for AI hardware, with domestic alternatives lagging in AI versatility but still sought in non-government sectors [2]. Despite these challenges, ProRL v2 represents a significant advancement in the field of reinforcement learning and AI capabilities.
References:
[1] https://www.ainvest.com/news/china-urges-firms-avoid-nvidia-h20-chips-security-concerns-push-domestic-semiconductors-2508/
[2] https://developer.nvidia.com/blog/scaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2/

Divulgación editorial y transparencia de la IA: Ainvest News utiliza tecnología avanzada de Modelos de Lenguaje Largo (LLM) para sintetizar y analizar datos de mercado en tiempo real. Para garantizar los más altos estándares de integridad, cada artículo se somete a un riguroso proceso de verificación con participación humana.
Mientras la IA asiste en el procesamiento de datos y la redacción inicial, un miembro editorial profesional de Ainvest revisa, verifica y aprueba de forma independiente todo el contenido para garantizar su precisión y cumplimiento con los estándares editoriales de Ainvest Fintech Inc. Esta supervisión humana está diseñada para mitigar las alucinaciones de la IA y garantizar el contexto financiero.
Advertencia sobre inversiones: Este contenido se proporciona únicamente con fines informativos y no constituye asesoramiento profesional de inversión, legal o financiero. Los mercados conllevan riesgos inherentes. Se recomienda a los usuarios que realicen una investigación independiente o consulten a un asesor financiero certificado antes de tomar cualquier decisión. Ainvest Fintech Inc. se exime de toda responsabilidad por las acciones tomadas con base en esta información. ¿Encontró un error? Reportar un problema

Comentarios
Aún no hay comentarios