-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Feature request
Reading Skywork's blogpost on their OR1 model I found this interesting modification where they use entropy as an extra regularization but with a dynamic weight such that it would try to aim for a specific value.
The relevant part of the blog is here:
https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reasoner-Series-1d0bc9ae823a80459b46c149e4f51680?pvs=25#1d1bc9ae823a801592a0c3891ea5328f
While I haven't tested this myself this looks like a promising enhancement to the GRPOTrainer and hopefully shouldn't be very hard to add.
Motivation
Just thought I'd point out an interesting enhancement in a likely overlooked post. I'm not in any way associated with Skywork and haven't tested this, but theoretically makes sense.
Your contribution
Unlikely I'll be able to help, sorry.