
“By evaluating the coed’s predictions towards the next-token options made by the trainer, we produce an on-policy reward sign that permits the coed to rapidly enhance the standard of its multi-token predictions,” they added.
At inference time, the system makes use of a confidence-adaptive (ConfAdapt) decoding technique that dynamically determines what number of tokens to emit per cross. When the mannequin is very assured, it outputs bigger chunks. When uncertainty rises, it falls again to smaller steps, preserving accuracy whereas sustaining velocity beneficial properties.
In experiments on GSM8K math reasoning benchmarks, an 8B parameter mannequin achieved greater than 3x acceleration with lower than a 3 % drop in accuracy. A smaller 4B parameter mannequin reached comparable speedups, although with a bigger 7 % drop in accuracy. Extra aggressive configurations pushed acceleration to 5x, although at steeper accuracy prices.

