1University of Science and Technology of China 2Shanghai Innovation Institute 3ByteDance
4Wuhan University 5Southeast University
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the practical execute tokens are prefered by flow reward rather than connection tokens. Experiments on both language and multimodal reasoning benchmarks demonstrate the flow reliability, suggesting a promising paradigm for reward shaping with auxiliary signals.
(a) Policy optimized with RLVR resistant to reward hacking, but prone to overlook potential valuable explorations in reasoning trajectories. (b) Auxiliary signals are used for reward shaping of process tokens, involving entropy and likelihood collected from logit space, where self-policy rewarding risks are non-negligible. (c) RLFR shows that the a well established flow field can be a sound environment for reward utilization.
Our approach offers a novel perspective on shaping RLVR with flow rewards derived from latent space, and thus extends RLVR with latent rewards utilization, highlighting the much underexplored yet highly expressive latent space and the sound flow environment.
RLFR constructs flow fields of policy latents from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal.
Instead of using predicted velocity to reverse forward process for distribution generation, the accuracy of velocity prediction can serve as a sensible metric to evaluate whether current samples are within the data distribution formed by flow. We further provide the timestep debiasing approach, suggesting the larger timesteps with less noises are favorable in flow rewards for velocity evaluation.
We use advantge shaping to make it more flexible for different RLVR algorithms. The noisy fluctuation of flow rewards are discarded and only the substantial deviations are preserved. The flow are online updated with rejection-sampling data throughout policy optimization, where the metrics are manageable to direct the constitution of flow reference for reward calculation.
The flow rewards are derived from velocity deviaitons of policy latents under reference flow field and we have:
where $\dot{\mathbf{a}}_k$ denotes latents of token $\mathbf{a}_k$, and $\dot{\mathbf{a}}_{k,t}$ is the linear interpolation between $\dot{\mathbf{a}}_{k}$ and noise $\epsilon$. The flow network $\mathbf{v}_{\phi}$ is first pre-trained on off-policy high-quality data to establish the reference distribution for offline start, and then online updated with policy.
We shape advantage term $\hat{A}_{o}$ of RLVR for each token $\mathbf{a}_{k}$ by the accumulation of post-processed flow rewards $r^{\mathbf{v_\phi}}_{k}$:
where $\mathbb{I}[\cdot]$ is the indicator function that discards noisy fluctuations in flow rewards and preserves only substantial deviations above $\eta$. The $\mathrm{minmax}$-$\mathrm{norm}$ is performed within the sequence to regularize the numerical values between $[-1,1]$. $\mathcal{T}$ and $\mathcal{L}$ are collections of timesteps and layers used to calculate the velocity deviations, and we practically condition on $\hat{\mathbf{a}}_{k+1}$ to establish context dependence for flow rewards.
@article{zhang2025rlfr,
title={RLFR: Extending Reinforcement Learning for LLMs with Flow Environment},
author={Zhang, Jinghao and Zheng, Naishan and Li, Ruilin and Cheng, Dongzhou and Liang, Zheming and Zhao, Feng and Wang, Jiaqi},
journal={arXiv preprint arXiv:2510.10201},
year={2025}
}