RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

Jinghao Zhang^1,2 Naishan Zheng^1,3 Ruilin Li^2,4 Dongzhou Cheng^2,5 Zheming Liang^1,2 Feng Zhao¹ Jiaqi Wang²

¹University of Science and Technology of China ²Shanghai Innovation Institute ³ByteDance

⁴Wuhan University ⁵Southeast University

Paper Code

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the practical execute tokens are prefered by flow reward rather than connection tokens. Experiments on both language and multimodal reasoning benchmarks demonstrate the flow reliability, suggesting a promising paradigm for reward shaping with auxiliary signals.

Overview

(a) Policy optimized with RLVR resistant to reward hacking, but prone to overlook potential valuable explorations in reasoning trajectories. (b) Auxiliary signals are used for reward shaping of process tokens, involving entropy and likelihood collected from logit space, where self-policy rewarding risks are non-negligible. (c) RLFR shows that the a well established flow field can be a sound environment for reward utilization.

Our approach offers a novel perspective on shaping RLVR with flow rewards derived from latent space, and thus extends RLVR with latent rewards utilization, highlighting the much underexplored yet highly expressive latent space and the sound flow environment.

                            
                            Flow Environment Rewarding A well-established flow field can be a sound reward environment
                        

Expert Reference Consistituting reward with expert data collected from anywhere

                            
                            Theoretical Analysis Proven connection between velocity deviaiton and probability likelihood

Method

RLFR

RLFR constructs flow fields of policy latents from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal.

Flow Reward from Velocity Deviation

Instead of using predicted velocity to reverse forward process for distribution generation, the accuracy of velocity prediction can serve as a sensible metric to evaluate whether current samples are within the data distribution formed by flow. We further provide the timestep debiasing approach, suggesting the larger timesteps with less noises are favorable in flow rewards for velocity evaluation.

Extending RLVR with Flow Reward

We use advantge shaping to make it more flexible for different RLVR algorithms. The noisy fluctuation of flow rewards are discarded and only the substantial deviations are preserved. The flow are online updated with rejection-sampling data throughout policy optimization, where the metrics are manageable to direct the constitution of flow reference for reward calculation.

Flow Reward

The flow rewards are derived from velocity deviaitons of policy latents under reference flow field and we have:

$$\mathcal{R}_{\text{FM}}^{\phi}(\dot{\mathbf{a}}_k; t,\tau) = \| \mathbf{v}_{\phi}(\dot{\mathbf{a}}_{k,t}, t) - (\dot{\mathbf{a}}_{k,1} - \epsilon) \|^2, \quad \text{where} \; t,\tau \sim \mathcal{U}[0, 1], \;\epsilon \sim \mathcal N(0,I)$$

where $\dot{\mathbf{a}}_k$ denotes latents of token $\mathbf{a}_k$, and $\dot{\mathbf{a}}_{k,t}$ is the linear interpolation between $\dot{\mathbf{a}}_{k}$ and noise $\epsilon$. The flow network $\mathbf{v}_{\phi}$ is first pre-trained on off-policy high-quality data to establish the reference distribution for offline start, and then online updated with policy.

Extending RLVR with Flow Reward

We shape advantage term $\hat{A}_{o}$ of RLVR for each token $\mathbf{a}_{k}$ by the accumulation of post-processed flow rewards $r^{\mathbf{v_\phi}}_{k}$:

$$\hat{A}_{k} = \sum_{s=k}^{|\mathbf{a}|}\gamma^{s-k}r^{\mathbf{v}_{\phi}}_{s}\; + \hat{A}_{o},\\ \quad\quad r^{\mathbf{v_\phi}}_{k} = -\beta \cdot\hat{r}^{\mathbf{v_\phi}}_{k}\mathbb{I}[|{\hat{r}^{\mathbf{v}_{\phi}}_{k}|>\eta}], \quad\quad \text{where} \; \hat{r}^{\mathbf{v_\phi}}_{k}= \mathrm{minmax}(\{\mathcal{R}_{FM}^{\phi}(\dot{\mathbf{a}}_k);\mathcal{T},\mathcal{L}\}_{k=1}^{|\mathbf{a}|}),$$

where $\mathbb{I}[\cdot]$ is the indicator function that discards noisy fluctuations in flow rewards and preserves only substantial deviations above $\eta$. The $\mathrm{minmax}$-$\mathrm{norm}$ is performed within the sequence to regularize the numerical values between $[-1,1]$. $\mathcal{T}$ and $\mathcal{L}$ are collections of timesteps and layers used to calculate the velocity deviations, and we practically condition on $\hat{\mathbf{a}}_{k+1}$ to establish context dependence for flow rewards.

Benchmark Results

Language Results

Multimodal Results

Takeaways

Flow rewards prefer tokens that practically execute the question, and depress tokens with empty content such as connection tokens
High entropy in logit space makes larger velocity deviations in latent space, attributing to the ambiguity hidden states correspond to a large set of candidate tokens
Flow rewards rely on efficient context dependence compressed within the hidden states, rather than individual token-level denotation for context comprehending

Citation

@article{zhang2025rlfr,
      title={RLFR: Extending Reinforcement Learning for LLMs with Flow Environment},
      author={Zhang, Jinghao and Zheng, Naishan and Li, Ruilin and Cheng, Dongzhou and Liang, Zheming and Zhao, Feng and Wang, Jiaqi},
      journal={arXiv preprint arXiv:2510.10201},
      year={2025}
    }