RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

1University of Science and Technology of China    2Shanghai Innovation Institute    3ByteDance

4Wuhan University    5Southeast University

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the practical execute tokens are prefered by flow reward rather than connection tokens. Experiments on both language and multimodal reasoning benchmarks demonstrate the flow reliability, suggesting a promising paradigm for reward shaping with auxiliary signals.

Overview

Teaser

(a) Policy optimized with RLVR resistant to reward hacking, but prone to overlook potential valuable explorations in reasoning trajectories. (b) Auxiliary signals are used for reward shaping of process tokens, involving entropy and likelihood collected from logit space, where self-policy rewarding risks are non-negligible. (c) RLFR shows that the a well established flow field can be a sound environment for reward utilization.

Our approach offers a novel perspective on shaping RLVR with flow rewards derived from latent space, and thus extends RLVR with latent rewards utilization, highlighting the much underexplored yet highly expressive latent space and the sound flow environment.

Flow Environment Rewarding A well-established flow field can be a sound reward environment
Expert Reference Consistituting reward with expert data collected from anywhere
Theoretical Analysis Proven connection between velocity deviaiton and probability likelihood

Method

RLFR

RLFR constructs flow fields of policy latents from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal.

1

Flow Reward from Velocity Deviation

Instead of using predicted velocity to reverse forward process for distribution generation, the accuracy of velocity prediction can serve as a sensible metric to evaluate whether current samples are within the data distribution formed by flow. We further provide the timestep debiasing approach, suggesting the larger timesteps with less noises are favorable in flow rewards for velocity evaluation.

2

Extending RLVR with Flow Reward

We use advantge shaping to make it more flexible for different RLVR algorithms. The noisy fluctuation of flow rewards are discarded and only the substantial deviations are preserved. The flow are online updated with rejection-sampling data throughout policy optimization, where the metrics are manageable to direct the constitution of flow reference for reward calculation.

Flow Reward

The flow rewards are derived from velocity deviaitons of policy latents under reference flow field and we have:

$$\mathcal{R}_{\text{FM}}^{\phi}(\dot{\mathbf{a}}_k; t,\tau) = \| \mathbf{v}_{\phi}(\dot{\mathbf{a}}_{k,t}, t) - (\dot{\mathbf{a}}_{k,1} - \epsilon) \|^2, \quad \text{where} \; t,\tau \sim \mathcal{U}[0, 1], \;\epsilon \sim \mathcal N(0,I)$$

where $\dot{\mathbf{a}}_k$ denotes latents of token $\mathbf{a}_k$, and $\dot{\mathbf{a}}_{k,t}$ is the linear interpolation between $\dot{\mathbf{a}}_{k}$ and noise $\epsilon$. The flow network $\mathbf{v}_{\phi}$ is first pre-trained on off-policy high-quality data to establish the reference distribution for offline start, and then online updated with policy.

Extending RLVR with Flow Reward

We shape advantage term $\hat{A}_{o}$ of RLVR for each token $\mathbf{a}_{k}$ by the accumulation of post-processed flow rewards $r^{\mathbf{v_\phi}}_{k}$:

$$\hat{A}_{k} = \sum_{s=k}^{|\mathbf{a}|}\gamma^{s-k}r^{\mathbf{v}_{\phi}}_{s}\; + \hat{A}_{o},\\ \quad\quad r^{\mathbf{v_\phi}}_{k} = -\beta \cdot\hat{r}^{\mathbf{v_\phi}}_{k}\mathbb{I}[|{\hat{r}^{\mathbf{v}_{\phi}}_{k}|>\eta}], \quad\quad \text{where} \; \hat{r}^{\mathbf{v_\phi}}_{k}= \mathrm{minmax}(\{\mathcal{R}_{FM}^{\phi}(\dot{\mathbf{a}}_k);\mathcal{T},\mathcal{L}\}_{k=1}^{|\mathbf{a}|}),$$

where $\mathbb{I}[\cdot]$ is the indicator function that discards noisy fluctuations in flow rewards and preserves only substantial deviations above $\eta$. The $\mathrm{minmax}$-$\mathrm{norm}$ is performed within the sequence to regularize the numerical values between $[-1,1]$. $\mathcal{T}$ and $\mathcal{L}$ are collections of timesteps and layers used to calculate the velocity deviations, and we practically condition on $\hat{\mathbf{a}}_{k+1}$ to establish context dependence for flow rewards.

Benchmark Results

Language Results

Teaser

Multimodal Results

Teaser

Takeaways

  • Flow rewards prefer tokens that practically execute the question, and depress tokens with empty content such as connection tokens
  • High entropy in logit space makes larger velocity deviations in latent space, attributing to the ambiguity hidden states correspond to a large set of candidate tokens
  • Flow rewards rely on efficient context dependence compressed within the hidden states, rather than individual token-level denotation for context comprehending

Citation

@article{zhang2025rlfr,
      title={RLFR: Extending Reinforcement Learning for LLMs with Flow Environment},
      author={Zhang, Jinghao and Zheng, Naishan and Li, Ruilin and Cheng, Dongzhou and Liang, Zheming and Zhao, Feng and Wang, Jiaqi},
      journal={arXiv preprint arXiv:2510.10201},
      year={2025}
    }