What an environment packages
Every environment bundles three things behind oneload_environment() entrypoint:
A dataset
The prompts your model practices on, with optional gold answers. Authored as
input/output records. See Datasets.An interaction model
How the model engages: a single prompt and response, or a multi-turn
exchange. This is the environment class you subclass.
A reward
score_response looks at an answer and returns a RewardResult score. This
score is the teacher.output is the gold completion Flash trains on, appended
after the environment’s initial episode so system prompts and tool transcripts
stay part of the example. See
Datasets for the output shapes
(a scalar answer, or a message trajectory for multi-turn and tool-use imitation).
The one thing you write
You writeenvironment.py; Flash owns the rest of the loop.
| You write | Flash runs |
|---|---|
The dataset, interaction, and reward in environment.py | The training loop: sampling prompts, generating model attempts (rollouts), applying the algorithm |
| (nothing) | Managed compute, batching, checkpointing, and auto-retry |
| (nothing) | A versioned Environments Hub and per-token serving |
Where it sits in the training loop
The environment supplies two of the loop’s steps:The model produces a rollout
The current model attempts an answer, following your environment’s
interaction model (single response, or several turns).
Single-turn and multi-turn
The interaction model is set by which base class you subclass:EnvironmentSingleTurn: prompt in, completion out, reward computed. Most tasks start here. If your prompt messages do not include asystemmessage, the SDK prepends the run’s prompt text as one.EnvironmentMultiTurn: for conversations or tool use, where the model takes several steps before the whole sequence (its trajectory) is scored. You implement the episode hooks —start_episode(opening messages),step_episode(react to each action and decide whether the episode continues),max_episode_turns(the bound), andscore_episode(reward the trajectory). See Multi-turn environments for the full loop, an action protocol, and the stateless-step pattern.
environment.py
thinking = true, response_text is the answer text by default; it also
exposes the separated reasoning trace and raw output when a reward needs them
(see Environments).
From local file to managed run
An environment is authored locally but has to be reachable by id when training runs on managed infrastructure. The lifecycle is four steps:Author it locally
Write
environment.py with a load_environment() that returns a Freesolo
environment.Publish it
flash env push --name my-env . packages the folder and uploads it to the
managed Environments Hub, which versions it and prints an id
(your-org/my-env).Reference it by id
Put that id in your config’s
[environment] id. Optional
[environment.params] are passed to load_environment(**params).Next
Build an environment
The full SDK: author a dataset and reward, then publish it.
Datasets
Package task records and data files inside an environment.
Explore the directory
Where the environment lives on disk and what Flash reads.
How Flash works
The full run loop the environment plugs into.