Scaffold one
environment.py, a
tiny dataset/train.jsonl, two configs (configs/sft.toml for SFT and
configs/rl.toml for GRPO), and a TRAINING.md playbook for the coding agent you
point at the project. The starter load_environment() returns a Freesolo
EnvironmentSingleTurn with a sample dataset and reward.
environment.py
- Dataset: the prompts (and any gold answers) your model trains and is evaluated on.
- Reward:
score_responsereturns aRewardResult. GRPO optimizes that score; SFT trains directly on your dataset answers instead. See how Flash works.
EnvironmentMultiTurn
and implement the episode hooks — see Multi-turn environments below.
Structure the package
A published environment is a small folder. Flash packages it, imports itsenvironment.py entrypoint on the worker, and calls
load_environment(**params), which must return a Freesolo SDK environment. The
folder can be named anything as long as it contains environment.py at its
root.
.py upload still works for tiny smoke tests, but push real environments
as folders. The --name value is normalized to a lowercase hyphen slug, and the
published id uses your Freesolo org namespace, for example your-org/math.
Keep all imports either from sibling files in the folder you publish, from the
freesolo SDK, or from third-party packages you declare under
[environment].pip (see below).
Declare third-party dependencies
If yourenvironment.py imports a package that isn’t part of the freesolo SDK,
list it under [environment].pip so the worker installs it before importing your
environment:
pyproject.toml,
requirements.txt, or lockfile beside environment.py may be useful for local
development, but Flash does not use those files to install packages for a
managed training run. Put runtime packages in the training config that references
the published environment.
Only list packages your environment imports. Flash manages the training stack
itself, so do not pin packages such as torch, trl, vllm, peft, or
bitsandbytes here unless your environment directly imports them.
For example, an environment whose reward calls a judge model should declare the
client library and the secret it reads. Here the judge is reached through an
OpenAI-compatible client (such as OpenRouter), but the same pattern applies to
any provider:
pymongo, but the same pattern applies to any database:
Use the SDK
Your environment code imports from thefreesolo package. Managed training
workers already have that SDK available when they import your published
environment. Your local Python environment does not get it automatically from
the flash CLI, so install it locally if you run or test environment.py
directly, or if you use a command such as flash train --cost that imports the
environment to count training rows:
EnvironmentSingleTurn:
environment.py
input/output row shape (Task
records). load_task_examples(...) exposes each
record’s input/output as example.input/example.output, with the raw row
available as example.record.
For SFT, the SDK builds the training conversation as the environment’s
start_episode(example, prompt_text) plus sft_completion(example). The default
sft_completion converts example.output into completion messages — see
Datasets for the scalar and
message-shaped output forms. Override sft_completion only when the gold
completion needs to be synthesized from other fields.
Single-turn environments usually implement build_prompt_messages. The SDK’s
default start_episode calls it and prepends the run’s prompt text as a
system message when your messages do not already include one, so local eval,
SFT, and GRPO see the same policy prompt. Multi-turn environments own
start_episode; include any task-specific initial system message there. On
managed runs Flash applies the same guarantee to every episode’s opening
messages, single- and multi-turn: if start_episode returns no system message
(or an empty one), the run’s prompt text is inserted as the system message.
When a run uses thinking = true, score_response receives answer text by
default. The response_text value is still string-compatible and also exposes
response_text.completion, response_text.thinking, and response_text.raw for
rewards that intentionally inspect the reasoning trace.
Multi-turn environments
Some tasks aren’t resolved in a single response: a conversation, a tool-using agent, anything where the environment reacts to what the model just did and the model acts again. For these, subclassEnvironmentMultiTurn and implement the
episode hooks. GRPO rolls out the whole episode, then scores the finished
trajectory.
Single-turn (EnvironmentSingleTurn) | Multi-turn (EnvironmentMultiTurn) |
|---|---|
build_prompt_messages → one prompt | start_episode → opening messages |
score_response → reward one reply | step_episode → react to each action; max_episode_turns → bound the episode; score_episode / score_episodes → reward trajectories |
The episode loop
During training the trainer drives your hooks in a loop and scores the result:start_episode(example, prompt_text)returns the initial messages (e.g. a system prompt plus the first user message).max_episode_turns(example)returns the maximum number of assistant actions Flash may sample before forcing the episode terminal. The value is per example.step_episode(example, messages, assistant_response)is the core.messagesalready includes the assistant action being stepped as its last element. Return anEnvironmentStepResult:done=False, messages=(...)keeps the episode open and injects the environment’s response (a tool result, the next user turn);done=True, final_response_text=...ends it.final_response_textoverrides the last assistant message as the text passed to scoring.
score_episode(example, episode)grades one completed transcript (episode.messages) and returns aRewardResult, exactly likescore_responsebut over the whole trajectory.score_episodes(example, episodes)is the batch wrapper Flash calls for a GRPO group. You normally do not implement it; the SDK default runs your singularscore_episodefor each item, usingmax_score_concurrency.
Full multi-turn contract
This is the worker-facing surface Flash expects:EnvironmentStepResult.messages should contain only new environment-side
messages to append after the assistant action. Do not repeat the assistant
message Flash just sampled. metadata is appended to the completed episode’s
metadata under steps, so it is available to score_episode and logs. Keep
durable episode state in the transcript rather than in hidden mutable
attributes.
An action protocol the model can emit as text
The trained policy is an open model, not a tool-use API, so define a plain-text protocol it can produce and yourstep_episode can parse — for example, a
tool-call block versus a final reply:
environment.py
SFT targets for multi-turn
SFT does not execute the multi-turn loop. It builds one supervised row as:sft_completion(example) turns example.output into the target
messages — a scalar becomes one assistant message, and a message list (or
{"messages": [...]}) becomes the full transcript (see
Datasets).
Flash uses completion-only loss for SFT. It masks the token prefix rendered from
start_episode(...) and trains on every token in sft_completion(...). For a
multi-turn target transcript, that means any interleaved assistant, user, and
tool/environment messages in output are part of the supervised completion. If
a multi-turn environment has only scalar outputs, SFT collapses to one assistant
turn per row and does not run step_episode; provide full target transcripts or
override sft_completion when you need multi-turn SFT.
Keep step_episode stateless
The SDK doesn’t hand you a per-episode scratch object, and GRPO runs many
rollouts of the same example concurrently — so don’t stash mutable world-state
on the env keyed by example. If your task has state (an API or service that
changes as tools run), make the transcript the source of truth and rebuild
state by replaying messages inside step_episode / score_episode. Replay is
reproducible as long as your service is deterministic given the same calls (seed
any ids; avoid wall-clock-dependent logic in the reward).
Score behaviours, not a script
Open-ended tasks rarely have one correct transcript, so don’t grade by imitation. Score the behaviours you care about — did it take the resolving action, follow required ordering (e.g. authenticate before a privileged call), stay within limits — and return that blend fromscore_episode. Keeping the reward in a
plain function (importable without the trainer) makes it unit-testable.
Configure for multi-turn
For multi-turn runs, the episode budget comes frommax_episode_turns. What
matters in [train]:
max_tokensis per assistant action (one tool call or one reply), not per episode — keep it modest.max_lengthmust hold the system prompt plus the whole running transcript (every turn, every tool result), so set it well above a single-turn task.group_sizeis the number of full episodes rolled out per prompt for the GRPO advantage estimate.
Publish it
Training runs on managed infrastructure, so the environment has to be reachable by id. Publish your local environment folder to the managed Environments Hub:your-org/math). Put that in your config’s [environment] id.
Use an existing environment
If you already have a published env id (yours or one shared with you), reference it directly. You can also reference a GitHub environment source with agithub:owner/repo@ref:path ref or a GitHub URL.
-o to choose the
destination and -f to overwrite an existing output:
List what you have
./environment.py or
folders under environments/.
Delete a Hub environment
Delete only targets managed Hub ids of the formnamespace/name; GitHub refs
and local paths are not Hub records and cannot be deleted this way.
Pass parameters
If yourload_environment(**kwargs) accepts arguments, set them under
[environment.params]:
Use API keys
If your environment needs an external service, read the key fromos.environ in
environment.py:
environment.py
.env or .env.local:
[environment.params].
If a declared secret is missing when you submit, flash train fails before the
run starts.