Environment model - Freesolo Docs

An environment is a small Python module that packages everything Flash needs to teach and grade your model: the data it practices on, how it interacts, and how its answers are scored. It is the single source of truth for what the model learns and what counts as good.

What an environment packages

Every environment bundles three things behind one load_environment() entrypoint:

A dataset

The prompts your model practices on, with optional gold answers. Authored as input/output records. See Datasets.

An interaction model

How the model engages: a single prompt and response, or a multi-turn exchange. This is the environment class you subclass.

A reward

score_response looks at an answer and returns a RewardResult score. This score is the teacher.

The same environment drives SFT (learn from gold answers), GRPO (learn from reward scores), and eval: you swap one line in the config and the environment stays put. For SFT, each row’s output is the gold completion Flash trains on, appended after the environment’s initial episode so system prompts and tool transcripts stay part of the example. See Datasets for the output shapes (a scalar answer, or a message trajectory for multi-turn and tool-use imitation).

The one thing you write

You write environment.py; Flash owns the rest of the loop.

You write	Flash runs
The dataset, interaction, and reward in `environment.py`	The training loop: sampling prompts, generating model attempts (rollouts), applying the algorithm
(nothing)	Managed compute, batching, checkpointing, and auto-retry
(nothing)	A versioned Environments Hub and per-token serving

The quality of the environment’s dataset and reward sets the ceiling on what training can achieve. For the full split of responsibilities, see How Flash works.

Where it sits in the training loop

The environment supplies two of the loop’s steps:

Flash samples a prompt

The training loop pulls a prompt from your environment’s dataset.

The model produces a rollout

The current model attempts an answer, following your environment’s interaction model (single response, or several turns).

Your reward scores it

score_response returns a RewardResult. That score is the signal the algorithm learns from.

See How Flash works for the algorithm side of the loop.

Single-turn and multi-turn

The interaction model is set by which base class you subclass:

EnvironmentSingleTurn: prompt in, completion out, reward computed. Most tasks start here. If your prompt messages do not include a system message, the SDK prepends the run’s prompt text as one.
EnvironmentMultiTurn: for conversations or tool use, where the model takes several steps before the whole sequence (its trajectory) is scored. You implement the episode hooks — start_episode (opening messages), step_episode (react to each action and decide whether the episode continues), max_episode_turns (the bound), and score_episode (reward the trajectory). See Multi-turn environments for the full loop, an action protocol, and the stateless-step pattern.

A minimal single-turn environment:

environment.py

from freesolo.datasets import TaskExample
from freesolo.environments import EnvironmentSingleTurn, RewardResult

class CustomEnv(EnvironmentSingleTurn):
    dataset = [{"input": "What is 2 + 2?", "output": "4"}]

    def build_prompt_messages(self, example: TaskExample, prompt_text: str):
        return [{"role": "user", "content": example.input}]

    def score_response(self, example: TaskExample, response_text: str) -> RewardResult:
        expected = str(example.output or "").strip()
        return RewardResult(score=1.0 if expected and expected in response_text else 0.0, threshold=1.0)

def load_environment(**kwargs) -> CustomEnv:
    return CustomEnv()

See Environments for the full SDK: prompt builders, loading dataset files, parameters, and secrets. With thinking = true, response_text is the answer text by default; it also exposes the separated reasoning trace and raw output when a reward needs them (see Environments).

From local file to managed run

An environment is authored locally but has to be reachable by id when training runs on managed infrastructure. The lifecycle is four steps:

Author it locally

Write environment.py with a load_environment() that returns a Freesolo environment.

Publish it

flash env push --name my-env . packages the folder and uploads it to the managed Environments Hub, which versions it and prints an id (your-org/my-env).

Reference it by id

Put that id in your config’s [environment] id. Optional [environment.params] are passed to load_environment(**params).

Flash loads it at run time

The worker installs your published environment, imports environment.py, and calls load_environment(**params) to drive the run.

Build an environment

The full SDK: author a dataset and reward, then publish it.

Datasets

Package task records and data files inside an environment.

Explore the directory

Where the environment lives on disk and what Flash reads.

How Flash works

The full run loop the environment plugs into.

​What an environment packages

A dataset

An interaction model

A reward

​The one thing you write

​Where it sits in the training loop

​Single-turn and multi-turn

​From local file to managed run

​Next