Environments - Freesolo Docs

An environment is the task: the data your model sees and the reward, a numeric score your code assigns to each model output. Flash loads environments through the Freesolo environment SDK. You author one locally, publish it to Freesolo’s managed Environments Hub, and reference it in config by its Freesolo environment id.

[environment]
id = "your-org/your-env"

Scaffold one

flash env setup

This scaffolds a starter project in the current directory: environment.py, a tiny dataset/train.jsonl, two configs (configs/sft.toml for SFT and configs/rl.toml for GRPO), and a TRAINING.md playbook for the coding agent you point at the project. The starter load_environment() returns a Freesolo EnvironmentSingleTurn with a sample dataset and reward.

environment.py

import json
from pathlib import Path

from freesolo.datasets import TaskExample
from freesolo.environments import EnvironmentSingleTurn, RewardResult

DEFAULT_DATASET_PATH = Path(__file__).parent / "dataset" / "train.jsonl"


def load_jsonl(path):
    with Path(path).open() as f:
        return [json.loads(line) for line in f if line.strip()]


class StarterEnv(EnvironmentSingleTurn):
    dataset = load_jsonl(DEFAULT_DATASET_PATH)

    def build_prompt_messages(self, example: TaskExample, prompt_text: str):
        return [{"role": "user", "content": example.input}]

    def score_response(self, example: TaskExample, response_text: str) -> RewardResult:
        expected = str(example.output or "").strip()
        score = 1.0 if expected and expected in response_text else 0.0
        return RewardResult(score=score, threshold=1.0)


def load_environment(dataset_path: str | None = None, **kwargs) -> StarterEnv:
    env = StarterEnv()
    if dataset_path:
        env.dataset = load_jsonl(dataset_path)
    return env

Replace the dataset and reward with your task:

Dataset: the prompts (and any gold answers) your model trains and is evaluated on.
Reward: score_response returns a RewardResult. GRPO optimizes that score; SFT trains directly on your dataset answers instead. See how Flash works.

For multi-turn tasks (conversations, tool use, agents), use EnvironmentMultiTurn and implement the episode hooks — see Multi-turn environments below.

Structure the package

A published environment is a small folder. Flash packages it, imports its environment.py entrypoint on the worker, and calls load_environment(**params), which must return a Freesolo SDK environment. The folder can be named anything as long as it contains environment.py at its root.

math/
  environment.py          # defines load_environment()
  helpers.py              # optional Python helpers
  dataset/
    train.jsonl           # input/output records
    eval.jsonl

Publish the folder:

flash env push --name math math

Single .py upload still works for tiny smoke tests, but push real environments as folders. The --name value is normalized to a lowercase hyphen slug, and the published id uses your Freesolo org namespace, for example your-org/math. Keep all imports either from sibling files in the folder you publish, from the freesolo SDK, or from third-party packages you declare under [environment].pip (see below).

Declare third-party dependencies

If your environment.py imports a package that isn’t part of the freesolo SDK, list it under [environment].pip so the worker installs it before importing your environment:

[environment]
id = "your-org/your-env"
pip = ["rapidfuzz>=3.0", "pydantic"]

This is the supported worker dependency mechanism. A pyproject.toml, requirements.txt, or lockfile beside environment.py may be useful for local development, but Flash does not use those files to install packages for a managed training run. Put runtime packages in the training config that references the published environment. Only list packages your environment imports. Flash manages the training stack itself, so do not pin packages such as torch, trl, vllm, peft, or bitsandbytes here unless your environment directly imports them. For example, an environment whose reward calls a judge model should declare the client library and the secret it reads. Here the judge is reached through an OpenAI-compatible client (such as OpenRouter), but the same pattern applies to any provider:

[environment]
id = "your-org/search-reward"
pip = ["openai>=1.0.0"]
secrets = ["JUDGE_API_KEY"]

If that same reward also queries a database, declare its client library and the exact environment variable name your code reads. Here the database is MongoDB via pymongo, but the same pattern applies to any database:

[environment]
id = "your-org/search-reward"
pip = ["openai>=1.0.0", "pymongo>=4.17.0"]
secrets = ["JUDGE_API_KEY", "DATABASE_URL"]

Do not declare unused packages or secrets. If the active reward only calls the judge model and no longer queries the database, leave out the database client and its secret.

Use the SDK

Your environment code imports from the freesolo package. Managed training workers already have that SDK available when they import your published environment. Your local Python environment does not get it automatically from the flash CLI, so install it locally if you run or test environment.py directly, or if you use a command such as flash train --cost that imports the environment to count training rows:

uv pip install freesolo

Use the public imports below in environment code:

from freesolo.datasets import TaskExample
from freesolo.datasets.records import load_task_examples
from freesolo.environments import (
    EnvironmentEpisode,
    EnvironmentMultiTurn,
    EnvironmentStepResult,
    EnvironmentSingleTurn,
    RewardResult,
)

For a single-turn task, subclass EnvironmentSingleTurn:

environment.py

from pathlib import Path

from freesolo.datasets import TaskExample
from freesolo.datasets.records import load_task_examples
from freesolo.environments import EnvironmentSingleTurn, RewardResult

ROOT = Path(__file__).parent


class MathEnv(EnvironmentSingleTurn):
    def __init__(self, *, split: str = "train") -> None:
        self.dataset = load_task_examples(ROOT / "dataset" / f"{split}.jsonl")

    def build_prompt_messages(self, example: TaskExample, prompt_text: str):
        return [{"role": "user", "content": example.input}]

    def score_response(self, example: TaskExample, response_text: str) -> RewardResult:
        expected = str(example.output or "").strip()
        score = 1.0 if expected and expected in response_text else 0.0
        return RewardResult(score=score, threshold=1.0)


def load_environment(split: str = "train", **kwargs) -> MathEnv:
    return MathEnv(split=split)

The dataset file uses the input/output row shape (Task records). load_task_examples(...) exposes each record’s input/output as example.input/example.output, with the raw row available as example.record. For SFT, the SDK builds the training conversation as the environment’s start_episode(example, prompt_text) plus sft_completion(example). The default sft_completion converts example.output into completion messages — see Datasets for the scalar and message-shaped output forms. Override sft_completion only when the gold completion needs to be synthesized from other fields. Single-turn environments usually implement build_prompt_messages. The SDK’s default start_episode calls it and prepends the run’s prompt text as a system message when your messages do not already include one, so local eval, SFT, and GRPO see the same policy prompt. Multi-turn environments own start_episode; include any task-specific initial system message there. On managed runs Flash applies the same guarantee to every episode’s opening messages, single- and multi-turn: if start_episode returns no system message (or an empty one), the run’s prompt text is inserted as the system message. When a run uses thinking = true, score_response receives answer text by default. The response_text value is still string-compatible and also exposes response_text.completion, response_text.thinking, and response_text.raw for rewards that intentionally inspect the reasoning trace.

Multi-turn environments

Some tasks aren’t resolved in a single response: a conversation, a tool-using agent, anything where the environment reacts to what the model just did and the model acts again. For these, subclass EnvironmentMultiTurn and implement the episode hooks. GRPO rolls out the whole episode, then scores the finished trajectory.

Single-turn (`EnvironmentSingleTurn`)	Multi-turn (`EnvironmentMultiTurn`)
`build_prompt_messages` → one prompt	`start_episode` → opening messages
`score_response` → reward one reply	`step_episode` → react to each action; `max_episode_turns` → bound the episode; `score_episode` / `score_episodes` → reward trajectories

The episode loop

During training the trainer drives your hooks in a loop and scores the result:

messages = env.start_episode(example, prompt_text)      # opening system/user messages
response_text = ""
for _ in range(env.max_episode_turns(example)):
    action = policy.sample(messages)                    # one assistant message
    messages.append({"role": "assistant", "content": action})
    result = env.step_episode(example, messages, action)
    if result.final_response_text is not None:
        response_text = result.final_response_text
    else:
        response_text = action
    messages.extend(result.messages)                    # the env's response (tool result, next user turn)
    if result.done:
        break
episode = EnvironmentEpisode(messages=tuple(messages), response_text=response_text)
reward = env.score_episode(example, episode)
# Flash batches completed rollouts by calling env.score_episodes(example, episodes).

start_episode(example, prompt_text) returns the initial messages (e.g. a system prompt plus the first user message).
max_episode_turns(example) returns the maximum number of assistant actions Flash may sample before forcing the episode terminal. The value is per example.
step_episode(example, messages, assistant_response) is the core. messages already includes the assistant action being stepped as its last element. Return an EnvironmentStepResult:
- done=False, messages=(...) keeps the episode open and injects the environment’s response (a tool result, the next user turn);
- done=True, final_response_text=... ends it. final_response_text overrides the last assistant message as the text passed to scoring.
score_episode(example, episode) grades one completed transcript (episode.messages) and returns a RewardResult, exactly like score_response but over the whole trajectory.
score_episodes(example, episodes) is the batch wrapper Flash calls for a GRPO group. You normally do not implement it; the SDK default runs your singular score_episode for each item, using max_score_concurrency.

Full multi-turn contract

This is the worker-facing surface Flash expects:

from dataclasses import dataclass, field
from typing import Any

from freesolo.datasets import TaskExample
from freesolo.environments import RewardResult

ChatMessage = dict[str, Any]


class EnvironmentMultiTurn:
    def start_episode(
        self,
        example: TaskExample,
        prompt_text: str,
    ) -> list[ChatMessage]: ...

    def max_episode_turns(self, example: TaskExample) -> int: ...

    def step_episode(
        self,
        example: TaskExample,
        messages: list[ChatMessage],
        assistant_response: str,
    ) -> "EnvironmentStepResult": ...

    def score_episode(
        self,
        example: TaskExample,
        episode: "EnvironmentEpisode",
    ) -> RewardResult: ...

    def score_episodes(
        self,
        example: TaskExample,
        episodes: list["EnvironmentEpisode"],
    ) -> list[RewardResult]: ...


@dataclass(frozen=True)
class EnvironmentStepResult:
    done: bool = True
    messages: tuple[ChatMessage, ...] = ()
    final_response_text: str | None = None
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass(frozen=True)
class EnvironmentTurn:
    role: str
    content: str
    name: str | None = None
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass(frozen=True)
class EnvironmentEpisode:
    messages: tuple[ChatMessage, ...]
    response_text: str
    turns: tuple[EnvironmentTurn, ...] = ()
    metadata: dict[str, Any] = field(default_factory=dict)
    latency_ms: int | None = None
    total_tokens: int | None = None

EnvironmentStepResult.messages should contain only new environment-side messages to append after the assistant action. Do not repeat the assistant message Flash just sampled. metadata is appended to the completed episode’s metadata under steps, so it is available to score_episode and logs. Keep durable episode state in the transcript rather than in hidden mutable attributes.

An action protocol the model can emit as text

The trained policy is an open model, not a tool-use API, so define a plain-text protocol it can produce and your step_episode can parse — for example, a tool-call block versus a final reply:

environment.py

import json, re
from freesolo.datasets import TaskExample
from freesolo.environments import (
    EnvironmentEpisode,
    EnvironmentMultiTurn,
    EnvironmentStepResult,
    RewardResult,
)

TOOL_CALL = re.compile(r"<tool_call>\s*(.*?)\s*</tool_call>", re.DOTALL)

def run_tool(name: str, args: dict) -> dict:
    ...  # your task-specific tool or service

class AgentEnv(EnvironmentMultiTurn):
    def __init__(self, *, split: str = "train") -> None:
        self.dataset = [{"input": "Refund my last order.", "output": "refund_order"}]

    def start_episode(self, example: TaskExample, prompt_text: str):
        return [
            {"role": "system", "content": "Call tools with <tool_call>{...}</tool_call>; reply in plain text when done."},
            {"role": "user", "content": example.input},
        ]

    def max_episode_turns(self, example: TaskExample) -> int:
        return 8

    def step_episode(self, example, messages, assistant_response):
        blocks = TOOL_CALL.findall(assistant_response)
        if blocks:  # a TOOL turn: run the calls, feed results back, stay open
            results = [run_tool(**json.loads(b)) for b in blocks]
            return EnvironmentStepResult(
                done=False,
                messages=({"role": "user", "content": f"<tool_result>{json.dumps(results)}</tool_result>"},),
            )
        # a REPLY turn: no tool call -> the model is done
        return EnvironmentStepResult(done=True, final_response_text=assistant_response)

    def score_episode(self, example: TaskExample, episode: EnvironmentEpisode) -> RewardResult:
        used = [json.loads(b)["name"] for m in episode.messages
                for b in TOOL_CALL.findall(m["content"]) if m["role"] == "assistant"]
        hit = str(example.output) in used
        return RewardResult(score=1.0 if hit else 0.0, threshold=1.0)

def load_environment(split: str = "train", **kwargs) -> AgentEnv:
    return AgentEnv(split=split)

SFT targets for multi-turn

SFT does not execute the multi-turn loop. It builds one supervised row as:

prompt_messages = env.start_episode(example, prompt_text)
target_messages = env.sft_completion(example)
training_text = chat_template(prompt_messages + target_messages)

The default sft_completion(example) turns example.output into the target messages — a scalar becomes one assistant message, and a message list (or {"messages": [...]}) becomes the full transcript (see Datasets). Flash uses completion-only loss for SFT. It masks the token prefix rendered from start_episode(...) and trains on every token in sft_completion(...). For a multi-turn target transcript, that means any interleaved assistant, user, and tool/environment messages in output are part of the supervised completion. If a multi-turn environment has only scalar outputs, SFT collapses to one assistant turn per row and does not run step_episode; provide full target transcripts or override sft_completion when you need multi-turn SFT.

Keep `step_episode` stateless

The SDK doesn’t hand you a per-episode scratch object, and GRPO runs many rollouts of the same example concurrently — so don’t stash mutable world-state on the env keyed by example. If your task has state (an API or service that changes as tools run), make the transcript the source of truth and rebuild state by replaying messages inside step_episode / score_episode. Replay is reproducible as long as your service is deterministic given the same calls (seed any ids; avoid wall-clock-dependent logic in the reward).

Score behaviours, not a script

Open-ended tasks rarely have one correct transcript, so don’t grade by imitation. Score the behaviours you care about — did it take the resolving action, follow required ordering (e.g. authenticate before a privileged call), stay within limits — and return that blend from score_episode. Keeping the reward in a plain function (importable without the trainer) makes it unit-testable.

Configure for multi-turn

For multi-turn runs, the episode budget comes from max_episode_turns. What matters in [train]:

max_tokens is per assistant action (one tool call or one reply), not per episode — keep it modest.
max_length must hold the system prompt plus the whole running transcript (every turn, every tool result), so set it well above a single-turn task.
group_size is the number of full episodes rolled out per prompt for the GRPO advantage estimate.

Publish it

Training runs on managed infrastructure, so the environment has to be reachable by id. Publish your local environment folder to the managed Environments Hub:

flash env push --name math math

It prints the published id (your-org/math). Put that in your config’s [environment] id.

Use an existing environment

If you already have a published env id (yours or one shared with you), reference it directly. You can also reference a GitHub environment source with a github:owner/repo@ref:path ref or a GitHub URL.

[environment]
id = "your-org/your-env"

If you want the source locally for editing or inspection, pull the whole environment into a directory:

flash env pull your-org/your-env

Pull one file by adding its path inside the environment. Use -o to choose the destination and -f to overwrite an existing output:

flash env pull your-org/your-env dataset/train.jsonl -o train.jsonl

List what you have

flash env list

Shows local environment sources you can publish, such as ./environment.py or folders under environments/.

Delete a Hub environment

Delete only targets managed Hub ids of the form namespace/name; GitHub refs and local paths are not Hub records and cannot be deleted this way.

flash env delete your-org/your-env
flash env delete your-org/your-env -y   # skip the confirmation prompt

Pass parameters

If your load_environment(**kwargs) accepts arguments, set them under [environment.params]:

[environment]
id = "your-org/your-env"

[environment.params]
difficulty = "hard"
num_examples = 500

Use API keys

If your environment needs an external service, read the key from os.environ in environment.py:

environment.py

import os

SERVICE_API_KEY = os.environ["SERVICE_API_KEY"]

Then declare the environment variable names in your training config:

[environment]
id = "your-org/your-env"
secrets = ["SERVICE_API_KEY"]

Set the value in your shell before submitting:

export SERVICE_API_KEY="..."
flash train configs/sft.toml

You can also put local development values in .env or .env.local:

SERVICE_API_KEY=...

Flash sends declared secret values to the worker out-of-band. Secret values are not stored in the TOML config, run spec, status JSON, logs, or environment artifact. Do not put API keys in [environment.params]. If a declared secret is missing when you submit, flash train fails before the run starts.

​Scaffold one

​Structure the package

​Declare third-party dependencies

​Use the SDK

​Multi-turn environments

​The episode loop

​Full multi-turn contract

​An action protocol the model can emit as text

​SFT targets for multi-turn

​Keep step_episode stateless

​Score behaviours, not a script

​Configure for multi-turn

​Publish it

​Use an existing environment

​List what you have

​Delete a Hub environment

​Pass parameters

​Use API keys

Scaffold one

Structure the package

Declare third-party dependencies

Use the SDK

Multi-turn environments

The episode loop

Full multi-turn contract

An action protocol the model can emit as text

SFT targets for multi-turn

Keep `step_episode` stateless

Score behaviours, not a script

Configure for multi-turn

Publish it

Use an existing environment

List what you have

Delete a Hub environment

Pass parameters

Use API keys