> ## Documentation Index
> Fetch the complete documentation index at: https://freesolo.co/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Environments

> Define the task your model trains on and is graded by.

An [**environment**](/environment-model) is the task: the data your model sees and the reward, a
numeric score your code assigns to each model output. Flash loads environments
through the Freesolo environment SDK.
You author one locally, publish it to Freesolo's managed Environments Hub, and
reference it in config by its Freesolo environment id.

```toml theme={null}
[environment]
id = "your-org/your-env"
```

## Scaffold one

```bash theme={null}
flash env setup
```

This scaffolds a starter project in the current directory: `environment.py`, a
tiny `dataset/train.jsonl`, two configs (`configs/sft.toml` for SFT and
`configs/rl.toml` for GRPO), and a `TRAINING.md` playbook for the coding agent you
point at the project. The starter `load_environment()` returns a Freesolo
`EnvironmentSingleTurn` with a sample dataset and reward.

```python environment.py theme={null}
import json
from pathlib import Path

from freesolo.datasets import TaskExample
from freesolo.environments import EnvironmentSingleTurn, RewardResult

DEFAULT_DATASET_PATH = Path(__file__).parent / "dataset" / "train.jsonl"


def load_jsonl(path):
    with Path(path).open() as f:
        return [json.loads(line) for line in f if line.strip()]


class StarterEnv(EnvironmentSingleTurn):
    dataset = load_jsonl(DEFAULT_DATASET_PATH)

    def build_prompt_messages(self, example: TaskExample, prompt_text: str):
        return [{"role": "user", "content": example.input}]

    def score_response(self, example: TaskExample, response_text: str) -> RewardResult:
        expected = str(example.output or "").strip()
        score = 1.0 if expected and expected in response_text else 0.0
        return RewardResult(score=score, threshold=1.0)


def load_environment(dataset_path: str | None = None, **kwargs) -> StarterEnv:
    env = StarterEnv()
    if dataset_path:
        env.dataset = load_jsonl(dataset_path)
    return env
```

Replace the dataset and reward with your task:

* [**Dataset**](/guides/datasets): the prompts (and any gold answers) your model trains and is
  evaluated on.
* **Reward**: `score_response` returns a `RewardResult`. GRPO optimizes that
  score; SFT trains directly on your dataset answers instead. See [how Flash
  works](/how-flash-works).

For multi-turn tasks (conversations, tool use, agents), use `EnvironmentMultiTurn`
and implement the episode hooks — see [Multi-turn environments](#multi-turn-environments) below.

## Structure the package

A published environment is a small folder. Flash packages it, imports its
`environment.py` entrypoint on the worker, and calls
`load_environment(**params)`, which must return a Freesolo SDK environment. The
folder can be named anything as long as it contains `environment.py` at its
root.

```text theme={null}
math/
  environment.py          # defines load_environment()
  helpers.py              # optional Python helpers
  dataset/
    train.jsonl           # input/output records
    eval.jsonl
```

Publish the folder:

```bash theme={null}
flash env push --name math math
```

Single `.py` upload still works for tiny smoke tests, but push real environments
as folders. The `--name` value is normalized to a lowercase hyphen slug, and the
published id uses your Freesolo org namespace, for example `your-org/math`.

Keep all imports either from sibling files in the folder you publish, from the
`freesolo` SDK, or from third-party packages you declare under
`[environment].pip` (see below).

### Declare third-party dependencies

If your `environment.py` imports a package that isn't part of the `freesolo` SDK,
list it under `[environment].pip` so the worker installs it before importing your
environment:

```toml theme={null}
[environment]
id = "your-org/your-env"
pip = ["rapidfuzz>=3.0", "pydantic"]
```

This is the supported worker dependency mechanism. A `pyproject.toml`,
`requirements.txt`, or lockfile beside `environment.py` may be useful for local
development, but Flash does not use those files to install packages for a
managed training run. Put runtime packages in the [training config](/reference/configuration) that references
the published environment.

Only list packages your environment imports. Flash manages the training stack
itself, so do not pin packages such as `torch`, `trl`, `vllm`, `peft`, or
`bitsandbytes` here unless your environment directly imports them.

For example, an environment whose reward calls a judge model should declare the
client library and the secret it reads. Here the judge is reached through an
OpenAI-compatible client (such as OpenRouter), but the same pattern applies to
any provider:

```toml theme={null}
[environment]
id = "your-org/search-reward"
pip = ["openai>=1.0.0"]
secrets = ["JUDGE_API_KEY"]
```

If that same reward also queries a database, declare its client library and the
exact environment variable name your code reads. Here the database is MongoDB via
`pymongo`, but the same pattern applies to any database:

```toml theme={null}
[environment]
id = "your-org/search-reward"
pip = ["openai>=1.0.0", "pymongo>=4.17.0"]
secrets = ["JUDGE_API_KEY", "DATABASE_URL"]
```

Do not declare unused packages or secrets. If the active reward only calls the
judge model and no longer queries the database, leave out the database client and
its secret.

## Use the SDK

Your environment code imports from the `freesolo` package. Managed training
workers already have that SDK available when they import your published
environment. Your local Python environment does not get it automatically from
the `flash` CLI, so install it locally if you run or test `environment.py`
directly, or if you use a command such as `flash train --cost` that imports the
environment to count training rows:

```bash theme={null}
uv pip install freesolo
```

Use the public imports below in environment code:

```python theme={null}
from freesolo.datasets import TaskExample
from freesolo.datasets.records import load_task_examples
from freesolo.environments import (
    EnvironmentEpisode,
    EnvironmentMultiTurn,
    EnvironmentStepResult,
    EnvironmentSingleTurn,
    RewardResult,
)
```

For a single-turn task, subclass `EnvironmentSingleTurn`:

```python environment.py theme={null}
from pathlib import Path

from freesolo.datasets import TaskExample
from freesolo.datasets.records import load_task_examples
from freesolo.environments import EnvironmentSingleTurn, RewardResult

ROOT = Path(__file__).parent


class MathEnv(EnvironmentSingleTurn):
    def __init__(self, *, split: str = "train") -> None:
        self.dataset = load_task_examples(ROOT / "dataset" / f"{split}.jsonl")

    def build_prompt_messages(self, example: TaskExample, prompt_text: str):
        return [{"role": "user", "content": example.input}]

    def score_response(self, example: TaskExample, response_text: str) -> RewardResult:
        expected = str(example.output or "").strip()
        score = 1.0 if expected and expected in response_text else 0.0
        return RewardResult(score=score, threshold=1.0)


def load_environment(split: str = "train", **kwargs) -> MathEnv:
    return MathEnv(split=split)
```

The dataset file uses the `input`/`output` row shape ([Task
records](/guides/datasets#task-records)). `load_task_examples(...)` exposes each
record's `input`/`output` as `example.input`/`example.output`, with the raw row
available as `example.record`.

For SFT, the SDK builds the training conversation as the environment's
`start_episode(example, prompt_text)` plus `sft_completion(example)`. The default
`sft_completion` converts `example.output` into completion messages — see
[Datasets](/guides/datasets#message-shaped-sft-targets) for the scalar and
message-shaped output forms. Override `sft_completion` only when the gold
completion needs to be synthesized from other fields.

Single-turn environments usually implement `build_prompt_messages`. The SDK's
default `start_episode` calls it and prepends the run's prompt text as a
`system` message when your messages do not already include one, so local eval,
SFT, and GRPO see the same policy prompt. Multi-turn environments own
`start_episode`; include any task-specific initial system message there. On
managed runs Flash applies the same guarantee to every episode's opening
messages, single- and multi-turn: if `start_episode` returns no system message
(or an empty one), the run's prompt text is inserted as the system message.

When a run uses `thinking = true`, `score_response` receives answer text by
default. The `response_text` value is still string-compatible and also exposes
`response_text.completion`, `response_text.thinking`, and `response_text.raw` for
rewards that intentionally inspect the reasoning trace.

## Multi-turn environments

Some tasks aren't resolved in a single response: a conversation, a tool-using
agent, anything where the **environment reacts to what the model just did** and
the model acts again. For these, subclass `EnvironmentMultiTurn` and implement the
**episode hooks**. GRPO rolls out the whole episode, then scores the finished
trajectory.

| Single-turn (`EnvironmentSingleTurn`) | Multi-turn (`EnvironmentMultiTurn`)                                                                                                      |
| ------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
| `build_prompt_messages` → one prompt  | `start_episode` → opening messages                                                                                                       |
| `score_response` → reward one reply   | `step_episode` → react to each action; `max_episode_turns` → bound the episode; `score_episode` / `score_episodes` → reward trajectories |

### The episode loop

During training the trainer drives your hooks in a loop and scores the result:

```text theme={null}
messages = env.start_episode(example, prompt_text)      # opening system/user messages
response_text = ""
for _ in range(env.max_episode_turns(example)):
    action = policy.sample(messages)                    # one assistant message
    messages.append({"role": "assistant", "content": action})
    result = env.step_episode(example, messages, action)
    if result.final_response_text is not None:
        response_text = result.final_response_text
    else:
        response_text = action
    messages.extend(result.messages)                    # the env's response (tool result, next user turn)
    if result.done:
        break
episode = EnvironmentEpisode(messages=tuple(messages), response_text=response_text)
reward = env.score_episode(example, episode)
# Flash batches completed rollouts by calling env.score_episodes(example, episodes).
```

* **`start_episode(example, prompt_text)`** returns the initial messages (e.g. a
  system prompt plus the first user message).
* **`max_episode_turns(example)`** returns the maximum number of assistant
  actions Flash may sample before forcing the episode terminal. The value is
  per example.
* **`step_episode(example, messages, assistant_response)`** is the core.
  `messages` already includes the assistant action being stepped as its last
  element. Return an `EnvironmentStepResult`:
  * `done=False, messages=(...)` keeps the episode open and injects the
    environment's response (a tool result, the next user turn);
  * `done=True, final_response_text=...` ends it. `final_response_text`
    overrides the last assistant message as the text passed to scoring.
* **`score_episode(example, episode)`** grades one completed transcript
  (`episode.messages`) and returns a `RewardResult`, exactly like
  `score_response` but over the whole trajectory.
* **`score_episodes(example, episodes)`** is the batch wrapper Flash calls for a
  GRPO group. You normally do not implement it; the SDK default runs your
  singular `score_episode` for each item, using `max_score_concurrency`.

### Full multi-turn contract

This is the worker-facing surface Flash expects:

```python theme={null}
from dataclasses import dataclass, field
from typing import Any

from freesolo.datasets import TaskExample
from freesolo.environments import RewardResult

ChatMessage = dict[str, Any]


class EnvironmentMultiTurn:
    def start_episode(
        self,
        example: TaskExample,
        prompt_text: str,
    ) -> list[ChatMessage]: ...

    def max_episode_turns(self, example: TaskExample) -> int: ...

    def step_episode(
        self,
        example: TaskExample,
        messages: list[ChatMessage],
        assistant_response: str,
    ) -> "EnvironmentStepResult": ...

    def score_episode(
        self,
        example: TaskExample,
        episode: "EnvironmentEpisode",
    ) -> RewardResult: ...

    def score_episodes(
        self,
        example: TaskExample,
        episodes: list["EnvironmentEpisode"],
    ) -> list[RewardResult]: ...


@dataclass(frozen=True)
class EnvironmentStepResult:
    done: bool = True
    messages: tuple[ChatMessage, ...] = ()
    final_response_text: str | None = None
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass(frozen=True)
class EnvironmentTurn:
    role: str
    content: str
    name: str | None = None
    metadata: dict[str, Any] = field(default_factory=dict)


@dataclass(frozen=True)
class EnvironmentEpisode:
    messages: tuple[ChatMessage, ...]
    response_text: str
    turns: tuple[EnvironmentTurn, ...] = ()
    metadata: dict[str, Any] = field(default_factory=dict)
    latency_ms: int | None = None
    total_tokens: int | None = None
```

`EnvironmentStepResult.messages` should contain only new environment-side
messages to append after the assistant action. Do not repeat the assistant
message Flash just sampled. `metadata` is appended to the completed episode's
metadata under `steps`, so it is available to `score_episode` and logs. Keep
durable episode state in the transcript rather than in hidden mutable
attributes.

### An action protocol the model can emit as text

The trained policy is an open model, not a tool-use API, so define a plain-text
protocol it can produce and your `step_episode` can parse — for example, a
tool-call block versus a final reply:

```python environment.py theme={null}
import json, re
from freesolo.datasets import TaskExample
from freesolo.environments import (
    EnvironmentEpisode,
    EnvironmentMultiTurn,
    EnvironmentStepResult,
    RewardResult,
)

TOOL_CALL = re.compile(r"<tool_call>\s*(.*?)\s*</tool_call>", re.DOTALL)

def run_tool(name: str, args: dict) -> dict:
    ...  # your task-specific tool or service

class AgentEnv(EnvironmentMultiTurn):
    def __init__(self, *, split: str = "train") -> None:
        self.dataset = [{"input": "Refund my last order.", "output": "refund_order"}]

    def start_episode(self, example: TaskExample, prompt_text: str):
        return [
            {"role": "system", "content": "Call tools with <tool_call>{...}</tool_call>; reply in plain text when done."},
            {"role": "user", "content": example.input},
        ]

    def max_episode_turns(self, example: TaskExample) -> int:
        return 8

    def step_episode(self, example, messages, assistant_response):
        blocks = TOOL_CALL.findall(assistant_response)
        if blocks:  # a TOOL turn: run the calls, feed results back, stay open
            results = [run_tool(**json.loads(b)) for b in blocks]
            return EnvironmentStepResult(
                done=False,
                messages=({"role": "user", "content": f"<tool_result>{json.dumps(results)}</tool_result>"},),
            )
        # a REPLY turn: no tool call -> the model is done
        return EnvironmentStepResult(done=True, final_response_text=assistant_response)

    def score_episode(self, example: TaskExample, episode: EnvironmentEpisode) -> RewardResult:
        used = [json.loads(b)["name"] for m in episode.messages
                for b in TOOL_CALL.findall(m["content"]) if m["role"] == "assistant"]
        hit = str(example.output) in used
        return RewardResult(score=1.0 if hit else 0.0, threshold=1.0)

def load_environment(split: str = "train", **kwargs) -> AgentEnv:
    return AgentEnv(split=split)
```

### SFT targets for multi-turn

SFT does not execute the multi-turn loop. It builds one supervised row as:

```text theme={null}
prompt_messages = env.start_episode(example, prompt_text)
target_messages = env.sft_completion(example)
training_text = chat_template(prompt_messages + target_messages)
```

The default `sft_completion(example)` turns `example.output` into the target
messages — a scalar becomes one assistant message, and a message list (or
`{"messages": [...]}`) becomes the full transcript (see
[Datasets](/guides/datasets#message-shaped-sft-targets)).

Flash uses completion-only loss for SFT. It masks the token prefix rendered from
`start_episode(...)` and trains on every token in `sft_completion(...)`. For a
multi-turn target transcript, that means any interleaved assistant, user, and
tool/environment messages in `output` are part of the supervised completion. If
a multi-turn environment has only scalar outputs, SFT collapses to one assistant
turn per row and does not run `step_episode`; provide full target transcripts or
override `sft_completion` when you need multi-turn SFT.

### Keep `step_episode` stateless

The SDK doesn't hand you a per-episode scratch object, and GRPO runs **many
rollouts of the same example concurrently** — so don't stash mutable world-state
on the env keyed by example. If your task has state (an API or service that
changes as tools run), make the **transcript the source of truth** and rebuild
state by replaying `messages` inside `step_episode` / `score_episode`. Replay is
reproducible as long as your service is deterministic given the same calls (seed
any ids; avoid wall-clock-dependent logic in the reward).

### Score behaviours, not a script

Open-ended tasks rarely have one correct transcript, so don't grade by imitation.
Score the behaviours you care about — did it take the resolving action, follow
required ordering (e.g. authenticate before a privileged call), stay within
limits — and return that blend from `score_episode`. Keeping the reward in a
plain function (importable without the trainer) makes it unit-testable.

### Configure for multi-turn

For multi-turn runs, the episode budget comes from `max_episode_turns`. What
matters in `[train]`:

* **`max_tokens`** is *per assistant action* (one tool call or one reply), not per
  episode — keep it modest.
* **`max_length`** must hold the system prompt **plus the whole running transcript**
  (every turn, every tool result), so set it well above a single-turn task.
* **`group_size`** is the number of full episodes rolled out per prompt for the
  GRPO advantage estimate.

## Publish it

Training runs on managed infrastructure, so the environment has to be reachable by id.
Publish your local environment folder to the managed Environments Hub:

```bash theme={null}
flash env push --name math math
```

It prints the published id
(`your-org/math`). Put that in your config's `[environment] id`.

## Use an existing environment

If you already have a published env id (yours or one shared with you), reference
it directly. You can also reference a GitHub environment source with a
`github:owner/repo@ref:path` ref or a GitHub URL.

```toml theme={null}
[environment]
id = "your-org/your-env"
```

If you want the source locally for editing or inspection, pull the whole
environment into a directory:

```bash theme={null}
flash env pull your-org/your-env
```

Pull one file by adding its path inside the environment. Use `-o` to choose the
destination and `-f` to overwrite an existing output:

```bash theme={null}
flash env pull your-org/your-env dataset/train.jsonl -o train.jsonl
```

## List what you have

```bash theme={null}
flash env list
```

Shows local environment sources you can publish, such as `./environment.py` or
folders under `environments/`.

## Delete a Hub environment

Delete only targets managed Hub ids of the form `namespace/name`; GitHub refs
and local paths are not Hub records and cannot be deleted this way.

```bash theme={null}
flash env delete your-org/your-env
flash env delete your-org/your-env -y   # skip the confirmation prompt
```

## Pass parameters

If your `load_environment(**kwargs)` accepts arguments, set them under
`[environment.params]`:

```toml theme={null}
[environment]
id = "your-org/your-env"

[environment.params]
difficulty = "hard"
num_examples = 500
```

## Use API keys

If your environment needs an external service, read the key from `os.environ` in
`environment.py`:

```python environment.py theme={null}
import os

SERVICE_API_KEY = os.environ["SERVICE_API_KEY"]
```

Then declare the environment variable names in your training config:

```toml theme={null}
[environment]
id = "your-org/your-env"
secrets = ["SERVICE_API_KEY"]
```

Set the value in your shell before submitting:

```bash theme={null}
export SERVICE_API_KEY="..."
flash train configs/sft.toml
```

You can also put local development values in `.env` or `.env.local`:

```bash theme={null}
SERVICE_API_KEY=...
```

Flash sends declared secret values to the worker out-of-band. Secret values are
not stored in the TOML config, run spec, status JSON, logs, or environment
artifact. Do not put API keys in `[environment.params]`.
If a declared secret is missing when you submit, `flash train` fails before the
run starts.
