> ## Documentation Index
> Fetch the complete documentation index at: https://freesolo.co/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Datasets

> Package task records and data files with Freesolo environments.

Datasets live inside the [environment](/environment-model). A [Flash config](/reference/configuration) points at one published
environment id, and the environment's `load_environment()` function returns the
dataset, prompt builder, and reward logic that Flash uses for [SFT, GRPO](/guides/training#choose-sft-or-grpo), and
local validation.

```toml theme={null}
[environment]
id = "your-org/your-env"
```

## Task records

Author dataset rows with `input` and `output`:

| Dataset key | Description                                                                                                                      |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `input`     | Prompt text for the model.                                                                                                       |
| `output`    | Target answer or gold completion. For SFT this can be a scalar answer, `{ "messages": [...] }`, or a bare list of chat messages. |
| `metadata`  | Optional dict preserved on `example.metadata` and available to scoring.                                                          |

<Note>
  Which column the model learns from depends on the algorithm. **SFT** trains
  directly on `output`, the gold answer. **GRPO** (RL) uses only `input`: the
  model generates its own answers from the prompt and learns from your reward,
  so `output` is optional there and is read only if your `score_response` uses
  it as a reference.
</Note>

`load_task_examples(...)` accepts a local file path or an iterable of records.

* **File formats:** `.jsonl`, `.json`, `.csv`, `.txt`, or `.bson`.
* **Field mapping:** `input` -> `example.input`, `output` -> `example.output`, `metadata` -> `example.metadata`.
* **Original row:** the untouched record stays available as `example.record`.

```jsonl dataset/train.jsonl theme={null}
{"input":"What is 2 + 2?","output":"4"}
{"input":"What is 3 + 5?","output":"8"}
```

Each row must be `input` plus an optional `output`; alternate prompt or target
key names are not accepted. Records are canonicalized to exactly
`input`/`output`/`metadata`.

<Warning>
  When Flash builds your training records it keeps only
  `input`/`output`/`metadata` and **silently drops** every other top-level key
  before the row reaches a training worker. Anything your scorer needs beyond
  the gold `output` string (a puzzle's `initial_board`, the `oracle_ids` a
  retrieval must return, unit tests to check code against, a grading rubric) has
  to live under `metadata`, or it is gone with no runtime warning.
</Warning>

### Message-shaped SFT targets

For SFT, `output` is the gold completion appended after the environment's
prompt messages. A scalar output becomes one assistant message. If you need to
teach a multi-turn trajectory or native tool calling, set `output` to
`{"messages": [...]}` or to a bare list of chat messages. Flash preserves those
assistant, tool-call, tool-result, and reply messages when it builds the SFT
example.

```jsonl dataset/train.jsonl theme={null}
{"input":"Refund my last order.","output":{"messages":[{"role":"assistant","content":null,"tool_calls":[{"id":"call_refund","type":"function","function":{"name":"refund_order","arguments":"{\"order\":\"last\"}"}}]},{"role":"tool","tool_call_id":"call_refund","content":"{\"ok\":true}"},{"role":"assistant","content":"Your last order has been refunded."}]}}
```

Use `Environment.sft_completion(example)` if your environment needs to
synthesize or transform the gold completion before SFT.

### Validate thinking-model SFT targets

SFT on a thinking model (`thinking = true`) expects each gold completion to
literally contain a `<think>...</think>` block. Catch missing blocks locally
before submitting a run:

```python theme={null}
from freesolo.datasets import load_dataset, warn_missing_think_tags

dataset = load_dataset("dataset/train.jsonl")
missing = warn_missing_think_tags(dataset.examples)  # UserWarning + list of offending ids
```

It returns the ids of offending examples and emits a `UserWarning` naming the
first few. Unlabeled records (no `output`) are skipped.

## Load sidecars

Read packaged files relative to `__file__`. That works locally and when the
environment runs on a worker.

```python environment.py theme={null}
from pathlib import Path
from freesolo.datasets.records import load_task_examples
from freesolo.environments import EnvironmentSingleTurn

ROOT = Path(__file__).parent


class MathEnv(EnvironmentSingleTurn):
    def __init__(self, *, split: str = "train") -> None:
        # read a packaged dataset file relative to environment.py
        self.dataset = load_task_examples(ROOT / "dataset" / f"{split}.jsonl")
```

The rest of the env class (`build_prompt_messages`, `score_response`) is covered
in [Environments](/guides/environments#use-the-sdk).

Then select the split from your Flash config:

```toml theme={null}
[environment]
id = "your-org/math"

[environment.params]
split = "eval"
```

<Note>
  `[environment.params]` values are passed to your `load_environment(**kwargs)`.
  `split` is also honored by Flash itself: for an environment packaged with
  dataset files, `split = "eval"` selects `dataset/eval.jsonl` (or `.json`) as
  the dataset Flash trains on — SFT targets and GRPO problem selection alike.
  If the environment packages a default train split but the requested split
  file does not exist, the run fails at load time instead of silently falling
  back to `train.jsonl`. An explicit `dataset_path` param takes precedence over
  `split`.
</Note>

## What gets uploaded

For a local environment directory, `flash env push` includes:

* `environment.py`, always at the artifact root.
* Sibling Python helper files.
* Sidecar directory named `dataset`.
* Common sibling data files such as `.jsonl`, `.json`, `.csv`, `.txt`,
  `.md`, `.parquet`, `.tsv`, `.yaml`, and `.yml`.

Workspace metadata, cache directories, virtualenvs, and version-control files
are skipped. Keep the artifact small: environment uploads are capped at 64 MB
compressed and 256 MB uncompressed. For large corpora, keep the data in an
external store and pass the identifier or URL through `[environment.params]`.
