Configuration reference - Freesolo Docs

A run is described by a TOML config passed to flash train. The only required fields are model and environment.id. algorithm defaults to sft, and everything else has a sensible default. Validate any config locally with flash train config.toml --dry-run.

model = "Qwen/Qwen3.5-4B"
algorithm = "sft"

[environment]
id = "your-org/your-env"

[train]
epochs = 3
max_examples = 1000
lora_rank = 32

Top level

Key	Type	Default	Description
`model`	string	(required)	Base model to train a LoRA adapter on, e.g. `Qwen/Qwen3.5-4B`. See `flash models`.
`algorithm`	string	`sft`	`sft` (supervised): learns by imitating example completions you provide. `grpo` (RL): learns from a reward you define. See SFT vs GRPO.
`thinking`	bool	`false`	Enable reasoning mode (thinking-capable models only). The reasoning trace shares the `max_tokens` budget with the answer, so raise `max_tokens` (and `max_length`) when enabling it — see Troubleshooting.

`[environment]`

Key	Type	Default	Description
`id`	string	(required)	Published Freesolo environment id, `your-org/your-env`.
`params`	table	`{}`	Keyword arguments passed to the env’s `load_environment(**kwargs)`. `split` is also honored by Flash for packaged env datasets — see Datasets: it selects `dataset/<split>.jsonl`, and a missing split file is an error (no silent `train.jsonl` fallback).
`pip`	list[string]	`[]`	Extra third-party pip requirements the worker installs before importing your environment (an escape hatch for deps your `environment.py` imports).
`secrets`	list[string]	`[]`	Environment variable names to forward to the worker as runtime secrets. Values are read from your shell, `.env`, or `.env.local` at submit time.

[environment]
id = "your-org/your-env"
pip = ["openai>=1.0.0"]
secrets = ["SERVICE_API_KEY"]

Environment code reads the secret normally:

import os

service_api_key = os.environ["SERVICE_API_KEY"]

[environment].pip is the worker install path for reward and environment dependencies. Do not rely on pyproject.toml, requirements.txt, or lockfiles inside the published environment artifact for managed training installs. Those files may describe your local development environment, but runtime packages belong in this config list. Never put secret values in [environment.params]; it becomes part of the run spec. Every name you list under [environment].secrets must resolve to a value in your shell, .env, or .env.local when you submit. Reserved platform secret names such as FREESOLO_API_KEY, HF_TOKEN, GITHUB_TOKEN, RUN_ID, and HF_REPO are owned by Flash.

`[train]`

Key	Type	Default	Description
`lora_rank`	int	`32`	LoRA rank; higher means a larger adapter with more capacity and cost. Use the default unless you have a reason to increase it; Flash validates deployability for the selected base model.
`lora_alpha`	int	`64`	LoRA alpha.
`learning_rate`	float	recipe default	Optimizer learning rate.
`batch_size`	int	recipe default	Training batch size.
`max_length`	int	recipe default	Max sequence length.
`save_every`	int	recipe default	Checkpoint cadence (steps). Every save is uploaded: if a save fires while a previous upload is still in flight it is queued (newest wins) and uploaded next, and the final checkpoint is flushed at train end — saves are not skipped on a slow uplink.
`init_from_adapter`	string	-	Warm-start from an existing adapter. Use the run id that `flash status` prints, e.g. `init_from_adapter = "<run-id>"`. To warm-start from a saved checkpoint instead of the run-level adapter, use the exact short ref listed by `flash checkpoints`: `"<run-id>/step-N"`. Longer storage refs are rejected in config; use the short refs printed by Flash.

SFT-specific

Key	Type	Description
`epochs`	int	Number of epochs (alternative to GRPO `steps`).
`max_steps`	int	cap on optimizer steps (`0` = no cap).
`max_examples`	int	Truncate the training dataset to N examples (`0` = no cap).

GRPO-specific

Key	Type	Description
`steps`	int	Number of training steps (default 150).
`group_size`	int	Completions sampled per prompt.
`temperature`	float	Sampling temperature for rollouts.
`max_tokens`	int	Max tokens per rollout completion.
`kl_penalty_coef`	float	Strength of the penalty that keeps the trained model from drifting too far from the base model.
`advantage_clip`	float	Recorded for recipe parity but currently a no-op (TRL clips the importance ratio, not the advantage value).
`thinking_length_penalty_coef`	float	Penalty on reasoning length.
`stop_sequences`	list[string]	Stop sequences for generation.

Managed infrastructure

Flash chooses the training resources, retry path, wall-clock limits, checkpointing, and run artifact storage. Those platform-managed fields may appear in resolved specs and run status for observability, but they are not config knobs.

`[worker_env]` (advanced)

[worker_env] passes non-secret string values into the worker environment and is serialized into the run spec and artifacts. Use it only for harmless labels or feature flags. Secret-looking keys and values are rejected; runtime secrets belong in [environment].secrets.

[worker_env]
FEATURE_FLAG = "enabled"

`[wandb]`

Optional Weights & Biases logging labels. These values are non-secret; set WANDB_API_KEY in your local environment when submitting a run to enable logging to your own W&B account.

Key	Type	Description
`project`	string	W&B project name.
`run_name`	string	W&B run name.

Overrides & composition

Any value can be set at submit time without editing the file. The override flag is --set (repeatable, dotted keys):

flash train config.toml --set train.steps=300 --set train.lora_rank=16
flash train base.toml --config overlay.toml      # deep-merge extra TOML

​Top level

​[environment]

​[train]

​SFT-specific

​GRPO-specific

​Managed infrastructure

​[worker_env] (advanced)

​[wandb]

​Overrides & composition