How Flash works - Freesolo Docs

If you’ve heard of “post-training” or “fine-tuning” but never shipped one, this page builds the intuition: the few concepts that matter, and how Flash runs the whole managed loop.

What post-training does

A base model like Qwen3.5 arrives from pre-training already fluent, but generic: it has read a lot of the internet and is good at everything in general, nothing in particular. Post-training is the step that shapes that general model into one that’s reliably good at your task, using your data and your definition of a good answer. Flash is a managed post-training service. You describe the task and pick a base model, Flash fine-tunes it on managed infrastructure, and then serves the result behind an OpenAI-compatible API. Reach for post-training when you want consistent behavior, a smaller and cheaper model that matches a big one on a narrow task, or behavior that prompting alone does not reliably produce. More on when to use it below.

The training loop

Every post-training run is the same loop with three moving parts: a base model, an environment (your task plus how to score it), and a training algorithm that improves the model from that score.

The model attempts a prompt

Flash takes a prompt from your environment and the current model produces an answer.

The environment scores it

Your environment grades the answer, either against a known good answer or with a reward function.

The algorithm updates the model

The training algorithm nudges the model’s weights toward answers that score higher.

Repeat

Over many steps the model gets measurably better at the task. The output is a small adapter you can deploy.

Core concepts

Base models and LoRA adapters

Flash trains a LoRA adapter: a small set of extra weights layered on top of the frozen base model. This is parameter-efficient fine-tuning, and it has three practical payoffs:

Cheap and fast to train, because you’re updating a tiny fraction of the parameters.
Small to store and move (megabytes, not gigabytes).
Efficient to serve, because many adapters that share a base model can be served by the same managed service.

Pick the base model with one line in your config. Browse the catalog with flash models, or see Supported models.

Environments: your task, as code

An environment is the task, expressed as code. It bundles two things:

A dataset: the prompts your model practices on (and any gold answers).
A reward: a scoring function that looks at an answer and returns a score.

The environment is the single source of truth for what the model practices on and how it’s graded. Flash uses Freesolo environments. You author one locally, publish it to the managed Environments Hub, and reference it from your config by Freesolo id. See the Environment model for the mental model, or Environments and Datasets to build one.

Two ways to teach: SFT and GRPO

There are two ways to turn that environment into a better model, and you choose between them with one line of config.

SFT: learn by imitation

Supervised fine-tuning. You show the model good answers and it learns to reproduce them. Best when you already have examples of the behavior you want.

GRPO: learn by practice

Reinforcement learning. The model generates attempts, your reward scores them, and the model is pushed toward the higher-scoring ones. Best when good output is easier to score than to write out by hand.

If you can hand the model a stack of ideal answers, start with SFT. If you can’t write the answers but you can tell a good one from a bad one, that scoring function is exactly what GRPO needs. See Training.

Rewards and rollouts (GRPO)

In GRPO, a rollout is one attempt the model generates for a prompt. For each prompt, GRPO samples a group of rollouts (the group_size), scores each one with your environment, and reinforces the rollouts that beat the group’s average. The reward is the score your environment returns. The reward is the teacher. If it reliably separates good answers from bad ones, GRPO can optimize toward it; if the task is so hard that every rollout scores zero, there’s no signal to learn from, so start with a model and task where some attempts succeed.

Serving the result

A trained adapter isn’t useful until you can call it. flash deploy registers your adapter with Freesolo’s managed serving. You talk to it over an OpenAI-compatible API, and serving is billed per token. See Deploy & chat.

Is post-training right for your task?

Post-training shines when the task is narrow and you can define success. Good signs:

You have examples of the behavior you want (favor SFT), or a way to score outputs even when you can’t write them (favor GRPO).
You want a small, cheap model to reliably do one job instead of paying for a frontier model on every call.
Prompting gets you close but not consistent enough.

If you don’t yet have data or a way to grade answers, start there: the quality of your environment sets the ceiling on what training can achieve.

Quickstart

Train, deploy, and chat with your first model in a few minutes.

Training in depth

SFT vs GRPO, config options, monitoring, and cost.

Build an environment

Turn your task into a dataset and a reward.

Supported models

The base models you can fine-tune and serve, with sizes and prices.

​What post-training does

​The training loop

​Core concepts

​Base models and LoRA adapters

​Environments: your task, as code

​Two ways to teach: SFT and GRPO