What post-training does
A base model like Qwen3.5 arrives from pre-training already fluent, but generic: it has read a lot of the internet and is good at everything in general, nothing in particular. Post-training is the step that shapes that general model into one that’s reliably good at your task, using your data and your definition of a good answer. Flash is a managed post-training service. You describe the task and pick a base model, Flash fine-tunes it on managed infrastructure, and then serves the result behind an OpenAI-compatible API. Reach for post-training when you want consistent behavior, a smaller and cheaper model that matches a big one on a narrow task, or behavior that prompting alone does not reliably produce. More on when to use it below.The training loop
Every post-training run is the same loop with three moving parts: a base model, an environment (your task plus how to score it), and a training algorithm that improves the model from that score.The model attempts a prompt
Flash takes a prompt from your environment and the current model produces an
answer.
The environment scores it
Your environment grades the answer, either against a known good answer or
with a reward function.
The algorithm updates the model
The training algorithm nudges the model’s weights toward answers that score
higher.
Core concepts
Base models and LoRA adapters
Flash trains a LoRA adapter: a small set of extra weights layered on top of the frozen base model. This is parameter-efficient fine-tuning, and it has three practical payoffs:- Cheap and fast to train, because you’re updating a tiny fraction of the parameters.
- Small to store and move (megabytes, not gigabytes).
- Efficient to serve, because many adapters that share a base model can be served by the same managed service.
flash models, or see Supported models.
Environments: your task, as code
An environment is the task, expressed as code. It bundles two things:- A dataset: the prompts your model practices on (and any gold answers).
- A reward: a scoring function that looks at an answer and returns a score.
Two ways to teach: SFT and GRPO
There are two ways to turn that environment into a better model, and you choose between them with one line of config.SFT: learn by imitation
Supervised fine-tuning. You show the model good answers and it learns to
reproduce them. Best when you already have examples of the behavior you
want.
GRPO: learn by practice
Reinforcement learning. The model generates attempts, your reward scores
them, and the model is pushed toward the higher-scoring ones. Best when good
output is easier to score than to write out by hand.
Rewards and rollouts (GRPO)
In GRPO, a rollout is one attempt the model generates for a prompt. For each prompt, GRPO samples a group of rollouts (thegroup_size), scores each one
with your environment, and reinforces the rollouts that beat the group’s
average. The reward is the score your environment returns.
The reward is the teacher. If it reliably separates good answers from bad ones,
GRPO can optimize toward it; if the task is so hard that every rollout scores
zero, there’s no signal to learn from, so start with a model and task where some
attempts succeed.
Serving the result
A trained adapter isn’t useful until you can call it.flash deploy registers
your adapter with Freesolo’s managed serving. You talk to it over
an OpenAI-compatible API, and serving is billed per token. See
Deploy & chat.
Is post-training right for your task?
Post-training shines when the task is narrow and you can define success. Good signs:- You have examples of the behavior you want (favor SFT), or a way to score outputs even when you can’t write them (favor GRPO).
- You want a small, cheap model to reliably do one job instead of paying for a frontier model on every call.
- Prompting gets you close but not consistent enough.
Next
Quickstart
Train, deploy, and chat with your first model in a few minutes.
Training in depth
SFT vs GRPO, config options, monitoring, and cost.
Build an environment
Turn your task into a dataset and a reward.
Supported models
The base models you can fine-tune and serve, with sizes and prices.