Deploy & chat - Freesolo Docs

Once a training run reaches done, deploy its adapter and chat with it, from the CLI or any OpenAI-compatible client. flash deploy registers your adapter with Freesolo’s managed serving service. Register the adapter, send requests to it, and pay per token (see Billing).

Deploy

flash deploy <run-id>

Preview what a deploy would do without creating it:

flash deploy <run-id> --dry-run

Skip the server-side post-deploy smoke check. Registration alone does not guarantee the adapter serves, so use this only when you’ll verify separately:

flash deploy <run-id> --no-verify

Deploy a specific checkpoint

Every training step that saved an adapter is independently deployable. List a run’s deployable checkpoints:

flash checkpoints <run-id>

Then serve a specific step instead of the final adapter:

flash deploy <run-id>/step-<N>

You can deploy a checkpoint while a run is still training, or after a GRPO run stops with useful intermediate steps. Checkpoint deployments attach serving metadata to the run while preserving the run’s training state.

A run that never finalized (cancelled or preempted mid-training) has no final adapter, so a plain flash deploy <run-id> fails with an error that lists the saved checkpoint steps and the exact flash deploy <run-id>/step-N command to deploy one of them instead.

Billing

Serving is billed per token. Prompt and completion tokens have per-model rates, and cached prompt tokens use the model’s cached-input rate. Prices are listed in Supported models. Cached prompt tokens are cheaper than uncached prompt tokens. Prefix caching is always on: when a request’s prompt shares a leading prefix with a recent one — a shared system prompt, or the growing context of a multi-turn chat — the serving engine reuses that prefix from cache instead of recomputing it. Only the input prefix can be cached, so completion tokens always bill at the full per-token rate. When a request hits the prefix cache, the response reports how many prompt tokens were served from cache in usage.prompt_tokens_details.cached_tokens.

Chat from the CLI

flash chat <run-id> -m "Summarize the plot of Hamlet in two sentences."

-m/--message is required; --system, --max-tokens (default 512), and --temperature (default 0.0) are optional — see the CLI reference for the full flag list.

flash chat <run-id> -m "Write a haiku about mountain weather" --temperature 0.8 --max-tokens 128

--system is transient — it applies to that one request and is not stored with the deployment. Use it to probe the adapter with the same system prompt it was trained with:

flash chat <run-id> -m "What is 6*7?" --system "Answer with just the number."

Manage deployments

flash deployments        # list active deployments and endpoints
flash undeploy <run-id>  # deregister the adapter from serving

Use it from your own code

Deployments are OpenAI-compatible, so you can point any OpenAI SDK at the endpoint from flash deployments. The model is your run id:

from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1",   # from `flash deployments`
    api_key="<your-freesolo-api-key>",       # your Freesolo API key (the same one `flash login` uses); the endpoint checks that your org owns the adapter
)

resp = client.chat.completions.create(
    model="<run-id>",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

Call it with the freesolo SDK

The freesolo SDK ships a small hosted-LoRA client if you’d rather not pull in an OpenAI SDK:

from freesolo.utils.hosting import HostedLoraClient, chat_completion_with_lora

client = HostedLoraClient(hosting_url="https://<your-endpoint>")
resp = client.chat_completion("<run-id>", [{"role": "user", "content": "Hello!"}])

# or the one-shot module helper
resp = chat_completion_with_lora(
    "<run-id>",
    [{"role": "user", "content": "Hello!"}],
    hosting_url="https://<your-endpoint>",
)

Every call — chat_completion, generate, and the module-level chat_completion_with_lora / generate_with_lora helpers — accepts a per-call timeout_seconds override, so a single slow request can be given its own budget (or a tight one) without changing the client-wide timeout set at construction:

client.chat_completion("<run-id>", messages, timeout_seconds=120.0)

Pin and verify the serving checkpoint

Redeploying an adapter id (for example sweeping over checkpoint steps with flash deploy <run-id>/step-N) replaces what that id serves. When requests and redeploys can race, the inference endpoints support two headers so you always know — and can require — which checkpoint answered:

X-Freesolo-Checkpoint (response): echoes the checkpoint that actually served the request. The header is omitted entirely when the adapter has no checkpoint (a plain final-adapter deployment).
X-Freesolo-Expected-Checkpoint (request): pins the request to an expected checkpoint. If the adapter is currently serving a different one (e.g. a parallel sweep redeployed another step in between), the request fails with 409 Conflict — naming the checkpoint that is live — instead of silently generating from the wrong checkpoint. The value is compared after stripping whitespace; send an empty value to require “no checkpoint”.

Both headers work on streaming and non-streaming requests.

resp = client.chat.completions.with_raw_response.create(
    model="<run-id>",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={"X-Freesolo-Expected-Checkpoint": "<checkpoint>"},
)
print(resp.headers.get("X-Freesolo-Checkpoint"))

Export to your own HuggingFace repo

To take a trained adapter out of Freesolo’s managed storage, copy it into a HuggingFace repo you own:

flash export --adapter-id <run-id> --repository <owner>/<name>

--adapter-id and --repository are required; --api-key (defaults to HF_TOKEN) and --public are optional. See the CLI reference for the full flag list.

​Deploy

​Deploy a specific checkpoint

​Billing

​Chat from the CLI

​Manage deployments

​Use it from your own code

​Call it with the freesolo SDK

​Pin and verify the serving checkpoint

​Export to your own HuggingFace repo

Deploy

Deploy a specific checkpoint

Billing

Chat from the CLI

Manage deployments

Use it from your own code

Call it with the freesolo SDK

Pin and verify the serving checkpoint

Export to your own HuggingFace repo