Skip to main content
Once a training run reaches done, deploy its adapter and chat with it, from the CLI or any OpenAI-compatible client. flash deploy registers your adapter with Freesolo’s managed serving service. Register the adapter, send requests to it, and pay per token (see Billing).

Deploy

flash deploy <run-id>
Preview what a deploy would do without creating it:
flash deploy <run-id> --dry-run
Skip the server-side post-deploy smoke check. Registration alone does not guarantee the adapter serves, so use this only when you’ll verify separately:
flash deploy <run-id> --no-verify

Deploy a specific checkpoint

Every training step that saved an adapter is independently deployable. List a run’s deployable checkpoints:
flash checkpoints <run-id>
Then serve a specific step instead of the final adapter:
flash deploy <run-id>/step-<N>
You can deploy a checkpoint while a run is still training, or after a GRPO run stops with useful intermediate steps. Checkpoint deployments attach serving metadata to the run while preserving the run’s training state.
A run that never finalized (cancelled or preempted mid-training) has no final adapter, so a plain flash deploy <run-id> fails with an error that lists the saved checkpoint steps and the exact flash deploy <run-id>/step-N command to deploy one of them instead.

Billing

Serving is billed per token. Prompt and completion tokens have per-model rates, and cached prompt tokens use the model’s cached-input rate. Prices are listed in Supported models. Cached prompt tokens are cheaper than uncached prompt tokens. Prefix caching is always on: when a request’s prompt shares a leading prefix with a recent one — a shared system prompt, or the growing context of a multi-turn chat — the serving engine reuses that prefix from cache instead of recomputing it. Only the input prefix can be cached, so completion tokens always bill at the full per-token rate. When a request hits the prefix cache, the response reports how many prompt tokens were served from cache in usage.prompt_tokens_details.cached_tokens.

Chat from the CLI

flash chat <run-id> -m "Summarize the plot of Hamlet in two sentences."
-m/--message is required; --system, --max-tokens (default 512), and --temperature (default 0.0) are optional — see the CLI reference for the full flag list.
flash chat <run-id> -m "Write a haiku about mountain weather" --temperature 0.8 --max-tokens 128
--system is transient — it applies to that one request and is not stored with the deployment. Use it to probe the adapter with the same system prompt it was trained with:
flash chat <run-id> -m "What is 6*7?" --system "Answer with just the number."

Manage deployments

flash deployments        # list active deployments and endpoints
flash undeploy <run-id>  # deregister the adapter from serving

Use it from your own code

Deployments are OpenAI-compatible, so you can point any OpenAI SDK at the endpoint from flash deployments. The model is your run id:
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1",   # from `flash deployments`
    api_key="<your-freesolo-api-key>",       # your Freesolo API key (the same one `flash login` uses); the endpoint checks that your org owns the adapter
)

resp = client.chat.completions.create(
    model="<run-id>",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

Call it with the freesolo SDK

The freesolo SDK ships a small hosted-LoRA client if you’d rather not pull in an OpenAI SDK:
from freesolo.utils.hosting import HostedLoraClient, chat_completion_with_lora

client = HostedLoraClient(hosting_url="https://<your-endpoint>")
resp = client.chat_completion("<run-id>", [{"role": "user", "content": "Hello!"}])

# or the one-shot module helper
resp = chat_completion_with_lora(
    "<run-id>",
    [{"role": "user", "content": "Hello!"}],
    hosting_url="https://<your-endpoint>",
)
Every call — chat_completion, generate, and the module-level chat_completion_with_lora / generate_with_lora helpers — accepts a per-call timeout_seconds override, so a single slow request can be given its own budget (or a tight one) without changing the client-wide timeout set at construction:
client.chat_completion("<run-id>", messages, timeout_seconds=120.0)

Pin and verify the serving checkpoint

Redeploying an adapter id (for example sweeping over checkpoint steps with flash deploy <run-id>/step-N) replaces what that id serves. When requests and redeploys can race, the inference endpoints support two headers so you always know — and can require — which checkpoint answered:
  • X-Freesolo-Checkpoint (response): echoes the checkpoint that actually served the request. The header is omitted entirely when the adapter has no checkpoint (a plain final-adapter deployment).
  • X-Freesolo-Expected-Checkpoint (request): pins the request to an expected checkpoint. If the adapter is currently serving a different one (e.g. a parallel sweep redeployed another step in between), the request fails with 409 Conflict — naming the checkpoint that is live — instead of silently generating from the wrong checkpoint. The value is compared after stripping whitespace; send an empty value to require “no checkpoint”.
Both headers work on streaming and non-streaming requests.
resp = client.chat.completions.with_raw_response.create(
    model="<run-id>",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={"X-Freesolo-Expected-Checkpoint": "<checkpoint>"},
)
print(resp.headers.get("X-Freesolo-Checkpoint"))

Export to your own HuggingFace repo

To take a trained adapter out of Freesolo’s managed storage, copy it into a HuggingFace repo you own:
flash export --adapter-id <run-id> --repository <owner>/<name>
--adapter-id and --repository are required; --api-key (defaults to HF_TOKEN) and --public are optional. See the CLI reference for the full flag list.