done, deploy its adapter and chat with it, from the CLI or
any OpenAI-compatible client.
flash deploy registers your adapter with Freesolo’s managed serving
service. Register the adapter, send requests to it, and pay per token
(see Billing).
Deploy
Deploy a specific checkpoint
Every training step that saved an adapter is independently deployable. List a run’s deployable checkpoints:A run that never finalized (cancelled or preempted mid-training) has no final
adapter, so a plain
flash deploy <run-id> fails with an error that lists the
saved checkpoint steps and the exact flash deploy <run-id>/step-N command
to deploy one of them instead.Billing
Serving is billed per token. Prompt and completion tokens have per-model rates, and cached prompt tokens use the model’s cached-input rate. Prices are listed in Supported models. Cached prompt tokens are cheaper than uncached prompt tokens. Prefix caching is always on: when a request’s prompt shares a leading prefix with a recent one — a shared system prompt, or the growing context of a multi-turn chat — the serving engine reuses that prefix from cache instead of recomputing it. Only the input prefix can be cached, so completion tokens always bill at the full per-token rate. When a request hits the prefix cache, the response reports how many prompt tokens were served from cache inusage.prompt_tokens_details.cached_tokens.
Chat from the CLI
-m/--message is required; --system, --max-tokens (default 512), and
--temperature (default 0.0) are optional — see the
CLI reference for the full flag list.
--system is transient — it applies to that one request and is not stored with
the deployment. Use it to probe the adapter with the same system prompt it was
trained with:
Manage deployments
Use it from your own code
Deployments are OpenAI-compatible, so you can point any OpenAI SDK at the endpoint fromflash deployments. The model is your run id:
Call it with the freesolo SDK
Thefreesolo SDK ships a small hosted-LoRA client if you’d rather not pull in
an OpenAI SDK:
chat_completion, generate, and the module-level
chat_completion_with_lora / generate_with_lora helpers — accepts a per-call
timeout_seconds override, so a single slow request can be given its own
budget (or a tight one) without changing the client-wide timeout set at
construction:
Pin and verify the serving checkpoint
Redeploying an adapter id (for example sweeping over checkpoint steps withflash deploy <run-id>/step-N) replaces what that id serves. When requests
and redeploys can race, the inference endpoints support two headers so you
always know — and can require — which checkpoint answered:
X-Freesolo-Checkpoint(response): echoes the checkpoint that actually served the request. The header is omitted entirely when the adapter has no checkpoint (a plain final-adapter deployment).X-Freesolo-Expected-Checkpoint(request): pins the request to an expected checkpoint. If the adapter is currently serving a different one (e.g. a parallel sweep redeployed another step in between), the request fails with 409 Conflict — naming the checkpoint that is live — instead of silently generating from the wrong checkpoint. The value is compared after stripping whitespace; send an empty value to require “no checkpoint”.
Export to your own HuggingFace repo
To take a trained adapter out of Freesolo’s managed storage, copy it into a HuggingFace repo you own:--adapter-id and --repository are required; --api-key (defaults to
HF_TOKEN) and --public are optional. See the
CLI reference for the full flag list.