> ## Documentation Index
> Fetch the complete documentation index at: https://freesolo.co/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Deploy & chat

> Serve a trained adapter and talk to it over an OpenAI-compatible API.

Once a [training run](/guides/training) reaches `done`, deploy its adapter and chat with it, from the CLI or
any OpenAI-compatible client.

`flash deploy` registers your adapter with Freesolo's **managed serving
service**. Register the adapter, send requests to it, and pay per token
(see [Billing](#billing)).

## Deploy

```bash theme={null}
flash deploy <run-id>
```

Preview what a deploy would do without creating it:

```bash theme={null}
flash deploy <run-id> --dry-run
```

Skip the server-side post-deploy smoke check. Registration alone does not
guarantee the adapter serves, so use this only when you'll verify separately:

```bash theme={null}
flash deploy <run-id> --no-verify
```

## Deploy a specific checkpoint

Every training step that saved an adapter is independently deployable. List a
run's deployable checkpoints:

```bash theme={null}
flash checkpoints <run-id>
```

Then serve a specific step instead of the final adapter:

```bash theme={null}
flash deploy <run-id>/step-<N>
```

You can deploy a checkpoint while a run is still training, or after a GRPO run
stops with useful intermediate steps. Checkpoint deployments attach serving
metadata to the run while preserving the run's training state.

<Note>
  A run that never finalized (cancelled or preempted mid-training) has no final
  adapter, so a plain `flash deploy <run-id>` fails with an error that lists the
  saved checkpoint steps and the exact `flash deploy <run-id>/step-N` command
  to deploy one of them instead.
</Note>

## Billing

Serving is **billed per token**. Prompt and completion tokens have per-model
rates, and cached prompt tokens use the model's cached-input rate. Prices are
listed in [Supported models](/reference/models#serving-prices).

**Cached prompt tokens are cheaper than uncached prompt tokens.** Prefix caching
is always on: when a request's prompt shares a leading prefix with a recent one
— a shared system prompt, or the growing context of a multi-turn chat — the
serving engine reuses that prefix from cache instead of recomputing it. Only the
input prefix can be cached, so completion tokens always bill at the full
per-token rate. When a request hits the prefix cache, the response reports how
many prompt tokens were served from cache in
`usage.prompt_tokens_details.cached_tokens`.

## Chat from the CLI

```bash theme={null}
flash chat <run-id> -m "Summarize the plot of Hamlet in two sentences."
```

`-m`/`--message` is required; `--system`, `--max-tokens` (default 512), and
`--temperature` (default 0.0) are optional — see the
[CLI reference](/reference/cli#serving) for the full flag list.

```bash theme={null}
flash chat <run-id> -m "Write a haiku about mountain weather" --temperature 0.8 --max-tokens 128
```

`--system` is transient — it applies to that one request and is not stored with
the deployment. Use it to probe the adapter with the same system prompt it was
trained with:

```bash theme={null}
flash chat <run-id> -m "What is 6*7?" --system "Answer with just the number."
```

## Manage deployments

```bash theme={null}
flash deployments        # list active deployments and endpoints
flash undeploy <run-id>  # deregister the adapter from serving
```

## Use it from your own code

Deployments are OpenAI-compatible, so you can point any OpenAI SDK at the
endpoint from `flash deployments`. The `model` is your run id:

```python theme={null}
from openai import OpenAI

client = OpenAI(
    base_url="https://<your-endpoint>/v1",   # from `flash deployments`
    api_key="<your-freesolo-api-key>",       # your Freesolo API key (the same one `flash login` uses); the endpoint checks that your org owns the adapter
)

resp = client.chat.completions.create(
    model="<run-id>",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)
```

### Call it with the freesolo SDK

The `freesolo` SDK ships a small hosted-LoRA client if you'd rather not pull in
an OpenAI SDK:

```python theme={null}
from freesolo.utils.hosting import HostedLoraClient, chat_completion_with_lora

client = HostedLoraClient(hosting_url="https://<your-endpoint>")
resp = client.chat_completion("<run-id>", [{"role": "user", "content": "Hello!"}])

# or the one-shot module helper
resp = chat_completion_with_lora(
    "<run-id>",
    [{"role": "user", "content": "Hello!"}],
    hosting_url="https://<your-endpoint>",
)
```

Every call — `chat_completion`, `generate`, and the module-level
`chat_completion_with_lora` / `generate_with_lora` helpers — accepts a per-call
`timeout_seconds` override, so a single slow request can be given its own
budget (or a tight one) without changing the client-wide timeout set at
construction:

```python theme={null}
client.chat_completion("<run-id>", messages, timeout_seconds=120.0)
```

### Pin and verify the serving checkpoint

Redeploying an adapter id (for example sweeping over checkpoint steps with
`flash deploy <run-id>/step-N`) replaces what that id serves. When requests
and redeploys can race, the inference endpoints support two headers so you
always know — and can require — which checkpoint answered:

* **`X-Freesolo-Checkpoint`** (response): echoes the checkpoint that actually
  served the request. The header is omitted entirely when the adapter has no
  checkpoint (a plain final-adapter deployment).
* **`X-Freesolo-Expected-Checkpoint`** (request): pins the request to an
  expected checkpoint. If the adapter is currently serving a different one
  (e.g. a parallel sweep redeployed another step in between), the request fails
  with **409 Conflict** — naming the checkpoint that is live — instead of
  silently generating from the wrong checkpoint. The value is compared after
  stripping whitespace; send an empty value to require "no checkpoint".

Both headers work on streaming and non-streaming requests.

```python theme={null}
resp = client.chat.completions.with_raw_response.create(
    model="<run-id>",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_headers={"X-Freesolo-Expected-Checkpoint": "<checkpoint>"},
)
print(resp.headers.get("X-Freesolo-Checkpoint"))
```

## Export to your own HuggingFace repo

To take a trained adapter out of Freesolo's managed storage, copy it into a
HuggingFace repo you own:

```bash theme={null}
flash export --adapter-id <run-id> --repository <owner>/<name>
```

`--adapter-id` and `--repository` are required; `--api-key` (defaults to
`HF_TOKEN`) and `--public` are optional. See the
[CLI reference](/reference/cli#export) for the full flag list.
