One Command Spins Up Private vLLM Endpoints at $1.50/Hour

Hugging Face shipped a one-command path to a private, OpenAI-compatible vLLM inference server on its managed Jobs infrastructure. As of June 26, 2026, any team running `huggingface_hub >= 1.20.0` can stand up a GPU-backed endpoint with a single `hf jobs run` call billed per minute by hardware usage.

The mechanism: `hf jobs run` executes `docker run` against HF's GPU fleet, pulling the official `vllm/vllm-openai` image and routing the container port through HF's public jobs proxy. An a10g-large flavor costs $1.50/hour; query the full hardware menu with `hf jobs hardware`. Boot time is typically a few minutes—weight download plus vLLM startup. When logs show "Application startup complete", the endpoint is live. The server speaks the OpenAI Chat Completions API, gated behind an HF bearer token scoped to the job owner's namespace. No request reaches it without a valid token; the URL defaults to private.

For teams already running vLLM, operational lift is minimal. Add `--flavor`, `--expose 8000`, and a `vllm serve` command pointing at any HF Hub model ID. The returned job URL becomes the `base_url` for the OpenAI Python client, with `get_token()` as the API key. Cancel explicitly via `hf jobs cancel <job_id>`; the `--timeout` flag acts as a cost guardrail.

This pattern scales to multi-hundred-billion-parameter models with two flags. HF's blog demonstrates Qwen3.5-122B-A10B on a two-H200 flavor: add `--tensor-parallel-size 2` (must match GPU count) and set `--max-model-len 32768 --max-num-seqs 256` to stay within VRAM. Qwen3.5-122B's 256K-token context window exhausts memory at vLLM's default batch settings. An OOM or cache-block error on startup means reducing `--max-model-len` and `--max-num-seqs` before requesting a larger flavor.

HF Jobs targets ephemeral workloads—evals, batch generation, ad-hoc tests—where spin-up and tear-down speed matter more than uptime guarantees. HF's documentation distinguishes it from Inference Endpoints, which serve persistent, production-grade workloads with SLA guarantees.

The token-gated proxy model imposes one constraint: every client (curl, Python SDK, Gradio UI) must carry a valid HF token with read access to the job's namespace. This scoping works for internal tooling, but the URL cannot be handed to external users or embedded in a public-facing product without an additional gateway layer.

For teams running vLLM, HF Jobs removes operational overhead for non-production workloads. Evals and batch jobs get a one-command, cost-metered, token-gated endpoint with no infrastructure contracts required.

Sources

Single hf jobs run command deploys a private, OpenAI-compatible vLLM server on HF infrastructure with per-minute billing and no server provisioning required
"Jobs is billed per‑minute by hardware usage"
huggingface.co ↗
An a10g-large GPU flavor costs $1.50/hour; full hardware pricing is available via hf jobs hardware
"An a10g-large runs at $1.50/hour — check hf jobs hardware for the full price list and pick the smallest flavor that fits your model."
huggingface.co ↗
hf jobs run is effectively docker run for HF infrastructure using the official vllm/vllm-openai image, with --expose routing the container port through HF's public jobs proxy
"hf jobs run is docker run for HF infrastructure. We use the official vllm/vllm-openai image, ask for a GPU with --flavor, and expose vLLM's port with --expose"
huggingface.co ↗
Every request must carry an HF token with read access to the job's namespace; the endpoint is gated, not public
"The endpoint is gated, not public. Every request must carry an HF token with read access to the job's namespace."
huggingface.co ↗
Qwen3.5-122B-A10B can be deployed on a 2×H200 flavor using --tensor-parallel-size 2 with --max-model-len 32768 --max-num-seqs 256 to stay within VRAM
"hf jobs run --flavor h200x2 --expose 8000 --timeout 2h vllm/vllm-openai:latest vllm serve Qwen/Qwen3.5-122B-A10B --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 --max-model-len 32768 --max-num-seqs 256"
huggingface.co ↗
Qwen3.5-122B defaults to a 256K-token context window which exhausts VRAM at vLLM's default batch settings; capping max-model-len and max-num-seqs is the first remediation step for OOM errors
"Qwen3.5-122B is a hybrid Mamba/attention architecture with a 256K-token default context, which doesn't leave enough memory for vLLM's default batch settings."
huggingface.co ↗
HF Jobs targets ephemeral workloads (evals, batch generation, tests); HF Inference Endpoints remains the offering for persistent, production-grade serving
"If you're after a managed, production-ready service instead, that's what Inference Endpoints are for"
huggingface.co ↗

Written and edited by AI agents · Methodology

One Command Spins Up Private vLLM Endpoints at $1.50/Hour

Get the signal before the noise.

Get the signal before the noise.