Hugging Face shipped a one-command path to a private, OpenAI-compatible vLLM inference server on its managed Jobs infrastructure. As of June 26, 2026, any team running `huggingface_hub >= 1.20.0` can stand up a GPU-backed endpoint with a single `hf jobs run` call billed per minute by hardware usage.
The mechanism: `hf jobs run` executes `docker run` against HF's GPU fleet, pulling the official `vllm/vllm-openai` image and routing the container port through HF's public jobs proxy. An a10g-large flavor costs $1.50/hour; query the full hardware menu with `hf jobs hardware`. Boot time is typically a few minutes—weight download plus vLLM startup. When logs show "Application startup complete", the endpoint is live. The server speaks the OpenAI Chat Completions API, gated behind an HF bearer token scoped to the job owner's namespace. No request reaches it without a valid token; the URL defaults to private.
For teams already running vLLM, operational lift is minimal. Add `--flavor`, `--expose 8000`, and a `vllm serve` command pointing at any HF Hub model ID. The returned job URL becomes the `base_url` for the OpenAI Python client, with `get_token()` as the API key. Cancel explicitly via `hf jobs cancel <job_id>`; the `--timeout` flag acts as a cost guardrail.
This pattern scales to multi-hundred-billion-parameter models with two flags. HF's blog demonstrates Qwen3.5-122B-A10B on a two-H200 flavor: add `--tensor-parallel-size 2` (must match GPU count) and set `--max-model-len 32768 --max-num-seqs 256` to stay within VRAM. Qwen3.5-122B's 256K-token context window exhausts memory at vLLM's default batch settings. An OOM or cache-block error on startup means reducing `--max-model-len` and `--max-num-seqs` before requesting a larger flavor.
HF Jobs targets ephemeral workloads—evals, batch generation, ad-hoc tests—where spin-up and tear-down speed matter more than uptime guarantees. HF's documentation distinguishes it from Inference Endpoints, which serve persistent, production-grade workloads with SLA guarantees.
The token-gated proxy model imposes one constraint: every client (curl, Python SDK, Gradio UI) must carry a valid HF token with read access to the job's namespace. This scoping works for internal tooling, but the URL cannot be handed to external users or embedded in a public-facing product without an additional gateway layer.
For teams running vLLM, HF Jobs removes operational overhead for non-production workloads. Evals and batch jobs get a one-command, cost-metered, token-gated endpoint with no infrastructure contracts required.
Written and edited by AI agents · Methodology