Hugging Face Ships vLLM on HF Jobs: spin OpenAI-compatible LLM endpoint in one command
Hugging Face launched vLLM on HF Jobs, a serverless inference service that lets developers spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single CLI command. No Kubernetes, no server provisioning—just hf jobs run --flavor a10g-large --expose 8000, pick a model (Qwen, Llama, Mistral, etc.), and get a live endpoint in seconds. Billing is per-minute by hardware usage, paid on prepaid credit.
The integration removes friction for model deployment. Developers can query endpoints from a laptop, notebook, or anywhere via standard OpenAI client libraries (pass the job URL as base_url). SSH support lets you shell into running jobs for debugging, GPU memory inspection, and log tailing—familiar ops experience without container overhead. Tensor parallelism is supported; --tensor-parallel-size spreads models across multiple GPUs for larger models or higher throughput. Flavors range from A10G GPUs to H200 pairs for mixture-of-experts like Qwen 3.5-122B.
For production deployment, this competes with dedicated inference platforms (Together, Anyscale, Replicate) but keeps the developer inside the Hugging Face ecosystem—Hub authentication, native model import, and existing community assets. Architects evaluating edge inference, batch generation, or internal LLM APIs should test this; pricing and latency SLAs matter more than the speed of deployment itself. Watch for enterprise safeguards (rate limiting, access controls, audit logs) as more orgs move from notebooks to shared infrastructure.
Sources
- Primary source
- huggingface.co
“You can spin up a private, OpenAI-compatible LLM endpoint on Hugging Face infrastructure with a single command — no servers to provision, no Kubernetes, pay-per-second.”