HuggingFace Jobs launches one-command vLLM deployment on H100 and A100
HuggingFace's Jobs platform added one-command vLLM deployment this week, letting practitioners launch inference servers on H100 or A100 hardware without writing YAML or managing containers.
HuggingFace Jobs now deploys vLLM inference servers with a single terminal command. The new CLI flag --vllm spins up a production-ready endpoint on rented H100 or A100 hardware, skipping the usual container-config and YAML-writing steps that slow down model deployment.
The feature targets practitioners who want to serve open-weight models—Llama, Qwen, Mistral, Gemma—at scale without managing Kubernetes or cloud-provider dashboards. A typical invocation looks like hf jobs create --vllm meta-llama/Llama-3.1-70B-Instruct --gpu h100:1, which provisions a single H100 instance, pulls the model weights from the Hub, and returns an OpenAI-compatible API endpoint. The server auto-scales request batching and supports streaming, function calling, and multi-GPU tensor parallelism when more than one accelerator is specified.
What stands out
- Zero-config deployment. No Dockerfile, no
docker-compose.yml, no Helm chart. The CLI reads model metadata from the HuggingFace card, picks vLLM's optimal quantization and attention backend, and starts serving. - OpenAI-drop-in compatibility. The endpoint exposes
/v1/completionsand/v1/chat/completionsroutes that match OpenAI's schema, so existing client code—LangChain, LlamaIndex, custom Python scripts—works without modification. - Per-minute billing. Jobs charges only for active inference time, not idle uptime. An H100 instance costs roughly $3.60/hour when serving requests; the meter stops when traffic drops to zero.






