Trooper.AI provides a fully automated vLLM deployment template that installs, configures, and runs an OpenAI-compatible inference server on your GPU server using systemd.
The goal:
The template automatically:
You only control a small set of public parameters.
These parameters can be set via environment variables before running the installer.
| Variable | Description |
|---|---|
TOKEN |
API key for authentication |
modelname |
HuggingFace model path |
hf_token |
HuggingFace token (for gated models) |
commandline_args |
Optional extra vLLM CLI arguments |
Detects GPU architecture (Volta, Ampere, Ada, Hopper, Blackwell)
Detects VRAM size
Selects optimal precision automatically:
Uses FP16 KV cache for stability
Tunes:
Installs vLLM with CUDA
Creates a systemd service:
vllm-server.service
Starts a persistent OpenAI-compatible API server on a secure HTTPS endpoint.
No manual tuning is required.
Base URL:
http://YOUR_SERVER:PORT/v1
Endpoints:
/v1/models/v1/completions/v1/chat/completionsAuthentication header:
Authorization: Bearer YOUR_TOKEN_FROM_CONFIG
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
)
resp = client.chat.completions.create(
model="Qwen/Qwen3-14B",
messages=[
{"role": "user", "content": "Hello, what is vLLM?"}
],
max_tokens=200
)
print(resp.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
});
const completion = await client.chat.completions.create({
model: "Qwen/Qwen3-14B",
messages: [
{ role: "user", content: "Hello from Node.js" }
],
max_tokens: 200
});
console.log(completion.choices[0].message.content);
<?php
$ch = curl_init("https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1/chat/completions");
$data = [
"model" => "Qwen/Qwen3-14B",
"messages" => [
["role" => "user", "content" => "Hello from PHP"]
],
"max_tokens" => 200
];
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
"Authorization: Bearer YOUR_API_KEY",
"Content-Type: application/json"
],
CURLOPT_POSTFIELDS => json_encode($data)
]);
$response = curl_exec($ch);
curl_close($ch);
echo $response;
resp = client.chat.completions.create(
model="Qwen/Qwen3-14B",
messages=[{"role":"user","content":"Explain transformers"}],
stream=True
)
for chunk in resp:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
const stream = await client.chat.completions.create({
model: "Qwen/Qwen3-14B",
messages: [{ role: "user", content: "Explain transformers" }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0].delta?.content || "");
}
Trooper.AI vLLM servers are designed for:
Trooper.AI uses:
This avoids:
The Trooper.AI vLLM template gives you:
You only choose the model and API key.
Everything else is optimized automatically.
For advanced tuning, multi-GPU, or custom presets, contact Trooper.AI support.