Trooper.AI provides a fully automated vLLM deployment template that installs, configures, and runs an OpenAI-compatible inference server on your GPU server using systemd.
The goal:
The template automatically:
You only control a small set of public parameters.
You can utilize a wide range of large language models from HuggingFace within vLLM. Ensure sufficient VRAM is available, as performance is contingent upon having adequate free GPU VRAM to accommodate the model and context size, multiplied by the number of concurrent users.
Trooper.AI automatically selects optimal precision per GPU architecture.
VRAM calculation: Model weights + ~25% KV-Cache buffer.
VRAM can be shared across multiple GPUs via Tensor Parallelism (--tensor-parallel-size N).
| Model | Parameters | Precision | Min. VRAM Total | GPU Configuration | GPUs |
|---|---|---|---|---|---|
| Qwen/Qwen3-4B | 4B | BF16 | ~8 GB | 1Γ V100 16GB / RTX 4070 Ti Super | 1 |
| Qwen/Qwen3-8B | 8B | BF16 | ~20 GB | 1Γ RTX 3090 / RTX 4090 (24 GB) | 1 |
| Qwen/Qwen3-32B | 32B | FP8 | ~40 GB | 1Γ A100 40GB or 2Γ RTX 4090 (2Γ24 GB) | 1β2 |
| meta-llama/Llama-3.1-8B-Instruct | 8B | FP8 | ~20 GB | 1Γ RTX 3090 / RTX 4090 (24 GB) | 1 |
| meta-llama/Llama-3.1-70B-Instruct | 70B | FP8 | ~90 GB | 1Γ RTX Pro 6000 Blackwell (96 GB) or 2Γ A100 (2Γ40 GB) | 1β2 |
Note: FP8 is used on Ada/Hopper architectures (RTX 40-series, A100, H100) for maximum throughput. \
Trooper.AI automatically selects the optimal precision for your GPU.
Multi-GPU setups use Tensor Parallelism β VRAM scales linearly across GPUs.
These parameters can be set via environment variables before running the installer.
| Variable | Description |
|---|---|
TOKEN |
API key for authentication |
modelname |
HuggingFace model path |
hf_token |
HuggingFace token (for gated models) |
commandline_args |
Optional extra vLLM CLI arguments |
Detects GPU architecture (Volta, Ampere, Ada, Hopper, Blackwell)
Detects VRAM size
Selects optimal precision automatically:
Uses FP16 KV cache for stability
Tunes:
Installs vLLM with CUDA
Creates a systemd service:
vllm-server.service
Starts a persistent OpenAI-compatible API server on a secure HTTPS endpoint.
No manual tuning is required.
Base URL:
http://YOUR_SERVER:PORT/v1
Endpoints:
/v1/models/v1/completions/v1/chat/completionsAuthentication header:
Authorization: Bearer YOUR_TOKEN_FROM_CONFIG
from openai import OpenAI
client = OpenAI(
api_key="YOUR_API_KEY",
base_url="https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
)
resp = client.chat.completions.create(
model="Qwen/Qwen3-14B",
messages=[
{"role": "user", "content": "Hello, what is vLLM?"}
],
max_tokens=200
)
print(resp.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
apiKey: "YOUR_API_KEY",
baseURL: "https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
});
const completion = await client.chat.completions.create({
model: "Qwen/Qwen3-14B",
messages: [
{ role: "user", content: "Hello from Node.js" }
],
max_tokens: 200
});
console.log(completion.choices[0].message.content);
<?php
$ch = curl_init("https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1/chat/completions");
$data = [
"model" => "Qwen/Qwen3-14B",
"messages" => [
["role" => "user", "content" => "Hello from PHP"]
],
"max_tokens" => 200
];
curl_setopt_array($ch, [
CURLOPT_POST => true,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HTTPHEADER => [
"Authorization: Bearer YOUR_API_KEY",
"Content-Type: application/json"
],
CURLOPT_POSTFIELDS => json_encode($data)
]);
$response = curl_exec($ch);
curl_close($ch);
echo $response;
resp = client.chat.completions.create(
model="Qwen/Qwen3-14B",
messages=[{"role":"user","content":"Explain transformers"}],
stream=True
)
for chunk in resp:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
const stream = await client.chat.completions.create({
model: "Qwen/Qwen3-14B",
messages: [{ role: "user", content: "Explain transformers" }],
stream: true
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0].delta?.content || "");
}
Trooper.AI vLLM servers are designed for:
Trooper.AI uses:
This avoids:
The Trooper.AI vLLM template gives you:
You only choose the model and API key.
Everything else is optimized automatically.
For advanced tuning, multi-GPU, or custom presets, contact Trooper.AI support.