vLLM OpenAI-Compatible Server

Trooper.AI provides a fully automated vLLM deployment template that installs, configures, and runs an OpenAI-compatible inference server on your GPU server using systemd.

The goal:

The template automatically:

You only control a small set of public parameters.


Model Size & GPU Requirements

You can utilize a wide range of large language models from HuggingFace within vLLM. Ensure sufficient VRAM is available, as performance is contingent upon having adequate free GPU VRAM to accommodate the model and context size, multiplied by the number of concurrent users.

Trooper.AI automatically selects optimal precision per GPU architecture.

VRAM calculation: Model weights + ~25% KV-Cache buffer.
VRAM can be shared across multiple GPUs via Tensor Parallelism (--tensor-parallel-size N).

Model Parameters Precision Min. VRAM Total GPU Configuration GPUs
Qwen/Qwen3-4B 4B BF16 ~8 GB 1Γ— V100 16GB / RTX 4070 Ti Super 1
Qwen/Qwen3-8B 8B BF16 ~20 GB 1Γ— RTX 3090 / RTX 4090 (24 GB) 1
Qwen/Qwen3-32B 32B FP8 ~40 GB 1Γ— A100 40GB or 2Γ— RTX 4090 (2Γ—24 GB) 1–2
meta-llama/Llama-3.1-8B-Instruct 8B FP8 ~20 GB 1Γ— RTX 3090 / RTX 4090 (24 GB) 1
meta-llama/Llama-3.1-70B-Instruct 70B FP8 ~90 GB 1Γ— RTX Pro 6000 Blackwell (96 GB) or 2Γ— A100 (2Γ—40 GB) 1–2

Note: FP8 is used on Ada/Hopper architectures (RTX 40-series, A100, H100) for maximum throughput. \ Trooper.AI automatically selects the optimal precision for your GPU.
Multi-GPU setups use Tensor Parallelism β€” VRAM scales linearly across GPUs.


Public Parameters

These parameters can be set via environment variables before running the installer.

Variable Description
TOKEN API key for authentication
modelname HuggingFace model path
hf_token HuggingFace token (for gated models)
commandline_args Optional extra vLLM CLI arguments

What the Template Does

  1. Detects GPU architecture (Volta, Ampere, Ada, Hopper, Blackwell)

  2. Detects VRAM size

  3. Selects optimal precision automatically:

    • FP8 > BF16 > FP16
  4. Uses FP16 KV cache for stability

  5. Tunes:

    • max concurrent sequences
    • batched token size
    • memory utilization
  6. Installs vLLM with CUDA

  7. Creates a systemd service:

    Code
    vllm-server.service
    
  8. Starts a persistent OpenAI-compatible API server on a secure HTTPS endpoint.

No manual tuning is required.


API Endpoints

Base URL:

Code
http://YOUR_SERVER:PORT/v1

Endpoints:

  • /v1/models
  • /v1/completions
  • /v1/chat/completions

Authentication header:

Code
Authorization: Bearer YOUR_TOKEN_FROM_CONFIG

Python Client Example

python
from openai import OpenAI

client = OpenAI(
    api_key="YOUR_API_KEY",
    base_url="https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
)

resp = client.chat.completions.create(
    model="Qwen/Qwen3-14B",
    messages=[
        {"role": "user", "content": "Hello, what is vLLM?"}
    ],
    max_tokens=200
)

print(resp.choices[0].message.content)

Node.js Client Example

javascript
import OpenAI from "openai";

const client = new OpenAI({
  apiKey: "YOUR_API_KEY",
  baseURL: "https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1"
});

const completion = await client.chat.completions.create({
  model: "Qwen/Qwen3-14B",
  messages: [
    { role: "user", content: "Hello from Node.js" }
  ],
  max_tokens: 200
});

console.log(completion.choices[0].message.content);

PHP Client Example

php
<?php

$ch = curl_init("https://YOUR_SERVER_ENDPOINT.apps.trooper.ai/v1/chat/completions");

$data = [
  "model" => "Qwen/Qwen3-14B",
  "messages" => [
    ["role" => "user", "content" => "Hello from PHP"]
  ],
  "max_tokens" => 200
];

curl_setopt_array($ch, [
  CURLOPT_POST => true,
  CURLOPT_RETURNTRANSFER => true,
  CURLOPT_HTTPHEADER => [
    "Authorization: Bearer YOUR_API_KEY",
    "Content-Type: application/json"
  ],
  CURLOPT_POSTFIELDS => json_encode($data)
]);

$response = curl_exec($ch);
curl_close($ch);

echo $response;

Streaming Example

Python Streaming

python
resp = client.chat.completions.create(
    model="Qwen/Qwen3-14B",
    messages=[{"role":"user","content":"Explain transformers"}],
    stream=True
)

for chunk in resp:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Node.js Streaming

javascript
const stream = await client.chat.completions.create({
  model: "Qwen/Qwen3-14B",
  messages: [{ role: "user", content: "Explain transformers" }],
  stream: true
});

for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0].delta?.content || "");
}

Use Cases

Trooper.AI vLLM servers are designed for:

  • SaaS AI backends
  • Chatbots
  • Code assistants
  • RAG systems
  • Multi-user inference servers
  • High throughput batch inference
  • GPU rental environments

Performance Philosophy

Trooper.AI uses:

  • Automatic architecture tuning
  • Automatic precision selection
  • VRAM-aware batching
  • Stable KV cache configuration

This avoids:

  • GPU misconfiguration
  • Precision crashes
  • VRAM fragmentation
  • Context instability

Summary

The Trooper.AI vLLM template gives you:

  • OpenAI-compatible API
  • Automatic GPU optimization
  • Production-safe defaults
  • Minimal configuration
  • Maximum throughput

You only choose the model and API key.

Everything else is optimized automatically.


Support

For advanced tuning, multi-GPU, or custom presets, contact Trooper.AI support.