Introduction

Running Ministral-3 on vLLM is surprisingly powerful. The model is fast, creative, and capable of producing high-quality responses even under heavy workloads.

But once you move from simple chat prompts to structured outputs, automation, or programmatic use, things quickly get complicated.

During the development of an AI-driven Jeopardy game with 12 simultaneous players and hundreds of model calls, we encountered several practical issues:

Markdown appearing in plain-text responses
JSON outputs becoming deeply nested
Token budgets getting exceeded
Small formatting differences breaking downstream logic

This guide summarizes the practical lessons learned while solving these issues, along with concrete patterns you can reuse in your own projects.

The Real-World Scenario

Our benchmark project simulates a full Jeopardy game where:

The host AI generates categories and clues
Multiple AI contestants answer simultaneously
Another AI judge evaluates correctness
Scores are tracked across rounds

A single game run can easily exceed 800 API calls.

This environment exposed edge cases that rarely appear in simple demos — making it a great test bed for understanding how Ministral behaves under real production workloads.

1. Serving Ministral-3 on vLLM

Ministral models do not use the standard HuggingFace tokenizer configuration.

This means the launch command must explicitly enable the Mistral tokenizer format.

bash

vllm serve mistralai/Ministral-3-14B-Instruct-2512 \
  --tokenizer_mode mistral \
  --config_format mistral \
  --load_format mistral

If your application relies on function calling, add the tool flags:

bash

--enable-auto-tool-choice
--tool-call-parser mistral

Important limitation

Unlike some other models, Ministral does not support chat_template_kwargs.

If you send a request like this:

json

{
  "chat_template_kwargs": {
    "enable_thinking": false
  }
}

vLLM returns:

Code

HTTP 400: chat_template is not supported for Mistral tokenizers

That means features such as explicit “thinking mode” toggling (used with models like Qwen or DeepSeek) are simply not available.

Fortunately, this is rarely needed because Ministral already produces concise outputs by default.

2. Temperature: The Single Most Important Parameter

The official vLLM documentation consistently uses the following value with Ministral-3:

Code

temperature = 0.15

At first glance this seems extremely low. However, it turns out to be critical for structured tasks.

What happens at higher temperatures

Using the OpenAI-style default:

javascript

temperature: 0.7

the model becomes overly creative with structure.

A simple request like:

json

{ "expertise": "2-3 topics they know best" }

may return something like:

json

{
  "expertise": [
    {
      "category": "Gourmet Pizza Alchemy",
      "detail": "Can transform random ingredients into Michelin-star pizza"
    },
    {
      "category": "Sumo Wrestling Physics",
      "detail": "Understands body mechanics and center-of-gravity combat"
    }
  ]
}

While technically valid JSON, it is not what the schema asked for.

The result:

JSON becomes larger than expected
max_tokens limits are exceeded
Responses get truncated
Parsing fails

Why `0.15` works better

At low temperature the model becomes structurally disciplined.

javascript

temperature: 0.15

Benefits:

JSON follows schemas precisely
token usage becomes predictable
nested structures almost disappear
responses stay within token limits

Even creative text generation remains strong — the model simply stops improvising with structure.

Recommendation: Use temperature: 0.15 as your default for Ministral-3.

3. Getting Clean JSON Responses

Producing machine-readable JSON from LLMs is harder than it sounds.

Ministral tends to interpret schema fields semantically instead of structurally, which leads to deeply nested output.

The naive approach

A prompt like:

Code

Return JSON with these fields.

often produces verbose structures.

Example request:

json

{ "expertise": "2-3 topics they know best" }

Typical response:

json

{
  "expertise": [
    {
      "category": "Ancient Roman Engineering",
      "detail": "Knows aqueduct systems in surprising detail"
    },
    {
      "category": "Pizza Dough Chemistry",
      "detail": "Obsessed with yeast fermentation dynamics"
    }
  ]
}

This consumes three times the expected tokens.

The reliable fix: Two-layer prompting

The most reliable solution combines two instructions.

Layer 1 — System instruction

Code

Respond with ONLY valid JSON.
No markdown, no explanation, no text before or after the JSON.
Keep values as short plain strings — never use nested objects or arrays.

Layer 2 — Schema constraint

Right next to the schema definition:

Code

Every value MUST be a short plain string — NO arrays, NO nested objects.

Combined with temperature 0.15, this produces predictable flat JSON.

Token budgeting

Even when constrained, Ministral tends to produce longer values than other models.

Example observation from our benchmark:

Model	Tokens needed
GPT-4o	~512
Qwen	~512
Ministral-3	~1024

A safe rule:

Budget 1.5–2× the tokens for JSON outputs.

Defensive JSON parsing

Even with perfect prompts, models occasionally generate malformed JSON.

A good strategy is to add defensive parsing layers.

1) Extract JSON from markdown

javascript

function extractJSON(raw, shape) {
  var text = raw.replace(/^```(?:json)?\s*/i, '').replace(/\s*```$/i, '').trim();
  if (shape === 'array') {
    var m = text.match(/\[[\s\S]*\]/);
    if (m) text = m[0];
  } else {
    var m = text.match(/\{[\s\S]*\}/);
    if (m) text = m[0];
  }
  return text;
}

2) Repair truncated output

Track open brackets and close them automatically:

javascript

var stack = [];
var inStr = false, esc = false;

for (var i = 0; i < text.length; i++) {
  var ch = text[i];

  if (esc) { esc = false; continue; }
  if (ch === '\\') { esc = true; continue; }
  if (ch === '"') { inStr = !inStr; continue; }
  if (inStr) continue;

  if (ch === '{') stack.push('}');
  else if (ch === '[') stack.push(']');
  else if (ch === '}' || ch === ']') stack.pop();
}

text = text.replace(/,\s*$/, '');

while (stack.length > 0)
  text += stack.pop();

3) Flatten nested values

If the model still returns nested structures:

javascript

if (Array.isArray(value)) {
  flat = value.map(function(item) {
    if (typeof item === 'string') return item;
    if (typeof item === 'object') return Object.values(item).join(' — ');
    return String(item);
  }).join(', ');
}

4) Retry failed requests

A simple retry loop dramatically increases reliability.

Because Ministral behaves consistently at low temperature, retries usually succeed.

Recommended:

Code

2–3 retry attempts

4. Getting Clean Plain-Text Responses

Ministral loves formatting.

Even when you ask for plain text, it tends to produce:

bold text
italic emphasis
headings
inline code formatting

This happens because the model ships with a built-in system prompt encouraging rich markdown formatting.

Why this matters

Many pipelines rely on simple string checks.

Example:

javascript

verdict.toUpperCase().startsWith('CORRECT')

But if the model returns:

Code

**CORRECT**

the check fails.

Solution: Always strip markdown

The safest approach is to normalize all outputs before processing.

javascript

function stripMarkdown(text) {
  if (!text) return text;

  var s = text.replace(/\*\*([^*]+)\*\*/g, '$1');
  s = s.replace(/__([^_]+)__/g, '$1');
  s = s.replace(/\*([^*]+)\*/g, '$1');
  s = s.replace(/^#{1,6}\s+/gm, '');
  s = s.replace(/`([^`]+)`/g, '$1');
  s = s.replace(/^```[a-z]*\s*$/gm, '');

  return s.trim();
}

Apply this to every model response, not just Ministral.

It avoids model-specific branching and keeps pipelines consistent.

5. Centralizing Model-Specific Behavior

If your system supports multiple model families (Mistral, Qwen, DeepSeek, Llama, etc.), the most maintainable design is to centralize model behavior in one place.

Example:

javascript

function buildModelProfile(modelName) {
  var lower = modelName.toLowerCase();
  var isMistral = lower.includes('mistral') || lower.includes('ministral');

  return {
    family: isMistral ? 'Mistral' : 'Generic',

    jsonSystemInstruction: isMistral
      ? 'Respond with ONLY valid JSON. No markdown. Keep values as short plain strings.'
      : 'You output only valid JSON. No markdown fences, no explanation.',

    jsonSchemaHint: isMistral
      ? ' Every value MUST be a short plain string — NO arrays, NO nested objects.'
      : '',

    jsonTemperature: isMistral ? 0.15 : 0.7,
    defaultTemperature: isMistral ? 0.15 : 0.7,

    plainTextInstruction: ' Do not use markdown formatting.'
  };
}

This allows the rest of your system to remain model-agnostic.

Adding a new model later becomes trivial.

Ministral-3 Cheat Sheet

Setting	Recommended Value	Reason
tokenizer_mode	mistral	Required for correct tokenizer
config_format	mistral	Required
load_format	mistral	Required
chat_template_kwargs	Do not send	Not supported
temperature	0.15	Prevents structural hallucination
JSON instruction	Explicit flat values	Avoid nested objects
max_tokens	1.5–2× typical	Model is verbose
Markdown stripping	Always	Prevent formatting errors
JSON retries	2–3 attempts	Reliable recovery

Final Thoughts

Ministral-3 performs extremely well when properly tuned.

Once you:

lower the temperature
constrain JSON structures
normalize markdown output
add defensive parsing

the model becomes remarkably predictable and production-ready.

In our Jeopardy benchmark, this setup supported:

12 concurrent AI contestants
800+ API calls per session
2,000+ tokens/sec throughput
consistent structured output

All running locally on Trooper.AI GPU infrastructure.

Get started

Try the full vLLM deployment template here: vLLM OpenAI-Compatible Server

Tuning Ministral-3 on vLLM

Introduction

The Real-World Scenario

1. Serving Ministral-3 on vLLM

Important limitation

2. Temperature: The Single Most Important Parameter

What happens at higher temperatures

Why `0.15` works better

3. Getting Clean JSON Responses

The naive approach

The reliable fix: Two-layer prompting

Layer 1 — System instruction

Layer 2 — Schema constraint

Token budgeting

Defensive JSON parsing

1) Extract JSON from markdown

2) Repair truncated output

3) Flatten nested values

4) Retry failed requests

4. Getting Clean Plain-Text Responses

Why this matters

Solution: Always strip markdown

5. Centralizing Model-Specific Behavior

Ministral-3 Cheat Sheet

Final Thoughts

Get started

WHAT ARE YOU WAITING FOR?

Rent a GPU Server for AI

Tuning Ministral-3 on vLLM

Introduction

The Real-World Scenario

1. Serving Ministral-3 on vLLM

Important limitation

2. Temperature: The Single Most Important Parameter

What happens at higher temperatures

Why 0.15 works better

3. Getting Clean JSON Responses

The naive approach

The reliable fix: Two-layer prompting

Layer 1 — System instruction

Layer 2 — Schema constraint

Token budgeting

Defensive JSON parsing

1) Extract JSON from markdown

2) Repair truncated output

3) Flatten nested values

4) Retry failed requests

4. Getting Clean Plain-Text Responses

Why this matters

Solution: Always strip markdown

5. Centralizing Model-Specific Behavior

Ministral-3 Cheat Sheet

Final Thoughts

Get started

WHAT ARE YOU WAITING FOR?

Rent a GPU Server for AI

Why `0.15` works better