Developer

Build on the GPU
you already own.

BareMetalRT exposes a local inference API powered by NVIDIA TensorRT-LLM — the production data-center engine — running natively on Windows. Point your code at your own RTX card. No cloud round-trip, no per-token billing, no Linux, no WSL. Your data never leaves the machine.

Windows-native. Every example on this page runs in PowerShell on Windows 10/11 against a stock install. No Docker, no compatibility layer. If a snippet mentions curl, it's the real curl.exe shipped with Windows.

System requirements

BareMetalRT is a native Windows runtime on NVIDIA RTX hardware. There is no CPU, AMD, or Apple-Silicon path — the engine is TensorRT-LLM.

ComponentRequirementNotes
Operating systemWindows 10 or 11 (64-bit)No Linux, macOS, WSL, or Docker.
GPUNVIDIA RTX, 20-series or newerAmpere / Ada / Blackwell. No CPU or AMD fallback.
VRAM8 GB minimumDecides which models you can run — see tiers below.
Driver545 or newerRequired for CUDA 12.4. Newer is fine; update before installing.
System RAM16 GB recommended8 GB works for the smallest models.
Disk~15 GB + modelsPinned runtime + NVIDIA SDKs, plus 1–8 GB per model you download.

Two NVIDIA SDKs (CUDA Toolkit and TensorRT) are also required — they're a one-time manual download covered in step 2 of Install.

VRAM tiers — what runs on your card

The catalog tags every model with its tier, and the app hides models that won't fit your card.

Install BareMetalRT Windows

One installer puts the inference daemon on your GPU and pins its own private Python runtime. You supply two NVIDIA SDKs — CUDA and TensorRT — that NVIDIA's license does not let us redistribute; everything else is downloaded for you.

Update your NVIDIA driver first. BareMetalRT needs a recent NVIDIA driver — 580 or newer. Newer is fine — grab the latest Game Ready or Studio driver or update through the NVIDIA App / GeForce Experience. See System requirements for the full hardware list.

1. Confirm your card fits a model tier

Your card's VRAM decides which models you can run — check the VRAM tiers above. The catalog tags every model with its tier and the app hides models that won't fit, so this is a quick sanity check before you install.

2. Download & run the installer

Grab the latest BareMetalRT-Setup.exe from the releases page and run it. That's the whole install. A private, version-pinned Python build, the matched PyTorch + TensorRT-LLM wheels, MPI, and the VC++ redistributable are all bundled. No separate CUDA Toolkit or TensorRT install is required — the installer runs a quick NVIDIA-driver check and sets up everything else for you.

Don't pip-install your own torch or tensorrt. The daemon runs against a bundled, version-pinned Python runtime — the Python and C++ TensorRT-LLM versions must match. Installing your own is the most common way to break inference. If something stops working after a manual change, reinstall from the .exe to restore the pinned set.

3. Open the app & connect this GPU to your account

When setup finishes, the app opens on your machine at http://localhost:8080/app — the full BareMetalRT runs locally, on your own hardware. The first time, it shows “Connect this GPU to your account.” Click it, sign in once, and the box links to your account. That's the only sign-in — after this the app runs locally with your account attached, no login, and nothing leaves your machine. (Want to reach this box from another device? That goes through the web portal at baremetalrt.ai/app, which relays back here.)

The daemon runs in the background — you don't keep a window open. Setup installs a startup task, so BareMetalRT launches automatically when Windows starts and keeps serving in the background. It lives in the system tray (the BareMetalRT icon by the clock): right-click → Open Dashboard reopens the local app at any time, and Restart / Quit control the daemon. You can also browse to http://localhost:8080/app directly. If it isn't running, launch BareMetalRT from the Start menu (or its desktop shortcut) to start the background task again.

4. Add a Hugging Face token (for gated models)

Some models — anything gated on Hugging Face — need a read token. Paste yours into the app under settings; it's stored per-user and encrypted. Ungated catalog models work without one.

5. Verify it works

Load a small model (Qwen3 0.6B is the fastest first run) and send a prompt. A clean install answers a question like “What is the capital of France?” with “Paris” in well under a second per token. If you get a streamed reply, you're done.

Voice is on by default. Speech-to-text and text-to-speech install automatically — no install-time opt-in. To turn it off, use the toggle in the app's settings.

Get to know the stack

Six ways to drive the engine. Pick the one that fits how you already build.

REST API Live

Native streaming endpoint

A token-streaming /api/chat SSE endpoint served by the daemon on your box (and proxied through the cloud relay when you're signed in). Works today.

OpenAI compatible Live

Drop-in /v1

A /v1/chat/completions surface so every tool built for the OpenAI API points at your RTX GPU unchanged. Live through the relay with an API key.

Python

openai SDK

Use the official openai package with a custom base_url. The same three lines you already wrote — different host.

JavaScript / TypeScript

fetch & SSE

No SDK required. Hit the endpoint with fetch and read the Server-Sent Events stream straight off the response body.

CLI Preview

bmrt

Load, list, and serve models from PowerShell. Scriptable for CI, agents, and scheduled tasks. Ships with the v1.0 headless tooling.

API keys Live

Bearer auth

Mint bmrt_… keys in Account Settings to reach your GPU through the cloud relay from anywhere. Revoke any time.

Super quick start

Ask your own GPU a question. The native /api/chat endpoint streams tokens back as SSE — this works on a stock install right now.

# Stream a reply straight off your local daemon (port 8080)
curl.exe -N http://localhost:8080/api/chat `
  -H "Content-Type: application/json" `
  -d '{ "message": "Who are you, and what can you do?", "max_tokens": 256 }'

# → data: {"token": "I", "token_id": 40, "time_ms": 9.1}
# → data: {"token": "'m", ...}
# → data: {"done": true, "total_tokens": 128, "stop_reason": "eos"}
# pip install requests
import json, requests

r = requests.post(
    "http://localhost:8080/api/chat",
    json={"message": "Who are you, and what can you do?", "max_tokens": 256},
    stream=True,
)
for line in r.iter_lines():
    if not line or not line.startswith(b"data: "):
        continue
    evt = json.loads(line[6:])
    if evt.get("done"):
        break
    print(evt.get("token", ""), end="", flush=True)
const res = await fetch("http://localhost:8080/api/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ message: "Who are you, and what can you do?", max_tokens: 256 }),
});

const reader = res.body.getReader();
const dec = new TextDecoder();
for (;;) {
  const { value, done } = await reader.read();
  if (done) break;
  for (const line of dec.decode(value).split("\n")) {
    if (!line.startsWith("data: ")) continue;
    const evt = JSON.parse(line.slice(6));
    if (evt.done) break;
    process.stdout.write(evt.token ?? "");
  }
}

Request body

FieldTypeDefaultDescription
messagestringThe user turn to respond to. Required.
max_tokensinteger2048Max tokens to generate. Capped at 4096 per request.
historyarray[]Prior turns as { "role": "user" | "assistant", "content": "…" }.
stylestringnullOptional sampling override — focused, balanced, or expressive. Validated against the active model's tier. See Sampling & styles.

Stream events

Each SSE frame is a data: line carrying one JSON object. Token frames carry token, token_id, and time_ms; the final frame carries done: true with total_tokens, truncated, and stop_reason.

Sampling & styles

Every model ships with a sampling preset tuned to its size, so you get coherent output by default. The optional style override trades determinism for variety — within a range that's safe for that model.

styleBehaviorGood for
focusedGreedy / deterministic — same prompt, same answer.Extraction, classification, code, tool calls.
balancedLight sampling — some variation, still tight.General chat and Q&A.
expressiveMore varied, creative sampling.Brainstorming, writing, longer-form replies.

Pass it on /api/chat, or omit it to use the model's default:

curl.exe -N http://localhost:8080/api/chat `
  -H "Content-Type: application/json" `
  -d '{ "message": "Extract the invoice total as a number.", "style": "focused" }'
Styles are tier-gated. A model's catalog default sets its maximum safe style. You can always step down toward focused, but stepping up past what a small model handles is rejected with 400 style_not_allowed — e.g. expressive on a 1.5B model produces token salad, so it isn't offered. Query the styles a given model allows in the tuner field of /api/models.

On the OpenAI /v1 surface, temperature and top_p are accepted but not applied — sampling is owned by the model's tier. Use the native /api/chat style override above to influence it.

Authentication Live

Two separate things: signing in to your account (to manage nodes, keys, and team), and API keys (to call your GPU programmatically). Local calls to localhost need neither — it's your machine.

Signing in

You can sign in to the app three ways:

Connecting a GitHub repo is separate from signing in. Linking a repository for the model to read uses its own consent with a broader scope and is covered under MCP & integrations — signing in with GitHub never grants access to your code.

API keys

To reach your GPU from elsewhere through the cloud relay, use an API key. Generate one in Account Settings. Keys are shown once, prefixed bmrt_, and scoped to inference and mesh. Pass it as a Bearer token:

# Reach your own GPU through the relay from anywhere
curl.exe -N https://baremetalrt.ai/api/chat `
  -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" `
  -H "Content-Type: application/json" `
  -d '{ "message": "Hello from my laptop!", "max_tokens": 128 }'

Enterprise SSO / Identity Available

Let your team sign in with your organization's identity provider, map IdP groups to roles, and auto-provision/deprovision users. Off by default — local and demo use are unchanged.

Admins configure it under SSO / Identity settings. For your security team, the full architecture & token-validation details are on the SSO architecture & security page →

Team seats Available

The Team plan is billed per seat. You buy a number of seats and assign them to people; entitlement is enforced online, so an assignment or revocation takes effect on the member's next request.

A member without an assigned seat can still sign in to reach billing, but inference (chat and the /v1 endpoints) and API-key creation return 402 until a seat is assigned. Manage everything from the Seats console →

Compliance & security Available

What a security team needs to know before approving BareMetalRT on company hardware: where data lives, how access is controlled and revoked, and what is recorded. The short version — inference and your documents stay on the GPU you own, and the controls below are enforced on your machine, not ours.

Identity & access control

Audit log

The daemon keeps a tamper-evident, append-only audit log on the GPU host: a hash-chained JSONL record (each entry carries the SHA-256 of the previous one, so deletion or edits are detectable). It captures security-relevant on-box events — tool calls and approvals/denials, connector connect/remove, model loads, configuration changes, and sign-in success/failure — with secrets redacted at write time, rotation at 16 MB, and a queryable interface with CSV export.

Scope. The audit log records on-box activity (tool use, model and config operations, local auth). Identity-provider provisioning events (a user added or deactivated in your IdP) live in your IdP's own logs and in the account state on the orchestrator (active/revoked flags) — they are not duplicated into the daemon's event log. Off-box forwarding to Splunk (HEC) or syslog is supported and off until you configure it.

Data residency & air-gap

Privacy & offline

The model runs on your GPU, not ours. That's the whole point — your prompts and the model's replies stay on hardware you own.

Air-gap mode Available

For classified, regulated, or disconnected sites. With BMRT_AIRGAP=1 set, the daemon makes zero unsolicited outbound internet connections. Off by default — normal installs keep the in-app update banner.

For your security team, the no-egress guarantee, exactly what air-gap mode disables, and a self-serve verification procedure (firewall / packet capture / netstat, with a control test) are on the air-gap & no-egress attestation page →

What you can build

The same primitives you'd reach for against a hosted API — running on hardware you control.

01 Live

Chat & text generation

Token-by-token streaming with paged KV-cache and fused attention kernels. Multi-turn via history. Concurrent users on a single GPU.

02 Live

Model management

List, download, build, load, and unload models over the REST API while the daemon stays up. Hot-swap without a restart.

03 Live

Multi-GPU mesh

Split a model too big for one card across multiple RTX GPUs over plain Ethernet — tensor parallelism over our network transport. Set X-GPU-Mode: tp2.

04 Live

Tool calling & agents

A built-in MCP host and agent loop — connect tools from the one-click catalog and the model calls them mid-chat, with approval prompts for anything side-effecting. See MCP & integrations.

05 Preview

Structured output

Force responses to a JSON Schema with grammar-constrained sampling — valid JSON every time, no retry loop. Landing with the v1.0 tool-calling track.

06 Live

GPU telemetry

Real-time VRAM, temperature, utilization, and power off /api/gpu-metrics — wire your own dashboards and autoscalers.

OpenAI compatibility Live

LM Studio, Ollama, and friends ship an OpenAI-shaped endpoint so existing tools just work. So do we — pointed at NVIDIA's data-center engine instead of llama.cpp.

Every OpenAI client works by changing one line — the base_url. Use the relay host with your API key to reach your GPU from anywhere, or the local daemon for keyless, on-box calls:

# pip install openai — only base_url + api_key change
from openai import OpenAI

client = OpenAI(
    base_url="https://baremetalrt.ai/v1",        # relay → your GPU (needs a key)
    api_key="bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",  # mint one in Account Settings
)
# or, on the box itself (latest daemon): base_url="http://localhost:8080/v1", api_key="bmrt_local"

resp = client.chat.completions.create(
    model="qwen3-4b-int4",
    messages=[{"role": "user", "content": "Who are you, and what can you do?"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")
// npm install openai
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://baremetalrt.ai/v1",
  apiKey: "bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
});

const stream = await client.chat.completions.create({
  model: "qwen3-4b-int4",
  messages: [{ role: "user", content: "Who are you, and what can you do?" }],
  stream: true,
});
for await (const part of stream) process.stdout.write(part.choices[0]?.delta?.content ?? "");
curl.exe https://baremetalrt.ai/v1/chat/completions `
  -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" `
  -H "Content-Type: application/json" `
  -d '{ "model": "qwen3-4b-int4", "messages": [{"role":"user","content":"Hi"}] }'
How it works. The /v1 surface is a thin translation layer over the native /api/chat stream — your OpenAI messages[] map to a prompt, the token stream maps back to chat.completion.chunk objects. Sampling is owned by each model's catalog tier, so temperature and top_p are accepted but not applied. The relay path is live now; the keyless localhost:8080/v1 ships with the next daemon update.
No SDK to learn. BareMetalRT doesn't ship its own client library — you use the standard openai (or anthropic) package you already know, pointed at our endpoint. It installs in your project's environment, not the daemon's bundled runtime — the two never share a dependency tree, so the "don't pip-install into the daemon" rule from Install doesn't apply to your app. Change one base_url and every OpenAI-shaped tool, framework, and SDK works.

Anthropic compatibility Live

Built your app on the Claude SDK? Point it here too. The /v1/messages endpoint speaks the Anthropic Messages wire format — named SSE events and all — so the anthropic SDK works unchanged.

The Anthropic SDK authenticates with x-api-key, which the relay accepts as your bmrt_ key:

# pip install anthropic — only base_url + api_key change
from anthropic import Anthropic

client = Anthropic(
    base_url="https://baremetalrt.ai",
    api_key="bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
)

with client.messages.stream(
    model="qwen3-1.7b",
    max_tokens=256,
    messages=[{"role": "user", "content": "Who are you, and what can you do?"}],
) as stream:
    for text in stream.text_stream:
        print(text, end="", flush=True)
curl.exe https://baremetalrt.ai/v1/messages `
  -H "x-api-key: bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" `
  -H "anthropic-version: 2023-06-01" `
  -H "Content-Type: application/json" `
  -d '{ "model": "qwen3-1.7b", "max_tokens": 256,
       "messages": [{"role":"user","content":"Hi"}] }'
Same engine, two dialects. Responses come back as Anthropic message objects with a content block array and stop_reason (end_turn / max_tokens). system is read from the top-level field; tool blocks aren't generated yet (see Tool use).

Vision & multimodal Live

Attach an image and ask about it — photos, screenshots, charts, and documents, read on your own GPU. Load a vision-language model, then send images alongside your prompt.

Vision-language models in the catalog include Qwen3-VL (2B / 4B / 8B), Qwen2-VL 2B, Pixtral 12B, Phi-4 Multimodal, Gemma 3 (4B / 12B / 27B), and Llama 3.2 Vision. Load one the usual way (see Managing models); the daemon cold-starts it with image support enabled.

Send images on the native /api/chat endpoint with an images array — each entry is a base64 string or a data: URL (PNG, JPEG, GIF, or WebP):

# POST /api/chat — ask about an image
{
  "message": "What's in this image?",
  "images": [
    "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..."
  ],
  "max_tokens": 512
}
Notes. Images are decoded and processed entirely on the GPU host — nothing is uploaded to a cloud vision service. Multiple images in the images array are folded into the same prompt. Only models with vision capability accept images; sending images to a text-only model has no effect.

Tool use & function calling Experimental

The OpenAI-compatible tool-calling surface: pass tools, the model returns tool_calls, you run them and feed the results back. This is the shape we're building toward.

Live in the app today — /v1 passthrough is rolling out. Agentic tool-calling works now through the native /api/chat agent loop and the one-click MCP catalog (see MCP & integrations) — connect a tool and the model calls it. The OpenAI-shaped tools / tool_calls passthrough on /v1/chat/completions documented below is landing next; design against this shape now.

You describe each function with a JSON-Schema parameter spec. When the model decides to call one, the reply carries a tool_calls array and finish_reason: "tool_calls" instead of prose — identical to the OpenAI shape.

curl.exe https://baremetalrt.ai/v1/chat/completions `
  -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" `
  -H "Content-Type: application/json" `
  -d '{
    "model": "qwen3-4b-int4",
    "messages": [{"role":"user","content":"What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
          "type": "object",
          "properties": { "city": {"type":"string"} },
          "required": ["city"]
        }
      }
    }]
  }'

# → the model returns a tool call instead of text:
# "finish_reason": "tool_calls",
# "message": { "role":"assistant", "tool_calls": [
#   { "id":"call_1", "type":"function",
#     "function": { "name":"get_weather", "arguments":"{\"city\":\"Paris\"}" } } ] }
# pip install openai — the standard tool loop works unchanged
from openai import OpenAI
import json

client = OpenAI(base_url="https://baremetalrt.ai/v1", api_key="bmrt_…")
tools = [{"type": "function", "function": {
    "name": "get_weather",
    "parameters": {"type": "object",
        "properties": {"city": {"type": "string"}}, "required": ["city"]}}}]

messages = [{"role": "user", "content": "Weather in Paris?"}]
r = client.chat.completions.create(model="qwen3-4b-int4", messages=messages, tools=tools)

call = r.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments)          # {"city": "Paris"}
result = get_weather(**args)                          # you run the function

# feed the call + result back, then ask for the final answer
messages += [r.choices[0].message,
    {"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)}]
final = client.chat.completions.create(model="qwen3-4b-int4", messages=messages, tools=tools)
print(final.choices[0].message.content)
Native vs. constrained (planned). Models with a tool-trained chat template (Qwen3, Llama 3.1/3.2, Ministral) will emit calls natively; for the rest, grammar-constrained decoding forces a syntactically valid call. Either path returns the same OpenAI tool_calls shape, so your code doesn't branch on the model.

Workflows Live

A repeatable process that runs the same way every time. Build it as an ordered funnel — Input → Action → AI → Output — and the daemon walks the funnel in order on your GPU host, so control flow is enforced, not improvised by the model.

Define the process with named fields (goal, process_owner, inputs, outputs), pick the model it runs on (it loads automatically before the run), and add ordered nodes. Each node is one of four types:

Each node's result is piped into later nodes with {{nodeId}} placeholders — e.g. an ai step references an earlier input as {{n1}} — so every run is repeatable and auditable.

GET /api/workflows
POST /api/workflows — create
GET /api/workflows/_tools — tools available to action nodes
POST /api/workflows/{id}/run
POST /api/workflows/{id} — update
DELETE /api/workflows/{id}

Agents Live

Give the model a goal and let it work — a headless agent loop that plans, calls tools, and reports back, without a node-by-node script.

Start a run with a goal and a permission_mode: readonly refuses any side-effecting tool, while autonomous pre-authorizes the run's tool calls. The run records a tool_log of every invocation and a final_result; poll it for status (pending / running / completed / failed / cancelled) or cancel it mid-flight.

GET /api/agent/runs
POST /api/agent/runs{ "goal": "…", "permission_mode": "readonly" }
GET /api/agent/runs/{id} — status, tool log, result
POST /api/agent/runs/{id}/cancel

Routines Live

Run an agent or a workflow on a schedule — the same goal, every morning or every N hours, unattended.

A routine wraps either a goal-based agent run or a saved workflow_id, with a schedule_type of daily (a local at time like "09:00") or interval (every_hours). Routines carry an enabled flag so you can pause one without deleting it, and you can fire any routine immediately, off-schedule.

GET /api/agent/routines
POST /api/agent/routines — create a schedule
POST /api/agent/routines/{id}/toggle — enable / pause
POST /api/agent/routines/{id}/run — run now
DELETE /api/agent/routines/{id}

Structured output Experimental

Constrain the reply to a JSON Schema with response_format — valid JSON every time, no parse-and-retry loop. OpenAI-compatible shape.

Not live yet. The daemon currently ignores response_format — you can coax JSON with a prompt, but nothing enforces the schema, so there's no guarantee. Grammar-constrained sampling (xgrammar) lands with the v1.0 tool-calling track and makes the schema a hard constraint. The shape below is what you'll build against.
curl.exe https://baremetalrt.ai/v1/chat/completions `
  -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" `
  -H "Content-Type: application/json" `
  -d '{
    "model": "qwen3-4b-int4",
    "messages": [{"role":"user","content":"Extract the person from: Alice is 30."}],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "person",
        "strict": true,
        "schema": {
          "type": "object",
          "properties": { "name": {"type":"string"}, "age": {"type":"integer"} },
          "required": ["name","age"],
          "additionalProperties": false
        }
      }
    }
  }'

# → content is guaranteed to parse against the schema:
# { "name": "Alice", "age": 30 }
from openai import OpenAI
import json

client = OpenAI(base_url="https://baremetalrt.ai/v1", api_key="bmrt_…")

r = client.chat.completions.create(
    model="qwen3-4b-int4",
    messages=[{"role": "user", "content": "Extract the person from: Alice is 30."}],
    response_format={
        "type": "json_schema",
        "json_schema": {"name": "person", "strict": True, "schema": {
            "type": "object",
            "properties": {"name": {"type": "string"}, "age": {"type": "integer"}},
            "required": ["name", "age"]}},
    },
)
person = json.loads(r.choices[0].message.content)    # always valid — {"name": "Alice", "age": 30}
Model still matters. Schema enforcement guarantees valid JSON, but coherent values need a capable model — sub-7B models may satisfy a required field with a weak guess. Check the model card before relying on it for hard extraction.

MCP & integrations Live

The daemon is a full Model Context Protocol host. Connect a tool and the agent can call it mid-conversation — files, web, databases, your enterprise systems — with the model deciding when, and the daemon running the round-trip. Everything executes on your GPU host: tokens and data flow tool ↔ local agent, never through our cloud.

One-click integrations catalog

Open the app → Skills → Integrations to browse the catalog and connect a tool in one click — or see the full list of 300+ integrations across 14 categories and every major industry. Bundled connectors are stdlib-only and need no Node — they work on a clean Windows box. Read-only tools run silently; anything side-effecting (a write, a send, a create) pauses for your approval before it runs.

Bundled No setup

Keyless, on-box

Weather (Open-Meteo), Fetch (read any web page, SSRF-guarded), Files (a folder you choose; writes approval-gated), and SQLite (read-only queries). No token, no Node.

Official Needs Node

npx servers

GitHub, Slack, Postgres, and a Browser (Playwright) via the official @modelcontextprotocol servers. Install Node.js and they light up; add a token where the service needs one.

Enterprise Live

Your data, your model

Microsoft 365 (Outlook, OneDrive, SharePoint, Teams, Excel), Databricks, Snowflake (read-only SQL), and Salesforce (SOQL + records). Token/OAuth stays on your GPU host — proprietary data never touches a cloud LLM.

Bring your own

Any MCP server

Paste a remote Streamable-HTTP URL (with an auth header) or a local launch command, and the daemon connects it like any other. Plug a local Git repo for read-only code tools in one click.

Enterprise integrations ship gated. Databricks/Snowflake need a workspace URL + access token; Microsoft 365 and Salesforce use an on-device OAuth flow against an app you register. Until configured they show as Needs setup — the consent and credentials live on your machine, never on our relay.

Connect a GitHub repo

From the composer, Connect GitHub runs a repo-scoped OAuth flow (separate consent from signing in), lists the repositories you can access, and on pick the daemon clones the repo onto your GPU host and attaches a read-only Git MCP server so the model can read the code. The OAuth token is held encrypted by the relay (so it can hand it to your daemon to clone) and the requested scope (repo) is read/clone access — it is never used to push or open issues. Like sign-in, it activates when GITHUB_CLIENT_ID / GITHUB_CLIENT_SECRET are configured.

Drive it from the API

The same agent loop is on the native endpoint: send agent: true to /api/chat and the daemon runs generate → call tool → feed result → continue, streaming tool_call and tool_result events alongside tokens. Tools resolve against the built-ins plus every connected MCP server (namespaced mcp__<server>__<tool>).

# Ask the agent something its connected tools can answer
curl.exe -N http://localhost:8080/api/chat `
  -H "Content-Type: application/json" `
  -d '{ "message": "What is the weather in Tokyo? Use your tools.", "agent": true }'

# → data: {"tool_call": {"name": "mcp__weather__get_weather", "args": {"location": "Tokyo"}}}
# → data: {"tool_result": {"name": "mcp__weather__get_weather", "output": "Weather in Tokyo: …"}}
# → data: {"token": "The"} … then the model's written answer
Skills. Alongside integrations (the tools), the app ships 29 ready-made Skills — reusable recipe cards (code review, changelog drafting, SQL help) that steer the agent toward a named job and declare the integrations they need. Enable one and its instructions are injected for that session.

Embeddings Live

Turn text into vectors for semantic search, RAG, clustering, and dedup. /v1/embeddings follows the OpenAI shape, served by all-MiniLM-L6-v2 (384-dim, Apache 2.0) running on your GPU alongside the chat model.

from openai import OpenAI

client = OpenAI(
    base_url="https://baremetalrt.ai/v1",
    api_key="bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
)

resp = client.embeddings.create(
    model="all-MiniLM-L6-v2",
    input=["the cat sat on the mat", "a feline rested on the rug"],
)
print(len(resp.data[0].embedding))   # 384
# cosine similarity of the two vectors is high — they mean the same thing
curl.exe https://baremetalrt.ai/v1/embeddings `
  -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" `
  -H "Content-Type: application/json" `
  -d '{ "model": "all-MiniLM-L6-v2", "input": "the cat sat on the mat" }'
Returns the standard OpenAI { "object": "list", "data": [{ "embedding": [...] }] } shape, so vector stores and RAG frameworks that target OpenAI embeddings work unchanged. Live on the relay — verified end-to-end returning correct 384-dim L2-normalized vectors on RTX hardware.

Retrieval (RAG) Live

Ground answers in your own documents — entirely on your GPU. Use the built-in knowledge bases (no code), or build your own pipeline on the embeddings endpoint. Either way the corpus, the index, and the model all stay on the machine.

Built-in knowledge bases Live

No code required. In the chat box at the bottom of the Dashboard, open the 📚 knowledge-base menu to create a base, attach your documents with the paperclip, and click the ingest button (📥). Select that base and every answer is grounded in your own text with bracketed [n] citations. The chunker, the embeddings (all-MiniLM-L6-v2), the flat vector index, and the source text all live on your GPU host — nothing is sent to a cloud index or a third-party service.

The app drives a small daemon API you can call directly to manage bases and ingest from your own tooling — text is extracted client-side, so the daemon takes no parser dependency:

GET /api/rag/collections
POST /api/rag/collections{ "name": "Company handbook" }
POST /api/rag/{name}/ingest{ "docs": [{ "source": "handbook.md", "text": "…" }] }
GET /api/rag/{name}/search?q=…&k=5 — inspect what would be retrieved
DELETE /api/rag/collections/{name}

To ground a chat, add collection to the /api/chat body — the daemon retrieves the top-k chunks for the message and folds them into the prompt with their sources. Behind the scenes: documents are split into ~1600-character chunks, embedded once at ingest, and ranked at query time by cosine similarity (a single matmul over L2-normalized vectors — a flat scan that stays fast into the hundreds of thousands of chunks, no ANN index required).

Local-only by construction. The vectors and chunk text are written under %APPDATA%/BareMetalRT/rag/<collection>/ on the GPU host. There is no external vector database and no network call in the retrieval path — the privacy promise is the architecture, not a setting.

Or build your own pipeline

Prefer your own vector store or framework? Call the embeddings endpoint directly and keep the index wherever you like:

# pip install openai numpy — fully local RAG in ~20 lines
from openai import OpenAI
import numpy as np

client = OpenAI(base_url="https://baremetalrt.ai/v1", api_key="bmrt_…")

# 1. Your knowledge base, split into chunks
docs = [
    "BareMetalRT runs TensorRT-LLM natively on Windows.",
    "Tensor parallelism splits one model across two RTX GPUs over standard networking.",
    "Voice mode ships on by default and can be toggled off in settings.",
]

def embed(texts):
    r = client.embeddings.create(model="all-MiniLM-L6-v2", input=texts)
    return np.array([d.embedding for d in r.data])

doc_vecs = embed(docs)                                  # embed the corpus once

# 2. At query time, embed the question and rank by cosine similarity
query = "How do I run a model too big for one card?"
q = embed([query])[0]
scores = doc_vecs @ q                                   # vectors are L2-normalized → dot = cosine
top = [docs[i] for i in scores.argsort()[::-1][:2]]     # top-2 chunks

# 3. Stuff the retrieved context into the chat prompt
context = "\n".join(top)
resp = client.chat.completions.create(
    model="qwen3-4b-int4",
    messages=[
        {"role": "system", "content": f"Answer using only this context:\n{context}"},
        {"role": "user", "content": query},
    ],
)
print(resp.choices[0].message.content)
Documents from PDFs. The app's "chat with documents" feature extracts the text layer from PDFs and text files via POST /api/upload (text-layer PDFs, .txt, .md; scanned/image-only PDFs are rejected with a clear message — OCR is on the roadmap). For a developer pipeline, extract text your own way, chunk it, and feed it through the loop above. Because /v1/embeddings is the OpenAI shape, drop-in vector stores (FAISS, Chroma, pgvector) and frameworks (LangChain, LlamaIndex) work too — point them at the relay.

IDE & tools Live

Anything that speaks OpenAI speaks BareMetalRT. Point your editor or agent framework at the relay and your own RTX GPU answers — no GPT-4 bill, no code leaving your machine.

Cursor

Settings → Models → enable Override OpenAI Base URL. Set the base URL to https://baremetalrt.ai/v1, paste your bmrt_ key as the OpenAI API key, and add a custom model named after one in your catalog (e.g. qwen3-1.7b). Cursor verifies the key against /v1/models — which is live — so the green check lights up.

Continue (VS Code / JetBrains)

Drop this into ~/.continue/config.json:

{
  "models": [{
    "title": "BareMetalRT",
    "provider": "openai",
    "model": "qwen3-1.7b",
    "apiBase": "https://baremetalrt.ai/v1",
    "apiKey": "bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  }]
}

Everything else

The same two values — base URL https://baremetalrt.ai/v1 and a bmrt_ key — work in any tool with a configurable OpenAI endpoint: the openai Python/JS SDKs, LangChain (ChatOpenAI(base_url=…)), LlamaIndex, Vercel AI SDK, Open WebUI, and friends. For on-box, keyless use, swap in http://localhost:8080/v1 with the latest daemon.

REST API reference

Served by the daemon on http://localhost:8080, and proxied through https://baremetalrt.ai when you authenticate with an API key.

Inference Live

POST /api/chat

Stream a completion as Server-Sent Events. Body and event shapes documented under Super quick start. Send header X-GPU-Mode: tp2 to route across a two-GPU mesh. Add collection to the body to ground the answer in a knowledge base.

Models Live

GET /api/models

List catalog models with download state, VRAM fit, and loadability for this node.

POST /api/models/{id}/pull
POST /api/models/{id}/build
POST /api/models/{id}/load
GET /api/models/{id}/status
POST /api/unload
POST /api/models/{id}/delete

Download, hot-load, and unload models without restarting the daemon. Most of the catalog runs on the PyTorch backend (PyExecutor) and loads straight from the HuggingFace checkpoint — pull then load, no build step. /build applies only to the legacy TensorRT path (the TP=2 mesh model), which compiles an engine before it can load. Poll /api/models/{id}/status for live download and build progress.

Knowledge bases (RAG) Live

GET /api/rag/collections
POST /api/rag/collections
DELETE /api/rag/collections/{name}
POST /api/rag/{name}/ingest
GET /api/rag/{name}/search

Create local knowledge bases, ingest already-extracted document text (chunked + embedded on the GPU host), and retrieve the top-k chunks. Pass collection in the /api/chat body to ground an answer in a base. The vectors and source text never leave the machine. See Retrieval (RAG).

Telemetry Live

GET /api/status
GET /api/gpu-metrics

Daemon readiness and live GPU stats:

# GET /api/gpu-metrics
{
  "vram_used_mb": 8880,
  "vram_total_mb": 12282,
  "temperature_c": 45,
  "gpu_util_pct": 10,
  "power_w": 85
}

Keys Live

POST /api/keys
GET /api/keys
DELETE /api/keys/{id}

Create, list, and revoke API keys for relay access. Manage them visually in Account Settings.

Errors & status codes

Errors follow the OpenAI shape — an error object with message, type, and sometimes code. In a non-streaming call the HTTP status carries the failure; in a streaming call the failure arrives as an error frame and the stream then closes.

StatuscodeWhen
401invalid_api_keyMissing or revoked bmrt_ key on a relay call.
503No GPU node connected — your daemon is offline or still loading.
503kv_exhaustedEngine at capacity (paged-KV pool full). Back off and retry.
503Context window full — start a new conversation or trim history.
400style_not_allowedRequested a tuner style above the active model's tier.

Example bodies

# 401 — no/invalid key (relay)
{ "error": { "message": "Missing or invalid API key…",
            "type": "invalid_request_error", "code": "invalid_api_key" } }

# 503 — backpressure (engine at capacity)
{ "error": { "message": "Server at capacity, retry shortly",
            "type": "server_error", "code": "kv_exhausted" } }

# streaming — error arrives as a frame, then the stream closes
data: {"...chunk...", "error": {"message": "No GPU node connected", "type": "server_error"}}
On the native /api/chat endpoint the same failures arrive flatter — data: {"error": "…"} followed by data: {"done": true}. The /v1 shim reshapes these into the OpenAI error object above.

Rate limits & concurrency

Your GPU, your rules. There is no per-token metering, no monthly quota, and no request cap — the only ceiling is the silicon.

API stability & versioning

What's safe to build on today, and how we signal change. The badge on each section is the contract.

BadgeMeansBuild on it?
LiveImplemented, served today, shape is stable.Yes — production-safe.
PreviewShipping soon; the surface exists but may change.Prototype against it; pin a daemon version.
ExperimentalDocuments a planned shape; not enforced yet.Design against it; don't depend on it.
The daemon reports its build at GET /api/diagnostics/executor (version field). Log it with your integration so you can correlate behavior to a specific release.

Headless deployment Live

BareMetalRT runs as a background process with no window and no GUI — it's how the hosted demo serves traffic right now. Launch it with a port and it exposes the full REST API (and the /v1 surface) on your LAN. Ideal for a spare workstation or a home server.

# start the daemon headless — no window, no display attached
PS> baremetalrt.exe --port 8080
   → REST API + /v1 on http://0.0.0.0:8080

# everything is driven over HTTP from there — load a model, then chat
PS> curl.exe -X POST http://localhost:8080/api/models/qwen3-1.7b/load
PS> curl.exe http://localhost:8080/api/status
Auto-start: register the daemon as a logon/boot task with Windows Task Scheduler (the installer does this for you) so the box rejoins the mesh after a reboot with no one logged in.

bmrt CLI Experimental

Not live yet. The bmrt command is planned for the v1.0 headless tooling — pure sugar over the HTTP endpoints above, which already work today. Until it ships, drive everything with curl against the REST API. The commands below are the planned surface.

A scriptable wrapper for load / list / serve from PowerShell — for CI, agents, and scheduled tasks:

bmrt serve --port 8080          # start the daemon headless
bmrt pull qwen3-1.7b            # download a catalog model
bmrt load qwen3-1.7b           # load it onto the GPU
bmrt ps                         # list loaded models + VRAM
bmrt unload                     # free the GPU

Model catalog

Open-weight models validated on real consumer RTX hardware. The catalog grows every week as new architectures pass the on-box battery.

Model IDParamsQuantGPUsNotes
qwen3-0.6b0.6BFP161Instant response
qwen3-1.7b1.7BFP161Tool calling
qwen3-4b-int44BW4A16-AWQ1 (8 GB+)Long context · Ampere+
ds-r1-distill-1.5b1.5BFP161Reasoning
llama-3.2-1b-instruct1BFP161Sub-10 ms/token
llama-3.2-3b-instruct3BFP161

Query the live catalog for the current list and per-node fit:

curl.exe http://localhost:8080/api/models

Managing models & storage

A model occupies two separate resources — GPU VRAM while it's loaded, and disk while it's downloaded. Unloading and deleting are different operations for exactly that reason.

ActionEndpointFreesUse when
UnloadPOST /api/unloadGPU VRAM (files stay on disk)Switching models, or reclaiming VRAM (e.g. for voice headroom). Reloads instantly.
DeletePOST /api/models/{id}/deleteDisk — weights and built engineReclaiming storage. A re-download is needed to use it again.

Downloaded weights and any built engines live in the install's data directory (models\ and engine_cache\ under Program Files\BareMetalRT). Catalog models pull from Hugging Face the first time you load them and stay local after that — roughly 1–8 GB each, depending on size and quantization.

In the app: the storage manager lists every downloaded model with its size and a delete button — the visual equivalent of the endpoints above, and the safe way to free disk without hand-editing engine_cache.

Voice & VRAM Live

Voice is built in and on by default. It runs its own models — speech-to-text, voice-activity detection, and text-to-speech — that share your GPU with the chat model, so VRAM is what decides how much voice you get.

StageModelNotes
Speech-to-textWhisper large-v3-turboRuns on the GPU via faster-whisper; CPU is configurable for tight cards.
Voice activitySilero VADTiny — detects when you start and stop speaking.
Text-to-speechOrpheus-3B + SNACFull-quality neural voice (~4 GB). Steps down to a low-VRAM TTS, then to browser speech, as the card gets tighter.

The voice stack reserves roughly 4 GB of free VRAM on top of your chat model. After a model loads, the daemon checks actual free VRAM and only enables voice if it fits — so on a tight 8 GB card running a 4B model, voice may drop to the low-VRAM tier or stay off. Want full voice? Run a smaller chat model, or unload to free headroom.

Controls. Voice is on by default; turn it off in the app's Settings. The mic only appears for models flagged voice-capable. Programmatically, toggle with POST /api/voice/enable and read state from GET /api/voice/status — e.g. { "enabled": true, "model_voice_capable": true, "ready": true }.

FAQ

The questions a first install actually runs into — install order, version pitfalls, and what runs where.

Why do I have to install CUDA and TensorRT myself? Can't the installer bundle them?

No — NVIDIA's license does not let us redistribute the CUDA Toolkit or the TensorRT SDK. Every other piece of the runtime (a private Python build, the PyTorch and TensorRT-LLM Python wheels, MPI, the VC++ redistributable) is downloaded for you. These two SDKs are the only parts you fetch from NVIDIA directly, with a free Developer account.

Which versions do I install? The download page offers CUDA 13 and TensorRT 11.

Use CUDA Toolkit 12.x (12.4–12.9) and TensorRT 10.15 or 10.16not the latest 13.x / 11.x that NVIDIA's pages default to. BareMetalRT ships a version-pinned runtime (torch 2.6.0+cu124, tensorrt-cu12 10.15.1.29); CUDA 13 or TensorRT 11 are a different ABI and inference won't start. Grab a 12.x toolkit from the CUDA Toolkit Archive and a TensorRT 10.x GA Windows ZIP for CUDA 12. (Newer CUDA/TensorRT and Blackwell support are on our roadmap — see below.)

I installed CUDA 13 or TensorRT 11 and the daemon won't load a model. How do I fix it?

Uninstall the 13.x CUDA Toolkit (or just install a 12.x one alongside it) and replace TensorRT 11 with a 10.15/10.16 build extracted to C:\TensorRT\. Then restart the daemon — it re-detects the SDKs on startup. If you also pip-installed a newer torch or tensorrt into the bundled runtime, reinstall from the .exe to restore the pinned set.

Which NVIDIA driver do I need?

Driver 545 or newer (required by CUDA 12.4). Newer drivers are fine and recommended — update through the NVIDIA driver page, the NVIDIA App, or GeForce Experience before installing.

Can I install BareMetalRT before the NVIDIA SDKs?

Yes. The installer notes this on its prerequisite screen — install BareMetalRT first, add the CUDA Toolkit and TensorRT afterward, and the daemon detects them automatically on its next start. You just can't load a model until both are present.

Why can't I just pip install a newer torch or tensorrt?

The Python and C++ sides of TensorRT-LLM are an ABI-locked set — the torch build, the TensorRT wheel, and the engine runtime all have to match. Upgrading one in the bundled runtime is the most common way to break inference. If something stops working after a manual change, reinstall from the .exe.

What GPU and how much VRAM do I need?

An NVIDIA RTX card, 20-series or newer. 8 GB runs the small models and 4B-class int4 (voice included); 12 GB+ adds headroom for larger Qwen3 and int4 models; two matching cards can split one larger model with tensor parallelism. There is no CPU, AMD, or Apple-Silicon fallback — the engine is TensorRT-LLM.

Does my data leave the machine?

No. Inference runs entirely on your GPU. When you sign in, the cloud relay only proxies requests to your daemon so you can reach it remotely with an API key — your prompts and model outputs aren't stored server-side, and on-box use (http://localhost:8080) never touches the relay at all.

Do I need a Hugging Face token?

Only for gated models. Paste a read token into the app's settings (stored per-user, encrypted) and gated checkpoints download under your account. Ungated catalog models need no token.

Building something on BareMetalRT? We read every issue — open one on GitHub or email [email protected].