Build on the GPU
you already own.
BareMetalRT exposes a local inference API powered by NVIDIA TensorRT-LLM — the production data-center engine — running natively on Windows. Point your code at your own RTX card. No cloud round-trip, no per-token billing, no Linux, no WSL. Your data never leaves the machine.
curl, it's the real curl.exe shipped with Windows.
System requirements
BareMetalRT is a native Windows runtime on NVIDIA RTX hardware. There is no CPU, AMD, or Apple-Silicon path — the engine is TensorRT-LLM.
| Component | Requirement | Notes |
|---|---|---|
| Operating system | Windows 10 or 11 (64-bit) | No Linux, macOS, WSL, or Docker. |
| GPU | NVIDIA RTX, 20-series or newer | Ampere / Ada / Blackwell. No CPU or AMD fallback. |
| VRAM | 8 GB minimum | Decides which models you can run — see tiers below. |
| Driver | 545 or newer | Required for CUDA 12.4. Newer is fine; update before installing. |
| System RAM | 16 GB recommended | 8 GB works for the smallest models. |
| Disk | ~15 GB + models | Pinned runtime + NVIDIA SDKs, plus 1–8 GB per model you download. |
Two NVIDIA SDKs (CUDA Toolkit and TensorRT) are also required — they're a one-time manual download covered in step 2 of Install.
VRAM tiers — what runs on your card
- 8 GB (e.g. RTX 3060 Ti, 4060) — Qwen3 0.6B / 1.7B, DeepSeek-R1-Distill 1.5B, and 4B-class models in int4. Voice included.
- 12 GB+ (e.g. RTX 3060 12 GB, 4070) — larger Qwen3 and int4 models with more context headroom.
- Two matching cards — split one larger model across both GPUs with tensor parallelism (same architecture on both, e.g. two 3090s).
The catalog tags every model with its tier, and the app hides models that won't fit your card.
Install BareMetalRT Windows
One installer puts the inference daemon on your GPU and pins its own private Python runtime. You supply two NVIDIA SDKs — CUDA and TensorRT — that NVIDIA's license does not let us redistribute; everything else is downloaded for you.
1. Confirm your card fits a model tier
Your card's VRAM decides which models you can run — check the VRAM tiers above. The catalog tags every model with its tier and the app hides models that won't fit, so this is a quick sanity check before you install.
2. Download & run the installer
Grab the latest BareMetalRT-Setup.exe from the releases page and run it. That's the whole install. A private, version-pinned Python build, the matched PyTorch + TensorRT-LLM wheels, MPI, and the VC++ redistributable are all bundled. No separate CUDA Toolkit or TensorRT install is required — the installer runs a quick NVIDIA-driver check and sets up everything else for you.
.exe to restore the pinned set.
3. Open the app & connect this GPU to your account
When setup finishes, the app opens on your machine at http://localhost:8080/app — the full BareMetalRT runs locally, on your own hardware. The first time, it shows “Connect this GPU to your account.” Click it, sign in once, and the box links to your account. That's the only sign-in — after this the app runs locally with your account attached, no login, and nothing leaves your machine. (Want to reach this box from another device? That goes through the web portal at baremetalrt.ai/app, which relays back here.)
http://localhost:8080/app directly. If it isn't running, launch BareMetalRT from the Start menu (or its desktop shortcut) to start the background task again.
4. Add a Hugging Face token (for gated models)
Some models — anything gated on Hugging Face — need a read token. Paste yours into the app under settings; it's stored per-user and encrypted. Ungated catalog models work without one.
5. Verify it works
Load a small model (Qwen3 0.6B is the fastest first run) and send a prompt. A clean install answers a question like “What is the capital of France?” with “Paris” in well under a second per token. If you get a streamed reply, you're done.
Get to know the stack
Six ways to drive the engine. Pick the one that fits how you already build.
Native streaming endpoint
A token-streaming /api/chat SSE endpoint served by the daemon on your box (and proxied through the cloud relay when you're signed in). Works today.
Drop-in /v1
A /v1/chat/completions surface so every tool built for the OpenAI API points at your RTX GPU unchanged. Live through the relay with an API key.
openai SDK
Use the official openai package with a custom base_url. The same three lines you already wrote — different host.
fetch & SSE
No SDK required. Hit the endpoint with fetch and read the Server-Sent Events stream straight off the response body.
bmrt
Load, list, and serve models from PowerShell. Scriptable for CI, agents, and scheduled tasks. Ships with the v1.0 headless tooling.
Bearer auth
Mint bmrt_… keys in Account Settings to reach your GPU through the cloud relay from anywhere. Revoke any time.
Super quick start
Ask your own GPU a question. The native /api/chat endpoint streams tokens back as SSE — this works on a stock install right now.
# Stream a reply straight off your local daemon (port 8080) curl.exe -N http://localhost:8080/api/chat ` -H "Content-Type: application/json" ` -d '{ "message": "Who are you, and what can you do?", "max_tokens": 256 }' # → data: {"token": "I", "token_id": 40, "time_ms": 9.1} # → data: {"token": "'m", ...} # → data: {"done": true, "total_tokens": 128, "stop_reason": "eos"}
# pip install requests import json, requests r = requests.post( "http://localhost:8080/api/chat", json={"message": "Who are you, and what can you do?", "max_tokens": 256}, stream=True, ) for line in r.iter_lines(): if not line or not line.startswith(b"data: "): continue evt = json.loads(line[6:]) if evt.get("done"): break print(evt.get("token", ""), end="", flush=True)
const res = await fetch("http://localhost:8080/api/chat", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify({ message: "Who are you, and what can you do?", max_tokens: 256 }), }); const reader = res.body.getReader(); const dec = new TextDecoder(); for (;;) { const { value, done } = await reader.read(); if (done) break; for (const line of dec.decode(value).split("\n")) { if (!line.startsWith("data: ")) continue; const evt = JSON.parse(line.slice(6)); if (evt.done) break; process.stdout.write(evt.token ?? ""); } }
Request body
| Field | Type | Default | Description |
|---|---|---|---|
| message | string | — | The user turn to respond to. Required. |
| max_tokens | integer | 2048 | Max tokens to generate. Capped at 4096 per request. |
| history | array | [] | Prior turns as { "role": "user" | "assistant", "content": "…" }. |
| style | string | null | Optional sampling override — focused, balanced, or expressive. Validated against the active model's tier. See Sampling & styles. |
Stream events
Each SSE frame is a data: line carrying one JSON object. Token frames carry token, token_id, and time_ms; the final frame carries done: true with total_tokens, truncated, and stop_reason.
Sampling & styles
Every model ships with a sampling preset tuned to its size, so you get coherent output by default. The optional style override trades determinism for variety — within a range that's safe for that model.
| style | Behavior | Good for |
|---|---|---|
| focused | Greedy / deterministic — same prompt, same answer. | Extraction, classification, code, tool calls. |
| balanced | Light sampling — some variation, still tight. | General chat and Q&A. |
| expressive | More varied, creative sampling. | Brainstorming, writing, longer-form replies. |
Pass it on /api/chat, or omit it to use the model's default:
curl.exe -N http://localhost:8080/api/chat ` -H "Content-Type: application/json" ` -d '{ "message": "Extract the invoice total as a number.", "style": "focused" }'
focused, but stepping up past what a small model handles is rejected with 400 style_not_allowed — e.g. expressive on a 1.5B model produces token salad, so it isn't offered. Query the styles a given model allows in the tuner field of /api/models.
On the OpenAI /v1 surface, temperature and top_p are accepted but not applied — sampling is owned by the model's tier. Use the native /api/chat style override above to influence it.
Authentication Live
Two separate things: signing in to your account (to manage nodes, keys, and team), and API keys (to call your GPU programmatically). Local calls to localhost need neither — it's your machine.
Signing in
You can sign in to the app three ways:
- Email & password — the default. Passwords are stored salted and hashed (scrypt).
- Sign in with GitHub — OAuth, as an alternative to a password. It requests identity only (
read:user,user:email) — no access to your code — and links to your account by GitHub identity, falling back to your email if you already have an account. Available on the hosted app; on a self-hosted relay it appears onceGITHUB_CLIENT_ID/GITHUB_CLIENT_SECRETare configured. - Enterprise SSO — your organization's IdP via OIDC or SAML, with SCIM provisioning and role mapping. See Enterprise SSO / Identity.
API keys
To reach your GPU from elsewhere through the cloud relay, use an API key. Generate one in Account Settings. Keys are shown once, prefixed bmrt_, and scoped to inference and mesh. Pass it as a Bearer token:
# Reach your own GPU through the relay from anywhere curl.exe -N https://baremetalrt.ai/api/chat ` -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ` -H "Content-Type: application/json" ` -d '{ "message": "Hello from my laptop!", "max_tokens": 128 }'
Enterprise SSO / Identity Available
Let your team sign in with your organization's identity provider, map IdP groups to roles, and auto-provision/deprovision users. Off by default — local and demo use are unchanged.
- OIDC (Authorization Code + PKCE) covers Okta, Microsoft Entra ID, Auth0, PingOne, Keycloak and ADFS, plus SAML 2.0 for orgs that mandate it.
- SCIM 2.0 auto-provisioning — when IT deactivates a user in the IdP, their sessions and API keys are revoked immediately.
- Role-based access (admin / user / viewer) mapped from IdP group claims.
- On-prem & air-gapped: the orchestrator is the relying party and talks to your IdP — no dependency on us.
Admins configure it under SSO / Identity settings. For your security team, the full architecture & token-validation details are on the SSO architecture & security page →
Team seats Available
The Team plan is billed per seat. You buy a number of seats and assign them to people; entitlement is enforced online, so an assignment or revocation takes effect on the member's next request.
- Buy seats — start a Team subscription and set the seat count in Checkout (Stripe quantity = seats; monthly or annual, with a free trial). Change the count any time; a downgrade that leaves more people assigned than paid-for moves the most-recently-assigned members to over quota until you free seats.
- Assign & revoke — invite teammates by email from the Seats console. They're seated automatically if a seat is free, otherwise invited until one opens. The buyer (org owner) always holds a seat and can always sign in to manage billing.
- Auto-sync with your IdP — with SCIM enabled, provisioning a user grants a seat (if available) and deprovisioning frees it, so your roster and your bill follow your directory.
- Shared chat guests don't use a seat. The 1 + 4 sharing model is separate: each seat holder can invite up to four chat-only guests (no connectors, agents, or API keys) onto their GPU. Guests never consume a paid seat — they upgrade to their own seat for full access.
A member without an assigned seat can still sign in to reach billing, but inference (chat and the /v1 endpoints) and API-key creation return 402 until a seat is assigned. Manage everything from the Seats console →
Compliance & security Available
What a security team needs to know before approving BareMetalRT on company hardware: where data lives, how access is controlled and revoked, and what is recorded. The short version — inference and your documents stay on the GPU you own, and the controls below are enforced on your machine, not ours.
Identity & access control
- Enterprise SSO — OIDC (Authorization Code + PKCE) and SAML 2.0 against your own IdP, with JWKS-validated ID tokens and signed SAML assertions. See Enterprise SSO / Identity.
- Role-based access — users are mapped to
admin,user, orviewerfrom your IdP group claims. Administrative settings (SSO/identity configuration, billing, deployment-wide banners) require the admin role. - Immediate revocation — when IT deactivates a user (via SCIM
active: false, an admin action, or account delete), all of that user's sessions and API keys are revoked synchronously, and the next request on a stale token is rejected. Sessions and keys are stored hashed, never in plaintext. - SCIM 2.0 provisioning — standard
/scim/v2/Usersand/scim/v2/Groupsendpoints, bearer-token authenticated, so your IdP provisions and deprovisions automatically.
Audit log
The daemon keeps a tamper-evident, append-only audit log on the GPU host: a hash-chained JSONL record (each entry carries the SHA-256 of the previous one, so deletion or edits are detectable). It captures security-relevant on-box events — tool calls and approvals/denials, connector connect/remove, model loads, configuration changes, and sign-in success/failure — with secrets redacted at write time, rotation at 16 MB, and a queryable interface with CSV export.
Data residency & air-gap
- Inference stays on your GPU. Prompts and completions are generated on the daemon; the cloud relay only forwards bytes between your client and your daemon and does not persist prompt or completion content. On-box calls to
http://localhost:8080never touch the relay at all. - Your documents never leave the box. Knowledge-base chunks, embeddings, and the vector index are written under
%APPDATA%/BareMetalRT/rag/on the host — there is no external vector database and no network call in the retrieval path. See Retrieval (RAG). - No conversation telemetry. Chat content is never phoned home; the server runs no third-party analytics on conversations.
- Offline, signed licensing. Licensed and air-gapped deployments verify an Ed25519-signed license entirely offline against a public key embedded in the daemon — no phone-home and no online activation. A disconnected appliance keeps running for the life of its license.
Privacy & offline
The model runs on your GPU, not ours. That's the whole point — your prompts and the model's replies stay on hardware you own.
- On-box calls never leave the machine. Hit
http://localhost:8080and the request goes straight to your daemon — it never touches our relay or any cloud service. - The relay only forwards to your own GPU. When you sign in and use an API key, the cloud relay proxies the request to your daemon so you can reach it from a phone or laptop. It doesn't store your prompts or the model's output, and nothing is used for training.
- Works fully offline. Once BareMetalRT is installed and your models are downloaded, local inference needs no internet at all. Only signing in, downloading models, and remote (relay) access require a connection.
- No conversation telemetry. Chat content is never phoned home.
Air-gap mode Available
For classified, regulated, or disconnected sites. With BMRT_AIRGAP=1 set, the daemon makes zero unsolicited outbound internet connections. Off by default — normal installs keep the in-app update banner.
- No phone-home. The in-app update check — the daemon's only autonomous outbound internet call — is fully disabled: no background poll, no on-demand check, no installer download.
- Offline by construction. Licensing is verified locally (Ed25519, no activation server), fleet/cluster traffic stays on your LAN, and nothing reaches us.
- Updates on your terms. Apply the signed installer through your normal media-transfer process; the daemon never downloads or launches anything on its own.
For your security team, the no-egress guarantee, exactly what air-gap mode disables, and a self-serve verification procedure (firewall / packet capture / netstat, with a control test) are on the air-gap & no-egress attestation page →
What you can build
The same primitives you'd reach for against a hosted API — running on hardware you control.
Chat & text generation
Token-by-token streaming with paged KV-cache and fused attention kernels. Multi-turn via history. Concurrent users on a single GPU.
Model management
List, download, build, load, and unload models over the REST API while the daemon stays up. Hot-swap without a restart.
Multi-GPU mesh
Split a model too big for one card across multiple RTX GPUs over plain Ethernet — tensor parallelism over our network transport. Set X-GPU-Mode: tp2.
Tool calling & agents
A built-in MCP host and agent loop — connect tools from the one-click catalog and the model calls them mid-chat, with approval prompts for anything side-effecting. See MCP & integrations.
Structured output
Force responses to a JSON Schema with grammar-constrained sampling — valid JSON every time, no retry loop. Landing with the v1.0 tool-calling track.
GPU telemetry
Real-time VRAM, temperature, utilization, and power off /api/gpu-metrics — wire your own dashboards and autoscalers.
OpenAI compatibility Live
LM Studio, Ollama, and friends ship an OpenAI-shaped endpoint so existing tools just work. So do we — pointed at NVIDIA's data-center engine instead of llama.cpp.
Every OpenAI client works by changing one line — the base_url. Use the relay host with your API key to reach your GPU from anywhere, or the local daemon for keyless, on-box calls:
# pip install openai — only base_url + api_key change from openai import OpenAI client = OpenAI( base_url="https://baremetalrt.ai/v1", # relay → your GPU (needs a key) api_key="bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", # mint one in Account Settings ) # or, on the box itself (latest daemon): base_url="http://localhost:8080/v1", api_key="bmrt_local" resp = client.chat.completions.create( model="qwen3-4b-int4", messages=[{"role": "user", "content": "Who are you, and what can you do?"}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content or "", end="")
// npm install openai import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://baremetalrt.ai/v1", apiKey: "bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", }); const stream = await client.chat.completions.create({ model: "qwen3-4b-int4", messages: [{ role: "user", content: "Who are you, and what can you do?" }], stream: true, }); for await (const part of stream) process.stdout.write(part.choices[0]?.delta?.content ?? "");
curl.exe https://baremetalrt.ai/v1/chat/completions ` -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ` -H "Content-Type: application/json" ` -d '{ "model": "qwen3-4b-int4", "messages": [{"role":"user","content":"Hi"}] }'
/v1 surface is a thin translation layer over the native /api/chat stream — your OpenAI messages[] map to a prompt, the token stream maps back to chat.completion.chunk objects. Sampling is owned by each model's catalog tier, so temperature and top_p are accepted but not applied. The relay path is live now; the keyless localhost:8080/v1 ships with the next daemon update.
openai (or anthropic) package you already know, pointed at our endpoint. It installs in your project's environment, not the daemon's bundled runtime — the two never share a dependency tree, so the "don't pip-install into the daemon" rule from Install doesn't apply to your app. Change one base_url and every OpenAI-shaped tool, framework, and SDK works.
Anthropic compatibility Live
Built your app on the Claude SDK? Point it here too. The /v1/messages endpoint speaks the Anthropic Messages wire format — named SSE events and all — so the anthropic SDK works unchanged.
The Anthropic SDK authenticates with x-api-key, which the relay accepts as your bmrt_ key:
# pip install anthropic — only base_url + api_key change from anthropic import Anthropic client = Anthropic( base_url="https://baremetalrt.ai", api_key="bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", ) with client.messages.stream( model="qwen3-1.7b", max_tokens=256, messages=[{"role": "user", "content": "Who are you, and what can you do?"}], ) as stream: for text in stream.text_stream: print(text, end="", flush=True)
curl.exe https://baremetalrt.ai/v1/messages ` -H "x-api-key: bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ` -H "anthropic-version: 2023-06-01" ` -H "Content-Type: application/json" ` -d '{ "model": "qwen3-1.7b", "max_tokens": 256, "messages": [{"role":"user","content":"Hi"}] }'
message
objects with a content block array and stop_reason
(end_turn / max_tokens). system is read from
the top-level field; tool blocks aren't generated yet (see Tool use).
Vision & multimodal Live
Attach an image and ask about it — photos, screenshots, charts, and documents, read on your own GPU. Load a vision-language model, then send images alongside your prompt.
Vision-language models in the catalog include Qwen3-VL (2B / 4B / 8B), Qwen2-VL 2B, Pixtral 12B, Phi-4 Multimodal, Gemma 3 (4B / 12B / 27B), and Llama 3.2 Vision. Load one the usual way (see Managing models); the daemon cold-starts it with image support enabled.
Send images on the native /api/chat endpoint with an images array — each entry is a base64 string or a data: URL (PNG, JPEG, GIF, or WebP):
# POST /api/chat — ask about an image { "message": "What's in this image?", "images": [ "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAA..." ], "max_tokens": 512 }
images array are folded into the same prompt. Only models with vision capability accept images; sending images to a text-only model has no effect.
Tool use & function calling Experimental
The OpenAI-compatible tool-calling surface: pass tools, the model returns tool_calls, you run them and feed the results back. This is the shape we're building toward.
/v1 passthrough is rolling out. Agentic tool-calling works now through the native /api/chat agent loop and the one-click MCP catalog (see MCP & integrations) — connect a tool and the model calls it. The OpenAI-shaped tools / tool_calls passthrough on /v1/chat/completions documented below is landing next; design against this shape now.
You describe each function with a JSON-Schema parameter spec. When the model decides to call one, the reply carries a tool_calls array and finish_reason: "tool_calls" instead of prose — identical to the OpenAI shape.
curl.exe https://baremetalrt.ai/v1/chat/completions ` -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ` -H "Content-Type: application/json" ` -d '{ "model": "qwen3-4b-int4", "messages": [{"role":"user","content":"What is the weather in Paris?"}], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a city", "parameters": { "type": "object", "properties": { "city": {"type":"string"} }, "required": ["city"] } } }] }' # → the model returns a tool call instead of text: # "finish_reason": "tool_calls", # "message": { "role":"assistant", "tool_calls": [ # { "id":"call_1", "type":"function", # "function": { "name":"get_weather", "arguments":"{\"city\":\"Paris\"}" } } ] }
# pip install openai — the standard tool loop works unchanged from openai import OpenAI import json client = OpenAI(base_url="https://baremetalrt.ai/v1", api_key="bmrt_…") tools = [{"type": "function", "function": { "name": "get_weather", "parameters": {"type": "object", "properties": {"city": {"type": "string"}}, "required": ["city"]}}}] messages = [{"role": "user", "content": "Weather in Paris?"}] r = client.chat.completions.create(model="qwen3-4b-int4", messages=messages, tools=tools) call = r.choices[0].message.tool_calls[0] args = json.loads(call.function.arguments) # {"city": "Paris"} result = get_weather(**args) # you run the function # feed the call + result back, then ask for the final answer messages += [r.choices[0].message, {"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)}] final = client.chat.completions.create(model="qwen3-4b-int4", messages=messages, tools=tools) print(final.choices[0].message.content)
tool_calls shape, so your code doesn't branch on the model.
Workflows Live
A repeatable process that runs the same way every time. Build it as an ordered funnel — Input → Action → AI → Output — and the daemon walks the funnel in order on your GPU host, so control flow is enforced, not improvised by the model.
Define the process with named fields (goal, process_owner, inputs, outputs), pick the model it runs on (it loads automatically before the run), and add ordered nodes. Each node is one of four types:
- input — collect a starting value.
- action — call a tool (MCP connector or built-in) with fixed
params. In read-only mode, side-effecting tools are refused. - ai — a model reasoning step driven by an
instruction. - output — capture the final result.
Each node's result is piped into later nodes with {{nodeId}} placeholders — e.g. an ai step references an earlier input as {{n1}} — so every run is repeatable and auditable.
Agents Live
Give the model a goal and let it work — a headless agent loop that plans, calls tools, and reports back, without a node-by-node script.
Start a run with a goal and a permission_mode: readonly refuses any side-effecting tool, while autonomous pre-authorizes the run's tool calls. The run records a tool_log of every invocation and a final_result; poll it for status (pending / running / completed / failed / cancelled) or cancel it mid-flight.
{ "goal": "…", "permission_mode": "readonly" }Routines Live
Run an agent or a workflow on a schedule — the same goal, every morning or every N hours, unattended.
A routine wraps either a goal-based agent run or a saved workflow_id, with a schedule_type of daily (a local at time like "09:00") or interval (every_hours). Routines carry an enabled flag so you can pause one without deleting it, and you can fire any routine immediately, off-schedule.
Structured output Experimental
Constrain the reply to a JSON Schema with response_format — valid JSON every time, no parse-and-retry loop. OpenAI-compatible shape.
response_format — you can coax JSON with a prompt, but nothing enforces the schema, so there's no guarantee. Grammar-constrained sampling (xgrammar) lands with the v1.0 tool-calling track and makes the schema a hard constraint. The shape below is what you'll build against.
curl.exe https://baremetalrt.ai/v1/chat/completions ` -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ` -H "Content-Type: application/json" ` -d '{ "model": "qwen3-4b-int4", "messages": [{"role":"user","content":"Extract the person from: Alice is 30."}], "response_format": { "type": "json_schema", "json_schema": { "name": "person", "strict": true, "schema": { "type": "object", "properties": { "name": {"type":"string"}, "age": {"type":"integer"} }, "required": ["name","age"], "additionalProperties": false } } } }' # → content is guaranteed to parse against the schema: # { "name": "Alice", "age": 30 }
from openai import OpenAI import json client = OpenAI(base_url="https://baremetalrt.ai/v1", api_key="bmrt_…") r = client.chat.completions.create( model="qwen3-4b-int4", messages=[{"role": "user", "content": "Extract the person from: Alice is 30."}], response_format={ "type": "json_schema", "json_schema": {"name": "person", "strict": True, "schema": { "type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"]}}, }, ) person = json.loads(r.choices[0].message.content) # always valid — {"name": "Alice", "age": 30}
MCP & integrations Live
The daemon is a full Model Context Protocol host. Connect a tool and the agent can call it mid-conversation — files, web, databases, your enterprise systems — with the model deciding when, and the daemon running the round-trip. Everything executes on your GPU host: tokens and data flow tool ↔ local agent, never through our cloud.
One-click integrations catalog
Open the app → Skills → Integrations to browse the catalog and connect a tool in one click — or see the full list of 300+ integrations across 14 categories and every major industry. Bundled connectors are stdlib-only and need no Node — they work on a clean Windows box. Read-only tools run silently; anything side-effecting (a write, a send, a create) pauses for your approval before it runs.
Keyless, on-box
Weather (Open-Meteo), Fetch (read any web page, SSRF-guarded), Files (a folder you choose; writes approval-gated), and SQLite (read-only queries). No token, no Node.
npx servers
GitHub, Slack, Postgres, and a Browser (Playwright) via the official @modelcontextprotocol servers. Install Node.js and they light up; add a token where the service needs one.
Your data, your model
Microsoft 365 (Outlook, OneDrive, SharePoint, Teams, Excel), Databricks, Snowflake (read-only SQL), and Salesforce (SOQL + records). Token/OAuth stays on your GPU host — proprietary data never touches a cloud LLM.
Any MCP server
Paste a remote Streamable-HTTP URL (with an auth header) or a local launch command, and the daemon connects it like any other. Plug a local Git repo for read-only code tools in one click.
Connect a GitHub repo
From the composer, Connect GitHub runs a repo-scoped OAuth flow (separate consent from signing in), lists the repositories you can access, and on pick the daemon clones the repo onto your GPU host and attaches a read-only Git MCP server so the model can read the code. The OAuth token is held encrypted by the relay (so it can hand it to your daemon to clone) and the requested scope (repo) is read/clone access — it is never used to push or open issues. Like sign-in, it activates when GITHUB_CLIENT_ID / GITHUB_CLIENT_SECRET are configured.
Drive it from the API
The same agent loop is on the native endpoint: send agent: true to /api/chat and the daemon runs generate → call tool → feed result → continue, streaming tool_call and tool_result events alongside tokens. Tools resolve against the built-ins plus every connected MCP server (namespaced mcp__<server>__<tool>).
# Ask the agent something its connected tools can answer curl.exe -N http://localhost:8080/api/chat ` -H "Content-Type: application/json" ` -d '{ "message": "What is the weather in Tokyo? Use your tools.", "agent": true }' # → data: {"tool_call": {"name": "mcp__weather__get_weather", "args": {"location": "Tokyo"}}} # → data: {"tool_result": {"name": "mcp__weather__get_weather", "output": "Weather in Tokyo: …"}} # → data: {"token": "The"} … then the model's written answer
Embeddings Live
Turn text into vectors for semantic search, RAG, clustering, and dedup. /v1/embeddings follows the OpenAI shape, served by all-MiniLM-L6-v2 (384-dim, Apache 2.0) running on your GPU alongside the chat model.
from openai import OpenAI
client = OpenAI(
base_url="https://baremetalrt.ai/v1",
api_key="bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
)
resp = client.embeddings.create(
model="all-MiniLM-L6-v2",
input=["the cat sat on the mat", "a feline rested on the rug"],
)
print(len(resp.data[0].embedding)) # 384
# cosine similarity of the two vectors is high — they mean the same thing
curl.exe https://baremetalrt.ai/v1/embeddings ` -H "Authorization: Bearer bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" ` -H "Content-Type: application/json" ` -d '{ "model": "all-MiniLM-L6-v2", "input": "the cat sat on the mat" }'
{ "object": "list", "data": [{ "embedding": [...] }] } shape, so vector stores and RAG frameworks that target OpenAI embeddings work unchanged. Live on the relay — verified end-to-end returning correct 384-dim L2-normalized vectors on RTX hardware.
Retrieval (RAG) Live
Ground answers in your own documents — entirely on your GPU. Use the built-in knowledge bases (no code), or build your own pipeline on the embeddings endpoint. Either way the corpus, the index, and the model all stay on the machine.
Built-in knowledge bases Live
No code required. In the chat box at the bottom of the Dashboard, open the 📚 knowledge-base menu to create a base, attach your documents with the paperclip, and click the ingest button (📥). Select that base and every answer is grounded in your own text with bracketed [n] citations. The chunker, the embeddings (all-MiniLM-L6-v2), the flat vector index, and the source text all live on your GPU host — nothing is sent to a cloud index or a third-party service.
The app drives a small daemon API you can call directly to manage bases and ingest from your own tooling — text is extracted client-side, so the daemon takes no parser dependency:
{ "name": "Company handbook" }{ "docs": [{ "source": "handbook.md", "text": "…" }] }To ground a chat, add collection to the /api/chat body — the daemon retrieves the top-k chunks for the message and folds them into the prompt with their sources. Behind the scenes: documents are split into ~1600-character chunks, embedded once at ingest, and ranked at query time by cosine similarity (a single matmul over L2-normalized vectors — a flat scan that stays fast into the hundreds of thousands of chunks, no ANN index required).
%APPDATA%/BareMetalRT/rag/<collection>/ on the GPU host. There is no external vector database and no network call in the retrieval path — the privacy promise is the architecture, not a setting.
Or build your own pipeline
Prefer your own vector store or framework? Call the embeddings endpoint directly and keep the index wherever you like:
# pip install openai numpy — fully local RAG in ~20 lines from openai import OpenAI import numpy as np client = OpenAI(base_url="https://baremetalrt.ai/v1", api_key="bmrt_…") # 1. Your knowledge base, split into chunks docs = [ "BareMetalRT runs TensorRT-LLM natively on Windows.", "Tensor parallelism splits one model across two RTX GPUs over standard networking.", "Voice mode ships on by default and can be toggled off in settings.", ] def embed(texts): r = client.embeddings.create(model="all-MiniLM-L6-v2", input=texts) return np.array([d.embedding for d in r.data]) doc_vecs = embed(docs) # embed the corpus once # 2. At query time, embed the question and rank by cosine similarity query = "How do I run a model too big for one card?" q = embed([query])[0] scores = doc_vecs @ q # vectors are L2-normalized → dot = cosine top = [docs[i] for i in scores.argsort()[::-1][:2]] # top-2 chunks # 3. Stuff the retrieved context into the chat prompt context = "\n".join(top) resp = client.chat.completions.create( model="qwen3-4b-int4", messages=[ {"role": "system", "content": f"Answer using only this context:\n{context}"}, {"role": "user", "content": query}, ], ) print(resp.choices[0].message.content)
POST /api/upload (text-layer PDFs, .txt, .md; scanned/image-only PDFs are rejected with a clear message — OCR is on the roadmap). For a developer pipeline, extract text your own way, chunk it, and feed it through the loop above. Because /v1/embeddings is the OpenAI shape, drop-in vector stores (FAISS, Chroma, pgvector) and frameworks (LangChain, LlamaIndex) work too — point them at the relay.
IDE & tools Live
Anything that speaks OpenAI speaks BareMetalRT. Point your editor or agent framework at the relay and your own RTX GPU answers — no GPT-4 bill, no code leaving your machine.
Cursor
Settings → Models → enable Override OpenAI Base URL. Set the base URL to https://baremetalrt.ai/v1, paste your bmrt_ key as the OpenAI API key, and add a custom model named after one in your catalog (e.g. qwen3-1.7b). Cursor verifies the key against /v1/models — which is live — so the green check lights up.
Continue (VS Code / JetBrains)
Drop this into ~/.continue/config.json:
{
"models": [{
"title": "BareMetalRT",
"provider": "openai",
"model": "qwen3-1.7b",
"apiBase": "https://baremetalrt.ai/v1",
"apiKey": "bmrt_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}]
}Everything else
The same two values — base URL https://baremetalrt.ai/v1 and a bmrt_ key — work in any tool with a configurable OpenAI endpoint: the openai Python/JS SDKs, LangChain (ChatOpenAI(base_url=…)), LlamaIndex, Vercel AI SDK, Open WebUI, and friends. For on-box, keyless use, swap in http://localhost:8080/v1 with the latest daemon.
REST API reference
Served by the daemon on http://localhost:8080, and proxied through https://baremetalrt.ai when you authenticate with an API key.
Inference Live
Stream a completion as Server-Sent Events. Body and event shapes documented under Super quick start. Send header X-GPU-Mode: tp2 to route across a two-GPU mesh. Add collection to the body to ground the answer in a knowledge base.
Models Live
List catalog models with download state, VRAM fit, and loadability for this node.
Download, hot-load, and unload models without restarting the daemon. Most of the catalog runs on the PyTorch backend (PyExecutor) and loads straight from the HuggingFace checkpoint — pull then load, no build step. /build applies only to the legacy TensorRT path (the TP=2 mesh model), which compiles an engine before it can load. Poll /api/models/{id}/status for live download and build progress.
Knowledge bases (RAG) Live
Create local knowledge bases, ingest already-extracted document text (chunked + embedded on the GPU host), and retrieve the top-k chunks. Pass collection in the /api/chat body to ground an answer in a base. The vectors and source text never leave the machine. See Retrieval (RAG).
Telemetry Live
Daemon readiness and live GPU stats:
# GET /api/gpu-metrics { "vram_used_mb": 8880, "vram_total_mb": 12282, "temperature_c": 45, "gpu_util_pct": 10, "power_w": 85 }
Keys Live
Create, list, and revoke API keys for relay access. Manage them visually in Account Settings.
Errors & status codes
Errors follow the OpenAI shape — an error object with message, type, and sometimes code. In a non-streaming call the HTTP status carries the failure; in a streaming call the failure arrives as an error frame and the stream then closes.
| Status | code | When |
|---|---|---|
| 401 | invalid_api_key | Missing or revoked bmrt_ key on a relay call. |
| 503 | — | No GPU node connected — your daemon is offline or still loading. |
| 503 | kv_exhausted | Engine at capacity (paged-KV pool full). Back off and retry. |
| 503 | — | Context window full — start a new conversation or trim history. |
| 400 | style_not_allowed | Requested a tuner style above the active model's tier. |
Example bodies
# 401 — no/invalid key (relay) { "error": { "message": "Missing or invalid API key…", "type": "invalid_request_error", "code": "invalid_api_key" } } # 503 — backpressure (engine at capacity) { "error": { "message": "Server at capacity, retry shortly", "type": "server_error", "code": "kv_exhausted" } } # streaming — error arrives as a frame, then the stream closes data: {"...chunk...", "error": {"message": "No GPU node connected", "type": "server_error"}}
/api/chat endpoint the same failures arrive flatter —
data: {"error": "…"} followed by data: {"done": true}. The
/v1 shim reshapes these into the OpenAI error object above.
Rate limits & concurrency
Your GPU, your rules. There is no per-token metering, no monthly quota, and no request cap — the only ceiling is the silicon.
- No quotas. Self-hosted inference is unmetered. The relay authenticates with your key but does not bill or throttle by volume.
- Concurrency is real. A single GPU serves many simultaneous chats — paged KV-cache and continuous batching keep per-user latency flat until the KV pool fills.
- Backpressure, not degradation. When the pool is exhausted the engine returns
503 kv_exhaustedimmediately rather than silently slowing everyone down. Treat it like a429: back off briefly and retry.
API stability & versioning
What's safe to build on today, and how we signal change. The badge on each section is the contract.
| Badge | Means | Build on it? |
|---|---|---|
| Live | Implemented, served today, shape is stable. | Yes — production-safe. |
| Preview | Shipping soon; the surface exists but may change. | Prototype against it; pin a daemon version. |
| Experimental | Documents a planned shape; not enforced yet. | Design against it; don't depend on it. |
- Two surfaces. The native
/api/*endpoints are our own; the/v1/*(OpenAI) and/v1/messages(Anthropic) surfaces track those vendors' wire formats so existing clients work unchanged. - Additive by default. We add fields and endpoints without breaking existing ones. New optional request fields are safe to ignore; unknown fields you send are ignored, not rejected.
- Breaking changes are announced. Any breaking change to a Live endpoint is called out in the release notes, tied to the daemon version that introduces it. Pin a version if you need to upgrade deliberately.
GET /api/diagnostics/executor (version field). Log it with your integration so you can correlate behavior to a specific release.
Headless deployment Live
BareMetalRT runs as a background process with no window and no GUI — it's how the hosted demo serves traffic right now. Launch it with a port and it exposes the full REST API (and the /v1 surface) on your LAN. Ideal for a spare workstation or a home server.
# start the daemon headless — no window, no display attached PS> baremetalrt.exe --port 8080 → REST API + /v1 on http://0.0.0.0:8080 # everything is driven over HTTP from there — load a model, then chat PS> curl.exe -X POST http://localhost:8080/api/models/qwen3-1.7b/load PS> curl.exe http://localhost:8080/api/status
bmrt CLI Experimental
bmrt command is planned for the v1.0 headless tooling — pure sugar over the HTTP endpoints above, which already work today. Until it ships, drive everything with curl against the REST API. The commands below are the planned surface.
A scriptable wrapper for load / list / serve from PowerShell — for CI, agents, and scheduled tasks:
bmrt serve --port 8080 # start the daemon headless bmrt pull qwen3-1.7b # download a catalog model bmrt load qwen3-1.7b # load it onto the GPU bmrt ps # list loaded models + VRAM bmrt unload # free the GPU
Model catalog
Open-weight models validated on real consumer RTX hardware. The catalog grows every week as new architectures pass the on-box battery.
| Model ID | Params | Quant | GPUs | Notes |
|---|---|---|---|---|
| qwen3-0.6b | 0.6B | FP16 | 1 | Instant response |
| qwen3-1.7b | 1.7B | FP16 | 1 | Tool calling |
| qwen3-4b-int4 | 4B | W4A16-AWQ | 1 (8 GB+) | Long context · Ampere+ |
| ds-r1-distill-1.5b | 1.5B | FP16 | 1 | Reasoning |
| llama-3.2-1b-instruct | 1B | FP16 | 1 | Sub-10 ms/token |
| llama-3.2-3b-instruct | 3B | FP16 | 1 | — |
Query the live catalog for the current list and per-node fit:
curl.exe http://localhost:8080/api/models
Managing models & storage
A model occupies two separate resources — GPU VRAM while it's loaded, and disk while it's downloaded. Unloading and deleting are different operations for exactly that reason.
| Action | Endpoint | Frees | Use when |
|---|---|---|---|
| Unload | POST /api/unload | GPU VRAM (files stay on disk) | Switching models, or reclaiming VRAM (e.g. for voice headroom). Reloads instantly. |
| Delete | POST /api/models/{id}/delete | Disk — weights and built engine | Reclaiming storage. A re-download is needed to use it again. |
Downloaded weights and any built engines live in the install's data directory (models\ and engine_cache\ under Program Files\BareMetalRT). Catalog models pull from Hugging Face the first time you load them and stay local after that — roughly 1–8 GB each, depending on size and quantization.
engine_cache.
Voice & VRAM Live
Voice is built in and on by default. It runs its own models — speech-to-text, voice-activity detection, and text-to-speech — that share your GPU with the chat model, so VRAM is what decides how much voice you get.
| Stage | Model | Notes |
|---|---|---|
| Speech-to-text | Whisper large-v3-turbo | Runs on the GPU via faster-whisper; CPU is configurable for tight cards. |
| Voice activity | Silero VAD | Tiny — detects when you start and stop speaking. |
| Text-to-speech | Orpheus-3B + SNAC | Full-quality neural voice (~4 GB). Steps down to a low-VRAM TTS, then to browser speech, as the card gets tighter. |
The voice stack reserves roughly 4 GB of free VRAM on top of your chat model. After a model loads, the daemon checks actual free VRAM and only enables voice if it fits — so on a tight 8 GB card running a 4B model, voice may drop to the low-VRAM tier or stay off. Want full voice? Run a smaller chat model, or unload to free headroom.
POST /api/voice/enable and read state from GET /api/voice/status — e.g. { "enabled": true, "model_voice_capable": true, "ready": true }.
FAQ
The questions a first install actually runs into — install order, version pitfalls, and what runs where.
Why do I have to install CUDA and TensorRT myself? Can't the installer bundle them?
No — NVIDIA's license does not let us redistribute the CUDA Toolkit or the TensorRT SDK. Every other piece of the runtime (a private Python build, the PyTorch and TensorRT-LLM Python wheels, MPI, the VC++ redistributable) is downloaded for you. These two SDKs are the only parts you fetch from NVIDIA directly, with a free Developer account.
Which versions do I install? The download page offers CUDA 13 and TensorRT 11.
Use CUDA Toolkit 12.x (12.4–12.9) and TensorRT 10.15 or 10.16 — not the latest 13.x / 11.x that NVIDIA's pages default to. BareMetalRT ships a version-pinned runtime (torch 2.6.0+cu124, tensorrt-cu12 10.15.1.29); CUDA 13 or TensorRT 11 are a different ABI and inference won't start. Grab a 12.x toolkit from the CUDA Toolkit Archive and a TensorRT 10.x GA Windows ZIP for CUDA 12. (Newer CUDA/TensorRT and Blackwell support are on our roadmap — see below.)
I installed CUDA 13 or TensorRT 11 and the daemon won't load a model. How do I fix it?
Uninstall the 13.x CUDA Toolkit (or just install a 12.x one alongside it) and replace TensorRT 11 with a 10.15/10.16 build extracted to C:\TensorRT\. Then restart the daemon — it re-detects the SDKs on startup. If you also pip-installed a newer torch or tensorrt into the bundled runtime, reinstall from the .exe to restore the pinned set.
Which NVIDIA driver do I need?
Driver 545 or newer (required by CUDA 12.4). Newer drivers are fine and recommended — update through the NVIDIA driver page, the NVIDIA App, or GeForce Experience before installing.
Can I install BareMetalRT before the NVIDIA SDKs?
Yes. The installer notes this on its prerequisite screen — install BareMetalRT first, add the CUDA Toolkit and TensorRT afterward, and the daemon detects them automatically on its next start. You just can't load a model until both are present.
Why can't I just pip install a newer torch or tensorrt?
The Python and C++ sides of TensorRT-LLM are an ABI-locked set — the torch build, the TensorRT wheel, and the engine runtime all have to match. Upgrading one in the bundled runtime is the most common way to break inference. If something stops working after a manual change, reinstall from the .exe.
What GPU and how much VRAM do I need?
An NVIDIA RTX card, 20-series or newer. 8 GB runs the small models and 4B-class int4 (voice included); 12 GB+ adds headroom for larger Qwen3 and int4 models; two matching cards can split one larger model with tensor parallelism. There is no CPU, AMD, or Apple-Silicon fallback — the engine is TensorRT-LLM.
Does my data leave the machine?
No. Inference runs entirely on your GPU. When you sign in, the cloud relay only proxies requests to your daemon so you can reach it remotely with an API key — your prompts and model outputs aren't stored server-side, and on-box use (http://localhost:8080) never touches the relay at all.
Do I need a Hugging Face token?
Only for gated models. Paste a read token into the app's settings (stored per-user, encrypted) and gated checkpoints download under your account. Ungated catalog models need no token.