The fully integrated Agent runtime built for your Windows RTX PC.
Unlimited tokens · no rate limits · air-gapped & private · no cloud bill.
irm https://baremetalrt.ai/install.ps1 | iex
It's the real app — scroll and click around. Or open it full screen →
Not just a model server — a full agent runtime
BareMetalRT runs agents, not just chats. Bring your own model, plug in 318 integrations, run 29 ready-made skills, and chain them into workflows — all on infrastructure you control. Private by default — from a single team to the whole organization.
Run Qwen, Llama, DeepSeek, Mistral, Gemma and Phi — or drop in your own open weights. No lock-in, no per-token fees, nothing leaves your machine.
Browse the model catalog →Most run without Node.js — AWS, Azure, Datadog, Snowflake, ServiceNow, Microsoft 365, Box, Workday, Splunk, Okta. One click to connect; read-only by default, opt-in to write.
Browse all integrations →Ready-made agent recipes across Development, Writing, Research and Productivity — with new ones added every week.
Browse all skills →Chain your integrations into reusable, schedulable procedures the agent runs for you — each run scoped to just the tools it needs, and remembered so it gets better over time.
In the app — private by defaultYour agent learns durable facts from your chats, keeps them in a local markdown vault you can audit and edit, and recalls them by meaning when they’re relevant — so a small local model punches above its context window.
In the app — never leaves your machineConnect your stack, govern access, and scale from one GPU to a fleet — without sending data to the cloud.
Your inbox, your notes, your files — automated privately on hardware you already own.
Most local AI tools run on llama.cpp — a general-purpose backend built for portability across any chip. BareMetalRT runs TensorRT-LLM: NVIDIA's production inference engine, built for RTX hardware, ported natively to Windows for the first time.
TensorRT-LLM is what AWS, Azure, and Google run to serve frontier models at scale. It compiles models to your exact RTX card at install time — fused attention kernels, paged KV-cache, in-flight batching, optimized GEMM, and context windows up to 512K tokens on a single box for whole codebases and long documents. The same machinery that lets one GPU serve several people at once — true in-flight batching runs concurrent chats in parallel on the card instead of one-at-a-time in a queue. How many run at once scales with your GPU's VRAM. No general-purpose backend. No performance left on the table.
Models too large for one card split across multiple RTX GPUs over ordinary Ethernet — mix different cards of the same RTX generation, no NVLink, no InfiniBand, no Linux. We replaced NCCL with a custom network transport. A 14 GB model across a desktop and a laptop. Working today.
NVIDIA discontinued TensorRT-LLM on Windows. We maintain the only working Windows port — no WSL, no Docker, no compatibility layers. Drop-in OpenAI and Anthropic APIs, so every tool that talks to GPT or Claude can point at your own RTX GPU instead.
| Engine | Inference backend | Platform | Data-center engine on Windows | Compiles to your GPU | Concurrent users | Multi-GPU mesh over your network |
|---|---|---|---|---|---|---|
| BareMetalRT | NVIDIA TensorRT-LLM | Windows | Yes | Yes | 4–8 | Yes |
| LM Studio | llama.cpp · MLX | Windows · Mac · Linux | No | No | 1 | No |
| Ollama | llama.cpp | Windows · Mac · Linux | No | No | ~4 | No |
| Jan | llama.cpp | Windows · Mac · Linux | No | No | 1 | No |
| GPT4All | llama.cpp | Windows · Mac · Linux | No | No | 1 | No |
| Windows ML | ONNX Runtime · DirectML | Windows | No | No | 1 | No |
| vLLM | PagedAttention · CUDA | Linux | No | No | many | No |
| ExLlamaV2 / TabbyAPI | ExLlamaV2 | Windows · Linux | No | No | a few | No |
Every consumer-facing local engine wraps llama.cpp. BareMetalRT is the only one running NVIDIA's production data-center engine — and the only one that splits a model across multiple RTX GPUs over ordinary Ethernet.
Menlo Research benchmarked NVIDIA TensorRT-LLM against llama.cpp — the backend every other consumer tool runs — on the same RTX hardware. TensorRT-LLM won across the board, and the lead widens as the card gets stronger.
We built a ground-up network transport to replace NCCL — tensor parallelism across mismatched GPUs over a commodity network, no specialized fabric required. At home, a 4070 in your desktop and a 4060 in your laptop run a single model together; in a business, the same mesh pools the RTX workstations you already own. Different models, different VRAM, same RTX generation. No NVLink. No InfiniBand. No Linux.
NVIDIA's collective library, NCCL, needs an NVLink-class fabric and matched GPUs. We replaced it with our own transport that runs over the ordinary network you already have, so GPUs in different machines can work on a single model together. No NVLink, no InfiniBand, no special switch — just the LAN in your home or office.
Two different GPU models, two different VRAM sizes, two different machines — same RTX generation — split a single model across all of them at inference time. NCCL demands identical cards; we don't. The sharding strategy is computed at session start from each card's available VRAM, so a 12 GB desktop and an 8 GB laptop each hold exactly the share they can fit. A model too large for either GPU alone runs across both — the first time this has worked in a consumer product.
Splitting a model across mismatched consumer cards can quietly corrupt results if it's done carelessly. We compute in full precision, so every card stays consistent: a 4070 and a 4060 working together produce the same answer each would on its own. No quality penalty for pooling GPUs across machines.
Frontier labs — Meta, Mistral, Google DeepMind — target Linux and H100 clusters. Consumer apps — LM Studio, Ollama, Jan — run llama.cpp on a single GPU. Nobody else sits at the intersection: NVIDIA's production inference stack, running natively on consumer Windows, pooling multiple GPUs over a commodity network. Each piece exists in isolation; the combination is what's new.
Open models downloaded once and run locally — from lightweight instant-response models to frontier-class reasoning models split across your GPU cluster. The catalog grows every week as new architectures are validated on-hardware.








































Sign in from any device — and the GPU you own answers, whether it sits in your study or a rack on-prem. The relay is a dumb encrypted pipe: your prompt reaches your hardware, but the model, the compute, and your conversation history never leave a machine you control. Private by default, cloud-convenient, $0 per token.
Sign in at baremetalrt.ai from any device and your GPU responds — no shared WiFi, no VPN, no carrying the hardware with you. Works the same from an office desk or a workstation across the building.
The relay only forwards TLS-encrypted bytes. Weights, compute, and your chat history — AES-256 encrypted on your own device — stay on hardware you control; no third party ever sees your tokens.
No port forwarding, no static IP, no inbound firewall holes. The daemon dials out to the relay, so your network stays closed to the outside.
Like a Plex server quietly streaming your library to every screen in the house, BareMetalRT runs as a silent, headless service on your most powerful PC — and every phone, laptop, and app on your network draws intelligence from it. No window to keep open, no GUI to babysit. It serves an OpenAI- and Anthropic-compatible API, so existing tools just work.
Runs as a daemon with no display attached — autostart on boot, ideal for a dedicated workstation or a shared office PC.
Drop-in /v1 endpoints on your LAN — OpenAI and Anthropic. Existing SDKs and tools work unchanged; just point them at the new base URL.
Load, unload, and list models straight from the terminal. Scriptable and automatable for CI, agents, and cron jobs.
Cloud AI is fast to start and slow to trust. Other local tools leave RTX performance on the table because they don't run NVIDIA's production stack. BareMetalRT is the only option that's both fully private and built on TensorRT-LLM — on the RTX GPU you already own.
An organization across a GPU fleet, a dev team on a workstation, a startup on a single box — or one person on a laptop. Same engine, same privacy, no per-seat pricing, no data leaving your network. Anyone with an NVIDIA RTX GPU can run it.
Every token runs on hardware you control; nothing leaves your network. Chat history is AES-256 encrypted on your device, credentials encrypted at rest, traffic TLS in transit. Your chats stay yours at home — and because there are no third-party subprocessors and it runs fully air-gapped, the architecture fits the requirements behind HIPAA and SOC 2 environments.
Drop-in /v1 endpoints for both the OpenAI and Anthropic APIs — point
anything that speaks GPT or Claude at your own GPU, a personal script or your team's
whole stack. No SDK changes, no prompt migration.
Start on one card. Add a second PC and split a larger model across them over your network. Add a fleet and serve your whole organization — same engine from a gaming PC to a rack.
A flat price per seat — no token meter, run it as hard as you like. Start free at home, bring your team for $50 a seat (each seat shares with up to 4), and scale to the enterprise with SSO and support. The engine runs entirely on hardware you own; nothing leaves your network. See plans & pricing →
Self-serve in minutes — no sales call. Enterprise rollout? Email a founder directly.
The streaming era ran on content delivery networks — edge infrastructure that pushed data close to people instead of routing everything back to one origin. Intelligence is following the same arc. Centralized hyperscale data centers are becoming the mainframe of this era — the wrong shape for AI that should live at the edge, on the hardware people already own. LLMDN is that delivery network for intelligence: idle RTX cards become nodes that contribute compute and together run mixture-of-experts models no single machine could ever hold.
Compute lived in a single building. Everyone dialed back to the mainframe.
The web scaled by pushing content to the edge — close to every user, not one origin.
Intelligence moves to the edge — model compute on the GPUs people already own.
Like a CDN serving from the nearest edge, LLMDN routes each request to whatever GPU is online and closest. Idle RTX cards become serving nodes — no central facility deciding who gets compute.
Every idle moment becomes useful compute. Your hardware contributes to the network even when you're not at it — and draws on the whole mesh when you need more than one card can give.
Mixture-of-experts models too big for any one card, sharded across many consumer GPUs over the same network mesh you just saw — scaled to a thousand nodes. No hyperscale facility, no NVLink.