Every release of Bare Metal RT, newest first. Voice, quantized models, an
OpenAI- and Anthropic-compatible API, and tensor parallelism across consumer
GPUs — all delivered in public betas since launch.
Jun 22v0.13.18
A private, on-box code editor — write code with a local model
Introducing the local code editor: a new Code button opens a full editor
in your browser, served from your own node. It pairs a Monaco editor (the same engine as VS Code)
with a file explorer and an AI agent that reads, edits, and runs your code using your node's local
model — nothing leaves your machine. It has a proper IDE layout (activity bar, tabs, breadcrumb,
minimap, status bar) finished in the Bare Metal RT look. Best on focused, single-file
tasks today; larger multi-file work benefits from a bigger model and GPU. The Code button appears when a coding model is loaded on the node. Shown in local mode.
Jun 22v0.13.15
More reliable int4 loads and tool calls on consumer GPUs
Fixed a separate load-warmup hang where AutoAWQ int4 models could stall after
loading weights on consumer Ada/Ampere GPUs (the kernel auto-tuner is now skipped on those cards, where it
isn't needed). Local coding models also call tools more consistently — tool-calling turns are
decoded deterministically, so an agent reliably acts instead of occasionally replying in prose.
Jun 22v0.13.14
Int4 code models load again, plus more reliable local coding agents
A recent engine update had dropped support for AutoAWQ int4 checkpoints, so
int4 code models — including Qwen2.5-Coder-7B and the rest of the int4 catalog —
failed to load with an "unsupported quantization" error. The engine now ships a unified build with
AutoAWQ restored (FP8 support preserved), so those models load and run again. This release also makes
tool calls from small local models far more reliable — the server normalizes slightly-off
tool-call formats into standard calls and honors required tool calls, so agentic coding flows work —
and the model-loading bar now tracks real load stages with a pinned GPU-usage readout.
Jun 22v0.13.12
Model loading and serving is now reliable instead of fingers-crossed
Loads now retry instead of failing on a transient hiccup (a VRAM blip, a slow
import, a flaky download), and GPU memory is no longer leaked between loads — a worker that
times out or crashes is reaped, so the next load doesn't fail for lack of VRAM. The node also reports
its real state: after an unload or unexpected drop it shows idle instead of a stale
ready, a crashed worker self-heals, a broken tokenizer surfaces as a real error, and a corrupt
model registry is backed up and fails loud instead of silently wiping your downloads. Experimental, off
by default: a new native serving path with rock-solid model swapping, opt-in via
BMRT_NATIVE_RESIDENCY=1.
Jun 22v0.13.11
Installer branding refresh
A new premium logo across the installer — the glowing mark with the
"Bare Metal AI" wordmark and "Your PC is the data center" tagline — and a brighter app icon that's
actually visible at taskbar size. A few lines on the System Check page that ran off the right edge now
wrap correctly.
Jun 22v0.13.10
No "No GPU connected" flash right after linking a machine
The daemon takes a couple of seconds to register and detect the GPU after you link it;
the app used to paint "No GPU connected" until its next 10-second poll, so it looked broken until you
refreshed. It now polls quickly until the GPU comes up, so the card fills in on its own within a
second or two.
Jun 22v0.13.9
Cleaner machine linking
An unlinked machine no longer opens (or repeatedly re-opens) stray cloud browser
tabs — the on-box app shows a single "Connect this GPU" screen, and that's the one place you link
from. The claim token refreshes on demand so signing in works no matter how long you take (a
5-minute expiry used to break it), and a linked machine can't be claimed twice — a deliberate
re-link replaces its key in place instead of stacking duplicates.
Jun 22v0.13.8
"Connect this GPU" screen redesigned to match sign-in
Linking a new machine now uses the same premium two-pane layout as the login
screen — the branded panel (pulsing mark, trust points) beside a clean call to action —
instead of the previous plain prompt.
Jun 22v0.13.7
New machines can register again
Fixed a regression that could stop the sign-in step from linking a
brand-new GPU to your account. Connecting a fresh machine works as expected again.
Jun 22v0.13.6
Your account, shown on your machine
The local app now shows the real account this machine is linked to — it
previously showed a placeholder. There's still no separate login: the machine's own
credential is the identity.
Jun 21v0.13.5
Updates that can't strand your machine
If an update is interrupted partway, the app now relaunches itself
automatically instead of getting stuck — a failed update self-heals.
Jun 21v0.13.4
Local app looks exactly like the web app
Fixed missing fonts and icons so the on-machine app renders
pixel-for-pixel with the web version.
Jun 21v0.13.3
Run the whole app locally — sign-in optional
The desktop app now serves the full chat experience straight from your own
machine, with no account required — your conversations and models never leave your PC.
This release also adds an enterprise managed mode so IT can enforce model and sharing
policies on each device.
Jun 21v0.13.2
Network access locked down by default
New installs now require authentication on the control API by default. Local
and on-device traffic stays friction-free, and multi-GPU fleets keep working out of the box through
trusted-peer handling — everything else needs an API key.
Jun 20v0.13.1
FP8 models for RTX 40- and 50-series GPUs
Added 26 new model cards in an FP8 precision lane — roughly half the memory
of full precision at near-full quality, with faster math on Ada and Blackwell tensor cores. Covers
Qwen, Llama, DeepSeek distills, Phi-4, Gemma 3, and Mistral Small, with a new FP8 filter to
surface them on a compatible GPU.
Jun 20v0.13.0
Redesigned installer and simpler setup
A refreshed install wizard matching the rest of the product, plus a streamlined
System Check that only asks for an up-to-date NVIDIA driver — no separate CUDA or
TensorRT downloads, the app ships everything it needs. Installs no longer flash a console window.
Jun 20v0.12.31
Trigger BI dashboard refreshes from your connectors
Business-intelligence connectors can now kick off refresh and run actions,
not just read data.
Jun 20v0.12.30
New AI integrations category
Added a dedicated AI category to the integrations catalog, a read/write badge on
each connector so you can see at a glance what it can do, and a spotlight for featured integrations.
The catalog now spans 318 connectors.
Jun 20v0.12.29
More integrations — 312 connectors
A fifth wave added 17 more integrations (now 312 total), plus clearer
read vs read & write badges so you can see at a glance what each connector
is allowed to do.
Jun 20v0.12.28
FP8 inference for Ada & Blackwell GPUs
The engine can now run dense FP8 checkpoints — roughly half the
memory of FP16 with comparable output quality — on RTX 40-series (Ada) and 50-series
(Blackwell) hardware. The first FP8 models follow as they're validated on that hardware.
Jun 20v0.12.27
More integrations (wave 4)
Added 12 more connectors and enabled write access on 5 more, bringing the
catalog to 295.
Jun 20v0.12.26
Two-way connectors & more CRMs
Connectors can now read and write, not just read — plus new
Close, Keap, and Copper CRM integrations.
Jun 20v0.12.25
18 new integrations
A big batch of connectors spanning databases, ads, social, CRM, finance,
and DevOps tools.
Jun 20v0.12.24
Smoother installs & quieter updates
Fixed an installer dependency step that could fail, and stopped a stray
console window from briefly appearing during a background update.
Jun 20v0.12.23
15 business-data connectors
Added 15 BI and data integrations, growing the connector catalog to 262.
Jun 19v0.12.22
More reliable installs on the latest GPU stack
Hardened the installer so a fresh install or update always sets up a GPU
compute stack that matches the engine. Previously, if the step that installs PyTorch was
interrupted, a machine could be left in a mismatched state where the engine couldn't load
any model. The installer now puts the correct build in place up front, removing that window.
Jun 19v0.12.21
Quieter background updates
When the app updates itself, the installer now runs fully in the background
— no stray console window sitting open on screen for the length of the install. The update
applies silently and the app relaunches itself when it's done, exactly as before.
Jun 19v0.12.20
Choose which network interface the control API listens on
The local control API can now bind to a specific host or interface instead of
only localhost — set it to a private LAN address to serve a fleet, or keep it pinned to
127.0.0.1 to stay strictly on-box. A built-in safeguard refuses to expose the API on a
public interface without an explicit API-key lock, so widening access is always a deliberate, secured
choice. Aimed at enterprise and multi-node deployments; single-machine installs are unchanged.
Jun 19v0.12.19
Broken nodes are kept out of the fleet automatically
If a node's inference engine is in a bad or mismatched state — for example
after a partial update left its GPU libraries out of sync — the fleet now detects it and fences
it off so chats are never routed to a machine that can't actually serve them. The unhealthy node
reports its status, refuses model loads with a clear error instead of a silent hang, and the
orchestrator skips it when picking a node. It rejoins on its own once its engine is healthy again.
Jun 19v0.12.18
Popular bfloat16 community 4-bit models now load
A wide set of AutoAWQ int4 checkpoints saved in bfloat16 — including
the well-traveled DeepSeek-R1-Distill AWQ models — previously failed to load and could
leave GPU memory stuck. They now load and run cleanly. The DeepSeek-R1-Distill catalog entries
(7B, Llama-8B, 14B, 32B) point at the canonical community AWQ repos; 7B and Llama-8B are validated
coherent at roughly 75 tokens/sec on an RTX 4070 SUPER.
Jun 19v0.12.17
GPU memory no longer gets stuck after an unexpected shutdown
If the app ever closed unexpectedly — a crash, a forced quit, or a power
event — the model could keep holding onto GPU memory in the background, and the next model
would fail to load with an out-of-memory error until you rebooted. The app now reliably releases
all GPU memory the moment it exits, for any reason, so the next model always loads cleanly.
Recommended update for every node.
Jun 19v0.12.16
Check for updates on demand — or turn auto-update off
The app now has a Software Update button in Settings to check for and install
the latest release whenever you want, plus an Auto-update on/off switch. Turning auto-update
off stops all background update checks — a one-click, UI-native way to keep a node fully
offline (handy for air-gapped sites), while you can still update manually at any time.
Jun 19v0.12.15
Installer now shows the current branding
The Windows installer and app icon now display the current Bare Metal AI
logo. A build-pipeline issue had been regenerating the installer artwork from an older design
on every build, so refreshed branding never reached the shipped installer; the setup screens and
desktop icon now match the current brand. No functional change.
Jun 19v0.12.14
Thousands more open-source models, no conversion step
BareMetalRT now loads AutoAWQ int4 checkpoints directly — the most
common community 4-bit format on Hugging Face — so a huge range of pre-quantized models run
on your GPU without any re-quantization or extra conversion. Validated on coding and
general-purpose models (Qwen2.5-Coder-7B, Hermes-3-8B); existing models are unaffected.
Jun 19v0.12.13
In-app updates now install reliably
The one-click in-app updater could download and verify a new release but fail
to finish installing it on some setups — the app would close but not come back, needing a
manual reinstall. Fixed: the signed installer now runs fully detached from the running app,
so it can replace files and relaunch the app on its own every time. Signature verification and
the consent prompt are unchanged.
Jun 18v0.12.12
Reliability: no more stalled chats on tight GPUs
On smaller cards (e.g. 8 GB), a long prompt could exhaust the GPU's
KV cache mid-reply and leave the chat spinning on “thinking” with no feedback.
Three fixes land here: the usable context window is now sized to your GPU so a request
can't oversubscribe memory; a stalled generation now surfaces a clear message in seconds
instead of a multi-minute hang; and voice no longer co-loads when it would starve the chat
model of the memory it needs to run. The GPU card also now shows the
effective context window for the loaded model, so the usable size is honest per card.
Jun 18v0.12.11
Air-gap mode
A new BMRT_AIRGAP setting puts the daemon into a strict
no-egress posture for classified, regulated, or disconnected sites. With it on, the
daemon makes zero unsolicited outbound internet connections: the in-app update check
(the only autonomous phone-home) is fully disabled — no background poll, no
on-demand check, no installer download — so nothing reaches out without an operator
asking. Updates are applied the air-gapped way: drop in the signed installer and run it.
Licensing was already offline (Ed25519, no phone-home), and fleet traffic stays on your
LAN. Off by default — normal installs keep the in-app update banner.
Jun 18Web
New models & engine update
The inference engine was updated to TensorRT-LLM 1.3.0rc18, broadening
the range of model architectures we can run. New vision-language models are now in the
catalog as Preview — Qwen3-VL (2B / 4B / 8B), Phi-4 Multimodal, and
Pixtral — attach an image and ask about it. Wider coverage, including
mixture-of-experts models and Blackwell (sm120) GPUs, is in progress and
moves from Preview to Available as we validate it on hardware.
Jun 18Web
Enterprise SSO / Identity
Sign in with your organization's identity provider —
OIDC
(Okta, Microsoft Entra ID, Auth0, PingOne, Keycloak, ADFS) and
SAML 2.0, with
SCIM 2.0 auto-provisioning/deprovisioning and
role-based access
(admin / user / viewer) mapped from your IdP groups. Authorization Code + PKCE,
JWKS-validated tokens, signed SAML assertions; the orchestrator is the relying party,
so it works on-prem and air-gapped against your own IdP.
Off by default —
the local and demo experience is unchanged.
Architecture & security →
Jun 18v0.12.10
Team plans, seat sharing & guest controls
New per-seat
Team plans (with a 14-day free trial) and
seat sharing
— each seat can invite up to four teammates to chat on its GPU. This build adds the
daemon-side enforcement: shared guests are
chat-only (no connectors, agents, or API
keys), enforced locally on your node. Also ships offline, signed license verification
(Ed25519 — no phone-home) for licensed and air-gapped deployments, and a
Share Access shortcut in the tray. See
plans & pricing →
Jun 18v0.12.9
Private knowledge bases (RAG)
Chat over your own documents — entirely on the GPU you own. Create a
knowledge base in the composer, add your files, and the model answers grounded in
your text with bracketed
[n] citations. The chunker, the embeddings
(
all-MiniLM-L6-v2), the vector index, and the source text all stay on your
host — no cloud index, no third-party service in the retrieval path. New
/api/rag/* endpoints, and any chat can be grounded by passing a
collection.
Retrieval docs →
Jun 18v0.12.8
Installer polish
A cosmetic installer refresh: the setup wizard now carries the current Bare Metal AI logo and wordmark, the license page shows an up-to-date revision date, and the dependency-setup console runs minimized so it no longer pops up and steals focus at the end of installation (progress still streams to the taskbar window and install-deps.log). No runtime changes — identical engine and daemon to v0.12.7.
Jun 17v0.12.7
Reliability & security hardening
A focused fix-and-harden release. Workflows, Agents and Routines are now fully wired in the hosted app — creating, scheduling and running them works end-to-end (previously these tabs could silently fail to save). Security hardening across the board: an opt-in API-key lock for the local control API so only the machine owner can change it over a network, a launcher allowlist for connector servers that blocks arbitrary process execution, salted password hashing (scrypt) with transparent upgrade on next sign-in, stricter session-revocation handling, and full request isolation on the OpenAI-compatible /v1 endpoint so concurrent calls never cross streams. Plus correctness fixes: mid-run cancellation is honored immediately, agent run and workflow files are written atomically, and connector SQL read-only guards are tighter.
Jun 17v0.12.6
230+ integrations — every system of record, every industry, on your hardware
The connector catalog jumps to 230+ integrations across 14 categories, with deep first-party coverage for the systems enterprises actually run: Workday, SAP, NetSuite, Dynamics 365, ServiceNow, Splunk, CrowdStrike, Okta, Microsoft Entra ID, Power BI, Snowflake and many more — plus vertical packs for finance (SEC EDGAR, Plaid, OpenFIGI), healthcare (Epic, Oracle Health/Cerner, and any FHIR R4 server), government & defense (Esri ArcGIS, SAM.gov, USAspending), and legal & insurance (Clio, iManage, Guidewire). GitHub, Slack, Jira, HubSpot and Notion are now first-party bundled — no Node.js required. Every connector runs as a stdlib server on your GPU host: credentials and data never leave the machine, reads are silent, and anything side-effecting pauses for your approval. A new industry filter narrows the catalog to your vertical in one click.
Jun 17v0.12.5
AWS, Azure & Google Cloud can now act — opt-in write
The three hyperscaler connectors join the read/write family. Connect AWS, Azure or Google Cloud read-only by default, or opt into read & write to let an agent provision and control resources: create an S3/GCS bucket and upload objects, start/stop EC2 and Compute instances, create an Azure resource group, start/stop VMs and tag resources. Writes are curated and conservative — no destructive deletes — and every one is side-effecting, so it pauses for your approval before it runs (auto-approved only inside an autonomous agent run). Real request signing (SigV4 for AWS, OAuth bearer for Azure/GCP) on your GPU host; keys never leave the machine. 26 connectors can now act.
Jun 17v0.12.4
Workflows, rebuilt — deterministic funnels you can actually trust
Workflows are no longer a prompt in disguise. Build a process as a funnel of steps that runs the same way every time: Input → deterministic tool Actions → AI reasoning nodes → Output. The engine walks the funnel in order and pipes each step's result into the next with {{n1}} placeholders — control flow is enforced on your GPU host, not improvised by the model, so every run is repeatable and auditable. Define the process properly with named fields (goal, owner, inputs, outputs), pick the model it runs on (it loads automatically before the run), and read-only mode refuses any side-effecting action. New funnel editor with add/reorder nodes, a per-tool parameter picker, and a built-in docs panel.
Jun 17v0.12.3
Connectors can act — read/write across the catalog, plus 5 new connectors
Connectors can now do things, not just read them. Five new ones ship — SAP (OData), Tableau, DocuSign, WhatsApp and Microsoft Teams — bringing the catalog to 102. And 23 connectors now let you choose access when you connect them: read-only (the safe default) or read & write — create a ServiceNow incident, reply to a Zendesk ticket, post to Teams, trigger a dbt or Airflow job, send a document for signature, and more. Every write is side-effecting, so it pauses for your approval before it runs (auto-approved only inside an autonomous agent run). Everything still runs on your GPU host with your own tokens.
Jun 17v0.12.1
Workflows — chain your integrations into agentic procedures
A new Workflows tab: name a procedure, write its steps in plain language, and pick which integrations it may use. Running it launches an agent scoped to only those integrations — so it stays focused even with 100+ connected — and every run is recorded so the workflow improves over time. Schedule any workflow to run daily on the existing Routines engine. It's the glue between your tools and your tasks: describe the job, the agent does it, entirely on your GPU host.
Jun 16v0.12.0
100+ one-click integrations — connect your whole stack, privately
The integrations catalog jumps to over 100 — 97 live and growing. New bundled connectors cover the cloud (AWS, Azure, Google Cloud), the data stack (Snowflake, Databricks, dbt, Fivetran, Airbyte, Airflow, Elasticsearch, Metabase), engineering (Datadog, Grafana, PagerDuty, Jenkins, Bitbucket, Azure DevOps, Snyk, SonarQube), and the enterprise (ServiceNow, Confluence, Box, Zendesk, Microsoft 365 incl. SharePoint, Salesforce) — plus CRM, support and productivity tools. Each runs on your GPU host as a read-only connector: you add your own token, and your credentials and data never leave your machine for our cloud.
Jun 16v0.11.2
Bring your own model — plus 42 built-in skills
Paste any Hugging Face model link and BareMetalRT now reads the model's card, checks it runs on your hardware, and adds it to your catalog — no waiting for us to add it. Supported architectures load straight onto the local PyTorch path and chat, all on your own GPU. This release also bundles 42 ready-made skills — reusable instruction recipes like code reviewer, commit-message writer, SQL helper, changelog drafter and research assistant — that you can hand to the agent in one click, alongside the growing one-click integrations catalog.
Jun 16v0.11.1
~4× faster responses on your GPU
The local PyTorch inference path now captures each decode step into a CUDA graph, so the GPU no longer sits idle between token launches. On an RTX 4070 SUPER, Qwen3-4B (4-bit) generation jumps from ~19 to ~79 tokens/sec — about 4× faster — closing most of the gap to the older compiled-engine backend. No setup or config change; responses just come back quicker.
Jun 15v0.11.0
Agent Fleet — multi-tool agents, headless runs & Routines
Agents can now use several tools in a single turn, with read-only tools running in parallel for faster answers. A new Agents tab lets you launch headless agent runs — give a goal and a permission mode (read-only, or autonomous to use side-effecting tools) and watch them work in a read-only fleet view; runs execute on your own GPU, one at a time on the engine, and are stored locally. Routines schedule recurring agent runs — every few hours or daily at a set time — that fire on your machine, always-on with no cloud bill. Everything stays on your hardware.
Jun 14v0.10.0–0.10.1
Multi-GPU from a clean install & smarter gated-model downloads
Cross-machine tensor parallelism — running a single model split across two machines' GPUs — now works straight from the installer with no manual setup, so you can run models too big for one card. Validated end-to-end across two RTX GPUs: a clean install on each box loads and chats coherently over the network. The catalog also flags gated models your Hugging Face account can't download yet and prompts you to accept the license in one click, instead of only finding out when a download fails.
Jun 13v0.9.34–42
Hardening pass: security, stability & correctness
A deep stress-test followed by a fix-and-verify loop run until no critical bug remained. Patched two cross-site-scripting holes in chat rendering; a crashed model now auto-recovers instead of erroring out; model loads are serialized to stop out-of-memory crashes; downloads now really stop on Pause/Cancel; and a Mistral 7B bug that returned blank replies is fixed. Voice recovers gracefully instead of going silently dead, non-English and emoji replies render correctly, and the installer is now a one-command, self-checking build. Validated end-to-end on an RTX 3060 Ti.
Jun 12v0.9.26–31
Faster model switching & rock-solid VRAM
Near-instant model swaps via a pre-warmed standby pool, fully headless background workers (no stray console windows), runtime free-VRAM gating that prevents out-of-memory crashes when voice is resident, and the new bmrt command-line tool — all bundled in the installer.
Jun 12v0.9.25–27
Voice quality overhaul
int8 Orpheus voice delivers crisp audio on 8 GB GPUs, and Kokoro-82M TTS keeps low-VRAM nodes responsive.
Jun 11v0.9.24
OpenAI- & Anthropic-compatible API
Point your existing tools at your own GPU. A drop-in /v1 API for both OpenAI and Anthropic clients, plus fast embeddings served by a persistent worker.
Jun 9v0.9.15
Long-context support
Fused FMHA attention and chunked prefill unlock long prompts without blowing past memory.
Jun 9v0.9.12
int4 quantized model catalog
Run larger models on less VRAM with W4A16-AWQ quantization (Qwen3-4B), tuned for both Ada and Ampere RTX cards.
Jun 8v0.9.7–10
Voice mode, default-on
On-device Whisper speech-to-text and a remote voice bridge — talk to your GPU from anywhere, with audio that never leaves your hardware.
Jun 7v0.9.0–5
New PyTorch inference backend
The PyExecutor backend opens up a much broader catalog: Qwen 3 (0.6B / 1.7B / 4B) and DeepSeek-R1-Distill reasoning, all on a single GPU.
Jun 5v0.8.2–6
Production inference engine
Paged KV-cache attention, multi-user concurrency (serve a whole household from one GPU), and per-model tuning controls.
Jun 2v0.7.18–19
Live transport health
Real-time multi-GPU transport status and peer-latency monitoring surfaced directly in the web UI.
May 28v0.7.16
Per-user Hugging Face auth
Bring your own Hugging Face token, encrypted per user, to pull gated models.
May 18v0.7.8–12
Multi-GPU performance
2× faster generation via incremental KV-cache, async multi-GPU sync for tighter latency, and an 8K default context window.
May 16–17v0.7.0–3
Mistral 7B across two GPUs
Tensor parallelism over ordinary networking — run a 14 GB model split across two consumer cards on a home network.
Apr 17–21v0.6.x
Multi-GPU engine orchestration
Server-driven tensor-parallel engine builds and VRAM / KV-cache stability work for heterogeneous consumer GPUs.
Apr 2v0.4.0
First public beta
Data-center-grade inference on the GPU you already own — no cloud, no API fees.
Released continuously. Full per-build notes live on GitHub.