Changelog

We ship continuously.

Every release of Bare Metal RT, newest first. Voice, quantized models, an OpenAI- and Anthropic-compatible API, and tensor parallelism across consumer GPUs — all delivered in public betas since launch.

128+Releases shipped

10Weeks since first beta

v0.13.19Latest build

June 2026

Jun 22v0.13.18

A private, on-box code editor — write code with a local model

Introducing the local code editor: a new Code button opens a full editor in your browser, served from your own node. It pairs a Monaco editor (the same engine as VS Code) with a file explorer and an AI agent that reads, edits, and runs your code using your node's local model — nothing leaves your machine. It has a proper IDE layout (activity bar, tabs, breadcrumb, minimap, status bar) finished in the Bare Metal RT look. Best on focused, single-file tasks today; larger multi-file work benefits from a bigger model and GPU. The Code button appears when a coding model is loaded on the node. Shown in local mode.

Jun 22v0.13.15

More reliable int4 loads and tool calls on consumer GPUs

Fixed a separate load-warmup hang where AutoAWQ int4 models could stall after loading weights on consumer Ada/Ampere GPUs (the kernel auto-tuner is now skipped on those cards, where it isn't needed). Local coding models also call tools more consistently — tool-calling turns are decoded deterministically, so an agent reliably acts instead of occasionally replying in prose.

Jun 22v0.13.14

Int4 code models load again, plus more reliable local coding agents

A recent engine update had dropped support for AutoAWQ int4 checkpoints, so int4 code models — including Qwen2.5-Coder-7B and the rest of the int4 catalog — failed to load with an "unsupported quantization" error. The engine now ships a unified build with AutoAWQ restored (FP8 support preserved), so those models load and run again. This release also makes tool calls from small local models far more reliable — the server normalizes slightly-off tool-call formats into standard calls and honors required tool calls, so agentic coding flows work — and the model-loading bar now tracks real load stages with a pinned GPU-usage readout.

Jun 22v0.13.12

Model loading and serving is now reliable instead of fingers-crossed

Loads now retry instead of failing on a transient hiccup (a VRAM blip, a slow import, a flaky download), and GPU memory is no longer leaked between loads — a worker that times out or crashes is reaped, so the next load doesn't fail for lack of VRAM. The node also reports its real state: after an unload or unexpected drop it shows idle instead of a stale ready, a crashed worker self-heals, a broken tokenizer surfaces as a real error, and a corrupt model registry is backed up and fails loud instead of silently wiping your downloads. Experimental, off by default: a new native serving path with rock-solid model swapping, opt-in via BMRT_NATIVE_RESIDENCY=1.

Jun 22v0.13.11

Installer branding refresh

A new premium logo across the installer — the glowing mark with the "Bare Metal AI" wordmark and "Your PC is the data center" tagline — and a brighter app icon that's actually visible at taskbar size. A few lines on the System Check page that ran off the right edge now wrap correctly.

Jun 22v0.13.10

No "No GPU connected" flash right after linking a machine

The daemon takes a couple of seconds to register and detect the GPU after you link it; the app used to paint "No GPU connected" until its next 10-second poll, so it looked broken until you refreshed. It now polls quickly until the GPU comes up, so the card fills in on its own within a second or two.

Jun 22v0.13.9

Cleaner machine linking

An unlinked machine no longer opens (or repeatedly re-opens) stray cloud browser tabs — the on-box app shows a single "Connect this GPU" screen, and that's the one place you link from. The claim token refreshes on demand so signing in works no matter how long you take (a 5-minute expiry used to break it), and a linked machine can't be claimed twice — a deliberate re-link replaces its key in place instead of stacking duplicates.

Jun 22v0.13.8

"Connect this GPU" screen redesigned to match sign-in

Linking a new machine now uses the same premium two-pane layout as the login screen — the branded panel (pulsing mark, trust points) beside a clean call to action — instead of the previous plain prompt.

Jun 22v0.13.7

New machines can register again

Fixed a regression that could stop the sign-in step from linking a brand-new GPU to your account. Connecting a fresh machine works as expected again.

Jun 22v0.13.6

Your account, shown on your machine

The local app now shows the real account this machine is linked to — it previously showed a placeholder. There's still no separate login: the machine's own credential is the identity.

Jun 21v0.13.5

Updates that can't strand your machine

If an update is interrupted partway, the app now relaunches itself automatically instead of getting stuck — a failed update self-heals.

Jun 21v0.13.4

Local app looks exactly like the web app

Fixed missing fonts and icons so the on-machine app renders pixel-for-pixel with the web version.

Jun 21v0.13.3

Run the whole app locally — sign-in optional

The desktop app now serves the full chat experience straight from your own machine, with no account required — your conversations and models never leave your PC. This release also adds an enterprise managed mode so IT can enforce model and sharing policies on each device.

Jun 21v0.13.2

Network access locked down by default

New installs now require authentication on the control API by default. Local and on-device traffic stays friction-free, and multi-GPU fleets keep working out of the box through trusted-peer handling — everything else needs an API key.

Jun 20v0.13.1

FP8 models for RTX 40- and 50-series GPUs

Added 26 new model cards in an FP8 precision lane — roughly half the memory of full precision at near-full quality, with faster math on Ada and Blackwell tensor cores. Covers Qwen, Llama, DeepSeek distills, Phi-4, Gemma 3, and Mistral Small, with a new FP8 filter to surface them on a compatible GPU.

Jun 20v0.13.0

Redesigned installer and simpler setup

A refreshed install wizard matching the rest of the product, plus a streamlined System Check that only asks for an up-to-date NVIDIA driver — no separate CUDA or TensorRT downloads, the app ships everything it needs. Installs no longer flash a console window.

Jun 20v0.12.31

Trigger BI dashboard refreshes from your connectors

Business-intelligence connectors can now kick off refresh and run actions, not just read data.

Jun 20v0.12.30

New AI integrations category

Added a dedicated AI category to the integrations catalog, a read/write badge on each connector so you can see at a glance what it can do, and a spotlight for featured integrations. The catalog now spans 318 connectors.

Jun 20v0.12.29

More integrations — 312 connectors

A fifth wave added 17 more integrations (now 312 total), plus clearer read vs read & write badges so you can see at a glance what each connector is allowed to do.

Jun 20v0.12.28

FP8 inference for Ada & Blackwell GPUs

The engine can now run dense FP8 checkpoints — roughly half the memory of FP16 with comparable output quality — on RTX 40-series (Ada) and 50-series (Blackwell) hardware. The first FP8 models follow as they're validated on that hardware.

Jun 20v0.12.27

More integrations (wave 4)

Added 12 more connectors and enabled write access on 5 more, bringing the catalog to 295.

Jun 20v0.12.26

Two-way connectors & more CRMs

Connectors can now read and write, not just read — plus new Close, Keap, and Copper CRM integrations.

Jun 20v0.12.25

18 new integrations

A big batch of connectors spanning databases, ads, social, CRM, finance, and DevOps tools.

Jun 20v0.12.24

Smoother installs & quieter updates

Fixed an installer dependency step that could fail, and stopped a stray console window from briefly appearing during a background update.

Jun 20v0.12.23

15 business-data connectors

Added 15 BI and data integrations, growing the connector catalog to 262.

Jun 19v0.12.22

More reliable installs on the latest GPU stack

Hardened the installer so a fresh install or update always sets up a GPU compute stack that matches the engine. Previously, if the step that installs PyTorch was interrupted, a machine could be left in a mismatched state where the engine couldn't load any model. The installer now puts the correct build in place up front, removing that window.

Jun 19v0.12.21

Quieter background updates

When the app updates itself, the installer now runs fully in the background — no stray console window sitting open on screen for the length of the install. The update applies silently and the app relaunches itself when it's done, exactly as before.

Jun 19v0.12.20

Choose which network interface the control API listens on

The local control API can now bind to a specific host or interface instead of only localhost — set it to a private LAN address to serve a fleet, or keep it pinned to 127.0.0.1 to stay strictly on-box. A built-in safeguard refuses to expose the API on a public interface without an explicit API-key lock, so widening access is always a deliberate, secured choice. Aimed at enterprise and multi-node deployments; single-machine installs are unchanged.

Jun 19v0.12.19

Broken nodes are kept out of the fleet automatically

If a node's inference engine is in a bad or mismatched state — for example after a partial update left its GPU libraries out of sync — the fleet now detects it and fences it off so chats are never routed to a machine that can't actually serve them. The unhealthy node reports its status, refuses model loads with a clear error instead of a silent hang, and the orchestrator skips it when picking a node. It rejoins on its own once its engine is healthy again.

Jun 19v0.12.18

Popular bfloat16 community 4-bit models now load

A wide set of AutoAWQ int4 checkpoints saved in bfloat16 — including the well-traveled DeepSeek-R1-Distill AWQ models — previously failed to load and could leave GPU memory stuck. They now load and run cleanly. The DeepSeek-R1-Distill catalog entries (7B, Llama-8B, 14B, 32B) point at the canonical community AWQ repos; 7B and Llama-8B are validated coherent at roughly 75 tokens/sec on an RTX 4070 SUPER.

Jun 19v0.12.17

GPU memory no longer gets stuck after an unexpected shutdown

If the app ever closed unexpectedly — a crash, a forced quit, or a power event — the model could keep holding onto GPU memory in the background, and the next model would fail to load with an out-of-memory error until you rebooted. The app now reliably releases all GPU memory the moment it exits, for any reason, so the next model always loads cleanly. Recommended update for every node.

Jun 19v0.12.16

Check for updates on demand — or turn auto-update off

The app now has a Software Update button in Settings to check for and install the latest release whenever you want, plus an Auto-update on/off switch. Turning auto-update off stops all background update checks — a one-click, UI-native way to keep a node fully offline (handy for air-gapped sites), while you can still update manually at any time.

Jun 19v0.12.15

Installer now shows the current branding

The Windows installer and app icon now display the current Bare Metal AI logo. A build-pipeline issue had been regenerating the installer artwork from an older design on every build, so refreshed branding never reached the shipped installer; the setup screens and desktop icon now match the current brand. No functional change.

Jun 19v0.12.14

Thousands more open-source models, no conversion step

BareMetalRT now loads AutoAWQ int4 checkpoints directly — the most common community 4-bit format on Hugging Face — so a huge range of pre-quantized models run on your GPU without any re-quantization or extra conversion. Validated on coding and general-purpose models (Qwen2.5-Coder-7B, Hermes-3-8B); existing models are unaffected.

Jun 19v0.12.13

In-app updates now install reliably

The one-click in-app updater could download and verify a new release but fail to finish installing it on some setups — the app would close but not come back, needing a manual reinstall. Fixed: the signed installer now runs fully detached from the running app, so it can replace files and relaunch the app on its own every time. Signature verification and the consent prompt are unchanged.

Jun 18v0.12.12

Reliability: no more stalled chats on tight GPUs

On smaller cards (e.g. 8 GB), a long prompt could exhaust the GPU's KV cache mid-reply and leave the chat spinning on “thinking” with no feedback. Three fixes land here: the usable context window is now sized to your GPU so a request can't oversubscribe memory; a stalled generation now surfaces a clear message in seconds instead of a multi-minute hang; and voice no longer co-loads when it would starve the chat model of the memory it needs to run. The GPU card also now shows the effective context window for the loaded model, so the usable size is honest per card.

Jun 18v0.12.11

Air-gap mode

A new BMRT_AIRGAP setting puts the daemon into a strict no-egress posture for classified, regulated, or disconnected sites. With it on, the daemon makes zero unsolicited outbound internet connections: the in-app update check (the only autonomous phone-home) is fully disabled — no background poll, no on-demand check, no installer download — so nothing reaches out without an operator asking. Updates are applied the air-gapped way: drop in the signed installer and run it. Licensing was already offline (Ed25519, no phone-home), and fleet traffic stays on your LAN. Off by default — normal installs keep the in-app update banner.

Jun 18Web

New models & engine update

The inference engine was updated to TensorRT-LLM 1.3.0rc18, broadening the range of model architectures we can run. New vision-language models are now in the catalog as Preview — Qwen3-VL (2B / 4B / 8B), Phi-4 Multimodal, and Pixtral — attach an image and ask about it. Wider coverage, including mixture-of-experts models and Blackwell (sm120) GPUs, is in progress and moves from Preview to Available as we validate it on hardware.

Jun 18Web

Enterprise SSO / Identity

Sign in with your organization's identity provider — OIDC (Okta, Microsoft Entra ID, Auth0, PingOne, Keycloak, ADFS) and SAML 2.0, with SCIM 2.0 auto-provisioning/deprovisioning and role-based access (admin / user / viewer) mapped from your IdP groups. Authorization Code + PKCE, JWKS-validated tokens, signed SAML assertions; the orchestrator is the relying party, so it works on-prem and air-gapped against your own IdP. Off by default — the local and demo experience is unchanged. Architecture & security →

Jun 18v0.12.10

Team plans, seat sharing & guest controls

New per-seat Team plans (with a 14-day free trial) and seat sharing — each seat can invite up to four teammates to chat on its GPU. This build adds the daemon-side enforcement: shared guests are chat-only (no connectors, agents, or API keys), enforced locally on your node. Also ships offline, signed license verification (Ed25519 — no phone-home) for licensed and air-gapped deployments, and a Share Access shortcut in the tray. See plans & pricing →

Jun 18v0.12.9

Private knowledge bases (RAG)

Chat over your own documents — entirely on the GPU you own. Create a knowledge base in the composer, add your files, and the model answers grounded in your text with bracketed [n] citations. The chunker, the embeddings (all-MiniLM-L6-v2), the vector index, and the source text all stay on your host — no cloud index, no third-party service in the retrieval path. New /api/rag/* endpoints, and any chat can be grounded by passing a collection. Retrieval docs →

Jun 18v0.12.8

Installer polish

A cosmetic installer refresh: the setup wizard now carries the current Bare Metal AI logo and wordmark, the license page shows an up-to-date revision date, and the dependency-setup console runs minimized so it no longer pops up and steals focus at the end of installation (progress still streams to the taskbar window and install-deps.log). No runtime changes — identical engine and daemon to v0.12.7.

Jun 17v0.12.7

Reliability & security hardening

A focused fix-and-harden release. Workflows, Agents and Routines are now fully wired in the hosted app — creating, scheduling and running them works end-to-end (previously these tabs could silently fail to save). Security hardening across the board: an opt-in API-key lock for the local control API so only the machine owner can change it over a network, a launcher allowlist for connector servers that blocks arbitrary process execution, salted password hashing (scrypt) with transparent upgrade on next sign-in, stricter session-revocation handling, and full request isolation on the OpenAI-compatible /v1 endpoint so concurrent calls never cross streams. Plus correctness fixes: mid-run cancellation is honored immediately, agent run and workflow files are written atomically, and connector SQL read-only guards are tighter.

Jun 17v0.12.6

230+ integrations — every system of record, every industry, on your hardware

The connector catalog jumps to 230+ integrations across 14 categories, with deep first-party coverage for the systems enterprises actually run: Workday, SAP, NetSuite, Dynamics 365, ServiceNow, Splunk, CrowdStrike, Okta, Microsoft Entra ID, Power BI, Snowflake and many more — plus vertical packs for finance (SEC EDGAR, Plaid, OpenFIGI), healthcare (Epic, Oracle Health/Cerner, and any FHIR R4 server), government & defense (Esri ArcGIS, SAM.gov, USAspending), and legal & insurance (Clio, iManage, Guidewire). GitHub, Slack, Jira, HubSpot and Notion are now first-party bundled — no Node.js required. Every connector runs as a stdlib server on your GPU host: credentials and data never leave the machine, reads are silent, and anything side-effecting pauses for your approval. A new industry filter narrows the catalog to your vertical in one click.

Jun 17v0.12.5

AWS, Azure & Google Cloud can now act — opt-in write

The three hyperscaler connectors join the read/write family. Connect AWS, Azure or Google Cloud read-only by default, or opt into read & write to let an agent provision and control resources: create an S3/GCS bucket and upload objects, start/stop EC2 and Compute instances, create an Azure resource group, start/stop VMs and tag resources. Writes are curated and conservative — no destructive deletes — and every one is side-effecting, so it pauses for your approval before it runs (auto-approved only inside an autonomous agent run). Real request signing (SigV4 for AWS, OAuth bearer for Azure/GCP) on your GPU host; keys never leave the machine. 26 connectors can now act.

Jun 17v0.12.4

Workflows, rebuilt — deterministic funnels you can actually trust

Workflows are no longer a prompt in disguise. Build a process as a funnel of steps that runs the same way every time: Input → deterministic tool Actions → AI reasoning nodes → Output. The engine walks the funnel in order and pipes each step's result into the next with {{n1}} placeholders — control flow is enforced on your GPU host, not improvised by the model, so every run is repeatable and auditable. Define the process properly with named fields (goal, owner, inputs, outputs), pick the model it runs on (it loads automatically before the run), and read-only mode refuses any side-effecting action. New funnel editor with add/reorder nodes, a per-tool parameter picker, and a built-in docs panel.

Jun 17v0.12.3

Connectors can act — read/write across the catalog, plus 5 new connectors

Connectors can now do things, not just read them. Five new ones ship — SAP (OData), Tableau, DocuSign, WhatsApp and Microsoft Teams — bringing the catalog to 102. And 23 connectors now let you choose access when you connect them: read-only (the safe default) or read & write — create a ServiceNow incident, reply to a Zendesk ticket, post to Teams, trigger a dbt or Airflow job, send a document for signature, and more. Every write is side-effecting, so it pauses for your approval before it runs (auto-approved only inside an autonomous agent run). Everything still runs on your GPU host with your own tokens.

Jun 17v0.12.1

Workflows — chain your integrations into agentic procedures

A new Workflows tab: name a procedure, write its steps in plain language, and pick which integrations it may use. Running it launches an agent scoped to only those integrations — so it stays focused even with 100+ connected — and every run is recorded so the workflow improves over time. Schedule any workflow to run daily on the existing Routines engine. It's the glue between your tools and your tasks: describe the job, the agent does it, entirely on your GPU host.

Jun 16v0.12.0

100+ one-click integrations — connect your whole stack, privately

The integrations catalog jumps to over 100 — 97 live and growing. New bundled connectors cover the cloud (AWS, Azure, Google Cloud), the data stack (Snowflake, Databricks, dbt, Fivetran, Airbyte, Airflow, Elasticsearch, Metabase), engineering (Datadog, Grafana, PagerDuty, Jenkins, Bitbucket, Azure DevOps, Snyk, SonarQube), and the enterprise (ServiceNow, Confluence, Box, Zendesk, Microsoft 365 incl. SharePoint, Salesforce) — plus CRM, support and productivity tools. Each runs on your GPU host as a read-only connector: you add your own token, and your credentials and data never leave your machine for our cloud.

Jun 16v0.11.2

Bring your own model — plus 42 built-in skills

Paste any Hugging Face model link and BareMetalRT now reads the model's card, checks it runs on your hardware, and adds it to your catalog — no waiting for us to add it. Supported architectures load straight onto the local PyTorch path and chat, all on your own GPU. This release also bundles 42 ready-made skills — reusable instruction recipes like code reviewer, commit-message writer, SQL helper, changelog drafter and research assistant — that you can hand to the agent in one click, alongside the growing one-click integrations catalog.

Jun 16v0.11.1

~4× faster responses on your GPU

The local PyTorch inference path now captures each decode step into a CUDA graph, so the GPU no longer sits idle between token launches. On an RTX 4070 SUPER, Qwen3-4B (4-bit) generation jumps from ~19 to ~79 tokens/sec — about 4× faster — closing most of the gap to the older compiled-engine backend. No setup or config change; responses just come back quicker.

Jun 15v0.11.0

Agent Fleet — multi-tool agents, headless runs & Routines

Agents can now use several tools in a single turn, with read-only tools running in parallel for faster answers. A new Agents tab lets you launch headless agent runs — give a goal and a permission mode (read-only, or autonomous to use side-effecting tools) and watch them work in a read-only fleet view; runs execute on your own GPU, one at a time on the engine, and are stored locally. Routines schedule recurring agent runs — every few hours or daily at a set time — that fire on your machine, always-on with no cloud bill. Everything stays on your hardware.

Jun 14v0.10.0–0.10.1

Multi-GPU from a clean install & smarter gated-model downloads

Cross-machine tensor parallelism — running a single model split across two machines' GPUs — now works straight from the installer with no manual setup, so you can run models too big for one card. Validated end-to-end across two RTX GPUs: a clean install on each box loads and chats coherently over the network. The catalog also flags gated models your Hugging Face account can't download yet and prompts you to accept the license in one click, instead of only finding out when a download fails.

Jun 13v0.9.34–42

Hardening pass: security, stability & correctness

A deep stress-test followed by a fix-and-verify loop run until no critical bug remained. Patched two cross-site-scripting holes in chat rendering; a crashed model now auto-recovers instead of erroring out; model loads are serialized to stop out-of-memory crashes; downloads now really stop on Pause/Cancel; and a Mistral 7B bug that returned blank replies is fixed. Voice recovers gracefully instead of going silently dead, non-English and emoji replies render correctly, and the installer is now a one-command, self-checking build. Validated end-to-end on an RTX 3060 Ti.

Jun 12v0.9.26–31

Faster model switching & rock-solid VRAM

Near-instant model swaps via a pre-warmed standby pool, fully headless background workers (no stray console windows), runtime free-VRAM gating that prevents out-of-memory crashes when voice is resident, and the new bmrt command-line tool — all bundled in the installer.

Jun 12v0.9.25–27

Voice quality overhaul

int8 Orpheus voice delivers crisp audio on 8 GB GPUs, and Kokoro-82M TTS keeps low-VRAM nodes responsive.

Jun 11v0.9.24

OpenAI- & Anthropic-compatible API

Point your existing tools at your own GPU. A drop-in /v1 API for both OpenAI and Anthropic clients, plus fast embeddings served by a persistent worker.

Jun 9v0.9.15

Long-context support

Fused FMHA attention and chunked prefill unlock long prompts without blowing past memory.

Jun 9v0.9.12

int4 quantized model catalog

Run larger models on less VRAM with W4A16-AWQ quantization (Qwen3-4B), tuned for both Ada and Ampere RTX cards.

Jun 8v0.9.7–10

Voice mode, default-on

On-device Whisper speech-to-text and a remote voice bridge — talk to your GPU from anywhere, with audio that never leaves your hardware.

Jun 7v0.9.0–5

New PyTorch inference backend

The PyExecutor backend opens up a much broader catalog: Qwen 3 (0.6B / 1.7B / 4B) and DeepSeek-R1-Distill reasoning, all on a single GPU.

Jun 5v0.8.2–6

Production inference engine

Paged KV-cache attention, multi-user concurrency (serve a whole household from one GPU), and per-model tuning controls.

Jun 2v0.7.18–19

Live transport health

Real-time multi-GPU transport status and peer-latency monitoring surfaced directly in the web UI.

May 2026

May 28v0.7.16

Per-user Hugging Face auth

Bring your own Hugging Face token, encrypted per user, to pull gated models.

May 18v0.7.8–12

Multi-GPU performance

2× faster generation via incremental KV-cache, async multi-GPU sync for tighter latency, and an 8K default context window.

May 16–17v0.7.0–3

Mistral 7B across two GPUs

Tensor parallelism over ordinary networking — run a 14 GB model split across two consumer cards on a home network.

April 2026

Apr 17–21v0.6.x

Multi-GPU engine orchestration

Server-driven tensor-parallel engine builds and VRAM / KV-cache stability work for heterogeneous consumer GPUs.

Apr 2v0.4.0

First public beta

Data-center-grade inference on the GPU you already own — no cloud, no API fees.

Download the latest All releases on GitHub →

Released continuously. Full per-build notes live on GitHub.