Enterprise

A private AI workforce on the GPUs you already own.

Bare Metal AI runs agents — not just chat — entirely on hardware you control. Point any OpenAI- or Anthropic-compatible agent framework at your own GPUs and let it work over your data: on-prem, air-gapped, nothing leaving your network. It's the fastest inference engine on RTX hardware, and a transport that turns ordinary cards into a cluster so you can run models a single GPU can't hold. Same engine from one workstation to a thousand-GPU fleet.

See plans & pricing → Start free

Free pilot — on hardware you own

Agents on your data — fully air-gapped

Up to 1.7× faster than llama.cpp on RTX

Unlimited usage — flat per seat, no token meter

Why it wins

Own the agent, the data, and the bill.

Agents that never phone home

Tool-using, autonomous models run entirely on hardware you control — reading your documents, hitting your internal APIs, automating real work — with nothing leaving your network. AES-256 encrypted on device, credentials encrypted at rest, TLS in transit, no third-party subprocessors. The air-gapped architecture fits the requirements behind HIPAA and SOC 2.

Drop in any agent stack

OpenAI- and Anthropic-compatible /v1 endpoints mean LangChain, the OpenAI and Anthropic SDKs, or your own agent loop point at your GPUs with no SDK changes and no prompt migration. The cloud agents you've already built just run locally.

Bigger models, cheaper cards

Our tensor-parallel transport runs over ordinary networking — no NVLink, no InfiniBand — so several consumer RTX cards can run a model too large for any one of them, in-house and private. Single-GPU is where raw throughput peaks; multi-GPU is how you reach models that wouldn't otherwise run on your hardware at all.

Performance

More capability per card.

Every consumer local-AI tool wraps llama.cpp. Bare Metal AI runs NVIDIA TensorRT-LLM with paged KV-cache, in-flight batching, and optimized GEMM — so a single card serves more tokens per second and your agents finish their work sooner. More throughput per GPU is fewer GPUs for the same work. Independently benchmarked by Menlo Research, single-GPU vs llama.cpp:

1.7×

RTX 4090 — 171 vs 100 tok/s

1.6×

RTX 3090 — 144 vs 89 tok/s

1.7×

4090 eGPU / Thunderbolt — 105 vs 62 tok/s

1.3×

RTX 4070 Laptop — 52 vs 40 tok/s

Per-GPU throughput vs. llama.cpp — the backend behind Ollama, LM Studio, and Jan. Source: Menlo Research — Benchmarking NVIDIA TensorRT-LLM →

The math

Frontier capability, without the frontier bill.

Cloud AI charges by the token and the seat — so the more your team uses it, the more it costs, forever, and your data leaves the building to earn that bill. Owning the engine flips it: a flat price per seat, unlimited usage, on hardware you control.

Rent it — cloud AI

Per-token billing that grows with every agent run
Per-seat and per-token — you pay on both meters
Cost scales against you as adoption rises
Your prompts and data leave your network
Rate limits and shared-tenant throttling

Own it — Bare Metal AI

Flat price per seat — usage is unlimited, no token meter
Each seat shares with up to 4 teammates
Cost flat as usage rises — agents run all day at no marginal price
Data never leaves your hardware
The card is yours — no limits, no throttling

The crossover comes fast: heavy agentic workloads bill the most in the cloud and cost nothing extra in-house. We'll model your break-even against current cloud spend in the demo.

Who it's for

Built for the teams that can't send data to the cloud.

Regulated & air-gapped

Healthcare, finance, legal, and government teams that need AI behind their own firewall, with no third-party subprocessor in the data path.

Proprietary code & IP

Engineering orgs that won't paste source, designs, or trade secrets into someone else's API — run coding and research agents entirely in-house.

Agentic automation

Back-office and operations work — document processing, retrieval, internal copilots — run by local agents over systems that never touch the internet.

Cost-bound AI at scale

Teams watching per-token cloud bills climb with usage. Move to a flat per-seat cost with unlimited usage on hardware you may already own, and stop metering your own people.

The platform

Connect your stack. Govern every action. Keep the data.

The agent is the engine; the platform is everything it reaches and everything you control. It ships today with 318 one-click integrations and grows along four lines — private knowledge, governance, reach, and deep enterprise connectors — every one of them running on your own hardware, with your data and credentials never leaving the building.

Private knowledge (RAG)

In preview

Build a knowledge base from your documents, contracts, and wikis. The agent answers with citations — and the chunker, the embeddings, the vector index, and the source text all stay on your GPU host. No cloud index, no third-party service in the retrieval path.

Built-in local vector store — no cloud, no third-party index — in preview
Cited answers grounded in your own corpus — in preview
Ingest folders, SharePoint, Drive, Confluence — on the roadmap

Request early access →

Governance & control

Core live

Every action is governed today — read-only by default, and every side-effecting tool call pauses for human approval before it runs. The compliance trail stays on-prem, like everything else.

Approval-gated writes & read-only default — live
Data & credentials never leave your hardware; air-gap-capable — live · no-egress attestation →
Tamper-evident audit log of every tool call & approval — in preview
Enterprise SSO (OIDC + SAML), SCIM provisioning & role-based access (Okta, Entra ID, Auth0, PingOne, Keycloak, ADFS) — available · architecture & security →
Data Processing Agreement (DPA) — available on request ([email protected])

Reach & channels

Telegram live

Reach the agent where your team already works. @mention it in Slack or Teams, or message it from your phone — allowlist-gated pairing, no inbound ports to open.

Telegram channel — shipping today
Slack & Microsoft Teams — on the roadmap
WhatsApp & Discord — on the roadmap

Connect everything

318 integrations

318 live integrations on the open Model Context Protocol, plus deep enterprise connectors on the same data-resident pattern as our Databricks and Snowflake servers — token and data stay on the host.

AWS, Azure, Google Cloud, Datadog, ServiceNow, Confluence, Box — live
Microsoft 365 (incl. SharePoint), Salesforce, Databricks, Snowflake, dbt — live
SAP, Tableau, DocuSign, WhatsApp, Teams — live (read-only or read & write)
Read-only by default; opt into write per connector, every change gated by approval. Bring your own remote MCP server, too

A selection of the catalog — 318 integrations across 14 categories and 7 industry verticals, all data-resident. Browse the full catalog →

Industries

Built for the teams that can't send data out.

The catalog spans the systems regulated and data-sensitive industries actually run on — every one on the same data-resident pattern: credentials and data stay on your hardware, and every write is approval-gated. A selection by vertical:

Financial services

Private analysis over trading, risk, and customer data. Snowflake, Databricks, SAP, and market-data connectors — without prompts or positions leaving your network.

Healthcare & life sciences

PHI-safe agents over clinical and claims systems — FHIR, EHR, and document stores — running on-prem, so protected data never reaches a third party.

Government & defense

Air-gap-capable deployments for controlled and classified environments. No inbound ports, no external subprocessors, and a full audit trail.

Legal

Contract, matter, and discovery analysis over privileged documents that never leave the firm. Document-management and e-signature connectors included.

Insurance

Claims triage, policy analysis, and underwriting support over your core systems — on hardware you control, with every write gated for human review.

Security & IT

SOC agents over your SIEM, identity, and endpoint telemetry — Splunk, CrowdStrike, Okta, Entra — triaging and correlating without shipping logs to a vendor cloud.

Don't see your stack? Any OpenAPI or Model Context Protocol service connects, and you can bring your own remote MCP server. Tell us what you run →

Pricing

Per seat. Unlimited usage. Private.

A flat price per seat — no token meter, run it as hard as you like. Self-serve Team is $50/seat/month; Enterprise is volume-priced with SSO, governance, and support. Full tiers, the free Home plan, and self-serve checkout live on the pricing page.

See plans & pricing → Start free

Volume pricing

Per-seat list price drops with volume; Enterprise is quoted on your seat count, with SCIM provisioning so the bill tracks your active roster automatically.

Support & SLA

Annual support with response-time guarantees, a named team, and a direct line to the engineers who build the engine.

Deployment services

Forward-deployed engineers to stand up the daemon, models, and transport on your fleet and integrate it into your stack.

FAQ

The questions enterprises ask first.

How does per-seat pricing work — and how is it different from cloud?

You pay a flat price per seat, and usage is unlimited — no token meter. Cloud AI charges per seat and per token, so the bill grows the more you use it; ours doesn't move. Each paid seat can also share access with up to 4 teammates (chat), and Enterprise seats provision automatically from your identity provider via SCIM. Everything runs on your hardware — your data never leaves the building.

Does any data leave our network?

No. Inference runs entirely on your hardware and the runtime can be fully air-gapped. There are no third-party subprocessors. Chat history is AES-256 encrypted on device, credentials are encrypted at rest, and traffic is TLS in transit.

Can we run agents, or is this just chat?

Agents. Because we expose OpenAI- and Anthropic-compatible /v1 endpoints, agent frameworks like LangChain or your own tool-calling loop point at your GPUs unchanged. Tool-using, autonomous models run over your internal data and APIs — entirely on hardware you control.

Will it work with our existing AI code?

Yes. We expose OpenAI- and Anthropic-compatible /v1 endpoints, so anything already pointed at GPT or Claude can point at your own GPUs with no SDK changes or prompt migration.

What hardware do we need, and how does multi-GPU work?

Any NVIDIA RTX GPU. Single-GPU is where per-card throughput is highest. When a model is too large for one card, our network transport splits it across machines — no NVLink or InfiniBand — so you can run it at all, privately, on hardware you own. Multi-GPU buys you model size and capacity; for raw speed per request, a single capable card leads.

How do we buy for a larger team?

Self-serve: pick a plan and subscribe — seats scale up in your admin console, billed per seat. For enterprise volume, SSO/SCIM, and a support SLA, email us your seat count and we'll send pricing the same day. No demo gauntlet.

Where this goes

A private agentic workforce.

Chat was the demo. The arc is fleets of capable open agents — the Hermes-class, tool-using models — running your real work autonomously: reading your documents, operating your internal systems, and answering to no one outside your walls. We're building the engine that makes that fast and affordable on hardware you already own. Today it's the fastest local inference and a transport that runs models a single card can't. Next is the agent platform on top of it.

Get started

Start in minutes — or have us deploy it for you.

Most teams self-serve: download the free Home plan or start a Team trial, no sales call. For an enterprise rollout (SSO, SCIM, audit, volume pricing) — or a managed appliance, a box we provision and ship that you own outright — tell us what you need and a founder replies, usually within a day.

See plans & pricing → Start free

Prefer email? [email protected]