Deploy In Your Own AWS / EKS Infrastructure¶

This is the self-hosted path for teams that want agent-bom inside their own AWS account, VPC, EKS cluster, IAM boundary, and databases.

Use this path when you want one operator-controlled system for:

scheduled scans and discovery
endpoint fleet inventory
selected live MCP proxy enforcement
central gateway policy management and shared remote MCP traffic
API, UI, findings, graph, and remediation in your own infra

The recommended rollout is:

stand up the control plane
add scheduled scans and fleet sync
use that for MCP inventory and granted surface area
add proxy or gateway only where live runtime enforcement is actually needed

Recommended defaults for this path:

control plane: API + UI + Postgres
inventory first: scans + fleet
shared remote MCPs: gateway
workload-local inline enforcement: selected sidecar proxy
node-wide runtime coverage: optional monitor only when your platform team explicitly wants that tradeoff
advanced storage: add ClickHouse or Snowflake only when the default Postgres-first path is no longer enough

Operator defaults at a glance¶

Concern	Recommended default	Change when
control-plane backend	`Postgres`	retained analytics or warehouse-native requirements justify `ClickHouse` or selected Snowflake parity surfaces
ingress + auth	same-origin ingress with OIDC or SAML in front of the control plane	you already run a different enterprise ingress/auth split and intentionally want to diverge
runtime rollout	scans + fleet first, then selected `proxy` or `gateway`	live runtime enforcement is worth the extra operational surface
remote MCP traffic	`gateway`	the traffic is actually local stdio or workload-local sidecar traffic instead
workload-local enforcement	selected `proxy` sidecars or endpoint-local wrappers	you do not need inline runtime enforcement on that workload yet
node-wide coverage	keep the monitor off	your platform team explicitly accepts a DaemonSet tradeoff for node-wide runtime visibility
graph operations	investigate by snapshot, page, search, and blast radius	you have benchmarked your expected tenant size and know a wider graph window is safe
production sizing	start from the packaged production values and benchmark before broad rollout	your endpoint/runtime volume exceeds the published pilot and initial-enterprise guidance

If you want the narrower pilot shape first, start with Focused EKS MCP Pilot. If you want the broader rollout that also covers developer endpoints, pair this page with Endpoint Fleet.

If you need the fastest packaged control-plane demo before standing up Postgres, there is now a single-node SQLite preset at eks-control-plane-sqlite-pilot-values.yaml. Use it only for pilots and demos; multi-replica EKS still belongs on Postgres.

For sizing bands, graph investigation boundaries, and the shipped load-test harness, see Performance, Sizing, and Benchmarks.

What This EKS Shape Is Optimized For¶

This deployment model is built for teams that want:

all findings, graph state, audit data, and remediation views inside their own VPC
read-only cloud and cluster discovery where enforcement is not required
selected inline enforcement only for the MCP workloads that actually need it
low-latency runtime inspection without routing every request through a shared monolith
enterprise auth, least privilege, and tenant boundaries that map cleanly to platform controls
predictable cost with stateless control-plane pods and scale-out scan jobs

Best-In-Class EKS Shape¶

The best current EKS rollout is not "put everything behind one service." It is a split between:

a control plane for auth, graph, findings, fleet, audit, and remediation
inventory paths for scans and fleet ingest
runtime paths for proxy and gateway, deployed selectively where they are needed

Both proxy and gateway are core agent-bom product surfaces. The question is not whether they exist in the product; it is where you deploy them for your actual MCP traffic and operating model.

Deployment topology¶

Surface	Runs where	Talks to	Why it exists
Browser UI	operator browser	ingress -> API/UI	review findings, graph, remediation, fleet, audit, and policy
API + UI + workers	EKS or self-hosted compute	Postgres, scan jobs, fleet sync, proxy/gateway audit	control-plane state, orchestration, graph, audit, remediation
Scan jobs + CI	EKS CronJobs, CI runners, one-off jobs	API + stores	discovery, CVEs, IaC, cloud, MCP config, skills
Fleet sync	employee endpoints or collectors	API `/v1/fleet/sync`	endpoint inventory without requiring runtime rollout first
Proxy	selected endpoints or sidecars	local MCPs + API audit/policy	inline workload-local MCP inspection
Gateway	shared cluster service	remote MCPs + API audit/policy	shared remote MCP traffic plane
Postgres	RDS or self-managed	API/UI/workers	transactional control-plane truth
ClickHouse / S3 / OTEL	optional adjacent services	control plane	analytics, archive, exports

Deployment truth: the browser drives workflows, the API owns control-plane state, workers do scans, and proxy plus gateway are peer runtime surfaces, not a required serial chain. For the role split, see the Self-Hosted Product Architecture.

Runtime MCP flow¶

Runtime path	Starts from	Ends at	Best fit
Proxy path	editor, endpoint, or sidecar workload	local or workload-local MCP	stdio MCPs, sidecars, workload-local enforcement
Gateway path	shared remote MCP client	remote MCP over HTTP/SSE	central policy and shared remote MCP traffic
Optional monitor path	node-wide daemon on selected clusters	per-node runtime coverage	only when an operator explicitly wants node-wide runtime monitoring
Inventory path	scan jobs or fleet sync	API + Postgres	inventory, provenance, findings, and graph without runtime rollout

Local stdio or workload-local MCPs use agent-bom proxy as the inline runtime path.
Shared remote MCPs can go directly to agent-bom gateway serve without a local proxy hop.
Both runtime surfaces pull policy from the control plane and push audit to /v1/proxy/audit.
Runtime detections, optional visual leak checks, and tenant-scoped limits happen on the enforcement surface that handled the call.

Rollout profiles¶

Profile	Deploy first	Add later
Inventory-first	API + UI + Postgres + scans + fleet sync	proxy, gateway, ClickHouse
Runtime on selected workloads	inventory-first plus `proxy`	gateway for shared remote MCPs
Shared remote MCP control	inventory-first plus `gateway`	local proxy where stdio or sidecar enforcement is still needed
Full self-hosted platform	control plane + scans + fleet + selected proxy + selected gateway	ClickHouse, Snowflake, stricter platform controls

What each profile makes visible¶

Profile	What operators can already see	What is added later
Inventory-first	endpoints, agents, MCP servers, transports, command or URL, declared tools, auth mode, credential-backed env vars, package and vuln context	live runtime calls, inline blocks, runtime policy events
Runtime on selected workloads	everything in inventory-first plus local runtime evidence for the chosen workloads	shared remote relay and central gateway-only surfaces
Shared remote MCP control	everything in inventory-first plus shared upstream inventory and gateway policy/audit	workload-local proxy evidence where stdio or sidecar inspection is required
Full self-hosted platform	one correlated plane across scans, fleet, gateway, proxy, graph, findings, and audit	longer-retention analytics and stricter platform controls

Which Agent-BOM Surface Runs Where¶

Surface	Where it runs	Why you deploy it
API + UI	in-cluster or on self-hosted compute behind your ingress	one operator plane for findings, graph, fleet, audit, gateway, and remediation
Scan	CronJob, CI runner, or one-off job	Kubernetes, container, package, MCP, cloud, and inventory scanning
Fleet	pushed into the control plane from endpoints or collectors	persisted workstation and collector inventory in `/fleet`
Proxy / runtime	next to the MCP workloads you want inline enforcement on	live JSON-RPC inspection, allow/warn/deny, audit push
Gateway	central service in-cluster	shared remote MCP traffic plane, policy distribution, audit, and rate limiting
MCP server	wherever you expose `agent-bom` itself as a tool server	assistant-facing tool access, separate from the proxy path

The important boundary is that agent-bom proxy is the inline runtime path, while the gateway is the central policy and shared remote MCP surface. One does not replace the other.

For the concrete gateway startup path against discovered fleet MCPs, see Gateway Auto-Discovery From the Control Plane. For screenshot and OCR rollout on runtime paths, see Visual Leak Detection.

What Stays In Your Infrastructure¶

For this model, the sensitive operator surfaces stay inside your environment unless you explicitly wire external destinations:

API and dashboard traffic
fleet inventory
proxy audit logs
Postgres and optional ClickHouse
Kubernetes discovery through your service account and IRSA role
cloud discovery through your own IAM credentials
OIDC, API-key, audit-HMAC, and ingress policy

Potential egress still depends on operator choice:

vulnerability database refresh
enrichment lookups
explicit exports such as SARIF upload, OTLP, SIEM, or webhooks

Current Capabilities By Surface¶

These are the current deployable capabilities this EKS model supports:

Surface	Current capabilities
Control plane	API + UI, remediation, graph, findings, gateway, fleet, audit review, compliance evidence, health and auth introspection
Scan	package, image, IaC, Kubernetes, MCP, cloud, and inventory scanning via CronJob, CI, or one-off runs
Fleet	endpoint and collector inventory persistence, state review, trust/lifecycle tracking
Proxy / runtime	MCP policy evaluation, undeclared-tool blocking, credential detection, audit push, local or sidecar deployment
Gateway	central policy authoring, distribution, and evaluation surface for proxies, plus optional shared remote MCP traffic
Storage	Postgres-backed control-plane state, optional ClickHouse analytics, optional S3-backed backups/exports

This is the important product boundary: customers can deploy one or all of these surfaces in their own infrastructure without shipping their core operator data to a vendor-hosted control plane.

What You Actually Deploy¶

These are the maintained building blocks for this model:

recommended full self-hosted entrypoint: scripts/deploy/install-eks-reference.sh
control plane: deploy/helm/agent-bom
AWS baseline module: deploy/terraform/aws/baseline
advanced local Compose references, not the primary production path: deploy/docker-compose.platform.yml and deploy/docker-compose.runtime.yml
sidecar examples: deploy/k8s/sidecar-example.yaml and deploy/k8s/proxy-sidecar-pilot.yaml
Postgres bootstrap: deploy/supabase/postgres/init.sql
ClickHouse bootstrap: deploy/supabase/clickhouse/init.sql
production values example: eks-production-values.yaml
focused pilot values example: eks-mcp-pilot-values.yaml

If you want one official answer to "what is the full deployment path?", use the reference installer first. Drop to the raw Helm examples only when you intentionally want to manage the AWS baseline, secrets, and values layering yourself.

For teams that want Terraform to own the AWS baseline around the chart, use the Terraform AWS Baseline module for RDS, IRSA, backup bucket, and Secrets Manager ownership, then let Helm own the in-cluster workloads.

For decommissioning, use the packaged reverse path instead of ad hoc helm uninstall plus cloud cleanup:

export AWS_REGION="<your-aws-region>"
agent-bom teardown \
  --cluster-name corp-ai \
  --region "$AWS_REGION" \
  --namespace agent-bom \
  --release agent-bom \
  --dry-run

That helper only removes product-owned agent-bom surfaces. It does not delete the EKS cluster, ingress controller, VPC, or other platform-owned infrastructure.

When the chart is removed, packaged Helm pre/post-delete hooks also clean up product-owned in-cluster leftovers such as generated ExternalSecret target secrets, CronJobs, Jobs, and PVCs before Terraform destroys the AWS baseline.

Recommended Topology¶

Use two layers.

1. Control plane¶

enable the packaged API + UI control plane
back it with Postgres
add ClickHouse only when you want event-scale analytics
keep ingress same-origin unless you have a concrete reason to split hosts
use OIDC or SAML for user access

2. Discovery and enforcement¶

run scheduled scan jobs for Kubernetes, MCP, package, and cloud discovery
use fleet sync for laptops and workstations
run agent-bom proxy only beside the MCP workloads that need inline runtime enforcement
let proxies pull gateway policy from the control plane and push audit back

That keeps scan, fleet, runtime enforcement, and gateway policy aligned without pretending every workload needs the same enforcement model.

Helm Knobs That Matter¶

Value	Why it matters
`controlPlane.enabled`	packages the API + dashboard in-cluster
`controlPlane.ingress.enabled`	routes `/` to UI and `/v1`, `/health`, `/docs`, `/ws` to API
`controlPlane.api.envFrom`	loads Postgres URL, auth settings, audit HMAC, and other control-plane secrets
`controlPlane.ui.env`	keeps same-origin routing honest with `NEXT_PUBLIC_API_URL=\"\"` or sets an explicit API URL
`serviceAccount.annotations`	shared IRSA/workload-identity annotations inherited by scanner, gateway, and backup service accounts unless you override them per component
`scanner.serviceAccount.annotations`	attach a distinct IRSA role to the scanner CronJob when cluster discovery should use a different IAM role than the shared runtime SA
`gateway.serviceAccount.annotations`	attach a distinct IRSA role to the gateway when it needs separate cloud access
`controlPlane.backup.serviceAccount.annotations`	attach a distinct IRSA role to the Postgres backup CronJob
`scanner.extraArgs`	enables `--k8s-mcp`, `--introspect`, `--enforce`, and other operator choices
`scanner.allNamespaces`	expands cluster scan scope
`controlPlane.api.autoscaling.*`	autoscales the API deployment
`controlPlane.ui.autoscaling.*`	autoscales the UI deployment
`topologySpread.*`	spreads API and UI pods across zones and nodes
`controlPlane.externalSecrets.*`	maps secrets from your external-secrets provider
`controlPlane.observability.prometheusRule.*`	packages alerts for API, scanner, OIDC, and proxy backlog
`controlPlane.backup.*`	packages the Postgres backup job when you are ready to wire S3 and KMS

Example:

helm install agent-bom deploy/helm/agent-bom \
  -n agent-bom --create-namespace \
  --set controlPlane.enabled=true \
  --set controlPlane.ingress.enabled=true \
  --set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::REPLACE_ME_ACCOUNT_ID:role/REPLACE_ME_AGENT_BOM_DISCOVERY_ROLE \
  --set scanner.allNamespaces=true \
  --set-json 'scanner.extraArgs=["--k8s-mcp","--k8s-all-namespaces","--introspect","--enforce","--preset","enterprise"]'

That gives you:

packaged API + UI
cluster-wide discovery
MCP-oriented scheduled scans
a clean bridge to selected proxy sidecars and gateway policy

Runtime, Proxy, Gateway, Scan, and Fleet Together¶

This is the most common source of confusion in self-hosted rollouts:

Scan finds and analyzes what is deployed.
Fleet persists endpoint and collector inventory into the control plane.
Proxy / runtime inspects and enforces live MCP traffic for selected workloads.
Gateway stores and serves the policies that proxies use.
API + UI is where operators review all of the above together.

The rollout order should normally be:

control plane
scheduled scan jobs
fleet sync
selected proxy sidecars
stricter gateway-backed enforcement

Recommended Production Defaults¶

use Postgres, not SQLite, for the control plane
use Alembic for long-lived Postgres-backed deployments
keep the proxy and API internal to your VPC unless exposure is intentional
attach discovery jobs to IRSA instead of static cloud keys
keep discovery roles read-only unless a specific workflow truly requires write access
set a persistent AGENT_BOM_AUDIT_HMAC_KEY and require it for proxy audit sign-off
set AGENT_BOM_RATE_LIMIT_KEY and AGENT_BOM_RATE_LIMIT_KEY_LAST_ROTATED for multi-replica control planes
split external secrets by rotation cadence
enable the packaged PrometheusRule and Grafana dashboard only when your cluster already runs Prometheus Operator and Grafana sidecar discovery
wire backup destinations explicitly before enabling the packaged backup CronJob
use topology spread for multi-AZ EKS
start with audit-only policy outcomes where rollout risk is unclear, then move to deny

Why This Is Not A Monolith¶

The control plane stores and visualizes state. The scanner discovers. The fleet surface ingests endpoint inventory. The proxy enforces live MCP traffic. The gateway distributes policy. Those are aligned surfaces, but they are not one process pretending to be every enterprise service at once.

That split is what makes the deployment:

secure: least privilege and clearer trust boundaries
performant: enforcement stays close to the workload
cheap: heavy scan work can scale independently from the API/UI
manageable: each surface can roll out on its own lifecycle
accurate: one shared graph and policy model keeps outputs consistent across surfaces

Run database migrations explicitly:

alembic -c deploy/supabase/postgres/alembic.ini upgrade head

If the database was previously bootstrapped from init.sql, stamp the baseline once before future upgrades:

alembic -c deploy/supabase/postgres/alembic.ini stamp 20260416_01

What You Still Own¶

This is a real self-hosted packaging path, but not every enterprise primitive is abstracted into the chart.

You still own:

Postgres, optional ClickHouse, and secret storage
ingress controller, cert-manager, and network perimeter specifics
HPA, failover, and operator runbooks
platform-specific logging and SIEM wiring
workload-by-workload decisions about where proxy sidecars belong

For the narrower rollout, see Focused EKS MCP Pilot. For the packaged control plane details, see Packaged API + UI Control Plane.