Performance, Sizing, and Benchmarks¶
This guide is the operator-facing follow-up to the packaged Helm control plane
and enterprise pilot docs. It does not promise fixed throughput numbers the
repo has not published. It gives a defensible starting point for sizing,
autoscaling, and validating a self-hosted agent-bom deployment in EKS.
What this guide covers¶
- control-plane API and UI sizing
- endpoint-fleet and proxy-audit growth boundaries
- when to keep analytics in
Postgresand when to turn onClickHouse - what to load test before calling the deployment production-ready
Reference topology¶
flowchart LR
subgraph EndpointFleet["Endpoint fleet"]
Endpoints["Employee laptops<br/>agent-bom agents --push-url ..."]
EndpointProxy["Local agent-bom proxy<br/>(selected employees)"]
end
subgraph EKS["EKS control plane"]
Ingress["Ingress / TLS"]
API["API Deployment<br/>2+ replicas"]
UI["UI Deployment<br/>2+ replicas"]
Scanner["Scanner CronJob"]
RuntimeProxy["Selected MCP sidecars"]
end
subgraph Data["Persistence"]
PG["Postgres"]
CH["ClickHouse optional"]
end
Endpoints --> API
EndpointProxy --> API
RuntimeProxy --> API
Scanner --> API
Ingress --> UI
Ingress --> API
API --> PG
API --> CH
Starting point by deployment size¶
These are starting bands, not SLA guarantees. Use them to choose the right backend and Helm values before running your own load tests.
| Shape | Typical use | Recommended path |
|---|---|---|
| Small team | up to ~250 endpoints, selected proxy sidecars, low audit retention needs | Postgres only, 2 API replicas, 2 UI replicas, no ClickHouse |
| Initial enterprise pilot | ~250-2,000 endpoints, scheduled endpoint sync, selected runtime sidecars, retained proxy audit | Postgres, packaged HPA, topology spread, internal ingress, add ClickHouse only if audit/query latency grows |
| Broader rollout | 2,000+ endpoints or sustained high-volume proxy audit/event analytics | Postgres for transactional state plus ClickHouse for analytics, explicit retention policy, benchmark before rollout |
Practical switch points¶
These are planning thresholds, not product-enforced limits.
| Signal | Stay on Postgres only | Add ClickHouse |
|---|---|---|
| Endpoint count | up to ~2,000 endpoints | above ~2,000 endpoints with retained analytics needs |
| Scan cadence | low to moderate scheduled scans/day | sustained high-volume scheduled scans across many tenants |
| Runtime / proxy audit volume | recent operational visibility | long retention, heavy trend queries, event-style analytics |
| Dashboard/query shape | mostly transactional reads | trend-heavy or fleet-wide historical analytics |
The product intent is:
Postgresremains the transactional control-plane source of truthClickHouseis the scale-out answer for analytics-heavy workloads
Control-plane sizing guidance¶
Start from the packaged production example:
That example already enables:
- API and UI
HPA - topology spread
external-secretscert-manager-friendly ingress annotations- restricted ingress policy
API¶
Current chart defaults in values.yaml:
- replicas:
2 - requests:
100mCPU /256Mimemory - limits:
1000mCPU /1Gimemory HPApackaged but off by default
Recommended production operator posture:
- keep
2replicas as the minimum steady-state floor - enable the shipped API
HPAbefore larger endpoint or proxy rollouts - raise requests before raising limits when sustained CPU stays above
70% - keep the API same-origin behind one ingress hostname unless you have a strong reason to split UI and API
UI¶
Current chart defaults:
- replicas:
2 - requests:
100mCPU /128Mimemory - limits:
500mCPU /512Mimemory
Recommended posture:
- keep
2replicas for zone and node failure tolerance - enable the shipped UI
HPAfor heavier internal usage or dashboard-heavy SOC workflows - scale UI based on browser concurrency, not endpoint count alone
Scanner CronJob¶
The scanner is batch-oriented. The main knobs are:
- schedule frequency
- namespace scope
--introspect--enforce
If scans begin overlapping, do not immediately add more control-plane replicas. First widen the CronJob interval, split scope, or reduce introspection scope for the rollout phase.
For the full execution model, including scheduler leader election, tenant quota
enforcement, and the split between recurring schedules and the packaged scanner
CronJob, see Worker and Scheduler
Concurrency.
Postgres vs ClickHouse¶
Use Postgres as the source of truth for:
- fleet and policy state
- RBAC and tenant-scoped reads
- audit and control-plane writes
- scheduler coordination
Turn on ClickHouse when analytics starts behaving like analytics rather than
control-plane state:
- large retained proxy audit history
- heavy dashboard trend queries
- many endpoints with long retention windows
- event-style workloads where write buffering matters more than relational reads
Do not move the transactional control plane out of Postgres just because
analytics volume grows. The intended shape is:
Postgresfor the control planeClickHousefor optional analytics scale-out
Operational thresholds to watch¶
Watch these before widening rollout:
- API
p95latency on authenticated reads and writes - Postgres CPU, memory, and connection pressure
- scanner run duration versus schedule interval
- proxy audit ingest rate and retained volume
- UI response time under simultaneous SOC usage
If any of these fail first:
- API CPU saturation: increase API requests and enable or widen the API
HPA - Postgres pressure: tune connection pool, size the database, and move
analytics-heavy reads toward
ClickHouse - overlapping scans: lengthen schedule or split scope
- audit query drag: add retention boundaries and move analytics to
ClickHouse
Security graph cardinality and windowing¶
The security graph is one of the main operator surfaces, but it should be read as a persisted, filterable investigation view, not as a promise that every tenant should dump full topology into one browser canvas.
Operationally important boundaries:
- graph snapshots can contain many more nodes and edges than one page should render at once
- UI pagination, focus mode, search, and blast-radius drilldown are part of the scaling model, not optional decoration
- the operator workflow should narrow by tenant, snapshot, attack path, source, or asset family before expecting one giant all-entity canvas to stay useful
Recommended posture for larger tenants:
- keep graph investigation windowed by snapshot and page
- start from blast radius, focused graph, or asset search before expanding
- treat full-topology rendering as an exception path, not the default view
- benchmark the control plane with your expected snapshot size before calling the graph production-ready at large scale
What the product claims today:
- persisted graph snapshots
- stable node identifiers in the operator detail view
- explicit snapshot metadata, scope, and timestamps
- filterable graph investigation flows
- paginated agent-node selectors through
GET /v1/graph/agents - bounded fleet list windows through
GET /v1/fleet?limit=...&offset=...
What it does not yet publish as a hard contract:
- universal node/edge ceilings that apply to every tenant shape
- a CI-enforced multi-tenant graph load harness
Tenant quotas¶
For multi-tenant or shared internal deployments, set explicit per-tenant quotas on the API instead of relying only on request rate limits.
Available environment variables:
AGENT_BOM_API_MAX_ACTIVE_SCAN_JOBS_PER_TENANTAGENT_BOM_API_MAX_RETAINED_JOBS_PER_TENANTAGENT_BOM_API_MAX_FLEET_AGENTS_PER_TENANTAGENT_BOM_API_MAX_SCHEDULES_PER_TENANT
These are enforced on:
- scan creation
- scheduled scan creation
- pushed scan-result ingestion
- fleet sync
- schedule creation
The packaged production example already shows how to inject these through
controlPlane.api.env in:
Recommended starting posture:
- keep active scan jobs low enough that one tenant cannot saturate the API
- keep retained jobs bounded to match your retention expectations
- keep fleet-agent quotas aligned to your actual endpoint/runtime rollout size
- keep schedule quotas low unless you deliberately expose scheduled scans to many tenants
For large tenants with thousands of agents, do not drive dashboards from a full fleet dump. Use server-side fleet search and pagination, graph agent selectors, and node-context drilldown. The UI should show a readable working set first, then let the operator expand by search, lifecycle state, environment, or graph neighborhood. Treat the exact fleet size as an operator sizing input, not a fixed product ceiling.
Benchmark before production rollout¶
The repo now ships a small benchmark harness under:
Use that harness to validate:
- authenticated control-plane read paths
- graph overview and graph search operator paths
- proxy audit ingest write paths
- scaling behavior after enabling
HPA - the point where analytics should move to
ClickHouse
Current boundary:
- the harness is shipped and operator-usable now
- the harness is not yet a full multi-tenant performance gate in CI
- published benchmark numbers should come from your environment until the repo ships broader reference runs
Minimal benchmark checklist¶
Run these in your own environment before broader rollout:
- Baseline the API with the control-plane script at low and medium concurrency.
- Baseline graph overview/search with the graph script against a representative snapshot.
- Baseline proxy audit ingest with realistic batch sizes.
- Repeat all runs with
HPAenabled. - Repeat with
ClickHousedisabled, then enabled if your rollout expects high retained analytics volume. - Record your chosen replica floor, resource requests, and retention policy in your operator runbook.
Target operator latency envelope¶
These are the current target thresholds for self-hosted rollouts using the bundled k6 harness:
| Flow | Target (pilot) | Target (production) |
|---|---|---|
Graph overview GET /v1/graph?limit=100 |
< 300 ms p95 | < 500 ms p95 |
Graph search GET /v1/graph/search?q=agent&limit=25 |
< 250 ms p95 | < 400 ms p95 |
Fleet read GET /v1/fleet?limit=25 |
< 200 ms p95 | < 350 ms p95 |
Fleet stats GET /v1/fleet/stats |
< 150 ms p95 | < 300 ms p95 |
Proxy audit ingest POST /v1/proxy/audit |
< 300 ms p95 | < 500 ms p95 |
These are not universal guarantees for every tenant shape. They are the thresholds operators should validate before calling a rollout production-ready.
Graph write batching¶
Graph persistence uses one batch-size knob across SQLite and Postgres:
Both backends persist the same logical graph objects through the same
save_graph(graph) contract: nodes, search rows, edges, attack paths, and
interaction risks. The backend implementations differ internally, but the
operator tuning model stays consistent.
The write path uses lazy row generation and bounded batches. Memory is therefore tied to the configured batch window instead of total graph size. Raising the batch size can improve write throughput, while lowering it can reduce per-batch memory and parameter pressure.
This setting controls graph writes only. Graph read latency is governed by snapshot windowing, pagination, search indexes, materialized drilldowns, and UI virtualization.
Publish your own benchmark numbers¶
agent-bom deliberately does not claim a universal scans/day SLA. Real
throughput depends on:
- scan shape
- endpoint mix
- retention window
- whether analytics stays in Postgres or moves to ClickHouse
What the project does ship is:
- a benchmark harness
- packaged HPA/PDB controls
- explicit backend guidance
That keeps the deployment story honest while still giving operators a concrete starting point.
Current boundary¶
agent-bom now ships the packaged control plane, pilot path, and operator
defaults. What it still does not claim is:
- one universal throughput number for every scan and tenant shape
- one browser graph view that should render every tenant's full topology at once
-
CI-proven multi-tenant graph/load guarantees beyond the shipped focused benchmark harness
-
fixed throughput guarantees
- a bundled benchmark certification suite
- a managed control plane that hides storage or scaling choices from the buyer
Those choices stay explicit, which is why the self-hosted story remains honest.