Kubernetes on AWS · open source

Your AWS data-transfer bill is a black box. Tollwing names the pod paying it, and gets cross-AZ right.

Tollwing is an open-source eBPF agent that meters every byte of pod TCP traffic in-kernel (UDP and QUIC are one flag away: -udp), attributes it to the pod that sent or received it, and prices it across 9 AWS billing paths, live, in dollars. Of the metered bytes, the few whose billing path can’t be proven are booked Unknown, never guessed; bytes it doesn’t meter are absent, never estimated. Every dollar is metered bytes × a dated rate. No app changes.

Run make demomake demo · 60s · no cloud account Star on GitHub

  ① THE DIFFERENTIATOR: cross-AZ cost that post-DNAT-only tools miss

     cart [shop/us-east-1a]  ──1.0 GiB──▶  checkout (ClusterIP) [shop/us-east-1b]

        billing path     attributed to     cost
        cross_az         checkout          $0.01   ◀── correct
        ───────────────────────────────────────
        total                              $0.01

  Post-DNAT-only tools (Kubecost / OpenCost) see only the rewritten
  destination IP, so they bill this to cart, or miss it.

A cent a gigabyte, charged in each direction, sounds like nothing. At Kafka RF=3 replication scale it is tens of thousands of dollars a month, billed to a pod you cannot currently see.

Real output from make demo. Pure Go: no cloud account, no cluster, no kernel. Every dollar is bytes × a dated rate, re-derived independently by an oracle (make sim).

Apache-2.0 · read every line Designed to 0.1–0.5% of one core · in-kernel aggregation → No app changes, no sidecars, no code edits Every number oracle-tested · run make sim

Works with the tools you already run: PrometheusGrafanaKubernetesHelm

The black box

You get the total. You never get the culprit.

Your AWS bill lumps every byte that crosses an availability zone, hits a NAT gateway, or leaves the VPC into one opaque line: EC2-Other, DataTransfer. It tells you the number is big. It never tells you which workload spent it. Your tags describe how things were provisioned, not how they actually talk at runtime. So when the data-transfer line jumps, you are guessing.

A Kafka cluster with RF=3 replication crosses two AZ boundaries on every produced byte. That can be tens of thousands of dollars a month. Datadog spreads it across workloads by traffic share; Kubecost’s heuristic buckets default pod-to-pod RFC1918 traffic to in-zone, i.e. free (issue #2464); AWS shows bytes, not dollars. As of July 2026, no tool we know of meters that flow to the pod pair, prices it by billing path, and shows you the dollars.

You can see the bill is high. You cannot see which pod is paying it. There is a structural reason post-DNAT tools get this wrong, and here it is.

Honest by design

It refuses to guess.

Heuristic tools fill the gaps they can’t see with estimates. That is how a charge lands on the wrong pod. Tollwing does the opposite. Every dollar is bytes × a dated rate: traceable, never estimated. It counts each cross-AZ interaction’s cost exactly once, with no double-counting. And when a single agent genuinely cannot see a flow’s zone, it marks that leg Unknown instead of inventing a number. An independent oracle then re-derives every figure in make sim.

Cost numbers are honest and traceable

Never invent or double-count a dollar. Every figure resolves to bytes × a dated rate.

Accurate attribution over convenient approximation

Refuse to guess. When a leg can’t be seen, mark it Unknown instead of fabricating it.

That is why the demo bills cross-AZ to the right service exactly once.

Why post-DNAT tools bill the wrong pod

We see the connection twice.

When a pod calls a Kubernetes Service, kube-proxy rewrites the destination mid-flight (DNAT), like a mail forwarder swapping the address on the envelope. Any tool that reads the envelope after forwarding sees the wrong recipient, and often the wrong availability zone. That is how the cross-AZ charge lands on the wrong pod, or vanishes.

Post-DNAT-only tools (Kubecost / OpenCost)

cart · us-east-1akube-proxy DNAT: original destination lost

checkout · us-east-1b

sees one hop → bills cart, or misses it

Tollwing

cart · us-east-1a

① connect4, on the calling node: captures the ClusterIP the pod meant to reach

② the backend-node agent: records where the connection landed, in a zone it knows

checkout · us-east-1b

cross-AZ captured exactly once, attributed to the checkout Service, correctly · $0.01/GB each way

We see the connection twice: what it meant to reach, and where it actually landed.

For the curious →

On the calling node, a cgroup/connect4 eBPF hook captures the pre-DNAT destination, the ClusterIP the pod meant to reach. A ClusterIP has no zone, so that agent leaves the zone Unknown rather than guessing it (P5). The agent on the backend node sees where the connection actually landed, in a zone it knows, and prices the cross-AZ movement exactly once, attributed to the Service (P4, no double-count). In-kernel PERCPU maps aggregate the bytes, so there is no per-packet userspace handoff. The two-phase capture is DEC-003; leaving the dialer leg Unknown rather than guessing it is DEC-010.

Verify, don’t trust

See the right answer on your laptop in 60 seconds.

git clone https://github.com/tollwing/tollwing
cd tollwing
make demo

✓ Runs the real cost engine: the same classification and dated-rate math the agent feeds in production, not a mock.
✓ No AWS account, no cluster, no kernel. Clone to answer in ~60s; the engine itself runs in milliseconds.
✓ Every dollar is oracle-tested. Run make sim and the oracle re-derives each number independently. Open the tests and check the math yourself.

② the breadth: per-pod cost by billing path

billing path          example flow       data       cost
────────────────────────────────────────────────────────
same_zone             api → cache        1.0 GiB    $0.00
cross_az              cart → checkout    1.0 GiB    $0.01
cross_region          api → replica      1.0 GiB    $0.01
nat_gateway           worker → nat       2.0 GiB    $0.27
internet_egress       api → internet     5.0 GiB    $0.45
vpc_peering           api → peer-db      1.0 GiB    $0.01
transit_gateway       api → tgw-peer     1.0 GiB    $0.01
vpc_endpoint          api → s3-endpoint  1.0 GiB    $0.01
cloud_service_public  api → s3-public    1.0 GiB    $0.00

Each row is an independent scenario, priced by the same engine, billing only the direction the provider bills: cross-region and transit gateway charge the sending side only, and the NAT row stacks $0.045/GB processing on the egress it fronts. NAT vs VPC-endpoint is the boring, huge fix: $0.045/GB → $0.01/GB.

On a real cluster the difference is stark: where OpenCost reports $0 and Kubecost defaults the ClusterIP leg to same-zone/free with no service field (kubecost#2464, closed unfixed), Tollwing attributes cross_az to the dialed Service.

Run make demo Free tier: the agent on one AWS cluster, live in your Grafana. That’s the whole attribution engine, not a teaser.

From demo to live dollars in three steps.

make demo

Read a per-pod, per-billing-path cost report locally. No cloud account, no cluster, no kernel.

helm install tollwing-agent

Deploy the eBPF agent as a DaemonSet. Read-only. No app changes, no sidecars, designed to a 0.1–0.5%-of-one-core overhead budget (what’s measured so far). Auto-detects provider + region via IMDS.

helm install tollwing-agent ./deploy/helm/tollwing-agent

Pin --set agent.provider / agent.region only if IMDS is blocked.

Open Grafana

Live per-pod, per-namespace dollars across all 9 AWS billing paths. The agent exposes tollwing_* Prometheus metrics on :9990/metrics, your Prometheus scrapes them, and the included 23-panel dashboard renders them. No control-plane server required for the live single-cluster view.

Read-only observability. Apache-2.0. Verify every number.

Safe by construction

Yes, it runs in your kernel. Here’s why that’s safe.

Verifier-checked before it loads

Every Tollwing eBPF program is checked by the Linux kernel verifier before it runs. Unlike a kernel module, a verified program cannot crash your kernel, loop forever, or read arbitrary memory. At worst it produces incorrect data, never an outage.

Read-only

The agent attaches cgroup/connect4 and sock_ops hooks to count bytes. It never modifies, drops, or redirects a single packet.

Honest overhead, with the mechanism named

We aggregate in-kernel with PERCPU maps, so there is no per-packet userspace handoff. The agent is designed to a 0.1–0.5%-of-one-core overhead budget, and we won’t quote a measured number until a reproducible benchmark ships. We will not tell you it is zero, and we publish what’s measured so far, and what isn’t, so you can measure it on your own nodes instead of taking ours.

It’s Apache-2.0, so your security team can read exactly what runs before you deploy it. Read the architecture →

See the bill by pod. Find the expensive conversation. Cut the waste.

See the bill by pod & namespace

Every metered byte classified deterministically across 9 AWS billing paths (same-zone, cross-AZ, cross-region, internet egress, NAT gateway, VPC peering, transit gateway, VPC endpoint, cloud-service public endpoint), in live dollars, not flow counts. Anything unprovable lands in an explicit Unknown bucket, never in a guess.

Find the expensive conversation

e.g. kafka-broker-2 → kafka-broker-0 cross-AZ replication, or spark-exec reaching S3 over a NAT gateway ($0.045/GB) instead of a VPC endpoint ($0.01/GB).

Drop it into your stack

tollwing_* Prometheus metrics your Grafana reads directly, the included 23-panel dashboard, and a standalone FOCUS-aligned JSON cost-export sidecar for external cost tooling. No new backend to run.

Scale it when it pays off

When one cluster proves it, Tollwing Enterprise adds the control plane: long-term history, a multi-cluster view, CLI + REST API, a one-page Cost Savings Report, and alerting.

Everyone tells you the bill is high. Tollwing tells you which pod, talking to which service, over which of 9 billing paths, in dollars.

As of July 2026, nothing else we can find ships all three at once: per-pod resolution, 9-way billing-path classification, and dollars derived from metered bytes. Here is the honest version of that table.

	Per-pod network dollars	Pre-DNAT Service intent	Billing-path granularity	Cost math	Open source
Tollwing	✓ pod + conversation	✓ connect4	✓ 9-way + explicit Unknown	bytes × dated rate	✓ Apache-2.0
Kubecost / OpenCost	per pod via conntrack	post-DNAT blind spot (#2464)	3 heuristic buckets	cloud/K8s cost model	OpenCost Apache-2.0
Datadog CCM + CNM	workload-level, no per-pod	not Service-intent capture	4 CUR transfer types	top-down bill spread	SaaS
AWS Container Network Observability	bytes, not dollars	no	per-workload cross-AZ + external	no dollars	AWS service

Kubecost, OpenCost, Datadog, and AWS Container Network Observability are excellent at their own jobs; this table is about one narrow slice. Datadog CCM + CNM allocates real bill lines down to the workload by traffic share — a top-down spread across 4 CUR transfer types that requires CNM on every host and tracks no individual pod. Tollwing meters each flow bottom-up, pre-DNAT, and prices it per path: different question, different answer. Kubecost meters per pod but into 3 heuristic buckets, and loses ClusterIP intent (kubecost#2464, closed unfixed). AWS Container Network Observability shows per-workload bytes, not dollars.

Comparison as of 2026-07-02, from each vendor’s public docs. If it is wrong or goes stale, open an issue — we would rather correct this table than defend it.

Free vs Enterprise

The full attribution engine is free, and Apache-2.0, forever.

Everything that attributes the cost is free and Apache-2.0: the eBPF agent, the 9-way per-pod classifier, the pre-DNAT service-intent capture, exact-once cross-AZ pricing, the cost engine, the service-dependency graph, the demo, and the FOCUS-aligned JSON cost-export sidecar. It runs on its own: the agent exposes Prometheus metrics your Grafana reads directly, no control-plane server required. That is the whole attribution engine, not a teaser.

“Forever” is not a slogan here; it is a versioned contract: OPEN-CORE.md. What shipped free stays free, the public tree stays Apache-2.0, accuracy and honesty fixes are always free, and the free agent contains no license code at all — there is nothing in it to unlock.

Community

Free

For seeing per-pod network cost live on one AWS cluster, in your own Grafana.

9-way per-pod classifier + the eBPF agent
Pre-DNAT service intent + exact-once cross-AZ pricing
Service-dependency graph attribution
Prometheus metrics + 23-panel Grafana + FOCUS-aligned JSON cost export
Terraform cost estimator + pure-Go proof suite (make demo, make sim)
Single cluster, AWS, your Prometheus retention

Run make demo

Tollwing Enterprise · early access

Contact

Early

Self-hosted, license-gated. The control plane on top of the same agent: store it, scale it across clusters, alert on it, act on it.

Control-plane server: long-term history, REST API, CLI, Cost Savings Report
Multi-cluster aggregation
CUR reconciliation to your actual discounted rates
Alerts, anomaly detection, recommendations, what-if
Approval-gated auto-remediation
GCP/Azure, SSO/RBAC (early access), multi-tenant
Signed offline license, no phone-home

Talk about Enterprise

The free agent is AWS-only and single-cluster on purpose: AWS’s billing-path complexity is where the hidden cross-AZ and NAT spend actually lives, so prove the value on one cluster first, in your own Grafana, and scale only if it pays off. Enterprise adds the control plane on top of the same agent: history beyond your Prometheus retention, a view across clusters, and the tooling to alert and act. If you never upgrade, the agent keeps working. There’s no hosted dependency and no phone-home, ever, and OPEN-CORE.md commits that the boundary only ever moves toward free.

Want a second set of eyes on a network-cost bill?

No form, no nurture sequence. Email the maintainer for Enterprise, design-partner access, or a pre-install read on whether Tollwing is likely to find anything useful.

hello@tollwing.com

Tollwing ships with a governance constitution, a public ADR log (the pre-DNAT capture is DEC-003; the open-core boundary is DEC-013), and a CI governance gate, because we hold the cost logic to a high bar and you should be able to see exactly how. Constitution · Open-core boundary · Architecture · ADR log