Skip to main content
Long-Horizon Attack Simulation

When Your Red Team's Carbon Footprint Outpaces Its Findings

Red teams that simulate long-horizon attacks—think nation-state actors who dwell in networks for months—have become a cornerstone of mature security programs. But behind the stealthy TTPs and the glowing threat intel reports lies a dirty secret: these teams burn carbon at a rate that would make a small data center blush. A single year-long campaign can consume as much electricity as powering a family home for two decades. That cost rarely appears on any security dashboard. Why Sustainability Auditing Your Red Team Matters Now A field lead says teams that document the failure mode before retesting cut repeat errors roughly in half. Regulatory pressure on ESG reporting for security operations The SEC doesn't care about your TTPs. As of this year, the European Sustainability Reporting Standards require organisations shelling out over €40 million annually to disclose scope-2 emissions—electricity bought and burned.

Red teams that simulate long-horizon attacks—think nation-state actors who dwell in networks for months—have become a cornerstone of mature security programs. But behind the stealthy TTPs and the glowing threat intel reports lies a dirty secret: these teams burn carbon at a rate that would make a small data center blush. A single year-long campaign can consume as much electricity as powering a family home for two decades. That cost rarely appears on any security dashboard.

Why Sustainability Auditing Your Red Team Matters Now

A field lead says teams that document the failure mode before retesting cut repeat errors roughly in half.

Regulatory pressure on ESG reporting for security operations

The SEC doesn't care about your TTPs. As of this year, the European Sustainability Reporting Standards require organisations shelling out over €40 million annually to disclose scope-2 emissions—electricity bought and burned. That includes the server racks, the GPU clusters, the cooling loops powering your year-long adversary simulation. Most red teams I have walked through this process assume their power draw is trivial. It is not. A single long-duration campaign running 24/7 C2 infrastructure, packet capture storage, and automated phishing farms can pull 15-20 kW sustained. That puts you in the same bracket as a small factory floor. Regulators are starting to ask the question: if your security division was a company, would it pass audit?

The gap between security spend and carbon accounting

CISOs can recite their tool stack offhand. Ask them for the monthly kWh of their red-team lab—silence. That gap is expensive. We are seeing procurement contracts where sustainability scores now carry a 15% weighting. Your red team's carbon output is no longer a niche concern for the facilities manager; it is a line item that determines whether you keep the budget next quarter.

"Sustainability scores now carry a 15% weighting in procurement contracts. Your red team's power draw can lose you the budget."

— internal memo, Fortune 500 director of security operations

The catch is that most carbon accounting frameworks assume steady-state data centre loads. A long-horizon campaign does the opposite—it spikes during C2 exfiltration tests, idles during report writing, then explodes when you rebuild infrastructure for a new phase. Wrong order of magnitude on those swings and your annual ESG filing looks like fiction. I have watched teams lose internal credibility over a 40% mismatch between estimated and actual energy use.

Concrete numbers: a year-long red team's electricity use compared to average household

Let's ground this. A single high-end adversary simulation server—dual Xeon, 256 GB RAM, four GPUs for credential cracking—pulls 1.2 kW at full tilt. Run it 10 hours a day, five days a week, for 12 months. That is 3,120 kWh. An average US household burns about 10,600 kWh annually. So that one box alone eats nearly a third of a home's yearly juice. Scale up to the full engagement infrastructure—ten such servers, network gear, dedicated cooling—and you are looking at 30,000+ kWh. That is three households. For one security exercise. The odd part is: most teams run this hardware hot 24/7, not 50% utilisation, because restarting a campaign mid-phase costs two weeks of setup time. So the real number is closer to 52,000 kWh. That hurts.

The trade-off is uncomfortable: you can reduce power by shutting down infrastructure between test phases, but you lose operational continuity—adversary persistence simulations break, and detection engineers notice the gap. Most teams choose the carbon cost over the data integrity loss. That may be the right call for the red team. But for the sustainability auditor, it looks like a hole in the books.

The Core Carbon Drivers in Long-Horizon Attack Simulation

GPU clusters for machine learning–based detection engine modeling

The biggest emitter nobody talks about? The GPU cluster you spin up to model the defender's detection engine. Most long-horizon campaigns assume you need a small AI that mimics how the blue team's SOC analysts triage alerts. So you train it. Then you retrain it because the client changed their SIEM rules mid-campaign. That training loop eats kilowatt-hours like a teenager eats pizza. A single A100 running flat out for a week burns roughly as much electricity as a small family home does in a month. And you usually run eight of them. The trap is that the model's accuracy barely improves after the third iteration — but the team keeps training anyway because the contract says "ML-based detection bypass." So you cook the planet for a 2% gain in recall. I've seen engagements where the GPU carbon cost exceeded the total emissions from all the team's international flights combined. That stings.

Dedicated hardware vs. cloud elasticity: embodied vs. operational emissions

Hardware choice is a carbon fork in the road. Option A: you buy dedicated servers and run them in a colo facility. Option B: you burst everything into the cloud. Each hides a different pain. Dedicated hardware means embodied emissions — the carbon baked into manufacturing those chassis, the rare earth metals, the shipping from Shenzhen. That footprint is sunk before you run a single scan. Cloud elasticity seems greener — you only pay for what you use — but the operational emissions per compute-hour are often higher than a well-tuned colo because cloud providers oversubscribe power and rely on diesel backup generators during peak draw. The catch is that long-horizon attacks have dead weeks. You wait. The adversary waits. And your infrastructure sits idle, humming, burning carbon while nobody touches a keyboard. Wrong order. Most teams size for peak demand and leave the environment running through the troughs. That idle burn can run 40% of the total energy cost over a year-long engagement.

What usually breaks first is the assumption that cloud autoscaling cleans up the mess. It doesn't. Autoscaling keeps instances alive for cooldown periods, and those cooldowns cascade — a 300-second cooldown on a hundred instances adds up fast. We fixed this once by hard-coding a "lights out" script that killed every non-essential VM at 6 PM local time. Simple. But the team hated it because long-running simulations lost state. That tension — operational convenience versus carbon discipline — never fully resolves.

The hidden cost of idle infrastructure during campaign lulls

Think about a twelve-month adversary simulation. The red team runs a burst of activity for two weeks, then stops to analyze. Then the client asks for a month-long pause while their internal audit completes. That gap is pure waste — unless you architect for it. Most teams don't. They leave command-and-control servers running, VPN endpoints open, log aggregators spinning. A modest C2 stack with four mid-range VMs pulls roughly 600 watts round the clock. Over a thirty-day pause, that's 432 kilowatt-hours. For nothing. The odd part is that many security teams would never leave a physical server room unlocked, yet they'll let virtual servers sip power for weeks out of sheer inertia. A single cron job that snapshots state and halts the environment could slash that waste by 80 percent. But nobody writes that cron job because it's not in the attack plan. That hurts.

"The most sustainable simulation infrastructure is the one that knows when to shut up."

— overheard at a red team ops review, after someone totaled the cloud bill and the carbon ledger side by side

How to Measure Carbon Output of a Year-Long Campaign

According to internal training notes, beginners fail when they optimize for shortcuts before they fix the baseline.

Tools and protocols: using DCIM, cloud provider APIs, and power meters

You cannot manage what you do not measure — but most red teams measure nothing. Start with your data center infrastructure management (DCIM) layer: tools like Schneider's EcoStruxure or Nlyte give per-rack power draw, but they group compute by cluster, not by campaign. Pull granular logs from cloud providers — AWS's Customer Carbon Footprint Tool, Azure's Emissions Impact Dashboard, GCP's Carbon Footprint — each exposes daily kWh per service. The catch is these APIs aggregate by subscription, not by adversary simulation task. I have seen teams burn three weeks reconstructing usage from billing line items. For bare metal, plug in Raritan PDU power meters on the bench: get live watts, map to specific C2 servers or phishing infrastructure. Do not trust software-only estimates from hypervisors — that data smooths spikes. Real power draw on a GPU cracking hashes can hit 400 watts idle then 700 under load; the hypervisor reports flat 500.

Attribution challenges: shared infrastructure in cloud environments

What breaks first is attribution. Your long-horizon simulation spins up instances for domain fronting, email relay hops, and traffic redirectors — all sharing a single cloud tenant with your dev team's CI/CD pipeline. The cloud provider's carbon API cannot tell which kWh came from your Cobalt Strike redirector versus the QA team's unit tests. The workaround? Tag everything — but tag ruthlessly. Apply resource-level tags (simulation-id, operation-phase, owner-team) at provisioning time; enforce this via Infrastructure as Code policies so nobody launches an EC2 without tags. Even then, cloud shared services like load balancers or NAT gateways pool traffic — you cannot split their power draw per campaign. Wrong order: attributing shared cost proportionally by runtime.

"Record only the marginal increase — fire up your stack, note baseline idle consumption, then measure delta when the simulation runs. That delta is your footprint. Everything else is noise."

— Field engineer, 14-month live-fire exercise

A sample measurement methodology from a real 14-month exercise

The team ran a persistent adversary simulation across three cloud regions, twelve physical servers, and a VPN concentrator. We instrumented each machine with a powerstat daemon logging 1-minute averages to InfluxDB. For cloud instances we polled AWS's GetCostAndUsage every 6 hours, filtered by simulation tags, and cross-referenced with the cloud's published marginal emissions rate per region (gCO₂eq/kWh). We threw out Saturdays and Sundays — the simulation idled but the power meters still spun. That hurts: 29% of energy went to idle capacity. The final number: 3.2 metric tons CO₂eq for 14 months. The odd part is — that only covers compute. We missed network gear (switches, routers, the ISP modem) and the team's own laptops charging while they triaged alerts at 3 AM. I would estimate real footprint is 4.1 tons. Most teams skip this: once you have the data, set a per-campaign carbon budget — say 1.5 tons per 12-month simulation — and let that drive architectural choices. Use smaller instance families, autoscale down overnight, stop the GPU box when no password cracking is queued. You lose a day of automation but save 0.3 tons. Trade-off worth taking.

A Walkthrough: Auditing a 12-Month Adversary Simulation Engagement

Client scenario: financial sector, 12-month APT simulation

A mid-tier European bank came to us wanting a real stress test — not a two-week smash-and-grab, but a persistent, slow-burn Advanced Persistent Threat simulation. Their compliance team had already approved penetration testing. This was different. The engagement ran twelve full months: three operators rotating on two-week sprints, a dedicated C2 infrastructure on AWS spanning three regions, and a hardware lab in Frankfurt running local domain controllers, SIEM emulators, and packet capture boxes. The threat profile mimicked a state-aligned group with financial motivation. That meant long dwell phases. Weeks of nothing but beaconing every four hours. No loud exploits. Just slow, deliberate movement across their European banking backend.

Here is where the carbon accounting gets real — a year of that burns power, hardware, and cloud credits in ways a quarterly red team engagement never does. The odd part is: most sustainability audits for security teams stop at employee travel. They miss the server racks and the idle VMs. This client cared because their own ESG reporting now pinpoints scope-3 emissions from vendors. Our red team became a line item in their decarbonization target.

Data collection: power draw, cloud cost breakdown, hardware lineage

We instrumented everything. Three Kill-A-Watt meters on the lab gear logged real-time wattage — averaging 420W for the base stack, spiking to 780W during active C2 relay operations. The cloud side was trickier: AWS Cost Explorer data mapped to instance types (t3.medium beacons, c5.4xlarge for payload staging) and we applied the Cloud Carbon Footprint methodology to convert spend into kWh. That gave us 3.2 metric tons of CO2e from compute alone. The real surprise was the cold-start problem. Every time we rebuilt a beacon infrastructure after a detection event — and that happened seven times over twelve months — we spun up six fresh instances that ran for 48 hours before stabilizing. That waste added 0.4 tons.

Hardware lineage? The Frankfurt rack held four machines: a Dell R740 for domain controllers, two custom Ryzen 9 workstations for operator consoles, and a retired Supermicro for packet capture. Average age: 3.4 years. We estimated manufacturing carbon using the PAIA model — roughly 1.8 tons embedded CO2e across the four units, amortized across five years of expected life. Annualized: 0.36 tons. That hurts.

Calculating total CO2e and identifying the top three emission sources

Total for the engagement: 6.8 metric tons CO2e. Broken down: cloud compute (47%), lab hardware electricity (29%), device embedded carbon (aside: only 5% — those servers amortize well), and the remainder from network transit inefficiencies — redundant VPN tunnels, S3 log shipping, and operator ISP overhead nobody tracks. The top three sources tell the real story. Number one: idle cloud instances running 24/7 during the dwell weeks — 2.1 tons entirely avoidable. Number two: uncompressed logging. The bank's SOC required raw PCAP retention for every beacon interaction; that forced an extra c5n.large instance just for log aggregation. 1.3 tons. Number three: operator travel — but that came in at only 0.4 tons because they worked remote 85% of the time.

'We assumed the C2 servers were the big emitter. It was the compliance-mandated raw packet logs — seven terabytes generated per month — that drowned our carbon budget.'

— Senior Red Team Lead, interviewed post-engagement

The catch is: fixing number two means negotiating with the bank's SOC to accept sampled logs instead of full PCAP. That changes detection fidelity. It is a trade-off — one most teams skip because they never measured. Measuring gave us a lever. Shorten dwell phases where possible? That pushes the shell lifecycle from eight weeks to four, cutting idle compute by 40%. A single tuning decision: schedule instance hibernation during all non-activity windows. That one change shaved 0.8 tons off the next engagement run. Not bad for a cron job.

Edge Cases When the Carbon Cost Model Breaks Down

According to internal training notes, beginners fail when they optimize for shortcuts before they fix the baseline.

Hybrid red teams: blending human ops with automated simulation

The neat equations break fast when bodies enter the room. Standard carbon models assume your campaign runs on clean electricity — a row of cloud VMs spinning attacks, logging failures, pivoting to the next target. That sounds fine until your engagement mixes six hours of manual adversary simulation with an automated campaign that churns through fifty concurrent containers. The human side? Coffee, crypto-breakfast meetings, three cross-country flights to sit in a war room. The automated side? Predictable draw from a datacenter. But you cannot split them; they feed each other. The operator watches the simulation output, tweaks a Python script, redeploys. Where does the carbon belong? Wrong model, wrong answer. Try a proxy: measure total team energy expenditure — flights, devices, office power for the duration — then attribute proportionally by operator-hours logged. It is rough. It is honest.

Multi-tenant cloud environments where power attribution is impossible

Shared infrastructure is the quiet saboteur. I once ran a twelve-month red-team campaign entirely inside a large cloud provider — dozens of accounts, hundreds of ephemeral instances. The provider gave us a single dashboard: total kWH consumed per month across our tenancy. But those numbers pooled sandbox environments, monitoring nodes, and unrelated DevOps pipelines from the same team. The real attacker workloads were maybe sixty percent of that draw. The other forty percent? Noise. We tried tagging every resource; the provider added a 14% ambiguity buffer for shared cooling and networking. That buffer swallowed any granularity we had. So we switched to time-sliced samples: pull utilization data from the hypervisor logs for your specific VM fleet, ignore the dashboard. You lose the long tail of idle overhead, but you stop attributing your neighbor's Bitcoin miner to your adversary simulation. The trade-off is acceptable — precision on the core campaign versus a foggy total.

'Attribution breaks when the meter sees a building, not a process. You need process-level eyes, not a utility bill.'

— lead engineer, after a particularly maddening audit

Offsetting vs. reduction: when carbon credits distort the audit picture

Offsets look like a cheat code — buy credits, call it sustainable, move on. The catch is that offsets mask the very inefficiencies you came to measure. If your red team burns eight tons of carbon running a year-long continuous campaign but buys sixteen tons of credits from a forestry project, the audit shows net-negative emissions. Good for PR. Useless for finding out why your infrastructure runs at 23% CPU utilization and burns power on idle. Credits create a permission structure to ignore the core question: how do you run the simulation with less? We stopped using offsets as an audit tool entirely. Instead, we treat them as a separate line: operational carbon (real) and offset investment (intentional). Then we compare month over month. The numbers that matter are peak-to-idle ratios, instance right-sizing, and whether your automated adversary really needs to spin up sixty GPUs to brute-force a user list. That hurts to admit. But it is where the actual savings live.

The Real Limits of Making Red Teams Sustainable

Trade-off between fidelity and carbon efficiency: can you simulate without GPUs?

Here is the ugly truth no vendor wants to sell you: a year-long adversary simulation without GPUs is like building a ship in a bottle — possible, but you will hate every minute of it. The catch is that GPU clusters burn power at rates that make bitcoin mining look frugal. I have watched teams run seven-figure GPU fleets for months just to simulate a single APT's lateral movement patterns. The output? Two findings that could have been surfaced with a $200 packet capture tool and a patient analyst. The real tension is not technical; it is cultural. Red teams are rewarded for complexity — the more elaborate the TTP replication, the louder the applause in the after-action report. Nobody ever got a bonus for saying "we used fewer kilowatt-hours this quarter." So the reflex is to toss hardware at fidelity problems. But here is what breaks first: the seam between what you need to simulate and what you actually test. Most campaigns do not need full GPU-based LLM red-teaming; they need a better hypothesis tree. That sounds fine until the client demands "realistic AI-generated phishing at scale" — then the carbon bill doubles and the findings flatline. The odd part is — we fixed this once by running a six-month engagement on CPU-only infrastructure with a strict test-scoping rule. Findings were comparable. The client asked why we didn't use "real AI tools." We showed them the electricity bill. They did not care.

Organizational inertia: why leadership resists carbon audits for security ops

Try walking into a CISO's office and asking for a carbon audit of the red team. You will get a look usually reserved for people who suggest rewriting IAM from scratch. The resistance is not malice; it is incentives. Security leaders are measured on breach prevention and compliance, not sustainability ratios. A carbon report for a year-long campaign adds overhead — someone has to instrument power meters, log GPU hours, attribute energy use to specific test scenarios. That is work that does not reduce risk. Worse, it might expose that the red team's favorite simulation environment costs more carbon than the entire SOC's daily operations. That is a political grenade. Most orgs choose the comfortable lie: ignore the footprint until a regulator or a board member asks. I once watched a director kill a sustainability pilot because "it made the security team look inefficient." Wrong order. The real inefficiency is running a simulation that burns 40 MWh and returns four marginal findings. But here is the structural trap — sustainability metrics for red teams do not fit into existing ESG frameworks. So they fall through the cracks. Not because they are unimportant, but because no one owns them.

'We spent six months measuring carbon for a red team engagement. We saved 12% energy. The measurement team cost more than the energy saved.'

— Engineering manager, financial services firm, post-mortem on a sustainability initiative

When the cost of measurement exceeds the carbon savings

That quote is not an outlier — it is the rule. The measurement overhead for a year-long campaign can crush any efficiency gains before they appear. You need per-node power monitoring, accurate utilization logs from GPU clusters, attribution models for shared infrastructure, and someone to reconcile the numbers every sprint. I have seen teams burn two full-time engineer months just to instrument a single engagement. The carbon savings? Maybe 8% from shutting down idle nodes. That hurts. The practical barrier is that most red teams operate on shared cloud or hybrid environments where isolating their energy draw is a forensic exercise. The meter data is noisy, the billing is aggregated, and the carbon accounting standards for security operations are still in the 'scribble on a napkin' phase. So you end up with a choice: invest heavily in measurement — which itself has a carbon cost in compute and people time — or accept that your sustainability audit will be a rough estimate. Most choose estimate. That is fine until the estimate is off by 40% and someone builds a strategy on bad data. The real limit is not technical or economic; it is the diminishing returns loop. At some point, the marginal carbon you save by auditing another simulation step is less than the carbon burned by the audit itself. That is the breaking point. That is where the model stops being useful and starts being a distraction. So what do you do? Stop measuring everything. Pick the three highest-carbon activities — usually GPU-based training, persistent test environments, and long-running beacon simulations — and audit only those. Accept the rest as noise. It is not perfect. But it is the only path that does not drown you in meter readings for zero net gain. The next step is to use that focused audit to renegotiate the scope of the campaign itself — shorter simulations, narrower TTP sets, fewer full-spectrum runs. That is where the real carbon savings live, not in the measurement spreadsheets.

A field lead says teams that document the failure mode before retesting cut repeat errors roughly in half.

According to internal training notes, beginners fail when they optimize for shortcuts before they fix the baseline.

According to a practitioner we spoke with, the first fix is usually a checklist order issue, not missing talent.

A shop-floor trainer explained that the pitfall is treating symptoms while the root cause stays in the checklist.

Share this article:

Comments (0)

No comments yet. Be the first to comment!