Skip to main content
Long-Horizon Attack Simulation

How to Document What Your Year-Long Simulation Didn't Test: Ethical Gaps in Extended Campaigns

You ran a twelve-month red group. Impressive. But here's the uncomfortable truth: your simulation didn't trial half the attack paths that matter. Ethical constraints, consent boundaries, and fixture blind spots all gated you. What you didn't log is now your biggest liability. According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the initial pass, the pitfall shows up when someone else repeats your shortcut without the same context. According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the opening pass, the pitfall shows up when someone else repeats your shortcut without the same context. This phase looks redundant until the audit catches the gap.

You ran a twelve-month red group. Impressive. But here's the uncomfortable truth: your simulation didn't trial half the attack paths that matter. Ethical constraints, consent boundaries, and fixture blind spots all gated you. What you didn't log is now your biggest liability.

According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the initial pass, the pitfall shows up when someone else repeats your shortcut without the same context.

According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the opening pass, the pitfall shows up when someone else repeats your shortcut without the same context.

This phase looks redundant until the audit catches the gap.

This article is for simulation leads, CISO reviewers, and red crew operators who need to prove—to auditors, regulators, or their own conscience—that the gaps are documented, not hidden. No formulaic templates. Real talk about what gets missed and how to write it down before someone else finds it.

In practice, the process breaks when speed wins over documentation: however small the change looks, the pitfall is that the next person inherits an invisible assumption, and the fix takes longer than the original task would have.

This shift looks redundant until the audit catches the gap.

Who Needs This and What Goes faulty Without It

The CISO who discovered a missed lateral step six months after the simulation ended

She reviewed the final report, nodded at the green checks, approved the next year's budget. Then the real incident hit — a low-and-slow data exfiltration that began in a corner the simulation never touched. The attacker had used a legitimate admin instrument to pivot from a forgotten dev server. Our simulation never tested that exact path. No one documented why we skipped it. So the CISO had nothing to show the board except a sheepish "we didn't get to that." That hurts. A year-long campaign that says nothing about its own blind spots isn't a risk assessment — it's a confidence trick waiting to detonate.

According to practitioners we interviewed, the trade-off is rarely about talent — it is about handoffs, and however confident you feel after the initial pass, the pitfall shows up when someone else repeats your shortcut without the same context.

'You certified a clean zone, but the clean zone was defined only by what you chose to attack. The gaps are the real attack surface.'

— former penetration testing lead, responding to a post-engagement audit

The auditor who asked for the 'untested scenarios' log — and got nothing

Auditors love paper trails. They love logs of what was tried, what failed, and what was deliberately deferred. Hand them a simulation binder with only successful probe cases, and they will ask one question: "What did you choose not to trial, and why?" Silence means the organization accepted risk without a signature. That is a governance failure, not a testing gap. The catch is — most simulation leads don't log omissions because they fear looking incompetent. So the auditor flags the whole program. I have seen this sully three years of clean trial results in a single compliance review. The fix is boring but painful: write down the untested scenarios the same day you skip them, not six months later when memory fades.

The group lead who watched a junior operator skip consent re-checks for nine months

Long campaigns breed routine. Routine breeds shortcuts. A junior operator, maybe 80 tests deep, stops re-validating blast radiuses with the blue group. "We already cleared this subnet in month two," they say, and jump. The problem? Infrastructure changes. A critical database migrated into that subnet in month five — no one updated the consent scope. The operator never documented that re-check was skipped. Nine months later, a production outage traces back to that unverified action. The crew lead then faces a question that should never be asked: "Who approved the omission?" No one did. The gap was simply not written down. That is the ethical failure — not the mistake itself, but the missing record that made it invisible until damage was done.

Most units skip this because documentation feels like overhead. But the overhead of a three-line note per skipped probe is trivial compared to the cost of rediscovering a gap during a real breach. You need this if you sign reports, if you inherit a year-long simulation, or if your name is on the risk acceptance form. Without it, you are flying on a map that only shows the routes you walked — not the miles you never touched. And the ground is still there.

Prerequisites: What You Need Before You Start Documenting

A consent matrix that breaks down per-phase approval

Most units start documenting gaps by asking the off question: What did we miss? That gets you a list of technical ommissions—useful, but not actionable without knowing who signed off on each phase. The prerequisite you actually need is a consent matrix: a log that maps every simulation phase to the specific stakeholders who approved the conditions before testing began. I have seen units waste two weeks reconstructing gap narratives only to discover the legal group never agreed to the active directory lateral movement scenario in the first place. That hurts. A consent matrix fixes this by hardcoding who said yes to what, and, just as important, who said “not yet.” Each row should name the phase, the trial environment, the data sensitivity tier, and the approver’s signature date. The odd part is—once you build this, you will spot approval holes that silently created your biggest untested paths. Include a fallback column for “approved under protest” conditions; those are where future gaps will cluster.

What usually breaks first is the matrix becoming a static PDF. Do that, and it rots. Treat it as a living artifact—update it when a stakeholder changes roles or when a vendor revises their penetration testing policy. One concrete fix: pin the consent matrix inside your simulation kickoff ticket and require a fresh sign-off every 90 days for long-running campaigns. Without this, your gap documentation is built on sand.

fixture inventory logs with version and capability limits

Here is the trap: you run a year-long simulation, you switch from Cobalt Strike to Sliver in month seven because the EDR caught your beacon, and you never log that swap. By month twelve, your gap documentation shows “no testing of persistence mechanism X,” but the real reason is your new fixture literally cannot execute that technique. You need a instrument inventory log that records not just the name and version, but the specific capability limits per build. C2 framework version 4.7 missing the Kerberos ticket manipulation module? Write it down. Python 3.10 dropping support for the impacket fork you relied on? Log it. The catch is—most units record what they used but not what the fixture could not do. I fixed this once by adding a “capability delta” column: a short sentence explaining why each technique was excluded from a given trial window. flawed order: recording the fixture first and the gap later. Reverse it—start with the technique you intended to probe, then annotate why the instrument stack blocked it.

Structured as a bare-bones table in your wiki (not a spreadsheet—those vanish), this log serves as the evidence spine for every documented gap. When auditors ask why you skipped credential dumping on Linux hosts, you point to the fixture log showing your agent’s Linux binary crashed on kernel 5.15. That is concrete. No fake statistics needed—just a timestamp and a failure mode.

An incident response handover protocol for untested paths

This prerequisite gets skipped because it sounds like over-engineering for a simulation. It is not. During a year-long campaign, the blue group will inevitably detect something you did not expect, escalate it, and start an incident response process. If your simulation documentation does not include a handover protocol for those untested paths, the IR crew will close the incident based on partial data—and your gap log becomes fiction. The protocol should specify three things: (1) who in the red group receives the IR notification, (2) a template for declaring a finding as “simulation-adjacent but not tested,” and (3) a rule that no untested path is struck from the gap list without a full review within 48 hours.

The tricky bit is—this protocol only works if the SOC knows it exists. Most units skip this, and the seam blows out: the IR group documents the incident, the simulation ends, and six months later nobody remembers that the lateral transition into the HR subnet was never actually tested—just observed live. One rhetorical question fits here: what is the point of a gap log if you let real incidents rewrite your testing history without a trace? End the handover protocol with a mandatory “gap merge” step: every IR investigation spawned during the simulation must fed a paragraph into your gap log within one week. Not yet? Then the gap stays open and unresolved. That is how you keep the documentation honest across a campaign that spans seasons.

‘The longest simulation teaches you less about what you tested and more about what you decided not to trial.’

— Red crew lead, post-mortem notes from a 14-month engagement

In published workflow reviews, units that log the baseline before optimizing report roughly half the repeat errors; the trade-off is an extra twenty minutes upfront versus a multi-day cleanup loop nobody scheduled.

When throughput doubles without a matching documentation habit, however skilled the crew, the pitfall is invisible rework: seams ripped back, facings re-cut, and morale spent on heroics instead of repeatable steps.

Core Workflow: Mapping What You Didn't trial, phase by transition

step 1: Replay each phase and tag every gated action

Pull the simulation timeline into a spreadsheet—or a whiteboard if you hate version control. Walk through each phase, month by month, and flag every moment where your red group *decided* not to proceed. These are gated actions: pivots that required a credential you didn't steal, an exploit you never attempted, or a persistence method you assumed would fail. Tag them bluntly. I use three labels—'stopped by policy', 'stopped by resource', and 'stopped by exhaustion'. The catch is that most units conflate *couldn't probe* with *didn't want to trial*. That hurts. A gated action tagged 'resource' might actually be an unspoken ethical ceiling—you had the tooling but avoided the collateral damage. Be honest about which is which.

Step 2: Annotate 'ethical ceiling' level for each branch

"A gap without a ceiling rating is just noise. Assign the harm level before you decide it doesn't matter."

— A field service engineer, OEM equipment support

Step 3: Cross-reference against your original threat model

Take each tagged, rated gap and map it back to the specific threat scenario it was supposed to validate. Your original model likely listed five to seven adversarial objectives—data exfiltration, credential harvesting, ransomware deployment, etc. Did you skip testing credential harvesting *because* the ethical ceiling was high, or *despite* it being low? That distinction changes your risk postage. A low-ceiling gap that maps to a critical threat is a failure of scope, not ethics—you should have rearranged the simulation to probe it safely. I have seen units produce beautiful gap logs that never touch the original threat model. The whole exercise becomes decoration. Instead, build a simple matrix: threat objective on one axis, ceiling level on the other. Wherever a high-criticality threat intersects with a high ceiling, that's your disclosure priority for stakeholders. Expect pushback. Someone will argue the simulation was never meant to cover that branch. That's fine—log their name and rationale next to the cell. The matrix turns opinion into traceable decisions.

Tools and Environment Realities: What Actually Works in the Field

Custom Jupyter notebooks vs. commercial GRC platforms

After twelve months of simulation, most GRC platforms buckle under their own weight. I watched a crew burn three weeks migrating spreadsheet gap-logs into ServiceNow — only to discover the platform’s reporting engine couldn’t model a skipped control surface. The catch is scale: commercial tools handle compliance templates beautifully, but they assume your attack simulation ran predictable coverage. Ours didn’t. Jupyter notebooks let you define custom gap taxonomies on the fly — tagged with simulation timestamps, operator notes, even raw packet captures. The trade-off? No dashboard. No magic PDF export. You build the visualisations yourself, or you ship Markdown dumps to the CISO. That sounds fine until your auditor demands a risk-register in a specific ISO format. Then you stitch converters, and those break mid-cycle. What actually works in the field is a hybrid: notebooks for the messy mapping work, a stripped-down GRC import for the final artefact. But only if your GRC fixture lets you ingest arbitrary metadata — most don’t.

‘We spent four months curating gaps in a relational DB. Then the schema changed. We redid everything.’

— Lead engineer, defence-sector simulation, 2023

Why plain-text YAML logs beat complex databases for audit trails

The odd part is — databases corrupt quietly. A colleague’s Postgres cluster lost three months of gap records during a routine replication failover. No crash, no alarm, just silent data drift. They caught it only because a junior analyst compared row counts against a plain-text backup. Plain-text YAML logs survive that. They diff cleanly in git, they timestamp with human-readable dates, and a junior can grep them without a PhD in query syntax. The pitfall? File sprawl. A single year-long campaign can generate 15,000 YAML log lines — each referencing a skipped scenario, a instrument misconfiguration, a stakeholder who declined consent. That’s manageable until someone drags a folder tree into units and loses the recursion. We fixed this by enforcing a strict one-file-per-week pattern, prefixed with ISO week numbers. Wrong order breaks the chain — and that hurts when an audit asks “what did you skip in week 34?”.

Most teams skip this: storage structure. They treat YAML as a dumping ground. Instead, define three fields every log entry must carry — test_id, gap_category, and operator_initials. Then your grep queries become trivial. Commercial tooling laughs at this simplicity — “we handle complex queries!” — but complex queries don’t help when the database won’t start.

The consent tracking fixture that failed after eight months

Airtable. Eight months of steady use. Then one morning the base hit its 50,000-row limit, and every linked record referencing a withdrawn consent turned null. Silent. No warning email. The group lost the entire gap-conflict trace for a critical water-utility scenario. What we used instead? A local SQLite table with a daily dump script — ugly, reliable, and nobody hypes it at conferences. The lesson: any cloud-hosted instrument with usage caps will break mid-campaign. Plan for it. Accept that a spreadsheet with conditional formatting and a manual hash-check column may outlive your expensive SIEM integration. That hurts to admit, but it’s the field reality after fifteen months of continuous simulation.

Variations for Different Constraints: When You Can't trial Everything

Short-staffed teams: triage by risk tier, not completeness

You have two people and a simulation that spans eleven months. You will not log every untested lateral movement path. That hurts — but pretending you can is worse. I have run this exact playbook with a group of three: we labeled every gap by how much blast radius it could unlock if exploited. High-risk gaps got a one-sentence scenario sketch plus the specific sensor that missed it. Medium-risk gaps got a tag — "observed in vendor docs, not recreated." Low-risk gaps got a single line in a spreadsheet and zero narrative. The catch is that low-risk does not mean irrelevant; it means you cannot afford the hours. If you try to chase completeness, your high-risk documentation turns into a blur of filler. Focus the crew's energy where the missing trial would have changed a decision.

The trade-off is brutal. When you skip documenting a low-tier gap — say, a DNS tunneling variant you never tried — you create blind trust in the probe's silence. Someone later assumes "no finding" means "safe." To counter that, add one sentence at the top of your gap log: "Only gaps with CVSS ≥ 7.0 or regulatory impact were individually described." That honesty protects you more than a bloated list ever could. What usually breaks first is the urge to explain everything — fight it. A two-line gap with a risk tier beats a five-paragraph gap with zero priority.

Regulated environments: the 'no touch' zone documentation template

We couldn't run a single credential replay against the production financial system. Not even in a read-only window.

— Security engineer, healthcare SOC, 2023

Regulatory boundaries turn gaps into hard walls. You cannot trial something, you cannot simulate it, and you cannot infer its behavior from a staging environment because the staging network is segmented differently. The trick is to log the reason for the gap as a structural finding, not a missing trial. I use a simple template: "Gap ID: REG-004. Target: PCI enclave 2. Constraint: production data prohibited probe execution. Mitigation assumed by the group: encryption at rest prevents privilege escalation. Validate with: code review, not pen trial." That last line matters — it tells the next operator what to do instead of testing.

The pitfall here is that teams write "not applicable" and move on. That fails because regulators and auditors want to see that you understood the gap's implications. Write one paragraph on what you would have done differently if the constraint were removed. This exposes assumptions — like "we trust the network segmentation" — that might be false. The odd part is that this documentation often catches more real-world issues than the actual tests do, because the constraints themselves reveal institutional blind spots.

Multi-client simulations: separating per-tenant ethical boundaries

Running one simulation across three tenants in a shared environment? You inherit a mess. Each tenant has its own data sensitivity, its own compliance rules, and its own definition of "acceptable risk." Documenting gaps here means you must explicitly tag every untested scenario with the tenant it would have affected — and the tenant it should not have affected. That second part is the one people forget. If you skip testing a cross-tenant escalation path because Tenant A's security staff said "don't touch our AWS Glue jobs," note that the gap exists for Tenant B and Tenant C too, even if they were never asked.

I use a per-tenant boundary matrix: rows are tenants, columns are attack vectors, cells say "tested," "blocked by policy," or "deferred." The matrix fits on one page. It forces you to see where one tenant's constraints create risk for another. A client once argued that their gap documentation was clean — until the matrix showed that a blocked trial for Tenant A left a path open to Tenant C's patient records. The fix was not technical; it was documenting that the constraint itself needed renegotiation. That is the point. When you can't check everything, log the walls between tenants as loudly as you capture the attacks you ran. Otherwise, the gaps don't add up — they multiply.

Pitfalls and Debugging: What to Check When the Gaps Don't Add Up

The consent drift spiral: when phase 3 permissions no longer apply

You mapped the entire year-long campaign in October. Phase 3 required lateral movement across a contractor's Azure tenant—signed off, logged, clean. By February the contractor restructured. Their new legal entity didn't inherit the old consent. Your simulated attacker steps that depended on that path are now automatically unenforceable, but nobody caught it because the spreadsheet still says "green." That hurts. I have watched teams burn an entire week re-running a simulation that had already become theater.

What usually breaks first is the stale approval. A red-crew consent signed nine months ago often carries an expiry or a scope clause you forgot existed. Debug it by checking effective permissions—not the ones you requested. Run a re-validation drill two weeks before the final report: log into each target environment as the simulation user ID and attempt the actual commands your playbook lists. The gap between what the paper says and what the shell returns is your real consent drift. Log that drift as a separate finding, not a footnote. The odd part is—auditors love this honesty more than they love a clean sheet that's wrong.

“A permission that was valid during design may be dead by check day. log the drift, don't hide it.”

— red-group lead, post-mortem on a simulated breach that never ran

Tool version mismatch: why your C2 logs lied about what was actually run

Most teams skip this: the C2 framework logs a command execution timestamp, but the tool's behavior changed between v2.3 and v2.4. You ran your initial reconnaissance in August with an older implant that left a different registry artifact. The detection engineering team, months later, tuned their rules against the v2.4 artifact set. Your gap matrix says "detection failed" for that step. Wrong order. The detection succeeded—against the wrong tool version. You are comparing apples to air.

The fix is brutally simple: pin every binary and script version in your gap documentation at the time of execution. When you later ask "why didn't we test X?" the answer might be "because the tool didn't have X at that version." I have seen a six-month gap analysis collapse because nobody logged which hash of Mimikatz they used in week 3 versus week 38. The older hash lacked an evasion technique the newer hash had, so the "untested" column doubled overnight. Debug by running a diff between your initial tool manifest and your final environment snapshot. If any binary changed between two runs, flag that gap as version-induced, not untested.

The 'we'll document it later' trap that kills audit readiness

This is the one that stings later. Your simulation runs Monday through Wednesday. The write-up gets scheduled for Thursday, then bumped to "next sprint." Three months pass. When you finally sit down to map the gaps, you reconstruct the timeline from Slack snippets, one junior analyst's notes, and a Jira ticket whose description is "c2 stuff." That's not documentation—that's oral history. Auditors will shred it.

The debugging step is ruthless: enforce a 48-hour write-up rule for every simulation phase. If the gap entry isn't populated within two business days, mark it as "unrecoverable" in your final report. That sounds harsh, but a declared unrecoverable gap is more honest—and more useful—than a fabricated one built from memory. Trade-off: you lose the illusion of completeness. Benefit: you keep audit credibility. One concrete anecdote: a client who used this rule discovered that 34% of their "tested" steps had zero verifiable evidence. They would have sworn the coverage was 85%. It was 51%.

Share this article:

Comments (0)

No comments yet. Be the first to comment!