- UX for AI
- Posts
- Amplify, Obfuscate, Simulate: The 3 Legitimate Uses of Synthetic Data in SOC Tooling (Synthetic Users Is Not One of Them)
Amplify, Obfuscate, Simulate: The 3 Legitimate Uses of Synthetic Data in SOC Tooling (Synthetic Users Is Not One of Them)
There's a disturbing trend in security AI: synthetic users replacing real customer research. For an AI MacGyver, synthetic data is duct tape. Brilliant to help amplify rare signals, obfuscate for privacy, and simulate for agent testing. But you don't do brain surgery with duct tape. And you don't replace handshakes with synthetic users. The duct tape is for machines. The handshake is for humans.
There's a trend quietly spreading across product teams — including security product teams — that should alarm anyone who cares about shipping software that actually works.
It's the rise of the synthetic user.
Teams are spinning up LLM-simulated personas to replace customer research. "Interview" a synthetic SOC analyst. Validate a PRD against a synthetic CISO. Run a whole discovery sprint against an imaginary user that always has time to talk and always says something plausible.
This is a disaster waiting to happen. And I'm going to keep saying it until people stop.
Every real AI MacGyver I know builds from what's in the room. Paper clip. Duct tape. Swiss Army knife. Not because those are the ideal tools — because they're the tools reality gave you. The discipline is knowing which improvisation holds and which one collapses under load. Synthetic data is duct tape. Brilliant for filling the holes when the real part isn't available. Disastrous when you reach for it in place of the real thing.
If you want to build something that works for humans, go talk to humans. Watch an analyst work. Sit in a SOC at 2 a.m. during an incident. Ride along during shift changeover. Find out that the person you're building for checks MFA logs before endpoint telemetry, not after, and that they do it because their last SOC lead told them to, and nobody has updated the runbook in three years. That is the kind of insight that ships products. No LLM is going to tell you that. No simulated persona is going to have a cigarette outside the building and complain to you about the alert fatigue that's making them think about leaving the industry.
If the consumer of your output is a human making a decision, you need real humans in your loop. Not synthetic approximations of what a human might have said.
I've been saying this loudly for months. In March 2024, I published "Why We Don't Use Synthetic Data (And Neither Should You)". The hard line. No nuance.
Then someone challenged me.
The Challenge
I was on a call recently where I was on my usual tear about synthetic data, and someone pushed back: "Aren't there legitimate uses of synthetic data for SOC tools?"
And I had to admit they had a point.
(That's what AI MacGyvers do. We adjust when reality doesn't match the plan. The blanket "no synthetic data" rule was my plan. Someone showed me where reality didn't match, and I adjusted. The duct tape has its place. I just had to figure out where.)
I had to stop and reconsider. I was using "synthetic data" as shorthand for the whole category, when what I actually meant — and what the March 2024 piece was actually about — was synthetic users and synthetic-for-humans. I still stand by that, now more than ever. But I was wrong to extend the blanket to cases where a machine is the consumer rather than a person.
So I went back and rethought it. Here's where I landed, and I think it's cleaner than where I started.
The organizing principle is this:
Synthetic data is for machines, not humans.
If the consumer of the synthetic data is a human making a decision about what other humans might want as their new experience, synthetic data is the wrong choice. It gives you a plausible-sounding answer that wasn't tested against reality. Unfortunately, humans are wired to accept plausible answers as true; that's how you end up sinking $2M and nine months into building a product that no one alive and breathing wants to buy or use.
But when the consumer is a machine — a detection model, an agent, a guardrail, an evaluator — synthetic data stops being a lie and starts being a tool. Machines don't care whether the world is real. They care whether the signal is statistically similar to what they'll see in production.
Within that, I now see three legitimate lanes for synthetic data in SOC tooling. I call them Amplify, Obfuscate, Simulate. All three have the same shape: you're short on something that reality can't give you — examples, permission, or safe ground.
1. Amplify the Signal
The machine needs more examples of the rare thing than reality provides.
This is the "bigger needle in the haystack" problem. Supervised detection models trained on real SOC telemetry face a brutal class imbalance. The positive class — the actual attack — is 0.001% of traffic. The model learns to predict "benign" every single time and still hits 99.9% accuracy. Technically correct. Operationally useless.
The fix is to synthesize more positives. More lateral movement patterns. More C2 callbacks. More rare TTPs that the model would otherwise see once per quarter. You're not lying to the model — you're giving it enough copies of the truth for the gradient descent to actually learn the boundary.
The same logic applies to reinforcement learning loops. If your agent has to wait for a real production incident to learn from, you're going to be training for a very long time. Synthetic incident streams at known severities let the agent rack up thousands of learning episodes in the time it would take to see two real ones.
The key word here is amplify, not fabricate. You start from a real signal and you scale it up. Ideally, you are using real attacks as foundation, not inventing new attack classes out of thin air. Otherwise you will just amplify a fictional signal and train your model to be extra vigilant for conditions that will never occur.
2. Obfuscate the Data
You have the real thing, but you can't use it.
This is the privacy and compliance lane. Cross-customer threat intelligence. MSSP platforms training models on behalf of dozens of clients whose logs legally cannot touch each other. Research papers on detection evasion where publishing the real environment would get someone fired. Training a SOC analyst with realistic incidents when handing them production data would violate six compliance frameworks.
In all these cases, you have the data. You just can't move it, share it, publish it, or show it to the people who need to learn from it.
Think about what an Arctic Wolf or an eSentire actually does at the platform layer. They can't pool customer A's logs with customer B's logs to train a better detection model — the contracts alone would kill them, and the compliance review would kill them twice. But a synthetic training set that preserves the statistical shape of cross-customer attack patterns, without any individual customer's specific IOCs, hostnames, or PII? That's how platform-scale threat intel gets built at all. Without it, every customer is an island, and every model is undertrained.
Synthetic obfuscation — preserving statistical properties without leaking specifics — is the pragmatic answer. It's not new. It's what differential privacy has been doing for a decade. The word "synthetic" makes it sound more exotic than it is. It's statistical fidelity without identifying specifics. Call it what it is.
It's a real, bounded, defensible use case. I was wrong to dismiss it.
3. Simulate the Environment
The machine needs to fail safely, or be tested repeatedly, in a world that behaves like production but isn't.
This is where synthetic data matters most for anyone building agentic SOC systems.
You cannot stress-test a judge-in-the-loop against adversarial prompt injection using real production logs. You cannot chaos-test an autonomous playbook by injecting corrupted timestamps and contradictory signals into your actual customer's telemetry. You cannot calibrate the thresholds on a Playbook Autonomy Score by waiting for enough real incidents at enough known severities to accumulate in the wild — that would take five years.
You need a synthetic environment. One where the agent thinks it's production, acts like it's production, experiences consequences like it's production — and then gets reset. The agent doesn't know it's in a sandbox. For an agent, it's all real. That's the whole point.
Every industry with high-consequence failures has figured this out. Surgical residents train in simulators before they touch a patient. Nuclear reactor operators run drills against synthetic meltdowns (and thus avoid Chernobyl incidents). Expensive robots get pushed to failure in digital twins because breaking a real one sets back the clock by months. Push, break, fix, reset. That's the loop.
Security agents deserve the same treatment, and for the same reason: the blast radius of a confident agent that's wrong is enormous, and you don't get to discover that in production.
This is also where regression testing lives. Every time you swap a model — Haiku to Opus, update a system prompt, retrain a detector — you need to know you didn't break anything. A synthetic eval set with known-correct verdicts, run through Phoenix or Arize on every change, is table stakes for shipping agentic systems responsibly. Anyone who tells you otherwise is going to ship a regression into production and not find out until a customer does.
The Line
Amplify, Obfuscate, Simulate. Three lanes. One principle underneath.
Amplify when reality doesn't give the machine enough of the rare thing.
Obfuscate when you can't move or publish the real thing.
Simulate when the machine needs to fail safely or be tested repeatedly.
Those are the three legitimate lanes. Everything else — especially the growing trend of synthetic users and simulated personas for product discovery — is innovation theater dressed up as research. It will produce products that are elegant, plausible, and completely unusable by the humans you forgot to talk to.
Synthetic data is AI MacGyver's duct tape, paper clips, and a Swiss Army knife. It fills the holes. It holds things together when the real part isn't available. That's legitimate. That's AI MacGyver energy.
But you don't do brain surgery with duct tape.
You don't build a relationship with a paper clip.
You don't replace a handshake with a Swiss Army knife.
The handshake is for humans. That's the real human connection — the empathy, the looking somebody in the eye, the "I got your back." That's the MacGyver energy that matters most, and it cannot be synthesized. It cannot be simulated. It cannot be LLM'd into existence.
The duct tape is for machines. The handshake is for humans. Know which is which, and you'll ship things that work.
I used to say "no synthetic data, period." Someone challenged me. I changed my position. I think that's what growth looks like. The world moves too fast right now for any of us to lock in a hard rule and stop listening.
What did I miss? I'm still learning. Tell me where I'm wrong — or where I'm still too lenient.
-Greg
P.S. Want to become an AI MacGyver?
The discipline of adjusting when reality doesn't match the plan — knowing when to use synthetic data and when to walk away from it — that's exactly what we teach in the UX for AI Professional Certification.
We cover the frameworks for shipping AI products that work: UX-led framing and validation strategies, human-in-the-loop agentic systems design, and the AI product discipline that separates the 15% that ship from the 85% that fail.
Cohort 1 sold out in weeks. Cohort 2 opens soon.
→ Get on the waitlist before it fills up: https://uxforai.com/c/certification
Reply