You are at best fighting a delaying action. You cannot even hold back the tide. We are losing.
Losing means that our current approach to defense is insufficient to stop the the existing threats and adversaries we are facing. I believe that our defenses will be and have been overrun, that we are in constant catch-up mode.
A starting point
Before we start:
- I see immense value in the infosec profession and those in it, I truly love my work
- I’m not going to the dark side and I’m not calling for stronger laws
- I’m no Luddite and I believe there is a way out (also not trying to sell you anything)
Now, go read Dan Geer’s posting on people in the loop, then that golden oldie from Spafford on solving the wrong problem, followed by Mike Murray’s 2009 Hardest Career (or Geer’s why it’s the most challenging) and finally HDMoore’s Law. Then wrap it up with a review of anything tagged #cyber, #APT or #breach.
Rather than linking and referencing every line, I’ll summarise the points from the listed articles to support my argument:
- Technology evolves at an alarming rate and to secure it you have to stay in front of it, unfortunately there are probably new technologies you probably don’t even know of as users adopt new tech at stupefying rates, the technology that mattered yesterday may be irrelevant tomorrow;
- We are always applying imperfect defenses to protect a fundamentally flawed system, the proverbial wrong cure for an unacknowledged disease;
- The pool of things we have to defend are growing at geometric and sometimes exponential rates (there is no linear) but even worse these things have complexity as both a planned and emergent property. The threats we protect against are continuously improving in capability while growing in number;
- Our capacity to create and move data grows in leaps and bounds but our capacity to protect it does not; and
- Our defenses are only tested to defend against the weakest attackers and our compliance driven approach focuses on only doing enough.
(My apologies to the authors of these august pieces if I misinterpreted, bastardized or did your works any harm – the falling is in me, not in your excellent writing).
Rochambeau
You will lose this fight because we are defenders in an knock-down drag-out fight that we treat as a slow cold war. We are few and our opponents are legion: from skilled hackers sponsored by foreign nations to splinter cell styled anti-sec groups to millions upon millions of malware infested computers. They are smarter than most of us (either in general or in specific domains that count), have critical knowledge before you and can move faster with better automation driving purpose built tools. They will always have more resources to throw at the problem whether it’s actual money, bodies, time or computational power; they are not constrained by budgets, management decisions or project time lines. In some cases they don’t even have to vet decisions with anyone, sometimes they’re just cruel random algorithms.
We cannot win because you operate alone, responsible for the security of thousands of  systems, millions of lines of code across a network with more interconnects than you can count, shepherding users who (irrationally) resist you at every turn, adopting technologies faster than you can track. The tools and resources you rely upon are flawed and incomplete but present to you a view that suggests otherwise, giving you false confidence. The resources you are given grow at a linear rate, if at all; but worse, you are defending against the last threat, not the current one. You probably cannot easily answer the question “what happened?†and you almost certainly cannot answer the question “what’s going on right now?â€. You are constrained by rules, employers and operational realities. All they need is one small hole; all you need is a perfect defense and therefore you will lose.
This losing fight happens against a background of more and more of our society being cast into silicon; more of our business processes and decision making happening in software. Our data grows at ever increasing rates and data begets data, brutally compounding as we go. Yet our investment in security does not keep pace with our investment in technology, the colleges and universities pump out many times more engineers and developers each year than they do security practitioners. This is sad, because while we’re not action heroes, we are (like many other quiet professions) defending our economy and our lifestyles against great harm.
The Root of it
We are here because we have outdated beliefs about the scale and capabilities of our adversaries. We’re here because we still use tools and methods intended to handle small volume threats, relying on conceptual archetypes that don’t match reality. Most importantly we are here because we are attempting to think like the multitude of adversaries and beat all of them at all of their own games, yet we are poor emulation platforms at best.
We are all individually losing and it will remain so if we stay on this path of current thinking and practices; if we fail to acknowledge the scope, scale and speed of our adversaries. The cloud (or whatever the technology du jure is) will not save us; it will certainly allow complex feats of engineering and give birth to powerful tools to help the fight, but it is still technology built on the same foundations and governed by the same security practices, the same business thinking that lead us here. Even the most type safe language with the most robust framework is still owned by the a business person who will always prioritize everything above security (even after they’re bleeding from a recent breach).
Now, Bathwater
Place no ultimate reliance in your penetration testers or auditors because they’re most likely less skilled and less aggressive than your actual adversaries. Even if they find nothing, the test wasn’t exhaustive and cannot prove you’re secure, all it can tell you is that you’re perhaps protected against what you were tested for. Do not consider achieving compliance objectives or the greening of your red-yellow-green risk matrices as being secure for those are ultimately exercises in compromise. Perhaps they are acceptable compromises, but still compromises in the face of adversaries that need only the smallest slip.
Recognize that every single defense fails given time, but even more importantly that every defense is inherently incomplete, at best 1% shy of good enough. You cannot rely on single technologies even when using defense-in-depth.
If security is the hardest and most challenging career, then it stands to reason that most of the non-security people you will deal with, even the most profoundly deep engineers and shrewdest business people, will not understand the problem in its entirety. Don’t expect the people you’re charged with protecting to make rational decisions even after your best educational efforts.
A plan, maybe
The following formulation is rough and should sound somewhat familiar to students of new school/big data thinking.
Start with raw data collection, set yourself an arbitrary target of gathering 1,000 data points per user per day (or 10,000 per system) and then invest towards achieving that goal; make sure to think about increasing the diversity of your data sets as well.  Data points come in the form of user actions, system events, file changes, transactions, email flow, security decisions, literally anything that leaves a trace or causes a change in data, anything that could be used to formulate a metric. The initial number of data points is arbitrary, just a starting point to be tuned up or down. Now that you have the data, build out analysis capabilities automating everything you can, stream line everything you can’t. Note that I’m not talking about SIEM, that’s too narrow, you need non-security data points and thinking in the mix too.
Whatever the data tells you, treat it as a hygiene problem or as infection control, not risk management. If your population is vulnerable to a certain problem or exhibiting unwanted behaviour, set up a program that uses multiple techniques to reliably eliminate it and use data to prove progress, augment the techniques as you experience drop off in effectiveness in your journey to 100%. Of course, this is really just another form of risk management, just using different language to evoke support and calibration to match the scale of the problem.
There will be some problems that you don’t need data to identify and some that you don’t have data to prove yet. For the former you can still run hygiene programs and use data to prove a change occurred, for the later simply continue to expand your data collection and analysis program. Initially you’ll be collecting data with limited direction but over time you’ll develop hypotheses and determine the data you need to support them.
With your data collection program in full swing, start creating and inserting security instrumentation into every business and IT process. Your next objective is to direct investments towards increasing coverage of security instrumentation. Use this instrumentation to increase the metrics for driving your hygiene/infection control programs and use each program to push more instrumentation. One of the processes that must be instrumented is your own data collection process; tracking and reporting on its performance will help you understand if you’re properly equipped to analyze the flow of information. As your program matures, focus on increasing the data points collected per dollar and the data points analysed per dollar (these are metrics that could work across sectors).
Compared to the classic risk management approach which attempts to balance likelihood and impact against cost, a hygiene centric program focuses on strengthening the population to be resistant against threats, you are either resistant or you’re not. This approach would permit collaboration across diverse groups, focusing on shared strategies for improving health and eliminating systemic problems. More importantly, it provides supporting data for scaling to meet ever increasing complexity and evolving threats. The best part about this approach is you’re finally in a position to provide evidence that you are secure against specific threats and that’s actually something the business can understand.
While data, even lots of it, won’t allow us to get entirely ahead of the adversaries, it will allow us to respond faster, immunizing parts of the population not yet affected, sacrificing (but eventually curing) a few front runners for all our benefit. These unlucky front runners could signal to the rest of us and enable early detection, and hopefully accelerating our response either through automation or efficiency. We’re starting to see some of that in our anti-malware solutions but it’s limited and disconnected, we need this collective defense to be pervasive, as free as possible, cross technology, cross vendor, cross industry and cross border.
<Fin>
Perhaps hard to believe, but this is not a pessimistic piece and no FUD is intended, while there perhaps is some fear, there is no uncertainty or doubt because you know what we’re doing isn’t working and needs to be better. There certainly are answers, there is certainly a way forward which by necessity leverages all that we have done before. This is an optimistic piece (although the optimism is poorly conveyed) because I think we are, as Geer so eloquently puts it, “… at a cross roads, an inflection point†and now is a great time to build powerful new techniques and practices for our profession.
These ideas are not mine alone and the formulation of this text is entirely dependent on the bigger thinking in the articles I referenced above, hopefully this gives you a new way to talk about the scale and scope of the problem, a way to build a supporting argument for capabilities that can operate at the scale we need to win.
(Many thanks to those that reviewed the first draft, I don’t name you because I didn’t ask your permission but feel free to take credit.)
Addendum
Where to start? Some examples would be nice, no?
Here are a few metrics (and the potential sources), the outcome of collecting data points. Some I’ve used in the past (most created with the help of unnamed colleagues), some I plan to use. Few of these will tell you that something is absolutely wrong, all of them will give you hints necessitating further investigation. Ultimately it’s up to you to decide what question you’re trying to answer and which is the appropriate metric to collect data points for:
- Time to patch – average time between a patch being available and being applied, smaller is better, higher signals inefficiencies or other hygiene problems – possible sources: patch management systems, vendor bulletins, vulnerability scanners
- Time to detect – average time between a vulnerability being known and being discovered on your systems, small is better, higher signals insufficient visibility or poor change management – possible sources: vulnerability scanners, vendor bulletins
- Time to respond – time between a security issue being discovered and resources solving the problem – possible sources: ticketing system, vulnerability managers
- System currency – what percentage of systems carry up-to-date security definitions? less than 100% means your defenses aren’t going to stop that new problem – possible sources: client management systems, anti-X management consoles
- Time to currency – how long does it take for 100% of your population to update to the latest security definition? more than a day may mean you can’t get out ahead of fast moving malware – possible sources: client management systems, anti-X management consoles
- Population size – number of systems that have X configuration or Y software – possible sources: client management systems, vulnerability and configuration management
- Vulnerability resilience/Vulnerable population – how many of your systems are immune or vulnerable to the current top ten vulnerabilities? or any vulnerability? – possible sources: client configuration management, vulnerability management
- Average vulnerabilities per host – on average, how healthy is your population? upticks may signal more maintenance – possible sources: vulnerability management platform
- Vulnerability growth rate versus review rate – is your team detecting more vulnerabilities faster than they can review? you may be under resourced or experiencing a rapid influx of new/unpatched systems – possible sources: ticketing system, vulnerability management platform
- Infection spread rates – how many systems are infected now versus an hour ago, if that number ticks up – possible sources: anti-x management console
- Matched detection – what percentage of vulnerabilities or infections were confirmed by two different instruments (two vulns scanners, two anti-x detectors, a vulnerability scanner and a patch manager). Matches suggest correct functioning but mismatches hint at early deficiencies in the tech – possible sources: vulnerability scanners, anti-X platforms, patch management
- Unknown binaries – the number of executables in your environment that you’ve never seen before – possible sources: anti-x, host intrusion detection systems
- Failure rates – sometimes processes crash or just stop, but many crashes in specific processes (repeated or across the population) may be evidence of breach activity – possible sources: system logs
- Restart rate – system restarts are normal, but to many may indicate an emerging security problem, not enough may indicate security upgrades aren’t happening – possible sources: uptime monitors, DHCP servers, system logs
- Configuration mismatch rate – how often are systems departed from specified good/safe configurations – possible sources: configuration management, system logs, help desk tickets
- Configuration mismatch density – on average in how many different configuration rules does a system break, the higher the number the less consistent your defences might be – Â possible sources: configuration management
- Average password age and length – passwords are usually set by users, if they’re changing too often you may have an access problem, if they’re not changing often enough (more than X days or only on the last possible day you may have an awareness issue. Similarly, if average password length is at the minimum permitted by policy, it may signal users don’t understand their responsibilities; if the average is below the minimum then your authentication system rules aren’t being enforced properly – possible sources: authentication directories, system logs, password testing tools
- Directory health – how many machine/user objects have all the data you need about them (location, purpose, owner). Also how many have been inactive for a long time? Lots of inactive systems are indicative of a failure in systems maintenance – possible sources: directory servers
- Directory change rate – what are the average number of transactions in your directory system on a daily basis? a higher rate may signal cowboy IT or churn due to an emerging security problem – possible sources: directory server logs
- Time to clear quarantine – how long does an infected file sit in quarantine before someone looks at it (or a non-compliant system sit in a quarantine network) – a big number may suggest you’re under resourced or experiencing higher infection rates – possible sources: anti-x management consoles, network access control
- Access error rates per logged in user – how many times per hour do your systems respond with access denied? if that number spikes someone may be probing your environment – possible sources: LDAP/RADIUS/Auth sources
- Groups per user – how many permission groups does the average user belong to? high numbers indicate access creep and poor maintenance; low numbers indicate not enough access decision making – possible sources: directories, application configurations, databases
- Tickets per user – how often are users asking for help; an increase could signal a systemic issue, a failed security technology or an emerging security issue – possible sources: help desk systems
- Access changes per user – this is a measure of churn, if changes are too frequent it may signal your authorisation architecture isn’t working as expected and perhaps users are experiencing access creep – possible sources: service desks, directory server logs, application logs
- New web sites visited – most users tend to visit the same websites on a daily basis, significant departures from that may be signs of C&C communications or drive-by downloads – possible sources: proxies and web filters
- Connections to botnet C&C’s – how many of your computers are trying to phone home to known botnet command and control systems? this could be a sign your anti-X isn’t working well – possible sources: intrusion detection, Malware RBLs, web filters, anti-X management consoles
- Downloads and uploads per user – are users suddenly moving more data than usual? This may be indicative of an intruder stealing information – possible sources: netflow, system logs
- Transaction rates – how many transactions are your systems processing within an hour? a spike may correlate to some business event or be a sign of unwanted activity – possible sources: ERP systems, application logs
- Unapproved or rejected transactions – it’s not unusual for some transactions to have problems but lots of transaction issues may signal an emerging security issue – possible sources: ERP systems, application logs
- Email attachment rates – do emails carry attachments more often than normal (in either direction)? Could be a sign of malware blooms – possible sources: mail gateways, spam filters
- Email rejection/bounce rates – how many incorrectly addressed emails are arriving at your organization on an hourly basis? Spikes could indicate unfocused phishing attempts – possible sources: mail gateways, spam filters
- Email block rates – is your mail system blocking more emails than usual, could be a sign of incoming malware – possible sources: mail gateways, spam filters
- Log-in velocity and Log-in failures per user – how many log-in attempts (and failures) do you have on average per hour, track by time and day; a spike may indicate someone is trying the doors – possible sources: directory servers, applications, server logs, voicemail managers
- Application errors – some application errors are normal but an increase may signal probing attempts – possible sources: application logs
- New connections – look for systems that the network has never seen before or systems that are talking to each other that never done so before, may indicate a breach or poor production promotion/change management – possible sources: netflow collectors, firewall logs
- Dormant systems – old systems are often unattended and unpatched, looks for systems with few data changes, limited log-ins, consistently low CPU utilization – Â possible sources: systems monitoring, virtualization servers, backup tape logs, directories
- Projects and without security approval – How many projects go into protection without security involvement? Every one of those projects is a potential Typhoid Mary, find and immunize them – possible sources: help desk, project management office, time sheets, finance
- Changes without security approval – How many changes occur without a security review? Every one of those changes could have unintended consequences on the population you care for – possible sources: help desk, change approval registers, network and system monitoring, host intrusion detection
- Average security dollars per project – How many person hours spent per project? The less contact security has a with a project, the more likely it is to have hygiene problems – possible sources: project office, finance, time sheets
- Hours per security solution – is your team spending enough time with the tools you do have? if low, this may signal under resourcing or potentially the tools aren’t providing value. If high, you may be inefficient in your resource usage – possible sources: systems logs, time sheets
- Hours on response  – Are your team members spending too much time investigating incidents? may be a resourcing issue or a call for more automation – possible sources: time sheets
- Lines of code committed versus reviewed – how fast is the code base for an app growing versus how many lines of code is your appsec team actually reviewing. Your coverage may be too low to catch problems – possible sources: versioning system, code repositories
- Application vulnerability velocity – is your app sec team filing bugs faster or slower than the engineers can address them? if faster you may need a different defensive approach; if slower, make sure they’re actually being closed out – possible sources: bug trackers
These are probably not enough and almost certainly don’t think big enough. There’s lots of data out there already – netflow, server logs, management consoles, application database, ERP systems – so go mine it; have a look at your existing solutions, some may have reports that provide these metrics already. Also look towards your peers in the industry as well your suppliers and customers, establish metric and data sharing relationships with them. Checkout Securosis and Metricon/securitymetrics.org, there are good ideas for metrics and data points there. Once you have the data, the real work begins as you analyse it, flag issues and then respond; the performance of that process (speed, coverage etc…) should also be measured (and then optimized by calculating per dollar invested relationships).