Identifying, Containing, and Responding to Incidents

by | Incident Response

ALERT! ALERT! This is not a drill! You have an alert that was identified as legit, was pressure-tested, checked, and escalated to containment. Something is truly up! Now what?

Identifying Incidents

To be clear, the first step–identifying an incident–is far from trivial. Many incidents go undetected for weeks and even months. On the other hand, sometimes “events” look are actually are not security events at all: hundreds of events in the course of doing business can be mistaken as a security incident.

An application may modify a configuration file, and you get an alert: is it an incident or a normal function? A file server is slow: incident or just a heavy traffic load? Access to the Internet is spotty: incident or ISP errors? A file modification alert has been issued: malware or normal use? An email with a threat alert: is it mere spam or a sign of things to come? Unfortunately, all these will need to be looked at through your cybersecurity glasses, even if they end up being little else than routine bugs.

Attempting to catalog all this, much less make sense of it all, is not something humans can manage on their own.

Enter SIEM. Security Information and Event Management systems ingest all this information (logs, threat feeds, etc.), correlate it, and issue alerts about abnormal events in your environment.

What’s not to love? Nothing. Except that not all SIEMs are alike, and, like all relationships, you need to put work into it to get the benefits out of it.

First, you should do your due diligence and select the right SIEM for your environment. They come in all sorts of sizes and flavors, including MDR/SIEM-as-a-service options. Second, depending on which system you deploy, you may have to fine-tune it. Make your SIEM too paranoid, and every alert issued becomes a ticket to be investigated. But tune it to “apathy” and you won’t be notified until your data center has been pounded into fine silicon dust.

Once tuned, a good SIEM will be your best friend and cybersecurity partner. You’ll be working hand-in-hand in processing the alerts, creating tickets, documenting trends, correlating with threat intelligence, and escalating to containment and treatment. All this beautifully orchestrated work will also be feeding into your cyber documentation tools, which your reporting workflows can use to notify authorities, incident information sharing organizations, wikis, and late-night glee clubs.

The catch, of course, is that to reach this level of incident-detection bliss will require not only the right SIEM properly configured, plus the right systems-management practices (e.g., documentation, normalization, synchronization) but also the right defense-in-depth strategy to begin with, which—you will recall—is the one that resulted in the deployment of the right controls being monitored in the first place.

And how did you get to the right defense-in-depth strategy? You got there because you did all the hard work during the asset, threat, vulnerabilities, and environments phases.

Containing Incidents

Once you’ve got your incident response plan in place, all you need is an incident! If you’re lucky, you’ll wait for a nice, long time, giving you all the training opportunities your team needs. But if you’re not… What’s next?

  • Step one: Do not panic! You’ve got this!
  • Step two: Identification. What are you dealing with? What are your detective controls telling you? Trust your controls! Don’t second-guess them.
  • Step three: Analysis. This is elbow-grease work, requiring unique skills and specialized tools. Your team of experts will be asking a lot of questions: What is the payload? What is the vector? What does the threat feed tell us? How did it get in? What is the payload doing? How is it doing it? Why is it doing it?
  • Step four: Action. Once you know who, how, and what, you need to be ready to act. Do you want to maintain the evidence for possible future actions (for example, prosecution)? Or do you want to get back to operations as soon as possible, with little regard to evidence preservation? That is, of course, a business decision and can only be taken at the board level. Do you want to tolerate the infection while you develop workarounds? These are decisions that you should have thought out ahead of time, and with board approval, and practiced with both your incident team and your top-level management.

Let’s look more closely at step four, action. Action essentially means containment. To properly contain the incident, you’ll need the results of step 3—analysis. We need to know who, what, how, and if possible, why.

  • Who: Who is behind the attack? Is this an accident? Is this an insider? Is this an act of terrorism? What do the threat feeds say about who’s behind it? What is the motive? Understanding the who will help you develop methods of containment and treatment.
  • How: What is the attack vector? Malware, and if so, which kind? Is this the result of a virus? A Trojan? A worm? Did someone plug in an infected USB somewhere? Did someone click a link in an email? Or was this a “drive-by shooting,” such as malware dropped on a computer while someone was just browsing? Or was the breach perhaps the result of stolen credentials? It may be one or more of these things or none of them. There could be some entirely other, more unusual method that the attacker used to compromise your systems. Your team will need to identify the how in order to contain it and potentially identify ways to preserve forensic data.
  • What: What part of your system was hit? A single endpoint or multiple? Servers? Applications? Networks? Your forensic people will need to do a detailed end-point analysis on the affected systems to collect evidence. This will include any kind of tracks that the attacker left behind, including a bit-by-bit copy of what is in the volatile memory (not just the permanent storage [e.g., hard disk, SSD]). They will need to establish the incident timeline: When did the attack start, how did it propagate, what did it leave behind, what are the effects? Part of this process involves highly technical work that may include isolating the malware on a system for observation, or reverse engineering the malware to understand what the code is doing. Once all this is understood, then all systems across the company will need to be forensically verified that they have not been infected as well.
  • Why? What made you a target? Are you just a random victim of greed? Are you a political target? Are you collateral damage of a much larger attack? Roadkill on the information highway? Know this, and you may be able to learn a lot more about your adversary and build better future defenses.

It bears repeating that in the incident containment phase, knowing what not to do is as important as knowing what to do. If your company does not have the resident expertise, a panicked executive or IT person being flooded by alerts may be tempted to pull the plug of the infected computer, try to use some I-got-it-from-the-Internet-malware-removal-tool, or—worse yet—log in as the administrator to investigate it or call your friend in the FBI.

This is why it is so critical to have a plan in place before the incident and to practice, practice, practice! Knowing what to do and how to do it gets you halfway toward resolution.

Treating Incidents

When it comes to treatment, I have bad news and bad news. First, the bad news.

Sometimes, you may have to live with the disease. You may need to live with it because critical business considerations prohibit you from isolating or rebuilding one or more systems. Or you may need to keep the infected systems running so that the attackers don’t realize that you are aware of them while shielding more valuable assets or sending Delta Force over to kick in their doors.

Similarly, there may be upstream or downstream dependencies that require you to get creative. For example, because of business requirements of nonstop use, you may need to restore the infected system onto brand-new hardware and then reinsert it into production while taking the infected system out. That can prove about as easy as replacing an aircraft engine while it’s in flight. It’s doable but very, very windy.

Other times, you may be required to preserve the infected systems for evidence. That’s all fine and good if you have one server that’s infected; it’s a whole other story if you have 50 of them sneezing and wheezing with a virus. Complicating things further is virtualization. These days, you can have multiple virtual computers running on a physical computer. Despite ironclad assurances from virtualization vendors that there can never be cross-contamination across virtual machine boundaries, I remain a skeptic. I am a firm believer that if one person can build it, another can break it. So I would not be surprised if a virus was created that could jump the virtualization boundary, despite whatever assurances you’ve been given.

Once the evidence has been preserved, requirements have been met, all the t’s crossed and all the i’s dotted, next comes the fun part of coordinating one or more infected systems. Not only do you need the IR team on the same page, you need the whole organization to know which systems may need to go offline, for how long, what the impact may be, and so on and so forth. Depending on the number of systems, this can be an expensive, resource-intensive exercise. Plus, you’ll make a whole new set of friends when you announce that all the passwords have been changed.

Which brings us to the bad news.

The only way to ensure that an infected system has been cured is to go to “bare metal.” This means that you need to completely and thoroughly wipe the infected system and rebuild it bottom-up from clean installation media. There is no sugarcoating this. If you need to be 99.99 percent sure that the infection is gone, the only way to do it is by “burning the box” and reinstalling everything new.

Why only 99.99 percent sure, even after a clean wipe? Because word on the street has it that some agencies have developed malware that infects the actual hardware itself. Allegedly, this type of malware can infect the BIOS (basic input/output system) of a computer or certain EPROM (erasable programmable read-only memory) chips, such that even after you wipe the computer, the malware can reinstall itself. As unlikely as this may sound today, do keep it in the back of your mind. That way, you won’t be surprised if the unlikely happens!

Of course, even wiping a single system is a time-consuming exercise. Not only do you need to have the clean installation media at hand, you also need to account for configuration settings, clean backup availability, data synchronization when the system gets reinserted into production, and so on and so forth. Now, multiply this by the number of infected systems.

Let’s end with a tiny bit of good news: this moment is where your disaster recovery planning will shine. Having clean, recent, whole-image air-gapped backups for each system will go a long way toward a rapid recovery.

Incident Recovery

With the incident treated, you are ready to bring your systems back online. There are several questions that need answering here, bridging the incident-treatment process and final incident recovery.

First, long before any incident, you must have considered whether you will need to preserve any evidence. If you are required to do forensic preservation, then you cannot wipe the infected equipment and rebuild. You’ll need replacement gear. Moreover, the documentation needs and requirements for evidence preservation are extensive. Your team will need to be familiar with all the laws and requirements for evidence handling, chain of custody documentation, proper evidence gathering, and so on. Your best bet is to retain a forensic firm to help you. They will interface with your attorney and your incident response team and make sure that all the evidence is taken care of properly.

Second, your incident recovery strategy hinges on your previously determined values of MTD (maximum tolerable downtime), RPO (recovery-point objective), and RTO (recovery-time objective). You have documented all of this before any incident, and they have governed your disaster recovery strategy. Here and now is where all this will be tested.

As with all things, recovery from an incident will fall on a spectrum. There will be companies whose MTD is practically zero and will therefore have hot failover sites, air-gapped (i.e., non-Internet accessible) backup and archives, and multi-timeline mirrors. (This refers to a mirroring strategy and equipment that creates isolated versions of mission-critical systems over predefined timelines—for example a real-time mirror, a day-old mirror, a week-old mirror, a month-old mirror, and so on.) Those companies will be using an army of resources to recover, because that’s what it’s worth to them and frequently is what is required of them (think utilities, national security installations, water purification, air traffic control, etc.).

Then, there are the majority of companies that can afford an MTD of hours or days. Their strategy is more appropriate to that timeline, meaning that the team has some runway to be able to restore systems to production. That is not to say that there is a company out there that doesn’t want the absolute minimum MTD. This is to say that the majority of companies are smart enough to not overpay for disaster recovery when they can tolerate a few days’ worth of an outage.

There may be a question about whether those smarts are developed from proper risk management–centric thinking or from “Are you crazy? I am not paying all this money for backup!” reactions. Either way, most IT departments have in place a budget-appropriate solution that has also been implicitly accepted as recovery-appropriate.

Lastly, there are (hopefully) a minority of companies that have not thought through any of this. For them, incident recovery is…challenging. Some may never recover and go out of business (consider malware that destroys all data in the absence of a backup). Others will survive but pay a dear price in both time and money and get only a partial recovery. Either way, if you are reading this blog and have realized that this paragraph describes your company, you have a lot of work to do, and you’d best hurry up!

In our case, I am confident that you have evaluated and vetted your disaster recovery strategies and solutions as appropriate to your risk appetite, cybersecurity exposure, and so on, so we’ll accept that your environment falls squarely into the smart group of companies! While your team is restoring your production environment, you are making sure that they are following a pre-rehearsed checklist that affirms proper and normal operation of all systems, following the communications and notifications protocols. All departments should be working together in incident response harmony. At the same time, you are making exhaustive notes on the incident that will help you in your post-incident review and action plan.

Post-Incident Review

There will be incidents and there will be INCIDENTS! Both will require your careful review. People call post-incident review many different things (lessons learned, postmortem, etc.), but no matter what you choose to call it, the review process is essential to the evolution of your cybersecurity program and team skills. To begin with, you need an answer to a whole list of questions, starting with:

Who’s responsible for this outrage? You should be able to answer this with confidence. And yet the question is more complex than it seems. It’s not about merely assigning a name to the attacker; it is about building a complete profile. Sometimes getting a name will be impossible. You will still be able to glean quite a bit from the tools used, origin, threat intelligence feeds, and plain old Internet searching. Try to build as good a profile as possible. Understanding your attackers, their motives, and their methods will help you prepare for the next attack.

Just the facts, ma’am! After the who, you need to document the how, and you need to do it in excruciating detail. You need to know exactly what happened, when it happened, how it happened, what vulnerabilities were exploited, how it was discovered, which alerts were triggered, and which were not. You need all the facts about the incident, in as much detail as possible. This will help you tune your defenses and identify holes in your control deployment.

Knowing now what you didn’t know then, what do you need to change? This is the introspective component of the exercise. You need to look inward and understand how your organization reacted to the incident. Look at everything and leave no stone unturned. How did the incident response team act? Were procedures followed? Were internal communications effective? Was the response time adequate? Were the response actions timely and adequate? Did you meet MTD, RPO, and RTO targets? How did the executive team support the effort? How did the staff react? How was the Privacy/IT/Cybersecurity partnership performance? Once these questions are answered, you can collectively sit down and tune your incident response plan such that next time you can perform better.

How do you stop this from happening again? Having understood the organization’s and team’s performance, plus all about the incident itself, you need to plug the hole that allowed the incident to occur in the first place. This sounds a lot easier than it is. If the breach was due to a phishing incident, for example, your options may be limited to better email sandbox tuning and targeted cybersecurity awareness training. Neither of those will plug the hole completely. They may narrow it, but there is no guarantee that another member of the staff will not fall victim to such an attack. In which case, you’re going back to layered defenses (sandboxing, role authorities, monitoring, training, etc.).

Do It All Over Again!

Incident-response planning isn’t one-and-done, ever. It changes over time, with your environment, the market, and the technology. It also changes because you change: you learn, you grow, you adapt.
The good news is that as you learn, grow, and adapt, you can apply your knowledge, experience, and adaptation to your privacy, cybersecurity, and incident-response plans. You get a second chance! And a third, and a fourth…

Change is the only constant here, and it represents a wonderful opportunity to keep your privacy and cybersecurity program constantly tuned. Embrace this and take advantage of it. The alternative is catastrophic: an obsolete privacy, cybersecurity, and incident response program are just as bad as not having any.


Submit a Comment

Your email address will not be published. Required fields are marked *