How to handle security incidents like a big enterprise
Incident response is one of the most important processes inside an information security team. The objective of the Incident Response team is to approach a security incident in a coordinated and organised manner aiming to stop or at least reduce the impact in the first phases.
Moving from a traditional security team in a big company where everyone masters the security lingo to a smaller, less mature business can be quite challenging, security specific terms and definitions might not be known by the majority of people that come from different parts of the business. I found myself countless times seeing people unprepared and/or not knowing what to do in a incident scenario.
When a security incident happens the team must follow a clear plan in order to have enough velocity to contain and recover from it, here is a comprehensive guide of the 6 steps for a complete incident response.
Phases of an incident response
An incident response can be broken down in 6 simple steps: Preparation, Identification, Containment, Eradication, Recovery and Post Incident Lessons Learned.
Breaking down the plan in those steps or phases gives a clear view of what need to be done in each phase, for example preparation will hold all the actions before engaging the incident, for example, who need to be notified or involved, playbooks and tools needed, etc, identification will have actions in order to identify what happen based on what was collected previously, containment will have actions to stop the spread of what caused the incident, eradication will be the steps necessary to stop the incident, recovery will be what need to be done in order to recover the affected services if any, and in the end, the lessons learned will hold all documentation and knowledge of what happened and to prevent that something like it don’t happen again.
So let’s dig a little bit dipper in each phase:
Phase 1
The first phase of the Incident Response is the preparation phase. Probably this one is the most important phase and needs to be taken seriously, the results of this phase will dictate the speed and effectiveness of the following phases.
At this point we still don’t know exactly if we have an incident in our hands and its extent. In case of being a real incident the team need to prepare with all information they can have in order to succeed. Here are some actions that need to be performed in this phase:
- Grab the incident response plan, policies and procedures (assuming the company/team have one). Without strong and well defined guidelines it will be extremely hard for the incident responders to treat the incident
- Collect logs and threat intelligence feeds, this capability is really important to have prior to an incident, feeds and logs will provide the context of the environment during the incident and help build a timeline of events. Those logs will provide the first glance of what is happening and how big it is, in case of a real incident those logs might be used as evidence as well
- Start the communication strategy present in the incident response plan, notify who needs to be notified, call all relevant contacts that need to be present or help investigate the incident. Usually incident response plans have levels of communication, for example the first level is all the analysts and specialists of specific resources to help identify it, a second level to a broader audience including some senior members of the team and maybe information security executives and a third level in the last phases for the entire company, executive board or even PR department for a public statement.
- Document everything, since the first phase document as much as possible, this will be added in the end to the knowledge base and the information present in this report might prevent another incident of the same kind to happen later or in a different department. Even if it does not prevent another incident to materialise it will definitely help to treat it more efficiently.
And before moving to the next phase keep in mind that sometimes in order to get a good understanding of the big picture it may be required to go back to this step and collect more information of newly discovered resources that could be involved.
Phase 2
Moving on, the second phase is Identification. In here we define if we have a real incident in our hands or it is just a false alarm. As you can see, here is where the information collected in the first step start to pay off, if the analysis of the events was successful is safe to call if the incident actually materialised.
In case of a false positive it is a good time to tune the alerting mechanisms with the collected information for the never ending task of reducing the false positive rate, like the old saying if everything is important nothing is important. However if the incident happens then a few options arise to deal with it: monitor it to collect more events, declare a real incident and escalate following the predefined procedures or treat a recurrent incident based on a previous incident report.
Before the end of this phase try answering questions like how critical this incident could be, what resources are affected, what could be the source of it and attack vector.
Phase 3
So for phase 3 we start actually facing the incident, if we got this far it means we have a live incident in our hands and the first thing to do is contain it. The goal here is to prevent the affected resources from communicating with other clean ones and spread the problem.
Each kind of incident will need its own set of actions, for example a DDoS attack has different characteristics when compared with a malware and suggested actions will come from previous incident reports, knowledge base or even the incident response plan.
Some important actions that generally need to be implemented are isolating the affected resources so it can’t communicate anywhere in your network or to the public internet, revoke credentials, and remove temporarily relevant access and roles. But be careful with the actions performed to contain it, sometimes blocking a resource can cause an even bigger outage, these actions need to be really well thought.
And one more thing, avoid wiping or destructing any resource at this point in time, powering off a machine or wiping a resource will destroy valuable evidence that can give the team clues of how the incident started and how it behave inside the environment, losing that would be disastrous leaving the door open for the same issue to happen again in the future.
With the incident contained to certain resources we can move to the next phase.
Phase 4
Here is where we stop the incident, phase 4 is eradication. The goal here is to pinpoint the root cause and fix it, always documenting each step. After finding and completely removing the root cause it is time to wrap things out. From here it is time to run scans whenever possible, review active policies, snapshot machines involved in case of further forensic analysis and do a coordinated shutdown.
Depending on the extent of the incident phase 5 may happen at the same time or not happen at all.
Phase 5
So for phase 5 we have recovery, if needed would happen almost at the same time as phase 4. Some critical resources that may not be powered off will need a fully recovered copy before starting shutting down infected resources. Sometimes this step could happen before or with the containment phase in case a resource cannot be fully isolated.
The tasks here are restoring resources based on backups and snapshots from a known good state, patch systems that have pending patches, rebuild what is necessary and employ all your monitoring capabilities to confirm that the incident is really gone and detect new attempts or variations of a consecutive attack.
Phase 6
And after all that, we’ve reached the last step, time for lesson learned. After things have cooled off it is time to sit down and review all the documentation, produce a proper incident report and notify all parts that need to be notified including government or regulatory agencies which is mandatory in some countries. At this point everything went back online and operation was fully recovered, time to build the knowledge base entry for this incident and capture what was learned from it, how it was resolved, what could be improved, new controls or preventative measures that could be added, and new training material where it applies.
And we are done. No one wants to be breached or engage an incident response but in the connected world we are living, with almost everything connected, this is inevitable. Having a solid plan is the difference between a successful recovery and a catastrophic outcome with brand damage, financial consequences, bad press and even end of job/operations.