Incident response and recovery best practices for industrial control systems
Introduction
In collaboration with the North American Electric Reliability Corporation (NERC), the Federal Energy Regulatory Commission (FERC) developed a 2020 Cyber Planning for Response and Recovery Study (CYPRES). The report provides information about cyberthreats to electric utilities, as well as incident response and recovery best practices for industrial control systems. The study includes expert observations from the sponsoring entities, as well as interviews with eight utility organizations on their approach to incident response.
Learn ICS/SCADA Security
About the study
The study's roots go back to 2014, when the FERC and NERC initiated a joint staff review to assess the plans for restoration and recovery in the case of utility outages of blackouts. This assessment led to the creation of the Review of Restoration and Recovery Plans report in 2016.
This report included recommendations for improving system restoration and cyber incident response and recovery planning and readiness. Included in these recommendations was the request to perform a study to understand plan improvements and identify best practices.
Now that study is complete, and the findings have been reported.
The report process
FERC and NERC used the National Institute of Standards and Technology (NIST) Special Publication 800-61 Revision 2, Computer Security Incident Handling Guide to prepare the report. The NIST publication offers guidelines on how to respond to incidents appropriately.
Teams conducted live site visits, interviewing electric utility employees responsible for incident response. The interviews leveraged the Cyber Kill Chain® intrusion process for discussions.
The interviewers asked participants questions regarding their approaches to the various phases of an intrusion in their incident response plans.
Other query topics:
- Questions regarding corporate networks and operation technology networks that control power systems.
- How they employ NERC’s Critical Infrastructure Protection (CIP) Reliability Standards, specifically CIP-008-05 (Incident Reporting and Response Planning) in their plans.
Defining the components
The study provides some context on components that are necessary to understand the report’s findings.
- Cyber network design: Describes interconnected digital elements such as computers, routers, switches, and firewalls. The report also identifies that corporate networks, operation technology networks and industrial control systems have different levels of security.
- Exercises: Opportunities for organizations to test systems and train staff to ensure incident response plans mitigate risk properly.
- Cybersecurity event: An observable occurrence in a system or network. Networks don’t always have adverse consequences.
- Cybersecurity incident: An incident is different from an event. It’s a malicious act or suspicious event that compromises or disrupts.
- Incident Response and Recovery (IRR) plan: Details how a utility responds to a cybersecurity incident, establishing clear procedures for handling it. They generally define the scope, security incidents, roles and reporting requirements.
Observations based on the NIT phases
The NIST Incident Response Life Cycle has four phases of incident response:
- Preparation
- Detection and Analysis
- Containment and Eradication
- Post-Incident Activity
The team used these phases to address observations from their interviews.
Preparation
Preparation consists of incident prevention and response capabilities that appropriately secure systems and networks. Preparation includes:
- Appropriate staffing for IRR duties with roles and responsibilities
- Adequate procedures and tools for investigation
- Team continuity assurances
- Implementing training and professional development that helps employees be cyber-aware
Preparation observations
Overall, the team found that interviewees had different approaches to preparation. Many of the differences in preparation relate to the organization’s size, structure, functions and geographic footprint. Smaller entities have much different responses than larger ones, simply because their infrastructure is smaller.
These participants used the Electricity Subsector Coordinating Council’s (ESCC) Cyber Mutual Assistance Program as a support system to augment personnel in case of an incident.
Most entities' defense posture included tools for proactively monitoring, managing and mitigating malicious activity. They are also using virtualization for quicker recovery. These organizations also have robust training programs to satisfy the human element of preparedness.
They promote professional development through certifications, conferences, workshops and online training. Developing existing personnel is a necessity for the utilities, as they find it very challenging to find specialists with industrial control systems security expertise.
In general, the organizations were holding quarterly cyber awareness training and annual cyber-exercises. Some of these were targeted drills around spearphishing and social engineering. They also stay up to date on all relevant news and threats through NERC alerts and others.
To achieve optimal preparation, the study authors noted the following:
- Well-defined roles that promote accountability
- Accessibility to technology and automated tools
- Well-trained team members who are always honing their skills
- Incorporation of lessons learned from past cyber-events and simulations into their IRR plan
Detection and Analysis
The next phase includes how an organization detects an incident and analyzes it. Detection is typically through security tools that perform scans. There is a determination of scope in analysis and detailing what’s impacted, with the deployment of a team immediately to determine prioritization and containment.
Detection and Analysis observations
Participants relayed their use of network scanning for detection. The scanning includes multiple operating systems, including operating platforms and industrial control systems, and is able to find software flaws, missing patches, malware and misconfiguration.
These security tools also provide users a baseline and assist with configuration management. These organizations use tools in two ways. One is complete automation; the other is IT personnel employing them. They complement one another in most cases.
Responding to detection varied by organization. It was dependent upon their unique definitions of suspicious, event and incident. Responses vary as well depending on the affected trust zone. The use of flowcharts and decision trees were useful tools to determine how to proceed.
As with preparation, personnel training is critical to a healthy detection and analysis response.
Containment and Eradication
Containment encompasses the pursuit to limit the incident scope or impact. Once an event is deemed an incident, containment could occur, but it’s not always the best option. In the case of certain malware, containment could actually make the problem worse.
Eradication involves removing the threat from the systems and attempting to return then to a regular and functioning state. It’s not an easy proposition, as responders may not have identified all compromised systems.
With both containment and eradication, clear communication amongst stakeholders is necessary.
Containment and Eradication observances
Again, there were differences in how participants handled containment and eradication. Their IRR plans often contained many options for response depending on severity. Containment strategies can often affect operations, so participants conduct tests and training to ensure they make the best response for the situation. IRR plans must take into consideration that containment could trigger predefined destructive actions by malware.
For eradication, the entities noted that the analysis of the attacker and how long it had been in the system influences what they do next. They stated that their IRR plans don’t necessarily cover this in detail but offer a set of conditions to consider.
During this phase, organizations also have evidence collection and continued analysis as part of their IRR plan. By doing so, they can forensically evaluate the length of the attack and more specifics.
Post-Incident Activity
In the Post-Incident Activity phase, it’s time to recover, document and further analyze. Recovery means the restoration of the system to operational. Lessons-learned reports are often an output.
Post-Incident Activity observances
The study’s participants spoke about lessons-learned reports and how they use them to change and adapt their IRR. These reports can be from actual incidents or training exercises.
Other observations
Beyond the NIST framework, the teams had other observations around incident response policies. Here are some of the most important:
- IRR plan organization varies. Some have multiple incident response processes for each network. Larger entities tend to have multiple ones, but the team of researchers observed those with one update and use it more frequently. For larger utilities that require more than one IRR, a best practice would be to ensure they all have the same structure.
- All participants had a formal process for updating their IRR plan. When new threats become known, they meet and review to see if the current plan addresses them. However, they don’t usually make sweeping changes unless the risk is high.
- Participants agreed that reporting incidents and sharing information is critical for the industry to be better prepared.
- All participating organizations have a strong commitment to cyber-training all employees and ensuring awareness across their staff, no matter their department or title.
- The most impactful IRR plans have a clear communication path. Communication is critical to successful incident response. Maintaining secure communication is another important consideration.
Learn ICS/SCADA Security
What can utility companies take away from the CYPRES report?
Incident response steps within your IRR plan are crucial for addressing threats. Having an effective and tested IRR plan is a necessity for every utility. The report doesn’t define one “perfect” IRR but acknowledges that those participating each had successful ingredients. The observations and practices can support any utility company’s IRR, making it stronger and more resilient.
An Incident Response Learning Path could be beneficial to stakeholders responsible for IRR in utilities. Learn more about incident response training today.
Sources
Cyber Planning for Response and Recovery Study (CYPRES), FERC
Report on the FERC-NERC-Regional Entity Joint Review of Restoration and Recovery Plans, FERC
Computer Security Incident Handling Guide, NIST
The Cyber Kill Chain, Lockheed Martin
Cyber Mutual Assistance, ESCC