Using Human Factors to Improve Process Safety
Following the recommendations of the Baker report into the BP Texas City incident two (2) new American Petroleum Institute (API) recommended practices were published. One of these dealt with the subject of fatigue risk management and the other, the subject of this paper, process safety performance measurement. The API document ‘ANSI/API RECOMMENDED PRACTICE 754 – Process Safety Performance Indicators for the Refining and Petrochemical Industries’ was published in its first edition in April 2010.
The document deals with an issue that has long persisted in the refining and chemical industries, namely the use of occupational safety performance as an indication of the overall safety of a facility. That’s not to say that occupational safety is not important, these programs ensure employees have a safe environment in which to work and go home in the same condition as they came to work in. However, reducing slips, trips and falls has a minimal impact on the reduction of potentially significant industrial incidents such as the BP Texas City refinery incident.
So this raises the big question, what do we measure to have a meaningful impact on process safety?
Many of us are familiar with the safety triangle shown below, based on the original work of Herbert Heinrich in the early 20th Century which is based on the premise that there is a large ratio of minor events to every major event.
To illustrate this further, a 2003 study by ConocoPhillips Marine produced a similar chart (below).
So the premise of API RP 754 is that if we track incidents at all levels, we can start to look at the underlying issues that can ultimately result in the one big event nobody wants to have.
API RP 754 takes this one step further to say that although individual events may not be significant, multiple circumstances that occur together can result in something bigger. This can be illustrated by the Swiss Cheese Model proposed by James Reason shown below.
In this model, any one breach of a protective barrier (an incident), is normally made safe by the other protective barriers, however, due to the weaknesses in these barriers, occasionally there are multiple breaches resulting in some form of harm.
To assess the effectiveness and performance on the individual barriers the API document proposes 4 tiers of performance indicators as illustrated below.
The top of the pyramid represents lagging indicators – i.e. after the horse has bolted – and the bottom of the pyramid leading indicators – i.e. measures that can predict potential failure.
The API document discusses in some detail the various Tiers that can be summarized as follows:
• Tier 1 – Serious consequence of material loss of primary containment (LOPC)
• Tier 2 – Lesser consequence of LOPC
• Tier 3 – Activation of safety system that does not lead to a Tier 1 or 2 event
• Tier 4 – Effectiveness of process safety management systems
So where does the human factors element come in to play? Typically Tier 1 and Tier 2 incidents result in incident investigations and often root cause analysis. The results of these may end in the conclusion that the incident was caused by human error, but it is often the case that companies do not drill down into the reason behind this to try and understand what contributed to the error. For example was the operator fatigued, was he overloaded by alarms, was shift handover ineffective and he was unaware of a situation? All these questions should be asked and used to improve the systems that address them. However, some of these things can be measured and could therefore be used as performance indicators, and because they are generally a function of the performance of a management system, they fall under Tier 4.
When we look at the layers of protection employed to protect a plant we normally consider:
• Equipment design – vessel and materials designed for the full range of operating conditions
• Physical protection – passive devices such as relief valves and rupture disks
• Safety Instrumented Systems (SIS) – automatic systems, often PLC or relay based logic.
• Basic Process Control System (BPCS) – DCS or SCADA logic
• Control Room Operator
When we design for safety we start at the top and apply sufficient layers of protection to reduce the danger to a level that is acceptable. Subsequent to that, we apply additional controls to mitigate the consequnce of any event that may occur. You would think; why do we have any incidents at all then? In theory everything seems good, but equipment properties change over time with the effects of such things as corrosion or simply wear. Relief valves and rupture disks work 99% of the time, but occassionally they get stuck or someone forgets to unisolate them after maintenance. SIS systems, regardless of how well they are designed, occassionally fail, or are subject to a set of conditions for which they were not designed. BPCS, even though in practice very reliable, can fall over and can be impacted by changes to programming by a human, for example shutdown limits, or overrides. There are also costs associated with all layers and it may not be economically viable to buid a reactor that can withstand tremendous pressures, and there may not be a relief valve or rupture disc that can relieve a two-phase flow resulting from a runaway reaction, so we are often left with the operator as an active last line of defense.
The UK’s Health and Safety Executive has proposed 10 key topics that influence the ability of the operator to respond to abnormal events. These are illustrated in the modified Swiss Cheese Model below.
The 10 topics are:
• Managing Human Failure
• Human Factors in Design
• Training and Competence
• Maintenance, Inspection and Testing
• Safety Critical Communications
• Fatigue and Shiftwork
• Organizational Change
• Organizational Culture
Dealing with each one in turn with some suggested Performance Indicators where appropriate:
Managing Human Failures
Although it seems odd to talk about managing human failures, the intent is to understand that humans will make errors, and so we need to identify ahead of time what they might be and decide how to protect against them. This is most often done through risk assessments that incorporate potential human errors (mistyping, activating wrong switch etc.) and influencing factors (being rushed, poor ergonomic design etc.). A performance indicator for this could be a percentage or number of risk assessments completed or the results of continuous improvement audits.
Human Factors in Design
There are several aspects of this that have a direct impact on the effectiveness of the operator:
• Control Room Design
• Human Machine Interface (HMI) Design
• Alarm Management System Performance
• Environmental Conditions
For each of these elements there are well established guidelines and best practices against which performance can be measure, for example the performance of the alarm management system can be judged against metrics suggested in EEMUA 191 or IEC 62682. Similarly the design of the control room can be judged against the recommendations of ISO 11064.
It is well accepted that accurate and up to date procedures that are written in a manner that is easily understood and followed is essential to ensure consistent safe operations.
This is discussed as an example of a Tier 4 indicator in API 754 with a measure of ‘Percent of process safety required operations and maintenance procedures reviewed or revised as scheduled.’
This is a tricky one, but there are methods, such as that offered by UCDS, that ensure staffing levels, for normal operating conditions, are sustainable and balanced. Similarly it is important to understand, through formal work team design, that supervision is appropriate for the type and size of work team employed. The DOT’s PHMSA CRM regulations require that workload is assessed annually and a requirement to quantify workload is also part of the API RP 755 guidelines.
Training and Competence
Effective training is one of the areas we see is generally lacking in the industry. We often find that the onboarding process is good, but after that, as operators move through the progression, it becomes ineffective.
Competence is also seldom used as a basis for worker selection and promotion, leading to workers unsuited for their new roles and poorly prepared.
This is discussed as an example of a Tier 4 indicator in API 754 with a measure of ‘Percent of process safety required training completed with skills verification.’
The API document also recommends performance indicators relating to Emergency Response Drills that could be expanded to include tabletop scenario drills and other short term formalized training exercises.
Maintenance, Inspection and Testing
In part this is discussed as an example of a Tier 4 indicator in API 754 under the headings of:
• Safety Critical Equipment Inspection
• Safety Critical Equipment Deficiency Management
Measures for the performance of each of these systems, such as percentage of inspections and corrective actions completed on time, are suggested.
Safety Critical Communications
There are two types of operator communications that are considered safety critical
• Shift Handover
• Work Permit
We have found that although work permit systems are often well developed and well thought out, they have a significant impact on the workload of operators.
Shift Handover is another area of weakness for many companies. In most cases we see informal processes with little structure. An effective Shift Handover is the only way to realistically transfer situation awareness from the outgoing to the oncoming crew.
Measures for the performance of each of these systems, such as accurate completion of shift handover logs and permits, are suggested.
Fatigue and Shiftwork
Operator fatigue is a significant area of concern in the industry and despite new guidelines in the form of API RP 755, still, in our opinion poorly managed. Many companies have fatigue policies, and often these only deal with hours of service limits and an exception process, but this is only part of the story. Proper education and training, not an annual CBT, is essential along with the provision of adequate fatigue countermeasures.
API RP 754 suggest some performance indicators for this including:
• Percentage of overtime,
• Number of open shifts
• Number of extended shifts
• Number of consecutive shifts worked
• Number of exceptions
However, we also recommend the tracking of other measures such as calculated fatigue and risk indices.
This is an area that is often overlooked, but the impact of organizational changes can be significant. The most obvious impact is when the number of operators changes, however, equally as important are any other changes to the organization that impact how operations are run. For example the change in supervisory structure or even a new control location can have a significant impact. It is therefore recommended, that a formal management of organizational change (MOOC) process is developed and used as an integral part of the staffing strategy.
The performance of this system might be measured by the percentage of MOOCs that were performed accurately with all recommendations completed.
A company’s organizational culture can have a significant impact on the way things are done, for example if mistakes are punished rather than used as an opportunity for organizational learning, this may modify the behavior of an operator in response to an abnormal situation. Performance Indicators are difficult to define in this area but there are assessment tools such as the ‘Safety Climate Survey Tool’ that can help in understanding the current situation and the impact of continuous improvement of changes in organizational leadership.
To summarize, it is well understood that the ability of an operator to detect, diagnose and respond to abnormal situations can have a direct input on a facilities process safety performance. Understanding any deficiencies in the tools and management systems that impact the operator is an important part of reducing the small impact incidents that on occasion escalate to significant industrial accidents. This paper hopefully provides an insight into how an understanding of the 10 key human factor elements can help in the development and use of a complete process safety performance program.
The UCDS overarching principle is to help reduce the number and significance of industrial incidents caused by a poor understanding of how we as humans work and we have developed a team of experts that help clients through the minefield of folklore, standards and best practices.