Adequately Address Abnormal Operations


Disclaimer: This article was originally published in Chemical Engineering Progress by Ian Nimmo in 1995. While the content reflects the technology and practices of that time, the core principles and insights remain relevant and valuable in today’s industry.

 

The U.S. petrochemical industry alone could save up to $10 billion/year by avoiding or at least better dealing with abnormal situations.

Many process plants have proven procedures for dealing with emergencies. However, between normal operation and real emergencies is a gray area that few facilities effectively address. Most companies are aware of the risk of operator overload during such abnormal situations. Often, though, the only real response has been to improve control system alarm management so that operators do not face numerous, confusing alarms. This, however, is not enough, according to in-depth surveys that we have conducted at several plants worldwide.

 

Abnormal situation management (ASM) is a safety issue, and safety long has been a top priority tot companies in the chemical process industries (CPI). I worked at ICI in the U.K. for over 20 years, and I know it gives highest priority to safety. And I see a similar emphasis in the many leading companies in the U.S. that 1 have visited. The OSHA 29 CFR 1910.119 Process Safety Management Standards will further reinforce this. Yet, ASM remains a problem in the global CPI.

 

Both management and the work force are struggling with this issue. They may not call it ASM, but I can guarantee that it is an issue for them. The difficulty in dealing with ASM is compounded by a lack of specific methodologies and tools, as well as metrics against which to gage progress. (However, a recent joint government-industry initiative is directed at just this-see sidebar.)

 

To investigate and identify root causes of abnormal operations and to pinpoint best practices for preventing these situations or at least handling them most effectively, we formed a team and conducted surveys around the world, including in the U.S.A., Canada. the U.K., Europe, and Japan. We visited a variety of facilities, including, gas processing plants, oil refineries, a coker, ethylene plants, polyethylene units, steam generating stations, as well as transportation and storage facilities.

 

The team identified eight key issues:

 

  • Lack of management leadership;
  • The significant role of human errors;
  • Inadequate design of the work environment;
  • Absence of procedures for dealing with abnormal operations (as opposed to emergencies);
  • Loss of valuable information from earlier minor incidents;
  • The potential economic return;
  • Transferability of good ASM performance to other plants; and
  • The importance of teamwork and job design.

 

Well look at each of these in more detail, as well as what is involved in assessing the ASM at a site.

Lack of Management Leadership

The number one problem in ASM is that manager’s arc not providing leadership and direction. This is difficult for a supplier to tell his customers, but my hero, Winston Churchill, stated: You cannot ask us to take sides against the obvious facts of the situation: What W.E. Deming said about quality also applies to ASM: “The… problem has started with management condoning inadequate systems. The aim of leadership is to help people and machines to do a better job. Management’s job is to improve systems what else?”

 

It is up to management to make sure that the sources of incidents are determined, and a strategy is adopted to reduce and eliminate them: then. metrics can be derived to monitor progress. This represents a fundamental change and requires committed senior management taking the initiative and becoming champions of the ASM program and eventually evangelists to the work force.

 

The Significant Role of Human Errors

Our survey found that human errors cause many abnormal operations (typically 40%) and batch plants pose higher risks. It also pointed up that the response to such situations too often is a passing of the buck — which hinders finding the true root cause and can lead to escalation of the problem.

 

Our findings of the significant role of human errors square with those cited by the Chemical Manufacturers Association (CMA): Historically, managers… have found human error to be significant factors in almost every quality problem, production outage, or accident at their facilities. One study of 190 accidents in chemical facilities found the top four causes were insufficient knowledge (34%), design errors (32%), procedure errors (24%), and operator errors (16%). A study of accidents in the petrochemical and refining units identified the following causes: equipment and design failures (41%), operator and maintenance errors (41%), inadequate or improper procedures (11%), inadequate or improper inspections (5%), and miscellaneous causes (2%). In systems where a high degree of hardware redundancy minimizes consequences of single component failures, human errors may comprise over 90% of the system failure probability (1).

 

At one plant we surveyed, there were 240 preventable plant shutdowns recorded for one year. The operations manager largely blamed these on equipment reliability problems, while the engineering manager thought they were the result of operating errors. The result was that no one owned the problem and no solution was being pursued.

 

After a review of several plant documents and interviews with a wide range of plant personnel, the ASM team established the following causes at the plant:

 

  • Failure to follow procedures, 40%;
  • Mechanical and instrument problems, 31%;
  • Inadequate methodology or procedures, 23%; and
  • Internal process events, 6%.

 

The high percentage of problems caused by human error (63%) suggests that this plant has a high probability of a large and possibly catastrophic incident. This percentage is significantly greater than the level of the overall survey, in which approximately 40% of problems were caused by human errors. Of course, even 40% is far too high and needs to be dramatically reduced.

 

Yet, if someone had suggested to this plant’s personnel before the ASM study that most of their problems stem from poor human performance analysis and a lack of definition during process hazard analysis, I do not think they would have accepted that statement and the need for internal cultural changes that it implies.

 

Other plants within this company did not have the same problems; mechanical/instrument reliability problems accounted for around 50% of abnormal situations, and process events caused 30-40% of errors. Why such a difference? The plant with the high human-error contribution was a batch process plant, and the site study team observed characteristics that are unique to batch operations:

 

  1. Poor understanding of processing conditions for different products is a major contributing factor to abnormal situations.
  2. Frequent product changes lead to contamination. workarounds, and in-consistent production runs.
  3. As complexity is added to equipment, better training techniques and different methods of following procedures are required but not necessarily provided.

 

The other plants operated continuous processes, with little in the way of operator intervention. Once at target, the processes can run for months and sometimes years at predetermined grades, with operator intervention being initiated only by process and equipment-deviation alarms.

 

Hence, trying to develop and implement company standards may not always be appropriate unless they take into consideration the specific work situations. What is normal on a batch process will not be normal for a continuous one, and vice versa abnormal conditions also will be unique. It is important for each plants personnel to have a clear understanding of normal and abnormal conditions related to their specific operation.

 

The design of the control room can affect a team’s ability to perform to the standards expected. Control rooms tend to evolve over many years, and numerous mistakes are introduced, especially during instrumentation revamping projects. Too often, ergonomics is ignored completely: control rooms often are lucky even to get an investment in new lighting.

 

Inadequate Design of the Work Environment

Today, thanks largely to advances in instrumentation and control, most plants run better and safer (though we must never forget that programmable electronic systems pose their own peculiar risks (2)): they also have fewer well-trained staff. Some might argue that less people mean fewer human errors and, hence, less incidents. Maybe not? Staffing levels should be based on a combination of task analysis and reliability analysis focused on improving human performance. These should be verified in an additional stage of the process hazard analysis.

 

CMA notes: The vast majority (80-85%) of human errors primarily result from the design of the work situation (the tasks, the equipment, and environment), which managers directly control (3).

Until recently, little attention has been given to understanding the issues regarding performance during normal vs. abnormal situations.

 

Absence of Procedures for Abnormal Situations

The ASM study team determined that operators work within a simple framework that has three main areas: normal, abnormal, and emergency operations. (See Figure 1.)

 

The diagram shows that the operator is driven by management goals that start with Keep Normal: The operators task is to prevent and react to deviations. This is done by monitoring, testing, and responding to process and equipment alarms. The goals include safety, environmental, quality, economic, and productivity targets.

 

As an event occurs, the operations goals are modified dynamically and automatically to Return to Normal Success, however, depends on response time and the actions taken. On some occasions, operators may have to manually intervene.

 

If the incident escalates, the goals again change: the operator may sacrifice lower priorities to achieve Bring to Safe State. The operator often is supplemented by an automatic shutdown system and other safety devices. Many processes still require a considerable amount of manual intervention during this phase. The operator frequently is faced with weighing unit shutdown against plant shutdown: the consequences are balanced against goals, risks, and operator/supervisor judgment.

 

In worst-case scenarios. the containment systems may not be adequate, and the operators goals again change to Minimize Impact. This involves implementing emergency response procedures, which may include first aid, firefighting, and evacuation.

 

Our studies have revealed that plants typically have well-defined normal operating procedures: very basic abnormal operating procedures, such as for shutdown: and very good emergency planning and response procedures. We, however, have seen very little in the way of procedures for Return to Normal and operating under abnormal conditions. We also have found that little or no technology exists for coping when between normal and out-of-control operation: diagnosis and recovery can be difficult because of process dynamics and the need for speedy response. Many operators have stated that controls and procedures are inadequate during this difficult operation.

 

Loss of Valuable Information from Earlier Minor Incidents

Previous events often can provide insights on an abnormal situation. Unfortunately, such institutional knowledge frequently is buried, not shared, or not used effectively. A set of circumstances occurs and causes an incident: months or years later, the same or a related event recurs, but the people involved are unaware of the lessons learned in the past. Many times, correlation’s between the two events are discovered only during an incident investigation.

 

This problem is more pronounced today because of large and rapid changes in the work force. Engineers often spend only one or two years at a given plant. Capable operators can achieve promotion more rapidly. Out-sourcing of maintenance hampers a plant staffs familiarity with equipment, long-term life-cycle models, as well as reliability and integrity issues.

Other useful knowledge and information often are hidden in operators logbooks, individuals notes, and incident investigation reports. Frequently, details on only the costly or near-miss incidents get widely circulated; the minor incidents often are not reported to all personnel.

 

Potential Economic Return

Our plant surveys show that some companies are getting a significant payback from good ASM practices. Unfortunately, though, given today’s business pressures. some of these practices may be difficult on the surface to justify. Yet, in the infrequent times when things really go wrong, they are of inestimable value and can justify their existence many times over – even over years of normal operation.

 

For instance, one site that we visited provided an extra person on each shift. That person, who could do any of the process operations, helped during shift rotation to train operators having new work responsibility and handling new technology. In addition, during upsets. that person kept an overview of the whole operation, to ensure that the proper person was following the right procedures and that no steps were missed. We witnessed the person monitoring a minor disturbance and checking for procedural errors. That person later led team members in strategizing and evaluating consequences using what-if scenarios. This proactive diagnosis prepared the operations team for any future consequence and often eliminated potential escalation of problems. Yet, this person was cut because of economic pressures, and lack of understanding by decision-makers of the significant but hard to quantify value of the position.

 

Other practices, however, such as designing for abnormal, as well as normal, operation, have clear and understandable benefits.

 

Abnormal situations are defined here as the development of nonoptimal conditions that the automatic control equipment cannot cope with and that, thus, require human intervention. Most such situations are quickly and efficiently dealt with by plant personnel. Some, though, result in poor quality product that must be discarded or reprocess & schedule delays, decreased process efficiency, and other real operational costs. A small percentage of abnormal situations mandate a process shutdown, leading to interruption of business and disturbances in upstream and downstream business operations. And, a tiny fraction cause significant equipment damage, release of undesirable materials into the environment, and even human injury or death.

 

The ASM consortium (see sidebar) believes that the current cost of such disruptions exceeds $16 billion/year for the U.S. petrochemical industry alone and, clearly, for the CPI, the figure is far higher. This estimate does not include important but indirect costs, such as environmental damage, human injury, and the impact on quality of life and quality of employment in and near plants. Several elements contribute to this cost:

 

Damage to process equipment, surrounding communities, and the environment. Plant damage figures associated with accidents are relatively easy to identify. According to insurance industry figures, there have been over 550 major accidents at U.S. petrochemical plants (each involving damage exceeding $500,000) over the last five years. with total equipment damage costs of $12.9 billion. Data from petrochemical and insurance industry sources indicate that the total cost of smaller incidents is at least the same order of magnitude as the cost of the larger ones. We estimate the cost just for petrochemical incidents to the U.S. economy as $3.8 billion/year.

 

Claims for death or injuries. Compared to many heavy industries, the CPI has an exemplary safety record.

Lost time accidents, therefore, represent a comparatively minor economic impact. and costs due to injuries resulting from abnormal situations are less significant still. While the potential for extremely severe impact, though extremely unlikely, always is present, we estimate the economic impact of this factor to be relatively insignificant on an actuarial basis.

 

Loss of production from damaged process equipment and other operational impacts of accidents. The insurance analysis cited above puts the costs of business interruption at l.53.5 times, plants deliberately are run at lower-than-maximum levels of efficiency to provide a safety margin for operations personnel. Because the capital equipment, operational staff, raw materials, and most other process costs already arc paid for, my increase m efficiency would directly impact a plant’s bottom link. Efficiency gains often are expressed in the petrochemical industry as a percentage of the cost of feed. Industry sources believe that a 3%-gain is readily attainable this represents $2.7 billion/year of additional earnings just for these plants.

 

Losses due to inefficiencies not caused by or resulting in equipment damage. Many processes can achieve very high levels of efficiency with advanced process control. Unfortunately, advanced control techniques and the complexities of modern processes have taxed the capabilities of plant personnel to respond effectively to disruptions. Processes, therefore, rarely run for sustained periods at their designed maximum levels of efficiency. Sometimes that of plant damage. And these costs are becoming much more significant as industry continues to consolidate production into fewer, more efficient facilities. Even in the recent period of relative overcapacity, the impact of lost production has been significant because of today’s complex feedstock supply and production routing relationships. Assuming a currently conservative ratio of loss of production to equipment damage of 2.5:1, the annual cost of loss of production to the U.S. petrochemical industry is about $9.5 billion/year.

 

Insurance, training, and other operational costs. The current level of petrochemical losses is reflected in insurance coverage that costs more, has higher deductibles, and is more difficult to obtain, even for companies with loss-free records. The costs of training plant personnel to efficiently operate today’s complex processes are higher than they would be it the process control systems provided more support for operators, particularly during infrequent activities such as startup or shutdown. While these costs are significant, they are difficult to estimate and so are excluded from this analysis. Also excluded are costs of complying with environmental, and health and safety regulations.

 

We estimate the total cost to the U.S. petrochemical industry of process upsets to be at least $16 billion/year equal to the current earnings of the entire industry. Preventable process, people, and operational causes, we reckon, are responsible for 64% of the total, or about $10 billion/year, which, given trickle down effects, result in a $20 billion/year net impact on the overall U.S. economy (4).

 

Transferability of Good ASM Performance to Other Plants

Effective ASM techniques have wide applicability. So, the ASM team has developed an evaluation and feedback mechanism that compares a plants performance to best practices for preventing and responding to abnormal situations. We assess training; incident investigation and corrective-action processes; shift rotation, and interaction with training; design of the control system: control-room ergonomics; and communications from site to site, plant 1n plant, discipline to discipline, and person to person. The study also looks at integration of technology, learning and knowledge capture, improving human performance, and removal of identified human-factor problems.

 

The Importance of Teamwork and Job Design

It usually is not the initial event that causes a catastrophic or economically disastrous incident, but more often insufficient time to respond, poor diagnosis, lack of knowledge, incorrect action, or poor use of resources, leading to over commitment of individuals. Many recent incidents started quite innocently as a plant shutdown due to, say, an environmental disturbance such as an electrical storm. During restart, the operators were overwhelmed by the workload introduced by nonessential alarm reporting, fast process conditions, and intermittent equipment failures. These factors, rather than the electrical storm. caused the actual problems associated with the event.

 

This underscores that successful ASM requires strong teamwork and good communication throughout the plant.

 

The work environment can play a big part in the success of the operator being able to manage abnormal situations. The plants organization structure needs to provide excellent problem-solving capability, speed of judgment, broad participation. cohesion and consensus, flexibility, as well as individual and group productivity.

 

The nature of the shift system can be extremely important. At one plant with better ASM performance. a workable rotation system allowed operators to continuously improve their overall experience and perform multiple jobs. There was a strong emphasis on teaming skills and joint problem solving, as well as effective leadership by the shift team leader. At many sites we visited, however, the shift system used penalized the work force and often caused fatigue. The handover training needs to be flexible to enable every operator to competently learn new skills and ultimately perform his or her duties as well as the best operator at the plant.

 

Team working and empowerment of all personnel (not just operators) needs to be a specific goal of the plant. Often, we found that management had a stronger and more unified link with operations than with maintenance and other groups. The maintenance and engineering groups many times have a very good relationship with operations but are hindered by their lack of empowerment.

 

Overcoming the lack of teamwork and poor communication among various plant work groups is a major factor in preventing and resolving abnormal situations. Teaming encourages and enforces ownership of problems. The total environment (from management leadership style to work facilities, job satisfaction, and benefits) contributes to high individual ownership and personal productivity. A revolving administrative role for operators — allowing each in turn to prepare reports, fill in timesheets. and deal with other daily issues normally handled by a shift supervisor — can give each of them exposure to the management arena. as well as a sense of ownership of the shift team.

 

We did see excellent collaboration between cross-functional groups at individual plants. but are concerned by the lack of collaboration between sites, especially for plants with identical processes and similar equipment. While the competitive spirit enforced by management is healthy, lack of accountability across the site has a negative effect on site productivity.

 

Assessing a Plants ASM

We have found that a team can get a good measure of a plants ASM in a matter of a few days. The team needs, however, to understand a planes organization, processes, and operations to effectively conduct a survey. This invariably requires that a particular person at the plant be made the site-visit coordinator. Prior to the visit, the team should obtain and review relevant plant documents, such as:

 

  • Process flow diagrams
  • Plant structure (major processes and interconnections)
  • Incidence response
  • Hazard and operability (HAZOP) reports
  • Operating manuals
  • Training manuals and
  • Operating displays

 

Then, the team should work out a preliminary schedule and agenda with the coordinator.

 

The ASM activities within a plant involve many different individuals from various job classes, including board operators, field operators, supervisors, instrument and control engineers, operating engineers, safety engineers, maintenance engineers and technicians, training supervisors, and applications and system developers. The site coordinator should identify the specific individuals to interview.

 

Discussions usually are more productive if these individuals think about the following issues before the team meets them:

 

  1. The role of plant operations and staff;
  2. The nature of abnormal situations, and tools and capabilities needed to improve management or situations;
  3. The adequacy of the existing distributed control system (DCS) implementation;
  4. The general control-room environment (ventilation. lighting, noise, access, congestion. and so on);
  5. The specific workspace that each individual operator has in the control room (ergonomics of the keyboard, position of the video-display unit, space for books, and the like); and
  6. Limitations in control-room communication.

 

After an introductory meeting, we then typically spend an hour confidentially interviewing everyone. We identify the individuals role, responsibilities, and impact on ASM. In addition, we seek that persons perspective on the strengths and weaknesses of current ASM practices and supporting technologies. To allay concerns of some individuals, we deliberately emphasize that our goal is not to replace people with increased automation but to identify solution concepts and best practices that will result in enhanced human-system performance.

 

We next observe operations from the control room, and review plant documentation.

 

The team then writes a report summarizing its observations and the critical issues identified, as well as specific recommendations for improving ASM. After getting feedback on this report from the plant it is revised accordingly.

 

Defining What Is Abnormal

The first step in ASM is to define what really is abnormal. The second step is to ensure that everyone understands the difference between normal and abnormal, and the root causes of abnormal events. The third step is to be aware of current practices that support ASM, and the procedures, practices, and techniques used to respond to abnormal conditions.

 

Site studies frequently uncover issues associated with communications. Human-factor issues are common problems because industry has evolved to meet immediate needs rather than changing according to a structured design. Lack of integration of equipment still causes many difficulties, even after computer integrated manufacturing, with its islands of technology, has come and gone. The most common problems, however, are associated with the design of the work situation (that is, the tasks, equipment, and environment).

 

Interviews with plant operations personnel inevitably reveal that individual perceptions of the nature and causes of abnormal situations vary. These diverse opinions reflect a general lack of industry-wide understanding of the sources of abnormal situations, and their impact on plant productivity. A significant result of the varied opinions is the development of multiple uncoordinated initiatives to address the symptoms of a problem. Very few plant personnel can give a clear definition of typical abnormal situations easily extracted from incident and quality reports, and the operators own logs.

A common thread, though, relates to the ineffectiveness of control systems during abnormal conditions and the need for operators to manually intervene. Operators often state that procedures addressing abnormal conditions do not exist. Even in those plants that do have formal procedures, operators frequently note that the time critical nature of abnormal situations makes it impractical or impossible to find and review procedures when they are needed.

 

Determining the Root Causes

Personnel interviewed in site studies generally could not clearly identify the root cause of recorded incidents. By reviewing over two years of incident reports, and operator and maintenance log records, combined with personnel interviews, the ASM team initially ascribed 30% of incidents to equipment failure. Process problems, such as operating beyond design limits, process design flaws, tower flooding, and the like, represented about 20% of the root causes. The rest were attributed to people and work context factors.

 

After careful examination, however, the ASM team concluded that over 50% of the equipment failures resulted from some loan of human error. Typical errors include design flaws, procurement mistakes, incorrect maintenance, failure to follow procedures, poor management of change, and operating equipment outside specified limits.

 

Preventing Abnormal Situations

Based on its numerous site visits, our ASM team has identified several practices that can help avoid abnormal conditions.

 

Systems designed. for normal and abnormal operations. Sites that have invested design dollars into consideration of abnormal operations have gained significant savings and often have prevented escalation of problems.

 

Some ethylene-cracker furnace systems that we saw provide a good example. The operator had little interaction during normal operations. During a plant disturbance, the advance control soon became unstable and defaulted to conventional control. This meant that the operator had to take charge. On many systems designed for normal operation only, the operator would have to do mass-balance calculations, and some alignment of the automatic controls, such as adjusting setpoints and establishing control by manual interaction. Failure to react correctly could cause expensive damage to the furnace, but the operator had little guidance. In contrast, a system designed for abnormal operation would anticipate the operators knowledge and complete the calculations automatically lot the operator. The system would prompt the operator with procedural instructions and help plan and anticipate the correct actions.

 

Effective operating teams. It was easy to distinguish among plants with an effective team and others in which each member acted individually. In the latter, during a disturbance, operators often isolated themselves and did not anticipate the effects of upstream and downstream processes. One plant we visited had an effective team that problem solved together during disturbances, developed what-if scenarios alter any disturbance, regardless of how small, making sure that all members of the operating team and support groups understood what had just happened and could happen. During times of quiet running, they rehearsed situations and continually refined operating instructions. They used the skills within the group to ensure the best solution was always found first. This often meant switching duties to put the more skilled or knowledgeable people where they could be most effective. The team did not just write logs and leave them for the following shift to read. Instead, they played back the day’s events and trained the next shift to ensure that a good handover was established.

 

A good preventative maintenance program. Such a program can eliminate many disturbances and break the circle of fault, resolution, and new problem introduced through stress during plant startup and shutdown.

 

A mechanism for the shift team to spot potential problems. This requires good working relationships between field operators, panel operators, and maintenance personnel. A good field operator relies on hearing, sight, and, often, intuition, to spot telltale signs of trouble. By working with other members of the plant team. the field operator frequently can identify and prevent potential problems.

 

Adequate and easily accessible written procedures. Many incidents are caused either because procedures are poor or are not followed. Regular use and continual improvement of procedures can instill a confidence and awareness that will remove many abnormal situations. But procedures sitting on the shelf among thousands of pages of similar data will not be used. The operators need information available in real time and fight at the control system console. Providing information on a separate computer in the corner of the control room is not the answer because, with all the extra workload in coping with an abnormal situation, operators do not have the luxury of going to that computer. The operators need the information integrated into their view of the process.

 

Use of advanced control programs 10 eliminate difficult and time-consuming repetitive operations. Many plants do not take full advantage of the capabilities their control systems provide or give sufficient attention to adding software solutions for such tasks.

 

Effective incident-reporting, loss-prevention. and learning-experience mechanisms. A costly or potentially dangerous incident normally is dealt with very efficiently, and the follow-up actions are completed and reviewed. In contrast, smaller incidents usually are inadequately documented, and little is done in the way of reviewing the effectiveness of the solution. The incident investigation often does not review the original HAZOP notes, and rarely recommends and implements an update to the HAZOP documentation. (This may change as OSHA PSM regulation requires the HAZOP 10 be updated.) Yet, many small incidents that have the potential to escalate if not resolved correctly in each time period repeat themselves.

 

On-the-job training and role-play during steady-state operation. It is very effective to stimulate operators and allow them to exercise their knowledge. Many plants run for years and some operations are very rare. We currently have challenging problems in many plants because personnel rotation is more frequent than plant startups and shutdowns. We have seen plants with operations teams that have never experienced a shutdown (and obviously a startup) on the equipment they are controlling. Dynamic plant simulators are excellent for role playing, but usually cannot be cost justified if only normal operations are considered.

 

Efficient communications. ASM requires speedy and accurate communications. Every plant we visited had major problems with communications among control rooms and field operators. Radio systems have many blind spots, plus excessive noise during abnormal conditions, and the extra bodies talking in the control room then make communication almost impossible. In one case, a field operator calmly reported that an oil storage vessel had split, and oil was filling the bund wall. The control room operator could not understand the message but, from the tone of the remote operators voice and the length of the message, concluded that everything was under control. As levels continued to fall, a supervisor went to investigate and found the operator desperately trying 10 contain the breach in the vessel. Before you harshly judge the console operator, consider that every console operator I have met around the world I have asked one common question after a series of communications: What did they say? l have discovered that most of the time the console operators have no idea but anticipate the answer based on length of sentences, key words, and the expected answer.

 

Solid, integrated equipment solutions. The studies often found that designers had gotten around equipment limitations by providing supporting equipment that was not integrated and indeed, frequently was standalone. Unfortunately, in many cases, this included the safety instrumented system (SIS), which maintains information in the form of computer type flags indicating the root cause of a failure. Without an integration plan, the SIS offers no direct feedback for operations personnel: instead, a technician must access the system by a programming panel that may be in a different location than the operator. So, the operator must notify the technician and wait for information. Critical alarms that can help the operator determine first-out alarms and the consequence and impact of the situation appear on a hardwired annunciator panel. Designers sometimes provide a dedicated personal-computer-based alarm annunciator panel to provide first-out display. These alarms, however, often are repeats of the DCS alarms and, because of poor ergonomic design, can cause conflict during acceptance and categorization. The DCS alarms are analyzed with these systems but, because of differences in time stamps and missing information due to sampling rates, this makes nonsense of the diagnosis or autopsy process.

 

Good ergonomics. Poor lighting, glare, excessive noise, nonstandard color coding, different keyboard designs, lack of workspace for manuals, and poor radio and phone design can hinder an operators performance. In the worst cases, they can prompt human errors. Eliminating these and many other ergonomic problems can vastly reduce many preventable abnormal situations. Ergonomic improvements go beyond the control room and should be adopted throughout a facility. One plant we visited required an outside operator to cycle process units. It was very easy for the operator to lose concentration and stroke the wrong valve and often the wrong unit. Design of computer displays can play an important role here. One facility had two streams and differentiated them by calling one A and the other B. The schematics were almost identical apart from the odd prefix to the letter. During an abnormal condition, as many different displays are called, it is very easy to get confused and to adjust the wrong stream, especially when multiple consoles are being utilized.

 

Use of metrics to ensure progress. The only way to ensure progress is to monitor and evaluate the effectiveness of an ASM program. The site studies identified some potential inputs, such as incident reports. To be useful, these must be factual and not protect people because of internal politics. If the root cause is identified as a simple human error, it should be recorded and classified as such. Hence, the first step in developing metrics is to define types of abnormal situations, such as human error and equipment failure. Then, it is essential that all incidents, major and minor, are reported and categorized -this means that all the facts must be established at the time of the incidents. It also is important to capture the direct and associated costs of each incident. This information is required to justify investment to head off infrequent abnormal situations. As already mentioned, the techniques and equipment purchased for abnormal management may not offer a good return on investment for normal operation, but we must not lose sight that they often are installed as insurance.

 

Managing and Responding to Abnormal Situations

It is one thing to prevent an abnormal situation and quite another to respond correctly to one. Prevention is not time-critical; it is proactive and can eliminate many common problems. Responding to abnormal situations, on the other hand, can be very complex and time critical. Normal operating systems, like control-room communications, easily can break down because of factors such as increased noise levels, loss of resources, a need to evacuate areas, and so on commonly associated with an abnormal situation.

 

Many plants, however, have adopted techniques that support and help plant personnel during these events. Again, these systems often are in place by default, and the benefits they deliver frequently are not reinforced by management. To be more effective. these tools must be put in place by design, and their value needs to be acknowledged and reinforced. A well-designed system for dealing with abnormal events will include:

 

A highly skilled operation team. During the plant studies, we witnessed several different approaches to operator job design. Some did not rotate personnel and generally put non-computer-competent personnel out in the field. This produced an experienced console operator and a brainless robot in the field who just followed radio instructions. The most effective team seen was one in which all staff had the same goals and had achieved competence at every job by an effective rotation system. This staff was motivated by the challenge of hands-on experience on the most-complex or active job, as well as regular respite from the more-repeatable, simple operations.

 

Good procedures for dealing with abnormal situations. We already have stressed the value of these. But the procedures on their own are not enough. Operators should anticipate and rehearse potential escalation of abnormal situations, so that they are comfortable and confident in how to use the procedures. Training and practicing different scenarios are essential to making the procedures effective.

Designation of a coordinator during all upset conditions. Such a person provides oversight and guidance during upsets and serves as the point person for decision-making.

 

Clear understanding of the role of manual intervention. Manual control only should be used when all other alternatives are not available. Operators need to be controlling the whole plant, not one single loop. Manual control should be supported by operating procedures and good tools, including automated guidance via the process control system.

 

Effective alarm-filtering and -suppression techniques. One of the main frustrations of today’s operators is the number and frequency of alarms during abnormal operations. The main reason for this is thoughtlessness during design. The designer only thinks of the positive aspect of the alarm. It works excellently during normal operation but may be an embarrassment during plant startup or shutdown. Today, many DCS suppliers offer techniques to suppress, re-range, silence, or program alarms so that they suit the characteristics and mode of process operation.

 

Adequate situation-assessment techniques. Expert systems are coming very popular because of their ability to use model-based and heuristic causal reasoning to predict and prevent process and equipment failures. More-complex solutions involve dynamic simulators and fault trees. This new technology may be the first step to changing the control and operations from reactive to predictive and preventative (5).

Good communication with inter-connected facilities. This was an area where we invariably found poor techniques and the potential for improvement. One plant had four main external connections. One was for services such as steam, electricity, and instrument air; the second for raw materials; the third for a recovery system that took waste product and recycled the product: and the fourth for finished product to storage facilities. A disturbance in any of these connections had repercussions on all connected plants. Yet, during abnormal conditions, the workload made it very difficult for the operator to break off diagnosis and correction to inform the other units. An automated solution or some mechanism for the operator to launch predefined messages would have been ideal in this situation.

 

 

The Future

Once we have fixed and enhanced the existing management systems, we can contemplate addressing the missing technology area that lies between Return to Normal Operation: and Bring to a Safe State. The site study can deliver significant benefits as we contemplate changing the culture from a reactive control system designed for normal and emergency situations to a predictive and preventative one that invests significant design time to abnormal situations.

 

For the next generation technology to deliver its promises, we must first encourage CEOs to become evangelists for safety and ASM, to remove inadequate systems and practices, to insist on root-cause analysis and ASM metrics, and to understand the issues relating to human performance, ergonomic design, and elimination of human errors.

 

At the plant level, integration of operator equipment and removal of nonessential equipment is a priority. Every plant must have a human performance-improvement program that incorporates human-reliability analysis. We need human-factor expertise, and this can be achieved by developing this new discipline and by making all personnel competent in human-factor assessment. Once we achieve this, suppliers can provide technology to bridge the knowledge gap and provide the benefits of computer technology:

 

  • Fast processing of multiple systems;
  • Information and data context sensitivity;
  • Predictive diagnosis, analysis, and state estimation;
  • Multi-disciplined information retrieval and communications systems;
  • Complex calculations, and rationalization of multi-function, multitasking operations; and
  • Development of operations and control strategies based on plant objectives.

 

The future will allow complex models to be used for multiple applications, which, in turn, will provide very cost-effective return on investment.

 

Within the next five years, we should see field operators with hand-held devices that allow them to diagnose problems for themselves and bring their knowledge of the process up to the same level as the control room console operator.

 

Supervisors currently have lost their view of the process, and their process knowledge has become very poor compared to the console operator. They would gain from the return of the big overview panel that had been a fixture in control rooms; technology soon may permit a programmable version with projected images and touch-sensitive navigation and zooming.  

 

 

 

Literature Cited

Cochran, E., and R. Duncan, [Human] Supervisory Control & Decision Support: State-of-the-Art: presented at Intelligent Systems in Process Engineering Conference, Snowmass, CO (July 1995).

Crowe, E.R., and C.A. Vassiliadis, Artificial Intelligence: Starting to Realize Its Practical Promise; Chem. Eng. Progress, 91 (1), p. 22 31 (Jan. 1995).

Lorenzo, D.K., A Managers Guide to Reducing Human Error, Chemical Manufacturers Association. Washington, DC, p. 1(1990).

Lorenzo, D.K., A Managers Guide to Reducing Human Error; Chemical Manufacturers Association. Washington, DC, p. 11 (1990).

Nimmo, L., Extend HAZOP to Computer Control Systems, Chem. Eng. Progress, 90 t 10), pp. 32-44 (Oct. 1994).

 

[Sidebar:]

Consortium Targets ASM

A government-industry initiative launched last November aims to cut the total costs of preventable abnormal situations in CPI plants by developing better ways to inform operators of problems, improved tools to help operators deal with them, and enhanced methods to prevent abnormal situations in the first place.

 

The almost-$20-million program is jointly funded by the U.S. Governments National Institute of Standards and Technology, and the ASM Joint Research and Development Consortium. This group consists of Amoco, BP, Chevron, Exxon, Mobil, Novacor, Shell, and Texaco; software developers, Applied Training Resources, and Gensym; and Honeywell, which administers the program.

 

The goal of the consortiums program is to reduce the level of preventable losses (about $10 billion/year at U.S. petrochemical operations alone) by 90%. The consortium also intends to cut by at least 10% the losses due to abrupt equipment failure, lightning strikes, and other unpreventable upsets in which resulting damages can be better mitigated by operations personnel these losses total another $500 million/year for domestic petrochemical producers.

 

A prototype system, called AEGIS (for Abnormal Event Guidance and Information System) will incorporate technical innovations in software, and system architecture and customization. In the event of an abnormal situation, AEGIS will assist plant personnel in restoring the process to normal operation and minimizing the severity of possible accidents.

 

Used here with express permission of the American Institute of Chemical Engineers. 

Copyright ©1995 AICHE All rights reserved.