Asset Reliability Transformation
A detailed process that guides an organization through the establishment, definition, and implementation of a reliability and performance improvement program.
Understand what is most important to the business so that every activity is aligned with the goals of the business. Assess the current state, evaluate the current strengths and weaknesses, establish (conservative) goals, establish a business case, and gain enthusiastic senior management support.
Establish the team to lead the implementation. Develop the plan to implement the strategy. Develop the first pass of the maintenance strategy.
Identify and develop leaders. Create awareness and enthusiasm in every person within the organization. Clarify roles and responsibilities. Educate, upskill, and certify. Encourage everyone to contribute and constantly communicate the achievements (and failures).
Stabilize and regain control of the maintenance department in order to establish the fundamentals and minimize reactive maintenance.
Ensure that every step between the original project concept and the acquisition and transportation of equipment into the plant guarantees high reliability and performance and thus the lowest total cost of ownership.
Ensure there is one way to do everything, and everyone does it that one way: workflows, procedures, standards, and management of change.
Proactively care for all assets so they deliver optimum reliability and performance: clean, smooth, tight, and well lubricated.
Acquire and analyze data so that opportunities for improvement are identified and all decisions are aligned with the goals of the business: CBM, KPIs, OEE, economics, etc.
Learn from failures, poor performance, and near misses so that root causes are eliminated and failures are not repeated. (EOL=End of Life)
Continually seek to improve every aspect of the reliability and performance improvement strategy.
The ART process is broken into 10 key phases. The VALUE, STRATEGY, and PEOPLE phases establish the fundamentals: set targets, develop a plan, and establish a culture of reliability. The CONTROL phase overcomes the reactive maintenance barrier. DISCIPLINE, CARE, ANALYTICS, and OPTIMIZE form the cycle of reliability: we do everything one way, care for our assets, make data-driven decisions, and constantly improve. With the addition of the ANALYTICS and EOL phases, we have the entire life cycle: from the acquisition of the asset to the disposal of the asset.
Step within a Phase
The phases are broken up into 65 major steps. While some steps are limited in scope, many define significant enterprises: establishing planning and scheduling, performing condition monitoring, and so on.
Recommended Practice within a Step
In order to provide definable tasks and greater resolution into what is required, we have defined multiple “recommended practices” within each step. The recommended practices provide the guidance necessary for you to be successful.
The process of closely examining the goals of the organization so that it is clearly understood how a reliability improvement initiative can add value to the organization. Every task performed should help the organization achieve its goals – there should be a clear line of sight between a person’s role and the business goals.
Business Process Review
Asset Reliability Practitioner
An accredited training and certification program developed by Mobius Institute for practitioners and leaders involved with reliability improvement.
ARP Reliability Advocate
A training and certification program intended for people who need detailed awareness of how to improve reliability of physical assets in an organization.
ARP Reliability Engineer
A training and certification program intended for reliability engineers who must deal with the technical aspects of reliability improvement, including reliability data analysis, maintenance strategy development, condition monitoring, and precision maintenance.
ARP Reliability Program Leader
A training and certification program intended for senior reliability engineers or managers who must develop the business case and then define the overall strategy and implement the program. They must deal with the economics, strategy, and organizational culture.
Asset Criticality Ranking
The process of developing a criticality score based on an asset’s consequence of failure, an asset’s likelihood of developing a fault condition, and the likelihood of detecting the onset of failure. The score is used to rank the assets from most critical to least critical so that activities can be prioritized.
Criticality analysis can help us set priorities, so it is first used in the VALUE phase.
Asset Opportunity Ranking
Whereas the asset criticality ranking generates a list of assets ranked from those that pose the greatest risk to those that pose the least risk, an asset opportunity ranking generates a list of assets ranked from those that present the greatest opportunity to produce a higher volume of quality product (or the provision of improved service) to those that present the least opportunity.
You will never read about the asset opportunity ranking in any other book. But the author believes that everyone involved with reliability and performance improvement must not only think about minimizing risk of failure; they must also think about maximizing the performance of the organization.
For example, you may have a piece of equipment that is quite reliable (i.e., it rarely fails), but unless it is adjusted correctly the quality of the product can be affected. By applying the appropriate brainpower, we may find ways to ensure that it always produces first-pass high quality and thus increases production output (reduces waste) and improves customer satisfaction.
Opportunity analysis can also help us set priorities, so it is first used in the VALUE phase.
MAINTENANCE STRATEGY DEVELOPMENT
Reliability Centered Maintenance
A detailed process that follows seven main steps in order to develop a detailed maintenance strategy that may comprise hidden failure finding tasks, condition-based maintenance tasks, interval-based maintenance tasks, or the decision to not take any proactive/preventive steps in order to prevent an asset from failing or a given failure mode from occurring. A proactive RCM process will also identify actions that can be taken to avoid the likelihood of failure.
RCM can be used to develop the maintenance strategy which begins in the STRATEGY phase.
Failure Modes and Effects Analysis
FMEA is a process where each and every failure mode of a piece of equipment is analyzed in order to assess the risk of failure and determine what action must be taken. During the design process, FMEA is used to identify opportunities to improve the design. As part of a reliability improvement strategy, FMEA is used to design the maintenance strategy, and it should be used to identify proactive steps that can be taken to reduce the risks of failure.
FMEA can be used to develop the maintenance strategy, which begins in the STRATEGY phase.
Failure Modes, Effects and Criticality Analysis
For all intents and purposes, FMECA is the same as FMEA.
Risk Priority Number
The risk priority number is used in association with the FMECA process. It is like having an asset criticality ranking for each and every failure mode. For each failure mode we create a score of the severity of the failure, the likelihood of the failure mode occurring, and the likelihood of detecting the onset of failure. Each element is given a score from 0 to 1 and they are multiplied together. An RPN of 1 means that there will be terrible consequences if failure occurs, the onset of failure is almost certain, and it is highly unlikely that we will detect the onset of failure.
Planned Maintenance Optimization
A process of assessing every existing PM in order to assess whether it should be performed in the future. A PM may be eliminated because it is found to be redundant, unnecessary, and/or harmful to the equipment. In a proactive environment, PMO can be used to identify the PMs that should be performed; however, at that point it overlaps with RCM.
PMO can be used to establish the maintenance strategy; however, it is an effective tool when seeking to reduce the volume of reactive maintenance, and thus it is in the CONTROL phase.
Planned Maintenance task
A maintenance task that has (hopefully) been well defined, which is scheduled in order to repair or replace a component or piece of equipment, or to perform an inspection or condition monitoring test.
You may not often come across IBM; however, it is used in ARP and ART training and documentation to refer to interval-based maintenance: the repair or replacement tasks performed at a predefined interval. The interval may be defined by equipment running hours, number of days, distance traveled (for vehicles, trains, mining equipment, etc.), production cycles, or some other interval. The assumption is that it has been proven that the maintenance task will always be required after that interval, and it is typically also proven that the need for the maintenance task cannot be determined more cost-effectively with a condition monitoring test or inspection.
The term “preventive maintenance” is commonly used to mean the same thing; however, as discussed below, preventive maintenance can often include interval-based maintenance and condition-based maintenance.
We identify the need to perform interval-based maintenance tasks when the maintenance strategy is developed. That process begins in the STRATEGY phase.
The strategy which recognizes that the costs associated with condition monitoring or interval-based maintenance cannot be justified because of the criticality of the equipment or the likelihood of a specific failure mode. Sometimes the most cost-effective strategy is to eliminate costly preventive/proactive tasks and accept that it may fail without warning.
We identify the need to adopt Run-to-Fail as a strategy when the maintenance strategy is developed. That process begins in the STRATEGY phase.
Hidden Failure Finding Task
If it is possible that a component or piece of equipment may fail without detection (i.e., we do not know that failure has occurred), but that component or piece of equipment may be called upon to operate in the future (particularly during an emergency situation), then a test or inspection may be performed in order to detect the “hidden failure” before the consequences of that failure will be experienced.
We identify the hidden failure finding tasks when the maintenance strategy is developed. That process begins in the STRATEGY phase.
Standard Operating Procedure
Clearly documented procedures on exactly how to operate a piece of equipment: starting, running, and stopping. This ensures that the equipment is not overly stressed, that it achieves optimal performance, and that everyone operates it in the same way.
Standard Operating Procedures are established in the DISCIPLINE phase and utilized in the CARE phase.
Integrity Operating Window
Equipment should have upper and lower limits specified which define the range within which it should operate to ensure the safe and reliable operation of the equipment. These limits define the integrity operating window.
Integrity Operating Windows are a part of the ANALYTICS phase.
Bill of Materials
The bill of materials is a list of all of the components and parts required to build, maintain, or repair a piece of equipment.
A Bill of Materials is a core ingredient in the DISCIPLINE phase.
Management of Change
The phrase “management of change” (and “change management”) is often confused with “culture change.” In our context, it is the process of ensuring that if any procedures are changed, if the design of the equipment or system is changed, or if there are any other changes that other people involved with the purchase, engineering, maintenance, or operation of the equipment should know about, then the changes are documented. There should be a management system in place so that changes are approved, documented, and communicated.
Management of Change is a key part of the DISCIPLINE phase.
Operator Driven Reliability
Also known as “operator care,” the aim is to engage operators in basic maintenance tasks (cleaning, inspections, etc.) and basic condition monitoring tasks (vibration overall readings, ultrasound tests, temperature readings, etc.). The aims are to reduce the burden on uniquely skilled maintenance technicians, to provide a heads-up to the condition monitoring team, and to give the operators an opportunity to contribute to the reliability of the equipment they operate.
ODR is a part of the CARE phase.
Root Cause Analysis
The most basic definition of root cause analysis is the determination of the root cause of an undesirable situation (more on this point in a moment). There are a number of techniques that can be used, such as “fishbone” or Ishikawa, “5 whys,” and fault tree analysis. That is, it is simply the determination of the root cause(s) of the undesirable situation.
The author prefers to expand the definition, although it may then overlap with FRACAS, discussed below.
The author prefers to think of RCA as the process of determining why an item (piece of equipment, component, etc.) either catastrophically failed, functionally failed, failed to perform at the desired level (including all forms of waste), or resulted in a safety or environmental incident or “near miss.” The RCA process should answer the following questions: what is the nature of the failure, how important is it that the failure is prevented from reoccurring, what is the root cause of the failure, and what proactive task(s) can be performed so that the failure is not repeated. The next steps should be economic analysis to determine the viability of implementing the task(s) and, if viable, the implementation of the task(s) and verification that the task was effective at eliminating future failures.
RCA is a key reliability improvement tool and is the main focus of the EOL phase.
Root Cause Failure Analysis
For the most part, RCFA is the same as RCA. In some instances, RCFA relates to the use of RCA after equipment failure occurs whereas RCA is applied whenever any undesirable situation (failure, safety or environmental incident, near miss, poor performance, waste in all its forms, etc.) must be understood and eliminated.
Failure Reporting, Analysis, and Corrective Action System
The way in which I have described RCA above could also be referred to as a FRACAS process. As you can see by the definition of the acronym, FRACAS begins with a failure reporting process, involves the analysis of the failure and determination of the corrective action, and then ensures that the corrective action has been taken.
FRACAS is a part of the EOL phase.
Reliability, Availability, Maintainability, and Safety analysis
During the design and equipment selection process, an analysis should be performed to ensure the equipment will provide the greatest reliability, will deliver optimal availability, is easy and the least expensive to maintain, and will not result in safety incidents. One may add to this the goals of maximizing energy efficiency and having the least impact on the environment.
RAMS should be performed as part of the ACQUIRE phase.
Quality Assurance and Quality Control
In a maintenance and reliability context, QA/QC ensures that checks are performed in order to achieve the highest level of quality. That may be performed during the acceptance testing process and after repairs or overalls have been performed and when the equipment is installed. We must not assume that the equipment is in optimal condition; we must prove that it is.
QA/QC is utilized in the ACQUIRE phase, and it is an important part of the DISCIPLINE phase as we ensure that all work is performed to the highest standard.
When new equipment is purchased, or when equipment is repaired or overhauled (by external contractors or by internal maintenance staff), tests should be performed to ensure that the equipment is in tip-top condition. When dealing with external OEMs, vendors, and contractors, agreements should be established that define parameters such as vibration limits, performance characteristics, noise emissions, and more that must be met in order for the purchase to be completed.
Acceptance testing is utilized in the ACQUIRE phase.
MAINTENANCE AND CM METRICS
The P–F interval is commonly used to describe the time between when a fault is detectable (i.e., the failure may potentially occur) and when the asset will functionally fail. If the P–F interval can be estimated for the major failure modes, it will be possible to establish the appropriate condition monitoring test interval.
Time to Fail
This metric can be used in two broadly different ways. First, it is synonymous with the P–F interval described above. Second, it can be used to describe the time between when the equipment became operational to the time when it failed, as is the case with the metric MTBF.
Mean Time Between Failures
The MTBF is the average time that an asset operates before it fails. Over a period of time, if the lifespans (the time between the equipment becoming operational and its functional or catastrophic failure) are recorded and those lifespans are averaged, then we will have the mean (average) time between failures. We usually refer to the service the asset performs (for example, EXHAUST FAN #1) rather than a specific component (e.g., electric motor) with a specific serial number.
Technically it should only be applied to “non-repairable” equipment, that is, equipment that is replaced and not restored back to “full” health.
Beware of the use of MTBF, as over a long period of time the MTBF may not adequately indicate the benefits of recent improvements to reliability. The MTBF can also hide the fact that equipment will have multiple failure modes: some that result in short lifespans, others that result in longer lifespans.
Mean Time To Repair
After a failure, it naturally takes time to repair/restore or replace a piece of equipment and put it back into service. If we measure and average those times, we will have the mean time to repair. As long as we are performing that work with the highest level of precision, our goal will be to reduce the mean time to repair.
Key Performance Indicator
Key performance indicators are metrics that enable us to measure the improvements (hopefully) we achieve as we change our processes, provide extra training, improve reliability, and so on. KPIs can measure the efficiency of the planning process, the production output of the plant, and many other parameters. The best results are achieved when KPIs are used to tell us if the changes we are making are delivering value to the organization and to identify where there are opportunities for improvement, rather than being used to punish people if they do not achieve the desired results.
KPIs are introduced in the VALUE phase and are a key part of the ANALYTICS phase.
Overall Equipment Effectiveness
The OEE is typically used to measure the performance of a process. It is the combination of three factors: the availability of the equipment, the production rate, and the quality of the output. If each metric is measured from 0 to 1 (1 being the ideal state) and they are multiplied together and then multiplied by 100, we will have a percentage value from 0% to 100%.
An OEE of 100% indicates that the equipment is always available to be operated, it runs at the desired rate (e.g., 100 widgets per hour), and all of the product produced is of high quality and can thus be delivered to the customer. In many cases, an increase of the OEE by 1% can represent an improvement in revenue of many hundreds of thousands of dollars.
Computerized Maintenance Management System
This is the key software system used to, as a minimum, manage work orders. It will typically also be used to manage the entire work management workflow and even spare parts inventory. It should be possible to extract data from the CMMS in order to assess the reliability of equipment, the criticality of equipment, the effectiveness of the planning and scheduling process, and many other key pieces of information.
The CMMS is introduced in the CONTROL phase and is a key part of the DISCIPLINE phase. Data is acquired from the CMMS in the ACQUIRE phase.
Enterprise Asset Management system
An EAM system will perform the function of the CMMS and many other functions, possibly including: asset life cycle management, labor management, MRO management, contract management, analytics and reporting, and financial management. As you can see, it goes well beyond the maintenance and reliability departments.
Maintenance, Repair, and Operations
MRO represents everything associated with keeping equipment running. This naturally involves maintenance, repair, testing, spares, tools, and more. From an accounting point of view, it often represents all of the costs associated with maintaining and running the plant but not the costs associated with the products being produced or the service being provided.
(Sometimes you will see MRO represented as Maintenance, Repair, and Overhaul when it is purely focused on the maintenance department.)
The use of measurement technologies such as vibration analysis, infrared thermography, and ultrasound analysis to assess whether a piece of equipment shows any signs of the onset of failure or if it is not being operated correctly. Tests may be performed on rotating machinery, electrical equipment, structures, and process equipment (such as steam traps, compressed air systems, etc.).
In the context of ART and ARP, we like to include visual and other inspections as part of condition monitoring because the goal is to assess the health of the equipment.
Condition monitoring is introduced in the CONTROL phase, and it lives in the ANALYTICS phase.
The strategy which results in repair or restoration maintenance tasks only being performed when the condition of the equipment indicates the need for that task. If the strategy is not followed, it is common for equipment to be replaced simply because of its age (which often results in premature failure).
The application of the CBM strategy is developed as part of the maintenance strategy. The development of the maintenance strategy begins in the STRATEGY phase.
While this term has been traditionally used interchangeably with condition-based maintenance (or just condition monitoring), now it is associated with the application of artificial intelligence in order to determine the nature and severity of fault conditions, and optionally to also predict when corrective maintenance is required, and even to generate the work order so the work can be performed.
The collection of vibration readings, typically via magnetically mounted accelerometers on the bearings of rotating machinery, to detect the onset of failure and diagnose the nature and severity of the fault condition.
The collection of ultrasound readings (sound above our range of hearing), via contact or airborne measurements, to detect impacting, turbulence, or friction in mechanical/process systems or arcing, corona, or tracking in electrical systems.
Infrared analysis, a.k.a. thermography
The measurement of infrared (thermal) radiation, typically via thermal imaging cameras or spot radiometers, from mechanical/process and electrical systems in order to detect changes in temperature that may indicate that a fault condition exists.
The collection and analysis of oil samples in order to assess the condition of the lubricant (to ensure that it has the correct chemical and other properties such as viscosity, acidity, alkalinity, etc.) and to detect if contamination has occurred. In some cases it can indicate if the equipment has begun to fail.
Wear Particle Analysis, a.k.a. ferrography
The collection and analysis of oil samples in order to assess whether the lubricant has been contaminated (dirt, coal, fibers, etc.) or if physical damage such as mechanical wear has occurred, which will result in metallic particles appearing in the oil. The sample is treated and inspected under a microscope, as the particles are too big for traditional oil analysis laboratory instrumentation.
Motor Current Analysis (Motor Circuit Analysis)
Motor current analysis involves the analysis of the current flowing through one or three phases in order to determine if there is a mechanical or electrical fault: most commonly broken or cracked rotor bars or end rings, or rotor eccentricity.Motor circuit analysis involves the testing of the motor when it is offline by analyzing the resistance, capacitance, and inductance of the stator as the rotor is slowly turned. Faults in the rotor, stator, insulation, and connections can be detected.
Motor Current Signature Analysis
Sometimes MCA is referred to as MCSA.
Motor Signature Analysis
Motor signature analysis typically involves the analysis of the voltage and current on all three phases of an induction motor. The current analysis is described above, and the voltage analysis provides an indication of the quality of the power supply to the motor.
Tests that are performed specifically to detect cracking, corrosion, and other conditions that may indicate that a component may fail. The aim of the test is to leave the component in good working order should the tests reveal that no faults exist. NDT is often applied to pressure vessels, valves, piping, crane hooks and structures, and other assets that pose a serious risk to safety and the environment.
The process of constantly seeking new and improved ways to perform every task within the organization.
Continuous Improvement is the focus of the OPTIMIZE phase.
Total Productive Maintenance
TPM aims to improve the maintenance process via autonomous maintenance, planned maintenance, quality improvement, new equipment management, continuous improvements, and improvements related to health, safety, and the environment. TPM is often synonymous with operator driven reliability, although ODR should be considered a subset.
We would consider TPM to be a subset of the ART process as we seek to go beyond maintenance. ODR is part of the CARE phase.
Total Quality Management
Total quality management is typically an organization-wide initiative that seeks to encourage employees to find ways to improve the quality of service to customers. Typically, the primary output is the delivery of first-pass high-quality product.
Quality is critical to reliability and performance improvement. Product quality improvement is the focus of the CARE and ANALYTICS phases, and improvements are identified in the OPTIMIZE phase.
Sort, Set in order, Shine, Standardize, Sustain
5S is related to TPM, Six Sigma, and Lean. The basic process involves looking at a work area, which may include the maintenance workshop and the operating area of the plant, and ensuring that everything can be performed as efficiently as possible. Everything should be organized, clean, and standardized. Everyone should be trained and motivated to perform tasks in such a way that time is not wasted searching, moving items, and walking around the workplace accessing tools and other items. A place for everything and everything in its place. ART utilizes the 5S process.
5S is introduced in the CONTROL phase and is a key part of the CARE phase.
Originally, Six Sigma was based on the basic goal that there should be less than six standard deviations of variation in a production process. To put it in English, “where 99.99966% of all opportunities to produce some feature of a part are statistically expected to be free of defects” (Wikipedia). In more recent times, Six Sigma relates more generally to a method that provides tools to improve the capability of an organization’s business processes, whether that be in the maintenance department or anywhere else.
When following the Six Sigma approach, one will follow the methodology: Define, Measure, Analyze, Improve, and Control (DMAIC). TPM, Lean, 5S, TQM, and Six Sigma are all closely related – they all seek to eliminate waste and achieve high quality so that a business can be more profitable and customers remain satisfied.
This is the general term given to a computer system that can learn and make decisions.
Machine learning is a process whereby data is used for a system to learn about a process, a piece of equipment, failure mechanisms, and so on. Machine learning is a subset of artificial intelligence.
For example, if the system were used to observe the performance of a piece of equipment under different weather changes, machine learning would enable it to observe the weather and predict equipment performance.
Data analytics is the more general term whereby data is utilized to gain insight into the process, piece of equipment, failure mechanisms, and so on. When utilizing data analytics, the human is performing calculations in order to learn.
For example, if a person analyzed data that showed how the performance of a piece of equipment changed as the weather changed and then performed statistical analysis on that data to “correlate” the changes in weather to the equipment performance, we would call that data analytics.
Predictive maintenance utilizes condition monitoring data and machine learning technology to predict future equipment health. Additional comments are made in the condition monitoring section.
Prescriptive maintenance takes predictive maintenance one extra step. Rather than only predicting the future health of equipment based on condition monitoring data, it will determine what corrective maintenance is required and take the necessary action (i.e., generate a work order).
Condition monitoring standards
A series of standards that define how practitioners must be trained (18436-3) and certified (18436-1) in vibration analysis (18436-2), ultrasound analysis (18436-8), oil and wear particle analysis (18436-4), infrared analysis (18436-7), and other less frequently used technologies.
The condition monitoring training and certification covered in the PEOPLE phase follows these ISO standards.
A standard that defines how certification bodies must ensure that certifications related to personnel must be fair and independent.
The process whereby a government-appointed agency assesses and periodically audits a certification body and acknowledges that the organization is acting in a fair and independent way according to the standards. ASNT performs this role in the United States. JAS-ANZ performs the role in Australia and New Zealand. UKAS performs the role in the UK.
The Mobius Institute condition monitoring and asset reliability transformation training and certification are accredited by JAS-ANZ.
Asset management standards
A series of three standards that describe how an organization can be assessed in order to determine whether it is maximizing the value of its physical assets. An organization can also be certified as ISO 55000 compliant in the same way that an organization can be certified as ISO 9000 compliant.
ISO 55000 does not provide guidance on how to achieve maximum value from its physical assets.
As discussed in detail below, the ART process can help an organization achieve ISO 55000.