Log In   |   Sign up

New User Registration

Article / Abstract Submission
Register here
Register
Press Release Submission
Register here
Register
coolingZONE Supplier
Register here
Register

Existing User


            Forgot your password
December 2005
library  >  PAPERS  >  Theoretical/General

Why the traditional reliability prediction models do not work - is there an alternative?



introduction

 

while it is generally believed that reliability prediction methods should be used to aid product design and product development, the integrity and auditability of the traditional prediction methods have been found to be questionable, in that the models do not predict field failures, cannot be used for comparative purposes, and present misleading trends and relations.

 

this paper presents a historical overview of reliability predictions for electronics, discusses the traditional reliability prediction approaches, and then presents an effective alternative which is becoming widely accepted.

 

 

 

historical perspective to the traditional reliability prediction models

 

stemming from a perceived need to place a figure of merit on a system's reliability during world war ii, the u.s. government procurement agencies sought standardization of requirement specifications and a prediction process. the view was that without standardization, each supplier could develop their own predictions based on their own data, and it would be difficult to evaluate system predictions against requirements based on components from different suppliers, or to compare competitive designs for the same component or system.

 

reliability prediction and assessment specs can be traced to november 1956, with the publication of the rca release tr-1100, "reliability stress analysis for electronic equipment", which presented models for computing rates of component failures. this publication was followed by the "radc reliability notebook" in october 1959, and the publication of a military reliability prediction handbook format known as mil-hdbk-217.

 

in mil-hdbk-217a, a single-point failure rate of 0.4 failures per million hours was listed for all monolithic integrated circuits, regardless of the environment, the application, the materials, the architecture, the device power, the manufacturing processes or the manufacturer. this single-valued failure rate was indicative of the fact that accuracy and science were of less concern than standardization.

 

the advent of more complex microelectronic devices continuously pushed the mil-hdbk-217 beyond reasonable limits, as was seen for example in the inability of mil-hdbk-217b models to address 64k or 256k ram. in fact, when the ram mode was used for the 64k capability common at that time, the resulting computed mean time between failures was 13 seconds, many orders of magnitude from the mtbf experienced in real life. as a result of such incidents, new versions of mil-hdbk-217 appeared about every seven years to "band-aid" the problems.

 

today, the u.s. government and the military, as well as various u.s. and european manufacturers of electronic components, printed wiring and circuit boards and electronic equipment and systems, still subscribe to the traditional reliability prediction techniques (e.g. mil-hdbk-217 and progeny)1 in some manner, although sometimes unknowingly.

 

but with studies conducted by the national institute of standards and technology (nist), bell northern research, the u.s. army, boeing, honeywell, delco and ford motor co., it is now clear that the approach has been damaging to the industry and a change is needed.

 

problems with the traditional approach to reliability prediction

 

problems that arise with the traditional reliability prediction methods and some of the reasons these problems exist are described below.

 

(1) up-to-date collection of the pertinent reliability data needed for the traditional reliability prediction approaches is a major undertaking, especially when manufacturers make yearly improvements. most of the data used by the traditional models is out-of-date. for example, the connector models in mil-hdbk-217 have not been updated for at least 10 years, and were formulated based on data 20 years old.

 

nevertheless, reliance on even a single outdated or poorly conceived reliability prediction approach can prove costly for systems design and development. for example, the use of military allocation documents (jiawg), which utilizes the mil-hdbk-217 approach upfront in the design process, initially led to design decisions maximizing the junction temperature in the f-22 advanced tactical fighter electronics to 60°c and in the comanche light helicopter to 65°c. boeing noted that, "the system segment specification normal cooling requirements were in place due to military electronic packaging reliability allocations and the backup temperature limits to provide stable electronic component performance. the validity of the junction temperature relationship to reliability is constantly in question and under attack as it lacks solid foundational data."

 

for the comanche, cooling temperatures as low as -40°c at the electronic's rails were at one time required to obtain the specified junction temperatures; even though the resulting temperature cycles were known to precipitate standing water as well as many unique failure mechanisms. slight changes have been made in these programs when these problems surfaced, but scheduling costs cannot be recovered.

 

(2) in general, equipment removals and part failures are not equal. often field removed parts are re-tested as operational (called re-test ok, or fault-not-found, or could-not duplicate) and the true cause of "failure" is never determined. as the focus of reliability engineering has been on probabilistic assessment of field data, rather than on failure analysis, it has generally been perceived to be cheaper for a supplier to replace a failed subsystem (such as a circuit card) and ignore how the card failed.

 

(3) many assembly failures are not component-related but due to an error in socketing, calibration or instrument reading or due to the improper interconnection of components during a higher level assembly process. today, reliability limiting items are much more likely to be in the system design (such as misapplication of a component, inadequate timing analysis, lack of transient control, stress-margins oversights), than in a manufacturing or design defect in the device.

 

(4) failure of the component is not always due to a component-intrinsic mechanism but can be caused by: (i) an inadvertent over-stress event after installation; (ii) latent damage during storage, handling or installation after shipment; (iii) improper assembly into a system; or (iv) choice of the wrong component for use in the system by either the installer or designer. variable stress environments can also make a model inadequate in predicting field failures. for example, one westinghouse fire control radar has been used in a fighter aircraft, a bomber, and on the top mast of a ship, each with its unique configuration, packaging, reliability and maintenance requirements.

 

(5) electronics do not fail at a constant rate, as predicted by the models. the models were originally used to characterize device reliability because earlier data was tainted by equipment accidents, repair blunders, inadequate failure reporting, reporting of mixed age equipment, defective records of equipment operating times, mixed operational environmental conditions. the totality of these effects conspired to produce what appeared to be an approximately constant hazard rate. further, earlier devices had several intrinsic failure mechanisms which manifested themselves as several subpopulations of infant mortality and wear-out failures resulting in a constant failure rate. the above assumptions of constant failure rate do not hold true for present day devices.

 

(6) the reliability prediction models are based upon industry-average values of failure rates, which are neither vendor- nor device-specific. for example, failures may come from defects caused by uncontrolled fabrication methods, some of which were unknown and some of which were simply too expensive to control (i.e. the manufacturer took a yield loss, rather than putting more money to control fabrication). in such cases, the failure was not representative of the field failures upon which the reliability prediction was based.

 

(7) the reliability prediction was based upon an inappropriate statistical model. for example, a failure in a lot of radio-frequency amplifiers was detected at westinghouse in which the insulation of a wire was rubbed off against the package during thermal cycling. this resulted in an amplifier short. x-ray inspection of the amplifier during failure analysis confirmed this problem. the fact that a pattern failure (as opposed to a random failure) existed under the given conditions, proved that the original mil-hdbk-217 modeling assumptions were in error, and that either an improvement in design, improved quality, or inspection was required.

 

(8) the traditional reliability prediction approaches can produce what are likely to be highly variable assessments. as one example, the predicted reliability, using different prediction handbooks, for a memory board with 70 64k drams in a "ground benign" environment at 40°c, varied from 700 fits to 4,240,460 fits. overly optimistic predictions may prove fatal. overly pessimistic predictions can increase the cost of a system (e.g., through excessive testing, or a redundancy requirement), or delay or even terminate deployment. thus, these methods should not be used for preliminary assessments, baselining or initial design trade-offs.

 

an alternative approach: physics-of-failure

 

in japan, taiwan, singapore, malaysia, the u.k. ministry of defense and many of the leading u.s. commercial electronics companies, the traditional methods of reliability prediction have been abandoned. instead, they use reliability assessment techniques, that are based on the root-cause analysis of failure mechanism, failure modes and failures causing stresses. this approach, called physics-of-failure, has proven to be effective in the prevention, detection, and correction, of failures associated with design, manufacture and operation of a product.

 

the physics-of-failure (pof) approach to electronics products, is founded on the fact that failure mechanisms are governed by fundamental mechanical, electrical, thermal, and chemical processes. by understanding the possible failure mechanisms, potential problems in new and existing technologies can be identified and solved before they occur.

 

the pof approach begins within the first stages of design (see figure 1). a designer defines the product requirements, based on the customer's needs and the supplier's capabilities. these requirements can include the product's functional, physical, testability, maintainability, safety, and serviceability characteristics. at the same time, the service environment is identified, first broadly as aerospace, automotive, business office, storage, or the like, and then more specifically as a series of defined temperature, humidity, vibration, shock, or other conditions. the conditions are either measured, or specified by the customer. from this information, the designer, usually with the aid of a computer, can model the thermal, mechanical, electrical, and electrochemical stresses acting on the product.

 

next, stress analysis is combined with knowledge about the stress response of the chosen materials and structures to identify where failure might occur (failure sites), what form it might take (failure modes), and how it might take place (failure mechanisms). failure is generally caused by one of the four following types of stresses: mechanical, electrical, thermal, or chemical, and it generally results either from the application of a single overstress, or by the accumulation of damage over time from lower level stresses. once the potential failure mechanisms have been identified, the specific failure mechanism model is employed.

 

the reliability assessment consists of calculating the time to failure for each potential failure mechanism, and then, using the principle that a chain is only as strong as its weakest link, choosing the dominant failure sites and mechanisms as those resulting in the least time to failure. the information from this assessment can be used to determine whether a product will survive for its intended application life, or it can be used to redesign a product for increased robustness against the dominant failure mechanisms. the physics-of-failure approach is also used to qualify design and manufacturing processes to ensure that the nominal design and manufacturing specifications meet or exceed reliability targets.

 

computer software has been developed by organizations such as phillips and the calce eprc at the university of maryland, to conduct a physics-of-failure analysis at the component level. numerous organizations have pof software which is used at the circuit card level. these software tools make design, qualification planing and reliability assessment, manageable and timely.

 

summary comments

 

the physics-of-failure approach has been used quite successfully for decades in the design of mechanical, civil and aerospace structures. this approach is almost mandatory for buildings and bridges, because the sample size is usually one, affording little opportunity for testing the completed product, or for reliability growth. instead, the product must work properly the first time, even though it often relies on unique materials and architectures placed in unique environments.

 

today, the pof approach is being demanded by (1) suppliers to measure how well they are doing and to determine what kind of reliability assurances they can give to a customer and (2) by customers to determine that the suppliers know what they are doing and that they are likely to deliver what is desired. in addition, pof is used by both groups to assess and minimize risks. this knowledge is essential, because the supplier of a product which fails in the field loses the customer's confidence and often his repeat business, while the customer who buys a faulty product endangers his business and possibly the safety of his customers.

 

in terms of the u.s. military, the u.s. army has discovered that the problems with the traditional reliability prediction techniques are enormous and have canceled the use of mil-hdbk-217 in army specifications. instead they have developed military acquisition handbook-179a which recommends best commercial practices, including physics-of-failure.


 

michael pecht
calce electronic packaging research center
university of maryland
college park, md


 

reference

the traditional approach to predicting reliability is common to various international handbooks.
[mil-hdbk-217 1991; tr-tsy-000332 1988; hrds 1995; cnet 1983; sn 29500 1986]; all derived from some predecessor of mil-hdbk-217.

 

Choose category and click GO to search for thermal solutions

 
 

Subscribe to Qpedia

a subscription to qpedia monthly thermal magazine from the media partner advanced thermal solutions, inc. (ats)  will give you the most comprehensive and up-to-date source of information about the thermal management of electronics

subscribe

Submit Article

if you have a technical article, and would like it to be published on coolingzone
please send your article in word format to [email protected] or upload it here

Subscribe to coolingZONE

Submit Press Release

if you have a press release and would like it to be published on coolingzone please upload your pr  here

Member Login

Supplier's Directory

Search coolingZONE's Supplier Directory
GO
become a coolingzone supplier

list your company in the coolingzone supplier directory

suppliers log in

Media Partner, Qpedia

qpedia_158_120






Heat Transfer Calculators