Log In   |   Sign up

New User Registration

Article / Abstract Submission
Register here
Register
Press Release Submission
Register here
Register
coolingZONE Supplier
Register here
Register

Existing User


            Forgot your password
Ake Malhammer | May 2005

Fan Failures and Unavailability


 


calculator:  unavailability and fan fails

introduction

some problems are associated with so many uncertainties that it is doubtful if a computed result is any better than a crude estimate. the issue for this article is a typical example. yet, even if fundamental uncertainties make a calculation method highly approximate, it can reveal valuable overview information.  

paradoxically, a good expert can be defined as someone who knows why he does not know. no one can explain this better than tony kordyban. read his comments about room temperature. the major uncertainty in this case is that equipment rooms often house devices with a large spectrum of specifications. another fuzzy problem is how potential fan failures impact the unavailability. the difference in this case, however, is that the impact of each parameter can be described with reasonably simple correlations. putting them all together in a method can reflect general tendencies even if the result is numerically inaccurate. 


 figure 1-this equipment is fully functional even if 2 fans fail provided that the ambient temperature is below 30 c. what is the unavailability?

the problem

figure 1 shows a sub rack cooled by four fans. for a typical telecom application the maximum room temperature would be specified to be 50 °c.  suppose that the pcb temperature increases by 10 °c if a fan fails. in that case the system would still function safely if the ambient temperature was below 40 c. room temperatures of that level are, however, not that common. a system failure will therefore only occur if two unlikely events coincide, a fan failure and a high room temperature. in that unfortunate event the down time would depend on how fast the fan can be replaced.

everybody wants to avoid down times and electrical engineers spend a lot of time trying to do so. on the system level there is nevertheless a regrettable tendency to take the easy way out and specify full functionality at maximum room temperature even if one, or sometimes two, fans fail. this requirement is pushed onto the thermal engineers who upsize the fans and get scolded by their managers for increasing both cost and noise. there is definitely an improvement potential here.
 


figure 2- life time and mtbf as defined by the bath tab curve.

fan reliability

there are two aspects of fan reliability. life time and failure intensity, the latter is often also represented by its inverted value mtbf, (medium tme between failures). lifetime and mtbf are both measured in hours and can therefore be easily confused. when defined as in the bath tab curve, figure 2, the difference is obvious but things are not always that clear. there are two distinct cases.

the first case is a straight forward interpretation of figure 2. it can be applied when the life time for the fans is of the same order as the life time for the equipment. in that case life time has no impact on failure intensity until the fans begin to wear out and need to be replaced, (mostly because the lubricant has evaporated). the failure intensity is caused by sporadic collapses of weak components and a large number of other effects, including insect attacks.

the second case is when the life time for the fans is much smaller than the life time for the equipment and the strategy is to replace faulty fans whenever they occur. the fans will after some time form a population with a large spread in age. sporadic failures will, therefore, not only appear as component collapses but also as they wear out. that is, the bath tab curve for the fan system has lost its tail.

the lifetime for high quality fans is at room temperature currently of the order 10 – 20 years. the first case would therefore probably be the most common. it should also be noted that the life time for a fan is defined as the time at which 90% of a large population still are functional.


 
figure 3- mtbf as function of temperature, realistic or pure guesswork?

the problem is to find realistic values for mtbf. some manufacturers claim it is enormous. others indicate values of the order >35 years. there is also a temperature impact. to base a curve like the one in figure 3 on a sequence of experience values for various temperature levels is next to impossible, (do not confuse this curve with the temperature dependence of the life time, for which the manufacturers now can provide decent data).

an additional complication is that eventual external speed controls also contribute to the failure intensity. there is fortunately a physical principle that describes how material degradation varies with temperature. it can be formulated as an exponential function with two characteristic parameters, it is the arrhenius function. this equation is by no way exact for all electronic components but it is the best that can be done in this context. given two mtbf-temperature couples it is therefore possible to create a diagram of the figure 3 type. it is of course apparent that predictions based on data of this quality not can be anything else than approximations. 


 
 
figure 4- a temperature duration curve. it shows the relative time for which the temperature is above a certain level.

temperature duration

another important parameter is the temperature duration, figure 4. the curve shows the relative time for which the ambient temperature is above a certain level. diagrams of this type are only physically relevant for specific locations. the ones found in specifications are worst case assumptions. in addition they are often only specific for the extremes, (of the type <5% run time in the temperature range 45 – 50 c). to create a time duration diagram often involves elements of guessing. the one shown in figure 4 has some resemblance to the actual data for non-air conditioned premises but is completely off target for well controlled environments.


figure 5- the fan speed is often temperature controlled.

fan speed control

an additional complication is that the fan speed is often temperature controlled as shown in figure 5. there is no uncertainty in the control curve itself, so this is the least uncertain factor in a fan unavailability calculation. however, there is a problem with the sensor location. if both the sensor and the fans are placed in the outlet air and that air is stratified because of a non uniform heat dissipation that jumps from one pcb to another, that uncertainty enters into the fan failure prediction. a further problem is that the failure intensity probably is speed dependent. the default value that is used in the referenced calculator, 75% at half speed, is a pure guess.

it should also be noted that the fan speed has an impact on the exhaust air temperature raise. the volumetric flow is approximately proportional to the fan speed. a nominal air temperature raise of 10 °c therefore changes to 20 °c when the fans are run at half speed. if the fans are placed in the outlet air, this will naturally impact the failure intensity.

discussion

the calculation procedure is quite simple. it is based on a step by step integration of the temperature duration curve. the fan speed, the exhaust air temperature, the mtbf and the fan temperature are determined for each step. the result is the average fan temperature and the long term mtbf. combined with the probability of the room temperature to be above the safe function threshold and the repair time, the result is the down time.

it could be of interest to look at some results of the included calculator. the default values yields an unavailability of ~6 min/year. in view of all the uncertainties, could it be 6 hours? probably not, but it is up to anyone to guess. if the repair time is decreased from 10 to 5 hours, the down time changes proportionally. the impact of this parameter is often overlooked. actually, it is just as important as mtbf.

another parameter of interest is the fan location. there are several aspects of this subject but one argument against placing fans in the exhaust air is that it increases the failure intensity. the calculator can be used to estimate the order of that effect. even if the result is not numerically exact the relative result should be fairly relevant. for the default values this effect is predicted to show a 40% increase.
it is also easy to simulate a system that is fully functional for worst case conditions and 1 fan fail. setting the temperature level for 1 fan fail to 50 °c simulates that. the result is a factor 30’000 decrease of the unavailability, truly dramatic. but is it worth the effort? could it not be sufficient to specify 48 °c? every degree that does not need to be cooled is valuable and it is always the last ones that are most expensive.

in spite of all the uncertainties involved, these examples show that a calculator of this type does have a value as a sensitivity analyser. since the parameters of the problem interact in a complex manner, the results should also be better than pure intuitive conclusions. another advantage is that it exposes the complexity of the problem to those who might think otherwise. the disadvantage, of course, is that the numerical result not can be trusted, which could produce erroneous conclusions when used by someone who does not understand the background.

about ake malhammar

 

ake obtained his master of science degree in 1970 at kth, (royal institute of technology), stockholm. he then continued his studies and financed them with various heat transfer engineering activities such as deep freezing of hamburgers, nuclear power plant cooling and teaching. his ph.d. degree was awarded in 1986 with a thesis about frost growth on finned surfaces. since that year and until december 2000 he was employed at ericsson as a heat transfer expert. currently he is establishing himself as an independent consultant.

having one foot in the university world and the other in the industry, ake has dedicated himself to applying heat transfer theory to the requirements of the electronic industry. he has developed and considerably contributed to several front-end design methods, he holds several patents and he is regularly lecturing thermal design for electronics.

ake malhammar
frigus primore
http://www.frigprim.com/

Choose category and click GO to search for thermal solutions

 
 

Subscribe to Qpedia

a subscription to qpedia monthly thermal magazine from the media partner advanced thermal solutions, inc. (ats)  will give you the most comprehensive and up-to-date source of information about the thermal management of electronics

subscribe

Submit Article

if you have a technical article, and would like it to be published on coolingzone
please send your article in word format to [email protected] or upload it here

Subscribe to coolingZONE

Submit Press Release

if you have a press release and would like it to be published on coolingzone please upload your pr  here

Member Login

Supplier's Directory

Search coolingZONE's Supplier Directory
GO
become a coolingzone supplier

list your company in the coolingzone supplier directory

suppliers log in

Media Partner, Qpedia

qpedia_158_120






Heat Transfer Calculators