Coolingzone.com - Thermal design of fault tolerant and high availability computer systems

New User Registration

Article / Abstract Submission
Register here

Press Release Submission
Register here

coolingZONE Supplier
Register here

Existing User

December 2005

Thermal design of fault tolerant and high availability computer systems

overview

thermal design is one of the most challenging aspects of computer system design. this is definitely the case when it comes to designing for "fault tolerance" and "high availability" in high-speed multiprocessor-based systems. the inherent challenges associated with fault tolerance not only bring demanding design requirements but headaches as well. the term fault tolerance describes computer systems containing redundant hardware/software features, which enable the system as a whole to tolerate a critical failure while at the same time not affecting the availability of the system to function.

fault tolerance/high availability is a desirable, and often necessary, feature for certain niche markets/customers within the computer industry. these markets include telecommunications, gaming, airline reservations, enhanced emergency services (911), atm, banking, and exchange trading. all of these markets share one thing in common: they cannot tolerate down time. these markets are called critical on-line-transaction businesses, since down time directly correlates with severe loss in revenue and/or non-availability of computing resources that can result in dire human consequences.

fault tolerance, in basic terms, is achieved by building two computers into one. the two are linked by hardware and software cross bridges which permit the two to act as one via parallel partner processing. the same convention holds true for the thermal design engineer who must abide by the same rule as the power, analog, logic, media storage, and software designers.

that rule is:

"no single point of failure in any sub-system can result in
the loss of functionality or availability of the system."

this governing rule of fault tolerance poses a daunting challenge for the thermal designer because it eliminates from the design process various degrees of freedom that are commonly available to thermal engineers designing non-fault tolerant systems. specifically, the cooling system itself must be fault tolerant. high availability systems are also governed by this rule, although to a somewhat lesser extent.

in general, primary cooling to these types of computer systems is achieved via forced air convection. liquid cooled systems exist but are limited to specific applications and, therefore, are not the focus of this article. the markets previously identified as being critical on-line-transaction typically demand cooling systems that are of simplified design, customer replaceable/upgradeable, and robust in terms of redundancy. furthermore, cooling system designs must adapt to various human interactive events, such as hot plug air mover replacement, field upgrade heat sink attachment, etc.

the thermal designer is also challenged by the inherent complications involved with meeting the bus length and timing issues associated with the duplicated and parallel processor/logic/chip sets present throughout the system. with the implementation of higher-powered asics and cpu chip/module designs, component placement and attention to air flow delivery are of paramount concern. more specifically, the localized thermal influences of one cpu on another result because redundant pairs must reside in close proximity to each other for timing purposes. these concerns, coupled with manufacturing efforts to make cooling system components turn-key type items, ensure a never-ending obstacle course of challenges through which the thermal designer must navigate.

design approach

in a fault tolerant/high availability design, thermal design engineers often make their most critical decisions at the initial stages of product concept design when governing boundary conditions of thermal design are, for the most part, determined. such boundary conditions include:

general airflow delivery scheme (vertical, front-back, pressurize/evacuate etc.).
placement and location of critical elements and components throughout the system.
system environment requirements (computer room, open office, central office, etc.).

choice of system environment is critical because, by default, it sets thermal boundary conditions, such as maximum ambient temperature and audible noise levels. after selection is made, boundary conditions are non-negotiable, because these parameters are governed by external agencies such as osha, nebs, and iso.

thermal engineers have the greatest impact on the system architecture at these early stages of product design. also at this time, negotiation skills and tactics become as meaningful as engineering skills. thermal engineers must be effective negotiators when it comes to guiding and persuading other engineering entities (power, logic, io, packaging, manufacturing, etc.) to make decisions and choose designs in harmony with those of the thermal design strategy. thermal engineers often serve design "watch dogs" because they must always be aware of the ongoing design choices others make.

those choices may have a dramatic impact on the viability of the thermal design. an example of such a situation may be as menial as a card guide selection for a pcb. to the packaging engineer this may seem a natural choice, but to the thermal engineer it may pose a disruptive effect to the air flow across the leading edge of that pcb. in fault tolerant designs, the redundancy of hardware throughout the system ensures an uphill battle when it comes to lobbying for proper thermal management design features.

fault tolerant designs are, by nature, high-density power and packaging designs. a typical fault tolerant design can have up to four times the standard interconnect as a non-fault tolerant design of similar size and volume. interconnect is a major obstacle to the thermal designer due to the high pin count, signal routing, and cable density required to tie parallel logic processes together. high density interconnects yield solid back planes, bulky connectors and cable counts that can reach into the hundreds per cabinet.

furthermore, the advent of 3.0 volt semiconductors demands power distribution strategies that often require high amperage cable capacity, which to the thermal designer translates into even more bulky cables. this in turn results in major obstructions to air flow which make it difficult to provide a sufficient volumetric flow rate of air, not to mention the ability to manage the air distribution profile.

fault tolerant cooling system designs are unique because they are fault tolerant. at the system level this means that proper thermal management must be maintained, even in the event of air mover failure or removal. this is most commonly addressed by compartmentalizing the air movers into crus (customer replaceable units), which are each treated as single points of failure. crus often contain multiple air movers.

in stratus designs (the author was formerly employed by stratus.--ed.), each system/cru in the field is tied to the on-board maintenance system, which is itself tied to field monitoring servers located at stratus' home base. these systems have a unique "phone home" capability. if there is any kind of field failure, the maintenance system detects and isolates that failure. the system calls home to stratus and identifies the failure at the cru level. a replacement cru is then shipped overnight to the location in need where the customer carries out the hot plug replacement of the broken cru.

cooling crus are usually duplicated throughout the system to provide functionality in case any cooling cru is broken or removed from the system. cru removal is a most challenging design problem since it usually introduces the thermal effects associated with a major violation of the cooling chimney. chimney is defined as "the primary cabinet or rack cross section through which air travels." chimney violations are further complicated by having to deal with a multitude of other compounding thermal events, which can occur in the field. these include coinciding cru removals, as well as cru removals during high ambient temperature duration.

modeling

after product concepts are defined and chosen, thermal modeling begins. in the last few years modeling capabilities have improved dramatically. introduction of new measurement devices for use in air flow measurement, thermal measurement, and associated data acquisition, combined with highly refined software tools such as flotherm, have greatly enhanced the ability of thermal engineers to produce detailed and accurate models of complex thermal systems. this results in reduced development time and cost. (see figures 1 and 2.)

modeling takes place in two forms, physical and cad. the two types are not mutually exclusive in that they rely heavily on each other for iterative feedback in order to refine their respective solution accuracy within acceptable levels. as mentioned, thermal modeling is an iterative process. software tools, such as flotherm, often require critical value inputs, such as loss coefficients and pressure drop estimates, which, for the most part, cannot be calculated or derived from pre-established data. therefore, the need exists to produce actual physical representations of enclosures, orifices, etc., and to measure experimentally the aforementioned values.

this iterative dependency can be time consuming and often frustrating. nonetheless, it is inescapable. in the effort to construct a useful and accurate thermal model, the thermal engineer must rely on both forms of modeling to achieve the desired result. it is also worth mentioning that thermal models frequently serve another highly useful purpose. they are often the first and only physical representation of a product in development and are therefore sought by marketing and management to serve as "show and tell" type items. this is extremely important because thermal models can serve as the platforms from which the thermal engineer further negotiates for appropriate thermal design features.

the challenge of high powered cpus

as previously stated, technology trends are driving cpu designs toward higher powered multi-chip module designs. high powered multi-chip modules, such as intel's pentium, descheuttes, merced, as well as risc-based processors from hp, have been and are becoming standard throughout the industry. the quest to thermally manage these types of devices is becoming increasingly more difficult given that demand for increased processor speed continues to grow. the relationship between processor speed and power dissipation ensures ever-increasing power dissipation levels. the result of this speed-power relationship ensures thermal design challenges for years to come. (see figures 3, 4 and 5.)

figures 3, 4 and 5 depict the trend in power dissipation for a variety of processor families (amd, intel and hp).

heat sink designs for these types of devices have shifted the paradigm in the thermal solutions marketplace. conventional thermal technologies can no longer support solutions for high heat flux devices, such as merced. merced is expected to dissipate 130 watts in its first version release (running at 700 mhz). industry rumors put merced follow-on "kicker" products at over 1 ghz with power dissipation levels approaching 200 watts. for the most part, thermal design goals for all high powered cpus will require application-specific and exotic heat sink solutions. such types include narrow channel, closed loop heat pipe, inverted impingement, and various other forms of dedicated air mover-heat sink integrated solutions. (see figures 6, 7, 8, and 9.)

figures 6, 7, 8 and 9 depict examples of dedicated air mover, heat sink and heat pipe designs. the designs shown have effective heatsink-ambient resistance of less than .5 ^oc/watt. shown are designs offered by chipcoolers, heat technology inc., advanced thermal solutions, and thermacore. (for clarity, most are shown without air mover).

noteworthy technology improvements

in addition to the aforementioned modeling and software tools, technology improvements in the areas of air movers and thermal interface materials have provided thermal designers with new ammunition to add to their design tool arsenal. cold flow phase change and other new low impedance thermal interface materials enable thermal designers to depend on low resistance thermal-mechanical joints for heat sink attach. such innovation at the joint level is welcome since it is an often overlooked yet most critical part of a heat sink design.

air mover technology has improved to the point where new air movers are replacing older, more traditional designs. for example, papst's new dv series units offer a 2-3 times improvement in terms of volumetric flow rate supplied as well as the ability to overcome impedance. the dv series units are same form factor (size) as the commonplace 6.75" diameter tube axial fan, which is a workhorse air mover found throughout the telco computer industry. these types of new products in the thermal marketplace will enable thermal designers to further push the envelope of forced convection cooling. (see figures 10 and 11.)

figures 10 and 11 show the new pabst dv series air mover, as well as a performance curve comparison of the dv vs. the more conventional 6 series fans.

looking ahead

as we prepare to turn the calendar on a new century, the future of thermal design holds great promise; specifically, its role in system level computer design. this is especially the case in the marketplace for fault tolerant and high availability systems. technology and packaging trends all but guarantee a never-ending set of thermal design challenges, which must be met in order for the high availability marketplace to keep pace with the explosion of new semiconductor development and its associated appetite for power.

all of these factors ensure that thermal design will continue to play an ever-growing part in the direction of computer system design and product definition.

stephen berestecky
lucent technologies cns