Hot and Cold Spares

The impact that a failed component has on a system or network depends largely on the pre-disaster preparation and on the recovery strategies used. Hot and cold spares represent a strategy for recovering from failed components.

Hot Spare and Hot Swapping

Hot spares gives system administrators the ability to quickly recover from component failureanother mechanism to deal with component failure. In a common use, a hot spare enables a RAID system to automatically failover to a spare hard drive should one of the other drives in the RAID array fail. A hot spare does not require any manual interventionrather, a redundant drive resides in the system at all times, just waiting to take over if another drive fails. The hot spare drive will take over automatically, leaving the failed drive to be removed at a later time. Even though hot-spare technology adds an extra level of protection to your system, after a drive has failed and the hot spare has been used, the situation should be remedied as soon as possible.

Hot swapping is the ability to replace a failed component while the system is running. Perhaps the most commonly identified hot-swap component is the hard drive. In certain RAID configurations, when a hard drive crashes, hot swapping allows you simply to take the failed drive out of the server and install a new one.

The benefits of hot swapping are very clear in that it allows a failed component to be recognized and replaced without compromising system availability. Depending on the system's configuration, the new hardware will normally be recognized automatically by both the current hardware and the operating system. Nowadays, most internal and external RAID subsystems support the hot-swapping feature. Some hot-swappable components include power supplies and hard disks.

Cold Spare and Cold Swapping

The term cold spare refers to a component, such as a hard disk, that resides within a computer system but requires manual intervention in case of component failure. A hot spare will engage automatically, but a cold spare might require configuration settings or some other action to engage it. A cold spare configuration will typically require a reboot of the system.

The term cold spare has also been used to refer to a redundant component that is stored outside the actual system but is kept in case of component failure. To replace the failed component with a cold spare, the system would need to be powered down.

Cold swapping refers to replacing components only after the system is completely powered off. This strategy is by far the least attractive for servers because the services provided by the server will be unavailable for the duration of the cold-swap procedure. Modern systems have come a long way to ensure that cold swapping is a rare occurrence. For some situations and for some components, however, cold swapping is the only method to replace a failed component. The only real defense against having to shut down the server is to have redundant components residing in the system.