Redundant Switch Fault

On Saturday evening (September 3, 2009) we experienced a short general outage caused by our DNS and SQL cache pool servers rebooting. Normally this wouldn’t be a problem, but the PowerDNS Recursor package didn’t start at boot time, and a lack of DNS meant a general lack of anything useful taking place.

It was odd that some equipment rebooted itself while others didn’t, so after a lot of thinking, we decided that one of the APC 30A redundant switches we use was probably faulted because this isn’t the first time we’ve seen this. We pulled the switch from service (causing another reboot – although if we were right we didn’t want it risking the system anymore) and opened the cover. Inside we found relay contacts with pitting and arching:

DCP_2588 DCP_2590 DCP_2591

The other relays we opened up weren’t nearly as bad, but they still exhibited discoloration and pitting on the contacts. The one in the pictures was loaded between 10 and 15 amps and it’s supposed to be rated for 30 (or 24 derated). Because this is the second failure we’ve had with this device we’ve decided to remove the remaining ones from service as they are likely to suffer the same fate in the future.

We apologize for the recent bumps in the normally smooth operation you’ve come to expect from us. We understand that your mail and DNS service is important to you, and to us, since we use the same services for our mail. As such, a discount/credit will be forthcoming.