Page 1 of 1

Unexpected downtime owing to hardware failure

Posted: Wed Jun 24, 2015 5:57 pm
by embleton
Unfortunately a RAM chip failed in the server last night causing the board computer to reboot randomly and break the RAID1 mirror on the occasions it rebooted. It was decided that the fault needed to be investigated and a bank of DDR3 chips was identified as the culprit.

An order was placed with a supplier yesterday afternoon and all the DDR3 memory banks were replaced and memory upgraded to 12GB on the machine concerned at the same time as replacing the faulty RAM bank.

The system was successfully restored to full operation after approximely 24 hours; we apology for the downtime due to this unexpected failure on the web server, but it couldn't be helped.

Re: Unexpected downtime owing to hardware failure

Posted: Mon Jun 29, 2015 8:29 am
by embleton
With this hardware fault it caused the RAID1 mirror to report a statechange on the Intel X58 chipset RAID controller. For solving this one needs to identify the hard drive that is reporting the statechange by unplugging each drive until the system boots.

Once this is done run a check disk on the good data drive to clear thr state change in Windows/Linux. And then connect the other drive, delete the single disk in the array that is reporting as an error in the RAID controller utility that is accessible whilst booting the system.

Reboot the computer and it should prompt you to rebuild the RAID1 mirror with the spare disk that has become available; there is no need to replace the hard disk that is reporting a fault error incorrectly, as a state change is usually caused by another hardware failure such as memory or loose SATA cables.

A reported statechange will usually result in the computer not booting whilst both drives are connected in the RAID1 mirror; don't panic just follow the instructions above to get it to boot from the almost good drive, and remember you should really have full backups of the system even if you are running a RAID1 mirror! The symptom for this is that the computer will continually reboot itself and never start up correctly.