Another RAID failure
There must be something in the air. I’ve had another RAID failure. This time, it was a hardware RAID, specifically a seven year old DPT controller (DPT was subsequently bought by Adaptec).
The “Windows consultant” called me first, saying that he had come in and found the machine beeping, and realized this must be a drive failure. He also said that the backup had failed, and gave the usual apologetic “I’m not a Unix guy” (funny, though – he runs his own website on a Linux box). I understand the concern, but as I pointed out to him, this isn’t an OS issue at all- the RAID is OS independent. However, I understood his worry about the backup because that could indicate something more serious like a controller or motherboard problem.
This is too important a system to leave to chance, so I cleared my schedule and drove down to the site. It’s not that I don’t trust the Windows guy, but I didn’t want information filtered through a telephone – either from him or from me to him. Too easy to make an awful, irretrievable mistake.
Upon arrival, I ran the “dptmgr” and confirmed that indeed, ID 3 showed as failed. I also looked at the Microlite Edge printout and could see that the failure was just in one file – a Hard Read Error 6. It happened to be a log file, so if that was all it was, I wasn’t too concerned. However, there are two places that could come from – either real read errors from the array, or file system inconsistency – the inode containing pointers to impossible blocks. I explained to the customer that the failed drive wouldn’t cause real read errors – the RAID reconstructs the missing data. Therefore, if this really was bad reads, we had a very serious problem.
However, nothing in system logs (messages) had any disk read errors, so it looked like file system damage was the more likely cause. This would most likely be related to the RAID failure – the disk might not have failed instantly, and have caused some corruption as it died. If it truly was confined to that one file, we’d be fortunate indeed. I ran an “fsck -ofull” (SCO system) and sure enough, it identified problems with the same file Microlite BackupEDGE had complained about, and was able to clear everything out and give us back a good filesystem. That was a relief.
Now, of course, we needed to fix the failed drive. We had a bit of low comedy there – the last time I had seen the cabinet the drives were in was seven years ago, and I don’t think the Windows guy had ever seen it. We couldn’t figure out how to open it to get at the drives! But that wasn’t what really bothered me. It was the replacement drives he had that had me worried. When we had originally installed these drives, we had tagged each drive with a paper sticky tag giving its ID. The drive he was proposing to replace the failed one with had such a tag on it, making me suspect that it was a bad drive previously removed from this box. However, we had nothing else – it’s hard to find SCSI-3 drives off the shelf nowadays, so after finally figuring out how to get the old drive out, we put in the replacement and started the rebuild process. Based on the percentage counter, I knew it would take close to three hours for a rebuild. There’s no reason the system couldn’t be used while rebuilding, but the customer and the Windows guy said they’d prefer to just wait. I went along, and we went for a long lunch.
Shortly after we came back, the rebuild failed. I wasn’t overly surprised. By now, we had found new drives which were on their way by Fedex, but there was little more we could do today. I told the customer to let people back on but to warn them that there was a small possibility of losing whatever they posted in that day (if we lost another drive, we’d be dead). I left.
The next morning, I called the customer again. He said that the backup had failed again. I asked for specifics, but was told there was no printout. I checked the Edge logs, and it looked to me like it had been interrupted part way through the verify. I asked if the database was “up” this morning (we shutdown the database before the backup and restart it when it is done). I was told, no, that the Windows guy had rebooted the machine this morning because the database wasn’t running. I wish people wouldn’t reboot machines – it’s simple to start the database and I just can’t stand the Windows “reboot fixes everything” mentality. Anyway, I could tell from the logs what happened – because the RAID was running degraded, it was much slower backing up. It just hadn’t finished its verifty by the time the workday started – and to make it worse, some people had come in early because they lost so much time the day before. Since it hadn’t finished, it hadn’t restarted the database. I couldn’t be 100% certain that the verify would have passed, but the backup had no errors, and the verify was OK up to the reboot anyway. I explained that much to the customer, and we reset the backup to start earlier as a temporary fix.
The new drive should be there tomorrow. Unless something really unfortunate happens, we should get this back in shape then.
*Originally published at APLawrence.com
A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com