SME Server Software Raid Failure, Grub 0x10 error
An SME customer called this morning saying that his system had apparently stopped working (web pages and mail were unavailable) and therefore he had rebooted.
Unfortunately, the grub boot would start to load the SME kernel and then fail with a 0x10 message. This was an “E-Machine”, which was a choice I remember being unhappy about when it was first installed, but this customer is very price conscious and ignored my advice that better hardware would be smarter. Oh well.
As I had nothing better to do (yeah, right), I hopped in my car and drove down to RI to see this first hand. I should have looked up the error before getting in my car, but it was early and I hadn’t had enough coffee yet. If I HAD looked it up, I would have quickly found this (from http://linux.derkeiler.com/Newsgroups/comp.os.linux.setup/2003-08/0074.html):
"Internal error". This code is generated by the sector read routine of the
LILO boot loader whenever an internal inconsistency is detected. This might
be caused by corrupt files. Reinstall IPCop or recreate the boot media.
"Illegal command". This shouldn't happen, but if it does, it may indicated
an attempt to access a disk which is not supported by the BIOS.
"Address mark not found". This usually indicates a media problem. Try
again several times.
"Write-protected disk". This should only occur on write operations.
"Sector not found". This typically indicates bad disk parameters in the
IPCop PC's BIOS. If you are booting from a large IDE disk, you should check
whether the IPCop PC's can handle the disk.
"Change line active". This sould be a transient error. Try booting a
"Invalid initialization". The BIOS failed to properly initialize the disk
controller. You should control the BIOS setup parameters. A warm boot might
"DMA overrun". This shouldn't happen. Try booting again.
"Invalid media". This shouldn't happen and might be caused by a media
error. Try booting again.
"CRC error". A media error has been detected. Try booting several times,
and if all else fails, replace the media.
"ECC correction successful". A read error occurred by was corrected. LILO
does not recognize this condition and aborts the load process anyway. A
second load attempt should succeed.
"Controller error". This shouldn't happen.
"Seek failure". This might be a media problem. Try booting again.
"Disk timeout". The disk or the drive isn't ready. Either the media is bad
or the disk isn't spinning. If you're booting from a floppy, you might not
have closed the drive door. Otherwise, trying to boot again might help.
"BIOS error". This shouldn't happen. Try booting again.
Well, I felt it had to be hardware, so that would have just confirmed it, and I did feel that it was going to be easier to track this down on-site than trying to work with the client over the phone. Providence isn’t very far away, so..
When I arrived on site, I just quickly confirmed that the symptoms were as told to me. Too many times I have have had someone tell me one thing and found something entirely different when on-site, but this time the error was accurately reported. Still lacking sufficient coffee, I sat down at a Windows machine and tried to call up Google.
Well, duh! The SME is the gateway to the internet! No gateway, no Internet, no Google. I shook my head in amazement and called Mitel support. In a very few minutes, I had one of the regular engineers on the phone. I explained that I would have looked this up myself if I had turned on my brain before getting in my car, and he laughed at me and did the search for me. In a few seconds, he told me it was most likely hardware.
I asked the customer for last night’s DVD (we run Microlite Edge for backup here) but it wouldn’t boot. That surprised me at the time, though later I found out why. I then asked for the boot recovery floppy we had created when the system was installed. That wouldn’t boot either, which was upsetting. Finally, I asked if he had a recent Desktop Backup – he said yes, but when we tried to find it on his Windows machines, there was none.
Oh boy. Just the way I wanted things to work out – no backups, hardware boot error. Good thing it’s only a 40 person office. Yes, I’m being sarcastic.
Fearing the worst, I inserted the SME install CD and rebooted. To me surprise, it saw the existing installation and offered to upgrade it. What the heck – I let it try, and it completed successfully. But the same 0x10 boot error came up. So, I booted that CD again, and this time when it got to the point of offering to upgrade, I did an ALT-F2 and had a shell prompt where I did a “cd /mnt/sysimage” and took a look around. All data was apparently intact, which meant that whatever hardware issue we had might be isolated to the boot files. I also now realized why the Edge DVD didn’t boot: this is a software raid system, which Edge can’t handle at the present time. We never told Edge to attempt a bootable backup because it can’t.
But knowing that it was RAID gave me hope. Examining /proc/mdstat showed me:
md1 : active raid1 hda2
119684160 blocks [2/1] [_U]
md0 : active raid1 hda1
102208 blocks [2/1] [_U]
unused devices: <none>
Personalities : [raid1]
read_ahead 1024 sectors
md2 : active raid1 hda3 262016 blocks [2/1] [_U]
md1 : active raid1 hda2 119684160 blocks [2/1] [_U]
md0 : active raid1 hda1 102208 blocks [2/1] [_U]
unused devices: <none>
The Mitel engineer explained that it should be showing [UU] for each line, and that the [_U] indicated a raid problem. At that point, I decided we should shut down the machine and open it up.
When we did that, I could immediately feel that the master ide drive was much hotter than the slave. The slave was warm, the master was uncomfortably hot. Touching the top of it with my finger made me feel I could blister my skin if I left it there long – it was that hot. I removed it, changed the jumper on the slave to make it the master, put the cable back, and buttoned the machine up. To my relief, it rebooted.
That’s not a guarantee with RAID. If the hardware problem had caused data corruption prior to failing completely, the corruption would have been mirrored to the slave. Fortunately that was not the case here.
So we were back up – short one hard drive, but up and running. I asked the Mitel engineer if I needed to reinstall blades because of the “upgrade”, but he explained that it wouldn’t overwrite newer files.
I then took a look at the Edge backup files – the backup had been failing for the past 10 days. I chastised the customer for not alerting me to that problem but I realize that he’s a busy guy and probably had other things on his mind. I left the system doing a Desktop Backup and advised the customer that they really should consider better hardware for such a critical system.
*Originally published at APLawrence.com
A.P. Lawrence provides SCO Unix and Linux consulting services http://www.pcunix.com