Real Life DBA: Dealing with a SQL Server Cluster Failover
I had just gotten out of a short meeting and went to my desk to see what important e-mail messages I might had missed. In my inbox was a series of messages sent from ELM Enterprise Manager 3.0, the software we use to monitor our SQL Server event logs. The message had arrived about 30 minutes earlier, when I was away from my desk.
Instantly, my eyes focused on the name of server that the messages came from. They were from our production SQL Server cluster, the one that runs the database for our ERP package. In other words, the most important SQL Server I manage. It handles over a billion dollars worth of sales every year, and generally has over 100 active connections. Oh crap!
I quickly read through the messages. Right away, I discovered that SQL Server had failed over from the primary cluster node to the secondary cluster node. I then started Cluster Administrator and verified that this was true.
The SQL Server cluster was running Windows 2000 Advanced Server, SP2, and SQL Server 7.0, SP3. The cluster was dedicated to SQL Server, and this was the first time in over seven months, since the cluster was installed, that a failover had occurred.
My next step was to run to our company’s Help Desk staff, who share the same room, and I asked them if they were getting any phone calls. Yes, a whole mess of them. I quickly told them that the SQL Server cluster had failed over and that they should tell users to exit their ERP software session and then re-logon. This software is not smart enough to retry a broken connection, so users are down a short time as they have to restart their software.
Once I informed the Help Desk Staff, I ran, looking for the ERP Manager, letting him know so that he could inform all the appropriate parties. Then I ran back to my desk to find out what happened.
My first step was to verify, from Enterprise Manager, that the SQL Server cluster was indeed up and running. It was. Fortunately, the failover process worked as advertised, and users were now able to get back to work after a 1-2 minute delay.
Since I didn’t know what caused the failover, my first goal was to try and identify its cause, so that I could prevent it from happening again. Depending on the nature of the problem, it was possible that the cluster could try to fail back, and if that happened, I wanted the primary node to work.
The most obvious place to start was the Event Log of the cluster, which is where I went. To my surprise, there were no immediate indications of what caused the problem. In the log I noticed that once the MSSQLSERVER service had failed, that the Cluster Service had tried to restart it, which is what is expected. It tried to restart the service three times, failing each time. After the third try, the Cluster Service initiated a failover, which went smoothly.
I looked at the logs very carefully, but found no obvious reasons why the MSSQLSERVER service had failed. The only thing that looked at all suspicious was that the service failed over when the primary node was in the process of performing a transaction log backup.
Next, I checked the SQL Server logs themselves, hoping to find more clues. There weren’t any. It was as if the MSSQLSERVER service died on its own, with no apparent cause. My stress level went up instantly. This would not be an easy fix.
Since there were no obvious clues, and I needed to find the cause of this problem right now, I decided to call Microsoft Technical Support. We have a company contract, so I didn’t need to get any special permission to make the call.
As you might expect, I got a first-tier support SQL Server specialist on the phone and I described the problem. I don’t really like dealing with first-tier people as they invariably can’t help, and have to bump the support call to the next level support staff.
In any event, the support tech said he would send me an e-mail that would include a utility program that would go out and locate, then compress all of the necessary hardware, software, and log information on my SQL Server cluster. I was not familiar with this particular tool, but when I got it, I ran it and sent the results to the support tech as soon as I could. Then I began waiting. He told me that he would get back to me later in the afternoon, once he had a chance to review the logs.
While I was waiting, I did some research on my own on the Internet, but couldn’t find anything very useful. I hadn’t heard from the Microsoft support tech yet, so I called. He said he wasn’t finished, but he speculated that the failover occurred due to one of these reasons:
The cluster didn’t have enough RAM.
The CPU was so busy that it caused the “Is Alive” cluster request to fail.
The database was corrupted.
After hearing the support tech’s theories, I actually got a little upset. While I don’t know everything there is to know about SQL Server clusters, I knew enough that the likelihood of any of these reasons being the cause of the cluster failover was very small. Why? First, each node in the cluster has 4GB of RAM, and I hade never seen SQL Server even use 3GB of it, ever. Second, the CPU was not all that busy, and if the “Is Alive” request had failed, this would have shown in the logs, and it did not. And third, we had run a DBCC CheckDB on the database in question about 6 hours before the failover occurred, and it had shown no signs of problems. I explained this to the support tech, who told me that he would look at the logs some more and get back with me the next day. Based on the conversation, I really doubted if the tech had spent much time looking at the logs as this point.
I was setting at my desk, waiting to hear back from the Microsoft support tech. My computer made a noise, letting me know that I got new e-mail. I checked my mail, and there were a series of messages from our alert software. Our production SQL Server cluster had just failed over again. Double crap!
This time, I didn’t know what was going to happen. If the first node still was “bad”, then a failover from the second node to the first node might mean that I had a dead ERP system.
I immediately brought up Cluster Administrator to see what was happening. At my first look, all of the SQL Server cluster resources were pending. In other words, the failover had started, but was not done. I froze as I watched the pending status, hoping that it would change to online. After about what was one of the longest minutes I had experienced in my life, the resources changed from pending to online. At least I had one good piece of news, the failover had succeeded, and the SQL Server was back online. But for how long? I still hadn’t found the cause of the initial failover, and I didn’t know how to prevent another one.
Like yesterday, I notified the proper people, and then started looking at the error logs. This time, I found something useful. There was a reference to an application popup message:
Application popup: sqlservr.exe – Application Error : The instruction at “0x4115d3c0” referenced memory at “0x4115d3c0”. The required data was not placed into memory because of an I/O error status of “0xc000026e”.
Although this was not all that useful, it helped more than yesterday. I also reviewed the SQL Server logs, and they contained many messages this time, but none of them pointed a conclusive finger as the cause of the problem. One log that was interesting was the DMP file created when the application error message above was logged. In other words, when the MSSQLSERVER service died, a DMP file was created (it wasn’t created yesterday). What I found interesting in it was that a particular stored procedure, that was written in-house for a report, was repeated over 200 times in this DMP file. I didn’t know what this meant, but when I contacted Microsoft next, I emphasized this, asking what it meant.
I called the Microsoft support tech, describing what had just happened and what I had found. He wasn’t of much use, only asking me to send the new log and DMP files to him. After doing this, since I had nothing else to do, so I did some more research on the Internet, looking up the various error messages that I had found in the various logs.
I didn’t really find anything useful on the Internet, but what I did found got me thinking. For example, the previous week I had made a single change on our production SQL Server. Previously, the server had been set so that it would not run parallel queries. We had turned this feature off about a year and a half ago because there was a known bug that caused problems when parallelism was turned on in the server. Supposedly this bug was fixed in SP3 of SQL Server 7.0. Even though I had installed SP3 on this cluster, I had never turned parallelism on. I was just being conservative.
So why had I turned it on the previous week? What had happened was that a report developer had come to me and asked me to analyze why a particular query she had written was so slow when it ran on the production server (the server had parallelism off) but it was fast on the development server (this server had parallelism on). Well, as you expect, it was because of the parallelism setting. The difference between the two servers was significant. The query on the development server only took 40 seconds to run, but it took 13 minutes on the development server (and what was worse, is that the development server was a smaller server than the production server).
Because of this report’s poor performance, I decided that it was perhaps now time to turn parallelism back on the production server. Parallelism had been on the development server for over a year now, and we had never seen any issues with parallelism as we did in previous service packs, so I assumed that the bug had been fixed. I also conferred with my boss, and he agreed that turning it on made sense.
There was nothing specific in the logs that pointed the finger at parallelism as being the cause of the failover, but to be conservative, I decided to turn off parallelism on the development server. There was little downside to doing this, other than slowing some reports, so I made this decision on my own. And besides, this was the only thing that had changed on the server in months, and well, you just never know.
I called the support tech again, and he wasn’t still of much use. It was at this point that I asked that this support call be escalated to the next level of support, as he wasn’t helping me in a timely fashion. The support tech didn’t have any problem with this, so he said that he would escalate it and that the person taking over the support call would contact me. I soon received an e-mail from the new support tech, and he said he needed some time to review the logs, and that he would contact me the next day.
I called the newly assigned Microsoft support tech, and asked him what he found in the logs. Like me, he saw the over 200 instances of this particular stored procedure in the DMP file. He said that when a parallelism is turned on, what the Query Optimizer does is to try and see if a query can be broken down into multiple threads so that the query can be executed on multiple CPUs at the same time, speeding performance. Based on the DMP file, it appeared that for some reason, when this particular query ran, that instead of only creating enough threads to match the number of CPUs in the server (which should have been four), that over 200 threads had been created instead, which in turn crashed the server. Obviously, this was a new bug.
To help confirm this, I did some research about the particular report that ran this stored procedure that caused the problem. I found out that this report was running on both Monday and Tuesday as the exact same time the SQL Server had failed over each day. Based on the evidence, it appeared that we had found the cause of the failover.
While the original parallelism bug that we were familiar with was not the cause of the problem, it had been fixed in SP3, but by turning on parallelism, it caused this new bug to raise to the surface. One question I had was why hadn’t we seen this problem on the development server, which always had parallelism turned on? Our best guess about this is that the development server is rarely very busy, while the production server is almost always busy. So the bug must arise its ugly head when a particular set of circumstances occur (and we were lucky enough to experience this two days in a row).
So What Do We Do Now?
We have several options to deal with the bug. One option is to move to SQL Server 2000. Unfortunately, we can’t do that because our ERP package currently doesn’t support SQL Server 2000. It may be 6-12 months before this can happen.
Another option is to turn parallelism back on for the server, but turn it off for this particular query using the MAXDOP query hint to turn it off for this particular query. This of course, assumes that we have no other queries that might cause a repeat of the bug we found.
We could also work with Microsoft to help isolate the bug so they can potentially fix it. The problem with this option is that our server is a production server, and I can’t use it for experiments.
And the last option, is the one we chose. And that is to turn off parallelism on the production server, and live with any poor query performance we get due to not using parallelism. We made this choice because we are very conservative, and it is more important to have our system up and running 24/7 than it is to run every query at the highest possible speed.
Let me know what you think about this case study, and perhaps how you might have handled it differently. Send your comments to firstname.lastname@example.org.
Brad M. McGehee is a full-time DBA with a large manufacturing company, and the publisher of http://www.SQL-Server-Performance.Com, a website specializing in SQL Server performance tuning and clustering.
He is an MVP, MCSE+I, MCSD, and MCT (former).
Brad also runs another website called http://www.WorldClassGear.com It provides independent gear reviews for backpackers, trekkers, and adventure travelers.