By: Edwin Sarmiento | Comments (14) | Related: > Clustering
Problem
The 2-node Windows Server Failover Cluster (WSFC) running my SQL Server failover clustered instance suddenly went offline. It turns out that my quorum disk and the standby node in the cluster both went offline at the same time. I could not connect to the WSFC nor to my SQL Server failover clustered instance. What do I need to do to bring my SQL Server failover clustered instance back online?
Solution
Since a SQL Server failover clustered instance runs on top of a WSFC, whether it stays online or not is dictated by the cluster quorum configuration. To better understand this behavior, we need to understand what the quorum is for. I kind of think of a cluster quorum as "majority votes win." When there is a majority of votes, a decision can be made to "do something." In a WSFC, a quorum determines whether or not the cluster stays online. If there is no quorum (or majority of votes), the cluster will not stay online. A more detailed discussion of a cluster quorum is available in this TechNet article.
By default, all nodes in a failover cluster will have a vote. In this particular configuration, the quorum disk and the standby node - both of which have votes - have gone offline, thereby, causing the cluster to lose quorum since it only has 1 out of 3 votes. And since the WSFC has gone offline, it takes the SQL Server failover clustered instance offline with it. Before we can even bring the SQL Server failover clustered instance online, we need to bring the WSFC online first. This has to be done by force starting the WSFC without the quorum. The goal is to bring the WSFC online as quick as we possibly so we can bring the SQL Server failover clustered instance online. This process can be done either by using the Failover Cluster Manager console or Windows PowerShell. However, I don't recommend using the Failover Cluster Manager console to perform this particular task as it will just cause more delay in bringing the WSFC online. The Failover Cluster Manager console will attempt to connect to the WSFC instance on the active node that you are currently logged on to. You'll probably spend at least 5 minutes of waiting before it tells you that it could not connect to the cluster.
I strongly recommend using Windows PowerShell to perform this task. Make sure that you are a member of the Windows Local Administrators group on all of the cluster nodes and that you open up a Windows PowerShell console with the Run As Administrator option. Depending on the server operating system version, you may need to import the FailoverClusters PowerShell module. Windows Server 2012 and higher includes Windows PowerShell V3 that automatically loads modules when you call feature-specific cmdlets. Follow the steps below to perform this task.
- Verify that the Cluster Service is not running on the current active node.
- Use the Start-ClusterNode PowerShell cmdlet, passing the -FixQuorum parameter.
The Start-ClusterNode PowerShell cmdlet will start the Cluster Service on the current node. The -FixQuorum parameter will force the cluster node to start even if quorum has not been active. In this case, quorum will not be active because you only have 1 out of the 3 possible votes in the cluster. In the example below, I am currently logged in to the cluster node WS-CLUSTER1 and would like to start the Cluster Service in that node.
Start-ClusterNode –Name "WS-CLUSTER1" -FixQuorum
Once the PowerShell command has been executed, you can now use the Failover Cluster Manager console to connect to the WSFC. Note that it warns you that the WSFC is in a ForcedQuorum state.
- Set the NodeWeight property of the cluster node to guarantee that it is a voting member of the quorum.
Once the WSFC has been brought online, make sure that the cluster node is guaranteed as a voting member. This can be done by using the Get-ClusterNode PowerShell cmdlet, setting the NodeWeight property equal to 1.
(Get-ClusterNode –Name "WS-CLUSTER1").NodeWeight = 1
You won't see any output after running this command. However, you can verify if the settings were applied by running the Get-ClusterNode PowerShell cmdlet and displaying the State and NodeWeight properties.
Get-ClusterNode –Name "WS-CLUSTER1" | Select-Object NodeName, State, NodeWeight
This is as simple as opening up the Services console and checking if the Cluster Service is not running. If it is, stop the service.
Once the WSFC is online, the SQL Server failover clustered instance is automatically brought online. You can opt to change the cluster quorum settings to temporarily use a file share witness while you fix the quorum disk and attempt to bring the other cluster node online. By following the outlined steps, you can quickly bring your SQL Server failover clustered instance and meet your recovery time objective (RTO).
A word of caution: Avoid the temptation to troubleshoot the issue or investigate the root cause while bringing the SQL Server failover clustered instance online. As engineers, we almost always want to solve a particular issue immediately. The goal in every disaster recovery situation is to bring the system back online as quickly as we possibly can to meet our recovery objective. You can leave the investigation and troubleshooting after the SQL Server failover clustered instance is brought online, the applications can connect to the databases and the users are happy.
Next Steps
Check out the following items- Clustering tips
- Edwin Sarmiento's tips since 2008
About the author
This author pledges the content of this article is based on professional experience and not AI generated.
View all my tips