Force Start a Windows Server Failover Cluster without a Quorum to bring a SQL Server Failover Clustered Instance Online

By:   |   Comments (14)   |   Related: > Clustering


Problem

The 2-node Windows Server Failover Cluster (WSFC) running my SQL Server failover clustered instance suddenly went offline. It turns out that my quorum disk and the standby node in the cluster both went offline at the same time. I could not connect to the WSFC nor to my SQL Server failover clustered instance. What do I need to do to bring my SQL Server failover clustered instance back online?

Solution

Since a SQL Server failover clustered instance runs on top of a WSFC, whether it stays online or not is dictated by the cluster quorum configuration. To better understand this behavior, we need to understand what the quorum is for. I kind of think of a cluster quorum as "majority votes win." When there is a majority of votes, a decision can be made to "do something." In a WSFC, a quorum determines whether or not the cluster stays online. If there is no quorum (or majority of votes), the cluster will not stay online. A more detailed discussion of a cluster quorum is available in this TechNet article.

By default, all nodes in a failover cluster will have a vote. In this particular configuration, the quorum disk and the standby node - both of which have votes - have gone offline, thereby, causing the cluster to lose quorum since it only has 1 out of 3 votes. And since the WSFC has gone offline, it takes the SQL Server failover clustered instance offline with it. Before we can even bring the SQL Server failover clustered instance online, we need to bring the WSFC online first. This has to be done by force starting the WSFC without the quorum. The goal is to bring the WSFC online as quick as we possibly so we can bring the SQL Server failover clustered instance online. This process can be done either by using the Failover Cluster Manager console or Windows PowerShell. However, I don't recommend using the Failover Cluster Manager console to perform this particular task as it will just cause more delay in bringing the WSFC online. The Failover Cluster Manager console will attempt to connect to the WSFC instance on the active node that you are currently logged on to. You'll probably spend at least 5 minutes of waiting before it tells you that it could not connect to the cluster.

Force Start A Windows Server Failover Cluster Without A Quorum To Bring A SQL Server Failover Clustered Instance Online

I strongly recommend using Windows PowerShell to perform this task. Make sure that you are a member of the Windows Local Administrators group on all of the cluster nodes and that you open up a Windows PowerShell console with the Run As Administrator option. Depending on the server operating system version, you may need to import the FailoverClusters PowerShell module. Windows Server 2012 and higher includes Windows PowerShell V3 that automatically loads modules when you call feature-specific cmdlets. Follow the steps below to perform this task.

  1. Verify that the Cluster Service is not running on the current active node.
  2. This is as simple as opening up the Services console and checking if the Cluster Service is not running. If it is, stop the service.

    Verify that the Cluster Service is not running on the current active node

  3. Use the Start-ClusterNode PowerShell cmdlet, passing the -FixQuorum parameter.

    The Start-ClusterNode PowerShell cmdlet will start the Cluster Service on the current node. The -FixQuorum parameter will force the cluster node to start even if quorum has not been active. In this case, quorum will not be active because you only have 1 out of the 3 possible votes in the cluster. In the example below, I am currently logged in to the cluster node WS-CLUSTER1 and would like to start the Cluster Service in that node.

    Start-ClusterNode –Name "WS-CLUSTER1" -FixQuorum
    


    Use the Start-ClusterNode PowerShell cmdlet, passing the -FixQuorum parameter.

    Once the PowerShell command has been executed, you can now use the Failover Cluster Manager console to connect to the WSFC. Note that it warns you that the WSFC is in a ForcedQuorum state.

    Once the PowerShell command has been executed, you can now use the Failover Cluster Manager console to connect to the WSFC

  4. Set the NodeWeight property of the cluster node to guarantee that it is a voting member of the quorum.

    Once the WSFC has been brought online, make sure that the cluster node is guaranteed as a voting member. This can be done by using the Get-ClusterNode PowerShell cmdlet, setting the NodeWeight property equal to 1.

    (Get-ClusterNode –Name "WS-CLUSTER1").NodeWeight = 1
    


    Set the NodeWeight property of the cluster node to guarantee that it is a voting member of the quorum.

    You won't see any output after running this command. However, you can verify if the settings were applied by running the Get-ClusterNode PowerShell cmdlet and displaying the State and NodeWeight properties.

    Get-ClusterNode –Name "WS-CLUSTER1" | Select-Object NodeName, State, NodeWeight
    


    Verify if the settings were applied by running the Get-ClusterNode PowerShell cmdlet and displaying the State and NodeWeight properties.

Once the WSFC is online, the SQL Server failover clustered instance is automatically brought online. You can opt to change the cluster quorum settings to temporarily use a file share witness while you fix the quorum disk and attempt to bring the other cluster node online. By following the outlined steps, you can quickly bring your SQL Server failover clustered instance and meet your recovery time objective (RTO).

A word of caution: Avoid the temptation to troubleshoot the issue or investigate the root cause while bringing the SQL Server failover clustered instance online. As engineers, we almost always want to solve a particular issue immediately. The goal in every disaster recovery situation is to bring the system back online as quickly as we possibly can to meet our recovery objective. You can leave the investigation and troubleshooting after the SQL Server failover clustered instance is brought online, the applications can connect to the databases and the users are happy.

Next Steps
Check out the following items


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Edwin Sarmiento Edwin M Sarmiento is a Microsoft SQL Server MVP and Microsoft Certified Master from Ottawa, Canada specializing in high availability, disaster recovery and system infrastructures.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Monday, August 28, 2023 - 11:59:42 PM - jimmyafflick Back To Top (91516)
I was trying to run the below comment. I am getting error like - start-clusternode the system cannot find the file specified. How can i fix this error. Please let me know.
Start-ClusterNode –Name "WS-CLUSTER1" -FixQuorum
When I was checking the cluster service. It was not running. When I am trying to enable the service. I am getting the below error like windows could not start cluster service on local server. I guess its needs to be running. Could you please help me how to fix this.
I am looking forward to hearning from you

Friday, April 22, 2022 - 4:13:31 AM - Taavi Tiitsmaa Back To Top (90026)
Thanks, super helpful

Thursday, November 19, 2020 - 12:13:28 PM - Steve Jones Back To Top (87811)
In your step 3 above, you have:
(Get-ClusterNode –Name "WS-CLUSTER1").NodeWeight = 1

You can't SET anything with a get- command, right?? Wouldnt this be more proper:

(Get-ClusterNode –Name "WS-CLUSTER1") | set-ClusterNode -NodeWeight 1

I'm not a SQL guy, but came here doing research for an Exchange cluster, and I think this is closer to what you mean?

Friday, October 13, 2017 - 4:00:03 PM - bass_player Back To Top (67298)

 

In an Availability Group secondary replica, the SQL Server service should not be offline because it is independent of the WSFC. There could be a more serious issue. I would suggest opening a case with Microsoft to investigate further.


Friday, October 13, 2017 - 2:23:04 PM - Hugh Back To Top (67295)

It's an Availability Group.  Yes, I would expect that the availability group's secondary replica to be off-line but I didn't expect the service to be stopped.  There was no System/Application event logged that would explain why the OS stopped the service.  Only that quorum was lost when WSFC and the fileshare witness lost communication to the secondary replica.

I appreciate the input.

 


Friday, October 13, 2017 - 11:37:55 AM - bass_player Back To Top (67288)

 

Is this a SQL Server failover clustered instance (FCI) or Availability Group? If it's a FCI, the SQL Server service should be set to Manual, not Automatic. The WSFC controls the SQL Server service, hence, why it is set to Manual and not Automatic. If it is set to Automatic, there must be something wrong with the installation or configuration.

If this is an Availability Group, the SQL Server service is set to Automatic, not Manual. Only the Availability Group is controlled by the WSFC, the SQL Server service is managed by the OS. 


Friday, October 13, 2017 - 10:03:23 AM - Hugh Back To Top (67281)

The service was set to AUTOMATIC start but was not running when I looked in Services. In other words, it did not restart as I would have expected even though, on the RECOVERY tab of the service properties, the "Restart service after" was set to 3 minutes.   So I had to start it manually.

I've subsequently added, on the RECOVERY tab, to "Restart The Service" on the First Failure setting to force a restart.

 


Thursday, October 12, 2017 - 11:21:42 AM - bass_player Back To Top (67241)

 

If the WSFC goes offline, every resource running on top of it goes offline as well. The WSFC controls the SQL Server service, hence, why it is set to Manual and not Automatic. If the SQL Server service was initially running before the WSFC went offline, it will be changed to Stopped once the WSFC goes offline. That's because the WSFC that is responsible for starting and stopping it is no longer available.


Wednesday, October 11, 2017 - 3:37:55 PM - Hugh Back To Top (67212)

"By default, all nodes in a failover cluster will have a vote. In this particular configuration, the quorum disk and the standby node - both of which have votes - have gone offline, thereby, causing the cluster to lose quorum since it only has 1 out of 3 votes. And since the WSFC has gone offline, it takes the SQL Server failover clustered instance offline with it." 

Edwin,

I had a similar incident where the error log recorded that the WSFC and quorum disk went off-line.  Instead of the failover instance going off-line the SQL service actually stopped.  Why would that happen?


Monday, March 16, 2015 - 11:01:37 AM - Aleksandr Back To Top (36541)

Thank you!

You post helped me to bring online my File server cluster, though only one node that didn't have a vote, was online!!!


Wednesday, December 10, 2014 - 1:48:44 PM - bass_player Back To Top (35560)

It's very tempting to immediately solve a problem during a disaster. However, we need to think in terms of what the business goals are. That's why I highlight the importance of recovery objectives and service level agreements in any HA/DR solution. The first 5 modules on my online course were made available to everyone because I feel that IT professionals need to know why they do what they do with regards to HA/DR


Wednesday, December 10, 2014 - 12:57:10 PM - SonomaRik Back To Top (35559)

liked the presentation.  Not sure if this is the universal method but when I get a chance will practice it on our practice nodes:

 

however, much I do appreciate the "don't troubleshoot" I could envision this happening many times in a row, or immediately after you bring it up on-line.

 

Suggestions?  Mine is to bring it up, then positively troubleshoot but warn the customers that it may happen, to lose connectivity, very soon to rectify.

 

we are talking a relatively low level two node cluster, albeit, ANY app is 'mission crititical' [right]

 


Monday, August 25, 2014 - 6:56:07 PM - bass_player Back To Top (34254)

Hi Daylinda,

That really depends on the version of Windows Server that you are running. Windows Server 2012 R2 introduced the concept of dynamic witness where the NodeWeight value of the nodes can be dynamically adjusted based on the configuration. Have a look at the new failover clustering features in Windows Serber 2012 R2 in this TechNet article


Saturday, August 23, 2014 - 2:03:48 AM - Daylinda Perry Back To Top (34239)

Hi Edwin, when the issue is resolved e.g. Other 2 voting partners are back online....do we need to reset the value for NodeWeight on WS-CLUSTER1? Or does that get automatically overwritten?















get free sql tips
agree to terms