Recover WSFC using Forced Quorum for SQL Server AlwaysOn Availability Group


By:   |   Updated: 2017-06-15   |   Comments (13)   |   Related: More > Availability Groups


Problem

I have a SQL Server Availability Group (AG) on a Windows Server Failover Cluster (WSFC) with 3 nodes using Node Majority quorum configuration. In an unforeseen circumstance, 2 nodes went completely offline at the same time causing the WSFC to go down.

This caused the AG to go into a Resolving state and all the databases in the AG and the AG Listener to be inaccessible. How can I recover the WSFC cluster node to start without a quorum and later fully re-establish HA for AlwaysOn AG when the 2 nodes are up?

Solution

We can bring up the WSFC on minority elements that are required for a quorum using the forced quorum method.

To outline the solution per the problem description, we will assume our infrastructure configuration as below.

SQL Server AlwaysOn Configuration - Description: WSFC contains 3 SQL Servers SQLP1 (Primary), SQLP2 (Secondary) and SQLP3 (Secondary) as per the diagram below in a SQL Server AlwaysOn AG.

In this scenario, SQLP1 and SQLP3 are unavailable. WSFC goes down and AG is inaccessible. In this situation, when you try to connect to the WSFC from the Failover Cluster Manager, it will give you the error below.

Connect to WSFC Error

Since SQLP2 is still up and running, you can connect to SQLP2 from SSMS. From here, the AG Dashboard lists SQLP2 in the Availability replica, but AGTest is in a Resolving state, meaning none of the databases in the AG will be accessible.

AG in Resolving state

Recover WSFC using Force Quorum

The recovery steps to bring up the single node SQLP2 for the WSFC and subsequently re-establish HA for the AG can be summarized as per the flow chart below.

Steps to recovery WSFC using Force Quorum

Step 1 – Login to SQLP2

All recovery steps and commands can be performed remotely, but typically logging in to the server allows important checks such as making sure the server is actually up and available. Sometimes, a ping command could return a reply, but the Windows Server OS could be completely hung.

In this tip, we will login to server SQLP2 to execute the next step.

Step 2 – Shutdown cluster service

The cluster service on SQLP2 can be shutdown either from the Services Microsoft Management Console or a command prompt. We will use the latter and launch a command prompt running as Administrator to shut down the cluster service.

C:\>Net stop clussvc
Stopping Cluster Service

Step 3 - Bring up cluster service with Forcequorum

Once the cluster service on SQLP2 is stopped, start up the cluster service again with a force quorum.

C:\>Net start clussvc /forcequorum
Start-up Cluster Service with force quorum

Step 4 - Failover AG to SQLP2 with Allow Data Loss

You will now be able to connect to the WSFC using Failover Cluster Manager, but the AG will still be in a Resolving state. This means the AG Listener and AG databases are still inaccessible.

 When the WSFC is up, the user databases in the AG will now appear in the AG Dashboard, but in a Not Synchronizing state.

AG still inaccessible

We will need to perform a forced AG failover to the SQL Server replica which is still up and running using the with allow data loss option. This step is still required even when the node was the primary SQL Server replica prior to the disaster.

The T-SQL command below will be executed on SQLP2.

USE master
GO
PRINT @@SERVERNAME
GO
ALTER AVAILABILITY GROUP AGTest FORCE_FAILOVER_ALLOW_DATA_LOSS;
GO
Perform Forced AG Failover

Step 5 - AG should be up and running

Refresh the AG Dashboard and it should now show SQLP2 as Primary and running. This will allow applications to connect to the AG listener and the workload to resume as you would in a normal production environment.

AG Online

Step 6 - When SQLP1 and SQLP3 is up and running, resume AG data movement

When the 2 servers SQLP1 and SQLP3 are fixed and running, they will re-join the WSFC automatically. As long as AG is not removed from the 2 other SQL Server instances, they will be able to re-join into the AG as well. But all database synchronization for the AG will need to be manually resumed.

Resuming of data movement can be done either from SSMS or T-SQL.

Resume data movement from SSMS

Alternatively, the T-SQL to resume data movement for the AG is as below:

ALTER DATABASE [AGplaceHolder] SET HADR RESUME;
GO

Once the data movement is resumed and caught up the AG Dashboard will show the Synchronization State as Synchronized.

AG Databases is Synchronized

These same steps need to be performed on each user database in the AG and on all the secondary replicas to resume the data movement.

Conclusion

This tip shows how you can perform a force quorum to bring up WSFC online on a single alive node. The same steps are applicable to force a quorum in a single-site or multi-site cluster.

Most importantly, be aware that log truncation will be delayed on a given primary database while any of its secondary databases is suspended. Therefore, if the outage period is prolonged, consider removing the failed replica from the AG to avoid running out of disk space due to log truncation delay.

Next Steps


Last Updated: 2017-06-15


get scripts

next tip button



About the author
MSSQLTips author Simon Liew Simon Liew is an independent SQL Server Consultant in Sydney, Australia. He is a Microsoft Certified Master for SQL Server 2008 and holds a Master’s Degree in Distributed Computing.

View all my tips
Related Resources





Comments For This Article




Thursday, February 07, 2019 - 2:22:00 PM - Amar Back To Top (78987)

 A very timely solution for a prod outage. Thank you for this clearly laid out post Simon!


Friday, January 11, 2019 - 9:13:13 AM - Wilner C Back To Top (78716)

 YES!! 100% what I was looking for. It is a bit unclear however. For example: Under Limitations and restrictions, it says the following: 

  • Changing the service account that is used by SQL Server or SQL Server Agent must be performed from the active node of the SQL Server cluster.

Does this mean  I have to change it on the primary, failover to a secondary node change it there and so on?

Secondly, it says below regarding windows server 2008. What exactly do they mean by that? and How can that be accomplished? and do you know if this would apply for 2012 R2?

  • When running on Windows Server 2008 (in a non-default configuration using Domain groups), changing the service account that is used by SQL Server or SQL Server Agent requires SQL Server Configuration Manager to stop SQL Server by taking the resource groups offline.


Friday, January 11, 2019 - 12:57:43 AM - Simon Liew Back To Top (78706)

Hi Wilner,

I'm assuming the service account you're referring here is the SQL Server service account, and the SQL Server instance is participating in an Availability Group. 

The process to change the service account first need to comply with changing a standalone SQL instance service account. This must be done from SQL Configuration Manger, never from the Services mmc. Also, you may want to note the section "Limitations and Restrictions" for "Clustered servers"

https://docs.microsoft.com/en-us/sql/database-engine/configure-windows/scm-services-change-the-service-startup-account?view=sql-server-2017

 

Then you would need to make sure the new login account has permissions to connect to each other

https://docs.microsoft.com/en-us/sql/database-engine/database-mirroring/set-up-login-accounts-database-mirroring-always-on-availability?view=sql-server-2017

There are also other considerations such as the endpoint owner (which defaulted to login used to establish Availability Group initially). 

https://docs.microsoft.com/en-us/sql/database-engine/database-mirroring/database-mirroring-allow-network-access-windows-authentication?view=sql-server-2017

Hope these helps.


Tuesday, January 08, 2019 - 2:32:56 PM - Wilner C Back To Top (78676)

 Hi Simon, 

This question is definitely under the wrong subject, however, i wanted to know if you could point me to some best practice procedures on service account password change for a 3 node alwayson availability group.  Thanks in advance. .


Saturday, November 03, 2018 - 5:35:21 AM - Simon Liew Back To Top (78145)

Wilner,

1. I was referring to primary data centre. Apology for not being specific

2. I've done DR test and do not suspend data movement prior to DR or at the end of DR test. Business knows the DR exercise is going on and they do not enter transactions during the period. Not suspending data movement has not impacted the test in my scenario.

Hope this helps.


Thursday, November 01, 2018 - 12:40:34 PM - Wilner C Back To Top (78129)

 Hi Simon,

One last question for you.   You said the following:

"In the Services mmc, there should be a Cluster Service on each of the node participating in the cluster. Prior to re-establishing the connectivity between primary and secondary site, you can stop the Cluster service on the primary node"

 

1. When you say to stop the cluster on the primary node, do you mean at the DR site? because prior to cutting connectivity, I would fail over to the DR site where the primary replica would now reside.  Do you mean stop the cluster srvice there or do you mean on the seconadry Nodes at the Primary data center?

2.  Is there at any point where data movement needs to be suspended in this operation?  I know we have to resume data movment at the end prior to failing back but do I suspend data movment at the DR or the Primary Data Center at anytime?

 

Thanks!


Friday, October 26, 2018 - 10:13:41 AM - Wilner C Back To Top (78065)

 Awesome! Thank you so much Simon.  I appreciate your help. 

 


Thursday, October 25, 2018 - 7:38:11 PM - Simon Liew Back To Top (78060)

Wilner,

In the Services mmc, there should be a Cluster Service on each of the node participating in the cluster. Prior to re-establishing the connectivity between primary and secondary site, you can stop the Cluster service on the primary node. Once network is reestablished, you can start the Cluster Service on the primary node one by this. This prevents a race condition where the cluster might get the wrong last known state prior to shutdown.

Checks on the cluster such as making sure the quorom type is still the same i.e. node with file share witness, etc. And also all nodes are up and available contributing to quorum.


Thursday, October 25, 2018 - 9:26:04 AM - Wilner C Back To Top (78055)

 Simon,

 

Thank you for the clarity. Thank you as well for that article.  I have been reading it over and over and It's starting to make sense to me. I do have one last question hopefully you can help me with.  You have been a great help already. 

 

When you say try to allow one node to rejoin at a time..How is this done since they will rejoin automatically? 

Secondly can you expound on the comment below with greater details and how this is done. I am a bit lost here.

- Check windows cluster manager to make sure all cluster is healthy. Sometimes it might require you to adjust the NodeWeight before bringing other nodes online to avoid cluster going into offline again.

 

Thank you greatly!

 


Wednesday, October 24, 2018 - 9:34:37 PM - Simon Liew Back To Top (78050)

Hi Wilner,

I'll start with half way through the steps you've provided

- Failover AG to DR

- Cut the connectivity between 2 sites

- Start cluster in forced quorum mode at DR

- (Fail back to primary site) When connectivity between sites are reestablished, the primary servers should automatically re-join back to cluster. But try to allow one node to re-join cluster at a time

- Check windows cluster manager to make sure all cluster is healthy. Sometimes it might require you to adjust the NodeWeight before bringing other nodes online to avoid cluster going into offline again

- Resume all AG databases data movement

- Fail back AG to primary site

 

Purpose of forced quorum is to disable quorum checks until the cluster achieves a majority of votes again. Once majority votes are achieved, the cluster will automatically transitions to a normal quorum operation. To recover from a forced quorum situation, you startup one node at a time without the forced quorum option. There are checks on cluster and AG which needs to be done even after all nodes in the cluster seem to have re-join the cluster successfully. For example, AG synchronisation might not always resume successfully, etc.

I found an article below which should be useful for your scenario. Hope this helps

https://docs.microsoft.com/en-us/sql/sql-server/failover-clusters/windows/wsfc-disaster-recovery-through-forced-quorum-sql-server?view=sql-server-2017


Wednesday, October 24, 2018 - 3:05:25 PM - Wilner C Back To Top (78045)

I appreciate you taking the time to answer Simon.  I guess i probably did not provide enough details.  

 

Node                site                          Quorom_vote    

srv-1                 primary                   1

srv-2                 primary                   1

fileshare                                           1

drsrv-1             secondary               1

drsrv-2             secondary               1

 

 

I have syncronus replication between the hosts on the primary with automatic failover

and asycnhronous with manual to the secondary site.

To test this, we will be cutting the line between the Primary site and the DR site. 

Failing over AG's to DR works fine, but when we cut the line between the two sites the WFC fails with lost quorom message. Which is correct sine three votes out of 4 is not majority. So then I have to start the cluster in forced quorom mode. When failing back to primary site I once again need to take down the cluster and start in normal mode.

Do you have any reference for the details steps? I hope the scenario made it a littel clearer. 

 

Thanks again

 

 

 


Wednesday, October 24, 2018 - 7:53:28 AM - Simon Liew Back To Top (78043)

Hi Wilner,

You can perform a force quorum to bring up WSFC online on a single alive node on the primary or secondary site. The same steps are applicable to force a quorum in a single-site or multi-site cluster.


Tuesday, October 23, 2018 - 3:01:29 PM - Wilner C Back To Top (78036)

 Thank you for sharing this.  Does this work in a set up where a secondary site is available and the primary site isn't?

 



download





Recommended Reading

Connect to SQL Server Availability Group replica with SSMS when Readable Secondary is Read-intent only

Configure SQL Server AlwaysOn Availability Group on a Multi-Subnet Cluster

Fix SQL Server AlwaysOn Availability Group Error: 1408 Joining database on secondary replica resulted in an error

What is SQL Server AlwaysOn?

Add a SQL Server Database to an Existing Always On Availability Group








get free sql tips
agree to terms


Learn more about SQL Server tools