Learn more about SQL Server tools

mssqltips logo
giveaway
 

Tutorials          DBA          Dev          BI          Career          Categories          Webcasts          Whitepapers          Today's Tip          Join

Tutorials      DBA      Dev      BI      Categories      Webcasts

DBA    Dev    BI    Categories

 

Recover WSFC using Forced Quorum for SQL Server AlwaysOn Availability Group


By:   |   Read Comments   |   Related Tips: More > AlwaysOn Availability

Problem

I have a SQL Server Availability Group (AG) on a Windows Server Failover Cluster (WSFC) with 3 nodes using Node Majority quorum configuration. In an unforeseen circumstance, 2 nodes went completely offline at the same time causing the WSFC to go down.

This caused the AG to go into a Resolving state and all the databases in the AG and the AG Listener to be inaccessible. How can I recover the WSFC cluster node to start without a quorum and later fully re-establish HA for AlwaysOn AG when the 2 nodes are up?

Solution

We can bring up the WSFC on minority elements that are required for a quorum using the forced quorum method.

To outline the solution per the problem description, we will assume our infrastructure configuration as below.

SQL Server AlwaysOn Configuration - Description: WSFC contains 3 SQL Servers SQLP1 (Primary), SQLP2 (Secondary) and SQLP3 (Secondary) as per the diagram below in a SQL Server AlwaysOn AG.

In this scenario, SQLP1 and SQLP3 are unavailable. WSFC goes down and AG is inaccessible. In this situation, when you try to connect to the WSFC from the Failover Cluster Manager, it will give you the error below.

Connect to WSFC Error

Since SQLP2 is still up and running, you can connect to SQLP2 from SSMS. From here, the AG Dashboard lists SQLP2 in the Availability replica, but AGTest is in a Resolving state, meaning none of the databases in the AG will be accessible.

AG in Resolving state

Recover WSFC using Force Quorum

The recovery steps to bring up the single node SQLP2 for the WSFC and subsequently re-establish HA for the AG can be summarized as per the flow chart below.

Steps to recovery WSFC using Force Quorum

Step 1 – Login to SQLP2

All recovery steps and commands can be performed remotely, but typically logging in to the server allows important checks such as making sure the server is actually up and available. Sometimes, a ping command could return a reply, but the Windows Server OS could be completely hung.

In this tip, we will login to server SQLP2 to execute the next step.

Step 2 – Shutdown cluster service

The cluster service on SQLP2 can be shutdown either from the Services Microsoft Management Console or a command prompt. We will use the latter and launch a command prompt running as Administrator to shut down the cluster service.

C:\>Net stop clussvc
Stopping Cluster Service

Step 3 - Bring up cluster service with Forcequorum

Once the cluster service on SQLP2 is stopped, start up the cluster service again with a force quorum.

C:\>Net start clussvc /forcequorum
Start-up Cluster Service with force quorum

Step 4 - Failover AG to SQLP2 with Allow Data Loss

You will now be able to connect to the WSFC using Failover Cluster Manager, but the AG will still be in a Resolving state. This means the AG Listener and AG databases are still inaccessible.

 When the WSFC is up, the user databases in the AG will now appear in the AG Dashboard, but in a Not Synchronizing state.

AG still inaccessible

We will need to perform a forced AG failover to the SQL Server replica which is still up and running using the with allow data loss option. This step is still required even when the node was the primary SQL Server replica prior to the disaster.

The T-SQL command below will be executed on SQLP2.

USE master
GO
PRINT @@SERVERNAME
GO
ALTER AVAILABILITY GROUP AGTest FORCE_FAILOVER_ALLOW_DATA_LOSS;
GO
Perform Forced AG Failover

Step 5 - AG should be up and running

Refresh the AG Dashboard and it should now show SQLP2 as Primary and running. This will allow applications to connect to the AG listener and the workload to resume as you would in a normal production environment.

AG Online

Step 6 - When SQLP1 and SQLP3 is up and running, resume AG data movement

When the 2 servers SQLP1 and SQLP3 are fixed and running, they will re-join the WSFC automatically. As long as AG is not removed from the 2 other SQL Server instances, they will be able to re-join into the AG as well. But all database synchronization for the AG will need to be manually resumed.

Resuming of data movement can be done either from SSMS or T-SQL.

Resume data movement from SSMS

Alternatively, the T-SQL to resume data movement for the AG is as below:

ALTER DATABASE [AGplaceHolder] SET HADR RESUME;
GO

Once the data movement is resumed and caught up the AG Dashboard will show the Synchronization State as Synchronized.

AG Databases is Synchronized

These same steps need to be performed on each user database in the AG and on all the secondary replicas to resume the data movement.

Conclusion

This tip shows how you can perform a force quorum to bring up WSFC online on a single alive node. The same steps are applicable to force a quorum in a single-site or multi-site cluster.

Most importantly, be aware that log truncation will be delayed on a given primary database while any of its secondary databases is suspended. Therefore, if the outage period is prolonged, consider removing the failed replica from the AG to avoid running out of disk space due to log truncation delay.

Next Steps


Last Update:






About the author
MSSQLTips author Simon Liew Simon Liew is an independent SQL Server Consultant in Sydney, Australia. He is a Microsoft Certified Master for SQL Server 2008 and holds a Master’s Degree in Distributed Computing.

View all my tips
Related Resources





More SQL Server Solutions











Post a comment or let the author know this tip helped.

All comments are reviewed, so stay on subject or we may delete your comment. Note: your email address is not published. Required fields are marked with an asterisk (*).

*Name    *Email    Notify for updates 


SQL tips:

*Enter Code refresh code     



Learn more about SQL Server tools