SQL Server Availability Group Maintenance with Distributed Availability Groups


By:   |   Updated: 2021-10-19   |   Comments   |   Related: More > Availability Groups


Problem

In a previous tip on Configure Distributed Availability Groups for SQL Server Disaster Recovery with Failover Clustering, we have seen how to configure a Distributed Availability Group as a disaster recovery strategy for SQL Server failover clustered instances. What do I need to do on the Distributed Availability Group in case I need to perform maintenance on the replicas of the primary Availability Group?

Solution

There are cases when you need to perform a manual failover between the SQL Server failover clustered instances (FCIs) in an Availability Group (AG), for example, replacing or upgrading the shared storage. Keep in mind that only manual failover is supported when SQL Server FCIs are configured as AG replicas.

However, when you have a Distributed AG as part of your disaster recovery strategy, you need to update the configuration whenever you perform a manual failover on the primary or secondary AG. Refer to the diagram below configured in the previous tip.

windows server failover cluster

Below are the details of the implementation for the configuration:

  Production DR
WSFC OS: Windows Server 2019 OS: Windows Server 2019
  Nodes: TDPRD021, TDPRD022, TDPRD027, and TDPRD028 Nodes: TDPRD031 and TDPRD032
  Cluster Name Object (CNO): TDPRDCLS024 Cluster Name Object (CNO): TDPRDCLS034
  IP Subnet: 172.16.0.0/16 IP Subnet: 192.168.0.0/24
SQL Server Instance 1 Virtual Network Name (VNN): TDPRDSQLCLS026 TDPRD031
SQL Server Instance 2 Virtual Network Name (VNN) : TDPRDSQLCLS029\INST29 TDPRD032
SQL Server service account TESTDOMAIN\sqlservice TESTDOMAIN\sqlservice
Availability Group Name: TDPRDSQLCLSAG25 Name: TDPRDSQLAG35
  Listener: TDPRDSQLCLSAG25 Listener: TDPRDSQLAG35
  Distributed Availability Group Name: DistAG_DC1_DC2

If you need to replace the shared storage (or perform any maintenance) on TDPRDSQLCLS026, you need to move the primary AG (TDPRDSQLCLSAG25) to TDPRDSQLCLS029\INST29 before doing any maintenance. This will make sure that the databases are still highly available while performing maintenance on TDPRDSQLCLS026.

Here’s a high-level overview of the steps for your reference.

  1. Failover the primary AG (TDPRDSQLCLSAG25) to the secondary replica (TDPRDSQLCLS029\INST29)
  2. Update the Distributed AG (DistAG_DC1_DC2) on the primary AG (TDPRDSQLCLSAG25)
  3. Update the Distributed AG (DistAG_DC1_DC2) on the secondary AG (TDPRDSQLAG35)

Step #1: Failover the primary AG (TDPRDSQLCLSAG25) to the secondary replica (TDPRDSQLCLS029\INST29)

Refer to this tip on how to manually failover an AG. Make sure that the secondary replica is failover ready before proceeding to avoid data loss.

Step #2: Update the Distributed AG (DistAG_DC1_DC2) on the primary AG (TDPRDSQLCLSAG25)

Since the SQL Server VNN – not instance name nor AG listener name - is used for the LISTENER_URL parameter when creating the Distributed AG involving the primary AG, the secondary AG (TDPRDSQLAG35) will no longer be able to communicate with the primary AG (TDPRDSQLCLSAG25) after performing a manual failover to the secondary replica (TDPRDSQLCLS029\INST29). That’s because the SQL Server VNN only exists within the context of the SQL Server FCI, not the AG. The secondary AG (TDPRDSQLAG35) will immediately report an error stating that it could not connect to the primary AG (TDPRDSQLCLSAG25). SQL Server will log an error message on the primary replica (TDPRD031) of the secondary AG (TDPRDSQLAG35) similar to the one below.

A connection timeout has occurred while attempting to establish a connection to availability replica 'TDPRDSQLCLSAG25' with id [1A9B6186-F8CB-4260-B74C-B3EB2EAE1A3D]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance. 

The dashboard of the secondary AG (TDPRDSQLAG35) will immediately report the same. Don’t be alarmed.

availability group status

NOTE: A common mistake when you get an alert about the health of an AG is to assume that there is something wrong with it without considering whether or not it is a part of a Distributed AG. I’ve worked on cases where DBAs and sysadmins would troubleshoot an AG for hours – even days - without realizing there’s nothing wrong with it. Start with looking at the bigger picture first before digging deeper. In this case, the AG being a part of a Distributed AG is what caused the error, not the AG itself. This issue resolve itself once the Distributed AG has been updated with the new LISTENER_URL parameter value.

Use the T-SQL script below to update the Distributed AG (DistAG_DC1_DC2) on the primary AG (TDPRDSQLCLSAG25). Be sure you are connected to the primary replica of the primary AG.

NOTE: Use SQLCMD mode when running the T-SQL scripts to make sure you are connected to the correct SQL Server instance.

--Run this on the primary replica of the primary Availability Group
:CONNECT TDPRDSQLCLS029\INST29
USE [master]
GO
ALTER AVAILABILITY GROUP [DistAG_DC1_DC2]  
MODIFY AVAILABILITY GROUP ON  
 'TDPRDSQLCLSAG25' WITH    
    (   
        LISTENER_URL = 'TCP://TDPRDSQLCLS029.TESTDOMAIN.COM:5022'
    )

I highlighted the new LISTENER_URL parameter value as the SQL Server VNN of TDPRDSQLCLS029\INST29.

Step #3: Update the Distributed AG (DistAG_DC1_DC2) on the secondary AG (TDPRDSQLAG35)

Once the Distributed AG has been updated on the primary AG, you can now proceed to update the Distributed AG on the secondary AG. Use the T-SQL script below to update the Distributed AG (DistAG_DC1_DC2) on the secondary AG (TDPRDSQLAG35). Be sure you are connected to the primary replica of the secondary AG.

--Run this on the primary replica of the secondary Availability Group
:CONNECT TDPRD031
USE [master]
GO
ALTER AVAILABILITY GROUP [DistAG_DC1_DC2]  
MODIFY AVAILABILITY GROUP ON  
 'TDPRDSQLCLSAG25' WITH    
    (   
        LISTENER_URL = 'TCP://TDPRDSQLCLS029.TESTDOMAIN.COM:5022'
    )

Once the Distributed AG has been updated with the new LISTENER_URL parameter value of the primary AG, the secondary AG will automatically resume connectivity and data replication. SQL Server will log a message on the primary replica (TDPRD031) of the secondary AG (TDPRDSQLAG35) similar to the one below.

A connection for availability group 'DistAG_DC1_DC2' from availability replica 'TDPRDSQLAG35' with id  [8038CFF8-5031-31AE-4666-DE6172735957] to 'TDPRDSQLCLSAG25' with id [1A9B6186-F8CB-4260-B74C-B3EB2EAE1A3D] has been successfully established.  This is an informational message only. No user action is required. 

The dashboard on the secondary AG will also report a healthy status once the connectivity is resumed. No need to do anything after this.

availability group status

This process needs to be done every time you perform a manual failover between AG replicas involving SQL Server FCIs. Be sure to include this in your change management process to meet your DR recovery objectives.

Next Steps





get scripts

next tip button



About the author
MSSQLTips author Edwin Sarmiento Edwin M Sarmiento is a Microsoft SQL Server MVP and Microsoft Certified Master from Ottawa, Canada specializing in high availability, disaster recovery and system infrastructures.

View all my tips


Article Last Updated: 2021-10-19

Comments For This Article





download














get free sql tips
agree to terms