SQL Server Networking High Availability and Disaster Recovery Best Practices Webcast Q and A
We hosted a webcast on "SQL Server and Networking High Availability/Disaster Recovery Best Practices" on July 16th. We received several great questions we didn't have time to answer, so we wanted to take this opportunity to answer the questions via a tip. Below are all the questions that were submitted, and our answers. If you missed the webcast, you can view it in the archives.
Q: What happens to incomplete transactions when a cluster node fails, moving all of the clustered resources on to another node in a SQL Server failover clustered instance?
I always tell SQL Server DBAs that a failover cluster is no different from a standalone instance to draw a parallel with what they already know. A standalone instance has all of the SQL Server databases on disk, similar to how a failover clustered instance works, except that the disks are shared across all of the nodes in the cluster. A failover - the process of moving clustered resources to one of the nodes in a cluster whether intentional or accidental - is no different from restarting the SQL Server service. When the SQL Server service is restarted, recovery runs on each of the databases (this is just one of a series of steps that occur during this process, but I'm simply discussing this part in the context of a cluster failover.) This process looks at the transactions in the log file and rolls forward any committed transactions (the redo phase) and rolls back any uncommitted/incomplete transactions (the undo phase). Take a look at this article on Understanding Logging and Recovery in SQL Server to get a better understanding of this process.
Q: Would the domain administrator add the Create Computer Objects permission in Active Directory just for the computers needed or always on the folder that contains the computers?
Computer objects (or any objects, in general) in Active Directory are created within a container. By default, they will be created in the Computers container. However, you can create custom containers or Organizational Units (OUs) for administrative or structural purposes. Think of OUs as folders in your file system. If you want to create or save files in that folder, you would need to be granted the appropriate permissions on the folder-level. Because a cluster network name is a virtual computer object in Active Directory, the person creating a cluster needs to be granted Create Computer Objects permission in the OU where the cluster will be created. Refer to this TechNet article on the minimum permissions required to create the cluster network name in Active Directory.
In the past, the Create Computer Objects permission needs to be granted on the OU where the potential cluster nodes reside. New in Windows Server 2012 is the ability to create a Windows Server Failover Cluster in an Active Directory organizational unit (OU) that is different than the ones that the potential cluster nodes are in. This feature made it easy for cluster administrators to create new Windows Server Failover Clusters in a restrictive Active Directory environment. For more information on how this can be done during installation, refer to the tip on Installing, Configuring and Managing Windows Server Failover Clusters using PowerShell Part 1.
As a side note, Windows Server 2012 R2 introduced the option to deploy a Windows Server Failover Cluster (WSFC) without dependency in Active Directory. This reduces the common permission issues that most deployments encounter. It still creates the corresponding DNS entries for the virtual network names. However, if you are using Kerberos authentication with your SQL Server instance, this is not a recommended implementation. I do hope to see this feature improved in future versions of Windows Server.
Q: If there are only 2 nodes in the cluster using a file share witness for quorum, and the file share witness goes away, will the Availability Group hang?
Whether the Availability Group stays online or not is dictated by the cluster quorum configuration. When an Availability Group is created, a corresponding cluster resource group is created in the WSFC. To better understand the behavior, we need to understand what the quorum is for. I kind of think of a cluster quorum as "majority votes win." When there is a majority of votes, a decision can be made to "do something." In a WSFC, a quorum determines whether or not the cluster stays online. If there is no quorum (or majority of votes,) the cluster will not stay online. A more detailed discussion of a cluster quorum is available in this TechNet article.
Let's take the scenario of a 2-node WSFC with a file share witness as a quorum configuration hosting an Availability Group. If the file share witness goes away, we've only lost 1 out of 3 votes in the quorum and we still have majority. Hence, the WSFC stays online. And if the WSFC stays online, the Availability Group stays online. The reason I mention this is because the Availability Group is dependent on the WSFC for this functionality. In essence, we really should be concerned with the WSFC. I have seen a lot of customer environments where the cluster has lost the file share witness and nobody even knows about it. High availability is at risk when this happens because if one of the nodes suddenly becomes unavailable, there is no more quorum to keep the WSFC online since we now lost 2 out of 3 votes. And if there is no more quorum, the cluster will be taken offline. This is a really interesting scenario to be in because even if the SQL Server instance that runs the Availability Group is online, the Availability Group will be brought offline because the WSFC is offline. Recovery Time Objective is compromised because in order to bring the databases online, you will need to force the WSFC to start without a quorum and bring the Availability Group online. Make sure to monitor your WSFC because an online Availability Group gives a false sense of security should something happen to your environment.
Now, in the webcast, I only talked about the cluster quorum in the context of Windows Server 2012 and earlier because I assumed that most customers are still on these platforms. In Windows Server 2012 R2, the quorum configuration has been improved. The default behavior of WSFC is that of a dynamic witness or dynamic quorum. This means that the number of active votes are automatically adjusted based on the number of available voting members. In the case of the 2-node cluster with a file share witness, the quorum starts off with 3 votes for a quorum. When the file share witness goes offline, the quorum votes are adjusted from 3 to 2. If one of the cluster nodes go down, the cluster still remains online. This concept of the "last man standing" has been introduced in Windows Server 2012, but you still have to manually configure quorum votes in your cluster. This makes it worthwhile to upgrade to Windows Server 2012 R2 just for the high availability options.
Q: Can you suggest any article on setting up NIC cards for simple two node cluster with two NICs on each node. Do we still need heartbeat?
I've been having this conversation with customers ever since Windows Server 2008 came out. In the past, you need to have 2 network interface cards (NIC) to provision a cluster. Windows Server 2008 and higher gave you the option to have just a single NIC to provision a cluster. In fact, the Failover Cluster Validation Wizard will only flag this as a warning and not a blocking error. This means it is technically supported to just have a single NIC for the nodes of your cluster. Now, that doesn't mean it's OK. Understand that the reason we build clusters is for high availability. If the network adapter fails, your cluster is gone. Here is an article about having multiple network adapters configured in a teamed pair. And that's really a great idea. But here's my take on it: even if we have multiple NICs per node that are teamed up, how sure are we that the network switches are redundant and highly available? I've seen DR exercises where only the servers are tested, but not the underlying network architecture. Only when the network switches themselves fail, do they realize that they are not at all highly available. I still recommend having a dedicated network for the heartbeat communication (think high availability and Quality of Service or QoS) and if the customer can guarantee that the network layer - switches, routers, etc. - is highly available, then I'll be happy with a NIC teaming implementation (thus, my reason for having a dedicated network for the heartbeat.) Taking it a step further, you can use the heartbeat network in an Availability Group for the log records replication.
Bottom line, you need to go back to the main reason why you need a WSFC, a failover clustered instance or an Availability Group. Use this as your rationale to decide whether or not to have single or multiple NICs for your cluster nodes.
Q: What are your thoughts on SAN disk storage drives not matching between 2 nodes (Active/Passive)?
For me, anything that will impact availability needs to be evaluated carefully. As stated in the answer to the previous question, the goal of having a WSFC, a failover clustered instance or an Availability Group is for high availability. If SAN disk storage drives do not match between nodes in a WSFC, how will that impact my availability? And I am not just talking about the drives themselves. I'm referring to the whole I/O stack - from the disks, to the connections, to the firmware, etc. Any incompatibilities and mismatch between the nodes of your cluster will impact the failover time. What's worse is that since your databases are stored on the shared storage, you risk getting them corrupted due to potential I/O issues. Now, if you can guarantee that the configuration is supported by your vendor, tested it extensively, documented as part of your disaster recovery plan, and meets your Recovery Time Objective (this is your ultimate goal) then, you can go ahead and implement it.
Check out the following items
- SQL Server and Networking High Availability/Disaster Recovery Best Practices Webcast in the archives.
- High Availability, Disaster Recovery, Clustering and Availability Groups tips.
- Edwin Sarmiento's tips since 2008.
Last Updated: 2014-07-28
About the author
View all my tips