Registry Check Pointing in a Windows Cluster to Bring SQL Server Online
The SQL Server service for our cluster could not be brought online during a recent failover due to registry not found error(s). How can we correct this issue?
Sometimes the SQL Server service cannot be brought online when there is a failover to another node in the cluster. When you check the error logs, you may have noticed that the SQL Server service could not be brought online due to certain registry entries. Here is an example of such error messages:
[sqsrvres] GetRegKeyAccessMask: Could not get registry access mask for
registry key Software\Microsoft\Microsoft SQL
Server\MSSQL10.<InstanceName>\Replication (status 2)).
[sqsrvres] Worker Thread (358930): Failed to retrieve the Replication registry key (last error = 0).
[sqsrvres] StartResourceService: Failed to start MSSQL$I03 service. CurrentState: 1
[sqsrvres] OnlineThread: ResUtilsStartResourceService failed (status 435)
[sqsrvres] OnlineThread: Error 435 bringing resource online.
This issue can be resolved and the technique to resolve this issue is known as registry check-pointing.
SQL Server is a highly registry aware application. If any registry information (keys/values) changed for some unknown reason, these changes can have serious impacts to SQL Server running on that cluster node.
SQL Server maintains registry information of all its cluster resources and a sample is shown in Figure 1.
To review the registry, click on Start | Run and type “regedit” and then navigate to the path shown in Figure 1.
Figure 1 - Cluster Resources in the Registry
Underneath the “Resources” folder in “regedit”, there are many folders with some sort of GUID as the name. These are the resources in the cluster and each resource has its own GUID for self identification purpose. If you expand those folders you can see other parameters related to that resource.
How Do You Know the Issue is With Cluster Check-Points?
This is a very valid question. Before you start anything with cluster check-pointing, you need to make sure no registry keys are missing. You can run the command below on any node of the cluster at the command prompt with administrator privileges.
Cluster res /checkpoints > c:\checkpoints_output.txt
Figure 2 – List all the check-points which are enabled in the cluster
Alternatively you can send the output to a text file for later analysis.
Sample output of the above command.
Figure 3 – List of enabled check-points in a cluster
This output file lists all the check-points which are enabled on every instance of the cluster. For example, if you have three SQL Server instances installed on the cluster, the above output would list all of the check-points for all three instances. If you identify check-points for an instance and it is not included or has missing check-points, this is the issue you need to fix.
How to Enable Missing Check-Points for Clustered SQL Server Instances
To fix the missing check-points, you need to enable the missing check-points for a particular SQL Server instance. It is a simple process, but you do need downtime because you need to failover the cluster to test it.
Assume, in the cluster you have three SQL Server instances running;
- SQLInstance00 (Default)
As per Figure 3, it is clear that there are no check-points enabled for SQLInstance01 because it is not listed in the output. If you want to fix that, follow the below steps:
1. Open command prompt with Administrator permissions on the active node of the SQL Server instance currently having issues (i.e. SQLInstance01)
2. Take the SQL Server service offline
3. Type the following commands in command prompt and press enter. You can execute all these at once or one by one.
cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\Cluster" cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\MSSQLServer" cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\Replication" cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\Providers" cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\SQLServerSCP" cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\CPE" cluster res "SQL Network Name (SQLInstance01)" /addcheckpoints:"SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL10.SQLInstance01\SQLServerAgent"
Note: Each command line has the following identical syntax. This syntax is different based on the SQL Server version. Below is an example for SQL Server 2008.
cluster . resource "SQL Network Name(<InstanceName>)" /addcheckpoints:"HKEY_LOCAL_MACHINE\Software\Microsoft\Microsoft SQL Server\MSSQL.x\MSSQLSERVER"
Words in bold are the places you need to replace with the appropriate values for your cluster. (SQL Network Name(<InstanceName>), MSSQL.x)
The first value in the command above is the SQL Server resource name which you need to find. To find the resource name of any SQL Server instance in a cluster, follow the below steps:
3.1 Open Failover Cluster Manager (FCM)
3.2 Expand the cluster and select the particular cluster in left pane of FCM. Then it will show the resources of the selected SQL Server instance in the right pane of FCM.
3.2 Right click on the SQL Server instance name and click Properties. Refer to Figure 4.
Figure 4 - Get Properties for SQL Server Name
3.3 In the Properties window of the SQL Server instance, copy the Resource Name. Refer to Figure 5.
Figure 5 - Properties of SQL Server Instance
Note: This is the resource name you need to paste in as the first value of the command in step 3 (i.e. Resource Name).
4. If you do not get any errors and enabling the registry check-pointing is successful, bring the SQL Server online. If you done this while SQL Server is online, then ignore step #4.
5. To confirm, run "Cluster res /checkpoints" to see that check-points are enabled for all three instances.
6. Failover the instance to a different node, and this time you should be able to failover without any issues.
I had an opportunity to troubleshoot such an issue with the help of Microsoft Engineers and they identified the issue is due to registry check-pointing. I am sharing the knowledge I gathered with the community. It is always better to troubleshoot and do some homework yourself before attempting to do anything on production system. It is critical to backup the registry of the clustered node that you are enabling the check-points before making any changes. If something goes wrong you have a backup to restore the registry to its original state. Please read the references I mention in this tip to get a full understanding of the problem. If you have any doubt about this process and testing did not yield the required output, please be cautious about proceeding in production.
- Check out these resources:
Last Updated: 2015-01-19
About the author
View all my tips