Ask the Core Team

In a previous blog, Understanding the Cluster Debug Log in 2008, you were given the information on how Cluster logging in Windows 2008 Failover Clustering and beyond has changed from the earlier versions. In this blog, you were shown how the size of the log can be manipulated in order to keep a recommended 72 hours’ worth of data. Just to recap:

‘It is generally recommended that your CLUSTER.LOG have at least 72 hours’ worth of continuous data retention. This is so that if you have a failure occur after you went home on Friday, you still have the data you need to troubleshoot the issue on Monday morning’

What if you wanted to get information from further back (i.e. a week, a month, etc)? One of the ways you could do this is to increase the size of the log with the /SIZE: switch. However, increasing the size for say a month could get you into gigabytes of space being used and text files being massive and hard to go through. Have you ever tried to open a 1 gigabyte text file with Notepad?

Here is a way that you can keep the file at a smaller size and keep backups that can be referred back to at any time. First, you must determine what size is needed to hold 24 hours’ worth of data. This way, you can have a Cluster Log generated for every day. The next thing to consider is where you want to store the files, local or network share. What if you wanted to do this for multiple Clusters? Let’s say that you figured that you need the log size to be at 200meg and you are going to put it on a server (JOHNMARLIN).

The previous blog mentioned will have you run the command Cluster Log /Size:200 to set the proper size based on the data needed. I do this for all my Clusters. I then go out to my JOHNMARLIN server and create a share for each Cluster (TXCLUSTER, NCCLUSTER, etc). Now I just have to go to one node in each of the Clusters to set things up.

On the node you doing the task on, go into Control Panel - Region and Language and change the Short date to yyyy-MM-dd.

On this node, you could create a CLUSTERLOG folder off the root of Drive C:. In this C:\CLUSTERLOG directory, create a batch file called Get-Logs.bat that has the following commands:

Net use j: /d
Net use j: \\johnmarlin\txcluster
Md j:\%date%
Cluster log /gen /copy:"c:\clusterlog"
Copy c:\clusterlog\*.log j:\%date%\*.log
Net use j: /d

I used Drive Letter J:, but you can use any available letter. So what the batch file will do when run today (June 18, 2012) is:

1. It will create a folder on the share named by the date

a. 2012-06-18

2. It will generate the Cluster on every node

3. It will copy the cluster logs from all nodes to the local c:\clusterlog folder and tag the Node Name as part of the filename

a. TXCLUSTER-node1_cluster.log
b. TXCLUSTER-node2_cluster.log
c. TXCLUSTER-node3_cluster.log
d. TXCLUSTER-node4_cluster.log

4. It will copy the cluster logs from this c:\logs folder to the share folder with the date and keeping the same name

a. \2012-06-18\TXCLUSTER-node1_cluster.log
b. \2012-06-18\TXCLUSTER-node2_cluster.log
c. \2012-06-18\TXCLUSTER-node3_cluster.log
d. \2012-06-18\TXCLUSTER-node4_cluster.log

When it runs the next day:

1. It will create a folder on the share named by the date

a. 2012-06-19

2. It will generate the Cluster on every node

3. It will copy the cluster logs from all nodes to the local c:\clusterlog folder and tag the Node Name as part of the filename

a. TXCLUSTER-node1_cluster.log
b. TXCLUSTER-node2_cluster.log
c. TXCLUSTER-node3_cluster.log
d. TXCLUSTER-node4_cluster.log

4. It will copy the cluster logs from this c:\clusterlog folder to the share folder with the date keeping the same name

a. \2012-06-19\TXCLUSTER-node1_cluster.log
b. \2012-06-19\TXCLUSTER-node2_cluster.log
c. \2012-06-19\TXCLUSTER-node3_cluster.log
d. \2012-06-19\TXCLUSTER-node4_cluster.log

It runs the next day, it creates the next dated folder and files. This way, you have an easily sorted folder structure that you can go to any day you want and get the file you need from whichever node you need.

The next thing to do is set up a Scheduled Task to run each day so it creates the files for you. This way, you do not have to remember to do it. From the Administrative Tools, open up Task Scheduler and select Create Task. You can then use the below information to create the task.

A. General Tab

i. For the Name, call it something like Cluster Daily Log Backups
ii. make sure use an account that has admin rights to this node, to the Cluster, and the network share
iii. select Run whether user is logged in or not

B. Triggers Tab

i. Set whatever time you want it to run. One thing to keep in mind is that the Cluster Log is in GMT time, so account for it when deciding when to have them created
ii. Select it to run daily and recur for 352 days
iii. Make sure is Enabled

C. Actions Tab

i. Program/Script will be CMD.EXE
ii. Add Arguments will be /C C:\Logs\Get-Logs.bat

D. Conditions Tab

i. Don't really need change anything unless want to

E. Settings Tab

i. Check Allow task to be run on demand
ii. Check Run task as soon as possible after scheduled start is missed

So now you have your task that will do this for you. You can now just sit back and relax knowing that you will have a Cluster Log generated for every node every day.

There are a couple caveats to this that you must take into consideration. If the account you are using has its password changed on the domain, you will have to change it on the task as well. It will stop running after 352 days, so if you want more, you would have to create it again. But you will have a year's worth of Cluster Logs when it is done.

There are other ways of doing this. You could use scripting and the PowerShell command:

Get-ClusterLog –Destination

You could also use other methods than the batch file. This is just one of the ways of doing it.

Happy Clustering !!!

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Welcome to the AskCore blog. Today, we are going to talk about nodes being removed from active Failover Cluster membership randomly. If you are having problems with a node being removed from membership, you are seeing events like this logged in your System Event Log:

This event will be logged on all nodes in the Cluster except for the node that was removed. The reason for this event is because one of the nodes in the Cluster marked that node as down. It then notifies all of the other nodes of the event. When the nodes are notified, they discontinue and tear down their heartbeat connections to the downed node.

What caused the node to be marked down?

All nodes in a Windows 2008 or 2008 R2 Failover Cluster talk to each other over the networks that are set to Allow cluster network communication on this network. The nodes will send out heartbeat packets across these networks to all of the other nodes. These packets are supposed to be received by the other nodes and then a response is sent back. Each node in the Cluster has its own heartbeats that it is going to monitor to ensure the network is up and the other nodes are up. The example below should help clarify this:

If any one of these packets are not returned, then the specific heartbeat is considered failed. For example, W2K8-R2-NODE2 sends a request and receives a response from W2K8-R2-NODE1 to a heartbeat packet so it determines the network and the node is up. If W2K8-R2-NODE1 sends a request to W2K8-R2-NODE2 and W2K8-R2-NODE1 does not get the response, it is considered a lost heartbeat and W2K8-R2-NODE1 keeps track of it. This missed response can have W2K8-R2-NODE1 show the network as down until another heartbeat request is received.

By default, Cluster nodes have a limit of 5 failures in 5 seconds before the connection is marked down. So if W2K8-R2-NODE1 does not receive the response 5 times in the time period, it considers that particular route to W2K8-R2-NODE2 to be down. If other routes are still considered to be up, W2K8-R2-NODE2 will remain as an active member.

If all routes are marked down for W2K8-R2-NODE2, it is removed from active Failover Cluster membership and the Event 1135 that you see in the first section is logged. On W2K8-R2-NODE2, the Cluster Service is terminated and then restarted so it can try to rejoin the Cluster.

For more information on how we handle specific routes going down with 3 or more nodes, please reference “Partitioned” Cluster Networks blog that was written by Jeff Hughes.

Now that we know how the heartbeat process works, what are some of the known causes for the process to fail.

1. Actual network hardware failures. If the packet is lost on the wire somewhere between the nodes, then the heartbeats will fail. A network trace from both nodes involved will reveal this.

2. The profile for your network connections could possibly be bouncing from Domain to Public and back to Domain again. During the transition of these changes, network I/O can be blocked. You can check to see if this is the case by looking at the Network Profile Operational log. You can find this log by opening the Event Viewer and navigating to: Applications and Services Logs\Microsoft\Windows\NetworkProfile\Operational. Look at the events in this log on the node that was mentioned in the Event ID: 1135 and see if the profile was changing at this time. If so, please check out the KB article “The network location profile changes from "Domain" to "Public" in Windows 7 or in Windows Server 2008 R2”.

3. You have IPv6 enabled on the servers, but have the following two rules disabled for Inbound and Outbound in the Windows Firewall:

Core Networking - Neighbor Discovery Advertisement
Core Networking - Neighbor Discovery Solicitation

4. Anti-virus software could be interfering with this process also. If you suspect this, test by disabling or uninstalling the software. Do this at your own risk because you will be unprotected from viruses at this point.

5. Latency on your network could also cause this to happen. The packets may not be lost between the nodes, but they may not get to the nodes fast enough before the timeout period expires.

6. IPv6 is the default protocol that Failover Clustering will use for its heartbeats. The heartbeat itself is a UDP unicast network packet that communicates over Port 3343. If there are switches, firewalls, or routers not configured properly to allow this traffic through, you can issues like this.

7. IPsec security policy refreshes can also cause this problem. The specific issue is that during an IPSec group policy update all IPsec Security Associations (SAs) are torn down by Windows Firewall with Advanced Security (WFAS). While this is happening, all network connectivity is blocked. When re-negotiating the Security Associations if there are delays in performing authentication with Active Directory, these delays (where all network communication is blocked) will also block cluster heartbeats from getting through and cause cluster health monitoring to detect nodes as down if they do not respond within the 5 second threshold.

8. Old or out of date network card drivers and/or firmware.At times, a simple miconfiguration of the network card or switch can also cause loss of heartbeats.

9. If you are running on VMWare, you may be experiencing Packet loss. The following blog talks about this in a little more detail including how to tell if this is the issue as well as points you to the VMWare article on the settings to change.

Nodes being removed from Failover Cluster membership on VMWare ESX?
http://blogs.technet.com/b/askcore/archive/2013/06/03/nodes-being-removed-from-failover-cluster-membership-on-vmware-esx.aspx

These are the most common reasons that these events are logged, but there could be other reasons also. The point of this blog was to give you some insight into the process and also give ideas of what to look for. Some will raise the following values to their maximum values to try and get this problem to stop.

Parameter	Default	Range
SameSubnetDelay	1000 milliseconds	250-2000 milliseconds
CrossSubnetDelay	1000 milliseconds	250-4000 milliseconds
SameSubnetThreshold	5	3-10
CrossSubnetThreshold	5	3-10

Increasing these values to their maximum may make the event and node removal go away, it just masks the problem. It does not fix anything. The best thing to do is find out the root cause of the heartbeat failures and get it fixed. The only real need for increasing these values is in a multi-site scenario where nodes reside in different locations and network latency cannot be overcome.

I hope that this post helps you!

Thanks,
James Burrage
Senior Support Escalation Engineer
Windows High Availability Group

Start getting familiar with Windows Server 2012 Failover Clustering by viewing the sessions that were delivered at TechEd 2012.

These sessions are posted online now for your viewing pleasures. For those not familiar with what TechEd is, I offer this.

TechEd is Microsoft's premier technology conference for IT professionals and developers, offering the most comprehensive technical education across Microsoft's current and soon-to-be-released suite of products, solutions, tools, and services. TechEd offers hands-on learning, deep product exploration and countless opportunities to build relationships with a community of industry and Microsoft experts that will help your work for years to come.

For over 20 years industry professionals have found TechEd to be the best opportunity to stay aligned with Microsoft’s current technologies and new product opportunities. You and your colleagues come to TechEd to discuss critical technology issues, gain practical advice, and network with Microsoft and industry experts. Whether you are an IT Professional or a Developer, TechEd has much to offer you.

This year’s North America event included over 11,000 customers, partners, speakers, and staff.

Each session lasts about 1:15 and includes the PowerPoint and presentation that you can view online or download to view at a later time. The audio formats you can view the sessions are:

MP3 = audio only

Mid Quality WMV = lo-band, mobile

High Quality MP4 = Ipad, PC

Mid Quality MP4 = WP7, HTML5

MP4 = Ipod, Zune HD

High Quality WMV = PC, Xbox, MCE

The sessions from the Clustering team at TechEd North America were:

WSV324 - Building a Highly Available Failover Cluster Solution with Windows Server 2012 from the Ground UP

Windows Server 2012 delivers innovative new capabilities that enable you to build dynamic availability solutions in which workloads, networks, and storage become more flexible, efficient, and available than ever before. This session covers creating a Windows Server 2012 highly available Failover Cluster leveraging the new technologies in Windows Server 2012. This session walks through a demo leveraging a highly available Space, encrypting data with shared BitLocker disks, asymmetrical storage configurations with CSV I/O redirection… from the bottom up to a highly available solution.

WSV430 - Cluster Shared Volumes Reborn in Windows Server 2012: Deep Dive

This session takes a deep technical dive into the new Cluster Shared Volumes (CSV) architecture and new features coming in Windows Server 2012. CSV is now a full-blown clustered file system, and all of the challenges of the past have been addressed, along with many enhancements. This is an in-depth session that covers the CSV architecture, CSV backup integration, and integration with a wealth of new features that enhance CSV and its performance.

WSV411 - Guest Clustering and VM Monitoring in Windows Server 2012

In Windows Server 2012 there will be new ways to monitor application health state and have recovery inside of a virtual machine. This session details the new VM Monitoring feature in Windows Server 2012 as well as discusses Guest Clustering and changes in Windows Server 2012 (such as virtual FC), along with pros and cons of when to use each.

WSV322 - Update Management in Windows Server 2012: Revealing Cluster-Aware Updating and the New Generation of WSUS

Today, patch management is a required component of any security strategy. In Windows Server 2012, the new Cluster-Aware Updating (CAU) feature delivers Continuous Availability through automated self-updating of failover clusters. In Windows Server 2012, Windows Server Update Services (WSUS) has evolved to become a Server Role with exciting new capabilities. This session introduces CAU with a discussion of its GUI, cmdlets, remote-updating and self-updating capabilities. And then we proceed to highlight the main functionalities of WSUS in Windows Server 2012 including the security enhancements, patch deployment automation, and new Windows PowerShell cmdlets to perform maintenance, manage and deploy updates

VIR401 - Hyper-V High-Availability and Mobility: Designing the Infrastructure for Your Private Cloud

Private Cloud Technical Evangelist Symon Perriman leads this session discussing Windows Server 2012 and Windows Server 2008 R2 Hyper-V and Failover Clustering design, infrastructure planning and deployment considerations for your highly-available datacenter or Private Cloud. Do you know the pros and cons of how different virtualization solutions can provide continual availability? Do you know how Microsoft System Center 2012 can move the solution closer to a Private Cloud implementation? This session covers licensing, hardware, validation, deployment, upgrades, host clustering, guest clustering, disaster recovery, multi-site clustering, System Center Virtual Machine Manager 2008 and 2012, and offers a wealth of best practices. Prior clustering and Hyper-V knowledge recommended.

The sessions from the Clustering team at TechEd Europe were:

WSV324 - Building a Highly Available Failover Cluster Solution with Windows Server 2012 from the Ground UP

Windows Server 2012 delivers innovative new capabilities that enable you to build dynamic availability solutions in which workloads, networks, and storage become more flexible, efficient, and available than ever before. This session will cover creating a Windows Server 2012 highly available Failover Cluster leveraging the new technologies in Windows Server 2012. This session will walk through a demo leveraging a highly available Space, encrypting data with shared BitLocker disks, asymmetrical storage configurations with CSV I/O redirection… from the bottom up to a highly available solution.

WSV430 - Cluster Shared Volumes Reborn in Windows Server 2012: Deep Dive

This session will do a deep technical dive of the new Cluster Shared Volumes architecture and new features coming in Windows Server 2012. CSV is now a full blown clustered file system, and all of the challenges of the past have been addressed, along with many enhancements. This will be an in-depth session that will cover the CSV architecture, CSV backup integration, and integration with a wealth of new features that enhance CSV and their performance. CSV backup integration, and integration with a wealth of new features that enhance CSV and it's performance

All other Tech Ed content regarding Windows Server 2012 and Windows 8 can be viewed at the same locations:

North America

Europe

I hope that you can take some time to get to know the new products and features that are coming and get as excited about it as Microsoft is.

Happy Clustering !!

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

I am starting a 'CNO Blog Series', which will consist of blogs written by the CORE team cluster engineers and will focus primarily on the Cluster Name Object (CNO). The CNO is the computer object in Active Directory associated with the Cluster Name; it is used as a common identity in the cluster. If you have been working with Failover Clusters since Windows Server 2008, you should be very familiar with the CNO and the role it plays with respect to the cluster security model. Looking over the CORE Team blog site, there have already been some blogs written that focus primarily on the CNO:

Recovering a Deleted Cluster Name Object (CNO) in a windows Server 2008 Failover Cluster (April 2009)

Recovering a Deleted Cluster Name Object (CNO) in a Windows Server 2008 Failover Cluster, Part 2 (May 2011)

Why is the CNO in a Failed State? (March 2012)

Rights needed for user account when pre-creating a Cluster Name Object (CNO) on Windows Server 2008 R2 Failover Cluster (June 2011)

With the release of Windows Server 2012, there have been several enhancements added to the Failover Clustering feature that provide for better integration with Active Directory. The Product Team blog (http://blogs.msdn.com/b/clustering/), has a post that discusses creating Windows Server 2012 Failover Clusters in more restrictive Active Directory environments. That blog discusses some of the changes that have been made in the product that directly involve the CNO.

On to today's blog - increasing awareness around the Cluster Name Object (CNO)….

Beginning with Windows Server 2008, when a cluster is created, the computer objected associated with the CNO, unless pre-staged in some other container, is placed, by default, in the Computers container. Windows Server 2012 Failover Clusters give cluster administrators more control over the computer object representing the CNO. The Product Group's blog mentioned earlier, details new functionality in Windows Server 2012, which includes:

Using Distinguished Names when creating the cluster to manually control CNO placement

New default behavior where a CNO is placed in the same container as the computer objects for the nodes in the cluster

The Virtual Computer Objects (VCOs) created by a CNO are placed in the same container as the CNO

Having more control over cluster computer object(s) placement, while desirable, requires a bit more 'awareness' on the part of a cluster administrator. This 'awareness' involves knowing that, by default, the CNO when placed in the non-default location may not have the rights it needs for other cluster operations such as creating other cluster computer objects (VCOs). The first indication of a problem may be when a Role is made highly available in the cluster and that Role requires a Client Access Point (CAP). After the Role creation process completes, and the Network Name associated with the CAP attempts to come Online, it fails with an Event ID 1194.

Log Name:	System
Source:	Microsoft-Windows-Failover-Clustering
Even ID:	1194
Level:	Error

This event reports a computer object associated with a cluster Network Name resource could not be created. The error message itself provides good troubleshooting guidance to help resolve the issue -

In this case, it is a simply a matter of modifying the security on the AD container so the CNO is allowed to Create Computer Objects. Once this setting is in place, the Network Name comes online without issue. Additionally, the CNO is also given another critical right, the right to change the password for any VCO it creates.

If Active Directory is properly configured (more on that in a bit), the VCO, along with the CNO, can be also protected from accidental deletion.

Protecting Cluster Computer Objects

A call often handled by our support engineers involves the accidental, or semi-intentional, deletion of the computer objects associated with Failover Clusters. There are a variety of reasons this happens, but we will not go into those here. Suffice it to say, things function more smoothly if the computer objects associated with a cluster are protected.

I mentioned new functionality in Windows Server 2012 Failover Clusters where cluster objects will be strategically placed in targeted Active directory containers (OU) automatically. Using this methodology also makes it easier to discern which objects are associated with a Failover Cluster. As you can see in this screenshot of a custom OU (Clusters) that I created in my domain, the objects associated with the cluster carry the description of Failover cluster virtual network name account. The cluster nodes, which are located in the same OU, are traditional computer objects, which do not carry this description.

Examining the properties of one of these accounts using the Attribute Editor, one can see it is clearly an attribute (Description field) of the computer object.

Properly protecting cluster computer objects (from accidental deletion) requires Domain Administrator intervention. This can be either a 'proactive' or a 'reactive' intervention. A proactive intervention requires a Domain Administrator set a Deny ACE (Access Control Entry) for Delete all child objects for the Everyone group on the container where the cluster computer objects will be located.

A reactive intervention occurs after a CNO is placed in the designated container. At this point, the Domain Administrator has a choice. He can either:

1. Set the Deny ACE for Delete all child objects on the container, or

2. Check the Protect object from accidental deletion checkbox on the CNO computer object (which would then set the correct Deny ACE on the container)

Let us step through a scenario from a recent case I worked for one of our customers deploying a new Windows Server 2012 Failover Cluster.

Customer Case Study

In this case, a customer was deploying a 2-Node Windows Server 2012 Hyper-V Failover Cluster dedicated to supporting virtualized workloads. The cluster creation process was completed without issue and the Cluster Core Resources group could move freely between the nodes without any resource failures. The customer had already created four highly available virtual machines, some of which were already in production. The customer wanted to test live migration for the virtual machines. When he attempted to execute a live migration for a virtual machine, it failed immediately on the source cluster node. He attempted a quick migration and that succeeded.

Reviewing the cluster logs obtained from the customer, the live migration error appeared in the cluster log of the source cluster node. The live migration failure was registered with an error code of 1326.

00001274.00001c24::2012/09/18-17:50:16.301 ERR [RES] Virtual Machine <Virtual Machine MRS1SAPPBW31>: Live migration of 'Virtual Machine MRS1SAPPBW31' failed.

00001274.00001c24::2012/09/18-17:50:16.301 ERR [RHS] Resource Virtual Machine MRS1SAPPBW31 has cancelled offline with error code 1326.

00000aa8.00001cf4::2012/09/18-17:50:16.301 INFO [RCM] HandleMonitorReply: OFFLINERESOURCE for 'Virtual Machine MRS1SAPPBW31', gen(0) result 0/1326.

The error code resolved to - 'The user name or password is incorrect'.

Examining the rest of the cluster log indicated the CNO could not log on to the domain controller to obtain necessary tokens. This failure was also causing a failure registering with DNS (customer is using Microsoft dynamic DNS).

00001228.00001a20::2012/09/18-17:43:00.466 WARN [RES] Network Name: [NNLIB] LogonUserEx fails for user HPVCLU03$: 1326 (useSecondaryPassword: 0)

00001228.00001a20::2012/09/18-17:43:00.550 WARN [RES] Network Name: [NNLIB] LogonUserEx fails for user HPVCLU03$: 1326 (useSecondaryPassword: 1)

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name: [NNLIB] Logon failed for user HPVCLU03$ (Error 1326), DC \\<FQDN_of_DC_here>

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: Obtaining Windows Token for Name: HPVCLU03, SamName: HPVCLU03$, Type: Singleton, Result: 1326, LastDC: \\<FQDN_of _DC_here>

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: Slow Operation, FinishWithReply: 1326

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: InternalReplyHandler with event: 1326

00001228.00001a20::2012/09/18-17:43:00.550 INFO [RES] Network Name <Cluster Name>: Identity: End of Slow Operation, state: Error/Idle, prevWorkState: Idle

00001228.00001a8c::2012/09/18-17:43:00.550 WARN [RES] Network Name <Cluster Name>: Identity: Get Token Request, currently doesn't have a token!

00001228.00001a8c::2012/09/18-17:43:00.550 INFO [RES] Network Name: [NN] got sync reply: 0

00001228.00001e0c::2012/09/18-17:43:00.550 ERR [RES] Network Name <Cluster Name>: Dns: Obtaining token threw exception, error 6

00001228.00001e0c::2012/09/18-17:43:00.550 ERR [RES] Network Name <Cluster Name>: Dns: Failed DNS registration with error 6 for Name: HPVCLU03 (Type: Singleton)

Examination of the DNS zone verified there was no A-Record for the cluster name.

At this point, we logged into the domain controller the cluster was communicating with and tried to locate the CNO using the Active Directory Users and Computers (ADUC) snap-in. When the computer object was not found in the Computers container, a full search of active directory revealed it was located in a nested OU structure four levels deep. Coincidentally, it was located with the cluster node computer accounts, which is the expected new behavior beginning with Windows Server 2012 Failover Clusters as previously described. It was clear to me; however, the cluster administrator was not aware of this new behavior.

At this point, it appeared to be a case of the CNO account password being out of synch in the domain. I had the customer execute the following process:

Temporarily move the CNO account into the Computers container

Log into one of the cluster nodes with a domain account that had the Reset Password right in the domain

Simulate failures for the cluster Network Name resource until it was in a permanent failed state

Once the resource was in a Failed state, right-click on the resource, choose More Actions and then click Repair

The previous action caused the password for the CNO to be reset in the domain

After executing the procedure, the cluster name came back online, and the customer noticed an automatic registration in DNS. He then executed a live migration for a virtual machine and it worked flawlessly. He also checked and verified the dNSHostName attribute on the computer object was now correctly populated. Issue resolved. Case closed.

Moral of the story - Not only do cluster administrators need to become familiar with the new functionality in Windows Server 2012 Failover Clusters (and there are many), but they should also realize that the CNO can have impact in areas that are not necessarily obvious.

Thanks, and come back again soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

Just when you thought you had things figured out - in the words of the legendary Bob Dylan, "the times they are a-changin." With the release of Windows Server 2012, Microsoft introduces a load of new features, which, in some cases, translates into doing some of the same things in different ways. Up to now, highly available virtualized workloads meant multi-node Hyper-V Failover Clusters configured with Cluster Shared Volumes (CSV) hosting virtual machines. In Windows Server 2012 Hyper-V, the rules have changed. Now, virtual machine files can be stored on SMB 3.0 file shares hosted in standalone Windows Server 2012 File Servers, or in Windows Server 2012 Scale-Out File Servers.

This multi-part blog will walk through a new scenario, one that we may start seeing more and more as IT Professionals realize they can capitalize on their high-speed networking infrastructure investment while at the same time saving themselves a little money. The scenario involves both Windows Server 2012 Hyper-V Failover Clusters and Windows Server 2012 Scale-Out File Servers.

In this multi-part blog, I will cover the following:

Setting up a Windows Server 2012 Hyper-V Failover Cluster with no shared storage
Setting up a Windows Server 2012 Failover Cluster with the Scale-Out File Services Role
Configuring an SMB Share that supports Application Data with Continuous Availability in the Scale-Out File Server
Deploying virtual machines in the Hyper-V Failover Cluster while using the Scale-Out File Server SMB 3.0 shares to host the virtual machine files

To demonstrate the scenario, I created a 3-Node Windows Server 2012 Hyper-V Failover Cluster with no shared storage and a 2-Node Windows Server 2012 Failover Cluster connected to iSCSI storage to provide the shared storage for the Scale-Out File Server Role.

Create a 3-Node Windows Server 2012 Hyper-V Failover Cluster

First, create the 3-Node Hyper-V Failover Cluster. Since the cluster will not be connected to storage, and it is always a 'best practice' from a Quorum calculation perspective, to keep the number of votes in the cluster equal to an odd number, I chose a 3-Node cluster. I could have just as easily configured a 2-Node cluster and manually modified the Quorum Model to Node and File Share Witness. To support this Quorum Model, the Scale-Out File Server could be configured with a General Purpose file share to support the File Share Witness resource.

Recommendation: Since the cluster is not connected to storage, you do not have to run the storage tests in the cluster validation process.

In the interest of highlighting some of the other new features in Windows Server 2012 Failover Clustering, I created the cluster using a Distinguished Name format which provides greater control over the placement of cluster computer objects in a custom Organization Unit (OU) I created in Active Directory. It is recommended that you configure the OU to protect the Failover Cluster computer objects from 'accidental' deletion prior to creating the cluster. To accomplish this, implement a custom Access Control Entry (ACE) on the OU to deny Everyone the right to Delete all child objects.

This configuration on the container automatically checks the Protect object from accidental deletion on cluster computer objects when they are created.

Specify a Distinguished Name for the Cluster Name when creating the cluster (Create Cluster Wizard).

The Create Cluster report reflects the Active Directory path (container) where the CNO computer object is located.

Create a 2-Node Windows Server 2012 Scale-Out File Server

Configure a 2-Node Windows Server 2012 Failover Cluster to provide Scale-Out File Services to the virtual machines hosted by the 3-Node Hyper-V Failover Cluster.

Note: To read about Scale-Out File Services access the TechNet content here - http://technet.microsoft.com/en-us/library/hh831349.aspx

The Scale-Out File Services cluster requires storage to support the Cluster Shared Volumes (CSV) that will host the virtual machine files. To ensure the entire configuration is supported, run a complete cluster validation process, including the storage tests, before creating the cluster. Be sure to create the cluster with sufficient storage to support a Node and Disk Majority Quorum Model (Witness disk required) and the CSV volumes to host the virtual machine files.

Note: While a single CSV volume supports multiple virtual machines, a 'best practice' is to place virtual machines across several CSV volumes to distribute the I/O to the backend storage. Additionally, consider enabling CSV caching (scenario dependent). To find out more about CSV Caching, review the Product Team blog on the topic - http://blogs.msdn.com/b/clustering/archive/2012/03/22/10286676.aspx

With the cluster up and running, configure the Scale-Out File Server Role by following these steps:

In Failover Cluster Manager, in the left-hand pane, right-click on Roles and choose Configure Role to start the High Availability Wizard
Review the Before You Begin screen and click Next
In the Select Role screen, choose File Server and click Next
For the File Server Type, choose Scale-Out File server for application data and click Next
Provide a properly formatted NetBIOS name for the Client Access Point and click Next
Review the Confirmation screen information and click Next
Verify the wizard completes and the Role comes Online properly in Failover Cluster Manager

A properly configured Scale-Out File Server Role should look something like this -

What happens if the Scale-Out File Server Role fails to start? Check the Cluster Events and you may find an Event ID: 1194 indicating a Network Name Resource failure occurred.

The Event Details section provides information for proper corrective action. In this case, since we are placing the cluster computer objects in a custom OU, we need to give the Scale-Out File Server CNO the right to Create Computer Objects. Once this is accomplished, and Active Directory replication has occurred, the Scale-Out File Server Role should start properly. Verify the Role comes online on all nodes in the cluster.

To review what we have accomplished:

Active Directory is configured properly to protect the accidental deletion of cluster computer objects
A 3-Node Hyper-V Failover Cluster has been created and validated
A 2-Node Scale-Out File Server Failover Cluster has been created and validated
The Scale-Out File Server CNO permissions have been properly configured on a custom OU

Well CORE Blog fans, that wraps it up for Part 1. Stayed tuned for Part 2 where we will:

Configure SMB 3.0 shares on the Scale-Out File Server
Configure highly available virtual machines in the Hyper-V Failover Cluster using the SMB shares on the Scale-Out File Server Cluster
Demonstrate Live Migration of virtual machines in the Hyper-V Failover Cluster

Thanks, and come back soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

In Part 1, I covered configuring the Hyper-V Failover Cluster and the Scale-Out File Server solution. In Part two, I will cover:

Creating the file shares in the Scale-Out File Server
Creating a virtual machine to use the SMB3.0 shares in the Scale-Out File Server
Verifying we can Live Migrate the virtual machines in the Hyper-V Failover Cluster

Creating the File Share

Execute the following steps to create a file share in the Scale-Out File Server

In Failover Cluster Manager, right-click on the Scale-Out File Server role in the center pane and choose Add File Share. This starts the New Share Wizard
In the Select Profile screen, choose SMB Share - Applications and click Next
For the Share Location, choose one of the CSV Volumes and click Next
Provide a Share Name, verify the path information and click Next
In the Other Settings screen, Enable Continuous Availability is checked by default. Click Next
Note: Some selections are greyed-out. This is because they are not supported for this share profile in a Failover Cluster
In the Permissions screen, click Customize Permissions. In the Advanced Security Settings screen, note the default NTFS and Share permissions and then proceed to add the Hyper-V Failover Cluster Nodes Computer Accounts to the NTFS permissions for the share and ensure they have Full Control. If the permissions listing does not include the cluster administrator(s), add it and give the account (or Security Group) Full Control. Click Apply when finished

Complete configuring the file shares.

As a test, connect to each of the shares from the Hyper-V Failover Cluster and verify you can write to each location before proceeding to the next step.

Creating a Virtual Machine to use an SMB 3.0 Share

Execute the following steps to create a new virtual machine

On one of the nodes in the Hyper-V Cluster, open Failover Cluster Manager
In the left-hand pane, click on Roles and then in the right-hand Actions pane click on Virtual Machines and choose New Virtual Machine
Choose one of the cluster nodes to be the target for the virtual machine and click OK
This starts the New Virtual Machine Wizard. Review the Before You Begin screen and click Next
In the Specify Name and Location screen, provide a name for the virtual machine and enter an UNC path to a share on the Scale-Out File Server and then click Next
Configure memory settings and click Next
Configure network settings and click Next
In the Connect Virtual Hard Disk screen, make a selection and click Next
Review the Summary screen and click Finish
Verify the process completes successfully and click Finish

Testing Live Migration

Once all the virtual machines are created, you may want to test Live Migration. Depending on how many simultaneous live migrations you want to support, you may have to modify the Live Migration settings on each of the Hyper-V Failover Cluster nodes. The default is to allow two simultaneous live migrations. Here is a little PowerShell script you can run to take care of the settings for all the nodes in the cluster -

$Cred = Get-Credential

Invoke-Command -Computername Fabrikam-N21,Fabrikam-N22,Fabrikam-N23 -Credential $Cred -scriptblock {Set-VMHost -MaximumVirtualMachineMigrations 6}

In my cluster, I have all the virtual machines running on the same node -

I will use a new feature in Windows Server 2012 Failover Clusters, multi-select, and select all of the virtual machines and live migrate them to another node in the cluster -

Since there are only four virtual machines and the maximum number of live migrations is equal to six, all will migrate.

If I were to rerun my script and make a change back to two, then two migrations will be queued until at least one of the in progress migrations completes.

You can use the Get-SmbSession PowerShell cmdlet on any node in the Scale-Out File Server to determine the number of sessions. For illustration purposes, I have all virtual machines running on the same Hyper-V Failover Cluster node (Fabrikam-N21) and the CSV volumes are running on the same node in the Scale-Out File Server (Fabrikam-N1) -

Distributing the virtual machines across the multi-node Hyper-V Failover Cluster (Fabrilam-N21, Fabrikam-N22, and Fabrikam-N23) is reflected on the Scale-Out File Server -

Finally, I re-distribute the CSV volumes across the Scale-Out File Server nodes as shown here -

This is reflected in the Get-SmbSession PowerShell cmdlet output -

Thanks, and come back again soon.

Chuck Timon
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support
High Availability\Virtualization Team

On Tuesday, Jaunary 8, the below recommended fix was released and available on Windows Update for Windows .NET Framework 4.5.

2750149
An update is available for the .NET Framework 4.5 in Windows 8, Windows RT and Windows Server 2012
http://support.microsoft.com/kb/2750149/EN-US

When installing this update to a Windows 2012 Cluster Server, you will receive the below error when you select Roles or Nodes from within Failover Cluster Manager.

A weak event was created and it lives on the wrong object, there is a very high chance this will fail, please review and make changes on your code to prevent this issue

You can still manage your Cluster from a node that does not have this fix or through Powershell. The recommendation at this time is to remove and not install this fix.

Microsoft is aware of the issue. Once the cause has been identified and a resolution available, this blog will be updated to reflect the resolution.

==========
UPDATE - January 23, 2013
==========

This issue has been resolved. After installing the KB2750149 patch for .NET Framework, please also install the below patch.

KB2803748
Failover Cluster Management snap-in crashes after you install update 2750149 on a Windows Server 2012-based failover cluster
http://support.microsoft.com/kb/2803748

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Hello everyone. Today’s post is going to cover the steps needed to follow should you ever have to remove a ‘Physical Disk’ resource from a clustered service or application where that disk is configured as a mount point. Removing a disk from a group may be needed if the application no longer requires the storage and there’s a need to utilize the disk in some other group or decommission it entirely.

First, I wanted to talk a little about dependencies and their function. Resource dependencies are created between one or more resources in a cluster group in order to determine the order in which resources in the group are taken online/offline. Take for instance a SQL Server resource that’s dependent on the ‘Physical Disk’ resources where SQL’s data is stored. A dependency should be established so that the SQL Server resource is dependent on the disk resource. This dependency will make sure the disk comes online first, and then SQL Server. Same thing when taking those things offline, except in reverse. The SQL Server resource will come offline before the disk comes offline. Obviously, we would not want to have SQL Server attempt to start until all the disks it was using was online first.

Once resources in a cluster group are linked with dependencies, you have to be careful when deleting resources out of a group. If you don’t remove dependencies properly, you may end up inadvertently removing other resources as well.

In this example, I have two resources in a cluster group, Resource A and Resource B. I establish dependencies between them so that Resource B is dependent on Resource B.

This means that when the group is brought online, Resource B will not be brought online until Resource A is online.

Here is what the dependency report looks like.

Now that these two resources are linked, you have to be careful when deleting these resources. If I were to delete Resource A from the group without first removing the dependency, BOTH resources will get removed.

At this point, a pop-up will appear warning that a removal of this resource could affect applications using this resource.

If you click ‘Yes’, both resources get removed.

Lesson learned. If the resource you are deleting is dependent on any other resources, remove the dependency first.

Now, we get to the main point of this post. The above process is fine for deleting resources from a cluster group unless the resource you are deleting is configured as a “Physical Disk’ resource, and it a mount point disk. The process differs slightly and you must follow this process or you could find yourself unintentionally moving every resource in the group into ‘Available Storage’.

First, lets cover proper way to remove a mount point disk from a cluster group. In this example, I have a plain File Server group with a Network Name, IP Address, File Server, and three disks. A root disk (Disk X:) and two mount points using folders on the root of X: called X:\MountPointA and X:\MountPointB.

Since I don’t have any shares located on X:\MountPointB, I want to remove that disk so I can use it in some other application. The FIRST thing I need to do is take the resource offline.

Then I can right-click the resource and click ‘Remove from GroupName’

When you remove a ‘Physical Disk’ resource from a cluster group, it doesn’t actually remove the cluster resource altogether, it moves the disk resource into the ‘Available Storage’ group. This is so that you can reallocate the resource to another group if needed.

As you can see, the resource now shows in ‘Available Storage’

At this point, you can remove the mount point configuration, or change it to a lettered drive so you can use for some other application.

Now let’s go over what can happen if you don’t take the mount point offline before removing it. The main reason in going over this is to show you how to recover so that there’s no adverse impact to the Cluster.

In this example, I am removing the same resource from the File Server group following the same process above WITHOUT taking it offline first. First, I verified there were no dependencies on the ‘DiskX:\MountPointB’ resource.

Now here’s where it gets fun. After I attempted to remove the mount point, ALL of my resources disappear from the group. ??????

Time to panic? No, all is not lost. What happened is that because we had a mount point configured, and a mount point is not usable unless there’s a root disk, ALL of the resources moved to ‘Available Storage’ because the rest of the resources DO have dependencies.

It may appear all of the resources disappeared. Because in the UI, we only show ‘Physical Disk’ resources, if any other resources get put in that group, they don’t show up in the UI. However, if we run a command line to display all resources and their groups, we can see that the resources are still there.

To get the resources back into the right group, just move the disks back to the original File Server group. Right click the disk, More actions, and select Move this resource to another service or application. The same dependency tree will cause the resources to move back.

Now we have all our resources back and we can follow the correct process of taking the mount point disk offline BEFORE removing it.

Cheers!

Jeff Hughes
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Several months ago, I posted a blog on adding a new disk to an existing cluster. Another question we get asked a lot is “How do I replace a disk?”

In this blog, I’ll walkthrough the process of replacing a 1GB disk with a 2GB disk. This process is similar to how you could go about doing a SAN migration where you are replacing all of your shared disks with storage from a new SAN.

The preferred way of getting a larger cluster disk is to use the built in capability of most SANs to dynamically expand a LUN then use an OS utility like DiskPart or Disk Manager to extend the size of the disk. If that’s not feasible or you simply want to replace a LUN with a larger one or as I mentioned, as part of a SAN migration, this process works well.

The first thing we need to do is present your new disk to the cluster. The nuts and bolts of how to do that are outside the scope of this post so just ask your SAN administrator for a new LUN and present it to all nodes of the cluster. Since by default in Server 2008, we leave new LUNs offline, there’s no risk in presenting a new LUN to all nodes at the same time. In the below figure is what Disk Manager would look like after my new disk had been presented.

Figure 1.

Note how the new disk ‘Disk 9’ is in an ‘Offline’ state. In order to prepare it to be the replacement disk for an existing disk, we need to do the following.

Online the disk
Initialize the disk (MBR or GPT)
Create a new volume
Format as either FAT32 or NTFS

Note: You do NOT need to assign the new drive a drive letter during the format process.

Figure 2.

Figure 2 now shows ‘Disk 9’ as Online and formatted with an NTFS partition. At this point, we can now go into Failover Cluster Manager to complete the rest of the replacement.

The screenshot below shows a File Server group with ‘Cluster Disk X:’ of size 1GB. This is the disk that I am going to replace with the new 2GB disk from above.

Figure 3.

Failover Cluster Manager has a built in ‘repair’ functionality that allows replacing a failed disk with a new disk. Since we’re not really replacing a failed disk but a working one, we need to put that disk into an 'offline' state so that the ‘repair’ function will be enabled.

Figure 4.

Now right-click the disk resource, ‘More actions…’, ‘Repair’. This will launch the ‘Repair a Disk Resource’ window.

Figure 5.

Figure 5 shows the disk that we presented and created in Figure 2. Select that disk, click [OK]/

Now bring the resource online. You’ll see in Figure 8. that the disk now shows as 2GB. We essentially swapped one disk for another without having to worry about resource dependencies. If the drive letter needs to be changed to match the old drive letter, do so now.

Figure 6.

So now that we’ve replaced the 1GB disk with the 2GB disk, what happened to the old disk? When you used the ‘Repair’ function, the old disk got removed from under the control of the cluster. The final step in the replacement is to bring the old disk back into the cluster so that we can bring it online and move the data from the old disk to the new.

To add the disk back in, from Failover Cluster Manager, go to the ‘Storage’ group. In the right-most column, in the ‘Actions’ pane, click on ‘Add a disk’

Figure 7.

Figure 8 shows the disk we just removed from the cluster. Select this disk, click [OK]

Figure 8.

This disk now shows up in ‘Available Storage’. Figure 12.

Figure 9..

The final steps in the replacement are to assign this disk a drive letter so that it’s exposed to the OS to get your data moved from the old disk to the new.

Figure 10.

Now that ‘Cluster Disk 7’ (the old disk) shows as online and has a drive letter (D:) , you can use your favorite data copy method to move the data from the old disk to the new disk. If you are no longer going to use the old LUN, you can simply delete this resource from Failover Cluster Manager and unpresent that LUN from all nodes of the cluster. That finishes up the clean-up process. You can also just leave the disk in ‘Available Storage’, format it, and have it ready for some other ‘Service or application’ cluster group to use in the future.

Hope you find this blog useful especially for those SAN migrations.

Jeff Hughes
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Greetings Cluster fans,

John Marlin back for another go at it. I wanted to write something about what we have been seeing involving the use of Cluster Shared Volumes (CSV) and File Server resources on the same Cluster.

There have been multiple instances we have seen regarding a Stop 0x00000050 on Cluster Servers that point to CSVFILTER.SYS as being the culprit.

CSVFILTER.SYS is a filter driver used by Failover Clustering to filter metadata I/O writes to a Cluster Shared Volume. If there is a metadata write from a node that“owns” or is the coordinator node, it allows the direct I/O write. If the node is not the coordinator, CSVFILTER.SYS redirects the I/O over the network to the node that is the coordinator.

Since it is a filter driver, it will attach itself to all drives in the Cluster. The stop error occurs because CSVFILTER sees SMB I/O that it does not want to see.

These are three different scenarios where you can get a Stop 0x00000050 error.

1. A Cluster that has File Server resources only (no Hyper-V VMs) with Cluster Shared Volumes enabled.

2. File Server resources or shares that are located on the Cluster Shared Volumes.

3. A Cluster that has both Hyper-V VMs on Cluster Shared Volumes and File Server resources on non-CSV drives.

Scenario 1

==========

A Cluster that has File Server resources only (no Hyper-V VMs) with Cluster Shared Volumes enabled.

When you are in this scenario, there is no need for Cluster Shared Volumes to be enabled. To resolve this, you should disable CSV so that CSVFILTER.SYS is no longer in play.

To do this, run Powershell from the Administrative Tools with this command:

get-cluster | %{$_.EnableSharedVolumes="Disabled"}

This will disable Cluster Shared Volumes and you will no longer receive the stop errors. In this type of configuration, there is no need for the enabling of Cluster Shared Volumes as they are not being used anyway.

Scenario 2

==========

File Server resources or shares that are located on the Cluster Shared Volumes.

When you enable Cluster Shared volumes, you will receive this dialog box:

As it states, you do not want any kind of user or application data on these volumes. Key point in the box above is “may result in unpredictable behavior, including data corruption or data loss” and we all know that data integrity needs to be there.

So if you are keeping user or application data on a CSV drive, get it off or bad things can happen. This is not a valid or supported configuration.

Scenario 3

==========

A Cluster that has both Hyper-V VMs on Cluster Shared Volumes and File Server resources on non-CSV drives.

In this configuration, you have all the highly available virtual machines on CSV drives and separate groups for File Servers on non-CSV drives. As mentioned at the beginning of this, CSVFILTER.SYS is attaching itself to all drives, including these non-CSV drives. This is where you would need the workaround and there are two options to consider.

The first is to create a virtual machine that is the File Server resource and shares. Add this VM into the Cluster on the drive that you can convert to Cluster Shared Volume. This one would take some work and a little bit of time to do.

The second option is to detach CSVFILTER.SYS from the non-CSV drives. This one is the easiest and quickest to do, but it is a little kludgy. For example, say your non-CSV was the Z: drive. To detach it, the command would be:

Fltmc detach csvfilter z:

This would remove CSVFILTER.SYS as a filter on the drive. The caveat to this is that if you restart the Cluster Service, reboot the machine, or simply move the group to another node, CSVFILTER.SYS may attach itself again.

To get around this, you would want to create a batch file with the above command and place it on the Z: drive. You would need to create a Generic Application resource with this batch file. You would then want to have the File Server resources depend on this Generic Application Resource and the Generic Application resource depend on the Drive Z: resource. This way, no matter what happens, the disk comes online, CSVFILTER is told to attach, the File Server resources do what they do.

No more stop errors. Is it kludgy? Yes. Does it do the job? Yes.

Microsoft is looking into this further. There are no guarantees that a fix will be created at this point. For now, we must utilize the workarounds mentioned above.

Happy Clustering !!

John Marlin

Senior Support Escalation Engineer

Microsoft Enterprise Platforms Support

There is an issue with Cluster Shared Volumes and McAfee VirusScan Enterprise that I wanted to pass along. When installing McAfee VSE 8.7 Patch 5 or 8.8 Patch 1, the CSV drives will go into redirected mode and will not go out of it.

The reason for this is that the McAfee filter driver (mfehidk.sys) is using decimal points in the altitude to help in identifying upgrade scenarios for their product. The Cluster CSV filter only accepts whole numbers and puts the drives in redirected access mode when it sees this decimal value.

When seeing this, if you run FLTMC from an administrative command prompt, you may see something similar too:

C:\> fltmc

Filter Name    Num Instances      Altitude    Frame
------------------------------------------------------
CSVFilter            2            404900        0
mfehidk                           329998.99   <Legacy>
mfehidk              2            321300.00     0

If you were to generate a Cluster Log, you would see the below identifying that it cannot read the altitude value properly.

INFO [DCM] FsFilterCanUseDirectIO is called for \\?\Volume{188c44f1-9cd0-11df-926b-a4ca2baf36ff}\
ERR mscs::FilterSnooper::CanUseDirectIO: BadFormat(5917)' because of 'non-digit found'
INFO [DCM] PostOnline. CanUseDirectIO for C2V1 => false

McAfee has released the following document giving a temporary workaround.

Cluster Shared Volumes (CSV) status becomes Online (Redirected access)
https://kc.mcafee.com/corporate/index?page=content&id=KB73596

Microsoft is aware of the problem and currently working on a fix. When this fix is available, this will be updated and a new KB Article will be created with the fix.

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

My name is Sean Dwyer and I am a Support Escalation Engineer with the Microsoft CORE team.

I’d like to share a quick tip for handling Windows Server Cluster administrators.

There may come a time, for whatever reason, that a Cluster managed volume is flagged as dirty and you will see an event ID message indicating that CHKDSK needs to run against the volume. Just for a little background, the NTFS File System is monitoring the drive/partition at all times. If it detects corruption, it will flip a bit on the volume and mark it as dirty. During the online process of a Clustered drive, it will check for the existance of this bit and spawn CHKDSK if it sees it. You can check, at any time, to see if a volume it is dirty with the CHKNTFS command.

C:\> chkntfs z:

The type of the file system is NTFS.

Z: is not dirty.

C:\> chkntfs z:

The type of the file system is NTFS.

Z: is dirty.

In a best case scenario, you can take the volume out of production, run CHKDSK on the volume if needed (refer to: http://technet.microsoft.com/en-us/library/cc772587.aspx, and then put the volume back into production.

In most situations though, the volume that needs attention is a heavily utilized production volume and will be extremely disruptive to have the volume offline for any length of time.

For example, a recent case I was involved with had a 14Tb* (see note 1 below) volume that was being flagged for CHKDSK to run on it about once a month. The volume had about 9tb of data on it. Apart from the concern of why the volume was continually being flagged as corrupt, the length of time that CHKDSK took to run on the volume was extremely painful for the customer’s business. When it ran initially, it took roughly 80 hours to complete a run on the volume.

It may be necessary to temporarily configure a problem volume to block CHKDSK from running against it while troubleshooting continues to determine why the volume is being flagged for CHKDSK to run.

I stress the word temporary here.

Turning off the health monitoring tool for the file system as a permanent solution could only lead to more downtime in the future. You may end up on the phone with one of the File Systems experts on my team, such as Robert Mitchell.

Ok – so let’s talk specifics about temporarily blocking CHKDSK from doing work on a Cluster volume.

Say we have determined that we need to suspend CHKDSK from running on a problem volume. For you old school Cluster admins, the first command parameter that probably jumps to mind is SKIPCHKDSK.

This works just fine for Windows 2003 Server Clusters, but will NOT work for Windows 2008 and 2008R2 Failover Clusters.

If SKIPCHKDSK is used for a Clustered volume, it will be ignored when the disk is next brought online and CHKDSK will be run. In a situation where the volume is 18tb, the volume will remain unavailable for use until CHKDSK finishes* (See note 2 below).

The correct way to configure a volume to block CHKDSK from running on it, is to use the DiskRunChkdsk parameter. Keep in mind that these two parameters we are discussing only apply to the Cluster environment. If the machine is restarted, the OS may prompt for CHKDSK to run on the affected volumes.

For information on how to configure the OS to ignore the dirty bit, refer to:

KB158675

How to Cancel CHKDSK After It Has Been Scheduled

http://support.microsoft.com/default.aspx?scid=kb;EN-US;158675

Before walking through an example of setting the DiskRunChkdsk parameter, I first must expain what the values mean. In Windows 2003 Server Clusters, the SKIPCHKDSK parameter was either 0x0 (disabled) or 0x1 (enabled). In Windows 2008 and 2008R2 Failover Clusters, there are different settings and what it is checking varies.

DiskRunChkDsk <0x0>: This is the default setting for all Failover Clusters. This policy will check the volume to see if the dirty bit is set and it will perform a Normal check of the file system. The Normal check is similar to running the DIR command at the root. If the dirty bit is set or if the Normal check returns a STATUS_FILE_CORRUPT_ERROR or STATUS_DISK_CORRUPT_ERROR, CHKDSK with be started in Verbose mode (Chkdsk /x /f).

DiskRunChkDsk <0x1>: This setting will check the volume to see if the dirty bit is set and it will perform a Verbose check. A verbose check will scan the volume by traversing from the volume root and checking all the files) of the file system. If the dirty bit is set or if the Verbose check returns a STATUS_FILE_CORRUPT_ERROR, CHKDSK with be started in normal mode (Chkdsk /x /f).

DiskRunChkDsk <0x2>: This setting will run CHKDSK in Verbose mode (Chkdsk /x /f) on the volume every time it is mounted.

DiskRunChkDsk <0x3>: This setting will check the volume to see if the dirty bit is set and it will perform a Normal check of the file system. The Normal check is similar to running the DIR command at the root. If the dirty bit is set or if the Normal check returns a STATUS_DISK_CORRUPT_ERROR, CHKDSK will be started in Verbose mode (Chkdsk /x /f), otherwise CHKDSK will be started in read only mode (Chkdsk without any switches).

DiskRunChkDsk <0x4>: This setting doesn’t perform any checks at all.

DiskRunChkDsk <0x5>: This setting will check the volume to see if the dirty bit is set and it will perform a Verbose check (scan the volume by traversing from the volume root and checking all the files) of the file system. If a problem is found, CHKDSK will not be started and the volume will not be brought online.

So now that we know what the varies switches do, to have CHKDSK never run during an online operation of the disk, we want to set DiskRunChkdsk to 0x4.

Here are the steps you can run through to accomplish this task.

Step 1: Determine the resource name as seen by Cluster

Step 2: Open either an Administrative command prompt or Windows Powershell Modules and run the command:

C:\> cluster res "Cluster Disk 8" /priv DiskRunChkdsk=4

PS C:\> Get-ClusterResource "Cluster Disk 8" | Set-ClusterParameter DiskRunChkdsk 4

Note: For the setting to WORK, the disk must be brought offline and back online. Otherwise, it is simply stored until the next time it is taken offline and back online.

Step 4: Bring the disk offline, then online again.

Step 5: Verify the setting is applied

PS C:\> Get-ClusterResource "Cluster Disk 8" | Get-ClusterParameter DiskRunChkdsk

Object Name Value

------ ---- -----

Cluster Disk 8 DiskRunChkDsk 4

Step 6: Actively start troubleshooting what could cause the volume to end up flagged dirty and needing CHKDSK.

Footnotes:

Note 1: It’s not suggested to run with volumes this large. In my experience once they exceed 2tb in size, they rapidly become an administrative liability, especially in a situation where CHKDSK has to run against the volume. We strongly suggest that mount points be used to carve up larger volumes like this, into more administratively friendly chunks. CHKDSK runs against mount points just fine, too.

Note 2: While it’s not recommended to interrupt CHKDSK while it’s running, an admin is not locked into having to let CHKDSK finish once it starts. The process can be terminated if absolutely required. However, we cannot guarantee that the end result will be positive. If the process is interrupted during the “magic moment” when CHKDSK is making changes, the results may be worse than the initial reason for the volume being flagged as corrupt.

Additional reading material related to the components and tools mentioned in this post:

KB947021

How to configure volume mount points on a server cluster in Windows Server 2008

http://support.microsoft.com/default.aspx?scid=kb;EN-US;947021

The shared disk on Windows Server 2008 cluster fails to come online

http://support.microsoft.com/default.aspx?scid=kb;en-US;2517696

FSUTIL utility; marking a volume dirty for testing

http://technet.microsoft.com/en-us/library/bb490641.aspx

In summary; try to keep your production volumes’ size under control, be aware that command line switches may not persist through all versions of a product, and continue being successful with Windows Server 2008!

I hope this post has been helpful!

Sean Dwyer

Support Escalation Engineer

Windows CORE Team

A fix for the above titled problem has been released. If you are running into this problem, please downlaod and install the following fix on all Clusters running McAfee and wanting the updates they provide.

2674551
Redirected mode is enabled unexpectedly in a Cluster Shared Volume when you are running a third-party application in a Windows Server 2008 R2-based cluster
http://support.microsoft.com/default.aspx?scid=kb;EN-US;2674551

===============

Below is the information from the original post:

When seeing this, if you run FLTMC from an administrative command prompt, you may see something similar too:

C:\> fltmc

Filter Name   Num Instances   Altitude   Frame
------------------------------------------------------
CSVFilter          2          404900       0
mfehidk                       329998.99 <Legacy>
mfehidk            2          321300.00    0

If you were to generate a Cluster Log, you would see the below identifying that it cannot read the altitude value properly.

McAfee has released the following document giving a temporary workaround.

Cluster Shared Volumes (CSV) status becomes Online (Redirected access)
https://kc.mcafee.com/corporate/index?page=content&id=KB73596

Microsoft is aware of the problem and currently working on a fix. When this fix is available, this will be updated and a new KB Article will be created with the fix.

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Hello, my name is Steven Graves and I am a Support Escalation Engineer (SEE) in the Platforms Support group here at Microsoft. One of the technologies I support is Windows Server Failover Clustering. I’d like to take a couple of minutes to share some information on an issue I previously worked on. The customer wanted to create an Exchange 2010 DAG, which would be the first Windows Server 2008 R2 cluster in their environment and they were having issues bringing the CNO online after the cluster was created. The customer domain was originally 2003 and they had to add a 2008 R2 DC and update the schema in order to install Exchange 2010 DAG.

For starters, since I knew the CNO was not coming online after creating the cluster I had the customer destroy the cluster, pre-staged a new computer object for the CNO then created a new cluster based on the name of the new CNO. After the cluster was created I noticed that the computer object was still disabled in AD and the following error message in the cluster log.

00000e80.00000d5c::2012/03/14-16:07:33.149 INFO [RES] Network Name <Cluster Name>: Trying to find computer account W2K8R2Cluster object GUID(cae8b3dcc60aa040bbcef250634427bb) on any available domain controller.
00000e80.00000d5c::2012/03/14-16:07:33.306 WARN [RES] Network Name <Cluster Name>: Search for existing computer account failed. status 8007052E
00000e80.00000d5c::2012/03/14-16:07:33.352 WARN [RES] Network Name <Cluster Name>: Couldn't get information from DC \\Info-dc3.infoimage.com. status 5
00000e80.00000d5c::2012/03/14-16:07:33.352 INFO [RES] Network Name <Cluster Name>: Trying to find object cae8b3dcc60aa040bbcef250634427bb on a PDC.
00000e80.00000d5c::2012/03/14-16:07:33.462 WARN [RES] Network Name <Cluster Name>: Couldn't get information about PDC. status 5
00000e80.00000d5c::2012/03/14-16:07:33.462 INFO [RES] Network Name <Cluster Name>: Unable to find object cae8b3dcc60aa040bbcef250634427bb on a PDC.
00000e80.00000d5c::2012/03/14-16:07:33.462 INFO [RES] Network Name <Cluster Name>: GetComputerObjectViaGUIDEx() failed, Status 8007052E.

‘Access is denied’

In the System Event log you will see an ID 1207 that should be in synch with the time in the cluster log. The main thing to focus on is the “Unable to get the Computer Object using GUID”.

Log Name: System
Source: Microsoft-Windows-FailoverClustering
Date: 3/14/2012 9:07:10 AM
Event ID: 1207
Task Category: Network Name Resource
Level: Error
Keywords:
User: SYSTEM
Computer: W2K8R2Cluster.Corp.com
Description:
Cluster network name resource 'Cluster Name' cannot be brought online. The computer object associated with the resource could not be updated in domain 'infoimage.com' for the following reason:
Unable to get Computer Object using GUID.
The text for the associated error code is: Logon failure: unknown user name or bad password.

At this point, I’m pretty convinced there are some issues with the GPOs on the domain controllers but I still need to do my due diligence in troubleshooting the issue with the Cluster Network Name in a failed state.

Since I pre-staged the CNO and it was still disabled after creating a new cluster, this gave me more evidence indicating an issue with the DC. I created a new OU and blocked inheritance in order to prevent any GPOs from being applied to the Node(s). I refreshed the GPO’s on the Node(s), confirmed there are no GPOs applied by running Gpresult /V from an Administrative CMD Prompt, but the Cluster Network Name still fails to come online. I’m convinced there is some issue with GPO’s on the DC but I’m not sure where to start looking.

Next, I verified the permissions in AD on the CNO and, to be on the safe, I granted the CNO Full Control to the object and also confirmed that the CNO has the correct permissions to the OU(READ permissions on the OU should be sufficient rights to access the OU and get to the computer object). Despite this, the Cluster Network Name failed to come online.

I moved on to check the DNS Host A record for the CNO not really thinking this is the issue but more or less making sure everything is in order. I came to find out a Host A record was not created for the Cluster Network Name because they do not have Dynamic Updates enable for DNS. I created the Host A record and checked off “Allow any authenticated user to update the DNS records with the same owner name.” I already knew the node was able to resolve the DC from the warnings in the cluster log but couldn’t get information from DC \\W2K8R2-DC.Corp.com. So it was not a name resolution issue trying to access the DC.

At this point, I have gone through all the normal troubleshooting steps that generally resolve the ID 1207 and the CNO in a failed state from the cluster perspective. Now it’s time to engage Directory Services to take a deeper look at the DC configuration. After some time reviewing the Domain Controller configuration and GPOs the DS engineer narrowed it down to permission issues in the “Access this computer from the network” policy. The default permissions are pictured below.

Access this computer from the network - This user right determines which users and groups are allowed to connect to the computer over the network. Since "Everyone" and "Authenticated Users" were missing from the settings, this meant that no computer would be able to access the domain controller.

Picture above shows the default permissions for the Access this computer from the network policy

The DS engineer modified the “Access this computer from the network” policy in the Default Domain Controllers policy by adding Authenticated Users, refreshed GPOs by running GPUpdate /force, ran RSOP.msc to confirm the GPO is applied, and the CNO came online.

Steven Graves
Support Escalation Engineer
Microsoft Enterprise Platforms Support

On the node you doing the task on, go into Control Panel - Region and Language and change the Short date to yyyy-MM-dd.

On this node, you could create a CLUSTERLOG folder off the root of Drive C:. In this C:\CLUSTERLOG directory, create a batch file called Get-Logs.bat that has the following commands:

Net use j: /d
Net use j: \\johnmarlin\txcluster
Md j:\%date%
Cluster log /gen /copy:"c:\clusterlog"
Copy c:\clusterlog\*.log j:\%date%\*.log
Net use j: /d

I used Drive Letter J:, but you can use any available letter. So what the batch file will do when run today (June 18, 2012) is:

1. It will create a folder on the share named by the date

a. 2012-06-18

2. It will generate the Cluster on every node

3. It will copy the cluster logs from all nodes to the local c:\clusterlog folder and tag the Node Name as part of the filename

a. TXCLUSTER-node1_cluster.log
b. TXCLUSTER-node2_cluster.log
c. TXCLUSTER-node3_cluster.log
d. TXCLUSTER-node4_cluster.log

4. It will copy the cluster logs from this c:\logs folder to the share folder with the date and keeping the same name

a. \2012-06-18\TXCLUSTER-node1_cluster.log
b. \2012-06-18\TXCLUSTER-node2_cluster.log
c. \2012-06-18\TXCLUSTER-node3_cluster.log
d. \2012-06-18\TXCLUSTER-node4_cluster.log

When it runs the next day:

1. It will create a folder on the share named by the date

a. 2012-06-19

2. It will generate the Cluster on every node

3. It will copy the cluster logs from all nodes to the local c:\clusterlog folder and tag the Node Name as part of the filename

a. TXCLUSTER-node1_cluster.log
b. TXCLUSTER-node2_cluster.log
c. TXCLUSTER-node3_cluster.log
d. TXCLUSTER-node4_cluster.log

4. It will copy the cluster logs from this c:\clusterlog folder to the share folder with the date keeping the same name

a. \2012-06-19\TXCLUSTER-node1_cluster.log
b. \2012-06-19\TXCLUSTER-node2_cluster.log
c. \2012-06-19\TXCLUSTER-node3_cluster.log
d. \2012-06-19\TXCLUSTER-node4_cluster.log

A. General Tab

i. For the Name, call it something like Cluster Daily Log Backups
ii. make sure use an account that has admin rights to this node, to the Cluster, and the network share
iii. select Run whether user is logged in or not
iv. You will also need to select Run with highest privileges

B. Triggers Tab

i. Set whatever time you want it to run. One thing to keep in mind is that the Cluster Log is in GMT time, so account for it when deciding when to have them created
ii. Select it to run daily and recur for 352 days
iii. Make sure is Enabled

C. Actions Tab

i. Program/Script will be CMD.EXE
ii. Add Arguments will be /C C:\Logs\Get-Logs.bat

D. Conditions Tab

i. Don't really need change anything unless want to

E. Settings Tab

i. Check Allow task to be run on demand
ii. Check Run task as soon as possible after scheduled start is missed

So now you have your task that will do this for you. You can now just sit back and relax knowing that you will have a Cluster Log generated for every node every day.

There are other ways of doing this. You could use scripting and the PowerShell command:

Get-ClusterLog –Destination

You could also use other methods than the batch file. This is just one of the ways of doing it.

Happy Clustering !!!

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

What caused the node to be marked down?

For more information on how we handle specific routes going down with 3 or more nodes, please reference “Partitioned” Cluster Networks blog that was written by Jeff Hughes.

Now that we know how the heartbeat process works, what are some of the known causes for the process to fail.

1. Actual network hardware failures. If the packet is lost on the wire somewhere between the nodes, then the heartbeats will fail. A network trace from both nodes involved will reveal this.

3. You have IPv6 enabled on the servers, but have the following two rules disabled for Inbound and Outbound in the Windows Firewall:

Core Networking - Neighbor Discovery Advertisement
Core Networking - Neighbor Discovery Solicitation

5. Latency on your network could also cause this to happen. The packets may not be lost between the nodes, but they may not get to the nodes fast enough before the timeout period expires.

8. Old or out of date network card drivers and/or firmware. At times, a simple misconfiguration of the network card or switch can also cause loss of heartbeats.

Nodes being removed from Failover Cluster membership on VMWare ESX?
http://blogs.technet.com/b/askcore/archive/2013/06/03/nodes-being-removed-from-failover-cluster-membership-on-vmware-esx.aspx

Parameter	Default	Range
SameSubnetDelay	1000 milliseconds	250-2000 milliseconds
CrossSubnetDelay	1000 milliseconds	250-4000 milliseconds
SameSubnetThreshold	5	3-10
CrossSubnetThreshold	5	3-10

I hope that this post helps you!

Thanks,
James Burrage
Senior Support Escalation Engineer
Windows High Availability Group

Start getting familiar with Windows Server 2012 Failover Clustering by viewing the sessions that were delivered at TechEd 2012.

These sessions are posted online now for your viewing pleasures. For those not familiar with what TechEd is, I offer this.

This year’s North America event included over 11,000 customers, partners, speakers, and staff.

Each session lasts about 1:15 and includes the PowerPoint and presentation that you can view online or download to view at a later time. The audio formats you can view the sessions are:

MP3 = audio only

Mid Quality WMV = lo-band, mobile

High Quality MP4 = Ipad, PC

Mid Quality MP4 = WP7, HTML5

MP4 = Ipod, Zune HD

High Quality WMV = PC, Xbox, MCE

The sessions from the Clustering team at TechEd North America were:

WSV324 - Building a Highly Available Failover Cluster Solution with Windows Server 2012 from the Ground UP

Windows Server 2012 delivers innovative new capabilities that enable you to build dynamic availability solutions in which workloads, networks, and storage become more flexible, efficient, and available than ever before. This session covers creating a Windows Server 2012 highly available Failover Cluster leveraging the new technologies in Windows Server 2012. This session walks through a demo leveraging a highly available Space, encrypting data with shared BitLocker disks, asymmetrical storage configurations with CSV I/O redirection… from the bottom up to a highly available solution.

WSV430 - Cluster Shared Volumes Reborn in Windows Server 2012: Deep Dive

This session takes a deep technical dive into the new Cluster Shared Volumes (CSV) architecture and new features coming in Windows Server 2012. CSV is now a full-blown clustered file system, and all of the challenges of the past have been addressed, along with many enhancements. This is an in-depth session that covers the CSV architecture, CSV backup integration, and integration with a wealth of new features that enhance CSV and its performance.

WSV411 - Guest Clustering and VM Monitoring in Windows Server 2012

In Windows Server 2012 there will be new ways to monitor application health state and have recovery inside of a virtual machine. This session details the new VM Monitoring feature in Windows Server 2012 as well as discusses Guest Clustering and changes in Windows Server 2012 (such as virtual FC), along with pros and cons of when to use each.

WSV322 - Update Management in Windows Server 2012: Revealing Cluster-Aware Updating and the New Generation of WSUS

Today, patch management is a required component of any security strategy. In Windows Server 2012, the new Cluster-Aware Updating (CAU) feature delivers Continuous Availability through automated self-updating of failover clusters. In Windows Server 2012, Windows Server Update Services (WSUS) has evolved to become a Server Role with exciting new capabilities. This session introduces CAU with a discussion of its GUI, cmdlets, remote-updating and self-updating capabilities. And then we proceed to highlight the main functionalities of WSUS in Windows Server 2012 including the security enhancements, patch deployment automation, and new Windows PowerShell cmdlets to perform maintenance, manage and deploy updates

VIR401 - Hyper-V High-Availability and Mobility: Designing the Infrastructure for Your Private Cloud

Private Cloud Technical Evangelist Symon Perriman leads this session discussing Windows Server 2012 and Windows Server 2008 R2 Hyper-V and Failover Clustering design, infrastructure planning and deployment considerations for your highly-available datacenter or Private Cloud. Do you know the pros and cons of how different virtualization solutions can provide continual availability? Do you know how Microsoft System Center 2012 can move the solution closer to a Private Cloud implementation? This session covers licensing, hardware, validation, deployment, upgrades, host clustering, guest clustering, disaster recovery, multi-site clustering, System Center Virtual Machine Manager 2008 and 2012, and offers a wealth of best practices. Prior clustering and Hyper-V knowledge recommended.

The sessions from the Clustering team at TechEd Europe were:

WSV324 - Building a Highly Available Failover Cluster Solution with Windows Server 2012 from the Ground UP

Windows Server 2012 delivers innovative new capabilities that enable you to build dynamic availability solutions in which workloads, networks, and storage become more flexible, efficient, and available than ever before. This session will cover creating a Windows Server 2012 highly available Failover Cluster leveraging the new technologies in Windows Server 2012. This session will walk through a demo leveraging a highly available Space, encrypting data with shared BitLocker disks, asymmetrical storage configurations with CSV I/O redirection… from the bottom up to a highly available solution.

WSV430 - Cluster Shared Volumes Reborn in Windows Server 2012: Deep Dive

This session will do a deep technical dive of the new Cluster Shared Volumes architecture and new features coming in Windows Server 2012. CSV is now a full blown clustered file system, and all of the challenges of the past have been addressed, along with many enhancements. This will be an in-depth session that will cover the CSV architecture, CSV backup integration, and integration with a wealth of new features that enhance CSV and their performance. CSV backup integration, and integration with a wealth of new features that enhance CSV and it's performance

All other Tech Ed content regarding Windows Server 2012 and Windows 8 can be viewed at the same locations:

North America

Europe

I hope that you can take some time to get to know the new products and features that are coming and get as excited about it as Microsoft is.

Happy Clustering !!

John Marlin
Senior Support Escalation Engineer
Microsoft Enterprise Platforms Support

Recovering a Deleted Cluster Name Object (CNO) in a windows Server 2008 Failover Cluster (April 2009)

Recovering a Deleted Cluster Name Object (CNO) in a Windows Server 2008 Failover Cluster, Part 2 (May 2011)

Why is the CNO in a Failed State? (March 2012)

Rights needed for user account when pre-creating a Cluster Name Object (CNO) on Windows Server 2008 R2 Failover Cluster (June 2011)

On to today's blog - increasing awareness around the Cluster Name Object (CNO)….

Using Distinguished Names when creating the cluster to manually control CNO placement

New default behavior where a CNO is placed in the same container as the computer objects for the nodes in the cluster

The Virtual Computer Objects (VCOs) created by a CNO are placed in the same container as the CNO

Log Name:	System
Source:	Microsoft-Windows-Failover-Clustering
Even ID:	1194
Level:	Error