Alert : DFSR - Replication service stopped replication on volume.
Description Type Category
Our monitoring agent has detected DFS Replication service stopped replication on volume #:. This occurs when a DFSR JET database is not shut down cleanly and Auto Recovery is disabled. Impact : Replication will be stopped.
Critical Non-Impact Alerts Active Directory
Knowledge Base Details

Summary

The purpose of this KB is to describe a new behavior in the DFSR service and the new DFSR Event ID 2213. It covers the actions an administrator should take and how to reduce the number of 2213 events logged on a DFSR server. It also covers best practices regarding Auto Recovery from dirty shutdown in light of the latest DFSR patch.

Microsoft introduced a new behavior to the DFS Replication service for Windows Server 2008 R2 via the hotfix published in KB2663685. After installing KB 2663685 or later versions of DFSRS.EXE on Windows Server 2008 R2, the DFSR Service will no longer perform automatic recovery of the Extensible Storage Engine (ESE)) database when the database suffers dirty shutdown. When the new behavior is triggered, Event ID 2213 is log in the DFSR Event Log. A DFSR administrator must manually resume replication when a dirty shutdown is detected by DFSR. 

Windows Server 2012 uses this behavior by default.
1) Please do take all replicated folder backup on volume mentioned in event.
2) Run WMIC command (For WMIC command, open Event ID 2213  property > Recovery Steps and copy WMIC command from step 2 and paste as it into an elevated command prompt) to recover DFSR ESE database from error state.

 

Once done with recovery steps, DFSR information event ID 2214 gets generated on server which you can add as closing condition in our monitoring system.

More information

The DFSR service maintains one ESE database per volume on volumes that host a replicated folder. DFSR uses this database to store metadata about each file and folder in the replicated folder. The integrity of the database must be maintained to ensure proper service function. 

When DFSR is notified that the service must shutdown it will begin to commit all outstanding changes to the ESE database. Dirty Shutdown in DFSR occurs when the DFSR service cannot commit all pending changes to the DSFR ESE Database before the DFSR service is terminated. Upon startup the DFSR service will check the integrity of the database.

Dirty shutdown recovery can possibly cause high backlogs which may cause replication conflicts. In some cases, prior to the fix published in KB 2780453, the winning file may not be the version that the end user wants. The change to stop replication on dirty shut down was intended as a safe guard which allowed administrators the opportunity to back up the data to capture deltas since the last backup was taken before resuming replication.

As of KB 2780453 it is no longer necessary to pause replication on dirty shutdown. Windows Server 2012 has the fix from  KB 2780453 included in the default media.

Best practices

Best practices for auto-recovery based on server role, OS and patch level

Role Windows Server 2008 R2 Windows Server 2008 R2 with KB 2780453 Installed Windows Server 2012
DC On On On
Cluster Node On On On
Writeable DFSR Server Off On On
Read Only DFSR Server On On On

How to disable the Stop Replication on Auto Recovery behavior

To have DFSR perform auto-recovery when a dirty database shut is detected edit the following registry value after KB 2780453is installed on Windows Server 2008 R2. You can deploy this change to all on all versions of Windows Server 2012. If the value does not exist it will need to be created.

Key: HKLM\System\CurrentControlSet\Services\DFSR\Parameters
Value: StopReplicationOnAutoRecovery
Type: Dword
Data: 0

How to Resume Replication after Event 2213 is logged

To resume replication after is has been paused will require an administrator to run a WMIC command. The command as it needs to be run will be provide in the text of Event ID 2213

Step 1: Event ID 2213 is logged on your DFSR server.

Event Type: Warning
Event Source: DFSR 
Event Category: Disk 
Event ID: 2213 
Description: 
"The DFS Replication service stopped replication on volume C:. This occurs when a DFSR JET database is not shut down cleanly and Auto Recovery is disabled. To resolve this issue, back up the files in the affected replicated folders, and then use the ResumeReplication WMI method to resume replication. 

Additional Information: 
Volume: C: 
GUID: E18D8280-2379-11E2-A5A0-806E6F6E6963



Recovery Steps

  1. Back up the files in all replicated folders on the volume. Failure to do so may result in data loss due to unexpected conflict resolution during the recovery of the replicated folders.
  2. To resume the replication for this volume, use the WMI method ResumeReplication of the DfsrVolumeConfig class. For example, from an elevated command prompt, type the following command: 

    wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid=""E18D8280-2379-11E2-A5A0-806E6F6E6963"" call ResumeReplication

Step 2: Copy the WMIC command from step 2 in Event ID 2213 and paste it into an elevated command prompt. The results of a successful run of the command will look like this:

wmic /namespace:\\root\microsoftdfs pathdfsrVolumeConfig where volumeGuid="F1CF316E-6A40-11E2-A826-00155D41C919" call ResumeReplication

Executing(\\WW2008R2DC1\root\microsoftdfs:DfsrVolumeConfig.VolumeGuid="F1CF316E-6A40-11E2-A826-00155D41C919")->ResumeReplication()
Method execution successful.Out Parameters:instance of __PARAMETERS{ ReturnValue = 0;};

Note for PowerShell users: You will need to add single quotations to the WMIC commnand to run it from PowerShell:

wmic /namespace:\\root\microsoftdfs pathdfsrVolumeConfig where ‘volumeGuid="F1CF316E-6A40-11E2-A826-00155D41C919"’ call ResumeReplication


Step 3: Check that Event ID 2212 and 2214 have been logged on the server that you ran the resume replication comand on.

Event Type: Warning
Event Source: DFSR 
Event Category: Disk 
Event ID: 2212 
Description: 
"The DFS Replication service has detected an unexpected shutdown on volume E:. This can occur if the service terminated abnormally (due to a power loss, for example) or an error occurred on the volume. The service has automatically initiated a recovery process. The service will rebuild the database if it determines it cannot reliably recover. No user action is required.

Additional Information: 
Volume: E: 
GUID: F1CF316E-6A40-11E2-A826-00155D41C919"



Event Type: Warning
Event Source: DFSR 
Event Category: Disk 
Event ID: 2214
Description:
"The DFS Replication service successfully recovered from an unexpected shutdown on volume E:.This can occur if the service terminated abnormally (due to a power loss, for example) or an error occurred on the volume. No user action is required.

Additional Information: 
Volume: E: 
GUID: F1CF316E-6A40-11E2-A826-00155D41C919"



Steps to reduce the chances of having a dirty shutdown

In the Windows operating system a service has 30 seconds to shut down once it receives a shutdown notification. After the 30 seconds expire, the Service Control Manager will force the service to shut down. In the case of the DFSR service, a busy hub server may need more than 30 seconds to commit outstanding changes to the database. If the DFSR service does not commit all changes in the 30 seconds allotted by the Service Control Manager it will be forcibly closed forcing a dirty shown down recovery.  

Power outages or any other hard reboot of a DFSR server can cause a dirty shutdown recovery to occur. To reduce the chances of this occurring make sure your DFSR servers are connected to a UPS to allow them to gracefully shutdown. 

Extending Service Shutdown Times 

On DFSR servers that need more than 30 seconds to shut down you can use the WaitToKillServiceTimeout value to extend the amount of time allowed for all services to shut down. 

Typical symptoms that a DSFR server needs more time to shut down are that the server will log 2212 and 2214 events on most reboots or restarts of the service. Or in the case when Auto Recovery from dirty shut down is enabled event 2213 is logged on every reboot or restart of the DFSR service.

Path: HKLM\SYSTEM\CurrentControlSet\Control
Value: WaitToKillServiceTimeout
Type: String
Data: 300000  

Note: This value is in milliseconds. The example provides 5 minutes of shutdown time. The value can be increased or decreased as needed. 

This value affects all services not just DFSR. It is recommended to set this value to the lowest setting that still allows DFSR enough time to shutdown cleanly. You can determine how long your DFSR service needs to shut down using the following process
  1. Add the registry value WaitToKillServiceTimeout with a setting of 300000 milliseconds (5 minutes). Reboot the server to enable the setting. (Important: See note below about installing 2549760)
  2. Monitor the next few reboots of the server for DFSR events 1006 (DFSR is stopping) and 1008 (DFSR Stopped) note the time elapsed between events 1006 and 1008.
  3. You can then adjust the time allowed for shut down by the WaitToKillServiceTimeout to be closer to the actual time DSFR needs to cleanly shut down
Notes Regarding WaitToKillServiceTimeOut:
  • Rebooting the server or restarting DFSR a few times in a row will not give you a good sample of the amount of time DFSR needs to shut down. You must allow the service time to run to accumulate pending database transactions.
  • WaitToKillServiceTimeout has maximum value of 1 hour. If the setting exceeds one hour SCM will use the default setting of 30 seconds for service shutdown.
  • To ensure proper function of SCM in regards to WaitToKillServiceTimeout make sure that KB 2549760 is installed on Windows Server 2008 R2

 

Reference Link : 

http://support.microsoft.com/kb/2846759