Microsoft introduced a new built-in monitoring system called Managed Availability in Exchange 2013, which automatically takes recovery actions for unhealthy services within the Exchange organization.
Microsoft has been operating a cloud version of Exchange since 2007 and have put all their knowledge into Managed Availability monitoring. Managed Availability is a cloud trained system based on an end user’s experience with recovery oriented computing.
Managed Availability doesn’t mean you don’t have to monitor your on-prem or hybrid Exchange environment in fact, it’s just the opposite. The long and complex PowerShell cmdlet’s used to monitor Exchange (which we will look at in more detail later) are not the best and most effective method to do so.
Exchange 2013, or even better, the Exchange Diagnostics Service (EDS), collects a lot of performance data by default. Over 3,000 performance counters are compiled over seven days. The folder %Exchange Install Path%\Logging\Diagnostics\PerformanceLogsToBeProcessed collects and merges data onto the daily performance log on a regular basis using the Microsoft Exchange Diagnostics service. You can find this folder under path %Exchange Install Path%\ Logging\Diagnostics\DailyPerformanceLogs which is a .blg file type from the PerfMon. Managed Availability uses these files, among others, to track the health of system components. The performance counters are saved for 7 days or until 5 GB of data is reached by default. You can change these settings in the file called Microsoft.Exchange.Diagnostics.Service.exe.config located in the bin directory of your Exchange installation path:
<add Name=”DailyPerformanceLogs” LogDataLoss=”True” MaxSize=”5120″ MaxSizeDatacenter=”2048″ MaxAge=”7.00:00:00″ CheckInterval=”08:00:00″ />
Managed Availability has multiple HealthSet models that are responsible for different services, such as:
- Client Protocols: OWA/ECP, ActiveSync, IMAP/POP, UM, Outlook, Compliance
- Storage: DataProtection, Clustering, PublicFolders, SiteMailbox, Store
- Mail Flow: FrontEndTransport, HubTransport, MailboxTransport, Deployment
- Migration: MigrationMonitor, MRS
- Fabric: Diskspace, MailboxSpace, ActiveDirectory, UserThrottling
The main constituents of Managed Availability are Probes, Monitors, and Responders.
- Probes run every few minutes against different services, checks the health, and collects data from the server. These results flow in the Monitoring component of Managed Availability. An Exchange 2013 multi-role server is defined by hundreds of probes and in most cases, these Probes are not directly discoverable. This means that most of the Probes are defined within the Exchange program code and not changeable. For example, customers reported the AutoDiscoverSelfTestProbe failed when the ExternalUrl for the EWS virtual directory wasn’t set and there weree no ways to change the probe settings. Therefore, Microsoft resolved this issue in Cumulative Update 6. The Probes write an informational event to the Microsoft.Exchange.ActiveMonitoring\ProbeResult crimson channel with the following result types:
- 1 = Timeout
- 2 = Poisoned
- 3 = Succeeded
- 4 = Failed
- 5 = Quarantined
- 6 = Rejected
Probes are divided into three categories:
- Reoccurring Probes: system performed tests for the end-to-end user experience, such as OWA connectivity.
- Notifications: performs their own monitoring without the health manager framework by directly writing probe results. For example, the MSExchangeDAGMgmt service logs a probe result without Managed Availability.
- Checks: collects data from performance counters and logs events if the defined thresholds are exceeded or are unmet.
- Monitors are the central part of Managed Availability. All collected server data is examined to determine if action needs to be taken based on a predefined rule set within the Monitors. Nearly all Monitors collect three types of data:
- Direct notifications: Monitors become Unhealthy if a direct notification, for example from a service, changed the Monitor state
- Probe results: Monitors become Unhealthy if a Probe fails
- Performance counters: Monitors become Unhealthy if a performance counter is higher or lower than the defined threshold
- Depending on the issue, a monitor can either initiate a responder or escalate the issue via an event log entry. Monitors have the following various states:
- Healthy: all collected Probe-data is within a normal state
- Unhealthy: issue detected; either a recovery process started or escalated
- Degraded: if a Monitor is in an unhealthy state < 60 seconds
- Disabled: if a Monitor is manually disabled by an administrator
- Unavailable: if the Microsoft Exchange Health service doesn’t get a query response from the Monitor
- Repairing: to inform Managed Availability (or a monitoring software) that corrective actions are in progress
Many Monitors have high thresholds of multiple probe failures before becoming Unhealthy to avoid wrong recovery actions taken by Managed Availability and the Responders. For problems that require manual intervention, take a look at the Microsoft.Exchange.ManagedAvailability\Monitoring crimson channel.
- Responders take actions generated by an Unhealthy Monitor and perform recovery actions, such as resetting an IIS application pool, initiating a database failover, or restarting a server. Managed Availability uses the following Responder types:
- Restart Responder: Terminates and restarts a service
- Reset AppPool Responder: Recycles an IIS AppPool
- Failover Responder: performs a mailbox database or server failover
- Bugcheck Responder: initiates a bug check of the server (forcing a reboot)
- Offline Responder: takes a protocol on a server, such as mapi/http, out of service and thus reject client requests
- Online Responder: takes a protocol on a server, such as mapi/http, back into production and thus accept client requests
- Escalate Responder: writes an event log to inform an administrator
- Specialized Component Responder: some specialized Responders that are unique to their component
If you would like to take a look at all recovery actions through the Managed Availability Responders, view the Microsoft.Exchange.ManagedAvailability\RecoveryActionResults crimson channel.
This concludes part one of this article. In the second part, we will take a more practical approach to Managed Availability. By using PowerShell we will show you how you can retrieve useful information from the massive amounts of data that Managed Availability collects about your environment.