Problem Detection and Analysis

What is Problem Detection and Analysis?

Dynatrace uses Davis for problem detection and analysis. Davis is an AI causation engine that automatically detects performance anomalies in applications, services, and infrastructure. We can then configure alert and notification rules to notify necessary teams when something is wrong.

Problem Detection and Analysis Cycle:

  1. Problem is created
  2. Alerting profile is triggered
  3. Notification is sent out

Useful Links

Options/Features

Dynatrace measures incoming traffic levels against defined thresholds to determine when a detected slowdown or error rate increase justifies the generation of a new problem event

Dynatrace uses two types of thresholds: Automated baselines and Built-in static thresholds

Automated baselines Built-in static thresholds
This multidimensional baselining automatically detects individual reference values that adapt over time. These values are then used to cope with dynamic changes within your application or service response times, error rates and load. Built-in static thresholds used for all infrastructure events, which include instances such as high CPU, low disk space or low memory. This is a more straightforward approach with less of a learning period.

After a problem has been detected, Dynatrace offers the impact analysis and the business impact analysis on the problems overview page, which allows you to analyze the consequences.

Recommendation

Dynatrace recommends letting Davis monitor your environment and automatically determine baselines for RUM and Service data

  • Special cases (Key requests, SLA tracking, etc.) exist to give you complete control
  • Keep Static Thresholds available for infrastructure entities so that they can be customized to fit your environment

Baseline or Static Threshold for anomaly detection?

Understand which key metrics fall under which threshold by default

  • Automatic Baselines include Application and Service Layer metrics
  • Built-in Static Thresholds include Process, Host, and Datacenter metrics

Follow the structure of the Smartscape (different entities are analyzed and mapped out). You have control to change these thresholds, but it is recommended to let Davis decide automated baselines

Setup and Configuration

Configuration

Metric breach configuration

Create a custom event:

  1. In the navigation menu, go to Manage > Settings > Anomaly detection > Metric events
  2. Select Add metric event
  3. Select the metric for your event
    • Metric Key - enter name of metric
    • Metric selector - enter query
  4. Select the type of aggregation
  5. Optional Add rule-based entity filters by clicking on Advanced entity settings and choosing an entity type from the drop down
  6. Define the Monitoring strategy
    1. Choose a strategy
    2. Auto-adaptive baseline – Dynatrace calculates threshold automatically
    3. Static threshold – threshold doesn’t change
    4. Fill in the settings for the strategy chosen
      • Auto-adaptive – you’re deciding how many times the signal fluctuation is added to the baseline
      • Static – you’re deciding the threshold value
      • Then decide the sliding window for comparison – this defines how often the threshold must be violated within a certain time window to raise an event (The drop down allows you to decide Alert or Do not alert
  7. Select the timeframe of the preview. You can receive alerts for 12 hours, one day or seven days to see how effective your configuration is
  8. Select a title for the event
  9. Create a meaningful event message in the Event description
  10. Select Create custom event for alerting to save the event

For more information, see the following links: Auto-adaptive baselining for custom metric events | Dynatrace DocumentationStatic thresholds for custom metric events | Dynatrace Documentation

Create an Alerting Profile

For this example, you will want to pay attention to step 5 - you will be using the event filter to define the event you just created. This tells the alerting profile to match on that specific event.

For further information on Alerting profiles, see Alerting Profiles

  1. Go to Settings > Alerting > Alerting profiles
  2. In the Create new alerting profile field, type a name for the new profile
  3. Define the management zone filter
    • This causes the alerting profile to only evaluate data coming from the specified management zone
    • The default is set to All, but a filter should be applied in most cases, which reduces the profile scope to the scope of your teams’ responsibility
    • Management zones can overlap (multiple filters will be applied if a problem is detected on a service that is defined within multiple management zones)
  4. Define the severity-level rules
    • You can specify up to 20 severity rules
    • Rules are combined by the OR logic, so an event fulfilling any of the rules triggers a notification
    • The following criteria can be used when applying filters:
      • The severity level
      • How long the problem is open before an alert is sent out
      • (optional) Monitoring entities that have any/all of the specified tags are combined by the AND logic, so all of them must be fulfilled for the rule to be invoked
  5. Define the event filter
    • Select Add event filter
    • Select Custom from the dropdown
    • Enter the Title of the custom event you just created
    • Criteria can be inverted using the negate option (this turns begins with into does not begin with)
    • Rules are combined by the following logic:
      • Rules that contain negated criteria are grouped by the AND logic
      • All other rules are grouped by the OR logic
      • The two groups (negated and non-negated) are grouped by the AND logic
  6. Review the summary of criteria for the new alerting profile
  7. Select Save changes

Add a Problem Notification:

  1. Go to Manage > Settings > Integration > Problem notifications
  2. Define the Notification type from the dropdown
  3. Define the Display name of the notification
  4. In the Webhook URL bar, enter your webhook endpoint
  5. Decide whether you want to Accept any SSL certificate, and whether or not you want to Call webhook if new events merge into existing problems
  6. Add any custom HTTP headers you want to pass to your webhook
  7. Define the Alerting profile from the drop down
    1. This will be the alerting profile you just created in the previous step
  8. Select Send a test notification

Usage

Problem status

Dynatrace provides a Problems page which provides information all problems, whether open or closed.

  1. To get to the Problems page, click on the red box with a number in it in the upper-right hand corner
    • This number indicated the number of problems that are currently open
    • Alternatively, this page can be accessed by selecting Problems from the navigation menu. You can also find problems within individual entity pages.
  2. There are list options for filtering by Status, Severity, Impact level, and Maintenance
  3. You can also use the search bar for extra filtering options, such as tags, alerting profiles, and text
Opening a problem:
  1. Click on the blue title of any problem and it will bring you to the Problem overview page
  2. This page provides three helpful sections
    • The top section provides the number of Applications, Services, and/or Infrastructure components that are affected by the problem
    • The section on the left provides the Impact analysis, which includes details about the direct consequences of the problem
      • This may also include the business impact
    • The final section on the right provides the Root cause analysis, which includes details about the underlying root of the problem

FAQ

How can I find the root cause of a problem?

  • Dynatrace offers a visual resolution path that allows you to replay the sequence of detected events that led up to/are correlated to any problem

How do I access the problems overview page?

  • Select Problems from the navigation menu, and then select the problem you want to analyze. The problem can also be selected on the individual entity page that is being affected.

Is there a time limit for conducting problem analysis?

  • All detailed problem analysis and organization must be performed within 14 days of the problem detection notification, as the transaction storage is limited.

How will I know if a problem is closed?

  • The problem will feature a gray icon once it is closed