Guiding Principles

Enterprise Monitoring Specific Guiding Principles

Monitoring Principle 1

Know before they know; Pre-empt service degradation and potential outages

Description: Identify system degradation and potential outages before impacted users through proactive breach notifications of pre-defined thresholds indicative of service problems.

Rationale: Degraded performance of applications and outages are disruptive to the University, costly if gone unnoticed, and contribute to a frustrating experience of users and stakeholders alike. Proactive monitoring of pre-defined metrics evaluated against baseline performance thresholds at each layer of application infrastructure, enables the monitoring system to automatically notify service owner(s), thereby identifying problems before system degradations or outages. The benefit of this principle is to provide service teams sufficient warning of problems so they can remediate issues prior to an end-user’s knowledge or significant degradation in service.

Implications: Service owners establish metrics and correlations at each layer of application infrastructure that are indicative of service degradations and threshold alerts are automated to recognize these event patterns and send alerts to the service team.

Monitoring Principle 2

Single-pane with Multi-lens view of monitoring metrics

Description: Enterprise Monitoring dashboards provide service owners with a centralized location to view and analyze real-time performance related metrics. Purpose-built dashboards, constructed over a common set of performance metrics, provide multiple channels of support for the enterprise’s diverse needs. Through this visualization of key performance indicators (KPIs), service owners can quickly assess service health at any given point in time.

Rationale: IT and business leaders are eager for and need visibility into the positive and negative impact of IT solution performance to the organization. Access to comprehensive real-time performance metrics, through easy-to-read dashboards, provides essential insight needed to support decision-making and management of University capabilities.

Implications: Each service needs to establish KPIs at all layers of application infrastructure.

Monitoring Principle 3

Reduction in Mean Time to Resolution

Description: Enterprise Monitoring collects a significant amount of performance data from diverse sources such as applications, networks, servers, storage, databases, etc. This data must be of sufficient quality and structure to be correlated, de-duplicated and enriched to interrogate specific events, holistically analyze performance, send meaningful alerts, sensibly visualize KPIs and aid in faster issue resolution.

Rationale: Correlation of events and root cause diagnosis help determine where the actual problem lies and so that an issue can be resolved quickly. The goal is to provide more rapid root cause analysis following an incident, improved ability to monitor performance trends and crisper justification for expansions (or contractions) of various services.

Implications: In order to correlate events, definitions need to be established on how data needs to be analyzed for each combination of correlated events. Organizing of data collected based on layered topology and timing of the event may provide better accuracy for correlation. As new layers are added in the infrastructure, the tool should be able to auto-collect new data and adjust correlation accordingly.

Monitoring Principle 4

Improve operational excellence through performance metrics

Description: A service’s operational performance is to be measured by a set of predetermined key performance indicators that are measured consistent over time. Service owners will use these metrics to measure degradation in service and to correlate this data with end-user experience. Intersections of poor user experience and underperforming KPIs will indicate areas of improvement.

Rationale: To identify areas of improvement based on metrics and not perceived performance.

Implications: Yale has declared certain platforms as strategic platforms when combined with other platforms. In the case of Workday, Force.com has been declared a strategic platform. This means that not only should Yale consider using this platform when additional functionality needs to be built, but that the platform tools are also available for use.

Monitoring Principle 5

Cohesive Enterprise Monitoring fabric

Description: To accurately monitor a Full Stack environment from network router through application performance multiple monitoring solutions will be required. It’s important to ensure all monitoring solutions contribute relevant metrics, do not duplicate the collection of metrics, and integrate with existing solutions.

Rationale: Consolidating IT monitoring tools reduces cost and complexity while improving overall management capabilities.

Implications: Duplication of tools collecting similar or irrelevant metrics inflates the overall cost of Enterprise Monitoring diminishing the available funds to extend licensing of approved tools and solutions.