Shift from monitoring to observability

Written on
19 November 2024
by
Erik Kruithof
Principal Consultant
Share
This blog post is the first in a trilogy in which we look at full-stack observability into your IT environment. In this first post we concentrate on monitoring vs. observability in general. In the second post we will examine the observability solution from Elastic. And in the third and final post we focus on our own PeopleSoft Manager Performance & Health solution that has been built on the Elastic Stack.Obtain full-stack observability into your IT environment with Elastic and Blis Digital.

Observability: An origin story

Observability has gained popularity recently. Reasons for this are that it addresses the specific needs of modern, complex architectures and accelerates incident response, improves reliability, and ultimately enhances the user experience. While this may sound cool, on the other hand, observability is simply the ability to see, or monitor, what is going on inside an application. So, what makes observability different from the monitoring we do since the dawn of the Information Age? To answer that question, we need to look back at how monitoring evolved in our IT world.

Availability Monitoring

Our story starts in the 90s when the World Wide Web started to transition from altruistic knowledge sharing to making money through e-commerce. As more and more money was put in internet-based companies, outages and failures became more and more costly. So, the need to know if your website is available became a necessity. This resulted in the concept of availability monitoring. Availability (or uptime) monitoring is an automated way of checking (pinging) whether a service such as a website or an application is available. Unavailability would usually result in an email, SMS or other type of message send to an administrator to resolve whatever is the problem.

System Monitoring (SM)

As only being informed that a website or application is unavailable will not tell you much about what happened; the need to see what was going on from the inside became more apparent with the start of the new decade. This resulted in the rise of system monitoring. At first by creating simple scripts to check system internals against thresholds. And later with the help of specialised system monitoring tools. Those monitoring tools usually would send an email or alert to an administrator whenever a threshold is breached.

Currently, system monitoring is an umbrella category of software that enables organizations to manage, operate, and monitor IT systems in a centralized manner. Nagios was one of the first well known system monitoring tools that could be widely adapted across industries.

Real User Monitoring (RUM)

Knowing what is going on inside systems is nice, but it will not tell you how your users are perceiving your service. In the late 2000s the focus shifted from monitoring systems to monitoring users. The ability to monitor transactions from real users of a service gave great insight into what users actually were experiencing. Real User Monitoring passively collects data from real users in real time. Making it possible to optimize a service based on real data. At this time a service like Pingdom quickly became popular because it offered also website performance insights at a time when user experience became crucial for businesses.

Application Performance Monitoring (APM)

Now we are in de mid-2010s and have our users also on the table, there is still a void between the users and systems. Our applications sits between them as a black box that comes in many different shapes. By tracing and timing calls within the application we can introspect the path calls make and where time is spent. Dynatrace is one of the well-known pioneers in the Application Performance Monitoring space.

Today’s Application Performance Monitoring has been developed, as Gartner defines it, into a suite of monitoring software, comprising:

  • Digital Experience Monitoring (DEM)
  • Application Discovery, Tracing and Diagnostics (ADTD)
  • Purpose-built Artificial Intelligence for IT Operations (AIOps).

Digital Experience Monitoring (DEM)

Digital Experience Monitoring is a software tool that supports the optimization of the operational experience and behaviour of a digital agent, human or machine, as it interacts with enterprise applications and services. The primary tools for Digital Experience Monitoring are synthetic monitoring, which actively emulates user interactions, and real user monitoring.

Application Discovery, Tracing, and Diagnostics (ADTD)

Application Discovery, Tracing, and Diagnostics tools seek to understand the relationships between applications using methods like Bytecode Instrumentation (BCI) or profiling. For example, by adding bytecode to a Java class during run time.

Artificial Intelligence for IT Operations (AIOps)

AIOps combines big data and machine learning to automate IT operations processes, including event correlation, anomaly detection and causality determination.

Shift towards observability

While Application Performance Monitoring filled the gap between our systems and users, it wasn’t the answer to the sheer growth in the complexity of our IT landscape in the early-2020s. The rise of microservices, containerization, distributed systems and cloud computing made that traditional monitoring tools were no longer sufficient. This new complexity required a shift from reactive monitoring to a more proactive approach. And this is where observability entered the room. By (centrally) combining metrics, logs, and traces to capture the whole picture of a system’s health new tools like Elasticsearch, Logstash, Kibana (ELK) were developed to address these needs.