Introduction to Instrumentation and Observability in Distributed Systems — Part1

ChandraSekhar Saripaka
The Startup
Published in
7 min readDec 2, 2020

--

Obseravable Spark Job

A distributed system is a system whose components/services are located on different networked hosts, which coordinate their actions by passing messages to one another. The components interact with one another to achieve a common goal. You will see most of the use cases in enterprise data systems, gaming systems which are bound by Time Synchronization. They implement consensus algorithms and some distributed consensus algorithms, which makes them truly distributed.

How can we create a observer pattern for all the hundreds and thousands of nodes in the cluster?

Quite often, its complex to work with the distributed systems, and it’s even more complicated to measure and observe across all the components. Measurement of metrics in distributed systems, e.g., memory, CPU, storage, network , I/O play a significant role in observability which is the base for scaling of distributed systems. Also, one has to understand the correlation among the services, components and the hosts, and the underlying hardware.

In distributed systems, services are hosted across multiple nodes; the reliability of usage depends on the performance of these hosts and services and coordination between the services. We present here a naive approach, the systematic way of designing the observability and instrumentation in distributed systems.

Theory

Here is a structural layout of how I would see the metrics are responsible for the functioning of the service and Application.

Metrics Layout

For example, in the context of Hadoop eco-systems, the zookeeper is used as a coordinator or observer to determine the master in master-slave architecture. Zookeeper needs to operate at less than 5 to 30 ms latency for every transaction. The components that use zookeeper needs to advertise the status and the workers/clients use the status in zookeeper to determine the primary host.

To identify this kind of scenario for failures in zookeeper ( journal-based ), one needs to be aware of disk latency, RPC throughput and latency.

Is there a better tool that can help determine and can prevent a service from degradation?

The answer is No, and you have to design your own. Make your code observable and feedback the metrics into the same component, that can operate on the services, with a mission for autoscaling, prevention of unplanned failover, offer high throughput and less latency even in high volumes.

We can approach the concept of observability using different methods. I once again refer to something called USE method, popularized by Brendan Gregg.

It can be summarized as

FOR EVERY RESOURCE, CHECK UTILIZATION, SATURATION, AND ERRORS.

More concisely, if you build utilization index, saturation Index and ErrorIndex, those become the indicators for the Service performance.

UsabilityIndex: The average time that resource was busy serving work is termed as usability. If you build a Numericalindex, of all the hosts that are serving this, determines the usability Index for the service.

SaturationIndex: The degree to which the resource has extra work which it can't service, often queued.

ErrorIndex: The count of error events per second. 99% of the service failures are clearly represented through the errors in the service log. For example, the number of failures that are not able to serve the API requests/ or served with larger latency.

We take this as a basis to determine the bottlenecks for any service, system or a host.

When there is no automated resolution, the operator needs to intervene in debugging the component and find the root cause of the issue and handle the events to support. They will resort to an operations handbook written by the component creator. If the software doesn’t provide an operations handbook, the support person has to go through the logs of all the components involved in this failure event and remediate each one of them, will act by creating a decision tree. The impact of the issue has to be determined, from the component/service impact matrix. This process is called the impact analysis, which determines the Service Level effects, and it opens up the issue of origination. The above is a tedious process and deserves the utmost care and the post-remediation; following up, and there will be a root cause analysis.

Traditional Analysis:

Crude Approach to Issue Remediation.

Root cause analysis (RCA) needs to be automated/drafted, for the repeated issues, else the toil piles up.

So, there is a great need for a framework here, how to codify an RCA from converting a manual finding to converge into an automated analysis.

Rudimentary approach to creating signals and flag alerts for Remediation.

Building Blocks

The fundamental change that every host needs to bring here is to have the necessary metric exporters at different host processes and the host itself. But one needs to know what has to be exported.

There are a bunch of metrics exporters like:

1. Node exporters 
2. Metric Beats which cover the system beat events.
3. Packet beats to understand what's going through the network for throughput and latency measurement.
4. Your choice of any software that can configure the predefined metrics that needs to be exported.
5. Alternatively, I suggest to have a look at the Brendan Gregg's BPF Performance tools if you are linux savvy.

( The below are mentioned considering Java Processes.)

1. Heap Memory. 
2. Native Threads and their equivalent usage of CPU cores.
3. TCP Connections in and out and their effect on CPU Cores.
4. Native Threads and their equivalent usage of open files.
5. Max Open Processes by a application.
6. Stack Size tuning , if GC is not properly tuned, and large buffers of data being written to stack, you see out of memory errors, unable to extend the stack with any more memory. This depends on the nature of the application, if it has poorly written memory management.
7. Control checks for the disk utilization and log rolling.

The user initiating the service needs to be configured with the proper system soft and hard ulimits. As a system operator, we all know this, but what we need is the continuous monitoring of these limits in resource-intensive systems. You may have seen issues where the service can go to a hung state, because of some underlying background process on the container or Virtual Machine. Unless the service produces observability on the processes and systems produce metrics on the machines that can be correlated, service will not be able to identify this.

How do we understand all of these metrics with the hosts, services 
and networks ?
Who would want to use these metrics?
How do we derive value out of it ?

Operators would need these kind of metrics data and see if there is a breach of threshold, to generate alerts and remediate accordingly for a host running these services. Performance engineers would analyse these metrics data, and derive common patterns.

The third kind would be the software bots who needs to do this remediation by using a defined set of run books with guided instructions. This is the place where Robotic Process Automation (RPA) becomes handy.

Feedback based Systems or Services:

A service can be impacted due to an abnormality in any of the metric, from one of the dependant components. Even in this modern infrastructure days, failover of the services to another machine happens because proactive feedback is not enabled. Abnormality of any hardware device also has an equal effect on the service to fail, how can we predetermine something like this. I call this as Operational Predictive Intelligence.

A service should have feedback from the metrics and should be able to self-adjust to the outside context of clients. A service should act like a control system, which can self-calibrate itself to the external factors and the software developers need to code that intelligence into the service through meters and signals as a feedback loop.

For example,

Lets think of a spark/mr job which can auto adjust the parameters based on its previous runs, to give better SLAs and use less resources.

Similarly, a game which can recalibrate the movements based on the levels and can generate personalized complexity for the players.

Turning the software service into a Control System.

In the present context of distributed systems, there are no such control systems that are built. When a service tends to failover, the root cause of the failures has to be determined, and they should become part of the run book and the feedback loop, to prevent the system from failing when having seen a similar kind of issue. A precise measurement of the infrastructure and the cost that we are using for these services needs to be recorded.

Majority of the cloud providers give these options, but it comes with a cost. Except for the cloud-native services, any service that does detailed monitoring, proactive feedback and self-remediation of the services, is what makes the services competent enough to serve a large volume of clients. However, if an application is hosted in the cloud and has to scale by itself, without proper feedback of metrics, one would be spinning up more instances or burn more budget in running the services.

We will cover more topics like Feedback loop Systems, Observable Architectures, USE method indices and many more in my next posts.

--

--

ChandraSekhar Saripaka
The Startup

Software Engineer with expertise in Bigdata and Distributed Systems.