Cluster Monitoring Overview

Introduction

A SwiftStack cluster has many moving parts - with many daemons across many nodes, all working together. It is therefore important to be able to tell what’s going on inside the cluster when diagnosing issues, performance or planning capacity changes. The SwiftStack Controller tracks not only server-level metrics like CPU utilization, load, memory consumption, disk usage and utilization, etc. but also hundred’s of Swift specific metrics to understand what the different daemons are doing on each server. This helps the operator to answer questions such as, “What’s the volume of object replication on node8?”, “How long is it taking?”, “Are there errors? If so, when did they happen?”

The SwiftStack Controller collects and stores monitoring data for over 500 metrics for each node in your SwiftStack cluster. A subset of these are reported on the cluster detail page of the SwiftStack Controller console so you can get a birdseye view of your cluster performance, with options to drill down into specific metrics for tuning and troubleshooting.

Metrics Collection and Storage

Metrics are collected at 30 second intervals. This high-resolution data is stored for 15 days, long enough to identify and troubleshooting any issues. Data older than 15 days is stored at one data-point per 6 minutes, data older than 3 months (93 days) is stored at one data-point per hour. As the resolution drops, the excess data is dropped, not averaged. Data is dropped entirely after 3 years (1095 days).

Metrics are displayed at the highest common resolution for the time period requested. For example, a last-month graph will show data at one data-point per 6 minutes, and a last-year graph will show data at one data-point per hour.

What to Look For

StatsD was designed for application code to be deeply instrumented; metrics are sent in real-time by the code, which just noticed something or did something. To avoid the problems inherent with middleware-based monitoring and after-the-fact log processing, we integrated the sending of StatsD metrics into Swift itself. With this change, Swift currently reports 124 metrics across 15 swift daemons, which is collected and reported through the SwiftStack Controller.

When analyzing SwiftStack monitoring data, the following are the key metrics to keep an eye on:

CPU Utilization

Load isn't an interesting metric, but one basic machine stat which is important is CPU utilization. Proxy servers in particular are prone to getting CPU-bound and bottle-necking your throughput. This would surface if you under-provisioned your proxy server or a heavy workload comes its way. Another way it could surface is a large number of request per second, but relatively low volumes of data transfer (for example, lots of HEADs, or PUTs for small objects, lots of GETs for small objects). If you have a workload like that then you’ll get a pretty good request per second and your proxy server would be CPU-bound, so watching CPU with proxy servers is particularly important.

The same goes for the back in storage nodes, but it’s generally less of an issue because they’ll generally get I/O-bound before they get CPU-bound. The possible exemption might be a SSD-backed account/container server where your I/O capacity and latencies so good that your CPU has a chance of getting bottle-necked.

I/O Utilization

Another metric that’s useful is I/O utilization metrics. Those can be tracked per-drive, but rolling them up into a node and cluster view can be really handy. When you have per-drive stats what you'll be able to see any hot-spots that may show up. This may happen if you have requests that are piling up on a particular account or container. In this case, you will see a huge number of concurrent PUTs to objects within the same container in the per-disk I/O.

Conclusion

This means as a Swift operator, you may not be anywhere near the client, but those metrics give you a window into the latency that they’re observing. You could see a window into whether they’re getting a lot of 404s, whether all of a sudden some 500s will start popping up. So you can detect problems within the Swift cluster as clients see them.

You really want to catch problem before clients see them which is the point capturing all of these internal metrics, but this is a handy way for operators to gauge the client experience of the Swift cluster.