Cluster Tuning

The SwiftStack Controller pre-configures many default settings and includes automatic tuning features for your cluster. However, there are some operational situations that my require cluster tuning.

With SwiftStack, it's extremely easy to manage and change those tuning settings. Just go to the tuning section of the Controller and adjust the account, container, object and proxy tuning settings. The SwiftStack Controller will then automatically push out the updated configuration settings, and reload the Swift processes all without any client downtime. This ensures that you are getting the most out of your hardware and are optimizing the performance for your use case.

For any additional questions about how or when to tune your cluster, contact SwiftStack support

This section contains some of the most common tuning settings in a SwiftStack Cluster.

Tuning is available on the Tune tab of the Manage Cluster page.

../../_images/manage-tabs.png

Settings Managed by the SwiftStack Controller

../../_images/tune-proxy-workers.png

On the left is a navigation pane to help in finding the right setting.

Worker Settings

One of the most important tunables that Swift has is the number of workers for each of its primary servers. account_servers, container_servers, object_servers, and proxy_servers all have a setting in their config files called workers.

The node can make its own guess as to what a good number of workers would be. Do this by setting the "auto" checkbox next to the workers settings, or you can control these settings yourself as follows. This is the default setting for new clusters.

Proxy Server Workers

  • Proxy workers (default auto)

One thing to know about Swift is that each worker is accepting sockets off whatever the main binding point is for that server. The proxy server is the easiest to understand because it binds its own port and IP combination and when a new request comes in, it gets accepted by one of the workers. So each worker has a separate UNIX process, and within a single worker up to 1,024 requests are juggled simultaneously using Eventlet. (Eventlet is a green-thread based library for high concurrency in Python.) So what that means is if you know that 10,000 simultaneous concurrent requests are expected to come in to your cluster in production then you can do the math on how many worker processes are needed. Because each proxy server worker can only juggle 1,024 simultaneous requests then, we know that we need at least 10 workers.

Keep in mind that the proxy server workload tends to be CPU-bound because it’s not doing any disk I/O, but rather just shuffling network data. This is because of the way Eventlet works. Each worker can only use up to one CPU core. So if you’ve got a lot of cores in your proxy server (and that’s a good idea by the way), you need a good number of workers to be able to saturate them.

A good starting point would be to have one proxy server process per CPU core.

Account and Container Workers

  • Account workers (default auto)
  • Container workers (default auto)

The account, container, and object servers are not only juggling network traffic from the proxy, but they're also accessing data on disk. For the account and container servers it's interacting with SQLite databases and in the case of the object server – the actual objects. So for all of those, you need enough workers so that your loadable file system operations – like read and write for instance – don’t end up starving the Eventlet event loop.

This is because asynchronous network I/O works great, but asynchronous file I/O on pretty much every platform is terrible. So it’s important to remember that a read and write system call on Linux can block.

The number you should set varies widely depending on your hardware and your network configuration. We won’t give you a lot of particular numbers because what’s important and what should drive your decisions is the metrics you observe with testing. A good starting point, however, is to run 1-2 processes per spindle or one per CPU core.

Object Workers

  • Object workers (default auto)
  • Object servers per port (default 4)

For object servers, you might need to be sensitive to the request latency that your clients are observing as you’re putting load into the cluster because if you have too few object servers and your disks are starting to get saturated then what you’d expect to see is inconsistent latency observed by clients. Let’s say a client is trying to pull a 3-gigabyte file out. If other people or other concurrent connections are utilizing the disk pretty hard then some reads and writes will get blocked. This is because other connections in the same green thread or the same object server worker that your large file stream is coming from will end up blocking the stream that you’re pulling.

In Swift versions before 1.9, this should be addressed by increasing the number of Object Workers to 'auto' or 1 to 2 times the number of cores, including Hyper-Threading if applicable. Starting in Swift version 1.9, a more robust configuration is available. Start with Object servers per port set to the default (4), and decrease the value for higher densities (>84 disks per node), or increase it for lower densities (~24 disks per node).

Background Daemons and their Settings

One of the common concerns that Swift operators have when they first start is the amount of resources consumed by the background processes. When folks start up their Swift cluster and put a bunch of data in it we often hear, “Oh my gosh look at all of the disk I/O traffic, there’s read I/O all over the place and there’s only a little trickle of write I/O! What’s going on here? I’m not even running swift-bench or any other benchmarking tool!”

What's happening is that the background daemons are doing their job.

In addition to the servers, there are other background daemons that run in Swift, specifically auditors, replicators, updaters, and the reaper.

However, if that level of background work is too high and you wanted to rein them in or tame your background daemons then it is important to know how these work and what settings can be configured.

Auditors

Auditors are responsible for ensuring the integrity of the stored data. They constantly check it, catch bit-rot, and flag other ways that the data can kind of go stale. So even with three replicas you don’t want to just let them sit there or they’ll rot as sector failures are not uncommon.

For the auditors there’s an "Auditor Interval" setting that determines what interval they run at.

Then for the object auditor, there are additional "files per second", "bytes per second", and "zero byte files per second" settings. These settings can be used to rate limit the auditing process which will subsequently rate limit its load on the I/O subsystem or CPU.

  • Account auditor interval (default 1500)
  • Container auditor interval (default 1500)
  • Object auditor files per second (default 20)
  • Object auditor bytes per second (default 10000000)
  • Object auditor zero byte files per second (default 50)

Replicators

Replicators are the eventual consistency guarantors of Swift and will synchronize missing data within the cluster.

For the replicators, there are two primary settings. First is "concurrency", which is the number of simultaneous outstanding requests each replicator has. If that’s too high then you will see more rsync traffic and you’ll see more request traffic as one server is asking the other “What version of this do you have?” Then there’s "run pause", which is the period of time that the replicator will sleep in between its looping runs.

  • Account replicator concurrency (default 8)
  • Account replicator run pause (default 30)

Reapers

The reaper is the account reaper and when an account is sent to be deleted, it’s deleted asynchronously. When a user, issues a delete, it’s marked as a kind of soft-delete. The account reaper is responsible for coming through and actually removing the objects in the containers within the account and then eventually removing the account record. That’s an expensive operation depending on size of the data involved, so it’s done asynchronously.

Then for the reaper there’s interval (which is the frequency of the runs) and concurrency (which is the number of items to do in parallel). There’s a trade-off here. With higher concurrency you’re able to do more and get higher throughput, but at the same time you get higher load. So if you want to lower the load, you can lower the concurrency. On the other hand, if you lower the load and lower the concurrency, you’ll raise the time it takes from start to finish for cleaning up a given workload. So that’s the trade off you have to make as an operator.

  • Account reaper interval (default 1800)
  • Account reaper concurrency (default 25)

Updaters

The updaters are responsible for bubbling up the object and byte count from objects into containers and from containers into accounts.

For the updaters there are three variables that you can adjust: interval (discussed above); concurrency (also discussed above); and slow-down which is another rate limiting variable that defines how long to sleep between every update. This allows for a bit more fine-grained control for slowing down the updaters.

  • Container updater interval (default 300)
  • Container updater concurrency (default 4)
  • Container updater slowdown (default 0.01)
  • Object updater interval (default 300)
  • Object updater concurrency (default 4)
  • Object updater slowdown (default 0.01)

Chunk Sizes

One last important setting in Swift is chunk sizes.

When the proxy server receives a request, it reads data from the client in chunks. So you want your chunk sizes to be large enough that your system calls into the kernel get a decent amount of data back without incurring an overhead cost for going into and out of the kernel-mode with your system calls. We're not going to give you any hard numbers, but you can tune this and see how the CPU utilization of your proxy servers under a given load can change. If you drop that client chunk size down to, something like 1024 bytes or anything lower than the MTU of your jumbo frame network then you might see an increase in CPU utilization. If your clients are coming out of the Internet then they’re getting chunked outside your control and so again there’s tuning to be done there.

So, when should you use larger chunk size? What are the trade-offs between smaller vs. larger chunk sizes?

A decent rule of thumb is to start with something sizable and then if that has trouble then drop it down some. For example, if it’s pretty large like 64 kilobytes then what that means is you’re asking the kernel “Give me 64k of data” and maybe the client’s trickling them in slower than that. Then that means that a read system call will not be blocking because you’re using non-blocking network I/O, but you can have higher latency between asking for that chunk and dealing with the chunk.

In some cases a call could take so long that you think there’s a timeout or the client just went quiet on you. Well, no it’s just dribbling data in so slowly that the kernel’s network buffer hasn’t filled up in order to satisfy your read request. So, you don’t want to set it too high. On the other hand, if you set it too low then you just have the additional overhead of issuing a system calls going in and out of the kernel for each little chunk of data without allowing the kernel to accumulate and then hand it back to you in one batch. So that’s the trade-off you need to negotiate.

And again, your testing with your workload is really going to be what needs to inform how you set chunk sizes for your cluster.

For the proxy server, there’s the client chunk size which is the chunk size with which the proxy server tries to read client data from the client. Then there’s the object chunk size which is the chunk size used when dealing with the object server. This means that they can be independently tuned. This is useful when you have a different internal storage network and a public access network with respect to the proxy server. Then, the traffic coming to the proxy server is not jumbo framed because it's coming in off the internet and you can bump up the object chunk size because it's on a nice fat LAN connection coming from the proxy server to the storage nodes.

  • Proxy server object chunk size (default 8192)
  • Proxy server client chunk size (default 8192)

Lastly, there is the disk chunk size for dealing with data to and from disk. Disk chunk size and network chunk size from the object server have the same concerns as the proxy server which is you don’t want data to be coming in and out of the kernel more than you have to.

  • Object server network chunk size (default 65536)
  • Object server disk chunk size (default 65536)

Settings Managed Externally

There are also tuning settings outside of SwiftStack that an operator needs to be aware of, which can significantly affect the performance of a cluster. These include:

Max Connections Setting for rsync

The max connections setting for rsync may in some case have a limiting factor on replication traffic.

Jumbo Frames

Enabling jumbo frames will allow the cluster to get decent throughput with high bandwidth network connections. Even on a 1-gigabit network you’re going to get higher throughput with jumbo frames. All of your networking equipment in the chain also needs to be jumbo frame aware, capable, and enabled. The tools for testing that configuration is outside the scope of this book.

ip_conntrack_max

If you’re using iptables on the systems or any modules that involve connection tracking, then you absolutely need to bump up ip_conntrack_max.