Events

An event is a system condition which may be of interest to an administrator. Events are often too low-level to be useful, and reporting each one would be noisy. For easier management, SwiftStack aggregates events into Alerts.

The SwiftStack Controller currently supports the following events:

Scope Event Code Description
Cluster E101 Config Push Failed
Cluster E102 Account Push Failed
Cluster E103 Ring Empty
Cluster I104 Ring No Longer Empty
Node W201 Low Available Disk Space
Node E202 Node Unreachable
Node I203 Node Reachable
Node E204 Node Upgrade Failing
Node I205 Node Upgrade No Longer Failing
Node E206 Format Devices Failed
Node E207 Swift Service Health Problem
Node I208 Swift Service Health Okay
Node I209 Sufficient Available Disk Space
Node E213 Node Configuration Out of Date
Node I214 Node Configuration Okay
Node W215 Node Automated Provisioning Failed
Node I216 Node Automated Provisioning Okay
Node E217 Node Network Interface Missing
Node I218 Node Network Interface Okay
Node E219 Node Config Deploy Impossible
Node I220 Node Config Deploy Possible
Node E222 Node Connectivity Failed
Node I221 Node Connectivity Ok
Node E223 ProxyFS Service Health Problem
Node I224 ProxyFS Service Health Okay
Node E225 KMIP Connectivity Failed
Node I226 KMIP Connectivity Ok
Gateway E210 Gateway Service Health Problem
Gateway I211 Gateway Service Health Okay
Gateway E212 Gateway Configuration Push Failed
Device E301 Device Missing
Device I302 Device Came Back
Device E303 Device Failed

SwiftStack Controller On-Premises supports these additional events:

Scope Event Code Description
Controller E401 Recovery Host Unreachable
Controller I402 Recovery Host Reachable
Controller E403 Recovery Host Version Shear
Controller I404 Recovery Host Version Fix
Controller E405 No Recent Tar Backup Of Configs/DB
Controller I406 Sufficiently Recent Tar Backup
Controller E407 No Recent Rsync Of Cluster Metrics
Controller I408 Sufficiently Recent Rsync
Controller E409 Low Disk Space On Controller
Controller I410 Sufficient Disk Space On Controller
Controller E411 Controller Utilization Failing
Controller I412 Controller Utilization Running

In addition to converting events to Alerts, the SwiftStack Controller On-Premises can emit events using standard protocols. The Controller supports reporting events using syslog (e.g. for integration with tools such as Splunk) and SNMP v2c Traps.

Sending Events Via Syslog

To send events via syslog, configure the syslog target host, port, decide whether to use UDP (default) or TCP, and which facility to use. Configure these values on the Controller Networking configuration page and click "Save Changes".

Sending Events Via SNMP

To send events via UDP SNMP v2c traps, configure the target hostname or IP, target port number, and a SNMP v2c community string on the Controller Networking configuration page and click "Save Changes".

The SwiftStack SNMP MIBs may be downloaded using these links:

Cluster Events

E101 Config Push Failed

After making changes to the cluster configuration (which were saved by the SwiftStack Controller but not yet deployed), Administrator attempted to "push the config" (deploy the configuration changes to the cluster), but the operation failed. This is usually due to one or more nodes being down or unreachable, while the Controller believes them to be active and enabled.

Troubleshooting: Look for nodes with errors on the "Nodes" page for the cluster. If there are down nodes in the list, and they will be down for an extended period of time, you may disable the nodes. If the down nodes will be back shortly (e.g. the server has been powered down for maintenance or hardware replacement), you should wait for the node to come back online before pushing a config to the cluster. If there are no down nodes and config pushes still fail, please contact Technical Support.

Note

Disabling nodes will remove them from the Swift ring(s) and trigger replication to restore full replica count within the cluster for every object, container, and/or account on the disabled nodes.

E102 Account Push Failed
A distribution of the SwiftStack Auth user database to the cluster nodes failed. Failure modes and troubleshooting procedures are identical with E101 Config Push Failed above.
E103 Ring Empty
One or more rings no longer have devices. This may be due to removing all devices previously allocated, or disabling all nodes that have devices allocated to the ring.

Note

The failure or removal of devices previously allocated to an affected ring may result in data loss.

Troubleshooting: Look for rings without devices on the "Deploy" page for the cluster and add devices to each of the affected rings. Wait for replication to move data to the new devices before removing devices previously in the ring or disabling nodes that were previously in the ring.

I104 Ring No Longer Empty
A previous E103 condition has been resolved. Wait for replication to move data to the new devices before removing devices previously in the ring or disabling nodes that were previously in the ring.

Node Events

W201 Low Available Disk Space

One or more partitions on the node have too little free disk space. The threshold value for this event is currently hard-coded at 10 percent. The event description will contain the threshold value and path of each partition.

Troubleshooting: SwiftStack recommends keeping at least 10% disk space free on your storage devices. This helps prevent devices from accidentally filling up from "bursty" work-loads and helps provide time to add capacity. However, you should not rely on seeing this event before adding capacity, because your rate of ingest and lead-time for hardware procurement and stand-up may require you to add capacity earlier. You should add capacity to the cluster, reduce or stop your ingest, or delete data no longer needed.

Note

It is possible to provision a Swift cluster such that a subset of drives will fill up "too early". E.g. a three-zone cluster with Zone 1 having 100TB, Zone 2 having 200TB and Zone 3 having 200TB. Because with 3 replicas, each zone will get one replica, the devices in Zone 1 will fill up before any devices in Zone 2 or Zone 3 fill up. Using more Zones will allow the partition placement algorithm to "smooth out" imbalances in zone capacity.

E202 Node Unreachable

The management/monitoring connection between the node and the SwiftStack Controller is not working. Assuming nothing other than the network connection between the node and the Controller is wrong, the correct functioning of the Swift cluster itself is not impacted if the node cannot contact the controller.

Troubleshooting: Possible reasons for this event include a network partition between the SwiftStack Controller and the node, a power failure to the node, a hardware failure in the node, a lock-up or hanging of the node, a problem with the SwiftStack agent sofware on the node, or a problem with the Controller itself (likely to have caused a number of E202 events for other nodes as well). If you are unable to locate the cause of the communication failure, please contact Technical Support.

Note

While an enabled node is unreachable, no config or SwiftStack Auth user database may be pushed to the cluster. Also, any gradual capacity adjustment for any devices in the cluster will not be able to make progress. Swift API clients may see degraded request latency while one or more nodes are down. See the E101 Config Push Failed above for information on disabling nodes which will be unavailable for an extended period of time.

I203 Node Reachable
A previous E202 condition has been resolved.
E204 Node Upgrade Failing
After a controller software update, the node is failing to update its SwiftStack agent software. Please contact Technical Support.
I205 Node Upgrade No Longer Failing
A previous E204 condition has been resolved.
E206 Format Devices Failed
The node has been instructed to format one or more drives, but the operation failed.
E207 Swift Service Health Problem

One or more Swift daemons are either not running or not functioning properly. The SwiftStack agent will attempt to restart Swift daemons should they unexpectedly exit. However, if that fails, this event will be triggered.

Troubleshooting: Log into the node and run start ssstop; start ssstart as root. If that does not resolve the issue, please contact Technical Support.

I208 Swift Service Health Okay
A previous E207 condition has been resolved.
I209 Sufficient Available Disk Space
A previous W201 condition has been resolved.
E213 Node Configuration Out of Date

A node was attempting to run Swift services with an outdated cluster configuration. To protect the rest of the cluster, all Swift services were disabled and this event was triggered.

Troubleshooting: Push a fresh config to the cluster. If that fails or otherwise does not resolve the issue, please contact Technical Support.

I214 Node Configuration Okay
A previous E213 condition has been resolved.
W215 Node Automated Provisioning Failed
A node that was undergoing Automated Provisioning has failed.
I216 Node Automated Provisioning Okay
A node that had previously failed Automated Provisioning has restarted.

Note

Automated Provisioning for nodes is currently in beta.

E217 Node Network Interface Missing

A previously configured network interface on a node was not detected during monitoring of the node.

Troubleshooting: Verify the missing network interface is properly configured.

I218 Node Network Interface Okay
A previous E215 condition has been resolved.
E219 Node Config Deploy Impossible

Trying to deploy config to this node will fail. Currently this only occurs because of incompatible packages on the node.

Troubleshooting: This alert occurs when the package swiftstack-swift-services is unable to install. Identify and resolve the packaging conflict(s) causing this issue.

I220 Node Config Deploy Possible
A previous E219 condition has been resolved.
E222 Node Connectivity Failure

A node cannot reach all of the required services of its peer.

Troubleshooting: This alert occurs when there is network connectivity interruption between nodes, or no services running on the other node. Identify and resolve network issues, firewall issues, or services not running on peers.

I221 Node Connectivity Ok
A previous E222 condition has been resolved.
E225 KMIP Connectivity Failure

A node is configured to use a KMIP server for encryption secrets, but cannot retrieve those secrets.

Troubleshooting: This alert occurs when there is network connectivity interruption between nodes and the KMIP server, or the KMIP server is temporarily down. Identify and resolve network issues, firewall issues, or KMIP server problems.

I226 KMIP Connectivity Ok
A previous E225 condition has been resolved.

Gateway Events

E210 Gateway Service Health Problem

One or more SwiftStack Gateway services are either not running or not functioning properly. The SwiftStack agent will attempt to restart services should they unexpectedly exit. However, if that fails, this event will be triggered.

Troubleshooting: Log into the gateway and run ssdiag to get more specific information. Restarting the gateway service (stop ss-gateway; start ss-gateway) and/or restarting NFS and Samba services may help. If that does not resolve the issue, please contact Technical Support.

I211 Gateway Service Health Okay
A previous E210 condition has been resolved.
E212 Gateway Configuration Push Failed

A configuration push to a single SwiftStack Gateway failed.

Troubleshooting: Ensure the Gateway's management agent is able to communicate with the controller (all Gateway management pages check and display this). Try the configuration push again and if that does not resolve the issue, please contact Technical Support.


Device Events

E301 Device Missing

The node cannot detect the presence of a drive which is expected to be present and available, or a drive which should be mounted is not. Note that a drive unmounted because an E303 event was generated and the cluster was configured to automatically unmount failing devices will also generate an E301 event.

Troubleshooting: Low level hardware problems (RAID/JBOD controller, SAS expander, etc.) can sometimes cause devices to consistenly "disappear". If a device is just unmounted, you may remount it using the SwiftStack Controller. But if the device is continuously generating this event, then you should look for a drive or other hardware problem.

I302 Device Came Back
A previous E301 condition has been resolved.
E303 Device Failed

A drive has reported a failure via SMART, or has been administratively disabled. The event description will include the SMART metrics which were considered failing by the drive firmware. There is currently no corresponding event for when this condition has been resolved, but one will be added in the future.

Troubleshooting: Examine the kernel log, /var/log/kern.log for corroborating error messages for the block device or filesystem. If the log indicates XFS filesystem issues, unmount the device with sdt unmount dXX and then run xfs_repair /dev/XXX, then remount the device with sdt mount dXX. If the log indicates other errors or if xfs_repair runs into problems, then the device is probably failing and should be replaced or removed from the ring using the SwiftStack Controller. If there are no apparent problems from the kernel's perspective, then a SMART metric may have temporarily reported "bad". In this case, unmount the device (if the cluster is not configured to automatically unmount failing devices), then re-mount the device. This will clear the failure. If this event fires again, the drive may be failing or otherwise violating the drive firmware's expectations despite the lack of kernel error messages. In this case the drive should be replaced or removed from the ring(s).


Controller Events

E401 Recovery Host Unreachable

A recovery backup host (see Setting Up A Recovery Controller) cannot be contacted over the VPN.

Troubleshooting: Similar to troubleshooting for E202 (Node Unreachable). Possible reasons for this event include a network partition between the two machines, a power failure or hardware failure on the backup, a lock-up or hanging of the backup, a problem with the SwiftStack agent sofware on the backup. If you are unable to locate the cause of the communication failure, please contact Technical Support.

I402 Recovery Host Reachable
A previous E401 condition has been resolved.
E403 Recovery Host Version Shear

A SwiftStack Controller recovery host is successfully communicating with its primary, but the recovery host is running an older version of the SwiftStack Controller software than its primary.

Troubleshooting: Upgrade the recovery host. In general, you should always upgrade your recovery hosts first, before upgrading your primary Controller. This allows live testing of the upgrade process without risk to production resources. It also means that any database migrations will be run first on the recovery hosts and then on the primary. This ensures that the recovery host will have a database schema at least as current as the primary, reducing the risk that the primary will produce a backup tar file including a database dump with a schema which is too recent for the recovery host to interpret.

I404 Recovery Host Version Fix
A previous E403 condition has been resolved.
E405 No Recent Tar Backup Of Configs/DB

The SwiftStack Controller has not successfully generated a .tar file backup of its configuration and database information in a time period greater than its configured backup interval.

Troubleshooting: Verify the backup-related settings on the Backup Settings Page. If saving backups into a Swift cluster (highly recommended), you can test your configuration using the "Verify Swift Credentials" button on that page. Try running the backup_ssc script manually and see whether it succeeds. Look for local backup tar files in the BACKUP_LOCAL_PATH directory (usually /opt/ss/var/lib/ss-backup/.... Look for backup files in your Swift cluster in the account corresponding to BACKUP_SWIFT_ACCOUNT.

I406 Sufficiently Recent Tar Backup
A previous E405 condition has been resolved, or the system can confirm that no E405 conditions exist.
E407 No Recent Rsync Of Cluster Metrics

One or more recovery backup hosts have been configured, but the primary Controller has not successfully rsynced the cluster metrics data to the given recovery host sufficiently recently. In case of an emergency, cluster metrics data would be lost.

Troubleshooting: The recovery host must be up and running. It must have a working sshd daemon. It must have rsync installed. The primary Controller must have the rsync command in the crontab file (in /opt/ss/etc/cron.d/recovery). That rsync command must run successfully from the command line.

I408 Sufficiently Recent Rsync
A previous E407 condition has been resolved, or the system can confirm that no E407 conditions exist for the given host.
E409 Low Disk Space On Controller

The SwiftStack Controller has low disk space on one or more of its mount points, which is likely to cause erratic behavior if not resolved soon.

Troubleshooting: The df command will help pinpoint the issue.

I410 Sufficient Disk Space On Controller
A previous E409 condition has been resolved, or the system can confirm that no E409 conditions exist.
E411 Controller Utilization Failing

The SwiftStack Controller has not successfully run its utilization aggregation job in over a day. Utilization API results will have gaps if this is not corrected within a few days.

Troubleshooting: Check that the ss-crond service is running and that aggregate_utilization_data is configured in /opt/ss/etc/cron.d/ssman. Check in /opt/ss/var/log/ssaggregate_utilization_data.log, as well as /var/log/messages.

I412 Controller Utilization Running
A previous E411 condition has been resolved, or the system can confirm that no E411 conditions exist.