Events¶
An event is a system condition which may be of interest to an administrator. Events are often too low-level to be useful, and reporting each one would be noisy. For easier management, SwiftStack aggregates events into Alerts.
The SwiftStack Controller currently supports the following events:
Scope | Event Code | Description |
---|---|---|
Cluster | E101 |
Config Push Failed |
Cluster | E102 |
Account Push Failed |
Cluster | E103 |
Ring Empty |
Cluster | I104 |
Ring No Longer Empty |
Node | W201 |
Low Available Disk Space |
Node | E202 |
Node Unreachable |
Node | I203 |
Node Reachable |
Node | E204 |
Node Upgrade Failing |
Node | I205 |
Node Upgrade No Longer Failing |
Node | E206 |
Format Devices Failed |
Node | E207 |
Swift Service Health Problem |
Node | I208 |
Swift Service Health Okay |
Node | I209 |
Sufficient Available Disk Space |
Node | E213 |
Node Configuration Out of Date |
Node | I214 |
Node Configuration Okay |
Node | W215 |
Node Automated Provisioning Failed |
Node | I216 |
Node Automated Provisioning Okay |
Node | E217 |
Node Network Interface Missing |
Node | I218 |
Node Network Interface Okay |
Node | E219 |
Node Config Deploy Impossible |
Node | I220 |
Node Config Deploy Possible |
Node | E222 |
Node Connectivity Failed |
Node | I221 |
Node Connectivity Ok |
Node | E223 |
ProxyFS Service Health Problem |
Node | I224 |
ProxyFS Service Health Okay |
Node | E225 |
KMIP Connectivity Failed |
Node | I226 |
KMIP Connectivity Ok |
Gateway | E210 |
Gateway Service Health Problem |
Gateway | I211 |
Gateway Service Health Okay |
Gateway | E212 |
Gateway Configuration Push Failed |
Device | E301 |
Device Missing |
Device | I302 |
Device Came Back |
Device | E303 |
Device Failed |
SwiftStack Controller On-Premises supports these additional events:
Scope | Event Code | Description |
---|---|---|
Controller | E401 |
Recovery Host Unreachable |
Controller | I402 |
Recovery Host Reachable |
Controller | E403 |
Recovery Host Version Shear |
Controller | I404 |
Recovery Host Version Fix |
Controller | E405 |
No Recent Tar Backup Of Configs/DB |
Controller | I406 |
Sufficiently Recent Tar Backup |
Controller | E407 |
No Recent Rsync Of Cluster Metrics |
Controller | I408 |
Sufficiently Recent Rsync |
Controller | E409 |
Low Disk Space On Controller |
Controller | I410 |
Sufficient Disk Space On Controller |
Controller | E411 |
Controller Utilization Failing |
Controller | I412 |
Controller Utilization Running |
In addition to converting events to Alerts, the SwiftStack Controller On-Premises can emit events using standard protocols. The Controller supports reporting events using syslog (e.g. for integration with tools such as Splunk) and SNMP v2c Traps.
Sending Events Via Syslog¶
To send events via syslog, configure the syslog target host, port, decide whether to use UDP (default) or TCP, and which facility to use. Configure these values on the Controller Networking configuration page and click "Save Changes".
Sending Events Via SNMP¶
To send events via UDP SNMP v2c traps, configure the target hostname or IP, target port number, and a SNMP v2c community string on the Controller Networking configuration page and click "Save Changes".
The SwiftStack SNMP MIBs may be downloaded using these links:
Cluster Events¶
E101
Config Push FailedAfter making changes to the cluster configuration (which were saved by the SwiftStack Controller but not yet deployed), Administrator attempted to "push the config" (deploy the configuration changes to the cluster), but the operation failed. This is usually due to one or more nodes being down or unreachable, while the Controller believes them to be active and enabled.
Troubleshooting: Look for nodes with errors on the "Nodes" page for the cluster. If there are down nodes in the list, and they will be down for an extended period of time, you may disable the nodes. If the down nodes will be back shortly (e.g. the server has been powered down for maintenance or hardware replacement), you should wait for the node to come back online before pushing a config to the cluster. If there are no down nodes and config pushes still fail, please contact Technical Support.
Note
Disabling nodes will remove them from the Swift ring(s) and trigger replication to restore full replica count within the cluster for every object, container, and/or account on the disabled nodes.
E102
Account Push Failed- A distribution of the SwiftStack Auth user database to the cluster nodes
failed. Failure modes and troubleshooting procedures are identical with
E101 Config Push Failed
above. E103
Ring Empty- One or more rings no longer have devices. This may be due to removing all devices previously allocated, or disabling all nodes that have devices allocated to the ring.
Note
The failure or removal of devices previously allocated to an affected ring may result in data loss.
Troubleshooting: Look for rings without devices on the "Deploy" page for the cluster and add devices to each of the affected rings. Wait for replication to move data to the new devices before removing devices previously in the ring or disabling nodes that were previously in the ring.
I104
Ring No Longer Empty- A previous
E103
condition has been resolved. Wait for replication to move data to the new devices before removing devices previously in the ring or disabling nodes that were previously in the ring.
Node Events¶
W201
Low Available Disk SpaceOne or more partitions on the node have too little free disk space. The threshold value for this event is currently hard-coded at 10 percent. The event description will contain the threshold value and path of each partition.
Troubleshooting: SwiftStack recommends keeping at least 10% disk space free on your storage devices. This helps prevent devices from accidentally filling up from "bursty" work-loads and helps provide time to add capacity. However, you should not rely on seeing this event before adding capacity, because your rate of ingest and lead-time for hardware procurement and stand-up may require you to add capacity earlier. You should add capacity to the cluster, reduce or stop your ingest, or delete data no longer needed.
Note
It is possible to provision a Swift cluster such that a subset of drives will fill up "too early". E.g. a three-zone cluster with Zone 1 having 100TB, Zone 2 having 200TB and Zone 3 having 200TB. Because with 3 replicas, each zone will get one replica, the devices in Zone 1 will fill up before any devices in Zone 2 or Zone 3 fill up. Using more Zones will allow the partition placement algorithm to "smooth out" imbalances in zone capacity.
E202
Node UnreachableThe management/monitoring connection between the node and the SwiftStack Controller is not working. Assuming nothing other than the network connection between the node and the Controller is wrong, the correct functioning of the Swift cluster itself is not impacted if the node cannot contact the controller.
Troubleshooting: Possible reasons for this event include a network partition between the SwiftStack Controller and the node, a power failure to the node, a hardware failure in the node, a lock-up or hanging of the node, a problem with the SwiftStack agent sofware on the node, or a problem with the Controller itself (likely to have caused a number of
E202
events for other nodes as well). If you are unable to locate the cause of the communication failure, please contact Technical Support.
Note
While an
enabled node is unreachable, no config or SwiftStack Auth
user database may be pushed to the cluster. Also, any gradual capacity
adjustment for any devices in the cluster will not be able to make
progress. Swift API clients may see degraded request latency while one or
more nodes are down. See the E101 Config Push Failed
above for
information on disabling nodes which will be unavailable for an extended
period of time.
I203
Node Reachable- A previous
E202
condition has been resolved. E204
Node Upgrade Failing- After a controller software update, the node is failing to update its SwiftStack agent software. Please contact Technical Support.
I205
Node Upgrade No Longer Failing- A previous
E204
condition has been resolved. E206
Format Devices Failed- The node has been instructed to format one or more drives, but the operation failed.
E207
Swift Service Health ProblemOne or more Swift daemons are either not running or not functioning properly. The SwiftStack agent will attempt to restart Swift daemons should they unexpectedly exit. However, if that fails, this event will be triggered.
Troubleshooting: Log into the node and run
start ssstop; start ssstart
as root. If that does not resolve the issue, please contact Technical Support.I208
Swift Service Health Okay- A previous
E207
condition has been resolved. I209
Sufficient Available Disk Space- A previous
W201
condition has been resolved. E213
Node Configuration Out of DateA node was attempting to run Swift services with an outdated cluster configuration. To protect the rest of the cluster, all Swift services were disabled and this event was triggered.
Troubleshooting: Push a fresh config to the cluster. If that fails or otherwise does not resolve the issue, please contact Technical Support.
I214
Node Configuration Okay- A previous
E213
condition has been resolved. W215
Node Automated Provisioning Failed- A node that was undergoing Automated Provisioning has failed.
I216
Node Automated Provisioning Okay- A node that had previously failed Automated Provisioning has restarted.
Note
Automated Provisioning for nodes is currently in beta.
E217
Node Network Interface MissingA previously configured network interface on a node was not detected during monitoring of the node.
Troubleshooting: Verify the missing network interface is properly configured.
I218
Node Network Interface Okay- A previous
E215
condition has been resolved. E219
Node Config Deploy ImpossibleTrying to deploy config to this node will fail. Currently this only occurs because of incompatible packages on the node.
Troubleshooting: This alert occurs when the package
swiftstack-swift-services
is unable to install. Identify and resolve the packaging conflict(s) causing this issue.I220
Node Config Deploy Possible- A previous
E219
condition has been resolved. E222
Node Connectivity FailureA node cannot reach all of the required services of its peer.
Troubleshooting: This alert occurs when there is network connectivity interruption between nodes, or no services running on the other node. Identify and resolve network issues, firewall issues, or services not running on peers.
I221
Node Connectivity Ok- A previous
E222
condition has been resolved. E225
KMIP Connectivity FailureA node is configured to use a KMIP server for encryption secrets, but cannot retrieve those secrets.
Troubleshooting: This alert occurs when there is network connectivity interruption between nodes and the KMIP server, or the KMIP server is temporarily down. Identify and resolve network issues, firewall issues, or KMIP server problems.
I226
KMIP Connectivity Ok- A previous
E225
condition has been resolved.
Gateway Events¶
E210
Gateway Service Health ProblemOne or more SwiftStack Gateway services are either not running or not functioning properly. The SwiftStack agent will attempt to restart services should they unexpectedly exit. However, if that fails, this event will be triggered.
Troubleshooting: Log into the gateway and run
ssdiag
to get more specific information. Restarting the gateway service (stop ss-gateway; start ss-gateway
) and/or restarting NFS and Samba services may help. If that does not resolve the issue, please contact Technical Support.I211
Gateway Service Health Okay- A previous
E210
condition has been resolved. E212
Gateway Configuration Push FailedA configuration push to a single SwiftStack Gateway failed.
Troubleshooting: Ensure the Gateway's management agent is able to communicate with the controller (all Gateway management pages check and display this). Try the configuration push again and if that does not resolve the issue, please contact Technical Support.
Device Events¶
E301
Device MissingThe node cannot detect the presence of a drive which is expected to be present and available, or a drive which should be mounted is not. Note that a drive unmounted because an
E303
event was generated and the cluster was configured to automatically unmount failing devices will also generate anE301
event.Troubleshooting: Low level hardware problems (RAID/JBOD controller, SAS expander, etc.) can sometimes cause devices to consistenly "disappear". If a device is just unmounted, you may remount it using the SwiftStack Controller. But if the device is continuously generating this event, then you should look for a drive or other hardware problem.
I302
Device Came Back- A previous
E301
condition has been resolved. E303
Device FailedA drive has reported a failure via SMART, or has been administratively disabled. The event description will include the SMART metrics which were considered failing by the drive firmware. There is currently no corresponding event for when this condition has been resolved, but one will be added in the future.
Troubleshooting: Examine the kernel log,
/var/log/kern.log
for corroborating error messages for the block device or filesystem. If the log indicates XFS filesystem issues, unmount the device withsdt unmount dXX
and then runxfs_repair /dev/XXX
, then remount the device withsdt mount dXX
. If the log indicates other errors or ifxfs_repair
runs into problems, then the device is probably failing and should be replaced or removed from the ring using the SwiftStack Controller. If there are no apparent problems from the kernel's perspective, then a SMART metric may have temporarily reported "bad". In this case, unmount the device (if the cluster is not configured to automatically unmount failing devices), then re-mount the device. This will clear the failure. If this event fires again, the drive may be failing or otherwise violating the drive firmware's expectations despite the lack of kernel error messages. In this case the drive should be replaced or removed from the ring(s).
Controller Events¶
E401
Recovery Host UnreachableA recovery backup host (see Setting Up A Recovery Controller) cannot be contacted over the VPN.
Troubleshooting: Similar to troubleshooting for E202 (Node Unreachable). Possible reasons for this event include a network partition between the two machines, a power failure or hardware failure on the backup, a lock-up or hanging of the backup, a problem with the SwiftStack agent sofware on the backup. If you are unable to locate the cause of the communication failure, please contact Technical Support.
I402
Recovery Host Reachable- A previous
E401
condition has been resolved.E403
Recovery Host Version ShearA SwiftStack Controller recovery host is successfully communicating with its primary, but the recovery host is running an older version of the SwiftStack Controller software than its primary.
Troubleshooting: Upgrade the recovery host. In general, you should always upgrade your recovery hosts first, before upgrading your primary Controller. This allows live testing of the upgrade process without risk to production resources. It also means that any database migrations will be run first on the recovery hosts and then on the primary. This ensures that the recovery host will have a database schema at least as current as the primary, reducing the risk that the primary will produce a backup tar file including a database dump with a schema which is too recent for the recovery host to interpret.
I404
Recovery Host Version Fix- A previous
E403
condition has been resolved.E405
No Recent Tar Backup Of Configs/DBThe SwiftStack Controller has not successfully generated a .tar file backup of its configuration and database information in a time period greater than its configured backup interval.
Troubleshooting: Verify the backup-related settings on the Backup Settings Page. If saving backups into a Swift cluster (highly recommended), you can test your configuration using the "Verify Swift Credentials" button on that page. Try running the
backup_ssc
script manually and see whether it succeeds. Look for local backup tar files in theBACKUP_LOCAL_PATH
directory (usually/opt/ss/var/lib/ss-backup/...
. Look for backup files in your Swift cluster in the account corresponding toBACKUP_SWIFT_ACCOUNT
.I406
Sufficiently Recent Tar Backup- A previous
E405
condition has been resolved, or the system can confirm that noE405
conditions exist.E407
No Recent Rsync Of Cluster MetricsOne or more recovery backup hosts have been configured, but the primary Controller has not successfully rsynced the cluster metrics data to the given recovery host sufficiently recently. In case of an emergency, cluster metrics data would be lost.
Troubleshooting: The recovery host must be up and running. It must have a working sshd daemon. It must have rsync installed. The primary Controller must have the rsync command in the crontab file (in
/opt/ss/etc/cron.d/recovery
). That rsync command must run successfully from the command line.I408
Sufficiently Recent Rsync- A previous
E407
condition has been resolved, or the system can confirm that noE407
conditions exist for the given host.E409
Low Disk Space On ControllerThe SwiftStack Controller has low disk space on one or more of its mount points, which is likely to cause erratic behavior if not resolved soon.
Troubleshooting: The
df
command will help pinpoint the issue.I410
Sufficient Disk Space On Controller- A previous
E409
condition has been resolved, or the system can confirm that noE409
conditions exist.E411
Controller Utilization FailingThe SwiftStack Controller has not successfully run its utilization aggregation job in over a day. Utilization API results will have gaps if this is not corrected within a few days.
Troubleshooting: Check that the
ss-crond
service is running and that aggregate_utilization_data is configured in/opt/ss/etc/cron.d/ssman
. Check in/opt/ss/var/log/ssaggregate_utilization_data.log
, as well as/var/log/messages
.I412
Controller Utilization Running- A previous
E411
condition has been resolved, or the system can confirm that noE411
conditions exist.