March 14, 2016: BbLearn System Stability

On Monday morning, several automated health check failures were reported for various BbLearn application nodes. When this occurs, the impacted node will be taken out of traffic rotation until it recovers.

Users may have experienced slowness or errors during the following windows:

Start End Notes System Behavior
8:07 AM 8:09 AM Nodes automatically marked down (bbapp2a, bbapp4a) System slow
8:27 AM 8:29 AM Nodes automatically marked down (bbapp1a, bbapp4a, bbapp7a) System slow
8:35 AM 8:38 AM All nodes automatically marked down (bbapp1a through bbapp7a) System offline

Support staff are investigating these micro outages. ESYS reported that several systems experienced "extreme" slowness during this time, due to NetApp storage maintenance.

Emergency Maintenance

Further investigation determined that the micro outages had left the BbLearn cluster in an unstable state.

The BbLearn cluster uses ActiveMQ as a messaging service between nodes, and relies on a controller node (normally bbapp1a) to coordinate this. If the cluster determines that the current controller has failed, another node will be automatically selected to become the new controller.

During the NetApp storage maintenance, several systems experienced slowness that caused status checks to time out. This triggered an automatic ActiveMQ controller migration. Because bbapp1a was in fact still running, it maintained locks on several essential files, and prevented other nodes from fully becoming the new controller.

Unrelated to the above, two nodes were offline for RHEL 6 migration (bbapp6a, bbapp7a). This left five nodes in the cluster.

As of 9:07 PM, three nodes were attempting to become the ActiveMQ controller (bbapp1a, bbapp2a, bbapp5a). Support staff consulted with vendor support to determine best practices for stabilizing the system, and launched an emergency maintenance window:

Start End Notes System Behavior
9:29 PM 9:52 PM Manual restart of faulted nodes (bbapp2a, bbapp5a) System slow
10:50 PM 11:21 PM Manual restart of all nodes (bbapp1a through bbapp5a) System offline

This stabilized the system. Support staff will continue RHEL 6 migration as scheduled.