February 23, 2017: Health check failure (bbprd3)

Between 2:26 PM and 2:49 PM, users may have experienced minor issues relating to instability on a single application node (bbprd3). In particular, some users may have experienced unusually slow page loads and failures in the Math Image Editor Service.

Automated monitoring systems detected multiple health check failures for bbprd3, and marked the node offline at 2:36 PM. When this happens, user traffic will automatically be routed to other available nodes.

Investigation of this issue is ongoing.

Timeline

Time Notes
2:26 PM bbprd3 begins reporting garbage collection warnings (concurrent sweep failures)
2:28 PM worker threads begin to get stuck
2:33 PM bbprd3 reports that 90% of its worker threads have been stuck for 300+ seconds
2:36 PM load balancer automatically marks bbprd3 offline
2:43 PM support staff take bbprd3 out of rotation
2:44 PM bbprd3 automatically restarts due to OutOfMemoryError
2:49 PM bbprd3 resumes normal operation (still out of rotation)