Distributed Learning Technologies

February 23, 2017: Health check failure (bbprd3)

Between 2:26 PM and 2:49 PM, users may have experienced minor issues relating to instability on a single application node (bbprd3). In particular, some users may have experienced unusually slow page loads and failures in the Math Image Editor Service.

Automated monitoring systems detected multiple health check failures for bbprd3, and marked the node offline at 2:36 PM. When this happens, user traffic will automatically be routed to other available nodes.

Investigation of this issue is ongoing.

Timeline

Incident Timeline
TimeNotes
2:26 PMbbprd3 begins reporting garbage collection warnings (concurrent sweep failures)
2:28 PMworker threads begin to get stuck
2:33 PMbbprd3 reports that 90% of its worker threads have been stuck for 300+ seconds
2:36 PMload balancer automatically marks bbprd3 offline
2:43 PMsupport staff take bbprd3 out of rotation
2:44 PMbbprd3 automatically restarts due to OutOfMemoryError
2:49 PMbbprd3 resumes normal operation (still out of rotation)