June, 2015: Post-Oracle 12c BBLearn Outages

BBLearn's database backend was upgraded to Oracle 12c during the June 11th maintenance window; starting that evening, the system experienced multiple intermittent outages.

Thursday, 6/11

Node bbapp5a was intermittently unresponsive between 9:41 PM and 11:52 PM.

Friday, 6/12

System continued to have intermittent outages throughout the day.

Node bbapp5a became unresponsive at 10:54 AM; bbapp2a also became unresponsive at 11:09 AM, with the issue spreading to other nodes by 12:00 noon. Admins performed a rolling restart at approximately 2:56 PM.

Saturday, 6/13

System continued to have intermittent outages throughout the day.

Node bbapp5a became unresponsive at 9:28 AM, and was restarted at 11:45 AM.

Sunday, 6/14

System continued to have intermittent outages throughout the day.

Monday, 6/15

System continued to have intermittent outages throughout the day.

Node bbapp3a experienced a micro-outage from 1:56 AM to 2:02 AM, and was restarted at 6:52 AM. With more support staff on duty, full investigation into the weekend's issues began. Various node stability issues and intermittent outages were observed throughout the day.

Application and database admins applied a hotfix and partial system restart at 2:56 PM, and are monitoring for any further incidents.

Tuesday, 6/16

System continued to have intermittent outages throughout the day.

Node bbapp3a was found unresponsive, and was restarted at 7:15 AM.

Around 5:30 PM, multiple nodes lost their connection to the database backend and became unresponsive. Admins performed an emergency restart to apply two configuration changes. BBLearn was back online at approximately 6:30 PM. An emergency RFC will be filed detailing specifics of the incident.

Thursday, 6/18

Emergency maintenance from 12:00 noon to 1:00 PM.