Server issues affecting the Qwilr app

Incident Report for Qwilr

Postmortem

On Tuesday June 11th at approximately 10.30am AEST, Qwilr experienced serious issues with site reliability and many users experienced failures in using the application and delivery of content to customers.

Qwilr’s engineering team investigated the issue and observed spikes in CPU on some of our webserver instances, but nothing that should cause the 502 and 504 errors customers reported. Eventually we could observe that some of our NodeJS docker Pods (we run in Kubernetes) were hitting 100% CPU and with further investigation could see that these processes were taking up to 30 minutes to process a single request.

The cause of this turned out to be a very large payload sent to our API, causing that request to take up to 30 minutes. Part of this was a result of having code that was designed to run fast for small payloads but didn’t handle this large payload. It filled up the memory allocated to the Pod and caused the CPU to go to 100%.

Combined with this, as a consequence of recently moving infrastructure from Rackspace to AWS, our Kubernetes Pods lacked readiness checks that would ensure traffic not be routed to them when not responsive. This meant requests to these Pods would time out and return 502 or 504s.

By 6pm AEST on the 11th, we deployed a code fix to resolve the root cause and ensure that these Pods could process such a large payload in approximately 1/10th of the time and also set up a readiness check to ensure our system is more robust. We are also working with our API customers to find a sensible limit to payload sizes.

As a result of this issue we are confident that our system has been made more stable and resilient for the future.

Posted Jun 19, 2019 - 15:22 AEST

Resolved

The server issue has been resolved.

Posted Jun 11, 2019 - 12:32 AEST

Update

We are continuing to work on a fix for this issue.

Posted Jun 11, 2019 - 12:32 AEST

Identified

We're currently experiencing server issues, and we're investigating. You may see errors loading the app or slow load times.

Posted Jun 11, 2019 - 12:07 AEST

Monitoring

The server issue have been resolved. We will continue to monitor the situation.

Posted Jun 11, 2019 - 11:33 AEST

Identified

The server issue have been resolved. We will continue to monitor the situation.

Posted Jun 11, 2019 - 11:32 AEST

Investigating

It looks like we're currently having some server issues and we're investigating the cause. You may see errors loading the app or slow load times.

Posted Jun 11, 2019 - 10:46 AEST

This incident affected: Qwilr App.