On Tuesday June 11th at approximately 10.30am AEST, Qwilr experienced serious issues with site reliability and many users experienced failures in using the application and delivery of content to customers.
Qwilr’s engineering team investigated the issue and observed spikes in CPU on some of our webserver instances, but nothing that should cause the 502 and 504 errors customers reported. Eventually we could observe that some of our NodeJS docker Pods (we run in Kubernetes) were hitting 100% CPU and with further investigation could see that these processes were taking up to 30 minutes to process a single request.
The cause of this turned out to be a very large payload sent to our API, causing that request to take up to 30 minutes. Part of this was a result of having code that was designed to run fast for small payloads but didn’t handle this large payload. It filled up the memory allocated to the Pod and caused the CPU to go to 100%.
Combined with this, as a consequence of recently moving infrastructure from Rackspace to AWS, our Kubernetes Pods lacked readiness checks that would ensure traffic not be routed to them when not responsive. This meant requests to these Pods would time out and return 502 or 504s.
By 6pm AEST on the 11th, we deployed a code fix to resolve the root cause and ensure that these Pods could process such a large payload in approximately 1/10th of the time and also set up a readiness check to ensure our system is more robust. We are also working with our API customers to find a sensible limit to payload sizes.
As a result of this issue we are confident that our system has been made more stable and resilient for the future.