Assemble outage
Incident Report for Assemble
Postmortem

Cause

A critical part of the infrastructure of Assemble is a Redis in-memory storage cluster. Assemble uses this for session storage, caching to improve performance and for holding the list of jobs that the Assemble queue workers need to process. At 12:50 we were automatically notified of issues being able to connect to Assemble and simultaneously received error logs of the cache database being out of memory.

Amazon Web Services do not offer any automatic scaling of memory capacity of Redis clusters, so within a few minutes, the infrastructure team manually initiated a doubling in the size of the cache database and monitored as Amazon worked to change the instance size. Assemble came back up at 13:07 and once it did, the development team were able to identify the cause of the huge and rapid increase in cache memory usage.

This was caused by a very large number of modifications with Assemble, in a short space of time, that generated (and continues to generate) millions of jobs for our queue workers to process. These millions of jobs rapidly used up all the remaining memory in the cache database which then blocked the ability for Assemble to create new sessions when a user tried to access the website.

Mitigation

For the time being, the cache database will stay at double the size and additional monitoring will be set up to include a slower increase over time that puts us outside of our usually comfortable margins. Further, the specific type of job that was allowed to fill all the memory will be moved off the cache server and into it’s own dedicated queue so that if large scale modifications are made like this again, it will not affect other parts of the infrastructure.

The infrastructure team will also investigate splitting up the session, cache and queue databases into their own Redis clusters so that a rapid growth in one area does not impact the others.

Other

An incident was not automatically created here in Statuspage, which we will investigate why and ensure all the monitoring systems we have in place are correctly hooked up to Statuspage to supply automatic notifications to interested parties and allow Assemble Support to update any incidents as they are taking place.

Posted Nov 17, 2022 - 15:16 GMT

Resolved
Assemble API is down
Posted Nov 17, 2022 - 12:50 GMT