Today at 8:38pm Pacific we experienced a critical outage that brought down all of our servers on both our primary and failover production environments. This outage did not result in any data loss but did disable all dashboard and API access.
At Doppler, we typically do batched rollouts to production at night (lowest risk time). To help prevent outages we use staged releases, where the migration scripts are run first, then our new code is released to our clusters. The problem we faced today with this approach is it creates an edge case when we are deleting a column in the database that is still being used in active code. In our latest rollout, we had a migration script that deleted a column in one of our tables. This column would not be used in the soon to be released code but was used by our active deployments. When the migration script ran, all our active deployments immediately crashed as the ORM (object-relational mapper) expected a column that did not exist anymore.
Migration Scripts
We are enacting a new policy where columns can not be deleted when there is an active deployment that relies on them. The new model will require that the column can only be deleted once 100% of our deployments are not using it. We expect this to come in the form of 2 rollouts, the first transitioning our deployments off the column with the second removing the column.
Hardening Our Deployments
One of the coolest parts of building tools for developers is that we get to dog food our own product. But it does come with its own struggles, such as Doppler relying on Doppler. Circular loops can be dangerous. To help prevent that, we create encrypted snapshots of our secrets in our Docker images during the build phase. This is done so that in the event an outage occurs, we can bring ourselves back up. The problem with the current approach is the timing of when those images are built.
Currently, those images are built after the migration scripts have run. We will be shifting to a new model where all images are built before running migrations so that we have a guarantee we can access our secrets. The longest deployment step during a rollout is building our images. By building our images before running migrations, we get the added benefit of dramatically reducing the time between when the migration is run and our updated code is released.
We want to apologize to all of our customers for this unacceptable outage. We take your trust in our uptime very seriously. We have learned from this incident and will showcase it by continuing to do everything in our power to prevent future outages.