Unplanned downtime and update to Lemmy 0.19.10

Hello,

as some of you may have noticed we just had about 25 minutes of downtime due to the update to Lemmy 0.19.10.

Lemmy release notes: https://join-lemmy.org/news/2025-03-19_-_Lemmy_Release_v0.19.10_and_Developer_AMA

This won’t fix YouTube thumbnails for us, as YouTube banned all IPs belonging to our hosting provider.

We were intending to apply this update without downtime, as we’re looking to apply the database migration that allows marking PMs as removed due to the recent spam waves.

Although this update contains database migrations, we expected to still be able to apply the migration in the background before updating the running software, as the database schema between the versions was backwards compatible. Unfortunately, once we started the migrations, we started seeing the site go down.

In the first minutes we assumed that the migrations contained in this upgrade were somehow unexpectedly blocking more than intended but still processing, but it turned out that nothing was actually happening on the database side. Our database deadlocked due to what appears to be an orphaned transaction, which didn’t die even after we killed all Lemmy containers other than the one running the migrations.

While the orphaned transaction was pending, a pending schema migration was waiting for the previous transaction to complete or be rolled back, so nothing was moving anymore. As the previous transaction also didn’t move anymore everything started to die. We’re not entirely sure why the original transaction broke down, as it was started about 30 seconds before the schema migration query, which seems like that shouldn’t have been broken by that happening at the same time.

Lemmy has a “replaceable” schema, which is applied separately from the regular database schema migrations, which runs every time a DB migration occurs. We unfortunately did not consider this replaceable schema migration in our planning, as we would otherwise have realized that this would likely have larger impact on the overall migration.

After we identified that the database had deadlocked, we resorted to restarting our postgres container, then run the migration again. Once we restarted the database, everything was back online in less than 30 seconds, which includes first running the remaining migrations and then starting up all containers again.

When we tested this process on our test instance prior to deploying this to the Lemmy.World production environment we did not run into this issue. Everything was working fine with the backend services running on Lemmy 0.19.9 and the database being upgraded to Lemmy 0.19.10 schema already, but the major difference here is the lack of user activity during the time of the migration.

Our learning from this is to always plan for downtime for Lemmy updates if any database migrations are included, as it does not appear to be possible to “safely” apply them even if they seem small enough to be theoretically doable without downtime.