Back to overview

a and b were briefly unavailable (multiple times!)

Feb 11 at 01:56am HST
Affected services
a.triplepat.com
b.triplepat.com

Resolved
Feb 11 at 01:56am HST

a and b servers were briefly unavailable, twice! Once on 7 Feb 2025 and a second time on 11 Feb 2025. The a server even had a third outage on 10 Feb 2025! This did not affect any users for two reasons:

  1. every server needs to go down before the check-in service becomes unavailable, and
  2. we're not launched.

Throughout these incidents, the servers triplepat.com, c, and d remained stably up.

Cause

A bad nginx config was (repeatedly) deployed while we were adding SMTP reverse-proxy functionality to it. Our deployment process broke down, like it should, but then we flailed around a bit and accidentally allowed the deployment process to get to b.triplepat.com too, causing more servers to go down.

We treated the first occurrence like a one-off, and then when in reoccurred we took structural action.

Events leading up

An engineer asked an LLM to help them with nginx configs and it hallucinated bad advice for SMTP reverse-proxying. Because they needed the advice, they didn't realize that the responses were hallucinations, and pushed the config. (First incident, bad configs were rolled back)

On second try, they still didn't realize how large the hallucination was. It turned out to be not just a wrong directive or two, but an entirely wrong-headed approach. The LLMs were too eager to please. This led to the second and third incident.

Immediate fix

Each time things were fixed by deploying a non-broken config. A little manual work was done to try and speed the process along by the eager engineer. As one might expect, this ended up making the problem worse.

The lesson remains: trust the tools, they almost always make things happen as reliably as things can.

Permanent fix

We added a step in our deployment process that asks a separate nginx invocation to parse the config before attempting to restart the remote nginx server with the new config. Deployment now correctly crashes and halts on the attempted deployment of configs that nginx won't parse, and does so before attempting to restart the remote server with the bad config.

We also took the opportunity to add the same protection to other configured services.