Incidenten | Triple Pat Incidenten gerapporteerd op de statuspagina voor Triple Pat https://status.triplepat.com/nl https://d1lppblt9t2x15.cloudfront.net/logos/81773542ca3e62f5c806506cdc7b116b.png Incidenten | Triple Pat https://status.triplepat.com/nl nl triplepat.com is down https://status.triplepat.com/nl/incident/545733 Tue, 15 Apr 2025 15:37:17 -0000 https://status.triplepat.com/nl/incident/545733#e20cf291cfa03caa450baf17c1a40d90a9185e595bd6f504e69ee972ad05461b triplepat.com recovered. triplepat.com is down https://status.triplepat.com/nl/incident/545733 Tue, 15 Apr 2025 15:31:15 -0000 https://status.triplepat.com/nl/incident/545733#bfc5c5ce895efa0ae7b25719c3cc968abbcc0aae505a5effaa7985d469970d1f triplepat.com went down. triplepat.com is down https://status.triplepat.com/nl/incident/544911 Mon, 14 Apr 2025 14:10:00 -0000 https://status.triplepat.com/nl/incident/544911#10e19e606c521d9e764289526e9bbadb2cc3001ea1a9dd9800efd0e77fc3f100 The root cause is a failed nginx deployment due to what looks like a race condition and/or an overly-picky health check. We are auditing the health checks. Redeploying the exact same config worked, so it's clear that this failure has something to do with either races or ephemeral machine state. triplepat.com is down https://status.triplepat.com/nl/incident/544911 Mon, 14 Apr 2025 09:47:16 -0000 https://status.triplepat.com/nl/incident/544911#e9fdec9c7b6721184f7ebf889c4995f085017a6f5473bb428cb1617cc72e0717 triplepat.com recovered. triplepat.com is down https://status.triplepat.com/nl/incident/544911 Mon, 14 Apr 2025 09:40:59 -0000 https://status.triplepat.com/nl/incident/544911#86b12ef4193a026d723e5949bec7b38efb5719a5019fd45dc3e46988e064f04b triplepat.com went down. d.triplepat.com is down https://status.triplepat.com/nl/incident/530727 Mon, 31 Mar 2025 10:14:00 -0000 https://status.triplepat.com/nl/incident/530727#fd183d78d7e1b0aecaba7baf6fcfcf8553ce585f89b54ddd4d128fd70239c297 TILAA has network issues more frequently than we (or they!) would like. This was one of them. Our every-node-is-a-master-node redundancy system means that everything still managed to work fine, however. b.triplepat.com is down https://status.triplepat.com/nl/incident/536573 Mon, 31 Mar 2025 10:12:00 -0000 https://status.triplepat.com/nl/incident/536573#b00afdd64aeae660ad6f0415fbb00d34c4a6be059458aaf1f7702a9c07326521 The machine b.triplepat.com became unavailable and eventually rebooted due to [a "power event" and corresponding Google Cloud outage in its datacenter](https://status.cloud.google.com/incidents/N3Dw7nbJ7rk7qwrtwh7X). When it came back up, the containers did not all start successfully at boot-time. It looks like the internal DNS for the docker-compose network didn't come-up successfully, which meant that the nginx config could not resolve the internal names for redirects, which meant that nginx refused to start. (Another point for https://isitdns.com) The power event in GCP: not our fault The DNS stuff not working after boot: at least a little bit our fault, and certainly something we might want to work to remediate. No user check-ins affected, because all other servers were up and running fine this whole time. No data loss. Resiliency works! The only TODO item is to make sure our systems reboot cleanly. b.triplepat.com is down https://status.triplepat.com/nl/incident/536573 Mon, 31 Mar 2025 05:50:39 -0000 https://status.triplepat.com/nl/incident/536573#5bfe9b951a8f7ca93f42cbb29686401a5e2f9ded13c5e9cde902c42fe117735d b.triplepat.com recovered. b.triplepat.com is down https://status.triplepat.com/nl/incident/536573 Sat, 29 Mar 2025 19:57:28 -0000 https://status.triplepat.com/nl/incident/536573#42ef0cbf49ee24897e984a89e1640011306e86304638b1b5fb6c7810cfe1095b b.triplepat.com went down. d.triplepat.com is down https://status.triplepat.com/nl/incident/530727 Wed, 19 Mar 2025 10:40:47 -0000 https://status.triplepat.com/nl/incident/530727#1d612a8fdb0ed4ef883fecec3914e10b4f46681f1d47a37a6ea3696112e7c6b8 d.triplepat.com recovered. d.triplepat.com is down https://status.triplepat.com/nl/incident/530727 Wed, 19 Mar 2025 10:33:55 -0000 https://status.triplepat.com/nl/incident/530727#aa414e3c6b2677f72c24832d64bdd228b1489f71e2870e3daa4cd2af7193d180 d.triplepat.com went down. a.triplepat.com was down after a OS upgrade + reboot https://status.triplepat.com/nl/incident/519873 Fri, 28 Feb 2025 14:23:00 -0000 https://status.triplepat.com/nl/incident/519873#af08d20e0716598bef252282db9d43491654c7234f8a88c3d84210a211a4c1d2 We tried to upgrade the OS and reboot the a server yesterday to see if that was a safe and quick operation. It was not. We delayed in bringing it back online because we haven't launched and were in the middle of a major code change. a.triplepat.com was down after a OS upgrade + reboot https://status.triplepat.com/nl/incident/519873 Fri, 28 Feb 2025 14:18:07 -0000 https://status.triplepat.com/nl/incident/519873#fd925dd155c5d5771d507bec42cdc3a5c16587c6c4996ca4116a3a0e89a2789c a.triplepat.com recovered. a.triplepat.com was down after a OS upgrade + reboot https://status.triplepat.com/nl/incident/519873 Thu, 27 Feb 2025 10:20:33 -0000 https://status.triplepat.com/nl/incident/519873#3d21e4d254a992e214197457cf6f3151bc4f91eee1b4ccefa980f0252f4339ab a.triplepat.com went down. d server was unreachable repeatedly https://status.triplepat.com/nl/incident/518114 Mon, 17 Feb 2025 12:03:00 -0000 https://status.triplepat.com/nl/incident/518114#724c17c1fd328b300f2beb20a766e3ce884ea9f3b80f562ca7ab39e3f45b3f4b [Tilaa cloud](https://www.tilaa.com/) (where the d server is hosted) had some network issues that made [d.triplepat.com](https://d.triplepat.com) inaccessible. It was scheduled maintenance that ended up taking more things down than they expected. https://status.tilaa.com/incidents/6hfj589ys0w1 and https://status.tilaa.com/incidents/4b2nb4tdw7mc are the issues on Tilaa's incident report page. Unclear what was the root cause here, as our incident started between the two. But I suspect it had the same root cause as the second link, because the timestamps on the first end around when people would have gone home and the timestamps on the second begin when people would have started work. a and b were briefly unavailable (multiple times!) https://status.triplepat.com/nl/incident/518112 Tue, 11 Feb 2025 11:56:00 -0000 https://status.triplepat.com/nl/incident/518112#2471f983f76e12cc11e030647f06afaa591438d45aa2dc2ee378bd4de5b5d369 [a](https://a.triplepat.com) and [b](https://b.triplepat.com) servers were briefly unavailable, twice! Once on 7 Feb 2025 and a second time on 11 Feb 2025. The a server even had a third outage on 10 Feb 2025! This did not affect any users for two reasons: 1. every server needs to go down before the check-in service becomes unavailable, and 2. we're not launched. Throughout these incidents, the servers [triplepat.com](https://triplepat.com), [c](https://c.triplepat.com), and [d](https://d.triplepat.com) remained stably up. # Cause A bad nginx config was (repeatedly) deployed while we were adding SMTP reverse-proxy functionality to it. Our deployment process broke down, like it should, but then we flailed around a bit and accidentally allowed the deployment process to get to b.triplepat.com too, causing more servers to go down. We treated the first occurrence like a one-off, and then when in reoccurred we took structural action. # Events leading up An engineer asked an LLM to help them with nginx configs and it hallucinated bad advice for SMTP reverse-proxying. Because they needed the advice, they didn't realize that the responses were hallucinations, and pushed the config. (First incident, bad configs were rolled back) On second try, they still didn't realize how *large* the hallucination was. It turned out to be not just a wrong directive or two, but an entirely wrong-headed approach. The LLMs were too eager to please. This led to the second and third incident. # Immediate fix Each time things were fixed by deploying a non-broken config. A little manual work was done to try and speed the process along by the eager engineer. As one might expect, this ended up making the problem worse. The lesson remains: trust the tools, they almost always make things happen as reliably as things can. # Permanent fix We added a step in our deployment process that asks a separate nginx invocation to parse the config before attempting to restart the remote nginx server with the new config. Deployment now correctly crashes and halts on the attempted deployment of configs that nginx won't parse, and does so before attempting to restart the remote server with the bad config. We also took the opportunity to add the same protection to other configured services. a and b were briefly unavailable (multiple times!) https://status.triplepat.com/nl/incident/518112 Tue, 11 Feb 2025 11:56:00 -0000 https://status.triplepat.com/nl/incident/518112#2471f983f76e12cc11e030647f06afaa591438d45aa2dc2ee378bd4de5b5d369 [a](https://a.triplepat.com) and [b](https://b.triplepat.com) servers were briefly unavailable, twice! Once on 7 Feb 2025 and a second time on 11 Feb 2025. The a server even had a third outage on 10 Feb 2025! This did not affect any users for two reasons: 1. every server needs to go down before the check-in service becomes unavailable, and 2. we're not launched. Throughout these incidents, the servers [triplepat.com](https://triplepat.com), [c](https://c.triplepat.com), and [d](https://d.triplepat.com) remained stably up. # Cause A bad nginx config was (repeatedly) deployed while we were adding SMTP reverse-proxy functionality to it. Our deployment process broke down, like it should, but then we flailed around a bit and accidentally allowed the deployment process to get to b.triplepat.com too, causing more servers to go down. We treated the first occurrence like a one-off, and then when in reoccurred we took structural action. # Events leading up An engineer asked an LLM to help them with nginx configs and it hallucinated bad advice for SMTP reverse-proxying. Because they needed the advice, they didn't realize that the responses were hallucinations, and pushed the config. (First incident, bad configs were rolled back) On second try, they still didn't realize how *large* the hallucination was. It turned out to be not just a wrong directive or two, but an entirely wrong-headed approach. The LLMs were too eager to please. This led to the second and third incident. # Immediate fix Each time things were fixed by deploying a non-broken config. A little manual work was done to try and speed the process along by the eager engineer. As one might expect, this ended up making the problem worse. The lesson remains: trust the tools, they almost always make things happen as reliably as things can. # Permanent fix We added a step in our deployment process that asks a separate nginx invocation to parse the config before attempting to restart the remote nginx server with the new config. Deployment now correctly crashes and halts on the attempted deployment of configs that nginx won't parse, and does so before attempting to restart the remote server with the bad config. We also took the opportunity to add the same protection to other configured services.