Incidenten | Triple Pat

Incidenten | Triple Pat Incidenten gerapporteerd op de statuspagina voor Triple Pat https://status.triplepat.com/nl https://d1lppblt9t2x15.cloudfront.net/logos/81773542ca3e62f5c806506cdc7b116b.png Incidenten | Triple Pat https://status.triplepat.com/nl nl b.triplepat.com is down https://status.triplepat.com/nl/incident/710308 Fri, 22 Aug 2025 08:28:00 -0000 https://status.triplepat.com/nl/incident/710308#6d30cca146d23d28c87a44460235febadd67ec8eaffc6f6c12da32a095948764 Connectivity to that particular GCP host disappeared and then reappeared. As always, one host going down does not constitute an outage. d.triplepat.com is down https://status.triplepat.com/nl/incident/703062 Fri, 22 Aug 2025 08:26:00 -0000 https://status.triplepat.com/nl/incident/703062#1e71176e6c67d4c9471c4758d95ccf656025c255f892be0e51f6730e753d7d60 D is attached to TILAA cloud, which seems to have less-reliable networking than most clouds. Connectivity disappeared and then reappeared. As always, one host going down does not constitute an outage. b.triplepat.com is down https://status.triplepat.com/nl/incident/710308 Fri, 22 Aug 2025 01:49:10 -0000 https://status.triplepat.com/nl/incident/710308#f418d27f43141bc2b0037b5a7d3e5da6af2513ff585334a2e3e4b7f5dcf006cb b.triplepat.com recovered. b.triplepat.com is down https://status.triplepat.com/nl/incident/710308 Fri, 22 Aug 2025 01:35:09 -0000 https://status.triplepat.com/nl/incident/710308#014fb70a2e1f518302fa3c572c9eacd371fb884a67ffaa87b11e5b41ea276a9b b.triplepat.com went down. d.triplepat.com is down https://status.triplepat.com/nl/incident/703062 Fri, 08 Aug 2025 20:11:40 -0000 https://status.triplepat.com/nl/incident/703062#60043db985a868f9f69c00a653cc303b98ffce030581ea3e92bb5f30ad24bb68 d.triplepat.com recovered. d.triplepat.com is down https://status.triplepat.com/nl/incident/703062 Fri, 08 Aug 2025 20:06:05 -0000 https://status.triplepat.com/nl/incident/703062#112342283ab40e60cf43e8fd0da1f8a72ecd283ed55056a71d42adb1bb6c4849 d.triplepat.com went down. b.triplepat.com is down https://status.triplepat.com/nl/incident/617909 Fri, 11 Jul 2025 10:07:00 -0000 https://status.triplepat.com/nl/incident/617909#991ebd953476cd70ee09d3949b69b9d0e6330ff4f256361ce58600b4174e0af4 Doing maintenance on b ended up running it out of (initially) IOPs and (eventually) CPU and RAM. A restart fixed things. b.triplepat.com is down https://status.triplepat.com/nl/incident/617909 Fri, 11 Jul 2025 10:01:25 -0000 https://status.triplepat.com/nl/incident/617909#18ad3980c881a03608b7b166981873911d7c172e95739270a4954e4a079ab1e8 b.triplepat.com recovered. b.triplepat.com is down https://status.triplepat.com/nl/incident/617909 Fri, 11 Jul 2025 09:53:17 -0000 https://status.triplepat.com/nl/incident/617909#636f29837d0bd0b3f07e7e1e99261d5dad3c56fee7bb77f22868b7026a1748b8 b.triplepat.com went down. b.triplepat.com is down https://status.triplepat.com/nl/incident/609436 Thu, 26 Jun 2025 07:48:00 -0000 https://status.triplepat.com/nl/incident/609436#4053e5ddcb8e0ad2bd03c0a6a82ee45bb0298d08b7b12be42faa7a990bebd626 b-mirror went down for unknown reasons and could not be connected to via ssh. It also stopped producing metrics. A reboot fixed it, but a further investigation to figure out the root cause is also underway. b.triplepat.com is down https://status.triplepat.com/nl/incident/609436 Thu, 26 Jun 2025 06:55:54 -0000 https://status.triplepat.com/nl/incident/609436#9d0042599dff2f3dbdeb75bd9bce479d60df8c27eeb61e6a653932123854c4f7 b.triplepat.com went down. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 09:16:00 -0000 https://status.triplepat.com/nl/incident/606258#45c5bf8b09e3d3eb2340f51edd7b7551e96ee4b5b007e40ee0c4564e53094a44 C is back up now too. As always, because at least one of a,b,c, and d was up no data has been lost and no user experience was affected. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 09:16:00 -0000 https://status.triplepat.com/nl/incident/606258#45c5bf8b09e3d3eb2340f51edd7b7551e96ee4b5b007e40ee0c4564e53094a44 C is back up now too. As always, because at least one of a,b,c, and d was up no data has been lost and no user experience was affected. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 09:13:00 -0000 https://status.triplepat.com/nl/incident/606258#d0e958bdf9852aef871e8b02ff3a4b4fea35126bbb7259d2f053dc39cf365cf0 Server maintenance on b and c servers. B is back up. C soon. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 09:13:00 -0000 https://status.triplepat.com/nl/incident/606258#d0e958bdf9852aef871e8b02ff3a4b4fea35126bbb7259d2f053dc39cf365cf0 Server maintenance on b and c servers. B is back up. C soon. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 09:07:54 -0000 https://status.triplepat.com/nl/incident/606258#a5f865abc929ab9065914bd1486e4503e5535afeb2dbcf9e1fd2c32c46c0e1a4 c.triplepat.com went down. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 09:05:07 -0000 https://status.triplepat.com/nl/incident/606258#dca1d412bd839dd36548a90d10c2ed047e4efe65913233b05b50a1407b203a40 b.triplepat.com recovered. b.triplepat.com and c.triplepat.com are down https://status.triplepat.com/nl/incident/606258 Fri, 20 Jun 2025 08:56:35 -0000 https://status.triplepat.com/nl/incident/606258#3e0faf0143838e762f83f27c8b19c404094f648ec0efd9c5d75d35781c246a72 b.triplepat.com went down. Some services are down https://status.triplepat.com/nl/incident/599064 Sun, 08 Jun 2025 14:32:00 -0000 https://status.triplepat.com/nl/incident/599064#8b9b903eacd84676324b63c537094215fbff2953d4a0ed107b7dd3037cd0e4d4 triplepat.com went down on a push. Tailscale was down (but this wasn't detected: thing to fix #1) so the bad push was unable to bind the internal services and refused to bring anything up. We logged the machine back into our tailnet, redeployed, and everything was fine. No user data was lost (again: for the outage to count as real every machine must be unreachable), but we now have a new thing to monitor and alert on to prevent such outages in the future. b.triplepat.com is down https://status.triplepat.com/nl/incident/567425 Fri, 16 May 2025 12:49:00 -0000 https://status.triplepat.com/nl/incident/567425#a303556f450417eddb05d90e05342db4bd9688098bde9e8fbf20d0afa30220c7 We rebooted the server and it was slow to come back up. As always, one server down is not a problem for us. b.triplepat.com is down https://status.triplepat.com/nl/incident/567425 Fri, 16 May 2025 12:32:49 -0000 https://status.triplepat.com/nl/incident/567425#6fb01e468249067af150a72c2e48fc63feae89c1afb9c2a8e1a303999dc414b1 b.triplepat.com recovered. b.triplepat.com is down https://status.triplepat.com/nl/incident/567425 Fri, 16 May 2025 12:29:22 -0000 https://status.triplepat.com/nl/incident/567425#6e46300834fc1216ede5ad111363dc8d45c0fcd5f0d9e59f356bd57ece5a1668 b.triplepat.com went down. triplepat.com is down https://status.triplepat.com/nl/incident/545733 Tue, 15 Apr 2025 15:37:17 -0000 https://status.triplepat.com/nl/incident/545733#e20cf291cfa03caa450baf17c1a40d90a9185e595bd6f504e69ee972ad05461b triplepat.com recovered. triplepat.com is down https://status.triplepat.com/nl/incident/545733 Tue, 15 Apr 2025 15:31:15 -0000 https://status.triplepat.com/nl/incident/545733#bfc5c5ce895efa0ae7b25719c3cc968abbcc0aae505a5effaa7985d469970d1f triplepat.com went down. triplepat.com is down https://status.triplepat.com/nl/incident/544911 Mon, 14 Apr 2025 14:10:00 -0000 https://status.triplepat.com/nl/incident/544911#10e19e606c521d9e764289526e9bbadb2cc3001ea1a9dd9800efd0e77fc3f100 The root cause is a failed nginx deployment due to what looks like a race condition and/or an overly-picky health check. We are auditing the health checks. Redeploying the exact same config worked, so it's clear that this failure has something to do with either races or ephemeral machine state. triplepat.com is down https://status.triplepat.com/nl/incident/544911 Mon, 14 Apr 2025 09:47:16 -0000 https://status.triplepat.com/nl/incident/544911#e9fdec9c7b6721184f7ebf889c4995f085017a6f5473bb428cb1617cc72e0717 triplepat.com recovered. triplepat.com is down https://status.triplepat.com/nl/incident/544911 Mon, 14 Apr 2025 09:40:59 -0000 https://status.triplepat.com/nl/incident/544911#86b12ef4193a026d723e5949bec7b38efb5719a5019fd45dc3e46988e064f04b triplepat.com went down. d.triplepat.com is down https://status.triplepat.com/nl/incident/530727 Mon, 31 Mar 2025 10:14:00 -0000 https://status.triplepat.com/nl/incident/530727#fd183d78d7e1b0aecaba7baf6fcfcf8553ce585f89b54ddd4d128fd70239c297 TILAA has network issues more frequently than we (or they!) would like. This was one of them. Our every-node-is-a-master-node redundancy system means that everything still managed to work fine, however. b.triplepat.com is down https://status.triplepat.com/nl/incident/536573 Mon, 31 Mar 2025 10:12:00 -0000 https://status.triplepat.com/nl/incident/536573#b00afdd64aeae660ad6f0415fbb00d34c4a6be059458aaf1f7702a9c07326521 The machine b.triplepat.com became unavailable and eventually rebooted due to [a "power event" and corresponding Google Cloud outage in its datacenter](https://status.cloud.google.com/incidents/N3Dw7nbJ7rk7qwrtwh7X). When it came back up, the containers did not all start successfully at boot-time. It looks like the internal DNS for the docker-compose network didn't come-up successfully, which meant that the nginx config could not resolve the internal names for redirects, which meant that nginx refused to start. (Another point for https://isitdns.com) The power event in GCP: not our fault The DNS stuff not working after boot: at least a little bit our fault, and certainly something we might want to work to remediate. No user check-ins affected, because all other servers were up and running fine this whole time. No data loss. Resiliency works! The only TODO item is to make sure our systems reboot cleanly. b.triplepat.com is down https://status.triplepat.com/nl/incident/536573 Mon, 31 Mar 2025 05:50:39 -0000 https://status.triplepat.com/nl/incident/536573#5bfe9b951a8f7ca93f42cbb29686401a5e2f9ded13c5e9cde902c42fe117735d b.triplepat.com recovered. b.triplepat.com is down https://status.triplepat.com/nl/incident/536573 Sat, 29 Mar 2025 19:57:28 -0000 https://status.triplepat.com/nl/incident/536573#42ef0cbf49ee24897e984a89e1640011306e86304638b1b5fb6c7810cfe1095b b.triplepat.com went down. d.triplepat.com is down https://status.triplepat.com/nl/incident/530727 Wed, 19 Mar 2025 10:40:47 -0000 https://status.triplepat.com/nl/incident/530727#1d612a8fdb0ed4ef883fecec3914e10b4f46681f1d47a37a6ea3696112e7c6b8 d.triplepat.com recovered. d.triplepat.com is down https://status.triplepat.com/nl/incident/530727 Wed, 19 Mar 2025 10:33:55 -0000 https://status.triplepat.com/nl/incident/530727#aa414e3c6b2677f72c24832d64bdd228b1489f71e2870e3daa4cd2af7193d180 d.triplepat.com went down. a.triplepat.com was down after a OS upgrade + reboot https://status.triplepat.com/nl/incident/519873 Fri, 28 Feb 2025 14:23:00 -0000 https://status.triplepat.com/nl/incident/519873#af08d20e0716598bef252282db9d43491654c7234f8a88c3d84210a211a4c1d2 We tried to upgrade the OS and reboot the a server yesterday to see if that was a safe and quick operation. It was not. We delayed in bringing it back online because we haven't launched and were in the middle of a major code change. a.triplepat.com was down after a OS upgrade + reboot https://status.triplepat.com/nl/incident/519873 Fri, 28 Feb 2025 14:18:07 -0000 https://status.triplepat.com/nl/incident/519873#fd925dd155c5d5771d507bec42cdc3a5c16587c6c4996ca4116a3a0e89a2789c a.triplepat.com recovered. a.triplepat.com was down after a OS upgrade + reboot https://status.triplepat.com/nl/incident/519873 Thu, 27 Feb 2025 10:20:33 -0000 https://status.triplepat.com/nl/incident/519873#3d21e4d254a992e214197457cf6f3151bc4f91eee1b4ccefa980f0252f4339ab a.triplepat.com went down. d server was unreachable repeatedly https://status.triplepat.com/nl/incident/518114 Mon, 17 Feb 2025 12:03:00 -0000 https://status.triplepat.com/nl/incident/518114#724c17c1fd328b300f2beb20a766e3ce884ea9f3b80f562ca7ab39e3f45b3f4b [Tilaa cloud](https://www.tilaa.com/) (where the d server is hosted) had some network issues that made [d.triplepat.com](https://d.triplepat.com) inaccessible. It was scheduled maintenance that ended up taking more things down than they expected. https://status.tilaa.com/incidents/6hfj589ys0w1 and https://status.tilaa.com/incidents/4b2nb4tdw7mc are the issues on Tilaa's incident report page. Unclear what was the root cause here, as our incident started between the two. But I suspect it had the same root cause as the second link, because the timestamps on the first end around when people would have gone home and the timestamps on the second begin when people would have started work. a and b were briefly unavailable (multiple times!) https://status.triplepat.com/nl/incident/518112 Tue, 11 Feb 2025 11:56:00 -0000 https://status.triplepat.com/nl/incident/518112#2471f983f76e12cc11e030647f06afaa591438d45aa2dc2ee378bd4de5b5d369 [a](https://a.triplepat.com) and [b](https://b.triplepat.com) servers were briefly unavailable, twice! Once on 7 Feb 2025 and a second time on 11 Feb 2025. The a server even had a third outage on 10 Feb 2025! This did not affect any users for two reasons: 1. every server needs to go down before the check-in service becomes unavailable, and 2. we're not launched. Throughout these incidents, the servers [triplepat.com](https://triplepat.com), [c](https://c.triplepat.com), and [d](https://d.triplepat.com) remained stably up. # Cause A bad nginx config was (repeatedly) deployed while we were adding SMTP reverse-proxy functionality to it. Our deployment process broke down, like it should, but then we flailed around a bit and accidentally allowed the deployment process to get to b.triplepat.com too, causing more servers to go down. We treated the first occurrence like a one-off, and then when in reoccurred we took structural action. # Events leading up An engineer asked an LLM to help them with nginx configs and it hallucinated bad advice for SMTP reverse-proxying. Because they needed the advice, they didn't realize that the responses were hallucinations, and pushed the config. (First incident, bad configs were rolled back) On second try, they still didn't realize how *large* the hallucination was. It turned out to be not just a wrong directive or two, but an entirely wrong-headed approach. The LLMs were too eager to please. This led to the second and third incident. # Immediate fix Each time things were fixed by deploying a non-broken config. A little manual work was done to try and speed the process along by the eager engineer. As one might expect, this ended up making the problem worse. The lesson remains: trust the tools, they almost always make things happen as reliably as things can. # Permanent fix We added a step in our deployment process that asks a separate nginx invocation to parse the config before attempting to restart the remote nginx server with the new config. Deployment now correctly crashes and halts on the attempted deployment of configs that nginx won't parse, and does so before attempting to restart the remote server with the bad config. We also took the opportunity to add the same protection to other configured services. a and b were briefly unavailable (multiple times!) https://status.triplepat.com/nl/incident/518112 Tue, 11 Feb 2025 11:56:00 -0000 https://status.triplepat.com/nl/incident/518112#2471f983f76e12cc11e030647f06afaa591438d45aa2dc2ee378bd4de5b5d369 [a](https://a.triplepat.com) and [b](https://b.triplepat.com) servers were briefly unavailable, twice! Once on 7 Feb 2025 and a second time on 11 Feb 2025. The a server even had a third outage on 10 Feb 2025! This did not affect any users for two reasons: 1. every server needs to go down before the check-in service becomes unavailable, and 2. we're not launched. Throughout these incidents, the servers [triplepat.com](https://triplepat.com), [c](https://c.triplepat.com), and [d](https://d.triplepat.com) remained stably up. # Cause A bad nginx config was (repeatedly) deployed while we were adding SMTP reverse-proxy functionality to it. Our deployment process broke down, like it should, but then we flailed around a bit and accidentally allowed the deployment process to get to b.triplepat.com too, causing more servers to go down. We treated the first occurrence like a one-off, and then when in reoccurred we took structural action. # Events leading up An engineer asked an LLM to help them with nginx configs and it hallucinated bad advice for SMTP reverse-proxying. Because they needed the advice, they didn't realize that the responses were hallucinations, and pushed the config. (First incident, bad configs were rolled back) On second try, they still didn't realize how *large* the hallucination was. It turned out to be not just a wrong directive or two, but an entirely wrong-headed approach. The LLMs were too eager to please. This led to the second and third incident. # Immediate fix Each time things were fixed by deploying a non-broken config. A little manual work was done to try and speed the process along by the eager engineer. As one might expect, this ended up making the problem worse. The lesson remains: trust the tools, they almost always make things happen as reliably as things can. # Permanent fix We added a step in our deployment process that asks a separate nginx invocation to parse the config before attempting to restart the remote nginx server with the new config. Deployment now correctly crashes and halts on the attempted deployment of configs that nginx won't parse, and does so before attempting to restart the remote server with the bad config. We also took the opportunity to add the same protection to other configured services.