2024/11/29

From Mew
Jump to navigation Jump to search

Problem

Nobody can post or follow on TootCat, though the page seems to reload and remote updates are coming in. When trying to post, a "500" error appears in the lower right corner.

Diagnosis

  • 10:27 EST ...and yet! It turns out that, by a curious coincidence, the root volume somehow got all filled up. (This is not where we keep the database or media files.) Once Dan removed a bunch of old, unneeded snapshot files, everything started working again (although Hetz had also reported that things were recovering on their end as well; hard to say if this would have worked before that point).
    • Dan is tweaking the garbage-collection routines to catch this sort of thing in the future.
  • 09:55 EST I've confirmed that the cloud consoles are no longer accessible (disabled "to help us recover"), so this is definitely their issue.
    • ...at least for now; if things don't start working again once they're done, then we'll have to resume investigating.
  • 09:38 EST New posts from outside are also not coming in (last new message was last night at 19:48 EST) -- which would be consistent with a blockage between the back-end server and the rest of the internet. So we can see the server's current state, but it isn't communicating with the rest of fedi. I can access most of the cpanel, including moderation, but for some reason the admin panel also returns an error; I'm not sure how that fits with my theory.
  • 08:50 EST Tentatively, the web servers and the database are fine -- but I need to discuss this with our web engineer (Dan) to confirm my understanding and work out a fix. I don't want to go making things worse with a poorly-conceived kluge that doesn't actually work. It looks like the load-balancer/firewall is the only thing affected, but I don't have a firm enough grasp of our architecture to be sure that this is the case or to be sure of how to temporarily work around it.
    • Note: I figured this out before they disabled the cloud consoles.
  • 08:14 EST The problem may be a service outage at Hetzner. We're checking possibilities.
    • This was originally "API failure (datacenters & locations)" but later changed to "Failures on API and Console".
    • Updates from Hetzner:
      • 2024-11-29 15:09 UTC+0 The recovery is in progress. The systems are recovering slowly right now.
      • 2024-11-29 14:08 UTC+0 It is currently not possible to access the Cloud Console. This is intended to help us recover.
      • 2024-11-29 12:41 UTC+0 Currently it is not possible to create or modify resources in the Hetzner Cloud. Already running resources are not affected.
        • ...except it looks like our firewall/load-balancer is out of commission. -W.