Difference between revisions of "2024/11/29"

From Mew
Jump to navigation Jump to search
 
(6 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
Nobody can post or follow on TootCat, though the page seems to reload and remote updates are coming in. When trying to post, a "500" error appears in the lower right corner.
 
Nobody can post or follow on TootCat, though the page seems to reload and remote updates are coming in. When trying to post, a "500" error appears in the lower right corner.
 
==Diagnosis==
 
==Diagnosis==
* '''08:50 EST''' Tentatively, the web servers and the database are fine -- but I need to discuss this with our web engineer (Dan) to confirm my understanding and work out a fix. I don't want to go making things worse with a poorly-conceived kluge that doesn't actually work.
+
* '''10:27 EST''' ...and yet! It turns out that, by a curious coincidence, the root volume somehow got all filled up. (This is ''not'' where we keep the database or media files.) Once Dan removed a bunch of old, unneeded snapshot files, everything started working again (although Hetz had also reported that things were recovering on their end as well; hard to say if this would have worked before that point).
 +
** Dan is tweaking the garbage-collection routines to catch this sort of thing in the future.
 +
* '''09:55 EST''' I've confirmed that the cloud consoles are no longer accessible (disabled "to help us recover"), so this is definitely their issue.
 +
** ...at least for now; if things don't start working again once they're done, then we'll have to resume investigating.
 +
* '''09:38 EST''' New posts from outside are also not coming in (last new message was last night at 19:48 EST) -- which would be consistent with a blockage between the back-end server and the rest of the internet. So we can see the server's current state, but it isn't communicating with the rest of fedi. I can access most of the cpanel, including moderation, but for some reason the [https://toot.cat/admin/dashboard admin panel] also returns an error; I'm not sure how that fits with my theory.
 +
* '''08:50 EST''' Tentatively, the web servers and the database are fine -- but I need to discuss this with our web engineer (Dan) to confirm my understanding and work out a fix. I don't want to go making things worse with a poorly-conceived kluge that doesn't actually work. It looks like the load-balancer/firewall is the only thing affected, but I don't have a firm enough grasp of our architecture to be sure that this is the case or to be sure of how to temporarily work around it.
 +
**  '''Note''': I figured this out before they disabled the cloud consoles.
 
* '''08:14 EST''' The problem may be a [https://status.hetzner.com/incident/c7a683c0-8216-45ef-87dc-1c8574ba714d service outage at Hetzner]. We're checking possibilities.
 
* '''08:14 EST''' The problem may be a [https://status.hetzner.com/incident/c7a683c0-8216-45ef-87dc-1c8574ba714d service outage at Hetzner]. We're checking possibilities.
 
** This was originally "API failure (datacenters & locations)" but later changed to "Failures on API and Console".
 
** This was originally "API failure (datacenters & locations)" but later changed to "Failures on API and Console".
 +
** Updates from Hetzner:
 +
*** '''2024-11-29 15:09 UTC+0''' The recovery is in progress. The systems are recovering slowly right now.
 +
*** '''2024-11-29 14:08 UTC+0''' It is currently not possible to access the Cloud Console. This is intended to help us recover.
 +
*** '''2024-11-29 12:41 UTC+0''' Currently it is not possible to create or modify resources in the Hetzner Cloud. Already running resources are not affected.
 +
**** ...except it looks like our firewall/load-balancer is out of commission. -W.

Latest revision as of 15:44, 29 November 2024

Problem

Nobody can post or follow on TootCat, though the page seems to reload and remote updates are coming in. When trying to post, a "500" error appears in the lower right corner.

Diagnosis

  • 10:27 EST ...and yet! It turns out that, by a curious coincidence, the root volume somehow got all filled up. (This is not where we keep the database or media files.) Once Dan removed a bunch of old, unneeded snapshot files, everything started working again (although Hetz had also reported that things were recovering on their end as well; hard to say if this would have worked before that point).
    • Dan is tweaking the garbage-collection routines to catch this sort of thing in the future.
  • 09:55 EST I've confirmed that the cloud consoles are no longer accessible (disabled "to help us recover"), so this is definitely their issue.
    • ...at least for now; if things don't start working again once they're done, then we'll have to resume investigating.
  • 09:38 EST New posts from outside are also not coming in (last new message was last night at 19:48 EST) -- which would be consistent with a blockage between the back-end server and the rest of the internet. So we can see the server's current state, but it isn't communicating with the rest of fedi. I can access most of the cpanel, including moderation, but for some reason the admin panel also returns an error; I'm not sure how that fits with my theory.
  • 08:50 EST Tentatively, the web servers and the database are fine -- but I need to discuss this with our web engineer (Dan) to confirm my understanding and work out a fix. I don't want to go making things worse with a poorly-conceived kluge that doesn't actually work. It looks like the load-balancer/firewall is the only thing affected, but I don't have a firm enough grasp of our architecture to be sure that this is the case or to be sure of how to temporarily work around it.
    • Note: I figured this out before they disabled the cloud consoles.
  • 08:14 EST The problem may be a service outage at Hetzner. We're checking possibilities.
    • This was originally "API failure (datacenters & locations)" but later changed to "Failures on API and Console".
    • Updates from Hetzner:
      • 2024-11-29 15:09 UTC+0 The recovery is in progress. The systems are recovering slowly right now.
      • 2024-11-29 14:08 UTC+0 It is currently not possible to access the Cloud Console. This is intended to help us recover.
      • 2024-11-29 12:41 UTC+0 Currently it is not possible to create or modify resources in the Hetzner Cloud. Already running resources are not affected.
        • ...except it looks like our firewall/load-balancer is out of commission. -W.