The failure seems to have been in the main firewall, if it had been the server itself we could have easily restored it on another server from the backups on another machine. But as it stands, remote access is entirely cut off.
There usually is another person with hardware access, but they are on summer holidays. This seemed like an acceptable risk at the time…
An off-site backup would have been nice of course, but due to the costs involved in running an Lemmy instance of that size on a rented server, it would have not been a great option either.
I have plans to add a KVM to the main firewall via a secondary connection, but even that might have not helped in this case. I’ll know more when I have physical access again.
It is run from a private residence in the DIY punk spririt (and this also allows us to run of a local solar PV system), but more or less the same would happen if you rent rack-space in a “real” data-center. Only if you rent a managed server or VPS someone else will be responsible to fix such issue and this comes at a significantly higher cost at the scale we operate at (slrpnk is part of a bigger project that also hosts other services).
I’ve done a lot of SysAdmin and DCOps stuff in the past so, thought I’d give you some plausible suggestions (haven’t dug deep into Lemmy DB stuff and DNS/Federation of the stack, so not sure all is practical).
Scenario 1 - Preserve and merge when access is restored
Setup
Spin up two VMs/VPS (or one that has enough grunt for two Lemmy servers). Call them robak.slrpnk.net and slrpnk.net and point DNS appropriately.
Pull federated content from other instances and place it on robak, set as read-only.
Sync important comms to (new) slrpnk.net without content.
Allow users to sign up, vetting as possible (all mods). Keep a list of those that are vetted (call it vetted.list). Inform all users that any non-vetted users will have their content dropped when access is restored.
Merge!
Once access is restored, ensure that (old) slrpnk.net is set to read-only.
Schedule a maintenance window (announce more time than you are likely to need).
During the maintenance window, put (new) slrpnk.net into R/O, or just block external access.
Slrpnk.net admin here.
The failure seems to have been in the main firewall, if it had been the server itself we could have easily restored it on another server from the backups on another machine. But as it stands, remote access is entirely cut off.
There usually is another person with hardware access, but they are on summer holidays. This seemed like an acceptable risk at the time…
An off-site backup would have been nice of course, but due to the costs involved in running an Lemmy instance of that size on a rented server, it would have not been a great option either.
I have plans to add a KVM to the main firewall via a secondary connection, but even that might have not helped in this case. I’ll know more when I have physical access again.
Is it run out of a private residence? How could it happen if it’s in a real data center…?
It is run from a private residence in the DIY punk spririt (and this also allows us to run of a local solar PV system), but more or less the same would happen if you rent rack-space in a “real” data-center. Only if you rent a managed server or VPS someone else will be responsible to fix such issue and this comes at a significantly higher cost at the scale we operate at (slrpnk is part of a bigger project that also hosts other services).
I’ve done a lot of SysAdmin and DCOps stuff in the past so, thought I’d give you some plausible suggestions (haven’t dug deep into Lemmy DB stuff and DNS/Federation of the stack, so not sure all is practical).
Scenario 1 - Preserve and merge when access is restored
Setup
robak.slrpnk.net
andslrpnk.net
and point DNS appropriately.Merge!
Scenario 2 - Server is in DC or Admin able to facilitate access
Appreciate the answer and the detail. Good luck getting it all resolved.