Outage - core01.ffm1.de
Incident Report for aurologic GmbH
Postmortem

On July 23, 2024, at 22:23:33, we experienced a power outage in cabinet RR.A.R02.C05. This cabinet houses a 2U ATS connected to both A and B-Feed, along with a single PDU connected to the ATS. These servers all have a single PSU, necessitating the ATS for power redundancy.

Due to historical reasons dating back to January 2023 when we moved into Tornado Datacenter FRA01, our old network cabinet is also cross-connected to cabinet RR.A.R02.C05 for power. The old network cabinet, RR.A.R03.NC01, houses a Juniper QFX virtual chassis that provides connectivity for about 120 10G links, broken out onto numerous MTP cassettes and MTP cables (40G).

These links connect dozens of customer servers with dedicated 10Gbps capacity to the Juniper QFX virtual chassis and serve as the announcement originator for numerous prefixes announced at FFM1 (Tornado Datacenter FRA01).

Following the power outage of cabinet RR.A.R02.C05, power in RR.A.R03.NC01 was also lost, impacting the operation of the Juniper QFX VC.

Monitoring alerts quickly informed us about the ongoing outage. An on-site technician, who was racking new customer servers, was called and temporarily powered the network cabinet using UPS-unprotected generator power of Rack-Room A. This allowed us to boot the affected equipment and quickly restore operation.

Operation on core01.ffm1.de (QFX) was impacted between 22:26 and 22:50 CEST. The cabinet is now also cross-connected to a second cabinet until we migrate the prefixes hosted on that device to core01.ffm1.de (Arista gear, different cabinet). This migration has been long-planned but delayed due to time and priority constraints. We will prioritize this higher and plan to take a second network cabinet with core02.ffm1.de (also Arista) into operation by September 2024.

Cabinet RR.A.R02.C05 operation was restored at 22:56:21 after both fuses were turned on by on-site personnel. Simultaneously, Tornado Datacenter event logs for the A-Feed UPS of UV01 in Rack-Room A suggest a high short-circuit current may have tripped both RR.A.R02.C05 fuses, causing the UPS to transfer into Bypass mode and turn off the inverter automatically. This event did not lead to a power cut, as connected cabinets (RR.A.R01.C01-C10 and RR.A.R02.C01-C10) continued normal operation without noticeable impact on A-Feed availability. It appears the UPS vendor ensured seamless load transfer upon inverter drop-off due to the short circuit.

Posted Jul 23, 2024 - 22:04 UTC

Resolved
Issue is resolved in both racks. Old network rack is now connected to another rack as well. Migration of the QFX stack is already planned.
Posted Jul 23, 2024 - 21:13 UTC
Update
Power for C05 is restored and we're actively looking for the cause. Temporary connection has been made for the old network rack onto another feed. Thermal check of the fuses didnt reveal an issue.
Posted Jul 23, 2024 - 20:53 UTC
Update
Cabinet is back online, we have another one down, RR.A.R02.C05, which is without power on the ATS. The old network cabinet was crossconnected to RR.A.R02.C05, thats why it was impacted. Onsite is looking into C05 currently, as the fuse is going off randomly, which suggests a failed PSU.
Posted Jul 23, 2024 - 20:51 UTC
Update
It appears the cabinet has no more power, electrical issue possibly with one of the devices. Customers running on the new Arista Router are fine, the old Juniper QFX stack is down though. We're currently working on providing temporary power.
Posted Jul 23, 2024 - 20:41 UTC
Identified
We're currently dealing with a outage of the old core01.ffm1.de router (Juniper QFX), personell is already on-site and looking into the issue.
Posted Jul 23, 2024 - 20:37 UTC
This incident affected: Network Infrastructure (Core Network).