Connectivity Loss - core01.ffm3
Incident Report for aurologic GmbH
Postmortem

On June 7th, we experienced an outage of router core01.ffm3.de at the Interxion (Digital Realty) FRA16 site. This equipment, consisting of two Juniper QFX Virtual Chassis members, provides connectivity to 10/40Gbps subscribers, including IP transit and Layer 2 transport.

In addition to serving downstream customers, this equipment also connects to our infrastructure hosting websites such as aurologic.com, my.aurologic.com, and our API. This makes core01.ffm3.de a critical component of our operations, having provided network connectivity reliably for several years without significant issues.

We are currently undertaking an internal project to move our websites to a multi-datacenter anycast infrastructure. This will eliminate the single point of failure at core01.ffm3.de by distributing incoming requests across different load balancers at Tornado Datacenter, Interxion, and Equinix. Given the business-critical nature of API connectivity for our customers, we strive for high availability, which we regrettably failed to achieve in this instance.

When the equipment became unreachable on June 7, 2024, we promptly dispatched one of our operations engineers to the Interxion (Digital Realty) site, achieving a response time of about 30 minutes. This is significantly better than the average two-hour response time offered by the data center operator for remote hands requests.

Upon arrival, our engineer found both devices in an inoperable state, preventing any restoration of routing functionality. We decided to perform a hard reboot of both devices, which temporarily restored normal operation.

However, the routing functionality failed again shortly afterward when the RPD daemon on the equipment logged an assert() condition, indicating a software bug in Juniper JunOS. Restarting the RPD daemon three more times led to repeated crashes as BGP sessions resumed.

Upon analyzing the RPD core dump, we identified the exact cause as a known bug in Juniper JunOS related to certain route attributes. Our analysis suggests that a downstream customer likely triggered this issue. The Juniper software's use of assert() to handle this condition, causing the RPD to crash, is, in our opinion, poor software engineering practice. Logging the problematic route instead of crashing and disrupting dozens of BGP sessions would have been a better approach. Additionally, Juniper JunOS does not automatically restart the RPD daemon after a crash, requiring manual intervention to restore functionality. Such behavior is not acceptable for enterprise-grade equipment managing network infrastructure.

We decided to patch the network devices with the latest Juniper JunOS version. This process was challenging due to access issues arising from our security protocols. We are already working on an internal project to separate out-of-band access onto a dedicated infrastructure, and we will use this incident as a learning opportunity to expedite this process.

Once both devices were patched, we regained stable operational status, restoring connectivity for our customers and our own infrastructure.

Posted Jun 08, 2024 - 11:31 UTC

Resolved
The issue has been resolved. Both devices running now on later firmware. We will issue a post-mortem later on, especially as this was a long going incident with severe impact to our operations. Customers may open a ticket to discuss anything else. Thanks for your understanding.
Posted Jun 07, 2024 - 13:17 UTC
Update
Pushing bundle xxx to fpc1, apologies for Juniper being a total failure. We're working on it still.
Posted Jun 07, 2024 - 12:44 UTC
Update
We had severe issues getting a ssh connection to patch the firmware, following our security concept. Firmware upgrade is in progress.
Posted Jun 07, 2024 - 12:04 UTC
Update
The devices crashed again, we see signs of a Juniper JunOS bug in the logs. We're doing a emergency change and upgrade now.
Posted Jun 07, 2024 - 11:42 UTC
Monitoring
Connectivity has been restored. We're checking the equipment and customer connections.
Posted Jun 07, 2024 - 11:38 UTC
Update
Both devices were in a in-operational state, causing controlplane and layer3 functionality to get stuck. A remote reboot would not have been possible. We're currently waiting for a clean reboot of both devices and be confident the issue is resolved afterwards.
Posted Jun 07, 2024 - 11:33 UTC
Update
Technican is on-site and shortly at our cabinet, we'll keep you posted once we know more.
Posted Jun 07, 2024 - 11:10 UTC
Update
Technican reports he's on-site at 13:06 (CEST). We'll keep you posted in here.
Posted Jun 07, 2024 - 10:50 UTC
Update
A technician has been dispatched to the site and will arrive in approximately 30 minutes. We will will update this incident as soon as new information is available.
Posted Jun 07, 2024 - 10:28 UTC
Investigating
We're currently experiencing a connectivity loss on our router core01.ffm3 (Interxion). We're currently investigating the cause of the issue and will update this incident as soon as new information is available.
Posted Jun 07, 2024 - 10:17 UTC
This incident affected: Network Infrastructure (Core Network).