On June 7th, we experienced an outage of router core01.ffm3.de at the Interxion (Digital Realty) FRA16 site. This equipment, consisting of two Juniper QFX Virtual Chassis members, provides connectivity to 10/40Gbps subscribers, including IP transit and Layer 2 transport.
In addition to serving downstream customers, this equipment also connects to our infrastructure hosting websites such as aurologic.com, my.aurologic.com, and our API. This makes core01.ffm3.de a critical component of our operations, having provided network connectivity reliably for several years without significant issues.
We are currently undertaking an internal project to move our websites to a multi-datacenter anycast infrastructure. This will eliminate the single point of failure at core01.ffm3.de by distributing incoming requests across different load balancers at Tornado Datacenter, Interxion, and Equinix. Given the business-critical nature of API connectivity for our customers, we strive for high availability, which we regrettably failed to achieve in this instance.
When the equipment became unreachable on June 7, 2024, we promptly dispatched one of our operations engineers to the Interxion (Digital Realty) site, achieving a response time of about 30 minutes. This is significantly better than the average two-hour response time offered by the data center operator for remote hands requests.
Upon arrival, our engineer found both devices in an inoperable state, preventing any restoration of routing functionality. We decided to perform a hard reboot of both devices, which temporarily restored normal operation.
However, the routing functionality failed again shortly afterward when the RPD daemon on the equipment logged an assert() condition, indicating a software bug in Juniper JunOS. Restarting the RPD daemon three more times led to repeated crashes as BGP sessions resumed.
Upon analyzing the RPD core dump, we identified the exact cause as a known bug in Juniper JunOS related to certain route attributes. Our analysis suggests that a downstream customer likely triggered this issue. The Juniper software's use of assert() to handle this condition, causing the RPD to crash, is, in our opinion, poor software engineering practice. Logging the problematic route instead of crashing and disrupting dozens of BGP sessions would have been a better approach. Additionally, Juniper JunOS does not automatically restart the RPD daemon after a crash, requiring manual intervention to restore functionality. Such behavior is not acceptable for enterprise-grade equipment managing network infrastructure.
We decided to patch the network devices with the latest Juniper JunOS version. This process was challenging due to access issues arising from our security protocols. We are already working on an internal project to separate out-of-band access onto a dedicated infrastructure, and we will use this incident as a learning opportunity to expedite this process.
Once both devices were patched, we regained stable operational status, restoring connectivity for our customers and our own infrastructure.