core01.ffm3.de - forwarding-engine segfault
Incident Report for aurologic GmbH
Postmortem

On 12th July 2024, 20:47 CEST, we configured EVPN-VXLan transport for a new customer at Tornado Datacenter FRA01, connecting thirdparty ports on top of our network. This is a standard procedure and uses predefined configuration templates, which are in use on multiple other devices, to bring up transport interfaces.

Following configuration, both Juniper QFX Virtual Chassis members for router core01.ffm3.de crashed forwarding-engine, which is the daemon handling also interface states on Juniper JunOS. Following the crash, interfaces towards customers and other equipment flapped, causing BGP sessions to be reset for a short period of time.

The said event led to minor unavailability of customers connected onto this router, as well as our own infrastructure. The daemon auto restarted and a core-dump was created. It appears, this is a bug in Juniper JunOS, which was introduced with the previous patch in June, fixing a crash of the rpd daemon, whenever a bad route was received.

This is quite unfortunate and the second time we’re having a repeated issue with the said router due to software quality issues caused by bad Juniper Networks QA and rubbish quality of their products. Following the crash, we’ll stop provisioning new transport customers on the said equipment, plan a upgrade and ultimatively, a replacement of both devices with Arista Networks equipment, which has proven to be A) more stable B) still bug-free within EVPN-VXLan configurations C) less expensive.

TL;DR; dont buy Juniper Networks gear, it will make your day worse and is now owned even by a printer supplier, which says it all.

Posted Jul 12, 2024 - 20:18 UTC

Resolved
Up on configuring EVPN-VXLan on core01.ffm3.de for a new transport customer, forwarding-engine segfaulted and caused device links to flap two times in a row. It appears to be a bug in Juniper JunOS. We have stopped configuring the devices and will move the customer onto other equipment.
Posted Jul 12, 2024 - 19:03 UTC
This incident affected: Network Infrastructure (Core Network).