Between 22:15 and 22:35, we had a network disturbance at Interxion Frankfurt which lead to partial connectivity issues. We are currently analysing what happened.
Update: According to our first analysis, the edge/core router pair at Interxion dropped all bgp sessions which lead to immediate connectivity issues. One of the sessions to RETN is still flapping, we have turned off announcements for now.
Update: It seems a loop between both sites has caused the disturbance. We have also identified flapping sessions at our FFM2 (Interwerk) site.
Update: The above event was a very unfortunate event of several infrastructural misbehavings at the same time. The first one was, that there was a ddos attack towards one of our customers at Interwerk. This ddos attack came in with around 14Gbit/s at our Interxion site. Unfortunately, at the time the ddos attack was going in, there was a inapropiate next-hop set towards the ddos-filters - possibly due bug in the software for bgp we use on our anti-ddos infrastructure. This resulted in all traffic getting sent back to Core-Backbone, which then sent back the traffic to us. This resulted in a flapping bgp session and several lacp related flaps - the chaos was completed to some extent. We had an loop of a lot of traffic between our Edge and Core Routers and overloaded Router control planes.
Due to that, we had that session flap which affected both sites and caused a high CPU usage on all routers within our network, causing BGP flaps to several sessions. As first reaction, we have implemented a fix on the network equipment at Interxion, which would correct the next-hop - even if it is total incorrectly set. Next week, we will update the bgp daemon to completely avoid such an situation in the future.
We are deeply sorry for any inconvenience caused. In case you have further questions, feel free to open a ticket.
kvm25 is offline since 04:34 as either the network or raid controller has failed:
"POST Error: 101-I/O ROM Error"
This indicates that there is an hardware issue with one of the PCIe cards. We are working on a replacement.\
Update: We are replacing the whole box, as all tries to get it working failed. We will keep you updated.
Update: The problem was resolved.
Friday 19th April 2019
No incidents reported
Thursday 18th April 2019
No incidents reported
Wednesday 17th April 2019
Outage Access Switch - rackB5-1.2.4-asw1-ffm3
We had a short outage of rackB5-1.2.4-asw1-ffm3 as it became unresponsive after a regular configuration change. The device has been rebooted, the total downtime was at around 10 minutes.
We do not expect a second issue, otherwise we will swap the whole switch to a spare one.