So, you’ve surely seen some interesting tidbits in the previous section, things you haven’t noticed from other configurations on the Internet. I will outline why these are present in this configuration based on the failure scenario I present below:

Complete and total loss of spine connections on a single leaf switch – First I’ll outline the ONLY reasons why a single leaf switch would lose all of its spine uplinks:

  1. Total and absolute failure of the entire leaf switch
  2. The 40GbE GEM card has failed, but the rest of the switch remains operational
  3. An isolated ASIC failure affecting only the GEM module
  4. Someone falls through a single cable tray in your data center, taking out all the connections you placed in a single tray
  5. Total and complete failure of all 40GbE QSFP+ modules, at the same time
  6. Total loss of power to either the leaf switch or to all spine switches
  7. All three line cards, in three different spine switches, at the same time, suffer the same failure
  8. Someone reloaded the spine switches at the same time
  9. Someone made a configuration change and hosed your environment

OK, now, lets make one thing clear: NO one, and I mean no one, can prevent any issue with starts with “Someone”, you can’t fix stupid. If you lose power to both of your 9396PX power supplies or to the 3+ PSUs in the 9508 spine switches, I think your problem is much larger than you care to believe. Lets see, we now have just 5 scenarios left.

If your leaf switch just dies, well, you know. Down to four! Yes, a GEM card can fail, I’ve seen it, but this isn’t common and is usually related to an issue which will down the entire switch anyway, but we’ll keep that in our hat. Failure of all the connected QSFP+ modules at the same time? I’ll call BS on this, if all of those QSFP+ modules have failed, your switch is on the train towards absolute failure anyways.

Isolated ASIC failure? So uncommon I feel stupid mentioning it. All three line cards in the spine failing at the same time? Yeah, right. So, in all we’re looking to circumvent a failure in the event of a GEM card failure which doesn’t also mean your switch is dead, being the only real valid reason; however, please note, I am only providing this as proof of concept and I don’t think anyone should allow their environment to operate in a degraded state. If your environments operating status isimportant to you, perhaps a different choice of leaf switch for greater redundancy, a cold or warm backup switch, or at least have 24x7x4 Cisco Smartnet.

When you have a leaf switch suffering from a failure of all the spine uplinks, your best course of action, on a vPC enabled VTEP, is to down the VPC itself on the single leaf switch experiencing the failure. This is where the tracking objects against the IP route and the tracking list which groups them for use within the event manager come to use. Once all the links have gone down, using the boolean AND, by the removal of the BGP host address in the routing table, the event manager applet named “spine down” initiates and shuts down the vPC, loopback0, and the NVE interface, respectively.

When all the links return to operation, there is a 12 second delay, configured for our environment to allow for the BGP peers to reach the established state, and then the next event manager applet named “spine up” initiates, basically just “un-shutting” the interfaces in the exact same order. The NVE interface configuration for the source-interface hold-down-timer, brings the NVE interface UP, but keeps the loopback0 interface down long enough to ensure EVPN updates have been received and the vPC port-channels come to full UP/UP status. If this didn’t happen, and the loopback0 and port-channels come up way too soon before the NVE interface, we’ll blackhole traffic from the hosts towards the fabric. If the NVE and loopback0 interface come up too long before the port-channels, you’ll black hole traffic from the network-to-access direction; thus, timing is critical and will vary per environment so testing is required.

A lot of stuff, right? This is all done to prevent the source interface of the NVE VTEP device coming up before the port-channels towards end hosts come up, to prevent the VTEP from advertising itself into the EVPN database and black holing INBOUND traffic.

You might be thinking: Why not just create a L3 link and form an OSPF adjacency between the two switches to allow the failed switch to continue to receive EVPN updates and prevent blackholing? Well, here are my reasons:

  1. Switchport density and cost per port – If it costs you $30,000 for a single switch of 48 10GbE ports, not including smartnet or professional services, you’re over $600/port, and you and I both know you’re not just going to use ONE link in the Underlay, you’ll use at least two. Really expensive fix.
  2. Suboptimal routing – Lets be real here, your traffic will now take an additional hop because your switch is on the way out
  3. Confusing information in EVPN database for next-hop reachability. – Because the switch with the failed spine uplinks still have a path and receiving EVPN updates, you’ll see it show up as a route-distinguisher in the database, creating confusion
  4. It doesn’t serve appropriate justice to a compromised switch – Come on, the switch has failed, while not completely, it is probably toast and should be downed to trigger immediate resolution of the issue, instead of using bubble gum to plug a leak in your infrastructure. The best solution is to bring down the vPC member completely, force an absolute failover to the remaining operational switch, prevent suboptimal routing, and prevent confusion in troubleshooting.

I can’t stress this enough: Engineering anything other just failing this non-border vPC enabled leaf switch, in the event it is the only switch without all, at least, 3 spine connections, is an attempt at either trying to design a fix for stupid or you’re far too focused on why your leaf switch has failed and ignoring the power outage in your entire data center because you lost main power and someone forgot to put diesel in the generator tanks. Part 3 will include more EVPN goodness, stay tuned!


One Response to “Cisco BGP EVPN VXLAN with head-end replication – Part 2 – Spine uplink failure”