Scalable network infrastructure in Layer 3 with BGP

Growth Spurt

Configuring the Hosts

What is crucial for IP fabrics is that each host in the setup speaks BGP; that is in practical terms, each host becomes a router itself. BGP is based on the idea of routing announcements: A host uses such an announcement in the BGP protocol to specify which target networks it lets you reach.

For this to work, at least one local area network and one local IP address must be defined for each participating host that can be reached on the respective local network – this is the IP address that other hosts will use later to communicate with the server in question. In technical jargon, these networks, which mostly comprise four IP addresses (subnet mask /30 ) taken from the local IPv4 address space, are typically known as transfer networks.

The target IP – strictly speaking also a separate network with an IP address, and thus a subnet mask of /32 – can reside on any network interface of the server, including the loopback interface lo . The server uses BGP to announce that the network with the target IP can be reached via the described transfer network. The admin has the choice between two BGP solutions that have asserted themselves on Linux: the veteran Quagga [3] or the more lightweight Bird [4] (Figure 2). Which of the two solutions you choose is ultimately a matter of personal taste. Both Quagga and Bird can be easily rolled out with today's automation solutions.

Figure 2: Bird supports BGP and is a lightweight alternative to Quagga.

The architecture of the two services differs considerably: Quagga comprises several components and an interactive BGP shell, which you can use to send commands directly to the service. Bird also offers a command-line tool for querying an active instance of the service (birdrc); all told, it is significantly less complex than Quagga. Both services are suitable for IP fabrics because only a minimal set of BGP commands is used. If you have not worked with either of the two services before, you will probably find it easier to get started with Bird.

Redundancy plays an important role in such a setup, and it is easy to implement with BGP. After all, the number of transfer networks per host is not limited, and it makes sense to define one for each physical network port. The server then uses BGP to announce a number of paths to the destination IP on the network, matching the number of ports you configured. This arrangement is smart and efficient and allows you to use network interface cards (NICs) at the same time without any limitations for as long as they work. If a NIC fails, the BGP announcement of the respective path is dropped, leaving just the other working path.

Switch Helpers

As described, BGP is based on the principle that routers use the protocol to exchange routing information about the hosts on the network. For the setup to work in practice, the switch needs to be actively involved: It must speak BGP and act as a peering partner for the hosts in the scope of the BGP protocol.

The switch plays two central roles: On the one hand, it collects the incoming BGP announcements from the servers connected to it and maintains a central routing table that it provides to all other routers on the network – both hosts and other switches. On the other hand, the switch acts as a physical router: The individual hosts send packets to the address that they learned from the BGP and which is more or less the default gateway in this case. The second usable IP address on the transfer network is good for this purpose; the address is configured on the switch for the respective port. The switch changes the target MAC address for these packets on the basis of routing information, reduces their time-to-live (TTL) by 1, and forwards them to the target computer.

A special role takes on external traffic: in this case, traffic that is not addressed to one of the local transfer networks. The switches use BGP to pass this traffic directly to the gateways set up especially for this purpose and, hence, known as border gateways. In contrast to a classic tree architecture, an IP fabric can have any number of these gateways. The packets ultimately take paths based on the routing information from BGP and not on the basis of the existing Layer 2 network.

What applies to border gateways also applies to the number of switches in the setup: Additional switches can be connected to any existing switch at any time, and the switches then seamlessly integrate into the existing BGP setup. Although somewhat unorthodox, it is possible without assuming any disadvantages. The approach is typically different in practice. The following example therefore points to best practices and also cites concrete examples of usable hardware.

Top of Rack, Core, Leaf, Spine

The kind of IP fabric designs presented here use many alternative terms. Manufacturers like to talk about leaf-spine architectures, or top-of-rack (ToR) switches, and core switches. They almost always mean the same thing: Assuming good planning, IP fabrics let you achieve a far higher number of ports and genuine scalability. Commonly, each rack is assigned a separate switch, with the switches often mounted at the top of the rack (hence the name "top of rack").

Cross-cabling of multiple racks can make sense. For example, if you have two racks side by side, each with a Mellanox SN2410 switch (48x25Gbps Ethernet), you could first connect the servers in the racks with the ToR switch in the same rack and then with the ToR switch in the rack next door to achieve redundancy. If the 48 ports are not enough, you can install several of these devices.

You thus need more powerful hardware for the switch layer: You might want to deploy the Mellanox SN2700, which offers 32 genuine 100Gbps Ethernet ports (Figure 3), which you can split up using break-out cables (see the box "Tricky Bit: The Switches"). By connecting the ToR switches from various racks to one or multiple core switches of this model, you can give the individual racks high-performance uplinks. After all, it is the core switches that are connected directly to the core routers facing the Internet and thus enable external connectivity.

Figure 3: The Mellanox SN2700 offers 32 genuine 100Gbps Ethernet ports and is suitable as a core switch for connecting multiple top-of-rack switches.

Tricky Bit: The Switches

It is no coincidence that I mentioned Spectrum series switches by Mellanox twice. They offer a feature that is especially relevant in the context of IP fabrics: You can run Cumulus Linux on them, a special version of Debian GNU/Linux that has been specifically customized for operation on network devices.

With all the neat benefits this setup offers, if you implement BGP on the hosts using Quagga or Bird, you can do the same on your Cumulus switches (Figure 4). Quagga is included and Bird can be installed with apt-get, although the manufacturer does not provide any support for this component. Thanks to Debian, the switches thus integrate seamlessly into an existing automation solution. Like the other hosts in the setup, the switch also gets its configuration via Ansible.

Figure 4: Cumulus on switches offers many advantages: You can resort to a real Linux distribution in which each port is a separate interface.

Of course, you can set up an IP fabric like the one I described with third-party network hardware and without setting up Cumulus Linux. The industry has finally taken note of the desire to use switches by the two kings of the hill, Cisco and Juniper, as routers. Both operating systems, NX-OS and Junos OS, respectively, now offer corresponding features, but an expensive additional license is typically required to do so. Additionally, NX-OS and Junos OS will not integrate so easily with an existing automation environment. If you have the choice, you should at least evaluate Cumulus.

Another thing in favor of Cumulus: If the switches on the network are only simple servers with Linux, any good system administrator can manage them, which is not true of Cisco or Juniper, for which you need a specialist. In some ways, IP fabrics thus also drive the need for IT specialists to change and become all-rounders. Sys admins without previous knowledge of networks can look forward to a steep learning curve because BGP is fairly complex, but generally, if you can use BGP on Linux, you will get by with Cumulus, Bird, or Quagga.

A setup planned in this way achieves maximum flexibility: Additional racks can always be connected to the existing core switches, and if you run out of ports, you can add an additional core switch. If you want, you can even add a third layer of switches during operation, which, especially in light of the imminent availability of 200Gbps Ethernet switches, could be a worthwhile alternative. Undoubtedly, as long as some physical connection is available from a new switch down to the existing switch network, new switches can be added with no worries.

Even without downtime, if you briefly turn off a switch to replace it with a more powerful model, the other BGP routes remain unaffected, and the network continues to function normally. You can even add additional layers of switches during operation, without worrying about a maintenance window.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy ADMIN Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • OS10 and Dell's open networking offensive
    Dell's OS10 is a Linux-based operating system for network hardware that is designed to free network admins from the stranglehold of established manufacturers. We look at what it is, how the system works, and what it can do for you.
  • Software-defined networking with Windows Server 2016
    Windows Server 2016 takes a big step toward software-defined networking, with the Network Controller server role handling the centralized management, monitoring, and configuration of network devices and virtual networks. This service can also be controlled with PowerShell and is particularly interesting for Hyper-V infrastructures.
  • Useful tools for automating network devices
    Armed with the right tools, you can manage your network infrastructure both automatically and effectively in a DevOps environment.
  • Spanning Tree Protocol
    Ethernet is so popular because it simply works and is inexpensive. However, the administration side looks a bit more complicated: For the network to run smoothly, the admin might need to make important decisions about the Spanning Tree protocol.
  • Network overlay with VXLAN
    VXLAN addresses the need for overlay networks within virtualized data centers accommodating multiple tenants.
comments powered by Disqus
Subscribe to our ADMIN Newsletters
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs

Support Our Work

ADMIN content is made possible with support from readers like you. Please consider contributing when you've found an article to be beneficial.

Learn More”>


		<div class=