Configure High Availability on SD-WAN Edge

This section discusses the high availability deployments and configuration supported on SD-WAN Edge .

How SD-WAN Edge High Availability (HA) Works

The high availability solution ensures continued traffic flow in case of failures. The SD-WAN Edge is the data plane component that is deployed at an end user’s branch location. SD-WAN Edge configured in High Availability (HA) mode are mirror images of each other and they show up on the SASE Orchestrator as a single SD-WAN Edge.

In a high availability configuration, SD-WAN Edges are deployed at the branch site in pairs of Active and Standby roles. Configurations are mirrored across both these Edges. The Active and Standby Edges exchange heartbeats using a failover link established over a wired WAN connection. If the Standby Edge loses connectivity with the Active Edge for a defined period, the Standby Edge assumes the identity of the Active Edge and takes over the traffic load. The failover has minimal impact on the traffic flow.

The Orchestrator communicates only with the Active Edge. Any changes made to the Active Edge using the Orchestrator are synchronized with the Standby Edge using the failover link.

Failure Scenarios

The following are some common scenarios that can trigger a failover from an Active to a Standby Edge:

WAN link failure- When a WAN link on the Active Edge fails, a failover action is triggered. The SASE Orchestrator generates the “High Availability Going Active” event. This means that another WAN link on the Standby Edge will take over as Active because the peer’s WAN interface is down.
LAN link failure- When a LAN link on the Active Edge fails, a failover action is triggered. The SASE Orchestrator generates the “High Availability Going Active” event. This means that another LAN link on the Standby Edge will take over as Active because the peer’s LAN interface is down.
Edge functions not responding, or Edge crash / reboot / unresponsive- When the Active Edge crashes, reboots, or is unresponsive, the Standby Edge does not receive any heartbeat messages. The SASE Orchestrator generates the “High Availability Going Active” event and the Standby Edge takes over as Active.
Service Restart- Configuration changes that trigger a service restart cause a failover. The service restart happens after the configuration changes are applied to the Standby Edge and Active Edge. For a list of changes that cause a service restart, see Arista VeloCloud SD-WAN Edge Configuration Changes That Can Trigger an Edge Service Restart.

Note: HA Edges should be deployed within an isolated broadcast domain. During failover scenarios, to ensure a seamless transition of the Active role to the Standby Edge, it is crucial that the Standby Edge does not receive any incoming packets on the HA interface.

High Availability Deployment Models

The High Availability feature supports the following deployment models:

Standard HA—In this model, the Active and Standby Edges have the same configurations and have symmetric connections, that is both Edges are connected to the same WAN links. All ports on the Active Edge are open for receiving and sending traffic. Whereas all ports except GE1 on the Standby Edge are blocked. The GE1 interface is used to exchange heartbeats between Active and Standby Edges. See Standard HA.
Enhanced HA – In this model, the Active and Standby Edges have the same configurations but have asymmetric connections, that is both Edges are connected to different WAN links. The GE1 interface is used to exchange heartbeats between Active and Standby Edges. The Active Edge can leverage the WAN link connected to the Standby Edge to send or receive traffic. It forwards the traffic through the GE1 interface to the Standby Edge, which in turn sends the traffic through the WAN link. See Enhanced HA.
Mixed-mode HA—This model is a combination of both Standard and Enhanced HA deployments on the same site. In this model, the Active and Standby Edges have the same configurations. The connections can be both symmetric and asymmetric. See Mixed-Mode HA.

The HA options are supported on the following SD-WAN Edge platforms: 510, 510N, 520, 520v, 540, 610, 610N, 620, 620N, 640, 640N, 680, 680N, 840, 2000, 3400, 3800, 3810, and any Virtual Edge.

CAUTION: HA is supported only between identical SD-WAN Edge platform models. For more information on the Edge platform models, see Arista Documentation.

Warning: Mixing Wi-Fi capable and non-Wi-Fi capable Edges in High Availability deployments is not supported. While the Edge models 510N, 610N, 620N, 640N, and 680N appear identical to their Wi-Fi capable counterparts, deploying a Wi-Fi capable Edge and a non-Wi-Fi capable Edge of the same model (for example, an Edge 640 and an Edge 640N) as a High-Availability pair is not supported. Customers should ensure that the Edges deployed as a High Availability pair are of the same type: both Wi-Fi capable, or both non-Wi-Fi capable.

Standard HA

This section describes Standard HA.

Topology Overview for Standard HA

The following figure shows a conceptual overview of Standard HA.

The Edges, one Active and one Standby, are connected by L1 ports to establish a failover link. The Standby SD-WAN Edge blocks all ports except the L1 port for the failover link.

Prerequisites for Standard HA

The LAN side switches in the following configuration descriptions must be STP capable and configured with STP.
In addition, SD-WAN Edge LAN and WAN ports must be connected to different L2 switches. If it is necessary to connect the ports to the same switch, then the LAN and WAN ports must be isolated.
The two SD-WAN Edges must have mirrored physical WAN and LAN connections.

Deployment Types for Standard HA

Standard HA has two possible deployment types:

Deployment Type 1: High Availability (HA) using L2 switches
Deployment Type 2: High Availability (HA) using L2 and L3 switches

The following sections describe these two deployment types.

Deployment Type 1: HA using L2 switches

The following figure shows the network connections using only L2 switches.

W1 and W2 are WAN connections used to connect to the L2 switch to provide WAN connectivity to both ISPs. The L1 link connects the two SD-WAN Edges and is used for ‘keep-alive’ and communication between the SD-WAN Edges for HA support. The SD-WAN Edge’s LAN connections are used to connect to the access layer L2 switches.

Considerations for HA Deployment using L2 switches

The same ISP link must be connected to the same port on both Edges.
Use the L2 switch to make the same ISP link available to both Edges.
The Standby SD-WAN Edge does not interfere with any traffic by blocking all its ports except the failover link (L1 port).
Session information is synchronized between the Active and Standby SD-WAN Edges through the failover link.
If the Active Edge detects a loss of a LAN link, it will also failover to the Standby if it has an Active LAN link.

Deployment Type 2: HA using L2 and L3 Switches

The following figure shows the network connections using L2 and L3 switches.

The SD-WAN Edge WAN connections (W1 and W2) are used to connect to L2 switches to provide a WAN connection to ISP1 and ISP2 respectively. The L1 connections on the SD-WAN Edge are connected to provide a failover link for HA support. The Edge LAN connections are used to connect L2 Switches, which have several end-user devices connected.

Considerations for HA Deployment using L2 and L3 switches

HSRP/VRRP is required on the L3 switch pair.
The SD-WAN Edge's static route points to the L3 switches’ HSRP VIP as the next hop to reach the end stations behind L2 switches.
The same ISP link must be connected to the same port on both SD-WAN Edges. The L2 switch must make the same ISP link available to both Edges.
The Standby SD-WAN Edge does not interfere with any traffic by blocking all of its ports except the failover link (L1 port).
The session information is synchronized between the Active and Standby SD-WAN Edges through the failover link.
The HA pair also does a failover from Active to Standby on detecting the L1 loss of LAN / WAN links.
- If Active and Standby have the same number of LAN links which are up, but Standby has more WAN links up, then a switchover to Standby will occur.
- If the Standby Edge has more LAN links up and has at least one WAN link up, then a failover to the Standby will occur. In this situation, it is assumed that the Standby Edge has more users on the LAN side than the Active Edge, and that the Standby will allow more LAN side users to connect to the WAN, given that there is some WAN connectivity available.

Enhanced HA

This section discusses Enhanced HA. The Enhanced HA eliminates the need for L2 Switches on WAN side of the Edges. For users looking for LAN side settings, please refer to the Standard HA documentation. This option is chosen when the Active Edge detects different WAN link(s) connected to the Standby Edge when compared to the link(s) connected to itself.

The following figure shows a conceptual overview of Enhanced HA.

The Edges, one Active and one Standby, are connected by using an HA link to establish a failover link. The Active Edge establishes overlay tunnels on both WAN links (connected to itself and the Standby Edge) through the HA link.

Note: The two SD-WAN Edges should not have mirrored physical WAN connections. For example, if the Active Edge has GE2 as the WAN link, then the Standby Edge cannot have GE2 as its WAN link.

In order to leverage the WAN link connected to the Standby Edge, the Active Edge establishes the overlay tunnel through the HA link. The LAN-side traffic is forwarded to the Internet through the HA link. The business policy for the branch defines the traffic distribution across the overlay tunnels.

Enhanced HA Support for LTE Interface

Long-Term Evolution (LTE) is a standard for wireless broadband communication for mobile devices and data terminals, based on the GSM/EDGE and UMTS/HSPA technologies. It increases the capacity and speed using a different radio interface together with core network improvements. SD-WAN supports LTE in 510 and 610 Edge models which have two SIM slots.

Starting with the 4.2 release, the LTE link/CELL interface is counted in the HA election. Internally, a lesser weight is provided for CELL links than wired links. So depending on the number of wired links connected to each Edge in the eHA pair, the Edge with the LTE link can either be the Active or the Standby Edge. Here are some use cases for eHA with LTE interface.

Use case 1: 1-Wired link on Active Edge and 1-LTE link on Standby Edge

Figure 5. Wired Link on Active Edge and 1-LTE link on Standby Edge

The figure illustrates the topology of Enhanced HA support for LTE Interface on a Standby Edge. In this example, there are two Edges, one Active ( SD-WAN Edge 1) and one Standby ( SD-WAN Edge 2), that are connected by using an HA cable to establish a failover link. The wired WAN link Edge is preferred as Active Edge. The Standby Edge uses an LTE link for tunnel establishment. The LTE link on the Standby Edge could be used as active, backup, or hot-standby link, based on the Edge configuration. The Active Edge establishes overlay tunnels on WAN link connected to itself and the LTE link on the Standby Edge through the HA link. If an Active Edge fails, the Standby Edge will continue to forward the LAN-side traffic through the LTE link.

Use case 2: 1-Wired and 1-LTE link on Active Edge and 1-Wired link on Standby Edge

Figure 6. Wired and 1-LTE link on Active Edge and 1-Wired link on Standby Edge

The figure illustrates the topology of Enhanced HA support for LTE Interface on an Active Edge. In this example, the SD-WAN Edge 1 with one wired link and one LTE link acts as an Active Edge, and SD-WAN Edge 2 with one wired link acts as Standby Edge. If the wired WAN link on the Active Edge goes down, the Standby Edge would take over as Active and the LTE link would be used in eHA mode.

Supported Topologies

The requirement for HA is to have same models connected in HA pair. The enhanced HA support for LTE supports the following topologies:

510- 510 LTE HA pair
610- 610 LTE HA pair
510 LTE- 510 LTE HA pair
610 LTE- 610 LTE HA pair

Note: Inserting LTE SIM in Active Edge when Standby Edge has an LTE SIM on CELL interface is not supported for 510-LTE pairs and 610-LTE pairs topologies.

Limitations

LTE Dual SIM Single Standby (DSSS) is not supported with eHA LTE.
USB modems on Standby Edge in eHA mode is not supported.

Troubleshooting Enhanced HA support for LTE

You can troubleshoot the Enhanced HA support for LTE Interface feature, by running the following remote diagnostic tests on an Edge:

LTE Modem Information- Run this test on a selected Edge interface to collect diagnostic details such as Modem information, Connection information, Location information, Signal information, and Status information for the internal LTE modem.
The below screen shows the output for an Edge's CELL1 interface where there is no SIM card attached, while exhibiting the expected fields for this diagnostic.

Figure 7. LTE Modem Information
Reset USB Modem- Run this test on a selected Edge interface to reset an malfunctioning USB modem connected to the given interface.
Note: Not all USB modems support this type of remote reset.

Figure 8. Reset USB Modem

Mixed-Mode HA

The Mixed-mode HA deployment model is a combination of Standard HA and Enhanced HA deployments.

In this deployment model you can have both shared interfaces and individual interfaces.

Let us consider a scenario where the private network is unable to communicate with the Orchestrator or the Controller.

Figure 9. Private Network is Unable to Communicate with the Orchestrator or the Controller

In this topology, the Active and Standby Edges exchange heartbeat messages, synchronize configuration updates, and other information over the GE1 interface. Both SD-WAN Edges have mirrored LAN and WAN connections over the GE2, GE3, and GE5 interfaces, which is similar to the Standard HA deployment model. However, the Active Edge is connected to the private network using the GE4 WAN link. This is similar to the Enhanced HA deployment model. All ports on the Active Edge are kept open to send and receive traffic. On the Standby Edge, all ports except GE1 are blocked.

When the MPLS network is unable to communicate with the Orchestrator or the Controller, the site would still have connectivity to the Orchestrator or the Gateway and would be able to build public overlays.

Now let us consider a scenario when both private and public networks are unable to communicate with the Orchestrator or Controller.

Figure 10. Private and Public Networks are Unable to Communicate with the Orchestrator or Controller

In this topology, the ISP1 is connected only to the Standby Edge using the GE6 WAN link and ISP2 is connected to both Active and Standby Edges using the GE5 WAN link. All ports on the Active Edge are kept open to send and receive traffic. On the Standby Edge, all ports except GE1 and GE6 are blocked. The Active Edge leverages GE6 WAN link to send traffic to the public network, ISP1 through GE1.

Split-Brain Condition

When the HA link is disconnected or when the Active and Standby Edges fail to communicate with each other, both Edges assume the Active role. As a result, both Edges start responding to ARP requests on their LAN interfaces. This causes LAN traffic to be forwarded to both Edges, which could result in spanning tree loops on the LAN.

Typically, LAN switches connected to the HA Edge pair LAN ports run the Spanning Tree Protocol to prevent loops in the network. In such a condition, the switch would block traffic to one or both Edges. However, doing so would cause a total loss of traffic through the Edge pair.

Important: On an Enhanced HA deployment (where there is no Layer 2 Switch connected to the Edge's WAN interfaces), connectivity to the Primary Gateway is a requirement for split-brain detection. More details on the split-brain detection functionality can be found in the section Split-Brain Detection and Prevention.

Split-Brain Detection and Prevention

This section discusses the mechanisms used to detect and prevent a split-brain state in an Edge deployment using a high availability topology.

There are two mechanism for detecting and preventing a split-brain condition in a high availability deployment (where both HA Edges become Active).

The first mechanism involves sending layer 2 broadcast heartbeats between the two HA Edges when the HA heartbeat link between the devices is lost. A layer 2 broadcast (EtherType 0x9999) heartbeat is sent from the Active Edge on all its WAN interfaces in an effort to find the Standby Edge in that broadcast network. When the Standby Edge receives this packet, it interprets the packet as an indication to maintain its current Standby state. This mechanism is used by a Legacy High Availability deployment where both HA Edges have their WAN ports connected to the same layer 2 switch.

The second mechanism used to detect and prevent split-brain conditions leverages the Primary Gateway used by the HA Edges. This mechanism is the sole means of detecting and preventing split-brain in an Enhanced High Availability deployment as this topology does not connect both HA Edges to an upstream layer 2 switch.

The Gateway has a pre-existing connection to the Active Edge (VCE1). In a split-brain condition, the Standby Edge (VCE2) changes state to Active and tries to establish a tunnel with the Gateway (VCG). The Gateway will send a response back to the Standby Edge (VCE2) instructing it to move to Standby state, and will not allow the tunnel to be established. The Gateway keep its tunnels only with the Active Edge. The sequence of events is as follows:

As soon as the HA link fails, the VCE2 moves to the Active state and enables the LAN/WAN ports, and tries to establish tunnels with the Primary Gateway. If the VCE1 still has tunnels, the Primary Gateway instructs the VCE2 to revert to the Standby state and thus the VCE2 blocks its LAN ports. Only the LAN interfaces remain blocked (as long as the HA cable is down). As illustrated in the following figure, the Gateway signals VCE2 to go into the Standby state. This will logically prevent the split-brain scenario from occurring.

Note:

The normal failover from Active to Standby in a split-brain scenario is not the same as the normal failover. It could take a few extra milliseconds/seconds to converge.
When configuring WAN interface settings for an Edge, if you select PPPoE from the Addressing Type field, the Edge cannot send heartbeat packets by broadcast from a WAN interface so configured.

Figure 11. Split-Brain Detection and Prevention

Note: Beginning in Release 5.2.0, the HA Failover Detection Time Multiplier feature can be used to set a longer High Availability failover threshold. The timer represents how long a Standby Edge will wait for a heartbeat packet from the Active Edge before becoming active. In some instances, where a lower model Edge is under high traffic load, the Active Edge's heartbeat packet may take longer than the default threshold time to be delivered to the Standby Edge. As a result the Standby Edge triggers a failover and is promoted to Active, resulting in a Split-Brain state.

Setting the HA Failover Detection Time Multiplier to a value higher than the default can lessen the risk of a Split-Brain state in this scenario. The default value is 700 milliseconds (ms), and this value can be increased up to a value of 7000 ms. For additional information, see Activate High Availability.

Support for BGP Over HA Link

When a pair of Edges are configured in a High Availability topology, the Active SD-WAN Edge will exchange BGP routes over the HA link. Where Enhanced HA is used, BGP on the Active Edge establishes neighborship with a peer connected only to the standby Edge’s WAN link.

Beginning with SD-WAN Release 5.1.0 and onwards, a site deployed in High Availability with BGP configured automatically synchronizes local routes between the Active and Standby Edges and uses these routes for forwarding on the Active Edge while also ensuring that the route table is immediately available after an HA failover. This results in improved failover times as the routes are already available on the Standby Edge when it is promoted to Active.

Note: To fully optimize HA failovers where BGP is used in Standard and Enhanced HA topologies, it is strongly recommended to also activate the BGP Graceful Restart feature. Information about this feature is found in the High Availability Graceful Switchover with BGP Graceful Restart topic.

High Availability Graceful Switchover with BGP Graceful Restart

To use the BGP Graceful Restart feature, a customer site must have the following.

A site deployed with a High Availability topology. This can be either Active/Standby or VRRP with 3rd party router. BGP Graceful Restart does not have any effect on a standalone Edge site, only on sites using HA.
The customer enterprise must have BGP configured as the routing protocol.

Important: To fully optimize the benefits of BGP Graceful Restart it is strongly recommended that Distributed Cost Calculation (DCC) is also activated for the customer enterprise. With DCC activated, preference and advertisement decisions are local to the Edge and the Edge synchronizes from Active to Standby as soon as it learns the routes from the routing process. DCC's value is not limited to HA sites, and for additional information on this feature, see the topic SD-WAN Routing Overview and the topic Configure Distributed Cost Calculation in the Arista VeloCloud SD-WAN Operator Guide.

For a site deployed in a High Availability topology where BGP is also used, an HA failover can be both slow and disruptive to customer traffic because the peer Edges have deleted all the routes on a failover. In Release 5.1.0 and later, the BGP Graceful Restart feature is added for HA deployments which ensures faster and less disruptive HA failovers.

BGP Graceful Restart with Graceful Switchover ensures faster Edge restarts and HA failovers by having the neighboring BGP devices participate in the restart to ensure that no route changes occur in the network for the duration of the restart. Without BGP Graceful Restart, the peer Edge deletes all routes once the TCP session terminates between BGP peers and these routes need to be rebuilt post Edge restart or HA failover. BGP Graceful Restart changes this behavior by ensuring that peer Edges retain routes as long as a new session is established within a configurable restart timer.

Note: BGP Graceful Restart is for sites deployed in High-Availability only. This feature is not yet available for sites deployed with a single, standalone Edge even if it uses the BGP routing protocol.

Limitations/Known Behaviors

BGP Graceful Failover and HA Graceful Switchover are segment agnostic and when activated on one segment (for example, the Global Segment) these settings are applied to all other segments on a customer site. This means that the Edge will synchronize routes on other segments and hold stale routes during an HA failover.

Configuring BGP Graceful Restart: Configuring BGP Graceful Restart is a two part process, the first part being done on the BGP configuration section, and the second part in the High Availability configuration section. The steps are:

Activate Graceful BGP Restart on Configure > Device > BGP .
1. In the Customer portal, select either Configure > Profile or > Configure > Edges depending on your preferences. The screenshots will show the steps for a single HA Edge.
2. Select the Device icon next to an Edge, or select the link to the Edge, and then select the Device tab.
3. Scroll down to the Routing & NAT section and open up the BGP section for the Edge or Profile.
  
  Figure 12. Expand BGP
4. In the BGP section check the box for Graceful Restart.
  
  Figure 13. Enable Graceful Restart
5. Once the box is checked, two additional parameters appear related to Enable Graceful Restart: Restart Time, and Stalepath Time:
  1. Restart Time represents the maximum time the route processor (RP) waits for the RP peer to begin talking before expiring route entries. The default time for this parameter is 120 seconds and can be manually configured withing a range of 1 to 600 seconds.
  2. Stalepath Time represents the maximum time routes are retained after a restart (HA failover). Updated routes from a route processor peer are expected to have been received by this time. The default time for this parameter is 300 seconds and can be manually configured within a range of 1 to 3600 seconds.
6. Once the user has activated BGP Graceful Restart and is satisfied with the two secondary settings, a user can then move to the High Availability section.
Activate Graceful Switchover on Configure > Device > High Availability .
1. From the BGP section, scroll down to the High Availability section.
  
  Figure 14. High Availability
2. In the High Availability section the option to check the box for Graceful Switchover is now available as a result of BGP Graceful Restart being activated.
3. Check the box for Graceful Switchover.
4. Nothing further is required in the High Availability section and there are no secondary parameters for Graceful Switchover.
Scroll down to the bottom of the Configure > Device page and select Save Changes in the bottom right corner. This applies the configuration changes made above.

Selection Criteria to Determine Active and Standby Status

This section discusses the selection criteria used to determine Active and Standby Status.

Check for the Edge that has a higher number (L2 and L3) LAN interfaces. The Edge with the higher number of LAN interfaces is chosen as the Active one. Note that the interface used for the HA link is not counted as a LAN interface.
If both Edges have the same number of LAN interfaces, the Edge with the higher number of WAN interfaces is chosen as the Active one.
Note: There is no preemption if the two Edges have the same number of LAN and WAN interfaces.
Additional Support Matrix:
- Static/DHCP/PPPoE links are supported.
- Multiple WAN links each tagged with a separate VLAN ID on a single interface (e.g. Sub-Interfaces) are supported.
- USB modems are not recommended on HA. The interface will not be used when present in the Standby Edge.

VLAN-tagged Traffic Over HA Link

This section discusses the VLAN-tagged Traffic over an HA Link.

Internet traffic from ISP2 is VLAN tagged.
Customer will have separate VLANs for Enterprise traffic versus DIA traffic.
The WAN link on the Standby has sub-interfaces to carry Internet traffic.
Multi segments

Figure 15. VLAN-tagged Traffic Over HA Link

Configure High Availability (HA)

To configure High Availability, configure the Active and Standby Edges.

Deploying High Availability on VMware ESXi

You can deploy the SD-WAN HA on VMware ESXi using the supported topologies.

While deploying HA on VMware ESXi, consider the following limitations:

ESXi vSwitch Caveats

The upstream failures are not propagated by the vSwitch that is directly connected to a virtual SD-WAN VNF. For example, if a physical adapter goes down, the VeloCloud Edges see the link up and do not failover.
vSwitches do not allow the ability to configure specific VLANs on a port group. If more than one VLAN is required, then VLAN 4095 must be configured. This allows all VLANs on the port group.
Note: This is not applicable to br-HA Link, which does not require VLANs.
The virtual Edge, when working as HA, changes its original assigned MAC Address. In order to allow the virtual Edge to receive frames with a MAC Address that is different from the one originally assigned, set the MAC address changes option on the virtual switch to Accept.
To allow the virtual Edge to receive traffic in the br-HA Link with multiple destination MAC Addresses, change the security settings on the port group/virtual switch to allow it to run in Promiscuous mode.

Note: For additional information on MAC address changes and Promiscuous mode operation, refer to the topic Unique MAC LAN and WAN Address.

Limitations of SD-WAN High Availability

There is no generic way of failure detection that will work on all the hardware, virtual, and uCPE platforms.

You can enable the Loss of Signal (LoS) detection to determine the HA Failover. For more information, see HA LoS Detection on Routed Interfaces.

VeloCloud SD-WAN supports the following topologies while deploying HA on VMware ESXi:

Topology 1: Legacy HA with WAN links

The following image illustrates a topology with legacy HA along with WAN links that have been uplinked using a single physical adapter and one routed LAN or trunked LAN through single physical adapter.

Topology 2: Enhanced HA with WAN Links

The following topology shows enhanced HA with three WAN links.

Topology 3: Enhanced HA with Subinterfaces

The following image shows Enhanced HA with subinterfaces on the WAN interfaces with VLAN ID as 4095 on port group.

Figure 18. Enhanced HA with Subinterfaces

HA LoS Detection on Routed Interfaces

The HA Loss of Signal (LoS) detection enables an Edge to detect reachability failures in HA deployments on routed Interfaces.

When an Edge is enabled with HA, the number of LAN and WAN Interfaces connected to the Edge are detected and this count is used to take decision on performing the HA failover.

When Edges in HA mode are deployed on ESXi, the LAN and WAN vNICs of the Edge are uplinked through single or multiple physical NICs. If one of the physical NICs is down, the Interface count computed by HA will not be different from the Edge vNICs. The vSwitch connections remain intact, preventing the HA Failover.

By enabling the LoS detection on a routed Interface, it is possible to determine the Loss of Signal and Failover. The LoS detection can be done based on ARP monitoring of next hop for routed Interfaces. The LoS detection is done only on active Edge and only for Interfaces that are UP.

If an Interface is physically up but LoS is detected, then the Interface will be considered down and the relevant action, that is HA Failover, will be taken based on active and standby Interface count. LoS detection is done only on parent Interface and not on its sub Interfaces as the underlying physical link is common for both. When the Interface misses three consecutive ARP responses with the configured probe interval, it is considered to be down with LoS.

Limitations of LoS

LoS detection works only for routed Interfaces as the Edge does not know the next hop in a switched Interface. LoS detection is not supported for PPPoE Interfaces and statically configured Interfaces without default Gateway provided.
LoS detection is not supported for Interfaces which are UP only on standby Edge
LoS probing is not done on the Interfaces of standby Edge. Hence, any Interface connectivity change on standby Edge cannot be detected.
In a legacy HA deployment, all the Interfaces on Standby Edge are blocked. As LoS monitoring uses ARP probing to detect liveliness of link, the connectivity state of links present on the Standby Edge cannot be ascertained because the Interfaces on Standby Edge are blocked and the ARP packets cannot go through.

Enable LoS Detection

In the SD-WAN settings of the Enterprise portal, select Configure > Edges .
Select the Device icon next to an Edge, or select the link to an Edge and then select the Device tab.
On the Device tab, scroll down to the Interface Settings section, which displays the Interfaces available in the selected Edge.
Select the Edit option for an Interface to view and modify the settings.
Select the Override Interface checkbox to modify the configuration settings for the selected Interface.
In the L2 Settings section, select the Enable LoS Detection checkbox to enable Loss of Signal (LoS) detection by using ARP monitoring.
Select the ARP Probe Interval from the drop-down list. The available options are 1, 3, 5, 10 seconds and the default value is 3 seconds. The LoS is detected on the Interface based on the probe interval. When the Interface does not receive 3 consecutive ARP responses, then the Interface is considered to be down by LoS.
Configure the other settings as required and select Update.

Figure 19. Edit Interface Settings
Select Save Changes on the Devices tab.

For additional information on the other settings of the Interface, see Configure Interface Settings for Profiles.

To view the LoS detection events, see Monitor Events for LoS Detection.

Monitor Events for LoS Detection

You can view the events related to the LoS Detection on a routed Interface of a virtual Edge.

In the Enterprise portal, select Monitor > Events .

To view the events related to LoS Detection, you can use the filter option. Select the drop-down arrow next to the Search option and choose to filter either by the Event or by the Message column.

The following events occur during LoS detection:

LoS detected on peer's Interface <Interface name>
LoS no longer seen on Interface <Interface name>

Unique MAC LAN and WAN Address

Starting from 4.3.0 release, virtual Edges support a unique MAC address feature on a High Availability interface.

Instead of generating a common or shared virtual MAC address when in HA, this feature uses the physical MAC address for hardware Edges and the assigned MAC address for virtual Edges.

Important: On a customer enterprise using HA Edges and vSwitches: where possible, MAC learning should be configured on all vSwitches. MAC learning is available on vSphere version 6.7 and later. If MAC learning is configured on all vSwitches, Unique MAC Address is not required. However if the vSwitches do not have MAC learning configured, Unique MAC Address is required on the HA Edge.

Prerequisites

This section discusses HA requirements that must be met before configuring a SD-WAN Edge as a Standby.

The two SD-WAN Edges must be the same model.
Only one SD-WAN Edge should be provisioned on the VeloCloud Orchestrator.
The Standby SD-WAN Edge must not have an existing configuration on it.
Ensure not to use 169.254.2.x for management interface.

Activate High Availability

You can activate High Availability (HA) on a pair of Edges to ensure redundancy.

In the SD-WAN service of the Enterprise portal, select Configure > Edges .
Select the SD-WAN Edge from the list and select the Device tab.
Scroll down to the High Availability section and select Active Standby Pair.

Figure 20. High Availability
Select Save Changes at the bottom of the Device window.
By default, the HA interface to connect the pair is selected as follows:
- For Edges 520, 520v, and 540: The LAN1 port is used as HA interface and DPDK is not enabled on these platforms.
- For Edges 510, 610, 620, 640, 680, 840, 2000, 3400, and 3800: The GE1 port is used as HA interface and DPDK is enabled on these platforms.

Configure a Non-Default High Availability Interface

The above HA interfaces are the default interfaces for their respective platforms and are selected automatically. Beginning with Release 5.2.0 you can also configure any LAN interface to be the HA interface with the HA Interface option.

Beginning with Release 5.2.0, a user can select any Edge 1G/10G Ethernet/SFP port which does not have WAN-Overlay enabled to be the HA interface with the HA Interface drop-down option.

Both HA Edges must be upgraded to Release 5.2.0 or later prior to using a non-default interface for HA traffic. Until both HA Edges are using Release 5.2.0, they must be configured to use the default GE1 as their HA interface. Only after both HA Edges are upgraded to Release 5.2.0 can a user configure the HA Edges to use an interface other than GE1 as the HA interface.

Configuring a non-default HA Interface can only be performed when HA is not enabled for that site. This means you can configure it prior to enabling HA for a site. However, if you want to change the HA Interface on a site where HA is already enabled, you must first disable HA, then change the HA Interface, and then re-enable HA.

Important: In the context of a High Availability (HA) site utilizing an alternative HA Interface, the replacement of the Standby Edge with a different Edge may result in activation issues if the new Edge has a factory image earlier than version 5.2.0. VeloCloud now supports factory images starting with the 5.2.4 MR. If the Edge has a factory image earlier than version 5.2.0, the Edge can automatically upgrade to the 5.2.4 MR image by connecting to a DHCP-enabled Internet connection, contacting the VeloCloud-hosted Maestro server, and downloading the latest applicable factory image. The Edge platforms 610, 610-LTE, 620, 640, 680, 3400, 3800, 3810 will get upgraded to the 5.2.4 MR image by default. To ensure a successful activation when replacing an Edge with a factory image earlier than 5.2.0:

Connect the Edge to a DHCP-enabled Internet connection.
Wait a few minutes for the Edge to contact the VeloCloud Maestro server, download the 5.2 MR factory image, and complete the upgrade.

Figure 21. Connect the Edge
Connect the HA interface cable on the upgraded Edge and proceed with HA activation.
This process is particularly relevant for customers deploying High Availability with non GE1 interface. This is also applicable for RMA devices for the above listed platforms to the 5.2.4 MR image if they have a factory image earlier than version 5.2.0.

Alternatively, if there is no DHCP enabled Internet Connection then follow the below steps:

Disable HA.
Reconfigure the HA Interface to its default value (GE1 or LAN1) on the UI, and relocate the HA Interface cable to the default HA Edge interface.
Integrate the replacement Edge into the HA topology of the site.
Re-enable HA and allow the replacement Edge to complete the activation process, assuming the role of the Standby Edge.
Disable HA.
Reconfigure the HA Interface to its alternative value on the UI, and relocate the HA Interface cable back to the alternative location on the HA Edges.
Re-enable HA to finalize the replacement process.

Configure a Unique LAN and WAN MAC Address

By default, High Availability uses a common virtual MAC address to support seamless failover between devices. If you need to use a unique MAC address in certain virtual environments, instead of generating a common or shared virtual MAC address, you can select the Deploy with Unique LAN MAC Address and/or Deploy with Unique WAN MAC Address checkbox, which are both deactivated by default. These options use the physical MAC address for hardware Edges and the assigned MAC address for virtual Edges. The LAN, Routed LAN, and WAN ports use physical MAC addresses when both options are enabled.

You can activate or deactivate the Deploy with Unique LAN MAC Address and/or Deploy with Unique WAN MAC Address option only when you enable High Availability by choosing Active Standby Pair. Once High Availability is enabled, you cannot activate or deactivate Deploy with Unique LAN MAC Address and/or Deploy with Unique WAN MAC Address at a later point of time.

If you need to activate or deactivate the option, follow these steps:

Disconnect the Standby Edge's WAN and LAN links, leaving only the HA link connected to the Active Edge. If it is a Virtual Edge, disable the virtual NICs that correspond to the WAN and LAN links, leaving only the HA interface NIC connected.
In the High Availability section, select None.
Select Save Changes at the top of the Device window.
Enable High Availability again and then select the Deploy with Unique LAN MAC Address and or Deploy with Unique WAN MAC Address check box to activate or deactivate the option.
Once the HA status becomes High Availability Ready on the Orchestrator UI, reconnect the LAN and WAN cables of the Standby Edge. If using Virtual Edges, re-enable the virtual NICs.

Advanced Options: HA Failover Detection Time Multiplier

Beginning in Release 5.2.0, a user can manually configure the time threshold before the Active Edge is marked as non-responsive which would trigger a failover to the Standby Edge. On some Edge platforms an Edge may experience a high amount of traffic sufficient to delay sending out a heartbeat response to the Standby Edge indicating that it is still functioning. This delay may exceed the default 700 millisecond threshold and trigger the Standby Edge to become active and results in an Active-Active (Split-Brain) state. With this feature, the user can increase the time threshold before the Active Edge is declared down and trigger a failover and prevent a potential split-brain state.

The value is changed under the Advanced Options section where a user configures the HA Failover Detection Time Multiplier. This multiplier is a number that is multiplied by 100 milliseconds (ms). The default value is 7 (700 ms) and be configured up to 70 (7000 ms).

Wait for the SD-WAN Edge to Assume Active

After the High Availability feature is enabled on the Orchestrator, wait for the existing SD-WAN Edge to assume an Active role, and wait for the Orchestrator Events to display High Availability Going Active.

Figure 23. High Availability Going Active

Connect the Standby SD-WAN Edge to the Active Edge

Power on the Standby SD-WAN Edge without any network connections.
After it boots up, connect the LAN1/GE1 interface (as indicated on the Device tab) to the same interface on the Active SD-WAN Edge.
Wait for the Active SD-WAN Edge to detect and activate the standby SD-WAN Edge automatically. The SASE Orchestrator Events displays HA Standby Activated when the SASE Orchestrator successfully activates the standby SD-WAN Edge.

Figure 24. HA Standby Activated

The Standby Edge begins to synchronize with the active SD-WAN Edge and reboots automatically during the process.
Note: It may take up to 10 minutes for the Standby SD-WAN Edge to sync with the Active Edge and upgrade its software.

Connect LAN and WAN Interfaces on Standby SD-WAN Edge

Connect the LAN and WAN interfaces on the standby SD-WAN Edge mirroring the network connectivity on the Active Edge.

The Orchestrator Events will display Standby device software update completed. The HA State in the Monitor > Edges page appears green when ready.

Deactivate High Availability (HA)

This section discusses deactivating a High Availability site and making it a Standalone site, one using a single Edge.

If you want a site configured with High Availability to instead work as a Standalone site with a single Edge, do the following:

In the SD-WAN service of the Enterprise portal, select Configure > Edges .
Select the SD-WAN Edge from the list and select the Device tab.
Scroll down to the High Availability section and select None.

Figure 26. High Availability
Select Save Changes at the top of the Device window.
Note: When High Availability is deactivated on a pair of Edges, the following events are expected to occur:
1. The existing Active Edge becomes the Standalone Edge for this site with no disruption in customer traffic. You can use the GE1 interface on the new Standalone Edge for a different purpose as it is no longer needed for HA.
2. The Standby Edge is deactivated. This means the configuration is cleared from the Edge while retaining the existing Edge software version (the Edge is NOT factory reset). Once the Edge is completely deactivated, you can then remove all cables from the former Standby Edge and re-purpose it to another deployment.
Important: If the Standby Edge is removed from the HA deployment prior to deactivating HA, you would need to perform a separate Edge deactivation or factory reset for that Edge to make it usable in a different location because you cannot activate an Edge to a new location if there is an existing configuration on the Edge.

Note: If the Standby Edge remains connected to the now Standalone Edge through the HA cable after HA is deactivated and is rebooted, the Edge may try to require certain configurations from the Standalone Edge and this would mean the former Standby Edge would need to be deactivated again or factory reset prior to being used at another location.

HA Event Details

This section describes HA events.

Table 1. HA Event Details
HA Event	Description
HA_GOING_ACTIVE	A standby SD-WAN Edge is taking over as Active because it has not heard a heartbeat from the peer.
HA_STANDBY_ACTIVATED	When a new Standby is detected by the Active, the Active tries to activate the Edge by sending this event to the Orchestrator. On a successful response, the Active will sync the configurations and sync data.
HA_FAILED	Typically happens after the HA pair has formed and the Active SD-WAN Edge no longer hears from the Standby SD-WAN Edge. For example, if the Standby SD-WAN Edge reboots, you will receive this message.
HA_READY	Means the Active SD-WAN Edge now hears from the Standby SD-WAN Edge. Once the Standby SD-WAN Edge comes back up and reestablishes the heartbeat, then you will receive this message.
HA_TERMINATED	When the HA configuration is deactivated, and it is successfully applied on the Edges, this Event is generated.
HA_ACTIVATION_FAILURE	If the Orchestrator is unable to verify the HA activation, it will generate this Event. Examples include: the Orchestrator is unable to generate a certificate the HA has been deactivated (rare)
VCO_IDENTIFIED_HA_FAILOVER	Event message reads: `Edge HA Failover Detected` The Orchestrator has detected that a High Availability failover has occurred on the Edge.
VCO_IDENTIFIED_HA_FAILURE	Event message reads: `Edge HA Failure Detected` The Orchestrator has detected that the Standby Edge has gone down. This event will include the serial number of the Edge.
HA_UPDATE_FAILOVER_TIME	Event message reads: `Updating HA Failover time from ####ms to ####ms` A user changed the failover time for when an HA Edge will failover based on how long the Edge will wait to receive a heartbeat from the Active Edge. Increasing this value can prevent an Active-Active "Split Brain" state for HA Edges under high load. This is done through the HA Failover Detection Time Multiplier located at Configure > Edge > Device > High Availability on the Orchestrator.
HA_RESET_FAILOVER_TIME	Event message reads: `Updating HA Failover time from ####ms to ####ms` When an HA Edge's system has been stable for 60 seconds, the process reduces the failover threshold time by 50%.

VeloCloud SD-WAN 5.2 - Administration Guide - Configure High Availability on SD-WAN Edge 印刷