Print

Configure Orchestrator Disaster Recovery

This section provides disaster recovery (DR) instructions for Orchestrator.

Orchestrator Disaster Recovery Overview

The Orchestrator Disaster Recovery (DR) feature prevents the loss of stored data and resumes Orchestrator services in the event of system or network failure.

Orchestrator DR involves setting up an active/standby Orchestrator pair with data replication and a manually-triggered failover mechanism.
  • The recovery time objective (RTO), therefore, is dependent on explicit action by the operator to trigger promotion of the standby.
  • The recovery point objective (RPO), however, is essentially zero, regardless of the recovery time, because all configuration is instantaneously replicated. Monitoring data that would have been collected during the outage is cached on the Edges and Gateways pending promotion of the standby.
Note: DR is mandatory. For licensing and pricing, contact the Arista Sales team for support.

Active/Standby Pair

In an Orchestrator DR deployment, two identical Orchestrator systems are configured as an active / standby pair. The operator can view the state of DR readiness through the web UI on either of the servers. Edges and Gateways are aware of both Orchestrators, and while they receive configuration changes only from the active Orchestrator, they periodically send DR heartbeats to both systems to report their view of both servers and to query the DR system status. When the operator triggers a failover, the Edges and Gateways are informed of the change in their next DR heartbeat.

DR States

From the view of an operator, and of the edges and gateways, an Orchestrator has one of four DR states:

Table 1. DR States
DR State Description
Standalone No DR configured.
Active DR configured, acting as the primary Orchestrator server.
Standby DR configured, acting as an inactive replica Orchestrator server.
Zombie DR formerly configured and active but no longer acting as the active or standby.

Run-time Operation

When DR is configured, the standby server runs in a limited mode, blocking all API calls except those related to the DR status and the DR heartbeats. When the operator invokes a failover, the standby is promoted to become fully operational as a Standalone server. The server that was formerly active is automatically transitioned to a Zombie state if it is responsive and visible from the promoted standby. In the Zombie state, management configuration services are blocked and any contact from Edges and Gateways that have not transitioned to the new active Orchestrator are redirected to the promoted server.

Figure 1. Run-time Operation

Set Up Orchestrator Replication

Two installed Orchestrator instances are required to initiate replication.
  • The selected standby is put into a STANDBY_CANDIDATE state, enabling it to be configured by the active server.
  • The active server is then given the address and credentials of the standby and it enters the ACTIVE_CONFIGURING state.

When a STANDBY_CONFIG_RQST is created from Active to Standby, the two servers synchronize through the state transitions.

The two Orchestrators for Disaster Recovery (DR) that will be established, must have the same time. Before you initiate Orchestrator replication, ensure you check the following NTP configurations:
  • The Gateway time zone must be set to Etc/UTC. Use the following command to view the NTP time zone.
    vcadmin@vcg1-example:~$ cat /etc/timezone Etc/UTC vcadmin@vcg1-example:~$

    If the time zone is incorrect, use the following commands to update the time zone.

    echo "Etc/UTC" | sudo tee /etc/timezone sudo dpkg-reconfigure --frontend noninteractive tzdata
  • The NTP offset must be less than or equal to 15 milliseconds. Use the following command to view the NTP offset.
    sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *ntp1-us1.prod.v 74.120.81.219 3 u 474 1024 377 10.171 -1.183 1.033 ntp1-eu1-old.pr .INIT. 16 u - 1024 0 0.000 0.000 0.000 vcadmin@vcg1-example:~$

    If the offset is incorrect, use the following commands to update the NTP offset.

    sudo systemctl stop ntp sudo ntpdate <server> sudo systemctl start ntp
  • By default, a list of NTP Servers are configured in the /etc/ntpd.conf file. The Orchestrators on which DR need to be established must have Internet to access the default NTP Servers and ensure the time is in sync on both the Orchestrators. Customers can also use their local NTP server running in their environment to sync time.
Note: Before you set up your Standby Orchestrator to begin the Replication process, you must enable the network.public.address system property.

Set Up the Standby Orchestrator

To set up Orchestrator replication, perform the following steps:

  1. Select Replication from the Navigation panel to display the Orchestrator Replication screen.
    Figure 2. Orchestrator Replication
  2. Enable the Standby Orchestrator by selecting the Standby (Replication Role) radio button.
    Figure 3. Replication Role
  3. Select the Enable for Standby button.
    The Prepare this Orchestrator for Standby Role dialog appears.
    Figure 4. Enable for Standby
  4. Select the Enable for Standby button again.
    The Orchestrator Success message appears across the top of the screen indicating that the Orchestrator has been enabled for Standby, and that the Orchestrator will restart in Standby mode.
  5. Select OK.
    Figure 5. Standby Orchestrator

After the Standby Orchestrator has been configured for replication, configure the Active Orchestrator. For additional information, see Set Up the Active Orchestrator.

Set Up the Active Orchestrator

To configure the second Orchestrator to be the Active Orchestrator:

  1. Select Replication from the Navigation panel.
    The Orchestrator Replication screen appears.
  2. Choose the Active Replication Role.
  3. Type in the Standby Orchestrator Address and the Standby Orchestrator UUID.
    The Orchestrator Address and Uuid are displayed in the Standby Orchestrator screen.
    Figure 6. Orchestrator Replication
  4. Type in the username and password for the Orchestrator Superuser to be used for replication.
    Note:
    • This Superuser should already exist on both systems.
    • Starting from the 4.5 release, the use of the special character "<" in the password is no longer supported. In cases where users have already used "<" in their passwords in previous releases, they must remove it to save any changes on the page.
  5. Select the Make Active button.
    The Active Orchestrator screen displays showing a status of the current state.
    Figure 7. Active Orchestrator

    When configuration is complete, both Orchestrators (Standby and Active) will be in sync.

     

    Standby Orchestrator in Sync

    Figure 8. Standby Orchestrator in Sync

    You can select the toggle history link to view the status of each state.

    Figure 9. Standby Orchestrator

     

    Active Orchestrator in Sync

    Figure 10. Active Orchestrator in Sync

Test Failover

The following testing failover scenarios are forced failovers for example purposes. You can perform these actions in the Available Actions area of the Active and Standby screens.

Promote a Standby Orchestrator

This section discusses how to promote a Standby Orchestrator.

To promote a Standby Orchestrator:
  1. Select the unlock link.
  2. Select the Promote Standby button in the Available Actions area on the Standby Orchestrator screen.
    Figure 11. Available Actions Tab

    The following dialog box appears, indicating that when you promote your Standby Orchestrator, administrators will no longer be able to manage the SASE Orchestrator using the previously Active Orchestrator.

    Figure 12. Standby Orchestrator Dialog Box
  3. Select the OK button to promote the Standby Orchestrator.

    Another message dialog box appears to verify your request to promote the Standby Orchestrator. This message will appear only if the Standby Orchestrator perceives the Active Orchestrator to be in good health, meaning the Standby is communicating with the Active and duplicating data.

  4. Select OK to promote the Orchestrator.
    Figure 13. Active Orchestrator Dialog Box

    A final dialog box appears indicating that the Orchestrator is no longer a Standby and will restart in Standalone mode.

    Figure 14. Standalone Mode Orchestrator Dialog Box

    When you promote a Standby Orchestrator, it restarts in Standalone mode.

    If the Standby can communicate with the formerly Active Orchestrator, it will instruct that Orchestrator to enter a Zombie state. In Zombie state, the Orchestrator communicates with its clients (edges, gateways, UI/API) that it is no longer active, and that they must communicate with the newly promoted Orchestrator. If the promoted Standby cannot communicate with the formerly Active Orchestrator, the operator should, if possible, manually demote the formerly Active Orchestrator.

    Figure 15. Quiesced Orchestrator

Return to Standalone Mode

To return the Zombie to standalone mode,click the Return to Standalone Mode button in the Available Actions area on the Active Orchestrator or Standby Orchestrator screens.

Figure 16. Available Actions for Orchestrator
Note: The Orchestrator can be returned to the Standalone mode from the Zombie state after the time specified in the system property vco.disasterRecovery.zombie.expirySeconds, which is defaulted to 1800 seconds.

Troubleshooting SASE Orchestrator DR

This section discusses the failure states of the system. These are also listed in the UI, along with a more detailed description of the failure. Additional information is available in the log.

Recoverable Failures

The following errors are recoverable failures that can occur after SASE Orchestrator DR reaches an in sync state. If the problem causing these failures is corrected, SASE Orchestrator DR will automatically return to normal operation.
  • FAILURE_SYNCING_FILES
  • FAILURE_GET_STANDBY_STATUS
  • FAILURE_MYSQL_ACTIVE_STATUS
  • FAILURE_MYSQL_STANDBY_STATUS

Unrecoverable Failures

The following failures can occur during configuration of the SASE Orchestrator DR. SASE Orchestrator DR will not automatically recover from these failures.
  • FAILURE_ACTIVE_CONFIGURING
  • FAILURE_LAUNCHING_STANDBY
  • FAILURE_STANDBY_CONFIGURING
  • FAILURE_COPYING_DB
  • FAILURE_COPYING_FILES
  • FAILURE_SYNC_CONFIGURING
  • FAILURE_GET_STANDBY_CONFIG
  • FAILURE_STANDBY_CANDIDATE
  • FAILURE_STANDBY_UNCONFIG
  • FAILURE_STANDBY_PROMOTION
  • FAILURE_ACTIVE_DEMOTION

Replication

The VeloCloud Orchestrator Disaster Recovery (DR) feature prevents the loss of stored data and resumes VeloCloud Orchestrator services in the event of system or network failure.

VeloCloud Orchestrator DR involves setting up an active/standby VeloCloud Orchestrator pair with data replication and a manually-triggered failover mechanism.
  • The Recovery Time Objective (RTO), therefore, is dependent on explicit action by the operator to trigger promotion of the standby.
  • The Recovery Point Objective (RPO), however, is essentially zero, regardless of the recovery time, because all configuration is instantaneously replicated. Monitoring data that would have been collected during the outage is cached on the Edges and Gateways pending promotion of the standby.
Note: DR is mandatory. For licensing and pricing, contact the Arista sales team for support.

Active/Standby Pair

In a VeloCloud Orchestrator DR deployment, two identical VeloCloud Orchestrator systems are configured as an active / standby pair. The operator can view the state of DR readiness through the web UI on either of the servers. Edges and gateways are aware of both VeloCloud Orchestrators, and while they receive configuration changes only from the active VeloCloud Orchestrator, they periodically send DR heartbeats to both systems to report their view of both servers and to query the DR system status. When the operator triggers a failover, the Edges and Gateways are informed of the change in their next DR heartbeat.

DR States

From the view of an operator, and the Edges and Gateways, a VeloCloud Orchestrator has one of the following four DR states:

Table 2. DR States
DR State Description
Standalone No DR configured.
Active DR configured, acting as the primary VeloCloud Orchestrator server.
Standby DR configured, acting as an inactive replica VeloCloud Orchestrator server.
Zombie DR formerly configured and active but no longer acting as the active or standby.

Run-time Operation

When DR is configured, the standby server runs in a limited mode, blocking all API calls except those related to the DR status and the DR heartbeats. When the operator invokes a failover, the standby is promoted to become fully operational as a Standalone server. The server that was formerly active is automatically transitioned to a Zombie state if it is responsive and visible from the promoted standby. In the Zombie state, management configuration services are blocked and any contact from edges and gateways that have not transitioned to the new active VeloCloud Orchestrator are redirected to the promoted server.

Figure 17. Run-time Operation

Set Up VeloCloud Orchestrator Replication

Two installed VeloCloud Orchestrator instances are required to initiate replication.
  • The selected standby is put into a STANDBY_CANDIDATE state, enabling it to be configured by the active server.
  • The active server is then given the address and credentials of the standby and it enters the ACTIVE_CONFIGURING state.

When a STANDBY_CONFIG_RQST is made from active to standby, the two servers synchronize through the state transitions.

The two Orchestrators on which Disaster Recovery (DR) need to be established must have same time. Before you initiate VeloCloud Orchestrator replication, ensure you check the following NTP configurations:
  • The Gateway time zone must be set to Etc/UTC. Use the following command to view the NTP time zone.
    vcadmin@vcg1-example:~$ cat /etc/timezone Etc/UTC vcadmin@vcg1-example:~$

    If the time zone is incorrect, use the following commands to update the time zone.

    echo "Etc/UTC" | sudo tee /etc/timezone sudo dpkg-reconfigure --frontend noninteractive tzdata
  • The NTP offset must be less than or equal to 15 milliseconds. Use the following command to view the NTP offset.
    sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== *ntp1-us1.prod.v 74.120.81.219 3 u 474 1024 377 10.171 -1.183 1.033 ntp1-eu1-old.pr .INIT. 16 u - 1024 0 0.000 0.000 0.000 vcadmin@vcg1-example:~

    If the offset is incorrect, use the following commands to update the NTP offset.

    sudo systemctl stop ntp sudo ntpdate <server> sudo systemctl start ntp
  • By default, a list of NTP Servers are configured in the /etc/ntpd.conf file. The Orchestrators on which DR need to be established must have Internet to access the default NTP Servers and ensure the time is in sync on both the Orchestrators. Customers can also use their local NTP server running in their environment to sync time.

Set Up the Standby Orchestrator

To set up the Standby Orchestrator, perform the following steps:

  1. In the SD-WAN service of the Enterprise Portal, select Orchestrator tab and then from the left pane select Replication button to display the Orchestrator Replication screen.
  2. Activate the Standby Orchestrator by selecting the Standby (Replication Role) radio button.
  3. Select Enable for Standby button.
    Figure 18. Standby Orchestrator

    The Standby Orchestrator page appears.

  4. Enter the manual configuration parameters and select Update configuration info button.

    After the Standby Orchestrator has been configured for replication, configure the Active Orchestrator according to the instructions below.

Set Up the Active Orchestrator

To set up the Active Orchestrator, select the Replication Role as Active and configure the following:

Figure 19. Orchestrator Replication

 

Table 3. Orchestrator Replication Fields
Option Description
Select Replication Role Select the Active radio button for the replication role.
Standby Orchestrator Address Enter the primary Standby Orchestrator IP Address.
Standby Orchestrator Address (IPv6) Enter the Standby Orchestrator IPv6 Address.
Standby Orchestrator Secondary Address Enter the address of the standby Orchestrator's secondary interface. This address is used for replication if the standby is promoted to active. Users can add Ipv4/Ipv6 or FQDN address here.
Standby Orchestrator UUID Enter the UUID of the standby Orchestrator.
Configuration Mode Select the Auto Configure Standby or Manually Configure Standby radio button based on the requirement.

When configured manually, paste a string value from ACTIVE VCO to STANDBY_WAIT

.
Superuser Username Enter the display name for the Orchestrator Superuser.
Standby Orchestrator Superuser Password Enter the password for the Orchestrator Superuser.
Note: Starting from the 4.5 release, the use of the special character "<" in the password is no longer supported. In cases where users have already used "<" in their passwords in previous releases, they must remove it to save any changes on the page.

Select Enable for Active button to activate replication role.

When configuration is complete, both Orchestrators (Standby and Active) are in sync.

Standby Orchestrator in Sync

Figure 20. Configuration Status of Orchestrator

Active Orchestrator in Sync

Figure 21. Active Orchestrator Status

Test Failover

The following testing failover scenarios are forced failovers for example purposes. You can perform these actions in the Available Actions area of the Active and Standbyscreens.

Promote a Standby Orchestrator

This section discusses how to promote a Standby Orchestrator.

To promote a Standby Orchestrator, perform the following steps:

  1. Select the unlock link.
  2. Select the Promote Standby button in the Available Actions area on the Standby Orchestrator screen.
    Figure 22. Available Actions

    The following dialog box appears, indicating that when you promote your Standby Orchestrator, administrators can no longer be able to manage the VeloCloud Orchestrator using the previously Active Orchestrator.

    Figure 23. Promote Standby Orchestrator
  3. Select the Promote Standby button to promote the Standby Orchestrator.
  4. Select Force Promote Standby to promote the Orchestrator.
    Figure 24. Force Promote Standby Orchestrator

    A final dialog box appears indicating that the Orchestrator is no longer a Standby and restarts in Standalone mode.

    Figure 25. Orchestrator Removed Status

When you promote a Standby Orchestrator, it restarts in Standalone mode.

If the Standby can communicate with the formerly Active Orchestrator, it instructs that Orchestrator to enter a Zombie state. In Zombie state, the Orchestrator communicates with its clients (edges, gateways, UI/API) that it is no longer active, and that they must communicate with the newly promoted Orchestrator. If the promoted Standby cannot communicate with the formerly Active Orchestrator, the operator should, if possible, manually demote the formerly Active Orchestrator.

Figure 26. Quiesced Orchestrator

Return to Standalone Mode

To return the Zombie to standalone mode, select the Return to Standalone Mode button in the Available Actions area on the Active Orchestrator or Standby Orchestrator screens.

Figure 27. Return to Standalone Mode
Note: The Orchestrator can be returned to the Standalone mode from the Zombie state after the time specified in the system property vco.disasterRecovery.zombie.expirySeconds, which is defaulted to 1800 seconds.

Troubleshooting VeloCloud Orchestrator DR

This section describes the failure states of the system. These are also listed in the UI, along with a more detailed description of the failure. Additional information is available in the Arista log.

Recoverable Failures

The following errors are recoverable failures that can occur after VeloCloud Orchestrator DR reaches an in sync state. If the problem causing these failures is corrected, VeloCloud Orchestrator DR automatically returns to normal operation.
  • FAILURE_SYNCING_FILES
  • FAILURE_GET_STANDBY_STATUS
  • FAILURE_MYSQL_ACTIVE_STATUS
  • FAILURE_MYSQL_STANDBY_STATUS

Unrecoverable Failures

The following failures can occur during configuration of the VeloCloud Orchestrator DR. VeloCloud Orchestrator DR does not automatically recover from these failures.
  • FAILURE_ACTIVE_CONFIGURING
  • FAILURE_LAUNCHING_STANDBY
  • FAILURE_STANDBY_CONFIGURING
  • FAILURE_COPYING_DB
  • FAILURE_COPYING_FILES
  • FAILURE_SYNC_CONFIGURING
  • FAILURE_GET_STANDBY_CONFIG
  • FAILURE_STANDBY_CANDIDATE
  • FAILURE_STANDBY_UNCONFIG
  • FAILURE_STANDBY_PROMOTION
  • FAILURE_ACTIVE_DEMOTION
..