VeloCloud Orchestrator Deployment and Monitoring Guide

The VeloCloud Orchestrator Deployment and Monitoring Guide provides guidance on how to install, run, and monitor the Orchestrator.

Overview

The VeloCloud SD-WAN Orchestrator Deployment and Monitoring Guide provides guidance on how to install, run, and monitor the Edge Cloud Orchestrator.

The Orchestrator Deployment and Monitoring Guide provides the following information:
  • Installing the Orchestrator
  • Setting Up Disaster Recovery
  • Upgrading the Orchestrator
  • Backing Up the Orchestrator application Data
  • Monitoring the Orchestrator application
  • Tuning various system properties (depending on the scale of the deployment)

Installing Orchestrator

This section discusses the Orchestrator installation.

Prerequisites

This section discusses the prerequisites that must be met before installing the Orchestrator .

Instance Requirements

Arista recommends installation of the Orchestrator and Gateway applications as a virtual machine, for example, as a guest instance, on an existing hypervisor.

The Orchestrator requires the following minimal guest instance specifications:
  • 8 Intel vCPU's at 2.5 GHz or higher
    Note: Arista recommends using Intel Xeon processors, similar Intel or AMD processors with the same or greater CPU frequency are also acceptable.
  • 64 GB of memory
  • Required Minimum IOPS: 5,000 IOPS
  • Orchestrator requires 4 SSD based persistent volumes expandable through LVM if necessary
    • 192GB x 1- Root
    • 1TB x 1- Store
    • 500GB x 1- Store2
    • 1TB x 1- Store3
  • 1 Gbps NIC
  • Ubuntu x64 server VM compatibility
  • Single public IP address made available through NAT

Upstream Firewall Configuration

The upstream firewall needs to be configured to allow inbound HTTP (TCP/80) as well as HTTPS (TCP/443). If a stateful firewall is in place, established outbound connections should also be allowed to facilitate upgrades and security updates.

External Services

The Orchestrator relies on several external services. Before proceeding with an installation, ensure you have available licenses for each of the services.

Google Maps

Google Maps displays Edges and data centers on a map, and does not require a Google account with Google to utilize the functionality. However, Internet access must be available to the Orchestrator instance in order for the service availability.

The service is limited to 25,000 map loads each day for more than 90 consecutive days. VeloCloud does not anticipate exceeding these limits for nominal use of the Orchestrator. For additional information, see Configure System Properties for Google Maps.

Twilio

VeloCloud uses Twilio for SMS-based alerting to enterprise customers and notifies them of Edge or link outage events. An account needs to be created and funded at http://www.twilio.com.

The account can be provisioned in the Orchestrator through the Operator Portals System Properties page. The account provisions through a system property, as described later in the guide. See Configure System Properties for Twilio for additional information.

MaxMind

MaxMind provides geolocation services and automatically detects Edge and Gateway locations and ISP names based on IP address. If this service deactivates, then you must update the geolocation information updated manually. The account can be provisioned in the Orchestrator through the Operator Portal's System Properties page. See Configure System Properties for MaxMind for additional information.

Installation Procedures

This section discusses Orchestrator installation.

Preparing the Cloud-init

This section discusses how to use the cloud-init package to handle the early initialization of instances.

About cloud-init

Cloud-init consists of a Linux package responsible for handling the early initialization of instances. If available in the distributions, it allows for configuration of many common parameters of the instance directly after installation. This creates a fully functional instance with a configuration based on a series of inputs.

Cloud-init behavior can be configured with user data. Provide the user data at the instance launch time and attach a secondary disk in ISO format that cloud-init searches for at first boot time. This disk contains all early configuration data to apply at that time.

The Orchestrator supports cloud-init and all essential configurations packaged in an ISO image.

Create the Cloud-init Metadata File

The final installation configuration options are set with a pair of cloud-init configuration files. The first installation configuration file contains the metadata. Create this file with a text editor and label it metadata. This file provides information that identifies the instance of Orchestrator to be installed. The instance-id can be any identifying name, and the local-hostname should be a host name that follows your site standards, for example:

instance-id: vco01
local-hostname: vco-01

Additionally, you can specify network interface information if the network does not have a DHCP configuration, for example:

instance-id: vco01
local-hostname: vco-01
network-interfaces: |
  auto eth0
  iface eth0 inet static
  address 10.0.1.2
  network 10.0.1.0
  netmask 255.255.255.0
  broadcast 10.0.1.255
  gateway 10.0.1.1
Create the Cloud-init User-data File

The second installation configuration option file contains the user data. This file provides information about users on the system. Create it with a text editor and name it user-data. This file enables access to the installation of Orchestrator. The following provides an example of the user-data file:

#cloud-config 
            password: Velocloud123 
            chpasswd: {expire: False} 
            ssh_pwauth: True 
            ssh_authorized_keys:
              - ssh-rsa AAA...SDvz This email address is being protected from spambots. You need JavaScript enabled to view it.
              - ssh-rsa AAB...QTuo This email address is being protected from spambots. You need JavaScript enabled to view it.
            vco:
              super_users:
                list: |
                  This email address is being protected from spambots. You need JavaScript enabled to view it.:password1
                remove_default_users: True
              system_properties:
                 list: |
                    mail.smtp.port:34
                    mail.smtp.host:smtp.yourdomain.com
                    service.maxmind.enable:True
                    service.maxmind.license:todo_license
                    service.maxmind.userid:todo_user
                    service.twilio.phoneNumber:222123123
                    network.public.address:222123123
            write_files:
               - path: /etc/nginx/velocloud/ssl/server.crt
                 permissions: '0644'
                 content: "-----BEGIN CERTIFICATE-----\nMI….ow==\n-----END CERTIFICATE-----\n"
               - path: /etc/nginx/velocloud/ssl/server.key
                 permissions: '0600'
                 content: "-----BEGIN RSA PRIVATE KEY-----\nMII...D/JQ==\n-----END RSA PRIVATE KEY-----\n" 
               - path: /etc/nginx/velocloud/ssl/velocloudCA.crt
This user-data file enables the default user, vcadmin, to login either with a password or with an SSH key. The use of both methods is possible, but not required. The password login is enabled by the password and chpasswd lines.
  • The password contains the plain-text password for the vcadmin user.
  • The chpasswd line turns off password expiration to prevent the first login from immediately prompting for a change of password. This is optional.
Note: If you set a password, it is recommended that you change it when you first log in because the password has been stored in a plain text file.

The ssh_pwauth line enables SSH login. The ssh_authorized_keys line begins a block of one or more authorized keys. Each public SSH key listed on the ssh-rsa lines will be added to the vcadmin ~/.ssh/authorized_keys file.

In this example, two keys are listed. For this example, the key has been truncated. In a real file, the entire public key must be listed. Note that the ssh-rsa lines must be preceded by two spaces, followed by a hyphen, followed by another space.

The vco section specifies configured Orchestrator services.

super_users contains list of Super Operator accounts and corresponding passwords.

The system_properties section allows to customize Orchestrator System Properties. See System Properties for details regarding system properties configuration.

The write_files section allows to replace files on the system. By default, Orchestrator web services are configured with self-signed SSL certificate. If you would like to provide different SSL certificate, the above example replaces the server.crt and server.key files in the /etc/nginx/velocloud/ssl/ folder with user-supplied files.
Note: The server.key file must be unencrypted. Otherwise, the service fails to start without the key password.
Create an ISO File

Once you have completed your files, package them into an ISO image. Use the ISO image as a virtual configuration CD with the virtual machine. This ISO image, called vco01-cidata.iso, is created with the following command on a Linux system:

genisoimage -output vco01-cidata.iso -volid cidata -joliet -rock user-data meta-data

Transfer the newly created ISO image to the datastore on the host running VeloCloud.

Install on VMware

VMware vSphere provides a means of deploying and managing virtual machine resources. This section explains how to run the Orchestrator using the VMware vSphere Client.

Deploy OVA Template
Note: This procedure assumes familiarity with VMware vSphere and does not refer to any specific version of VMware vSphere.
  1. Log into the vSphere Client.
  2. Select File > Deploy OVF Template .
  3. Respond to the prompts with information specific to your deployment.
    Table 1. OVF Field Descriptions
    Field Description
    Source Type a URL or navigate to the OVA package location.
    OVF template details Verify that you pointed to the correct OVA template for this installation.
    Name and location Name of the virtual machine.
    Storage Select the location to store the virtual machine files.
    Provisioning Select the provisioning type. thin is recommended for database and binary log volumes.
    Network mapping Select the network for each virtual machine to use.
    Important: Uncheck Power On After Deployment. Selecting it starts the virtual machine and it should be started later after the cloud-init ISO has been attached.
  4. Select Finish.
    Note: Depending on your network speed, this deployment can take several minutes or more.
Attach ISO Image as a CD/DVD to Virtual Machine
  1. Right-click the newly-added Orchestrator VM and select Edit Settings.
  2. From the Virtual Machine Properties window, select CD/DVD Drive.
  3. Select the Use an ISO image option.
  4. Browse to find the ISO image you created earlier and then select it. The ISO can be found in the datastore that you uploaded it to, in the folder that you created.
  5. Select Connect on Power On.
  6. Select OK to exit the Properties screen.
Run the Orchestrator Virtual Machine
  1. To start up the Orchestrator virtual machine and highlight it. Then select the Power On button.
  2. Select the Console tab to watch as the virtual machine boots up.
    Note: If you configured Orchestrator as described here, log into the virtual machine with the user name vcadmin and password you defined when you created the cloud-init ISO.

Install on KVM

This section explains how to run the Orchestrator using the libvirt. This deployment was tested on an Ubuntu 18.04 LTS instance.

Images
For KVM deployment, VMware provides the Orchestrator in four qcow images.
  • ROOTFS
  • STORE
  • STORE2
  • STORE3

The images thin provision on deployment.

Start by copying the images to the KVM server. In addition, you must copy the cloud-init ISO build as described in the previous section.

XML Sample
Note: For the images in the images/vco folder, you need to edit the XML.
<domain type='kvm' id='49'>
  <name>vco</name>
  <uuid>b0ff25bc-72b8-6ccb-e777-fdc0f4733e05</uuid>
  <memory unit='KiB'>12388608</memory>
  <currentMemory unit='KiB'>12388608</currentMemory>
  <vcpu>2</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
  <type>hvm</type>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
    <cpu mode='custom' match='exact'>
    <model fallback='allow'>SandyBridge</model>
    <vendor>Intel</vendor>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='dtes64'/>
    <feature policy='require' name='invpcid'/>
    <feature policy='require' name='vmx'/>
    <feature policy='require' name='erms'/>
    <feature policy='require' name='xtpr'/>
    <feature policy='require' name='smep'/>
    <feature policy='require' name='pbe'/>
    <feature policy='require' name='est'/>
    <feature policy='require' name='monitor'/>
    <feature policy='require' name='smx'/>
    <feature policy='require' name='abm'/>
    <feature policy='require' name='tm'/>
    <feature policy='require' name='acpi'/>
    <feature policy='require' name='fma'/>
    <feature policy='require' name='osxsave'/>
    <feature policy='require' name='ht'/>
    <feature policy='require' name='dca'/>
    <feature policy='require' name='pdcm'/>
    <feature policy='require' name='pdpe1gb'/>
    <feature policy='require' name='fsgsbase'/>
    <feature policy='require' name='f16c'/>
    <feature policy='require' name='ds'/>
    <feature policy='require' name='tm2'/>
    <feature policy='require' name='avx2'/>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='bmi1'/>
    <feature policy='require' name='bmi2'/>
    <feature policy='require' name='pcid'/>
    <feature policy='require' name='ds_cpl'/>
    <feature policy='require' name='movbe'/>
    <feature policy='require' name='rdrand'/>
  </cpu>
<clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm-spice</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/images/vco/rootfs.qcow2'/> 
      <target dev='hda' bus='ide'/>
      <alias name='ide0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/ images/vco/store.qcow2'/> 
      <target dev='hdb' bus='ide'/>
      <alias name='ide0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/ images/vco/store2.qcow2'/> 
      <target dev='hdc' bus='ide'/>
      <alias name='ide0-0-2'/>
      <address type='drive' controller='0' bus='1' target='0' unit='0'/>
    </disk>
    <disk type='file' device='disk'> 
      <driver name='qemu' type='qcow2' /> 
      <source file='/images/vco/store3.qcow2' /> 
      <target dev='hdd' bus='ide' />
      <alias name='ide0-0-3' /> 
      <address type='drive' controller='0' bus='1' target='0' unit='1' /> 
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/ images/vco/seed.iso'/> 
      <target dev='sdb' bus='sata'/>
      <readonly/>
      <alias name='sata1-0-0'/>
      <address type='drive' controller='1' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='usb' index='0'>
      <alias name='usb0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='direct'>
      <source dev='eth0' mode='vepa'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/3'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/3'>
      <source path='/dev/pts/3'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='none' />
<!--  <seclabel type='dynamic' model='apparmor' relabel='yes'/> -->
</domain>
Create the VM

To create the VM using the standard virsh commands:

virsh define vco.xml
virsh start vco

Install on AWS

This section discusses how to install Orchestrator on AWS.

Minimum Instance Requirements

See the first section of the Orchestrator Installation, Instance Requirements, and select an AWS instance type matching these requirements. Both CPU and Memory requirements must be satisfied. Example: use c4.2xlarge or larger; r4.2xlarge or larger

Request an AMI Image

Request an AMI ID from VeloWare. It will be shared with the customer account. Have an Amazon AWS account ID ready when requesting AMI access.

Installation
  1. Launch the EC2 instance in AWS cloud.

    Example: http://docs.aws.amazon.com/efs/latest/ug/gs-step-one-create-ec2-resources.html

  2. Configure the security group to allow inbound HTTP (TCP/80) as well as HTTPS (TCP/443).
  3. After the instance is launched, point the web browser to the Operator login URL: https://<name>/operator.

Initial Configuration Tasks

Complete the following initial configuration tasks:
  • Configuring system properties
  • Setting up initial operator profile
  • Setting up operator accounts
  • Creating gateways
  • Setting up gateway pools
  • Creating customer account/partner account

Install an SSL Certificate

This section discusses how to install an SSL certificate.

To install an SSL certificate:

  1. Login into the Orchestrator CLI console through SSH. If you configured the Orchestrator as described here, you should be able to log into the virtual machine with the user name vcadmin and password that you defined when you created the cloud-init ISO.
  2. Generate the Orchestrator private key.
    Note: Do not encrypt the key. It must remain unencrypted on the Orchestrator system.
    openssl genrsa -out server.key 2048
  3. Generate a certificate request. Customize-subj according to your organization information.
    openssl req -new -key server.key -out
    server.csr -subj "/C=US/ST=California/L=Mountain View/O=Velocloud Networks
    Inc./OU=Development/CN=vco.velocloud.net"
    Table 2. Field Descriptions
    Field Description
    C country
    ST state
    L locality (city)
    O company
    OU department (optional)
    CN Orchestrator fully qualified domain name
  4. Send the server.csr to a Certificate Authority for signing. You should get back the SSL certificate server.crt. Ensure that it has the PEM format.
  5. Install the certificate which requires root access. Orchestrator SSL certificates are located in /etc/nginx/velocloud/ssl/.
    cp server.key server.crt /etc/nginx/velocloud/ssl/
    chmod 600 /etc/nginx/velocloud/ssl/server.key
  6. Restart nginx.
    systemctl restart nginx

Configure System Properties

This section discusses how to configure System Properties, which provide a mechanism to control the system-wide behavior of the VeloCloud SD-WAN.

System Properties can be set initially using the cloud-init config file. For additional information, see Cloud-init Preparation. The following properties need to be configured to ensure proper operation of the service.

System Name

Enter a fully qualified VeloCloud domain name in the network.public.address system property.

Google Maps
Google Maps displays edges and data centers on a map. Maps may fail to display without a license key. The Orchestrator continues to function properly, but browser maps are not available in this case.
  1. Login into https://console.developers.google.com.
  2. Create a new project, if one is not already created.
  3. Locate the button Enable API. Select the Google Maps APIs and enable both Google Maps JavaScript API and Google Maps Geolocation API.
  4. On the left side of the screen, select the Credentials link
  5. Under the Credentials page, select Create Credentials, then select API key. Create an API key.
  6. Set the service.client.googleMapsApi.key system property to API key.
  7. Set service.client.googleMapsApi.enable to true.
Twilio
Twilio provides an optional messaging service that allows you to receive VeloCloud alerts via SMS. The account details can be entered into VeloCloud through the Operator Portal System Properties page. The properties are called:
  • service.twilio.enable allows the service to be deactivated in the event that no Internet access is available to the VeloCloud
  • service.twilio.accountSid
  • service.twilio.authToken
  • service.twilio.phoneNumber in (nnn)nnn-nnnn format

Obtain the service at https://www.twilio.com.

MaxMind
MaxMind is a geolocation service. It is used to automatically detect Edge and Gateway locations and ISP names based on an IP address. If this service is deactivated, then geolocation information will need to be updated manually. The account details can be entered into the VeloCloud through the Operator Portal's System Properties page. You can configure:
  • service.maxmind.enable allows the service to be deactivated in the event that no Internet access is available to the VeloCloud
  • service.maxmind.userid holds the user identification supplied by MaxMind during the account creation
  • service.maxmind.license holds the license key supplied by MaxMind

Obtain the license at: https://www.maxmind.com/en/geoip-api-web-services.

Email
Email services can be used for both sending the Edge activation messages as well as for alarms and notifications. It is not required, but it is strongly recommended that you configure this as part of VeloCloud operations. The following system properties are available to configure the external email service used by the Orchestrator:
  • mail.smtp.auth.pass- SMTP user password.
  • mail.smtp.auth.user- SMTP user for authentication.
  • mail.smtp.host- relay server for email originated from the VeloCloud.
  • mail.smtp.port - SMTP port.
  • mail.smtp.secureConnection- use SSL for SMTP traffic.

Upgrading Orchestrator

This section discusses how to upgrade the Orchestrator.

To upgrade the Orchestrator:

  1. Upload the image to the SD-WAN Orchestrator system using any file transfer tool available in your infrastructure, for example,scp. Copy the image to the following location on the system: /var/lib/velocloud/software_update/vco_update.tar.
  2. Connect to the SD-WAN Orchestrator console and run:
    sudo /opt/vc/bin/vco_software_update
    Note: If you configured the SD-WAN Orchestrator as described here, you should be able to log into the virtual machine with the user name vcadmin and the password that you defined when you created your cloud-init configuration files.

    For instructions on how to upgrade the SD-WAN Orchestrator with DR deployment, see Upgrade the DR Setup.

Expanding Disk Size

All storage volumes are configured as LVM devices. They can be resized online by providing the underlying virtualization technology to support online disk expansion. Disks are expanded automatically via cloud-init when the VM boots.

To expand disks after boot:

  1. Login into the Orchestrator system console.
  2. Identify the physical disks that support the database volume.
    vgs -o +devices store

    Example

    root@vco:~# vgs -o +devices db_data \ VG #PV #LV #SN Attr VSize VFree Devices store 1 1 0 wz--n- 500.00g 125.00g /dev/sdb(0)
  3. Identify the physical disk attachment.
    lshw -class volume

    Example

    /dev/sdb is attached to scsi@2:0.1.0 (Host: scsi2 Channel: 00 Id: 01 Lun: 00)
    root@vco:~# lshw -class volume
      *-volume 
          description: EXT4 volume 
          vendor: Linux
          physical id: 1 bus 
          info: scsi@2:0.0.0,1 
          logical name: /dev/sda1 
          logical name: / version: 1.0 
          serial: 9d212247-77c4-4f98-a5c2-7f8470fa2da8 
          size: 10239MiB 
          capacity: 10239MiB 
          capabilities: primary bootable journaled extended_attributes large_files huge_files dir_nlink recover extents ext4 ext2 initialized 
          configuration: created=2016-02-22 20:49:38 filesystem=ext4 label=cloudimg-rootfs lastmountpoint=/ modified=2016-02-22 21:18:58 mount.fstype=ext4 mount.options=rw,relatime,data=ordered mounted=2016-10-06 23:22:04 state=mounted 
      *-disk:1 
          description: SCSI Disk 
          physical id: 0.1.0 
          bus info: scsi@2:0.1.0 
          logical name: /dev/sdb 
          serial: v5V2zm-Lvbh-Mfx3-W8ki-COI9-DAtP-RXndhu 
          size: 500GiB capacity: 500GiB 
          capabilities: lvm2 configuration: sectorsize=512 
      *-disk:2 
          description: SCSI 
          Disk physical id: 0.2.0 bus 
          info: scsi@2:0.2.0 logical 
          name: /dev/sdc serial: fTQFJ2-giAV-WsXL-1Wha-V305-oQkV-qqS3SA 
          size: 100GiB capacity: 100GiB capabilities: lvm2 configuration: sectorsize=512
  4. On the hypervisor host, locate the disk attached to the VM using bus information. Example: SCSI(0:1)
  5. Extend the virtual disk.
  6. View the disk input/output statistics. These statistics are displayed twice, at an interval of 10 seconds.
    sar -d -p 10 2
    Note: This step is optional.
  7. Re-login into the Orchestrator system console.
  8. View detailed device utilization statistics, that provides insights into individual storage device performance.
    iostat -d -x
    Note: This step is optional.
  9. Re-login into the Orchestrator system console.
  10. Re-scan the block device for the resized physical volume. Example:
    echo 1 > /sys/block/$DEVICE/device/rescan
  11. Resize the LVM physical disk.
    pvresize /dev/sdb
  12. Determine the amount of free space in the database volume group.
    root@vco:~# vgdisplay store |grep Free 
    Free PE / Size 34560 / 135.00 GiB
  13. Extend the database logical volume.
    lvextend -r -L+#G /dev/store/data

    Example:

    root@vco1:~# lvextend -r -L+1G /dev/store/data 
    Size of logical volume store/data changed from 400.00 GiB (102400 extents) to 401.00 GiB (102656 extents). 
    Logical volume store/data successfully resized. resize2fs 1.44.1 (24-Mar-2018) 
    Filesystem at /dev/mapper/store-data is mounted on /store; on-line resizing required 
    old_desc_blocks = 50, new_desc_blocks = 51 
    The filesystem on /dev/mapper/store-data is now 105119744 (4k) blocks long.
  14. View the new size of the volume.
    df -h /dev/store/data

    Example:

    root@vco:~# df -h /dev/store/data Filesystem Size Used Avail Use% Mounted on /dev/mapper/store-data 379G 1.2G 359G 1% /store

System Properties

VeloCloud provides System Properties to configure various features and options available in the Orchestrator portal.

In the Operator portal, navigate to the System Properties page, which lists the available pre-defined system properties. See List of System Properties, which lists some of the system properties that you can modify as an Operator.

Figure 1. System Properties

To configure the system properties:

  1. Select New System Property to add a new property.
  2. In the New System Property window, configure the following parameters:
    Figure 2. Adding a New System Property
    Table 3. New System Property Option Descriptions
    Option Description
    Name Enter the Name for the new system property.
    Data Type Choose the required Data Type from the drop-down menu.
    Value Enter the Value for the property according to the data type.
    Value is Password Select Yes or No as required.
    Value is Read-only Select Yes or No for as required.
    Description Enter the Description for the new system property
  3. Select Save Changes.

    You can use the Search field to find a specific system property.

    Note: It is recommended to contact Arista Support before making changes to the system properties.

List of System Properties

As an Operator, you can add or modify the values of the system properties.

The following tables describe some of the system properties. As an Operator, you can set the values for these properties.
  • Alert Emails
  • Alerts
  • Bastion Orchestrator Configuration
  • Certificate Authority
  • Customer Configuration
  • Data Retention
  • Edges
  • Edge Activation
  • Edge Management
  • Enhanced Firewall Services
  • LAN-Side NAT Rules
  • Monitoring
  • Notifications
  • Password Reset and Lockout
  • Rate Limiting APIs
  • Remote Diagnostics
  • Security Service Edge
  • Segmentation
  • Self-service Password Reset
  • Syslog Forwarding
  • TACACS Services
  • Two-factor Authentication
  • Tunnel Parameters for Edges
  • VNF Configuration
  • VPN
  • Warning Banner
  • Zscaler
Table 4. Alert Emails
System Property Description
vco.alert.mail.to When an alert is triggered, a notification is sent immediately to the list of Email addresses provided in the Value field of this system property. You can enter multiple Email IDs separated by commas.

If the property does not contain any value, then the notification is not sent.

The notification is meant to alert Arista support / operations personnel of impending issues before notifying the customer.

vco.alert.mail.cc When alert emails are sent to any customer, a copy is sent to the Email addresses provided in the Value field of this system property. You can enter multiple Email IDs separated by commas.
mail.* There are multiple system properties available to control the Alert Emails. You can define the Email parameters like SMTP properties, username, password, and so on.
Table 5. Alerts
System Property Description
vco.alert.enable Globally activates or deactivates the generation of alerts for both Operators and Enterprise customers.
vco.enterprise.alert.enable Globally activates or deactivates the generation of alerts for Enterprise customers.
vco.operator.alert.enable Globally activates or deactivates the generation of alerts for Operators.
Table 6. Bastion Orchestrator Configuration
System Property Description
session.options.enableBastionOrchestrator Enables the Bastion Orchestrator feature.

For more information, see Bastion Orchestrator Configuration Guide.

vco.bastion.private.enable Enables the Orchestrator to be the Private Orchestrator of the Bastion pair.
vco.bastion.public.enable Enables the Orchestrator to be the Public Orchestrator of the Bastion pair.
Table 7. Certificate Authority
System Property Description
edge.certificate.renewal.window This optional system property allows the Operator to define one or more maintenance windows during which the Edge certificate renewal is enabled. Certificates scheduled for renewal outside of the windows will be deferred until the current time falls within one of the enabled windows.

Enable System Property:

To enable this system property, type "true" for "enabled" in the first part of the Value text area in the Modify System Property dialog box. An example of the first part of this system property when it is enabled is shown below.

Operators can define multiple windows to restrict the days and hours of the day during which Edge renewals are enabled. Each window can be defined by a day, or a list of days (separated by a comma), and a start and end time. Start and end times can be specified relative to an Edge's local time zone, or relative to UTC. See image below for an example.

Figure 3. Modify System Property

Note: If attributes are not present, the default is false.
When defining window attributes, adhere to the following:
  • Use IANA time zones, not PDT or PST (e.g. America/Los_Angeles).
  • Use UTC for days (e.g. SAT, SUN).
    • Separated by comma.
    • Days in three letters in English.
    • Not case sensitive.
  • Use Military 24 hour time format only (HH:MM) for start times (e.g. 01:30) and end times (e.g. 05:30).

If the above-mentioned values are missing, the attribute defaults in each window definition are as follow:

  • If enabled is missing, the default value = false.
  • If timezone is missing, the default = 'local.'
  • If one of either 'days' or end and start times are missing, the defaults are as follows:
    • If 'days' is missing, the start/end is applied to each day of the week (Mon, Tue, Wed, Thur, Fri, Sat, Sun).
    • If end and start times are missing, then any time in the specified day will match (start = 00:00 and end = 23:59 ).
    • NOTE: One of either 'days' or end and start times must be present. However, if they are missing, the defaults will be as indicated above.

Deactivate System Property

This system property is deactivated by default, which means the certificate will automatically renew after it expires. "Enabled" will be set to "false in the first part of the Value text area in the Modify System Property dialog box. An example of this property when it is deactivated is shown below.

{

"enabled": false,

"windows": [

{

NOTE: This system property requires that PKI be enabled.

gateway.certificate.renewal.window This optional system property allows the Operator to define one or more maintenance windows during which the Gateway certificate renewal is enabled. Certificates scheduled for renewal outside of the windows will be deferred until the current time falls within one of the enabled windows.

Enable System Property:

To enable this system property, type "true" for "enabled" in the first part of the Value text area in the Modify System Property dialog box. See image below for an example.

Operators can define multiple windows to restrict the days and hours of the day during which edge renewals are enabled. Each window can be defined by a day, or list of days (separated by a comma), and a start and end time. Start and end times can be specified relative to an edge's local timezone, or relative to UTC. See image below for an example.

Figure 4. Modify System Property

Note: If attributes are not present, the default is enabled false.
When defining window attributes, adhere to the following:
  • Use IANA time zones, not PDT or PST (e.g. America/Los_Angeles).
  • Use UTC for days (e.g. SAT, SUN).
    • Separated by comma.
    • Days in three letters in English.
    • Not case sensitive.
  • Use Military 24 hour time format only (HH:MM) for start times (e.g. 01:30) and end times (e.g. 05:30).
If the above-mentioned values are missing, the attribute defaults in each window definition are as follow:
  • If enabled is missing, the default value = false.
  • If timezone is missing, the default = 'local."
  • If one of either 'days' or end and start times are missing, the defaults are as follows:
    • If 'days' is missing, the start/end is applied to each day of the week (Mon, Tue, Wed, Thur, Fri, Sat, Sun).
    • If end and start times are missing, then any time in the specified day will match (start = 00:00 and end = 23:59).
    • NOTE: One of either 'days' or (end and start) must be present. However, if they are missing, the defaults will be as indicated above.

Deactivate System Property

This system property is deactivated by default, which means the certificate will automatically renew after it expires. "Enabled" will be set to "false in the first part of the Value text area in the Modify System Property dialog box. An example of this property when it is deactivated is shown below.

{

"enabled": false,

"windows": [

{
Note: This system property requires that PKI be enabled.
Table 8. Customer Configuration
System Property Description
session.options.enableServiceLicenses This system property allows Operator users to manage Service Configuration under Global Settings > Customer Configuration , and is set to True, by default.
Table 9. Data Retention
System Property Description
retention.highResFlows.days This system property enables Operators to configure high resolution flow stats data retention anywhere between 1 and 90 days.
retention.lowResFlows.months This system property enables Operators to configure low resolution flow stats data retention anywhere between 1 and 365 days.
session.options.maxFlowstatsRetentionDays This property enables Operators to query more than two weeks of flows stats data.
retentionWeeks.enterpriseEvents Enterprise events retention period (-1 sets retention to the maximum time period allowed)
retentionWeeks.operatorEvents Operator events retention period (-1 sets retention to the maximum time period allowed)
retentionWeeks.proxyEvents Proxy events retention period (-1 sets retention to the maximum time period allowed)
retentionWeeks.firewallLogs Firewall logs retention period (-1 sets retention to the maximum time period allowed)
retention.linkstats.days Link stats retention period (-1 sets retention to the maximum time period allowed)
retention.linkquality.days Link quality events retention period (-1 sets retention to the maximum time period allowed)
retention.healthstats.days Edge health stats retention period (-1 sets retention to the maximum time period allowed)
retention.pathstats.days Path stats retention period (-1 sets retention to the maximum time period allowed)
Table 10. Edges
SD-WAN Data Date Retention Period
Enterprise Events 1 year
Enterprise Alerts 1 year
Operator Events 1 year
Enterprise Proxy Events 1 year
Link Stats 1 year
Link QoE 1 year
Path Stats 2 weeks
Flow Stats (Low Resolution) 1 year – 1 hour rollup
Flow Stats (High Resolution) 2 weeks – 5 minute rollup
Edge Health Stats 1 year
Table 11. Edge Activation
System Property Description
edge.offline.limit.sec If the Orchestrator does not detect a heartbeat from an Edge for the specified duration, then the state of the Edge is moved to OFFLINE mode.
edge.link.unstable.limit.sec When the Orchestrator does not receive link statistics for a link for the specified duration, the link is moved to UNSTABLE mode.
edge.link.disconnected.limit.sec When the Orchestrator does not receive link statistics for a link for the specified duration, the link is disconnected.
edge.deadbeat.limit.days If an Edge is not active for the specified number of days, then the Edge is not considered for generating Alerts.
vco.operator.alert.edgeLinkEvent.enable Globally activates or deactivates Operator Alerts for Edge Link events.
vco.operator.alert.edgeLiveness.enable Globally activates or deactivates Operator Alerts for Edge Liveness events.
Table 12. Edge Management
System Property Description
edge.activation.key.encode.enable Base64 encodes the activation URL parameters to obscure values when the Edge Activation Email is sent to the Site Contact.
edge.activation.trustedIssuerReset.enable Resets the trusted certificate issuer list of the Edge to contain only the Orchestrator Certificate Authority. All TLS traffic from the edge are restricted by the new issuer list.
network.public.certificate.issuer Set the value of network.public.certificate.issuer equal to the PEM encoding of the issuer of Orchestrator server certificate, when edge.activation.trustedIssuerReset.enable is set to True. This will add the server certificate issuer to the trusted issuer of the Edge, in addition to the Orchestrator Certificate Authority.
Table 13. Enhanced Firewall Services
System Property Description
edge.link.show.limit.sec Allows to set the Edge Link Down Limit value for each Edge.
Table 14. System Properties
System Property Description
ntics.public address Specifies the hostname that is used to access the NSX Threat Intelligent Cloud Service (NTICS).
gsm.public.address Specifies the Public address of Global Services Manager (GSM).
gsm.authentication.key Specifies the mTLS key to authenticate with GSM.
gsm.authentication.cert Specifies the mTLS certificate to authenticate with GSM.
gsm.authentication.passphrase Specifies the mTLS passphrase to authenticate with GSM.
Table 15. LAN-Side NAT Rules
System Property Description
session.options.enableLansidePortRules Allows to configure the parameters Inside Port and Outside Port under Device Settings tab > Routing and NAT > LAN-Side NAT Rules for an Edge or Profile.
Table 16. Monitoring
System Property Description
vco.monitor.enable Globally activates or deactivates monitoring of Enterprise and Operator entity states. Setting the Value to False prevents Orchestrator from changing entity states and triggering alerts.
vco.enterprise.monitor.enable Globally activates or deactivates monitoring of Enterprise entity states.
vco.operator.monitor.enable Globally activates or deactivates monitoring of Operator entity states.
Table 17. Notifications
System Property Description
edge.liveData.enterFlowLiveMode.delay.seconds How long the Edge waits before giving up on capturing the count configured by edge.liveData.enterFlowLiveMode.delay.seconds. The default value is five seconds. The allowed range is 5- 59 seconds. The invalid input defaults to zero seconds.
edge.liveData.enterFlowLiveMode.flow.count How many flows the Edge will return if met within the configured time controlled by edge.liveData.enterFlowLiveMode.flow.count. The default value is 1000. The allowed range is 1000- 4999 total flows. The invalid input defaults to one flow.
Table 18. Password Reset and Lockout
System Property Description
vco.notification.enable Globally activates or deactivates the delivery of Alert notifications to both Operator and Enterprises.
vco.enterprise.notification.enable Globally activates or deactivates the delivery of Alert notifications to the Enterprises.
vco.operator.notification.enable Globally activates or deactivates the delivery of Alert notifications to the Operator.
Table 19. Rate Limiting APIs
System Property Description
vco.object.groups.max.count.per.enterprise Maximum allowed number of object groups per Enterprise. The default value is 2000.
vco.object.groups.max.count.per.edge Maximum allowed number of object group associations per Edge and its Profile. The default value is 1000.
Table 20. Remote Diagnostics
System Property Description
vco.enterprise.resetPassword.token.expirySeconds Duration of time, after which the password reset link for an enterprise user expires.
vco.enterprise.authentication.passwordPolicy Defines the password strength, history, and expiration policy for customer users.

Edit the JSON template in the Value field to define the following:

strength
  • minlength: Minimum password character length. The default minimum password length is 8 characters.
  • maxlength: Maximum password character length. The default maximum password length is 32 characters.
  • requireNumber: The password must contain at least one numeric character. Numeric requirement is enabled by default.
  • requireLower: The password must contain at least one lowercase character. Lowercase requirement is enabled by default.
  • requireUpper: The password must contain at least one uppercase character. Uppercase requirement is not enabled by default.
  • requireSpecial: The password must contain at least one special character (for example, _@!). The special character requirement is not enabled by default.
  • excludeTop: Password must not match a list of the most used passwords. Default value is 1000, representing the top 1000 most used passwords, and is configurable to a maximum of 10,000 of the most used passwords.
  • maxRepeatingCharacters:Password must not include a configurable number of repeated characters. For example, if maxRepeatingCharacters is set to ‘2’ then the Orchestrator would reject any password with 3 or more repetitive characters, like “Passwordaaa”. The default value of-1 signifies that this feature is not enabled.
  • maxSequenceCharacters:Password must not include a configurable number of sequential characters. For example, if maxSequenceCharacters is set to ‘3’ then the Orchestrator would reject any password where 4 or more characters which are sequential, like “Password1234”. The default value of-1 signifies that this feature is not enabled.
  • disallowUsernameCharacters: Password must not match a configurable portion of the user's ID. For example, if disallowUsernameCharacters is set to 5, if a user with username This email address is being protected from spambots. You need JavaScript enabled to view it. attempts to configure a new password that includes ‘usern’ or ‘serna’, or any five-character string that matches a section of the user’s username, that new password would be rejected by the Orchestrator. The default value of-1 signifies that this feature is not enabled.
  • characterisations: New password must vary from the old password by a configurable number of characters. The Orchestrator uses the Levenshtein distance between two words to determine the variation between the new and old password. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.
  • If variationValidationCharacters is set to 4, then the Levenshtein distance between the new and old password must be 4 or greater. In other words, the new password must have 4 or more variations from the old password. For example, if the old password used was "kitten" and the new password is "sitting", the Levenshtein distance for these is 3, since it requires only three edits to change kitten into sitting:
    • kitten → sitten (substitution of "s" for "k")
    • sitten → sittin (substitution of "i" for "e")
    • sittin → sitting (insertion of "g" at the end).

Since the new password only varies by 3 characters from the old, “sitting” would be rejected as a new password to replace “kitten”. The default value of-1 signifies that this feature is not enabled.

expiry:
  • enable: Set this to true to enable automatic expiry of customer user passwords.
  • days: Enter the number of days that an customer password may be used before forced expiration.
history:
  • enable: Set this to true to enable recording of customer users' previous Passwords.
  • count: Enter the number of previous Passwords to be saved in the history. When a customer user tries to change the password, the system does not allow the user to enter a password that is already saved in the history.
enterprise.user.lockout.defaultAttempts Number of times the enterprise user can attempt to login. If the login fails for the specified number of times, the account is locked.
enterprise.user.lockout.defaultDurationSeconds Duration of time, in seconds, in which the Enterprise user account is locked.

For example, if set to 300, the Enterprise user account will get locked if four incorrect login attempts are made within 300 seconds. If set to 60, the Enterprise user account will get locked if four incorrect attempts are made within one minute.

Note: The number of attempts is configurable via the enterprise.user.lockout.defaultAttempts system property.
enterprise.user.lockout.enabled Activates or deactivates the lockout option for the enterprise login failures.
vco.operator.resetPassword.token.expirySeconds Duration of time, after which the password reset link for an Operator user expires.
vco.operator.authentication.passwordPolicy Defines the password strength, history, and expiration policy for Operator users.

Edit the JSON template in the Value field to define the following:

strength
  • minlength: Minimum password character length. The default minimum password length is 8 characters.
  • maxlength: Maximum password character length. The default maximum password length is 32 characters.
  • requireNumber: The password must contain at least one numeric character. Numeric requirement is enabled by default.
  • requireLower:The password must contain at least one lowercase character. Lowercase requirement is enabled by default.
  • requireUpper: The password must contain at least one uppercase character. Uppercase requirement is not enabled by default.
  • requireSpecial: The password must contain at least one special character (for example, _@!). The special character requirement is not enabled by default.
    Note: Starting from the 4.5 release, the use of the special character "<" in the password is no longer supported. In cases where users have already used "<" in their passwords in previous releases, they must remove it to save any changes on the page.
  • excludeTop: Password must not match a list of the most used passwords. Default value is 1000, representing the top 1000 most used passwords, and is configurable to a maximum of 10,000 of the most used passwords.
  • maxRepeatingCharacters: Password must not include a configurable number of repeated characters. For example, if maxRepeatingCharacters is set to ‘2’ then the Orchestrator would reject any password with 3 or more repetitive characters, like “Passwordaaa”. The default value of-1 signifies that this feature is not enabled.
  • maxSequenceCharacters: Password must not include a configurable number of sequential characters. For example, if maxSequenceCharacters is set to ‘3’ then the Orchestrator would reject any password where 4 or more characters which are sequential, like “Password1234”. The default value of-1 signifies that this feature is not enabled.
  • disallowUsernameCharacters: Password must not match a configurable portion of the user's ID. For example, if disallowUsernameCharacters is set to 5, if a user with username This email address is being protected from spambots. You need JavaScript enabled to view it. attempts to configure a new password that includes ‘usern’ or ‘serna’, or any five-character string that matches a section of the user’s username, that new password would be rejected by the Orchestrator. The default value of-1 signifies that this feature is not enabled.
  • variationValidationCharacters: New password must vary from the old password by a configurable number of characters. The Orchestrator uses the Levenshtein distance between two words to determine the variation between the new and old password. The Levenshtein distance is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into another.
  • If variationValidationCharacters is set to 4, then the Levenshtein distance between the new and old password must be 4 or greater. In other words, the new password must have 4 or more variations from the old password. For example, if the old password used was "kitten" and the new password is "sitting", the Levenshtein distance for these is 3, since it requires only three edits to change kitten into sitting:
    • kitten → sitten (substitution of "s" for "k")
    • sitten → sittin (substitution of "i" for "e")
    • sittin → sitting (insertion of "g" at the end).

Since the new password only varies by 3 characters from the old, “sitting” would be rejected as a new password to replace “kitten”. The default value of-1 signifies that this feature is not enabled.

expiry:
  • enable: Set this to true to enable automatic expiry of Operator user passwords.
  • days: Enter the number of days that an Operator password may be used before forced expiration.
history:
  • enable: Set this to true to enable recording of Operator users' previous Passwords.
  • count: Enter the number of previous Passwords to be saved in the history. When a Operator user tries to change the password, the system does not allow the user to enter a password that is already saved in the history.
operator.user.lockout.defaultAttempts Number of times the Operator user can attempt to login. If the login fails for the specified number of times, the account is locked.
operator.user.lockout.defaultDurationSeconds Duration of time, in seconds, in which an Operator user account is locked.

For example, if set to 300, the Operator user account will get locked if four incorrect login attempts are made within 300 seconds. If set to 60, the Operator user account will get locked if four incorrect attempts are made within one minute.

Note: The number of attempts is configurable via the operator.user.lockout.defaultAttempts system property.
operator.user.lockout.enabled Activates or deactivates the lockout option for the Operator login failures.
Table 21. Security Service Edge
System Property Description
vco.api.rateLimit.enabled Allows Operator Super users activate or deactivate the rate limiting feature at the system level. By default, the value is False.
Note: The rate-limiter is not enabled in earnest, that is, it will not reject API requests that exceed the configured limits, unless the vco.api.rateLimit.mode.logOnly setting is deactivated.
vco.api.rateLimit.mode.logOnly Allows Operator Super user to use rate limit in a LOG_ONLY mode. When the value is set as True and if a rate limit exceeds, this option logs only the error and fires respective metrics allowing clients to make requests without rate limiting.

When the value is set to False, the request API is restricted with defined policies and HTTP 429 is returned.

vco.api.rateLimit.rules.global Allows to define a set of globally applicable policies used by the rate-limiter, in a JSON array. By default, the value is an empty array.

Each type of user (Operator, Partner, and Customer) can make up to 500 requests for every 5 seconds. The number of requests is subject to change based on the behavior pattern of the rate limited requests.

The JSON array consists of the following parameters:

Types: The type objects represent different contexts in which the rate limits are applied. The following are the different type objects that are available:
  • SYSTEM: Specifies a global limit shared by all the users.
  • OPERATOR_USER: A limit that can be set in general for all the Operator users.
  • ENTERPRISE_USER: A limit that can be set in general for all the Enterprise users.
  • MSP_USER: A limit that can be set in general for all the MSP users.
  • ENTERPRISE: A limit that can be shared between all users of an Enterprise and is applicable to all the Enterprises in the network.
  • PROXY: A limit that can be shared between all users of a Proxy and is applicable to all proxies.
Policies: Add rules to the policies to apply the requests that match the rule, by configuring the following parameters:
  • Match: Enter the type of requests to be matched:
    • All: Rate-limit all requests matching one of the type objects.
    • METHOD: Rate-limit all requests matching the specified method name.
    • METHOD_PREFIX: Rate-limit all requests matching the specified method group.
  • Rules: Enter the values for the following parameters:
    • maxConcurrent: Number of jobs that can be performed at the same time.
    • reservoir: Number of jobs that can be performed before the limiter stops performing jobs.
    • reservoirRefreshAmount: Value to set the reservoir to when reservoirRefreshInterval is in use.
    • reservoirRefreshInterval: For every millisecond of reservoirRefreshInterval, the reservoir value will be automatically updated to the value of reservoirRefreshAmount. The reservoirRefreshInterval value should be a multiple of 250 (5000 for Clustering).

Enabled: Each type limit can be activated or deactivated by including the enabled key in APIRateLimiterTypeObject. By default, the value of enabled is True, even if the key is not included. You need to include "enabled": false key to deactivate the individual type limits.

The following example shows a sample JSON file with default values:

[
  {
    "type": "OPERATOR_USER",
    "policies": [
      {
        "match": {
          "type": "ALL"
        },
        "rules": {
          "reservoir": 500,
          "reservoirRefreshAmount": 500,
          "reservoirRefreshInterval": 5000
        }
      }
    ]
  },
  {
    "type": "MSP_USER",
    "policies": [
      {
        "match": {
          "type": "ALL"
        },
        "rules": {
          "reservoir": 500,
          "reservoirRefreshAmount": 500,
          "reservoirRefreshInterval": 5000
        }
      }
    ]
  },
  {
    "type": "ENTERPRISE_USER",
    "policies": [
      {
        "match": {
          "type": "ALL"
        },
        "rules": {
          "reservoir": 500,
          "reservoirRefreshAmount": 500,
          "reservoirRefreshInterval": 5000
        }
      }
    ]
  }
]
Note: It is recommended not to change the default values of the configuration parameters.
vco.api.rateLimit.rules.enterprise.default Comprises the default set of Enterprise-specific policies applied to newly created Customers. The Customer-specific properties are stored in the Enterprise property vco.api.rateLimit.rules.enterprise.
vco.api.rateLimit.rules.enterpriseProxy.default Comprises the default set of Enterprise-specific policies applied to newly created Partners. The Partner-specific properties are stored in the Enterprise proxy property vco.api.rateLimit.rules.enterpriseProxy.
Table 22. Remote Diagnostics
System Property Description
network.public.address Specifies the browser origin address/DNS hostname that is used to access the Orchestrator UI.
network.portal.websocket.address Allows to set an alternate DNS hostname/address to access the Orchestrator UI from a browser, if the browser address is not the same as the value of network.public.address system property.

As remote diagnostics now uses a WebSocket connection, to ensure web security, the browser origin address that is used to access the Orchestrator UI is validated for incoming requests. In most cases, this address is same as the network.public.address system property. In rare scenarios, the Orchestrator UI can be accessed using another DNS hostname/address that is different from the value set in the network.public.address system property. In such cases, you can set this system property to the alternate DNS hostname/address. By default, this value is not set.

session.options.websocket.portal.idle.timeout Allows to set the total amount of time (in seconds) the browser WebSocket connection is active in an idle state. By default, the browser WebSocket connection is active for 300 seconds in an idle state.
Table 23. Security Service Edge
System Property Description
session.options.enableSseService Activates or deactivates the Security Service Edge (SSE) feature for Enterprise users.
Table 24. Segmentation
System Property Description
enterprise.capability.enableSegmentation Activates or deactivates the segmentation capability for Enterprise users.
enterprise.segments.system.maximum Specifies the maximum number of segments allowed for any Enterprise user. Ensure that you change the value of this system property to 128 if you want to enable 128 segments on Orchestrator for an Enterprise user.
enterprise.segments.maximum Specifies the default value for the maximum number of segments allowed for a new or existing Enterprise user. The default value for any Enterprise user is 16.
Note: This value must be less than or equal to the number defined in the system property, enterprise.segments.system.maximum.
It is not recommended for you to change the value of this system property if you want to enable 128 segments for an Enterprise user. Instead, you can enable Customer Capabilities in the Customer Configuration page to configure the required number of segments.
enterprise.subinterfaces.maximum Specifies the maximum number of sub-interfaces that can be configured for an Enterprise user. The default value is 32.
enterprise.vlans.maximum Specifies the maximum number of VLANs that can be configured for an Enterprise user. The default value is 32.
session.options.enableAsyncAPI When the segment scale is increased to 128 segments for any Enterprise user, to prevent UI timeouts, you can enable Async APIs support on the UI by using this system property. The default value is true.
session.options.asyncPollingMilliSeconds Specifies the Polling interval for Async APIs on the UI. The default value is 5000 milliseconds.
session.options.asyncPollingMaxCount Specifies the maximum number of calls to get Status API from the UI. The default value is 10.
vco.enterprise.events.configuration.diff.enable Activates or deactivates configuration diff event logging. Whenever the number of segments for an Enterprise user is greater than 4, the configuration diff event logging will be deactivated. You can enable configuration diff event logging using this system property.
Table 25. Self-service Password Reset
System Property Description
vco.enterprise.resetPassword.twoFactor.mode Defines the mode for the second level for password reset authentication, for all the Enterprise users. Currently, only the SMS mode is supported.
vco.enterprise.resetPassword.twoFactor.required Activates or deactivates the two-factor authentication for password reset of Enterprise users.
vco.enterprise.selfResetPassword.enabled Activates or deactivates self-service password reset for Enterprise users.
vco.enterprise.selfResetPassword.token.expirySeconds Duration of time, after which the self-service password reset link for an Enterprise user expires.
vco.operator.resetPassword.twoFactor.required Activates or deactivates the two-factor authentication for password reset of Operator users.
vco.operator.selfResetPassword.enabled Activates or deactivates self-service password reset for Operator users.
vco.operator.selfResetPassword.token.expirySeconds Duration of time, after which the self-service password reset link for an Operator user expires.
Table 26. Syslog Forwarding
System Property Description
log.syslog.backend Backend service syslog integration configuration.
log.syslog.portal Portal service syslog integration configuration.
log.syslog.upload Upload service syslog integration configuration.
log.syslog.lastFetchedCRL.backend Keeps the last updated CRL as PEM formatted string for service syslog and updated regularly.
log.syslog.lastFetchedCRL.portal Keeps the last updated CRL as PEM formatted string for service syslog and updated regularly.
log.syslog.lastFetchedCRL.upload Keeps the last updated CRL as PEM formatted string for service syslog and updated regularly.
Table 27. TACACS Services
System Property Description
session.options.enableTACACS Activates or deactivates the TACACS services for Enterprise users.
Table 28. Two-factor Authentication
System Property Description
vco.enterprise.authentication.twoFactor.enable Activates or deactivates the two-factor authentication for Enterprise users.
vco.enterprise.authentication.twoFactor.mode Defines the mode for the second level authentication for Enterprise users. Currently, only SMS is supported as the second level authentication mode.
vco.enterprise.authentication.twoFactor.require Defines the two-factor authentication as mandatory for Enterprise users.
vco.operator.authentication.twoFactor.enable Activates or deactivates the two-factor authentication for Operator users.
vco.operator.authentication.twoFactor.mode Defines the mode for the second level authentication for Operator users. Currently, only SMS is supported as the second level authentication mode.
vco.operator.authentication.twoFactor.require Defines the two-factor authentication as mandatory for Operator users.
Table 29. Tunnel Parameters for Edges
System Property Description
session.options.enableNsdPkiIPv6Config Activates Certificate Authentication mode and IPv6 Local Identification Type.
Table 30. VNF Configuration
System Property Description
edge.vnf.extraImageInfos Defines the properties of a VNF Image.
You can enter the following information for a VNF Image, in JSON format in the Value field:
[
  {
    "vendor": "Vendor Name",
    "version": "VNF Image Version",
    "checksum": "VNF Checksum Value",
    "checksumType": "VNF Checksum Type"
  }
]
Example of a JSON file for Check Point Firewall Image:
[
  {
    "vendor": "checkPoint",
    "version": "r80.40_no_workaround_46",
    "checksum": "bc9b06376cdbf210cad8202d728f1602b79cfd7d",
    "checksumType": "sha-1"
  }
]
Example of a JSON file for Fortinet Firewall Image:
[
  {
    "vendor": "fortinet",
    "version": "624",
    "checksum": "6d9e2939b8a4a02de499528c745d76bf75f9821f",
    "checksumType": "sha-1"
  }
]
edge.vnf.metric.record.limit Defines the number of records to be stored in the database.
enterprise.capability.edgeVnfs.enable Allows VNF deployment on supported Edge models.
enterprise.capability.edgeVnfs.securityVnf.checkPoint Activates Check Point Networks Firewall VNF.
enterprise.capability.edgeVnfs.securityVnf.fortinet Activates Fortinet Networks Firewall VNF.
enterprise.capability.edgeVnfs.securityVnf.paloAlto Activates Palo Alto Networks Firewall VNF.
session.options.enableVnf Activates VNF feature.
vco.operator.alert.edgeVnfEvent.enable Activates or deactivates Operator alerts for Edge VNF events globally.
vco.operator.alert.edgeVnfInsertionEvent.enable Activates or deactivates Operator alerts for Edge VNF Insertion events globally.
edge.vnf.extraImageInfos. Allows selection of the Check Point VNF image.
Table 31. VPN
System Property Description
vpn.disconnect.wait.sec The time interval for the system to wait before disconnecting a VPN tunnel.
vpn.reconnect.wait.sec The time interval for the system to wait before reconnecting a VPN tunnel.
Table 32. Warning Banner
System Property Description
login.warning.banner.message This optional system property allows the Operator to configure and display a Security Administrator-specified advisory notice and consent warning message regarding the use of Orchestrator. The warning message is displayed in the Orchestrator prior to user login.

For instructions about how to configure this system property, see the topic Configure Advisory Notice and Consent Warning Message for SD-WAN Orchestrator.

Table 33. Zscaler
System Property Description
session.options.enableZscalerProfileAutomation Enables to configure Zscaler settings at the Profile level.

Configure Orchestrator Disaster Recovery

This section provides disaster recovery (DR) instructions for Orchestrator.

Orchestrator Disaster Recovery Overview

The Orchestrator Disaster Recovery (DR) feature prevents the loss of stored data and resumes Orchestrator services in the event of system or network failure.

Orchestrator DR involves setting up an active/standby Orchestrator pair with data replication and a manually-triggered failover mechanism.
  • The recovery time objective (RTO), therefore, is dependent on explicit action by the operator to trigger promotion of the standby.
  • The recovery point objective (RPO), however, is essentially zero, regardless of the recovery time, because all configuration is instantaneously replicated. Monitoring data that would have been collected during the outage is cached on the Edges and Gateways pending promotion of the standby.
Note: DR is mandatory. For licensing and pricing, contact the Arista sales team for support.

Active/Standby Pair

In a Orchestrator DR deployment, two identical Orchestrator systems are configured as an active / standby pair. The operator can view the state of DR readiness through the web UI on either of the servers. Edges and Gateways are aware of both Orchestrators, and while they receive configuration changes only from the active Orchestrator, they periodically send DR heartbeats to both systems to report their view of both servers and to query the DR system status. When the operator triggers a failover, the Edges and Gateways are informed of the change in their next DR heartbeat.

DR States

From the view of an operator, and of the edges and gateways, a Orchestrator has one of four DR states:

Table 34. DR State Descriptions
DR State Description
Standalone No DR configured.
Active DR configured, acting as the primary Orchestrator server.
Standby DR configured, acting as an inactive replica Orchestrator server.
Zombie DR formerly configured and active but no longer acting as the active or standby.

Run-time Operation

When DR is configured, the standby server runs in a limited mode, blocking all API calls except those related to the DR status and the DR heartbeats. When the operator invokes a failover, the standby is promoted to become fully operational as a Standalone server. The server that was formerly active is automatically transitioned to a Zombie state if it is responsive and visible from the promoted standby. In the Zombie state, management configuration services are blocked and any contact from Edges and Gateways that have not transitioned to the new active Orchestrator are redirected to the promoted server.

Figure 5. Example Topology

Set Up Orchestrator Replication

Two installed Orchestrator instances are required to initiate replication.
  • The selected standby is put into a STANDBY_CANDIDATE state, enabling it to be configured by the active server.
  • The active server is then given the address and credentials of the standby and it enters the ACTIVE_CONFIGURING state.

When a STANDBY_CONFIG_RQST is created from Active to Standby, the two servers synchronize through the state transitions.

The two Orchestrators for Disaster Recovery (DR) that will be established, must have the same time. Before you initiate Orchestrator replication, ensure you check the following NTP configurations:
  • The Gateway time zone must be set to Etc/UTC. Use the following command to view the NTP time zone.
    vcadmin@vcg1-example:~$ cat /etc/timezone
    Etc/UTC
    vcadmin@vcg1-example:~$

    If the time zone is incorrect, use the following commands to update the time zone.

    echo "Etc/UTC" | sudo tee /etc/timezone
    sudo dpkg-reconfigure --frontend noninteractive tzdata
  • The NTP offset must be less than or equal to 15 milliseconds. Use the following command to view the NTP offset.
    sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p
         remote           refid      st t when poll reach   delay   offset  jitter
    ==============================================================================
    *ntp1-us1.prod.v 74.120.81.219    3 u  474 1024  377   10.171   -1.183   1.033
     ntp1-eu1-old.pr .INIT.          16 u    - 1024    0    0.000    0.000   0.000
    vcadmin@vcg1-example:~$

    If the offset is incorrect, use the following commands to update the NTP offset.

    sudo systemctl stop ntp
    sudo ntpdate <server>
    sudo systemctl start ntp
  • By default, a list of NTP Servers are configured in the /etc/ntpd.conf file. The Orchestrators on which DR need to be established must have Internet to access the default NTP Servers and ensure the time is in sync on both the Orchestrators. Customers can also use their local NTP server running in their environment to sync time.
Note: Before you set up your Standby Orchestrator to begin the Replication process, you must enable the network.public.address system property.

Set Up the Standby Orchestrator

To set up Orchestrator replication, perform the following steps:

  1. Select Replication from the Navigation panel to display the Orchestrator Replication screen.
    Figure 6. Orchestrator Replication
  2. Enable the Standby Orchestrator by selecting Standby (Replication Role).
    Figure 7. Enabling Standby Orchestrator
  3. Select Enable for Standby. The Prepare this Orchestrator for Standby Role dialog displays.
    Figure 8. Preparing Orchestrator for Standby Role
  4. Select Enable for Standby again. The Orchestrator Success message displays across the top of the screen indicating that the Orchestrator has been enabled for Standby, and that the Orchestrator restarts in Standby mode.
  5. Select OK.
    Figure 9. Configuring Standby Orchestrator

    After configuring the Standby Orchestrator for replication, configure the Active Orchestrator.

Set Up the Active Orchestrator

Configure the second Orchestrator to be the Active Orchestrator:

  1. Select Replication from the Navigation panel. The Orchestrator Replication screen appears.
  2. Choose the Active Replication Role.
  3. Type in the Standby Orchestrator Address and the Standby Orchestrator UUID. The Orchestrator Address and Uuid display in the Standby Orchestrator screen.
    Figure 10. Active Orchestrator Replication
  4. Enter the username and password for the Orchestrator Superuser to be used for replication.
    Note:
    • This Superuser should already exist on both systems.
    • Starting from the 4.5 release, the use of the special character "<" in the password is no longer supported. In cases where users have already used "<" in their passwords in previous releases, they must remove it to save any changes on the page.
  5. Select Make Active. The Active Orchestrator screen displays showing a status of the current state.
    Figure 11. Active Orchestrator Replication Settings

    When configuration is complete, both Orchestrators (Standby and Active) will be in sync.

Standby Orchestrator in Sync
Figure 12. Standby Orchestrator

You can select the toggle history link to view the status of each state.

Figure 13. Standby Orchestrator Status
Active Orchestrator in Sync
Figure 14. Synchronizing Orchestrator

Test Failover

The following testing failover scenarios are forced failovers for example purposes. You can perform these actions in the Available Actions area of the Active and Standby screens.

Promote a Standby Orchestrator

This section discusses how to promote a Standby Orchestrator.

To promote a Standby Orchestrator, perform the following steps:

  1. Select the unlock link.
  2. Select the Promote Standby button in the Available Actions area on the Standby Orchestrator screen.
    Figure 15. Available Actions

    The following dialog box appears, indicating that when you promote your Standby Orchestrator, administrators will no longer be able to manage the Orchestrator using the previously Active Orchestrator.

    Figure 16. Promoting the Orchestrator
  3. Select OK to promote the Standby Orchestrator. Another message dialog box appears to verify your request to promote the Standby Orchestrator. This message appears only if the Standby Orchestrator perceives the Active Orchestrator to be in good health, meaning the Standby is communicating with the Active and duplicating data.
  4. Select OK to promote the Orchestrator.
    Figure 17. Promoting to Standby

    A final dialog box appears indicating that the Orchestrator is no longer a Standby and will restart in Standalone mode.

    Figure 18. Removing the Standby Orchestrator

    When you promote a Standby Orchestrator, it restarts in Standalone mode.

    If the Standby can communicate with the formerly Active Orchestrator, it instructs the Orchestrator to enter a Zombie state. In Zombie state, the Orchestrator communicates with its clients (edges, gateways, UI/API) that it is no longer active, and that they must communicate with the newly promoted Orchestrator. If the promoted Standby cannot communicate with the formerly Active Orchestrator, the operator should, if possible, manually demote the formerly Active Orchestrator.

    Figure 19. Quiesced Orchestrator

Return to Standalone Mode

To return the Zombie to standalone mode, select Return to Standalone Mode in the Available Actions area on the Active Orchestrator or Standby Orchestrator screens.

Figure 20. Available Actions
Note: The Orchestrator can be returned to the Standalone mode from the Zombie state after the time specified in the system property vco.disasterRecovery.zombie.expirySeconds which defaults to 1800 seconds.

Troubleshooting Orchestrator DR

This section describes the failure states of the system. These are also listed in the UI, along with a more detailed description of the failure. Additional information is available in the VeloCloud log.

Recoverable Failures

The following errors are recoverable failures that can occur after Orchestrator DR reaches an in sync state. If the problem causing these failures is corrected, Orchestrator DR automatically returns to normal operation.
  • FAILURE_SYNCING_FILES
  • FAILURE_GET_STANDBY_STATUS
  • FAILURE_MYSQL_ACTIVE_STATUS
  • FAILURE_MYSQL_STANDBY_STATUS

Unrecoverable Failures

The following failures can occur during configuration of the Orchestrator DR. Orchestrator DR does not automatically recover from these failures.
  • FAILURE_ACTIVE_CONFIGURING
  • FAILURE_LAUNCHING_STANDBY
  • FAILURE_STANDBY_CONFIGURING
  • FAILURE_COPYING_DB
  • FAILURE_COPYING_FILES
  • FAILURE_SYNC_CONFIGURING
  • FAILURE_GET_STANDBY_CONFIG
  • FAILURE_STANDBY_CANDIDATE
  • FAILURE_STANDBY_UNCONFIG
  • FAILURE_STANDBY_PROMOTION
  • FAILURE_ACTIVE_DEMOTION

Replication

The Orchestrator Disaster Recovery (DR) feature prevents the loss of stored data and resumes Orchestrator services in the event of system or network failure.

Orchestrator DR involves setting up an active/standby Orchestrator pair with data replication and a manually-triggered failover mechanism.
  • The Recovery Time Objective (RTO), therefore, is dependent on explicit action by the operator to trigger promotion of the standby.
  • The Recovery Point Objective (RPO), however, is essentially zero, regardless of the recovery time, because all configuration is instantaneously replicated. Monitoring data that would have been collected during the outage is cached on the Edges and Gateways pending promotion of the standby.
Note: DR is mandatory. For licensing and pricing, contact the Arista sales team for support.

Active/Standby Pair

In a Orchestrator DR deployment, two identical Orchestrator systems are configured as an active / standby pair. The operator can view the state of DR readiness through the web UI on either of the servers. Edges and gateways are aware of both Orchestrators, and while they receive configuration changes only from the active Orchestrator, they periodically send DR heartbeats to both systems to report their view of both servers and to query the DR system status. When the operator triggers a failover, the Edges and Gateways are informed of the change in their next DR heartbeat.

DR States

From the view of an operator, and the Edges and Gateways, a Orchestrator has one of the following four DR states:

Table 35. DR State Descriptions
DR State Description
Standalone No DR configured.
Active DR configured, acting as the primary Orchestrator server.
Standby DR configured, acting as an inactive replica Orchestrator server.
Zombie DR formerly configured and active but no longer acting as the active or standby.

Run-time Operation

When DR is configured, the standby server runs in a limited mode, blocking all API calls except those related to the DR status and the DR heartbeats. When the operator invokes a failover, the standby is promoted to become fully operational as a Standalone server. The server that was formerly active is automatically transitioned to a Zombie state if it is responsive and visible from the promoted standby. In the Zombie state, management configuration services are blocked and any contact from edges and gateways that have not transitioned to the new active Orchestrator are redirected to the promoted server.

Figure 21. Run-time Operation

Set Up Orchestrator Replication

Two installed Orchestrator instances are required to initiate replication.
  • The selected standby is put into a STANDBY_CANDIDATE state, enabling it to be configured by the active server.
  • The active server is then given the address and credentials of the standby and it enters the ACTIVE_CONFIGURING state.

When a STANDBY_CONFIG_RQST is made from active to standby, the two servers synchronize through the state transitions.

The two Orchestrators on which Disaster Recovery (DR) need to be established must have same time. Before you initiate Orchestrator replication, ensure you check the following NTP configurations:
  • The Gateway time zone must be set to Etc/UTC. Use the following command to view the NTP time zone.
    vcadmin@vcg1-example:~$ cat /etc/timezone
    Etc/UTC
    vcadmin@vcg1-example:~$

    If the time zone is incorrect, use the following commands to update the time zone.

    echo "Etc/UTC" | sudo tee /etc/timezone
    sudo dpkg-reconfigure --frontend noninteractive tzdata
  • The NTP offset must be less than or equal to 15 milliseconds. Use the following command to view the NTP offset.
    sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p
         remote           refid      st t when poll reach   delay   offset  jitter
    ==============================================================================
    *ntp1-us1.prod.v 74.120.81.219    3 u  474 1024  377   10.171   -1.183   1.033
     ntp1-eu1-old.pr .INIT.          16 u    - 1024    0    0.000    0.000   0.000
    vcadmin@vcg1-example:~$

    If the offset is incorrect, use the following commands to update the NTP offset.

    sudo systemctl stop ntp
    sudo ntpdate <server>
    sudo systemctl start ntp
  • By default, a list of NTP Servers are configured in the /etc/ntpd.conf file. The Orchestrators on which DR need to be established must have Internet to access the default NTP Servers and ensure the time is in sync on both the Orchestrators. Customers can also use their local NTP server running in their environment to sync time.

Set Up the Standby Orchestrator

To set up the Standby Orchestrator, perform the following steps:
  1. In the SD-WAN service of the Enterprise Portal, select Orchestrator tab and then from the left pane select Replication button to display the Orchestrator Replication screen.
  2. Activate the Standby Orchestrator by selecting the Standby (Replication Role) radio button.
  3. Select Enable for Standby button.
    Figure 22. Standby Orchestrator

    The Standby Orchestrator page appears.

  4. Enter the manual configuration parameters and select Update configuration info button.

    After the Standby Orchestrator has been configured for replication, configure the Active Orchestrator according to the instructions below.

Set Up the Active Orchestrator

To set up the Active Orchestrator, select the Replication Role as Active and configure the following:

Figure 23. Replication Role as Active
Table 36. Set Up the Active Orchestrator Option Descriptions
Option Description
Select Replication Role Select the Active radio button for the replication role.
Standby Orchestrator Address Enter the primary Standby Orchestrator IP Address.
Standby Orchestrator Address (IPv6) Enter the Standby Orchestrator IPv6 Address.
Standby Orchestrator Secondary Address Enter the address of the standby Orchestrator's secondary interface. This address is used for replication if the standby is promoted to active. Users can add Ipv4/Ipv6 or FQDN address here.
Standby Orchestrator UUID Enter the UUID of the standby Orchestrator.
Configuration Mode Select the Auto Configure Standby or Manually Configure Standby radio button based on the requirement.

When configured manually, paste a string value from ACTIVE VCO to STANDBY_WAIT

.
Superuser Username Enter the display name for the Orchestrator Superuser.
Standby Orchestrator Superuser Password Enter the password for the Orchestrator Superuser.
Note: Starting from the 4.5 release, the use of the special character "<" in the password is no longer supported. In cases where users have already used "<" in their passwords in previous releases, they must remove it to save any changes on the page.
  • Select Enable for Active button to activate replication role.

When configuration is complete, both Orchestrators (Standby and Active) are in sync.

Standby Orchestrator in Sync

Figure 24. Standby Orchestrator in Sync

Active Orchestrator in Sync

Figure 25. Active Orchestrator in Sync

Test Failover

The following testing failover scenarios are forced failovers for example purposes. You can perform these actions in the Available Actions area of the Active and Standbyscreens.

Promote a Standby Orchestrator

This section discusses how to promote a Standby Orchestrator.

To promote a Standby Orchestrator, perform the following steps:
  1. Select the unlock link.
  2. Select the Promote Standby button in the Available Actions area on the Standby Orchestrator screen.
    Figure 26. Available Actions

    The following dialog box appears, indicating that when you promote your Standby Orchestrator, administrators can no longer be able to manage the Orchestrator using the previously Active Orchestrator.

    Figure 27. Promote Standby
  3. Select the Promote Standby button to promote the Standby Orchestrator.
  4. Select Force Promote Standby to promote the Orchestrator.
    Figure 28. Force Promote Standby

    A final dialog box appears indicating that the Orchestrator is no longer a Standby and restarts in Standalone mode.

    Figure 29. Restart in Standalone Mode Notice

When you promote a Standby Orchestrator, it restarts in Standalone mode.

If the Standby can communicate with the formerly Active Orchestrator, it instructs that Orchestrator to enter a Zombie state. In Zombie state, the Orchestrator communicates with its clients (edges, gateways, UI/API) that it is no longer active, and that they must communicate with the newly promoted Orchestrator. If the promoted Standby cannot communicate with the formerly Active Orchestrator, the operator should, if possible, manually demote the formerly Active Orchestrator.

Figure 30. Zombie State

Return to Standalone Mode

To return the Zombie to standalone mode, select the Return to Standalone Mode button in the Available Actions area on the Active Orchestrator or Standby Orchestrator screens.

Figure 31. Available Actions
Note:

The Orchestrator can be returned to the Standalone mode from the Zombie state after the time specified in the system property "vco.disasterRecovery.zombie.expirySeconds," which is defaulted to 1800 seconds.

Troubleshooting Orchestrator DR

This section discusses the failure states of the system. These are also listed in the UI, along with a more detailed description of the failure. Additional information is available in the Arista log.

Recoverable Failures

The following errors are recoverable failures that can occur after Orchestrator DR reaches an in sync state. If the problem causing these failures is corrected, Orchestrator DR automatically returns to normal operation.
  • FAILURE_SYNCING_FILES
  • FAILURE_GET_STANDBY_STATUS
  • FAILURE_MYSQL_ACTIVE_STATUS
  • FAILURE_MYSQL_STANDBY_STATUS

Unrecoverable Failures

The following failures can occur during configuration of the Orchestrator DR. Orchestrator DR does not automatically recover from these failures.
  • FAILURE_ACTIVE_CONFIGURING
  • FAILURE_LAUNCHING_STANDBY
  • FAILURE_STANDBY_CONFIGURING
  • FAILURE_COPYING_DB
  • FAILURE_COPYING_FILES
  • FAILURE_SYNC_CONFIGURING
  • FAILURE_GET_STANDBY_CONFIG
  • FAILURE_STANDBY_CANDIDATE
  • FAILURE_STANDBY_UNCONFIG
  • FAILURE_STANDBY_PROMOTION
  • FAILURE_ACTIVE_DEMOTION

Upgrade Orchestrator

This section discusses how to upgrade the Orchestrator.

Orchestrator Upgrade Overview

The following steps are required to upgrade a Orchestrator.
  1. Prepare for the Orchestrator Upgrade.
  2. Send Upgrade Announcement.
  3. Proceed with the Orchestrator upgrade.
  4. Complete the Orchestrator Upgrade.

Upgrade an Orchestrator

This section discusses how to upgrade an Orchestrator.

Step 1: Prepare for the Orchestrator Upgrade

Contact Arista Support team to prepare for the Orchestrator upgrade.

To upgrade Orchestrator:

Arista Support assists you with your upgrade. Collect the following information prior to contacting Support.
  • Provide the current and target Orchestrator versions, for example: current version (i.e. 2.5.2 GA-20180430), target version (3.3.2 p2).
    Note: For the current version, this information can be found on the top, right corner of the Orchestrator by selecting the Help link and choosing About.
  • Provide a screenshot of the replication dashboard of the Orchestrator.
Figure 32. Orchestrator Dashboard
  • Hypervisor Type and version (i.e. vSphere 6.7)
  • Commands from the Orchestrator:
    Note: Commands must be run as root (e.g. ‘sudo <command>’ or ‘sudo-i’).
    • Run the script /opt/vc/scripts/vco_upgrade_check.sh to check:
      • LVM layout
      • Memory Information
      • CPU Information
      • Kernel Parameters
      • Some system properties
      • ssh configurations
      • Mysql schema and database sizes
      • File_store locations and sizes
    • Copy of /var/log
      • tar -czf /store/log-`date +%Y%M%S`.tar.gz --newer-mtime="36 hours ago" /var/log
    • From the Standby Orchestrator:
      • sudo mysql --defaults-extra-file=/etc/mysql/velocloud.cnf velocloud -e 'SHOW SLAVE STATUS \G'
  • From the Active Orchestrator:
    • sudo mysql --defaults-extra-file=/etc/mysql/velocloud.cnf velocloud -e 'SHOW MASTER STATUS \G'

Step 2: Send Upgrade Announcement

The Upgrade Announcement area enables you to configure and send a message about an upcoming upgrade. This message will be displayed to all users the next time they login to the Orchestrator.

  1. From the Orchestrator, select Orchestrator Upgrade from the navigation panel.
  2. In the Upgrade Announcement area, type in your message in the Banner Message text box.
    Figure 33. Configuring the Upgrade Announcement
  3. Select Announce Orchestrator Upgrade. A popup message appears indicating successful creation of your announcement and your banner message displays at the top of the Orchestrator.
    Figure 34. Removing the Upgrade Announcement

    (Optional) You can remove the announcement from the Orchestrator by selecting Unannounce Orchestrator Upgrade. A message appears indicating you successfully unannounced the Orchestrator upgrade. The announcement displayed at the top of the Orchestrator disappears.

Step 3: Before Proceeding with the Orchestrator Upgrade

This section provides important information to consider prior to upgrading the Orchestrator, as well as how the image-based upgrade works. Contact Arista Support to assist you with the 5.4 to 6.0 upgrade.

Note: The Orchestrator OS, database, and several other dependent components currently in use have reached their end of life, and will no longer be supported.
Note: The benefit to upgrading to the 6.0 release is better security due to components with active LTS.
Note: Starting from the 6.0.0 release, existing events data is migrated from mySQL to ClickHouse, and all the new events data is stored in ClickHouse for a duration of 1 year.
Consider the Following When Upgrading to the 6.0 Release
  • This upgrade work does not modify any existing APIs.
  • Just like other releases, there are schema changes with the 6.0 release. However, these changes will not impact the upgrade process.
The OS for the Orchestrator virtual appliance specific upgrades include the following:
  • The OS version is changing from Ubuntu 18.04 to 22.04.
  • Image based upgrade instead of a Debian based upgrade.
Important Notes for Upgrading from 5.4 to 6.0
With the 6.0 release, the Orchestrator is adopting an image-based upgrade approach, which will introduce the following important differences compared to previous upgrades.
  • Any non-supported binaries installed on top of Orchestrator will be removed. These can include the off-the-shelf monitoring applications, remote access applications, etc.
  • Back up any configurations if you want to continue using them. After the upgrade, you must reinstall them manually and configure them accordingly.
  • For a successful upgrade, a reboot of the Orchestrator is required.
    • The upgrade process requires a mandatory system-level REBOOT of the Orchestrator.
  • After a successful upgrade, the Orchestrator does not support rolling back to the previous release. Therefore, ensure you have backups of the entire system, including /store, /store2, /store3, and so forth, before upgrading.
  • At least 30GB of free space is required on the physical disk before upgrading the Orchestrator from 5.4.0 to 6.0.0.
Image-based Upgrade Process
This section discusses how the image-based upgrade process works.
  • An Ubuntu 22.04-based VCO image is prepared with all required binaries with LVM partitions “/” and “/var/"
  • The “/” and “/var/” LVM partitions are "snapshotted" to represent new image rooftfs.
  • These snapshots are packaged with upgrade scripts as shown in the below diagram to serve two primary functions:
    • Transferring specific configurations, notably those associated with mysql, nginx, ssh, and their respective keys, from the existing system to the new snapshots.
    • Adjusting the boot configuration to ensure the system boots using the new LVM partitions, thus ensuring the upgrade is complete and effective.
  • As seen in the above diagram, the image-based upgrade replaces the old file system with a new one. As mentioned, this might result some unsupported files and packages being lost. Contact Arista Support before upgrade to ensure a safe and successful upgrade.
Best Practices/Recommendations:
Listed below are some upgrade best practices:
  • From the System Properties page in the Orchestrator, make a note of the value of the edge.heartbeat.spread.factor system property. Then, change the heartbeat spread factor to a relatively high value for a large Orchestrator (e.g. 20, 40, 60). This will help reduce the sudden spike of the resource utilization (CPU, IO) on the system. Make sure to verify that all Gateways and Edges are in a connected state before restoring the previous edge.heartbeat.spread.factor value from the System Property page in the Orchestrator.
  • Leave the demoted Orchestrator up for a few hours before complete shutdown or decommission.
  • Freeze configuration modifications to avoid any additional configuration changes until the upgrade process is completed.

Step 4: Proceed with the Orchestrator Upgrade

Contact Arista Support at for assistance with the Orchestrator upgrade.

Step 5: Complete the Orchestrator Upgrade

After you have completed the Orchestrator upgrade, select Complete Orchestrator Upgrade. This re-enables the application of the configuration updates of Edges at the global level.

To verify that the status of the upgrade is complete, run the following command to display the correct version number for all the packages:
dpkg -l|grep vco
When you are logged in as an Operator, the same version number should display at the bottom right corner of the Orchestrator.

Orchestrator Disaster Recovery

This section discusses how to set up and upgrade disaster recovery in the Orchestrator.

Set Up Disaster Recovery

To set up disaster recovery in the Orchestrator:

  1. Install a new Orchestrator whose version matches the Product version that is currently the Active Orchestrator.
  2. Set the following properties on the Active and Standby Orchestrator, if necessary:
    1. Set vco.disasterRecovery.transientErrorToleranceSecs to a non-zero value (it defaults to 900 seconds in version 3.3 and later, but to zero in earlier versions). This prevents any transient errors from resulting in an Edge/Gateway management plane update.
    2. Set vco.disasterRecovery.mysqlExpireLogsDays (defaults is 1 day). This is the amount of time the Active Orchestrator keeps the mysql binlog data.
  3. Set up the network.public.address property on the Active and Standby Orchestrators to the address contacted by the Edges (Heartbeats).
  4. Set up DR by following the usual DR Setup procedure that is described in Orchestrator Disaster Recovery.

Upgrade the DR Setup

To upgrade a DR-enabled Orchestrator pair, follow the steps below:
Note: If the Orchestrator upgrade is from 2.X to 3.2.X, run dr-standby-schema.sh on the Standby before starting the upgrade.
  1. Prepare for the Upgrade. For instructions, go to Step 1: Prepare for the Orchestrator Upgrade of the section titled, Upgrade an Orchestrator with DR Deployment.
  2. Proceed with the Orchestrator upgrade. For instructions, go to Step 4: Proceed with the Orchestrator Upgrade of the section titled, Upgrade an Orchestrator with DR Deployment.

Troubleshooting Orchestrator

This section discusses Orchestrator troubleshooting.

Orchestrator Diagnostics Overview

The Orchestrator Diagnostics bundle is a collection of diagnostic information that is required for Support and Engineering to troubleshoot the Orchestrator. For Orchestrator on-premises installation, Operators can collect the Orchestrator Diagnostic bundle from the Orchestrator UI and provide it to the Arista Support team for offline analysis and troubleshooting.

SD-WAN Orchestrator Diagnostics includes the following two diagnostic bundles:
  • Diagnostic Bundles Tab: Request and download a diagnostic bundle. This information can be found in the Arista SD-WAN Orchestrator Deployment and Monitoring Guide. See the section titled, "Diagnostic Bundle Tab."
  • Database Statistics Tab: Provides a read-only access view of some of the information from a diagnostic bundle. This information can be found in the Arista SD-WAN Orchestrator Deployment and Monitoring Guide. See the section titled, "Database Statistics Tab."

Diagnostics Bundle Tab

Users can request and download a diagnostic bundle in the Diagnostics Bundle tab.

Columns in the Diagnostics Bundle Tab

The Orchestrator Diagnostics table grid includes the following columns:

Table 37. Orchestrator Diagnostics Table Description
Column Name Description
Request Status There are two types of status requests:
  • Complete
  • In Progress
If a bundle has not completed the download, the In Progress status appears.
Reason for Generation The specific reason given for generating a diagnostic bundle. Select the Request Diagnostic Bundle button to include a description of the bundle.
User The individual logged into the Orchestrator.
Generated The date and time when the diagnostic bundle request was sent.
Cleanup Date The default Cleanup Date is three months after the generated date, when the bundle will be automatically deleted. If you need to extend the Cleanup date period, select the Cleanup Date link located under the Cleanup Date column. For additional information, see Updating Cleanup Date.
Request a Diagnostic Bundle

 

To request a diagnostic bundle:

  1. From the Orchestrator navigation panel, select Diagnostics.
    Figure 35. Diagnostics Screen
  2. From the Request Diagnostic Bundle tab, select the Request Diagnostic Bundle button.
  3. In the Request Diagnostic Bundle dialog, enter the reason for the request in the appropriate area.
    Figure 36. Request Diagnostic Bundle
  4. Select Submit. The bundle request you created displays in the grid area of the Diagnostic Bundle screen with an In Progress status.
  5. Refresh your screen to check the status of diagnostic bundle request. When the bundle is ready for download, a Complete status appears.
Download a Diagnostic Bundle

 

To download a diagnostic bundle:

  1. Select a diagnostic bundle you want to download.
  2. Select the Actions button, and choose Download Diagnostic Bundle. You can also select the Complete link to download the diagnostics bundle.
The diagnostics bundle downloads.
Update the Cleanup Date

 

The Cleanup date represents the date when the generated bundle will be automatically deleted, which by default is three months after the Generated date. You can change the Cleanup date or choose to keep the bundle indefinitely.

To update the Cleanup date:

  1. From the Cleanup Date column, select the Cleanup Date link of your chosen Diagnostic Bundle.
  2. From the Update Cleanup Date dialog, select the Calendar icon to change the date.
    Figure 37. Calendar Settings
  3. You can also choose to keep the bundle indefinitely by checking the Keep Forever check box.
    Figure 38. Update Cleanup Date
  4. Select OK.

    The Orchestrator Diagnostics table grid updates to reflect the changes to the Cleanup Date.

    Figure 39. Table Grid Updates

Database Statistics Tab

The Database Statistics tab provides a read-only access view of some of the information from a diagnostic bundle.

If you require additional information, go to the Diagnostic Bundles tab, request a diagnostic bundle, and download it locally. For additional information, see Request Diagnostic Bundle.

The Database Statistics tab displays the following sections: Database Sizes, Database Table Statistics, Database Storage Info, Database Process List, Database Status Variable, Database System Variable, and Database Engine Status.
Figure 40. Orchestrator Database Statistics
Table 38. Orchestrator Database Statistics Field Descriptions
Field Description
Database Sizes Sizes of the Orchestrator databases.
Database Table Statistics Statistical details of all tables in the Orchestrator database.
Database Storage Info Storage details of the mounted locations.
Database Process List The top 20 records of long-running SQL queries.
Database Status Variable The status variables of the MySQL server.
Database System Variable System variables of the MySQL server.
Database Engine Status The InnoDB engine status of the MySQL server.

System Metrics Monitoring

This section discusses System Metrics Monitoring on the Orchestrator.

Orchestrator System Metrics Monitoring Overview

The Orchestrator comes with a built-in system metrics monitoring stack, which includes a metrics collector and a time-series database. With the monitoring stack, you can easily check the health condition and the system load for the Orchestrator.

To enable the monitoring stack, run the following command on the orchestrator:
sudo /opt/vc/scripts/vco_observability_manager.sh enable
To check the status of the monitoring stack, run:
sudo /opt/vc/scripts/vco_observability_manager.sh status
To deactivate the monitoring stack, run:
sudo /opt/vc/scripts/vco_observability_manager.sh disable

The Metrics Collector

Telegraf is used as the Orchestrator system metrics collector, which includes plugins to collect system metrics. The following metrics are enabled by default.
Table 39. Metric Descriptions
Metric Name Description
inputs.cpu Metrics about CPU usage.
inputs.mem Metrics about memory usage.
inputs.net Metrics about network interfaces.
inputs.system Metrics about system load and uptime.
inputs.processes The number of processes grouped by status.
inputs.disk Metrics about disk usage.
inputs.diskio Metrics about disk IO by device.
inputs.procstat CPU and memory usage for specific processes.
inputs.nginx Nginx's basic status information (ngx_http_stub_status_module).
inputs.mysql Statistic data from the MySQL server.
inputs.clickhouse Metrics from one or many ClickHouse servers.
inputs.redis Metrics from one or many redis servers.
inputs.filecount The number and total size of files in specified directories.
inputs.ntpq Standard NTP query metrics (requires ntpq executable).
Inputs.x509_cert Metrics from a SSL certificate.
To activate more metrics or deactivate some enabled metrics, edit the Telegraf configuration file on the Orchestrator by the following:
  • sudo vi /etc/telegraf/telegraf.d/system_metrics_input.conf
  • sudo systemctl restart telegraf

The Time-series Database

Prometheus is used to store the system metrics collected by Telegraf. The metrics data will be kept in the database for three weeks at the most. By default, Prometheus listens on port 9090. If you have an external monitoring tool, provide the Prometheus database as a source, so that you can view the Orchestrator system metrics on your monitoring UI.

Rate Limiting API Requests

When there are too many API requests sent at a time, it affects the performance of the system. You can enable Rate Limiting, which enforces a limit on the number of API requests sent by each user.

The Orchestrator makes use of certain defence mechanisms that curb API abuse and provides system stability. API requests that exceed the allowed request limits are blocked and returned with HTTP 429 (Too many Requests). The system needs to go through a cool down period before making the requests again.

The following types of Rate-Limiters are deployed on Orchestrator:
  • Leaky bucket limiter – Smooths the burst of requests and only allows a pre-defined number of requests. This limiter takes care of limiting the number of requests allowed in a given time window.
  • Concurrency limiter – Limits the number of requests that occur in parallel which leads to concurrent requests fighting for resources and may result in long running queries.
The following are the major reasons that lead to rate limiting of the API requests:
  • Large number of active or concurrent requests.
  • Sudden spikes in request volume.
  • Requests resulting in long running queries on the Orchestrator holding system resources for long being dropped.
Developers that rely on the API can adopt the following measures to improve the stability of their code when the VCO rate-limiting capability is enabled.
  • Handle HTTP 429 response code when requests exceed rate limits.
  • The penalty time duration is 5000 ms when the rate limiter reaches the maximum allowed requests in a given period. If blocked, the clients are expected to have a cool down period of 5000 ms before making requests again. The requests made during the cool down period of 5000 ms will still be rate limited.
  • Use shorter time intervals for time series APIs which will not let the request to expire due to long running queries.
  • Prefer batch query methods to those that query individual Customers or Edges whenever possible.
Note: Operator Super users configure Rate limits discretely based on the environment. For any queries on relevant policies, contact your Operator.

Configure Rate Limiting Policies using System Properties

You can use the following system properties to enable Rate Limiting and define the default set of policies:
  • vco.api.rateLimit.enabled
  • vco.api.rateLimit.mode.logOnly
  • vco.api.rateLimit.rules.global
  • vco.api.rateLimit.rules.enterprise.default
  • vco.api.rateLimit.rules.enterpriseProxy.default

For additional information on the system properties, see List of System Properties.

Configure Rate Limiting Policies using APIs

It is recommended to configure the rate limiter policies as global rules using the system properties, as this approach produces the best possible API performance, facilitates troubleshooting, and ensures a consistent user experience across all Partners and Customers. In rare cases, however, Operators may determine that global policies are too lax for a particular tenant or user. For such cases, VMware supports the following operator-only APIs to set policies for specific partners and enterprises.
  • enterpriseProxy/insertOrUpdateEnterpriseProxyRateLimits – Used to configure Partner-specific policies.
  • enterprise/insertOrUpdateEnterpriseRateLimits – Used to configure Customer-specific policies.

For additional information on the APIs, see VeloCloud API Guide.

Enterprise Deployment and Operations for Orchestrator

This section provides information about the available options to monitor, backup, and upgrade Enterprise On-Premises deployments in a two-day operation scenario.

Overview

Even though the enterprise on-premises model has some unique advantages and features, there are considerations that the service provider or customer managing the solution must understand. Some of these considerations are as follows:
  • Isolation of the solution- The Arista Cloud Operations team does not have access to apply hotfixes and upgrades.
  • Restrictions on change management limit the frequency of patching and upgrades.
  • Inadequate or insufficient solution monitoring- This situation may happen due to a lack of personnel capable of managing the infrastructure, resulting in functional issues, slower resolution of problems, and customer dissatisfaction.

This approach always requires a significant investment in people and time to manage, operate, and patch properly. The table below outlines some of the elements that must be considered when managing a system on-premises.

Table 40. Elements to Consider
System Description VeloCloud Hosted Responsibility On-Premises Responsibility
SD-WAN Orchestration Application QoS and link steering policy Yes Yes
Security policy for apps and SD-WAN appliances Yes Yes
SD-WAN appliance provisioning and troubleshooting Yes Yes
Handling of SD-WAN alerting & events Yes Yes
Link performance and capacity monitoring Yes Yes
Hypervisor Monitoring / alerting No Yes
Compute and memory resourcing No Yes
Virtual networking and storage No Yes
Backup No Yes
Replication No Yes
Infrastructure CPU, memory, compute No Yes
Switching and routing No Yes
Monitoring & management systems No Yes
Capacity planning No Yes
Software upgrades/patching No Yes
Troubleshooting application/infrastructure issues No Yes
Backup and Infrastructure DR Backup infrastructure No Yes
Regular testing of backup regime No Yes
DR infrastructure No Yes
DR testing No Yes

See Day One Operations and Day Two Operations to continue your deployment.

Day One Operations

Deactivating the Cloud-init on the Orchestrator

The data-source contains two sections: meta-data and user-data. Meta-data includes the instance ID and should not change during the lifetime of the instance, while user-data is a configuration applied on the first boot (for the instance ID in meta-data).

It is not recommended to purge the cloud-init file with the command apt purge cloud-init (this procedure does not cause issues in the VeloCloud SD-WAN Controller). Purging the cloud-init file also erases some essential Orchestrator tools and scripts such as upgrade and backup scripts. If the purge command was used, you can restore the files using the following commands:

  1. Go to the folder /opt/vcrepo/pool/main/v/vco-tools.
  2. Install the Orchestrator tool package from the folder: sudo dpkg -i vco-tools_3.4.1-R341-20200423-GA-69c0f688bf.deb.

    The vco-tools package name may change depending on your release. Please check the correct file name with the command ls vco-tools.

Configuring the NTP Timezone

  1. The Orchestrator and Gateway timezone must be set to Etc/UTC.
    vcadmin@vco1-example:~$ cat /etc/timezone Etc/UTC vcadmin@vco1-example:~$
  2. If the timezone is incorrect, it can be corrected by executing the following commands:
    echo "Etc/UTC" | sudo tee /etc/timezone sudo dpkg-reconfigure --frontend noninteractive tzdata

The expectation is that the NTP offset is <= 15 milliseconds.

  1. Use the following command to check the NTP Offset:
    vcadmin@vco1-example:~$ sudo ntpq -p remote refid st t when poll reach delay offset jitter *ntp1-us1.prod.v 74.120.81.219 3 u 474 1024 377 10.171 -1.183 1.033 ntp1-eu1-old.pr .INIT. 16 u - 1024 0 0.000 0.000 0.000 
    vcadmin@vco1-example:~$
  2. If the offset is incorrect, it can be corrected by executing the following commands:
    sudo service ntp stop sudo ntpdate <server> sudo service ntp start

Orchestrator Storage

When the Orchestrator is initially deployed, three partitions are created: /, /store, /store2, /store3 (version 4.0 and onward). The partitions are created with default sizes. Follow the instructions in Increasing Storage in the Orchestrator for guidance in modifying the default sizes to match the design.

Additional Tasks

The Orchestrator requires further configuration after implementation using the following steps:
  1. Configure System Properties.
  2. Set up the initial Operator Profile.
  3. Set up Operator accounts.
  4. Create Gateways.
  5. Setup Orchestrator.
  6. Create the customer account/partner account.

    Detailed instructions can be found in Install Orchestrator.

Day Two Operations

Orchestrator Backup

This section provides the available mechanisms to periodically backup the Orchestrator database to recover from Operator errors or catastrophic failure of both the Active and Standby Orchestrator.

Remember that the Disaster Recovery feature or DR is the preferred recovery method. It provides a Recovery Point Objective of nearly zero, as all configurations on the Active Orchestrator is instantly replicated. For more details on the Disaster recovery feature, refer to the next section.

Backup Using the Embedded Script

The Orchestrator provides an in-built configuration backup mechanism to periodically Backup the configuration to recover from Operator errors or catastrophic failure of both the Active and Standby Orchestrator. The mechanism is script-driven and is located at /opt/vc/scripts/db_backup.sh.

The script essentially takes a database dump of the configuration data and events, while excluding some of the large monitoring tables during the database dump process. Once the script is executed, backup files are created in the local directory path provided as input to the above script.

Best Practices
  • Mount a remote location and configure the backup script to it. The remote location should have the same storage as /store if flows are also being Backup.
  • Before using the Backup Script, check the Disaster Recovery (DR) replication status from the Orchestrator replication page. They should be in sync, and no errors should be present.
  • Additional to this, execute a MySQL query and check the replication lag.
    • SHOW SLAVE STATUS \G
    • In the above query, look at the field seconds_behind_master. Ideally, it should be zero, but under 10 would be sufficient as well.
    • For the large Orchestrators, it is recommended to use the Standby for the Backup script execution. There will be no difference in the Backup that is generated from both Orchestrators.
    Caveats
    • The Script only takes a backup of the configuration; flow stats or events are not included.
    • Restoring the configuration requires assistance from the Support/Engineering team.

The Backup consists of two .gzs files, one containing the database schema definition and the other one containing the actual data without definition. The administrator should ensure that the backup directory location has enough disk space for the Backup.

Frequently Asked Questions

  • How long does the Script take to run?

    The duration of the Backup depends on the scale of the actual customer configuration. Since the monitoring tables are excluded from the Backup operation, it is expected that the configuration Backup operation will complete quickly. For a large Orchestrator with thousands of Edge and lots of historical events, it could take up to an hour, while a smaller Orchestrator should be completed within a few minutes.

  • What is the recommended frequency to run the Backup script?

    Depending on the size and time it takes to complete the initial backup, the Backup operation frequency can be determined. The Backup operation should be scheduled to run during off-peak hours to reduce the impact on Orchestrator resources.

  • What if the root file system doesn't have enough space for the backup?

    It is recommended that other mounted volumes are used to store the backup. Note, it is not a best practice to use the root filesystem for the backup.

  • How does one verify if the Backup operation completed successfully?

    The script stdout and stderr should be sufficient to determine the success or failure of the Backup operation. If the script invocation is automated, the exit code can determine the Backup operation's success or failure.

  • How is the configuration recovered?

    Currently, Arista requires that the customer work with Arista Support to recover the configuration data. Arista Support will help to recover the customer's configuration. Customers should refrain from making any additional configuration changes until the configuration is restored.

  • What is the exact impact of executing this Script?

    Even though a backup of the configuration should have little impact on performance, there will be an increase in resource utilization for the MySQL process. It is recommended that the Backup be run during off-peak hours.

  • Are any configuration changes allowed during the run of the Backup operation?

    It is safe to make configuration changes while the Backup operation is running. However, to ensure up-to-date backups, it is recommended that no configuration operations are done while the Backup is running.

  • Can the configuration be restored on the original Orchestrator, or does it require a new Orchestrator?

    Yes, the configuration can, and ideally should, be restored on the same Orchestrator if it is available. This will ensure that the monitoring data is utilized after the Restore operation is completed. If the original Orchestrator cannot be recovered and the Standby Orchestrator is down, the configuration can be restored on a new Orchestrator. In this instance, the monitoring data will be lost.

  • What actions should be taken in case the configuration needs to be restored to a new Orchestrator?

    Please contact Arista Support for the recommended set of actions on the new Orchestrator as the steps vary depending on the actual deployment.

  • Do Edges have to re-register on the newly restored Orchestrator?

    No, Edges are not required to register on the new Orchestrator, as all needed information is preserved as part of the Backup.

Orchestrator Disaster Recovery

The Orchestrator Disaster Recovery (DR) feature prevents the loss of stored data and resumes Orchestrator services in the event of system or network failure. Orchestrator DR involves setting up an Active/Standby Orchestrator pair with data replication and a manually-triggered failover mechanism.

Note: DR is mandatory. For licensing and pricing, contact the Arista SD-WAN Sales team for support.

See Set Up Orchestrator Disaster Recovery for detailed instructions.

Upgrade Procedure for the Orchestrator

  1. Arista Support assists with the upgrade. Collect the following information before contacting Arista Support.
  2. Provide the current and target Orchestrator versions, for example, the current version (i.e., 3.4.2), target version (3.4.3).
    Note: For the current version, this information can be found on the top, right corner of the Orchestrator by selecting the Help link and choosing About.
  3. Provide a screenshot of the replication dashboard of the Orchestrator.
    Figure 41. Replication Dashboard
  4. Hypervisor Type and version (i.e., vSphere 6.7)
    Commands from the Orchestrator (Commands must be run as root (e.g. 'sudo <command>' or 'sudo-i'). ):
    • LVM layout
      • pvdisplay-v
      • vgdisplay-v
      • lvdisplay-v
      • df-h
      • cat /etc/fstab
    • Memory information
      • free-m
      • cat /proc/meminfo
      • ps-ef
      • top-b-n 2
    • CPU Information
      • cat /proc/cpuinfo
    • Copy of /var/log
      • tar-czf /store/log-`date +%Y%M%S`.tar.gz--newer-mtime="36 hours ago" /var/log
    • From the Standby Orchestrator:
      • sudo mysql--defaults-extra-file=/etc/mysql/velocloud.cnf velocloud-e 'SHOW SLAVE STATUS \G'
    • From the Active Orchestrator:
      • sudo mysql--defaults-extra-file=/etc/mysql/velocloud.cnf velocloud-e 'SHOW MASTER STATUS \G'
  5. Contact VeloCloud SD-WAN Support with the above-mentioned information for assistance with the Orchestrator upgrade.

Monitoring

One of the customer's responsibilities on enterprise On-Prem deployments is to monitor the solution. Monitoring gives customer's the visibility required to be one step ahead of possible issues.

SD-WAN Controller Monitoring

You can monitor the status and usage data of Controllers available in the Operator portal.

The procedure is as follows:

  1. In the Operator portal, select Gateways.
  2. The Gateways page displays the list of available Controllers.
  3. Select the link to a Gateway. The details of the selected Controller displays.
  4. Select the Monitor tab to view the usage data of the selected Controller.

    The Monitor tab of the selected Controller displays the following details:

    Figure 42. Monitor

    You can choose a specific period to view the Controller details for the selected duration at the top of the page.

    The page displays a graphical representation of usage details of the following parameters for the period of selected time duration, along with the minimum, maximum, and average values.

     
    Usage Description
    CPU Percentage Percentage of usage of CPU
    Memory Usage Percentage of usage of memory
    Flow Counts Count of traffic flow
    Handoff Queue Drops Count of packets dropped due to queued handoff
    Tunnel Count Count of tunnel sessions

    SD-WAN Gateway Controller Recommended Values to Monitor

    The following list shows values that should be monitored and their thresholds. The list below is given as a start point, and it is not exhaustive. Some deployments may require assessing additional components such as flows, packet loss, etc.

    Whenever a warning threshold is reached, it is recommended to review the current device scale configuration and add more resources if required. When a critical alarm is triggered, it is crucial to contact Arista Support representatives to check the solution and provide further advice.

     
    Service Check Service Check Description Warn Threshold Critical Threshold
    CPU Load Check System Load. 60 80
    Memory Checks the memory utilization buffer, cache, and used memory. 70 80
    Tunnels Number of tunnels from connected Edges. 60% of max Scale 80% of max Scale

    Note: A sudden loss of all tunnels or an abnormal low quantity should also be a concern.

    Handoff Drops Due to the busy nature of traffic through a Controller, occasional drops are expected. Consistent drops in specific queues may indicate a capacity problem.
    Disk Space Current disk utilization 40% Free 20% Free
    Controller NTP Check for Time offset Offset of 5 Seconds Offset of 10 Seconds

Orchestrator Integration with Monitoring Stacks

The Orchestrator comes with a built-in system metrics monitoring stack, which can attach to an external metrics collector and a time-series database. With the monitoring stack, you can quickly check the health condition and the system load for the Orchestrator.

Before getting started, set up a time-based database and a dashboard/alerting agent. After this is complete, you can enable Telegraf in Orchestrator.

To enable the monitoring stack, run the following command on the orchestrator: sudo /opt/vc/scripts/vco_observability_manager.sh enable

To check the status of the monitoring stack, run: sudo /opt/vc/scripts/vco_observability_manager.sh status

To deactivate the monitoring stack, run:
sudo /opt/vc/scripts/vco_observability_manager.sh disable
Figure 43. Monitoring Stacks Topology

The Metrics Collector Telegraf is used as the Orchestrator system metrics collector with plugins to collect different system metrics. The following metrics are enabled by default.

Table 41. Metrics Collector
Metric Name Description Supported in Version
inputs.cpu Metrics about CPU usage. 3.4/4.0
inputs.mem Metrics about memory usage. 3.4/4.0
inputs.net Metrics about network interfaces. 4.0
inputs.system Metrics about system load and uptime. 4.0
inputs.processes The number of processes grouped by status. 4.0
inputs.disk Metrics about disk usage. 4.0
inputs.diskio Metrics about disk IO by device. 4.0
inputs.procstat CPU and memory usage for specific processes. 4.0
inputs.nginx Nginx's basic status information (ngx_http_stub_status_module). 4.0
inputs.mysql Statistic data from MySQL server. 3.4/4.0
inputs.redis Metrics from one or many redis servers. 3.4/4.0
inputs.statds API and system metrics. 3.4/4.0 (additional metrics are included in 4.0)
inputs.filecount The number and the total size of files in specified directories. 4.0
inputs.ntpq Standard NTP query metrics, requires ntpq executable. 4.0
Inputs.x509_cert Metrics from a SSL certificate. 4.0

To activate more metrics or deactivate some enabled metrics, you can edit the Telegraf configuration file on the Orchestrator using the following commands:

sudo vi /etc/telegraf/telegraf.d/system_metrics_input.conf
sudo systemctl restart telegraf
  • Time-series Database - A time Series Database can be used to store the system metrics collected by Telegraf. A time-series database (TSDB) is a database optimized for time series data.
  • Dashboard and Alerting Agent- allows you to query, visualize, alert, and explore the data stored in the TSDB. The following image provides an example of a dashboard using Telegraph, a TSDB and a dashboard engine, created to monitor the solution.
    Figure 44. Dashboard

Follow the instructions below to setup the time-series database.

  1. Add the iptables entry to allow for external monitoring systems to access to Telegraf port. The source IP address should be specified for security reasons.
    The IP address of the external monitoring system is 191.168.0.200 Add "-A INPUT-p tcp-m tcp--source 191.168.0.200--dport 9273-m comment--comment "allow telegraf port"-j ACCEPT" to /etc/iptables/rules.v4
    Figure 45. Adding ports
  2. Restart iptables.

    (Orchestrator 3.4.x)

    sudo service iptables-persistent restart

    (Orchestrator 4.x)

    sudo systemctl restart netfilter-persistent 
  3. Ensure the iptables entry updated.
    Figure 46. IP Tables
  4. Add the time-series database details in the Telegraf configuration. Create an output configuration file. For example, Prometheus uses the following:
    /etc/telegraf/telegraf.d/prometheus_out.conf
    Figure 47. Prometheus Example

Monitor Values and Thresholds

The following list shows a list of values that should be monitored and their thresholds. The list below is given as a starting point, as it is not exhaustive. Some deployments may require assessing additional components such as database transactions, automatic backups, etc.

Whenever a warning threshold is reached, it is recommended to review the current device scale configuration and add more resources if required. When a critical alarm is triggered, it is crucial to contact the Arista Support representatives to check the solution and give further advice.
Table 42. Service Checks
Service Check Service Check Description Warn Threshold Critical Threshold
CPU Load Check System Load – Telegraf input plugin: inputs.cpu. 60 70
Memory Checks the memory utilization buffer, cache, and used memory – Telegraf input plugin: inputs.memory. 70 80
Disk Usage Disk Utilization in the different Orchestrator partitions, /, /store, /store2 and /store3 (version 4.0 and onwards) – Telegraf input plugin: inputs.disk (version 4.0 and onwards). 40% Free 20% Free
MySQL Server Checks MySQL Connections-Telegraf input plugin: inputs.mysql.   Above 80% of max connection define in mysql.conf(/etc/mysql/my.cnf)
Orchestrator Time Check for Time offset-Telegraf input plugin: inputs.ntpq (version 4.0 and onwards). Offset of 5 Seconds Offset of 10 Seconds
Orchestrator SSL Certificate Checks Certificate Expiration- Telegraf input plugin: inputs.x509_cert (version 4.0 and onwards). 60 Days 30 Days
Orchestrator Internet (not applicable for MPLS only topologies) Check for Internet access. Response time > 5 secs Response time > 10 secs
Orchestrator HTTP Make sure HTTP on localhost is responding.   The localhost is not responding.
Orchestrator Total Cert Count Check Total – Example mysql query:

SELECT count(id) FROM VELOCLOUD_EDGE_CERTIFICATE WHERE validFrom <= NOW() AND validTo >=NOW()', 'SELECT count(id) FROM VELOCLOUD_GATEWAY_CERTIFICATE WHERE validFrom <= NOW() AND validTo >=NOW()

CRL When Total Cert count exceeds 5000
DR Replication Status Confirm the Standby Orchestrator is up-to-date. Review that the DR Orchestrator is no more than 1000 seconds behind the Active Orchestrator.

Seconds_Behind_Master: from mysql command: show slave STATUS\G;

DR Replication Edge Gateway delta Confirm that Edges and Gateways can talk to the DR Orchestrator.

Different values between the Active and the Standby Orchestrators can be due to a difference in the timezone in Edges and Gateways.

The same amount of Edges talking with the Active Orchestrator should be able to reach the Standby Orchestrator. This value can be checked on the "replication" tab or via the API.

API Best Practices

Orchestrator powers the management plane in the VeloCloud SD-WAN solution. It offers a broad range of configuration, monitoring, and troubleshooting functionality to service providers and enterprises. The main web service with which users interact to exercise this functionality is called the Orchestrator Portal.

Orchestrator Portal- The Orchestrator Portal allows network administrators (or scripts and applications acting on their behalf) to manage network and device configuration and query the current or historical network and device state. API clients may interact with the Portal via a JSON-RPC interface or a REST-like interface. It is possible to invoke all of the methods described in this document using either interface. There is no Portal functionality for which access is constrained exclusively to either JSON-RPC clients or REST-like ones.

Both interfaces accept exclusively HTTP POST requests. Both also expect that request bodies, contain JSON-formatted content consistent with RFC 2616. Clients are furthermore likely to formally assert where this is the case using the Content-Type request header, e.g., Content-Type: application/json.

More information about the VeloCloud SD-WAN API can be found here:

https://code.Arista.com/apis/1000/velocloud-sdwan-vco-api

Best Practices for enterprises and service providers using APIs- Consider the following best practices while using APIs:
  • Wherever possible, aggregate API calls should be preferred to enterprise-specific ones, for example, a single call to monitoring/getAggregateEdgeLinkMetrics may be used to retrieve transport stats across all Edges concurrently.
  • VeloCloud requests that clients limit the number of API calls in flight at any given time to no more than 2-4. If a user feels there is a compelling reason to parallelize API calls, Arista requests that they contact Arista Support to discuss alternative solutions.
  • Arista doesn't recommend polling the API for stats data more frequently than every 10 min. New stats data arrives at the Orchestrator every 5 minutes. Due to jitter in reporting/processing, clients polling every 5 minutes might observe "false-positive" cases where stats aren't reflected in API calls' results. You might get the best result using request intervals of 10 minutes or greater in duration.
  • Avoid querying the same information twice.
  • Use sleep between APIs.
  • For complex software automations, run your scripts and evaluate the CPU/Memory impact. Then adjust as required.

Orchestrator Syslog Configuration

The VeloCloud Orchestrator Syslog capability can be configured independently for the following Orchestrator processes:
  • Portal: The Portal process runs as an internal HTTP server downstream from NGINX. The Portal service handles incoming API requests, either from the Orchestrator web interface or from an HTTP/SDK client, primarily in a synchronous fashion. These requests allow authenticated users to configure, monitor, and manage the various services provided by the Orchestrator.

    This log is very useful for AAA activities as it has all actions taken by users in the Orchestrator.

    Log files: /var/log/portal/velocloud.log (Logs all info, warn, and error logs)

  • Upload: The Upload process runs as an internal HTTP server downstream from NGINX. The Upload service handles incoming requests from Edges and Gateways, either synchronously or asynchronously. These requests primarily consist of activations, heartbeats, flow statistics, link statistics, and routing information sent by Edges and Gateways.

    Log files: /var/log/upload/velocloud.log (Logs all info, warn, and error logs)

  • Backend: Job runner that primarily runs scheduled or queued jobs. Scheduled jobs consist of cleanup, rollup, or status update activities. Queued jobs consist of processing link and flow statistics.

    Log files: /var/log/backend/velocloud.log (Logs all info, warn, and error logs)

Use the following steps to configure Orchestrator Syslog:

  1. Navigate to System Properties in Orchestrator > System Properties and enter log.syslog in the search bar.
  2. Change the enable:false value to true for one or more of the servers. Change the Host IP and port accordingly to your implementation.
    Figure 48. Modify System Property

Increasing Storage in the Orchestrator

For detailed instructions to increase the Storage in the Orchestrator, see the topics Install SD-WAN Orchestrator and Expand Disk Size (Arista).

Best Practices
  • Ensure the same LVM distribution applies to the Standby Orchestrator.
  • It is not recommended to reduce the size of the volumes once increased. Use thin provisioning instead.
  • In 3.4, when increasing the disk size, the following percentage/value distribution may be used:
    • “/” Volume: This volume is used for the operative system. Production Orchestrators are usually set to 140GBs and have from 40% to 60% usage.
    • /store and /Store2: The proportion applied in production Orchestrators is close to 85% for /Store and 15% for /Store2.

The following guidelines in the table below should be used in the 4.x release and onwards.

Table 43. Storage
Instance Size /store /store2 /store3 /var/log
Small (5000 Edges) 2 TB 500 GB 8 TB 100 GB
Medium (10000 Edges) 2 TB 500 GB 12 TB 125 GB
Large (15000 Edges) 2 TB 500 GB 16 TB `150 GB

Managing Certificates in the Orchestrator

Orchestrator uses a built-in certificate server to manage the overall PKI lifecycle of all Edges and SD-WAN Controllers. X.509 certificates are issued to the devices in the network.

Detailed instructions to configure the CA can be found in the official VeloWare SD-WAN Operator documentation, under Install Orchestrator and Install an SSL Certificate.

Certificates issued by the CA are used only for the authentication of the following:
  • Management plane TLS 1.2 tunnels between the Orchestrator and Edge SD-WAN Controller.
  • Control and Data plane IKEv2/IPsec tunnels between SD-WAN Edges and between Edge and SD-WAN Controller.

Certificate Revocation List

On Controllers with PKI enabled, revoked certificates are stored in a Certificate Revocation List (CRL). If this list grows too long, generally due to an issue with the Orchestrator Certificate Authority, the Controller's performance becomes impacted. The CRL should be less than 4,000 entries long.

vcadmin@vcg1-example:~$ openssl crl -in /etc/vc-public/vco-ca-crl.pem -text | grep 'Serial Number' | wc -l 14 vcadmin@vcg1-example:~
Support Interaction

Our Customer Support organization provides 24x7x365 world-class technical assistance and personalized guidance to VeloCloud SD-WAN customers.

This section provides some guidelines to interact with the Arista Support team.
  • Diagnostic Bundles

    While investigating an incident, a diagnostic bundle of the Orchestrator and SD-WAN Controller can be created. The resulting file will assist the Arista Support team to further analyze the events around an issue.

    Figure 49. Gateway Diagnostic Bundles
    Figure 50. Request Diagnostic Bundles
  • Share Access with Support

    On occasion assistance from Arista Support representatives for the Orchestrator and SD-WAN Controllers may be required.

    Some common ways to grant access are:
    • Remote sessions with Support: The customer would either grant remote control to the SSH jump server or follow the Support representative's instructions.
    • Creating an account for the Support team in the Orchestrator. This helps the Support team gather logs without customer interaction.
    • Through the Bastion Host: SSH permissions and keys can be configured to allow the Support engineers to access the on-premises Orchestrator and SD-WAN Controller using a Bastion Host.

    When contacting Arista SD-WAN Support to assist triaging an issue, include the data described in the table below.

    Additional information can be found in the following link: https://www.arista.com/en/support/customer-support/velocloud.

Table 44. Arista SD-WAN Support Required Information
Required Suggested
Partner Case Number Issue Start/Stop
Partner Return Email/Phone Impacted Flow SRC/DST IP
Orchestrator URL Impacted Flow SRC/DST Port
Customer Name in Orchestrator Flow Path (E2E, E2GW, Direct)
Customer Impact (High/Med/Low) SD-WAN Gateway Name(s)
Edge Name(s) Link to PCAP in the Orchestrator
Link to Diagnostic Bundle in Orchestrator  
Short Problem Statement  
Analysis & Requested Assistance