Monitor Gateways using CLI

You can use CLI commands to check the status of Gateways in an Orchestrator.

Monitor System Health

You can use the CLI commands to check the status of the Gateways, Software version, usage of CPU and memory, and other information.

Monitor Gateway Activation State

Use the following command to check if the Gateway is activated on an Orchestrator.

In the following example, the Gateway is activated:

vcadmin@vcg1-example1:~$ /opt/vc/bin/is_activated.py
True
vcadmin@vcg1-example1:~$

In the following example, the Gateway is deactivated:

vcadmin@vcg1-example1:~$ /opt/vc/bin/is_activated.py
False
vcadmin@vcg1-example1:~$

View Activated Orchestrator Name

Use the following command to locate the Orchestrator for the Gateway providing the Gateway is activated.

vcadmin@vcg1-example1:~$ /opt/vc/bin/getpolicy.py 
managementPlane.data.managementPlaneProxy.primary
"vco1-example1.velocloud.net"
vcadmin@vcg1-example1:~$

View Software Version

The various VeloCloud processes in the system display version numbers which should all be identical.

Review the version numbers using the following commands:

root@NY-GATEWAY-1:~# /opt/vc/sbin/gwd -v
VCG Info
========
Version:          4.2.0
Build rev:        R420-20201216-GA-0bcea3f6f0
Build Date:       2020-12-16_23-23-33
Build Hash:       0bcea3f6f0e6b8c21260187bb2d953e4cefd7f27
root@NY-GATEWAY-1:~# /opt/vc/sbin/natd -v
NATd Info
========
Version:          4.2.0
Build rev:        R420-20201216-GA-0bcea3f6f0
Build Date:       2020-12-16_23-23-33
Build Hash:       0bcea3f6f0e6b8c21260187bb2d953e4cefd7f27
root@NY-GATEWAY-1:~# /opt/vc/sbin/mgd -v
VeloCloud gateway 4.2.0 build R420-20201216-GA-0bcea3f6f0

View NTP Time Zone

Use the following command to view the NTP time zone. The Gateway time zone must be set to Etc/UTC.

vcadmin@vcg1-example:~$ cat /etc/timezone
Etc/UTC
vcadmin@vcg1-example:~$

If the time zone is incorrect, use the following commands to update the time zone.

echo "Etc/UTC" | sudo tee /etc/timezone
sudo dpkg-reconfigure --frontend noninteractive tzdata

View NTP Offset

Use the following command to view the NTP offset, which must be less than or equal to 15 milliseconds.

sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
*ntp1-us1.prod.v 74.120.81.219    3 u  474 1024  377   10.171   -1.183   1.033
 ntp1-eu1-old.pr .INIT.          16 u    - 1024    0    0.000    0.000   0.000
vcadmin@vcg1-example:~$

If the offset is incorrect, use the following commands to update the NTP offset.

sudo systemctl stop ntp
sudo ntpdate <server>
sudo systemctl start ntp

Monitor Disk Usage

Use the following command to check the disk usage space. Ensure that the disk has at least 16 GB of free space to store critical files such as logs and cores.

vcadmin@vcg1-example:~$ sudo df -kh --total | grep total | awk '{print $4}'
77G
vcadmin@vcg1-example:~$

The common places for disk usage to build up are /var/log, /velocloud/core, and /tmp.

Note: Each session creates a temporary file named /velocloud/vctcpdump.XXXX with fixed size of 1MB. This file gets deleted when the session ends.

Monitor CPU Usage

 

The Gateway processes bursts of traffic and bursts of high CPU are expected. The Gateway should be monitored for CPU cores at 100%. However, the DPDK cores run in poll mode for performance reasons and they expect to take close to 100% CPU at high throughput.

You can monitor a Gateway with thresholds that provide warning or critical states which indicate potential issues prior to impacting services. The following table lists the threshold values and recommended actions.

Table 1. Thresholds
Threshold State Threshold Value Recommended Corrective Action
DP Core Non DP Core
Warning 95% 80% If the threshold value is crossed consistently for 5 minutes:
  • Check per-process CPU usage.
  • Monitor for 10 more minutes.

If the threshold value is crossed consistently for 5 minutes:

  • Collect Gateway diagnostic bundle.
  • Open a support case with Arista.
Critical 98% 90% If the threshold value is crossed consistently for 5 minutes:
  • Monitor for possible critical packet drop which can indicate over capacity.

If the issue is observed for one hour:

  • If over capacity is observed over a 5 minute interval, add Gateway capacity and rebalance to avoid capacity related service impact.
    Note: Before rebalancing the Gateway, confirm that the capacity metrics are within the recommended limit. For more information on capacity metrics, see Capacity of Gateway Components.
The following is an example Python script for monitoring the CPU usage:
Note: You can also use Telegraf to monitor the CPU usage. For more information, see Monitor Gateways using Telegraf.
#! /usr/bin/env python
"""
Check for CPUs spinning at 100%
"""
import re
import collections
import time
import sys
import json
import os
import subprocess
re_cpu = re.compile(r"^cpu\d+\s")
CPUStat = collections.namedtuple('CPUStat', ['user', 'nice', 'sys', 'idle'])


def get_stats():
    stats = open("/proc/stat").readlines()
    ret = {}
    for s in stats:
        if not re_cpu.search(s):
            continue
        s = s.split()
        ret[s[0]] = CPUStat(*[int(v) for v in s[1:5]])
    return ret


def verify_dpdk_support():
    if os.path.isfile('/opt/vc/etc/dpdk.json'):
        with open("/opt/vc/etc/dpdk.json") as data:
            d = json.loads((data.read()))
        if "status" in d.keys():
            return True if d['status'] is "Supported" else False
    else:
        return False


def another_verify_dpdk_support():
    if os.path.isfile('/opt/vc/bin/debug.py'):
        f = subprocess.check_output(
            ["/opt/vc/bin/debug.py", "--dpdk_ports_dump"])
        x = [r.split() for r in f.split('\n')]
        if len(x) <= 1:
            return False
        else:
            return True
    else:
        return False


dpdk_status = verify_dpdk_support() or another_verify_dpdk_support()
if __name__ == "__main__":
    try:
        stat1 = get_stats()
        time.sleep(3)
        stat2 = get_stats()
    except:
        print "UKNOWN - failed to get CPU stat: %s" % str(sys.exc_info()[1])
        sys.exit(3)
    busy_cpu_set = [cpu for cpu in stat1 if (
        stat2[cpu].idle - stat1[cpu].idle) == 0]
    if not busy_cpu_set:
        print "OK - no spinning CPUs"
        sys.exit(0)
    if dpdk_status == True:
        if "cpu1" in busy_cpu_set and len(busy_cpu_set) == 1:
            print "OK - no spinning CPUs"
            sys.exit(0)
        elif "cpu1" in busy_cpu_set:
            busy_cpu_set.remove('cpu1')
            print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
            sys.exit(2)
        else:
            print busy_cpu_set, 1
            print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
            sys.exit(2)
    else:
        print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
        sys.exit(2)

Monitor Memory Usage

The main process (gwd) has its memory monitored by vc_process_monitor, which ensures that it never consumes more than 75% of available memory. As a result, monitoring for total system memory uses a warning threshold of 80% and critical threshold of 90%.

You can monitor a Gateway with thresholds that provide warning or critical states which indicate potential issues prior to impacting services. The following table lists the threshold values and recommended actions.

Table 2. Thresholds
Threshold State Threshold Value Recommended Corrective Action
Warning 80% If the memory crosses warning threshold:
  • Collect Gateway diagnostic bundle.
  • Check per-process memory usage

Continue monitoring actively and check for increasing utilization.

Critical 90% If the memory crosses critical threshold:
  • Monitor for possible critical packet drop which can indicate over capacity.

If the issue is observed again:

  • If over capacity is observed over a 5 minute interval, add Gateway capacity and rebalance to avoid capacity related service impact.
Note: Before rebalancing the Gateway, confirm that the capacity metrics are within the recommended limit. For more information on capacity metrics, see Capacity of Gateway Components.
The following is an example Python script for monitoring the memory usage:
Note: You can also use Telegraf to monitor the memory usage. For more information, see Monitor Gateways using Telegraf.
#!/usr/bin/env python

from optparse import OptionParser
import sys

# Parse commandline options:
parser = OptionParser(
    usage="%prog -w <warning threshold>% -c <critical threshold>% [ -h ]")
parser.add_option("-w", "--warning",
                  action="store", type="string", dest="warn_threshold", help="Warning threshold in absolute(MB) or percentage")
parser.add_option("-c", "--critical",
                  action="store", type="string", dest="crit_threshold", help="Critical threshold in ansolute(MB) or percentage")
(options, args) = parser.parse_args()


def read_meminfo():
    meminfo = {}
    for line in open('/proc/meminfo'):
        if not line:
            continue
        (name, value) = line.split()[0:2]
        meminfo[name.strip().rstrip(':')] = int(value)
    return meminfo


if __name__ == '__main__':
    if not options.crit_threshold:
        print "UNKNOWN: Missing critical threshold value."
        sys.exit(3)
    if not options.warn_threshold:
        print "UNKNOWN: Missing warning threshold value."
        sys.exit(3)

    is_warn_pct = options.warn_threshold.endswith('%')
    if is_warn_pct:
        warn_threshold = int(options.warn_threshold[0:-1])
    else:
        warn_threshold = int(options.warn_threshold)

    is_crit_pct = options.crit_threshold.endswith('%')
    if is_crit_pct:
        crit_threshold = int(options.crit_threshold[0:-1])
    else:
        crit_threshold = int(options.crit_threshold)

    if crit_threshold >= warn_threshold:
        print "UNKNOWN: Critical percentage can't be equal to or bigger than warning percentage."
        sys.exit(3)

    meminfo = read_meminfo()
    memTotal = meminfo["MemTotal"]
    memFree = meminfo["MemFree"] + meminfo["Buffers"] + meminfo["Cached"]
    memFreePct = 100.0*memFree/memTotal
    if (is_crit_pct and memFreePct <= crit_threshold) or (not is_crit_pct and memFree/1024 <= crit_threshold):
        print "CRITICAL: Free memory is at %2.0f %% ( %d MB free our of %d MB total)" % (memFreePct, memFree/1024, memTotal/1024)
        sys.exit(2)
    if (is_warn_pct and memFreePct <= warn_threshold) or (not is_warn_pct and memFree/1024 <= warn_threshold):
        print "WARNING: Free memory is at %2.0f %% ( %d MB free our of %d MB total)" % (memFreePct, memFree/1024, memTotal/1024)
        sys.exit(1)
    else:
        print "OK: Free memory is at %2.0f %% ( %d MB free our of %d MB total)" % (memFreePct, memFree/1024, memTotal/1024)
        sys.exit(0)

Monitor VeloCloud SD-WAN Services

Use the CLI commands to monitor the SD-WAN processes, sessions, and components.

Monitor VeloCloud SD-WAN Processes

The VeloCloud SD-WAN processes described in the Components should be running in order to ensure proper functionality of the system. The Linux command pgrep can be used to identify the process. The format is slightly different for Python processes. If the process is running, a pid (integer process ID) is returned. If not running, the command returns empty output.

vc_procmon

Use the following command to check if vc_procmon is running on the system.

vcadmin@vcg1-example:~$ pgrep -f vc_procmon
14711
vcadmin@vcg1-example:~$

Other Processes

Use the following commands to check for other processes.

vcadmin@vcg1-example:~$ pgrep mgd
14725
vcadmin@vcg1-example:~$ pgrep gwd
15143
vcadmin@vcg1-example:~$ pgrep natd
15095

To recover the processes, restart the VeloCloud SD-WAN Process Monitor, which restarts all other processes. Use the following command to restart VeloCloud SD-WAN Process Monitor.

sudo service vc_process_monitor restart

Use the following command to restart routing protocol daemons.

/usr/sbin/frr.init {start|stop|restart} [daemon ...]

Monitor Certificate Revocation List

On Gateways with PKI enabled, the revoked certificates are stored in a Certificate Revocation List (CRL). If this list grows too long, generally due to an issue with the Certificate Authority of the Orchestrator, the performance of the Gateway is impacted. The CRL should be less than 4000 entries long.

Use the following command to check the CRL entries.

vcadmin@vcg1-example:~$ openssl crl -in /etc/vc-public/vco-ca-crl.pem -text | grep 'Serial Number' | wc -l 
14
vcadmin@vcg1-example:~$

Monitor ICMP Status

If you configure a Gateway as a Partner Gateway with static routing, and the ICMP responder is configured to track the reachability of those routes, the debug.py command indicates the UP or DOWN states:

vcadmin@vcg1-example:~$ sudo /opt/vc/bin/debug.py --icmp_monitor
{
  "icmpProbe": {
    "cTag": 0, 
    "destinationIp": "0.0.0.0", 
    "enabled": false, 
    "frequencySeconds": 0, 
    "probeFail": 0, 
    "probeType": "NONE", 
    "probesSent": 0, 
    "respRcvd": 0, 
    "sTag": 0, 
    "state": "DOWN", 
    "stateDown": 0, 
    "stateUp": 0, 
    "threshold": 0
  }, 
  "icmpResponder": {
    "enabled": false, 
    "ipAddress": "0.0.0.0", 
    "mode": "CONDITIONAL", 
    "reqRcvd": 0, 
    "respSent": 0, 
    "state": "DOWN"
  }
}
vcadmin@vcg1-example:~$

When the ICMP responder is enabled, the DOWN state means that there are no Edges connected to the Gateway.

Monitor BGP Sessions

The debug.pycommand bgp_view_summary provides information about state of the BGP neighbor and prefixes learned, to verify that BGP is UP and exchanging prefixes. Use the following command to verify if BGP neighborships are established.

vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py --bgp_view_summary | grep Established | wc -l
6
vcadmin@vcg1-1:~$

If the BGP sessions are down, check whether the Gateway is properly connected to the Orchestrator.

Monitor Core Files

When a service crashes on the Gateway, a core file is generated. The diagnostic bundles generated from the Orchestrator should be retrieved as soon as possible following the generation of a core file, to download the core file and to provide the associated logs to Arista Support.

The following example illustrates a Python script to check for recent core files:

#! /usr/bin/env python
import subprocess
import traceback
import os
import os.path
import glob
import datetime
import time
import sys
import re
from pynag.Plugins import PluginHelper, ok, warning, critical, unknown
from subprocess import Popen, PIPE
import time
import os
import commands
import json
helper = PluginHelper()
helper.parse_arguments()


def diag_check():
    regex_patern = "^.*\s+Uploading diag-201[0-9]-.*"
    re_nat = re.compile(regex_patern)
    cmd = 'grep "Uploading diag-201[0-9]" /var/log/mgd.log'
    p1 = subprocess.Popen([cmd], stdout=subprocess.PIPE,
                          stderr=subprocess.PIPE, shell=True)
    stdout_value, stderr_value = p1.communicate()
    m = re_nat.search(stdout_value)
    if m:
        return True
    else:
        return False


def vco_vcg_version():
    with open("/opt/vc/.gateway.info") as data:
        d = json.loads((data.read()))
    vcg = d["gatewayInfo"]["name"]
    # build_number=d["gatewayInfo"]["buildNumber"]
    status, output = commands.getstatusoutput(
        "sudo /opt/vc/sbin/gwd -v 2>&1 | grep rev")
    if status == 0:
        build_number = output.split()[2].rstrip('\n')
    vco = d["configuration"]["managementPlane"]["data"]["managementPlaneProxy"]["primary"]
    return vcg, build_number, vco


status_file = "/tmp/coredump_status_file"
warning_file = "/tmp/warning_file"
if not os.path.isfile(status_file) and not os.access(status_file, os.R_OK):
    os.system("touch /tmp/coredump_status_file")
    os.system("chown nagios:nagios /tmp/coredump_status_file")

if not os.path.isfile(warning_file) and not os.access(status_file, os.R_OK):
    os.system("touch /tmp/warning_file")
    os.system("chown nagios:nagios /tmp/warning_file")

if not os.path.isfile(warning_file) and not os.access(status_file, os.R_OK):
    os.system("touch /tmp/crashlist.txt")
    os.system("chown nagios:nagios /tmp/crashlist.txt")

command = "cat /tmp/coredump_status_file"
command1 = "cat /tmp/warning_file"
files = ["crashlist.txt", "warning_file",
         "coredump_status_file", "coredump_message"]
for item in files:
    if os.path.isfile("/tmp/"+item):
        st = os.stat("/tmp/"+item)
        if st.st_uid == 0:
            commands.getstatusoutput("sudo chown nagios:nagios /tmp/"+item)

status, output = commands.getstatusoutput(command)
if output == "1":
    status_message = ""
    os.system("chown nagios:nagios /tmp/coredump_message")
    with open("/tmp/coredump_message", "r") as data:
        for line in data.readlines():
            status_message += line
    mtime = os.path.getmtime("/tmp/coredump_status_file")
    cur_time = time.time()
    if int(cur_time) - int(mtime) >= 300:
        os.system('echo -n "0" > /tmp/coredump_status_file')
    helper.status(critical)
    helper.add_summary(status_message)
    helper.exit()
    sys.exit(0)

status_message = ""
newcore = 0
try:
    crashlistpath = '/tmp/crashlist.txt'
    cmd = "stat -c '%Y %n' /velocloud/core/*core.tgz"
    if not os.path.isfile(crashlistpath) and not os.access(crashlistpath, os.R_OK):
        os.system("find /velocloud/core/ -name *core.tgz > /tmp/crashlist.txt")

    with open(crashlistpath, "a+") as f:
        oldcrashlist = f.read()
        corelist = glob.glob("/velocloud/core/*core.tgz")
        corecount = len(corelist)
        if corecount > 0:
            for line in corelist:
                file_modified = datetime.datetime.fromtimestamp(
                    os.path.getmtime(line))
                if datetime.datetime.now() - file_modified > datetime.timedelta(hours=42*24):
                    os.remove(line)
                if not line in oldcrashlist:
                    newcore += 1
                    status_message += '\n' + "Core:" + \
                        str(newcore) + " " + line.rsplit('/', 1)[1] + " "
                    f.write(line+'\n')
                    cmd1 = "tar -xvf " + \
                        line.rstrip(
                            '\n') + " -C /tmp  --wildcards --no-anchored '*.txt' "
                    crash = subprocess.Popen(
                        cmd1, shell=True, stdout=subprocess.PIPE)
                    crash.wait()
                    for line1 in crash.stdout:
                        btcmd = "awk '/^Thread 1 /,/^----/' /tmp/" + \
                            line1.rstrip(
                                '\n') + " | egrep '^#' | sed 's/ 0x0.* in //' | sed 's/ (.*/ /'"
                        bt = subprocess.Popen(
                            btcmd, shell=True, stdout=subprocess.PIPE)
                        status_message += '\n' + bt.communicate()[0]
        else:
            helper.status(ok)
            status_message = "No Core file"
            f.close()

except Exception as e:
    traceback.print_exc()
    helper.exit(summary="Nagios check could not complete",
                long_output=str(e), exit_code=unknown, perfdata='')

if corecount and not newcore:
    helper.status(ok)
    status_message = str(corecount) + " old core file found in /velocloud/core"
    os.system('echo -n "0" > /tmp/coredump_status_file')

elif newcore > 0:
    output = vco_vcg_version()
    vcg_data = "%s;    VCG_Build_Number:%s;     VCO:%s\n" % (output)
    status_message = vcg_data + str(newcore) + " New Core\n" + status_message
    with open("/tmp/coredump_message", "w") as data:
        data.writelines(status_message)
    os.system('echo -n "1" > /tmp/warning_file')
    os.system('echo -n "1" > /tmp/coredump_status_file')
    helper.status(critical)
    helper.add_summary(status_message)
    helper.exit()
    sys.exit(0)

status, output_warn = commands.getstatusoutput(command1)
if output_warn == "1":
    helper.status(warning)
    status_message = "Please generate gateway diag bundle from the VCO if required"
    result = diag_check()
    if result == False:
        if not os.path.isfile("/tmp/coredump_start_time"):
            os.system("touch /tmp/coredump_start_time")
            os.system("chown nagios:nagios /tmp/coredump_start_time")
            start_time = time.time()
            with open("/tmp/coredump_start_time", "w") as data:
                data.write(str(start_time))
        end_time = time.time()
        cmd = "cat /tmp/coredump_start_time"
        status, start_time = commands.getstatusoutput(cmd)
        total_time = end_time - float(start_time)
        if total_time > 10800:
            result = True
    if result == True:
        os.system('echo -n "0" > /tmp/warning_file')
        os.remove("/tmp/coredump_start_time")
        helper.status(warning)
        status_message = "Please generate the diagbundle for the last crash. if it is taken already, please ignore this message"

helper.add_summary(status_message)
helper.exit()

Capacity of Gateway Components

Use the CLI commands to check the capacity of the Gateway components and verify that the configured values do not exceed the supported values to ensure seamless performance.

For additional information on the capacity of different components and the supported values for each component, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).

Monitor Packet Processing Queue

The packet processing engine on the Gateway involves multiple stages, and each stage has a packet processing queue in between. Due to the bursty nature of traffic through a Gateway, occasional packet buildup in the packet forwarding queues are expected. However, consistent high queue length in certain queues indicate a capacity problem.

The following example shows output of the debug.py command to view handoff queue output.

The output has been truncated to display only the first and last entries, for conciseness. You can exclude the -v option in the command to view the output in tabular format.

vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py -v --handoff
{
  "handoffq": [
    {
      "deq": 12126489784,
      "drops": 0,
      "enq": 12126482089,
      "name": "vc_queue_net_sch",
      "qlength": 0,
      "qlimit": 4096,
      "sleeping": 1475174572,
      "tid": 1502,
      "wmark": 1280,
      "wmark_1min": 385,
      "wmark_5min": 450,
      "wokenup": 1164664965
    }, 
    … 
    {
      "deq": 767292,
      "drops": 0,
      "enq": 767272,
      "name": "vc_queue_ip_common_bh_1",
      "qlength": 0,
      "qlimit": 16384,
      "sleeping": 596612,
      "tid": 1512,
      "wmark": 53,
      "wmark_1min": 3,
      "wmark_5min": 3,
      "wokenup": 596209
    }
  ]
}
vcadmin@vcg1-example:~$

You need to note the values of qlength and wmark.

The qlength column indicates the number of packets currently buffered in the queue. The wmark column indicates the maximum depth a queue has ever reached, which indicates how close a Gateway has come to dropping packets. The impact and remediation for these depends largely on the queue being monitored.

You should monitor both the critical and non-critical queues.

Netif and Per-Core Queues

The high watermark level in these queues indicates that the packet processing rate is lower than the incoming rate.

The following example shows output of the dispcnt -p netif -s len -s wmark -d vcgw.com command.

# dispcnt -p netif -s len -s wmark -d vcgw.com

Wed Jul 20 09:03:39 2022
netif_queue0_len                         = 0           	0	/s
netif_queue0_wmark                       = 0           	0	/s
netif_queue0_wmark_1min                  = 0           	0	/s
netif_queue0_wmark_1s                    = 0           	0	/s
netif_queue0_wmark_5min                  = 0           	0	/s
netif_queue1_len                         = 0           	0	/s
netif_queue1_wmark                       = 45          	22	/s
netif_queue1_wmark_1min                  = 3           	1	/s
netif_queue1_wmark_1s                    = 1           	0	/s
netif_queue1_wmark_5min                  = 3           	1	/s

In the output, the value in the first column is the current count, and the second column is the rate per second. When you run this command for the first iteration, the first column displays the total (lifetime) count since the counter started, and the second column displays the rate calculated by dividing the lifetime total by 2. As the data refreshes every two seconds, then the first column displays the count since the last sample, and the second column is the rate per second.

The following example shows output of the dispcnt -p per_core -s len -s wmark -d vcgw.com command.

# dispcnt -p per_core -s len -s wmark -d vcgw.com

Wed Jul 20 09:11:05 2022
per_core_queue0_len                      = 0           	0	/s
per_core_queue0_wmark                    = 1476        	738	/s
per_core_queue0_wmark_1min               = 91          	45	/s
per_core_queue0_wmark_1s                 = 34          	17	/s
per_core_queue0_wmark_5min               = 346         	173	/s
per_core_queue1_len                      = 0           	0	/s
per_core_queue1_wmark                    = 1216        	608	/s
per_core_queue1_wmark_1min               = 47          	23	/s
per_core_queue1_wmark_1s                 = 27          	13	/s
per_core_queue1_wmark_5min               = 64          	32	/s

View Non-Critical Queues

The high queue length in the non-critical queues are less common or less likely to impact customers.

The following are the non-critical queues that can be monitored.

vc_queue_vcmp_init – This queue provides VCMP tunnel initiation messages regarding new tunnel setup. The Gateway throttles incoming tunnel requests to the maximum rate they can be handled without disrupting the existing traffic, based on available cores. As a result, high queue length is expected in the queue on a Gateway with many tunnels.

The packet buildup in these queues should come in large bursts following a specific event, such as Gateway restart or transit interruption, and there should not be drops during normal operation.

vc_queue_vcmp_ctrl_0 and vc_queue_vcmp_ctrl_1 – This queue provides VCMP tunnel management control messages received on the existing tunnels. This includes messages such as route updates, path state updates, heartbeats, statistics, QoS Sync, and tunnel information.

Almost all control messages have built-in retry mechanisms to account for these drops, like route updates.

vc_queue_ike – The queue processes IKE protocol messages to manage keys and other state of encryption sessions.

This is generally a low volume traffic and it is unlikely that packet build up is encountered here. If drops occur, IKE messages are retried.

Monitor Throughput Performance

While handoff queues are the ideal way to monitor from a capacity perspective, it may be useful and/or interesting to monitor the throughput as well.

For many providers, the monitoring of throughput occurs on the Hypervisor and is outside the scope of the Gateway.

For providers who want to monitor on the Gateway, the following example illustrates how to get the RX and TX byte counts, to make delta calculations over a period to measure the throughput.

Note: By default, DPDK is enabled.
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth0_pstat_ibytes -d vcgw.com
1895744358
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth0_pstat_obytes -d vcgw.com
1865866321
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth1_pstat_ibytes -d vcgw.com
33233362
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth1_pstat_obytes -d vcgw.com
29843320

The actual throughput capacity might vary based on the number of connected Edges, encryption mix, and average packet size. The handoff queues provide a clear picture of the Gateway performance relative to its capacity.

For supported value of maximum throughput, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).

View Connected Edges

The following example shows the number of connected edges. It is recommended to have the tunnel count below the supported value to reduce the CPU load and recovery time that follows a restart.

vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py --list_edges 2
{
  "vceCount": 156
}
vcadmin@vcg1-example:~$

If the number of connected Edges approaches the supported value for a Gateway, then customers should be moved to alternate Gateways to reduce the Edge count. If the number of connected Edges exceeds the supported value for a Gateway, then this movement of Edges to alternate Gateways should be treated as critical.

Note: A single VeloCloud Gateway deployed with 4 cores supports a maximum of 2000 connected Edges. A Gateway deployed with 8 cores supports a maximum of 4000 connected Edges. You need more Gateways in your pool for horizontal scale if you reach this limit.

Monitor Tunnel Count

You can monitor tunnel count with thresholds that provide warning or critical states which indicate potential issues prior to impacting services.

The following table lists the threshold values of tunnel count and recommended actions.

Table 3. Threshold Values
Threshold State Threshold Value Recommended Corrective Action
Warning/Critical Gateway with 4 Cores and 32 GB of RAM
  1. If tunnel count crosses warning or critical threshold:
    1. Collect diagnostic bundle.
    2. Check the stale tunnel count thresholds and perform corresponding actions listed for stale tunnels.
    3. If stale tunnels are within the warning threshold, it is recommended to quiesce the Gateway, add new Gateway to same Gateway pool and load balance the new Edge connections to another Gateway.
    4. If memory usage crosses critical threshold, restart services on Gateway.
  2. If stale tunnels are observed beyond critical threshold:
    1. Collect core dump and diagnostic bundle.
    2. Restart the services on Gateway.
    3. Open a support case with Arista.
With Certificate Without Certificate
3000 3000
Gateway with 8 Cores and 32 GB of RAM
With Certificate Without Certificate
6000 6000

The following table lists the threshold values of stale tunnel count and recommended actions.

Table 4. Threshold Values
Threshold State Threshold Value Recommended Corrective Action
Warning 10% of total tunnel count for a duration of 300 seconds
  1. If stale tunnel count crosses warning or critical threshold, collect diagnostic bundle.
  2. If stale tunnel count crosses critical threshold, open a support case with Arista.
Critical 25% of total tunnel count for a duration of 300 seconds

Monitor Path Stability

You can monitor the status of unstable tunnel count to determine the path stability.

The following table lists the threshold values of unstable tunnel count and recommended actions.

Table 5. Threshold States
Threshold State Threshold Value Recommended Corrective Action
Warning 25% of total tunnels in unstable state for 5 minutes
  1. If unstable tunnel count crosses warning or critical threshold:
    • Collect diagnostic bundle.
    • Load balance new Edges to different Gateway.
  2. If unstable tunnel count crosses critical threshold:
    • Open high priority support case with Arista, along with diagnostic bundle.
Critical 25% of total tunnels in unstable state for 10 minutes

View BGP-enabled VRFs

The following example shows the number of BGP-enabled VRFs.

vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py --vrf | grep "my_asn" | wc -l
0
vcadmin@vcg1-example:~$

If the number of BGP-enabled VRFs exceeds the maximum supported value, customers should be moved to alternate Gateways to reduce the number.

For supported values, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).

View Gateway Routes

The following example shows the number of routes.
vcadmin@vcg1-example:~$ sudo /opt/vc/bin/getcntr -c memb.mod_gw_route_t.obj_cnt -d gwd-mem 
8262
vcadmin@vcg1-example:~$

If the number of route entries exceeds the maximum supported value, customers should be moved to alternate Gateways to reduce the route count.

For supported values, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).

View Gateway Flows

The number of flows supported by a Gateway is determined by the system memory. A log file reflects the number of flows during startup.

The following example shows the log of maximum supported flows:

ERROR  [MAIN] gwd_get_max_flow_supported:35 Flow Admission: GWD 
Max flow supported: 1929780 soft limit:1157820 hard limit:1736730

If logs have rolled over, use the following table as reference:

Table 6. Maximum Supported Flows
Gateway Memory(GB) Max Number of Flows Critical Number of Flows(90% of max flows)
4 245760 221184
8 491520 442368
16 983040 884736
32 1966080 1769472

If flow limits reach a critical limit the system should be investigated for a possible flow leak.

Current flow objects in the system are as follows:

vcadmin@vcg1-example:~$ sudo /opt/vc/bin/getcntr -c memb.mod_mp_flow_t.obj_cnt -d
        gwd-mem

If the flows are determined to be invalid, a diagnostic bundle should be generated before restarting the Gateway service to clear the stale flows. If the flows are determined to be valid, then the customers should be moved to alternate Gateways to reduce the flow count.

The following table lists the threshold values and recommended actions for flow count.

Table 7. Threshold States
Threshold State Threshold Value Recommended Corrective Action
Warning 50% of 1.9 Million flows
  1. If total flow count crosses warning or critical threshold:
    • Collect diagnostic bundle.
    • Check the stale flow count thresholds and perform corresponding actions listed for stale flows.
    • If the stale flow count is within warning threshold, check the top consumers from flow table.
    • Disable any peer that is creating lot of rogue flows and if a DoS attack is suspected.
    • Check if any Enterprise is consuming most of the flow entries. This information can be used to load balance Edges in the Enterprise.
    • If memory usage crosses critical threshold, perform actions specified for memory metrics.
  2. If flow count crosses critical threshold:
    • Open high priority support case with Arista, along with diagnostic bundle.
    • Restart the services on Gateway.
    • Use the following command to check the peers with high flow count: /opt/vc/bin/vc_top_peers.sh -t flow.
Critical 75% of 1.9 Million flows

The following table lists the threshold values and recommended actions for stale flow count.

Table 8. Threshold States
Threshold State Threshold Value Recommended Corrective Action
Warning 10%
  1. If stale flow count crosses warning or critical threshold:
    • Collect diagnostic bundle.
    • Check if small set of Edges are contributing to these stale flows.
  2. If stale flow count crosses critical threshold:
    • Open high priority support case with Arista, along with diagnostic bundle.
    • Restart the services on Gateway.
  3. If the same issue occurs multiple times on same Gateway or observed on different Gateways, mark the already created support case as critical.
Critical 25%

View NAT Entries

If the number of free NAT entries is critically low, the system should be investigated for a possible leak.

vcadmin@vcg1-example:~$ sudo /opt/vc/bin/getcntr -c natd.nat_shmem_free_entries -d vcgwnat.com
993408
vcadmin@vcg1-example:~$

Reboot the Gateway to clear all assigned NAT entries. Restarting the services has no effect on NAT entries.

For supported values, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).

The following table lists the threshold values of NAT entries.

Table 9. Threshold States
Threshold State Threshold Value Recommended Corrective Action
Warning 50% of 900K NAT entries
  1. If total NAT count crosses warning or critical threshold:
    • Collect diagnostic bundle.
    • Check the stale NAT count thresholds and take corresponding actions listed for stale NAT count.
    • If the stale NAT count is within warning threshold, check the top consumers from NAT table.
    • Disable any peer that is creating lot of NAT entries, if a DOS attack is suspected.
    • Check if any Enterprise is consuming most of the NAT entries. This information can be used to load balance Edges in the Enterprise.
    • If all tenants are using NAT entries more or less equally and if memory usage crosses critical threshold, restart services on Gateway.
  2. If NAT count crosses critical threshold:
    • Open high priority support case with Arista, along with diagnostic bundle.
    • Restart the NAT services on Gateway and check if the issue is fixed. If not, restart all Gateway services.
  3. Run the following command to check the peers with high NAT count: /opt/vc/bin/vc_top_peers.sh-t nat.
Critical 75% of 900K NAT entries

The following table lists the threshold values of stale AT entries.

Table 10. Threshold Values of Stale AT Entries
Threshold State Threshold Value Recommended Corrective Action
Warning 10%
  1. If stale NAT count crosses warning or critical threshold:
    • Collect diagnostic bundle.
    • Check if small set of Edges are contributing to these stale NAT entries.
  2. If stale NAT count crosses critical threshold:
    • Open high priority support case with Arista, along with diagnostic bundle and output of /opt/vc/bin/debug.py--stale_nat_dump.
    • Restart the NAT services on Gateway and check if the issue is fixed. If not, restart all Gateway services.
  3. If the same issue occurs multiple times on same Gateway or observed on different Gateways, mark the already created support case as critical.
Critical 25%

Monitor Over Capacity Drops

Admission Control is a mechanism by which incoming data packets will be dropped when the system is at over capacity. This throttling helps in ensuring that the system has enough resources to process the already buffered packets. The admission control is applied only on data packets.

To check if there are any over capacity drops, use the following commands:

root@spperf-gateway-1:~# dispcnt -s over_capacity_drop

over_capacity_drop = 1461980	0	/s
root@gateway-1:~# dispcnt -s over_capacity_drop -d vcgw.com

Fri Dec 17 11:12:25 2021
over_capacity_drop                       = 0           	0	/s
root@gateway-1:~# dispcnt  -s natd.shmem_oom -s natd.port_assign_fail -d vcgwnat.com

Fri Dec 17 11:12:44 2021
natd.port_assign_fail                    = 0           	0	/s
natd.shmem_oom                           = 0           	0	/s
root@gateway-1:~# dispcnt -p netif -s tx_drop -s rx_drop -d vcgw.com

Fri Dec 17 11:13:04 2021
netif_eth0_rx_dropped                    = 0           	0	/s
netif_eth0_tx_dropped                    = 0           	0	/s
netif_eth1_rx_dropped                    = 0           	0	/s
netif_eth1_tx_dropped                    = 0           	0	/s

To monitor the capacity of flows, run the following command:

root@gateway-1:~# dispcnt  -s flow_admisison_limit_hit
To monitor the over capacity issues on NAT entries, run the following command:
root@gateway-1:~# dispcnt  -s natd.shmem_oom -s natd.port_assign_fail -d vcgwnat.com

Fri Dec 17 11:12:44 2021
natd.port_assign_fail                    = 0           	0	/s
natd.shmem_oom                           = 0           	0	/s

The following table lists the threshold values and recommended actions for overcapacity drops.

Table 11. Threshold Values for Overcapacity Drops
Threshold State Threshold Value Recommended Corrective Action
Warning 500 drops per 30 seconds (absolute count)

When the drops remain above threshold value consistently for 5 minutes, warning alert is triggered.

When the drops cross warning threshold:
  • Collect Gateway diagnostic bundle.
  • Check if a CPU intense system event is causing the packet drops.
  • Check flow metrics as follows:
    • Maximum: 1.9M
    • NAT: 960K
    • Route: 1M for shared Gateway, 100K single enterprise
  • If throughput is consistently bursting for every 60 minutes or less to 2Gbps, quiesce the Gateway and add new Gateway to increase the capacity.
  • If any of the scale metrics have reached a 90% threshold of maximum limit, quiesce the Gateway and add new Gateway to increase the capacity.
Critical 1000 drops per 30 seconds (absolute count)

When the drops remain above threshold value consistently for 5 minutes, critical alert is triggered.

When the drops cross critical threshold:
  • Collect Gateway diagnostic bundle.
  • Monitor throughput continuously for the next 15 minutes

If the drops do not stabilize:

  • Quiesce the Gateway and add new Gateway to increase the capacity.
  • Check for top talker Enterprises to move to new Gateway.
  • At this stage Gateway is already causing user experience. After identifying the top talker Enterprises, rebalance the Edges immediately.

Monitor Latency Threshold for Paths

Whenever the latency threshold values are changed for an Edge, all the tunnels to the corresponding Gateway inherit the same threshold values. The debug command debug.py-v--path can be used to check the values.

Below is the sample output:

pi_info": {
      "connected": 2,
      "num_ha_takeover": 0,
      "priv_ip": "169.254.129.4",
      "profile": "0/0",
      "qoe_latency_threshold": {
        "trans_red_latency_ms": 100,
        "trans_yellow_latency_ms": 100,
        "video_red_latency_ms": 50,
        "video_yellow_latency_ms": 10,
        "voice_red_latency_ms": 62,
        "voice_yellow_latency_ms": 22
      }

The threshold values are not synced with other Edges or Hubs.