Monitor Gateways using CLI
You can use CLI commands to check the status of Gateways in an Orchestrator.
Monitor System Health
You can use the CLI commands to check the status of the Gateways, Software version, usage of CPU and memory, and other information.
Monitor Gateway Activation State
Use the following command to check if the Gateway is activated on an Orchestrator.
In the following example, the Gateway is activated:
vcadmin@vcg1-example1:~$ /opt/vc/bin/is_activated.py
True
vcadmin@vcg1-example1:~$
In the following example, the Gateway is deactivated:
vcadmin@vcg1-example1:~$ /opt/vc/bin/is_activated.py
False
vcadmin@vcg1-example1:~$
View Activated Orchestrator Name
Use the following command to locate the Orchestrator for the Gateway providing the Gateway is activated.
vcadmin@vcg1-example1:~$ /opt/vc/bin/getpolicy.py
managementPlane.data.managementPlaneProxy.primary
"vco1-example1.velocloud.net"
vcadmin@vcg1-example1:~$
View Software Version
The various VeloCloud processes in the system display version numbers which should all be identical.
Review the version numbers using the following commands:
root@NY-GATEWAY-1:~# /opt/vc/sbin/gwd -v
VCG Info
========
Version: 4.2.0
Build rev: R420-20201216-GA-0bcea3f6f0
Build Date: 2020-12-16_23-23-33
Build Hash: 0bcea3f6f0e6b8c21260187bb2d953e4cefd7f27
root@NY-GATEWAY-1:~# /opt/vc/sbin/natd -v
NATd Info
========
Version: 4.2.0
Build rev: R420-20201216-GA-0bcea3f6f0
Build Date: 2020-12-16_23-23-33
Build Hash: 0bcea3f6f0e6b8c21260187bb2d953e4cefd7f27
root@NY-GATEWAY-1:~# /opt/vc/sbin/mgd -v
VeloCloud gateway 4.2.0 build R420-20201216-GA-0bcea3f6f0
View NTP Time Zone
Use the following command to view the NTP time zone. The Gateway time zone must be set to Etc/UTC.
vcadmin@vcg1-example:~$ cat /etc/timezone
Etc/UTC
vcadmin@vcg1-example:~$
If the time zone is incorrect, use the following commands to update the time zone.
echo "Etc/UTC" | sudo tee /etc/timezone
sudo dpkg-reconfigure --frontend noninteractive tzdata
View NTP Offset
Use the following command to view the NTP offset, which must be less than or equal to 15 milliseconds.
sudo ntpqvcadmin@vcg1-example:~$ sudo ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*ntp1-us1.prod.v 74.120.81.219 3 u 474 1024 377 10.171 -1.183 1.033
ntp1-eu1-old.pr .INIT. 16 u - 1024 0 0.000 0.000 0.000
vcadmin@vcg1-example:~$
If the offset is incorrect, use the following commands to update the NTP offset.
sudo systemctl stop ntp
sudo ntpdate <server>
sudo systemctl start ntp
Monitor Disk Usage
Use the following command to check the disk usage space. Ensure that the disk has at least 16 GB of free space to store critical files such as logs and cores.
vcadmin@vcg1-example:~$ sudo df -kh --total | grep total | awk '{print $4}'
77G
vcadmin@vcg1-example:~$
The common places for disk usage to build up are /var/log, /velocloud/core, and /tmp.
Monitor CPU Usage
The Gateway processes bursts of traffic and bursts of high CPU are expected. The Gateway should be monitored for CPU cores at 100%. However, the DPDK cores run in poll mode for performance reasons and they expect to take close to 100% CPU at high throughput.
You can monitor a Gateway with thresholds that provide warning or critical states which indicate potential issues prior to impacting services. The following table lists the threshold values and recommended actions.
| Threshold State | Threshold Value | Recommended Corrective Action | |
|---|---|---|---|
| DP Core | Non DP Core | ||
| Warning | 95% | 80% | If the threshold value is crossed consistently for 5 minutes:
If the threshold value is crossed consistently for 5 minutes:
|
| Critical | 98% | 90% | If the threshold value is crossed consistently for 5 minutes:
If the issue is observed for one hour:
|
#! /usr/bin/env python
"""
Check for CPUs spinning at 100%
"""
import re
import collections
import time
import sys
import json
import os
import subprocess
re_cpu = re.compile(r"^cpu\d+\s")
CPUStat = collections.namedtuple('CPUStat', ['user', 'nice', 'sys', 'idle'])
def get_stats():
stats = open("/proc/stat").readlines()
ret = {}
for s in stats:
if not re_cpu.search(s):
continue
s = s.split()
ret[s[0]] = CPUStat(*[int(v) for v in s[1:5]])
return ret
def verify_dpdk_support():
if os.path.isfile('/opt/vc/etc/dpdk.json'):
with open("/opt/vc/etc/dpdk.json") as data:
d = json.loads((data.read()))
if "status" in d.keys():
return True if d['status'] is "Supported" else False
else:
return False
def another_verify_dpdk_support():
if os.path.isfile('/opt/vc/bin/debug.py'):
f = subprocess.check_output(
["/opt/vc/bin/debug.py", "--dpdk_ports_dump"])
x = [r.split() for r in f.split('\n')]
if len(x) <= 1:
return False
else:
return True
else:
return False
dpdk_status = verify_dpdk_support() or another_verify_dpdk_support()
if __name__ == "__main__":
try:
stat1 = get_stats()
time.sleep(3)
stat2 = get_stats()
except:
print "UKNOWN - failed to get CPU stat: %s" % str(sys.exc_info()[1])
sys.exit(3)
busy_cpu_set = [cpu for cpu in stat1 if (
stat2[cpu].idle - stat1[cpu].idle) == 0]
if not busy_cpu_set:
print "OK - no spinning CPUs"
sys.exit(0)
if dpdk_status == True:
if "cpu1" in busy_cpu_set and len(busy_cpu_set) == 1:
print "OK - no spinning CPUs"
sys.exit(0)
elif "cpu1" in busy_cpu_set:
busy_cpu_set.remove('cpu1')
print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
sys.exit(2)
else:
print busy_cpu_set, 1
print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
sys.exit(2)
else:
print "CRITICAL - %s is at 100%%" % (",".join(busy_cpu_set))
sys.exit(2)
Monitor Memory Usage
The main process (gwd) has its memory monitored by vc_process_monitor, which ensures that it never consumes more than 75% of available memory. As a result, monitoring for total system memory uses a warning threshold of 80% and critical threshold of 90%.
You can monitor a Gateway with thresholds that provide warning or critical states which indicate potential issues prior to impacting services. The following table lists the threshold values and recommended actions.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 80% | If the memory crosses warning threshold:
Continue monitoring actively and check for increasing utilization. |
| Critical | 90% | If the memory crosses critical threshold:
If the issue is observed again:
Note: Before rebalancing the Gateway, confirm that the capacity metrics are within the recommended limit. For more information on capacity metrics, see Capacity of Gateway Components.
|
#!/usr/bin/env python
from optparse import OptionParser
import sys
# Parse commandline options:
parser = OptionParser(
usage="%prog -w <warning threshold>% -c <critical threshold>% [ -h ]")
parser.add_option("-w", "--warning",
action="store", type="string", dest="warn_threshold", help="Warning threshold in absolute(MB) or percentage")
parser.add_option("-c", "--critical",
action="store", type="string", dest="crit_threshold", help="Critical threshold in ansolute(MB) or percentage")
(options, args) = parser.parse_args()
def read_meminfo():
meminfo = {}
for line in open('/proc/meminfo'):
if not line:
continue
(name, value) = line.split()[0:2]
meminfo[name.strip().rstrip(':')] = int(value)
return meminfo
if __name__ == '__main__':
if not options.crit_threshold:
print "UNKNOWN: Missing critical threshold value."
sys.exit(3)
if not options.warn_threshold:
print "UNKNOWN: Missing warning threshold value."
sys.exit(3)
is_warn_pct = options.warn_threshold.endswith('%')
if is_warn_pct:
warn_threshold = int(options.warn_threshold[0:-1])
else:
warn_threshold = int(options.warn_threshold)
is_crit_pct = options.crit_threshold.endswith('%')
if is_crit_pct:
crit_threshold = int(options.crit_threshold[0:-1])
else:
crit_threshold = int(options.crit_threshold)
if crit_threshold >= warn_threshold:
print "UNKNOWN: Critical percentage can't be equal to or bigger than warning percentage."
sys.exit(3)
meminfo = read_meminfo()
memTotal = meminfo["MemTotal"]
memFree = meminfo["MemFree"] + meminfo["Buffers"] + meminfo["Cached"]
memFreePct = 100.0*memFree/memTotal
if (is_crit_pct and memFreePct <= crit_threshold) or (not is_crit_pct and memFree/1024 <= crit_threshold):
print "CRITICAL: Free memory is at %2.0f %% ( %d MB free our of %d MB total)" % (memFreePct, memFree/1024, memTotal/1024)
sys.exit(2)
if (is_warn_pct and memFreePct <= warn_threshold) or (not is_warn_pct and memFree/1024 <= warn_threshold):
print "WARNING: Free memory is at %2.0f %% ( %d MB free our of %d MB total)" % (memFreePct, memFree/1024, memTotal/1024)
sys.exit(1)
else:
print "OK: Free memory is at %2.0f %% ( %d MB free our of %d MB total)" % (memFreePct, memFree/1024, memTotal/1024)
sys.exit(0)
Monitor VeloCloud SD-WAN Services
Use the CLI commands to monitor the SD-WAN processes, sessions, and components.
Monitor VeloCloud SD-WAN Processes
The VeloCloud SD-WAN processes described in the Components should be running in order to ensure proper functionality of the system. The Linux command pgrep can be used to identify the process. The format is slightly different for Python processes. If the process is running, a pid (integer process ID) is returned. If not running, the command returns empty output.
vc_procmon
Use the following command to check if vc_procmon is running on the system.
vcadmin@vcg1-example:~$ pgrep -f vc_procmon
14711
vcadmin@vcg1-example:~$
Other Processes
Use the following commands to check for other processes.
vcadmin@vcg1-example:~$ pgrep mgd
14725
vcadmin@vcg1-example:~$ pgrep gwd
15143
vcadmin@vcg1-example:~$ pgrep natd
15095
To recover the processes, restart the VeloCloud SD-WAN Process Monitor, which restarts all other processes. Use the following command to restart VeloCloud SD-WAN Process Monitor.
sudo service vc_process_monitor restart
Use the following command to restart routing protocol daemons.
/usr/sbin/frr.init {start|stop|restart} [daemon ...]
Monitor Certificate Revocation List
On Gateways with PKI enabled, the revoked certificates are stored in a Certificate Revocation List (CRL). If this list grows too long, generally due to an issue with the Certificate Authority of the Orchestrator, the performance of the Gateway is impacted. The CRL should be less than 4000 entries long.
Use the following command to check the CRL entries.
vcadmin@vcg1-example:~$ openssl crl -in /etc/vc-public/vco-ca-crl.pem -text | grep 'Serial Number' | wc -l
14
vcadmin@vcg1-example:~$
Monitor ICMP Status
If you configure a Gateway as a Partner Gateway with static routing, and the ICMP responder is configured to track the reachability of those routes, the debug.py
command indicates the UP or DOWN states:
vcadmin@vcg1-example:~$ sudo /opt/vc/bin/debug.py --icmp_monitor
{
"icmpProbe": {
"cTag": 0,
"destinationIp": "0.0.0.0",
"enabled": false,
"frequencySeconds": 0,
"probeFail": 0,
"probeType": "NONE",
"probesSent": 0,
"respRcvd": 0,
"sTag": 0,
"state": "DOWN",
"stateDown": 0,
"stateUp": 0,
"threshold": 0
},
"icmpResponder": {
"enabled": false,
"ipAddress": "0.0.0.0",
"mode": "CONDITIONAL",
"reqRcvd": 0,
"respSent": 0,
"state": "DOWN"
}
}
vcadmin@vcg1-example:~$
When the ICMP responder is enabled, the DOWN state means that there are no Edges connected to the Gateway.
Monitor BGP Sessions
The debug.pycommand bgp_view_summary provides information about state of the BGP neighbor and prefixes learned, to verify that BGP is UP and exchanging prefixes. Use the following command to verify if BGP neighborships are established.
vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py --bgp_view_summary | grep Established | wc -l
6
vcadmin@vcg1-1:~$
If the BGP sessions are down, check whether the Gateway is properly connected to the Orchestrator.
Monitor Core Files
When a service crashes on the Gateway, a core file is generated. The diagnostic bundles generated from the Orchestrator should be retrieved as soon as possible following the generation of a core file, to download the core file and to provide the associated logs to Arista Support.
The following example illustrates a Python script to check for recent core files:
#! /usr/bin/env python
import subprocess
import traceback
import os
import os.path
import glob
import datetime
import time
import sys
import re
from pynag.Plugins import PluginHelper, ok, warning, critical, unknown
from subprocess import Popen, PIPE
import time
import os
import commands
import json
helper = PluginHelper()
helper.parse_arguments()
def diag_check():
regex_patern = "^.*\s+Uploading diag-201[0-9]-.*"
re_nat = re.compile(regex_patern)
cmd = 'grep "Uploading diag-201[0-9]" /var/log/mgd.log'
p1 = subprocess.Popen([cmd], stdout=subprocess.PIPE,
stderr=subprocess.PIPE, shell=True)
stdout_value, stderr_value = p1.communicate()
m = re_nat.search(stdout_value)
if m:
return True
else:
return False
def vco_vcg_version():
with open("/opt/vc/.gateway.info") as data:
d = json.loads((data.read()))
vcg = d["gatewayInfo"]["name"]
# build_number=d["gatewayInfo"]["buildNumber"]
status, output = commands.getstatusoutput(
"sudo /opt/vc/sbin/gwd -v 2>&1 | grep rev")
if status == 0:
build_number = output.split()[2].rstrip('\n')
vco = d["configuration"]["managementPlane"]["data"]["managementPlaneProxy"]["primary"]
return vcg, build_number, vco
status_file = "/tmp/coredump_status_file"
warning_file = "/tmp/warning_file"
if not os.path.isfile(status_file) and not os.access(status_file, os.R_OK):
os.system("touch /tmp/coredump_status_file")
os.system("chown nagios:nagios /tmp/coredump_status_file")
if not os.path.isfile(warning_file) and not os.access(status_file, os.R_OK):
os.system("touch /tmp/warning_file")
os.system("chown nagios:nagios /tmp/warning_file")
if not os.path.isfile(warning_file) and not os.access(status_file, os.R_OK):
os.system("touch /tmp/crashlist.txt")
os.system("chown nagios:nagios /tmp/crashlist.txt")
command = "cat /tmp/coredump_status_file"
command1 = "cat /tmp/warning_file"
files = ["crashlist.txt", "warning_file",
"coredump_status_file", "coredump_message"]
for item in files:
if os.path.isfile("/tmp/"+item):
st = os.stat("/tmp/"+item)
if st.st_uid == 0:
commands.getstatusoutput("sudo chown nagios:nagios /tmp/"+item)
status, output = commands.getstatusoutput(command)
if output == "1":
status_message = ""
os.system("chown nagios:nagios /tmp/coredump_message")
with open("/tmp/coredump_message", "r") as data:
for line in data.readlines():
status_message += line
mtime = os.path.getmtime("/tmp/coredump_status_file")
cur_time = time.time()
if int(cur_time) - int(mtime) >= 300:
os.system('echo -n "0" > /tmp/coredump_status_file')
helper.status(critical)
helper.add_summary(status_message)
helper.exit()
sys.exit(0)
status_message = ""
newcore = 0
try:
crashlistpath = '/tmp/crashlist.txt'
cmd = "stat -c '%Y %n' /velocloud/core/*core.tgz"
if not os.path.isfile(crashlistpath) and not os.access(crashlistpath, os.R_OK):
os.system("find /velocloud/core/ -name *core.tgz > /tmp/crashlist.txt")
with open(crashlistpath, "a+") as f:
oldcrashlist = f.read()
corelist = glob.glob("/velocloud/core/*core.tgz")
corecount = len(corelist)
if corecount > 0:
for line in corelist:
file_modified = datetime.datetime.fromtimestamp(
os.path.getmtime(line))
if datetime.datetime.now() - file_modified > datetime.timedelta(hours=42*24):
os.remove(line)
if not line in oldcrashlist:
newcore += 1
status_message += '\n' + "Core:" + \
str(newcore) + " " + line.rsplit('/', 1)[1] + " "
f.write(line+'\n')
cmd1 = "tar -xvf " + \
line.rstrip(
'\n') + " -C /tmp --wildcards --no-anchored '*.txt' "
crash = subprocess.Popen(
cmd1, shell=True, stdout=subprocess.PIPE)
crash.wait()
for line1 in crash.stdout:
btcmd = "awk '/^Thread 1 /,/^----/' /tmp/" + \
line1.rstrip(
'\n') + " | egrep '^#' | sed 's/ 0x0.* in //' | sed 's/ (.*/ /'"
bt = subprocess.Popen(
btcmd, shell=True, stdout=subprocess.PIPE)
status_message += '\n' + bt.communicate()[0]
else:
helper.status(ok)
status_message = "No Core file"
f.close()
except Exception as e:
traceback.print_exc()
helper.exit(summary="Nagios check could not complete",
long_output=str(e), exit_code=unknown, perfdata='')
if corecount and not newcore:
helper.status(ok)
status_message = str(corecount) + " old core file found in /velocloud/core"
os.system('echo -n "0" > /tmp/coredump_status_file')
elif newcore > 0:
output = vco_vcg_version()
vcg_data = "%s; VCG_Build_Number:%s; VCO:%s\n" % (output)
status_message = vcg_data + str(newcore) + " New Core\n" + status_message
with open("/tmp/coredump_message", "w") as data:
data.writelines(status_message)
os.system('echo -n "1" > /tmp/warning_file')
os.system('echo -n "1" > /tmp/coredump_status_file')
helper.status(critical)
helper.add_summary(status_message)
helper.exit()
sys.exit(0)
status, output_warn = commands.getstatusoutput(command1)
if output_warn == "1":
helper.status(warning)
status_message = "Please generate gateway diag bundle from the VCO if required"
result = diag_check()
if result == False:
if not os.path.isfile("/tmp/coredump_start_time"):
os.system("touch /tmp/coredump_start_time")
os.system("chown nagios:nagios /tmp/coredump_start_time")
start_time = time.time()
with open("/tmp/coredump_start_time", "w") as data:
data.write(str(start_time))
end_time = time.time()
cmd = "cat /tmp/coredump_start_time"
status, start_time = commands.getstatusoutput(cmd)
total_time = end_time - float(start_time)
if total_time > 10800:
result = True
if result == True:
os.system('echo -n "0" > /tmp/warning_file')
os.remove("/tmp/coredump_start_time")
helper.status(warning)
status_message = "Please generate the diagbundle for the last crash. if it is taken already, please ignore this message"
helper.add_summary(status_message)
helper.exit()
Capacity of Gateway Components
Use the CLI commands to check the capacity of the Gateway components and verify that the configured values do not exceed the supported values to ensure seamless performance.
For additional information on the capacity of different components and the supported values for each component, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).
Monitor Packet Processing Queue
The packet processing engine on the Gateway involves multiple stages, and each stage has a packet processing queue in between. Due to the bursty nature of traffic through a Gateway, occasional packet buildup in the packet forwarding queues are expected. However, consistent high queue length in certain queues indicate a capacity problem.
The following example shows output of the debug.py command to view handoff queue output.
The output has been truncated to display only the first and last entries, for conciseness. You can exclude the -v option in the command to view the output in tabular format.
vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py -v --handoff
{
"handoffq": [
{
"deq": 12126489784,
"drops": 0,
"enq": 12126482089,
"name": "vc_queue_net_sch",
"qlength": 0,
"qlimit": 4096,
"sleeping": 1475174572,
"tid": 1502,
"wmark": 1280,
"wmark_1min": 385,
"wmark_5min": 450,
"wokenup": 1164664965
},
…
{
"deq": 767292,
"drops": 0,
"enq": 767272,
"name": "vc_queue_ip_common_bh_1",
"qlength": 0,
"qlimit": 16384,
"sleeping": 596612,
"tid": 1512,
"wmark": 53,
"wmark_1min": 3,
"wmark_5min": 3,
"wokenup": 596209
}
]
}
vcadmin@vcg1-example:~$
You need to note the values of qlength and wmark.
The qlength column indicates the number of packets currently buffered in the queue. The wmark column indicates the maximum depth a queue has ever reached, which indicates how close a Gateway has come to dropping packets. The impact and remediation for these depends largely on the queue being monitored.
You should monitor both the critical and non-critical queues.
Netif and Per-Core Queues
The high watermark level in these queues indicates that the packet processing rate is lower than the incoming rate.
The following example shows output of the dispcnt -p netif -s len -s wmark -d vcgw.com command.
# dispcnt -p netif -s len -s wmark -d vcgw.com
Wed Jul 20 09:03:39 2022
netif_queue0_len = 0 0 /s
netif_queue0_wmark = 0 0 /s
netif_queue0_wmark_1min = 0 0 /s
netif_queue0_wmark_1s = 0 0 /s
netif_queue0_wmark_5min = 0 0 /s
netif_queue1_len = 0 0 /s
netif_queue1_wmark = 45 22 /s
netif_queue1_wmark_1min = 3 1 /s
netif_queue1_wmark_1s = 1 0 /s
netif_queue1_wmark_5min = 3 1 /s
In the output, the value in the first column is the current count, and the second column is the rate per second. When you run this command for the first iteration, the first column displays the total (lifetime) count since the counter started, and the second column displays the rate calculated by dividing the lifetime total by 2. As the data refreshes every two seconds, then the first column displays the count since the last sample, and the second column is the rate per second.
The following example shows output of the dispcnt -p per_core -s len -s wmark -d vcgw.com command.
# dispcnt -p per_core -s len -s wmark -d vcgw.com
Wed Jul 20 09:11:05 2022
per_core_queue0_len = 0 0 /s
per_core_queue0_wmark = 1476 738 /s
per_core_queue0_wmark_1min = 91 45 /s
per_core_queue0_wmark_1s = 34 17 /s
per_core_queue0_wmark_5min = 346 173 /s
per_core_queue1_len = 0 0 /s
per_core_queue1_wmark = 1216 608 /s
per_core_queue1_wmark_1min = 47 23 /s
per_core_queue1_wmark_1s = 27 13 /s
per_core_queue1_wmark_5min = 64 32 /s
View Non-Critical Queues
The high queue length in the non-critical queues are less common or less likely to impact customers.
The following are the non-critical queues that can be monitored.
vc_queue_vcmp_init – This queue provides VCMP tunnel initiation messages regarding new tunnel setup. The Gateway throttles incoming tunnel requests to the maximum rate they can be handled without disrupting the existing traffic, based on available cores. As a result, high queue length is expected in the queue on a Gateway with many tunnels.
The packet buildup in these queues should come in large bursts following a specific event, such as Gateway restart or transit interruption, and there should not be drops during normal operation.
vc_queue_vcmp_ctrl_0 and vc_queue_vcmp_ctrl_1 – This queue provides VCMP tunnel management control messages received on the existing tunnels. This includes messages such as route updates, path state updates, heartbeats, statistics, QoS Sync, and tunnel information.
Almost all control messages have built-in retry mechanisms to account for these drops, like route updates.
vc_queue_ike – The queue processes IKE protocol messages to manage keys and other state of encryption sessions.
This is generally a low volume traffic and it is unlikely that packet build up is encountered here. If drops occur, IKE messages are retried.
Monitor Throughput Performance
While handoff queues are the ideal way to monitor from a capacity perspective, it may be useful and/or interesting to monitor the throughput as well.
For many providers, the monitoring of throughput occurs on the Hypervisor and is outside the scope of the Gateway.
For providers who want to monitor on the Gateway, the following example illustrates how to get the RX and TX byte counts, to make delta calculations over a period to measure the throughput.
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth0_pstat_ibytes -d vcgw.com
1895744358
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth0_pstat_obytes -d vcgw.com
1865866321
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth1_pstat_ibytes -d vcgw.com
33233362
vcadmin@vcg34-1:~$ sudo /opt/vc/bin/getcntr -c dpdk_eth1_pstat_obytes -d vcgw.com
29843320
The actual throughput capacity might vary based on the number of connected Edges, encryption mix, and average packet size. The handoff queues provide a clear picture of the Gateway performance relative to its capacity.
For supported value of maximum throughput, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).
View Connected Edges
The following example shows the number of connected edges. It is recommended to have the tunnel count below the supported value to reduce the CPU load and recovery time that follows a restart.
vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py --list_edges 2
{
"vceCount": 156
}
vcadmin@vcg1-example:~$
If the number of connected Edges approaches the supported value for a Gateway, then customers should be moved to alternate Gateways to reduce the Edge count. If the number of connected Edges exceeds the supported value for a Gateway, then this movement of Edges to alternate Gateways should be treated as critical.
Monitor Tunnel Count
You can monitor tunnel count with thresholds that provide warning or critical states which indicate potential issues prior to impacting services.
The following table lists the threshold values of tunnel count and recommended actions.
| Threshold State | Threshold Value | Recommended Corrective Action | |
|---|---|---|---|
| Warning/Critical | Gateway with 4 Cores and 32 GB of RAM |
|
|
| With Certificate | Without Certificate | ||
| 3000 | 3000 | ||
| Gateway with 8 Cores and 32 GB of RAM | |||
| With Certificate | Without Certificate | ||
| 6000 | 6000 | ||
The following table lists the threshold values of stale tunnel count and recommended actions.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 10% of total tunnel count for a duration of 300 seconds |
|
| Critical | 25% of total tunnel count for a duration of 300 seconds |
Monitor Path Stability
You can monitor the status of unstable tunnel count to determine the path stability.
The following table lists the threshold values of unstable tunnel count and recommended actions.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 25% of total tunnels in unstable state for 5 minutes |
|
| Critical | 25% of total tunnels in unstable state for 10 minutes |
View BGP-enabled VRFs
The following example shows the number of BGP-enabled VRFs.
vcadmin@vcg1-example:~$ /opt/vc/bin/debug.py --vrf | grep "my_asn" | wc -l
0
vcadmin@vcg1-example:~$
If the number of BGP-enabled VRFs exceeds the maximum supported value, customers should be moved to alternate Gateways to reduce the number.
For supported values, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).
View Gateway Routes
vcadmin@vcg1-example:~$ sudo /opt/vc/bin/getcntr -c memb.mod_gw_route_t.obj_cnt -d gwd-mem
8262
vcadmin@vcg1-example:~$
If the number of route entries exceeds the maximum supported value, customers should be moved to alternate Gateways to reduce the route count.
For supported values, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).
View Gateway Flows
The number of flows supported by a Gateway is determined by the system memory. A log file reflects the number of flows during startup.
The following example shows the log of maximum supported flows:
ERROR [MAIN] gwd_get_max_flow_supported:35 Flow Admission: GWD
Max flow supported: 1929780 soft limit:1157820 hard limit:1736730
If logs have rolled over, use the following table as reference:
| Gateway Memory(GB) | Max Number of Flows | Critical Number of Flows(90% of max flows) |
|---|---|---|
| 4 | 245760 | 221184 |
| 8 | 491520 | 442368 |
| 16 | 983040 | 884736 |
| 32 | 1966080 | 1769472 |
If flow limits reach a critical limit the system should be investigated for a possible flow leak.
Current flow objects in the system are as follows:
vcadmin@vcg1-example:~$ sudo /opt/vc/bin/getcntr -c memb.mod_mp_flow_t.obj_cnt -d
gwd-mem
If the flows are determined to be invalid, a diagnostic bundle should be generated before restarting the Gateway service to clear the stale flows. If the flows are determined to be valid, then the customers should be moved to alternate Gateways to reduce the flow count.
The following table lists the threshold values and recommended actions for flow count.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 50% of 1.9 Million flows |
|
| Critical | 75% of 1.9 Million flows |
The following table lists the threshold values and recommended actions for stale flow count.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 10% |
|
| Critical | 25% |
View NAT Entries
If the number of free NAT entries is critically low, the system should be investigated for a possible leak.
vcadmin@vcg1-example:~$ sudo /opt/vc/bin/getcntr -c natd.nat_shmem_free_entries -d vcgwnat.com
993408
vcadmin@vcg1-example:~$
Reboot the Gateway to clear all assigned NAT entries. Restarting the services has no effect on NAT entries.
For supported values, refer to the VeloCloud SD-WAN Performance and Scale Datasheet published at the Partner Connect Portal. To access the datasheet, you must log into the Partner Connect Portal using your Partner credentials (username and password).
The following table lists the threshold values of NAT entries.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 50% of 900K NAT entries |
|
| Critical | 75% of 900K NAT entries |
The following table lists the threshold values of stale AT entries.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 10% |
|
| Critical | 25% |
Monitor Over Capacity Drops
Admission Control is a mechanism by which incoming data packets will be dropped when the system is at over capacity. This throttling helps in ensuring that the system has enough resources to process the already buffered packets. The admission control is applied only on data packets.
To check if there are any over capacity drops, use the following commands:
root@spperf-gateway-1:~# dispcnt -s over_capacity_drop
over_capacity_drop = 1461980 0 /s
root@gateway-1:~# dispcnt -s over_capacity_drop -d vcgw.com
Fri Dec 17 11:12:25 2021
over_capacity_drop = 0 0 /s
root@gateway-1:~# dispcnt -s natd.shmem_oom -s natd.port_assign_fail -d vcgwnat.com
Fri Dec 17 11:12:44 2021
natd.port_assign_fail = 0 0 /s
natd.shmem_oom = 0 0 /s
root@gateway-1:~# dispcnt -p netif -s tx_drop -s rx_drop -d vcgw.com
Fri Dec 17 11:13:04 2021
netif_eth0_rx_dropped = 0 0 /s
netif_eth0_tx_dropped = 0 0 /s
netif_eth1_rx_dropped = 0 0 /s
netif_eth1_tx_dropped = 0 0 /s
To monitor the capacity of flows, run the following command:
root@gateway-1:~# dispcnt -s flow_admisison_limit_hit
root@gateway-1:~# dispcnt -s natd.shmem_oom -s natd.port_assign_fail -d vcgwnat.com
Fri Dec 17 11:12:44 2021
natd.port_assign_fail = 0 0 /s
natd.shmem_oom = 0 0 /s
The following table lists the threshold values and recommended actions for overcapacity drops.
| Threshold State | Threshold Value | Recommended Corrective Action |
|---|---|---|
| Warning | 500 drops per 30 seconds (absolute count)
When the drops remain above threshold value consistently for 5 minutes, warning alert is triggered. |
When the drops cross warning threshold:
|
| Critical | 1000 drops per 30 seconds (absolute count)
When the drops remain above threshold value consistently for 5 minutes, critical alert is triggered. |
When the drops cross critical threshold:
If the drops do not stabilize:
|
Monitor Latency Threshold for Paths
Whenever the latency threshold values are changed for an Edge, all the tunnels to the corresponding Gateway inherit the same threshold values. The debug command debug.py-v--path can be used to check the values.
Below is the sample output:
pi_info": {
"connected": 2,
"num_ha_takeover": 0,
"priv_ip": "169.254.129.4",
"profile": "0/0",
"qoe_latency_threshold": {
"trans_red_latency_ms": 100,
"trans_yellow_latency_ms": 100,
"video_red_latency_ms": 50,
"video_yellow_latency_ms": 10,
"voice_red_latency_ms": 62,
"voice_yellow_latency_ms": 22
}
The threshold values are not synced with other Edges or Hubs.
