Telemetry Collector

A DANZ Monitoring Fabric (DMF) consists of a pair of Controllers, switches, and managed appliances. The Telemetry Collector feature centrally retrieves the infrastructure metrics (interface counters, CPU usage, etc.) associated with all these devices from the controllers using its REST API.

Deployment

Figure 1. DMF Deployment

The diagram above shows a DMF deployment with an Active/Standby Controller cluster. In this environment, each Controller collects metrics from all the devices it manages and the controller nodes. Each Controller establishes a gNMI connection to all the devices and the other Controller in a fabric to collect telemetry streams. gNMI is a gRPC-based protocol to configure and access states on network devices. The REST API exposes this information in the /api/v1/data/controller/telemetry/data subtree. The Controller exposes several metrics about CPU, memory, disk utilization, interface counters, and sensor states of all devices in a fabric. All metrics are fetched at a 10-second frequency except those associated with the sensors, which are collected every minute.

Configuration

No additional configuration is necessary on the DMF Controller to enable the metric collection. However, to read these metrics, the user must be an admin user or configured with the privilege, category:TELEMETRY. To set the telemetry permission for a custom group and associate the user with this group, use the following commands:

dmf-controller(config)# group group_name
dmf-controller(config)# permission category:TELEMETRY privilege read-only
dmf-controller(config)# associate user username 

Connection Status

The Controller uses the gNMI protocol to collect telemetry data from all devices. DMF reports the status of these connections in both the REST API and the CLI.

REST API

The REST API subtree /api/v1/data/controller/telemetry/connection reports the telemetry connection state. The API Schema browser of the GUI provides more details.

CLI

The following show commands display the connection details. All these commands support filtering the output with a device name.

show telemetry connection device-name | all 
show telemetry connection device-name | all details
show telemetry connection device-name | all last-failure

The show telemetry connection device-name all command shows the state and the latest state change of the connection between the Controller and the devices.

dmf-controller# show telemetry connection all
# Name State Last state change
-|----------------------------|-----|------------------------------|
1 c2 ready 2023-11-10 13:10:02.718000 UTC
2 core2ready 2023-11-10 13:22:56.311000 UTC
<snip>

The show telemetry connection device-name all command displays the state and the latest state change of the connection between the Controller and the devices.

dmf-controller# show telemetry connection all details 
# NameState Last state changeTargetConnection type Last message time
-|-----|-----|------------------------------|-------------------|---------------|------------------------------|
1 core2 ready 2023-11-10 13:22:56.311000 UTC 10.243.255.102:6030 clear-text2023-11-10 13:59:35.437000 UTC
<snip>

The show telemetry connection device-nameall last-failure command displays more details about a connection failure. The time of the latest failure and the potential reason appear in the output. If the connection is still in the failed state, this output also shows when the next reconnection will be attempted.

dmf-controller# show telemetry connection all last-failure 
# NameFail timeFail type Root causeNext retry in 
-|-----|------------------------------|---------------|-------------------------|-------------|
1 core2 2023-11-10 13:19:34.237000 UTC unavailable UNAVAILABLE: io exception 0
<snip>

Limitations

  • Software interfaces (for example, loopback, bond, and management) do not report counters for broadcast and unicast packets.
  • The reported interface names are the raw physical interface name (e.g., et1) rather than the user-configured name associated with the role of an interface (e.g., filter1).
  • Resetting the interface counter does not affect the counter values stored at the /telemetry/data path. The value monotonically increases and corresponds to the total count since the device was last powered up. This value only gets reset when rebooting the device.

Usage Notes

  • DMF uses the configured name of a managed device (e.g., switch, recorder node, etc.) on the Controller as the value of the key name for the node device for all the metrics corresponding to it. In the case of a Controller, DMF uses the configured hostname as the key. Thus, these names must be unique in a specific DMF deployment.
  • The possibility exists that metrics are not collected from a device for a short period. This data gap may happen when rebooting the device or when the Controllers experience a failover event.
  • If the gNMI connection between the Controller and the device is interrupted, the Controller attempts a new connection after 2 minutes. The retry timeout for a subsequent connection attempt increases exponentially and can go up to 120 minutes. Upon a successful reconnection, this timeout value resets to 2 minutes.
  • There might be gNMI warning messages in the floodlight log during certain events, e.g., when first adding a device or it is reloading. Ignore these messages.
  • This feature enables an OpenConfig agent on switches running EOS to collect telemetry.

Telemetry Availability

As a DMF fabric consists of different types of devices, the metrics of each vary. The following outlines the metrics collected from each device type by the Controller and typically made available over its REST API. However, some specific platforms or hardware might not report a particular metric. For brevity, the following list mentions the leaves that can correspond to a metric. For more details, use the API Schema browser of the GUI.

telemetry
+-- data
 +-- device
 +-- interface
 |+-- oper-statusCtrl, SWL, EOS, SN, RN
 |+-- counters
 | +-- in-octets Ctrl, SWL, EOS, SN, RN
 | +-- in-pkts Ctrl, SWL, EOS, SN, RN
 | +-- in-unicast-pkts Ctrl, SWL, EOS, SN, RN
 | +-- in-broadcast-pkts Ctrl, SWL, EOS, SN, RN
 | +-- in-multicast-pkts Ctrl, SWL, EOS, SN, RN
 | +-- in-discards Ctrl, SWL, EOS, SN, RN
 | +-- in-errors Ctrl, SWL, EOS, SN, RN
 | +-- in-fcs-errors Ctrl, SWL, EOS, SN, RN
 | +-- out-octetsCtrl, SWL, EOS, SN, RN
 | +-- out-pktsCtrl, SWL, EOS, SN, RN
 | +-- out-unicast-pktsCtrl, SWL, EOS, SN, RN
 | +-- out-broadcast-pktsCtrl, SWL, EOS, SN, RN
 | +-- out-multicast-pktsCtrl, SWL, EOS, SN, RN
 | +-- out-discardsCtrl, SWL, EOS, SN, RN
 | +-- out-errorsCtrl, SWL, EOS, SN, RN
 +-- cpu
 |+-- utilizationCtrl, SWL, EOS, SN, RN
 +-- memory
 |+-- totalCtrl, SWL, SN, RN
 |+-- availableCtrl, SWL, EOS, SN, RN
 |+-- utilized Ctrl, SWL, EOS, SN, RN
 +-- sensor
 |+-- fan
 ||+-- oper-status Ctrl, SWL, EOS, SN, RN
 ||+-- rpm Ctrl, SWL, EOS, SN, RN
 ||+-- speed SWL
 |+-- power-supply
 ||+-- oper-status Ctrl, SWL, EOS, SN, RN
 ||+-- capacityEOS
 ||+-- input-current Ctrl, SWL, EOS, SN, RN
 ||+-- output-currentSWL, EOS
 ||+-- input-voltage Ctrl, SWL, EOS, SN, RN
 ||+-- output-voltageSWL, EOS
 ||+-- input-power Ctrl, SWL, SN, RN
 ||+-- output-powerSWL, EOS
 |+-- thermal
 | +-- oper-status Ctrl, SWL, SN, RN
 | +-- temperature Ctrl, SWL, EOS, SN, RN
 +-- mount-point
 |+-- size Ctrl, SWL, SN, RN
 |+-- availableCtrl, SWL, SN, RN
 |+-- utilized Ctrl, SWL, SN, RN
 |+-- usage-percentage Ctrl, SWL, SN, RN
 +-- control-group
 +-- memory Ctrl, SWL, SN, RN
 +-- cpuCtrl, SWL, SN, RN

* Ctrl = Controller, SWL = A switch running SwichLight OS, 
EOS = A switch running Arista EOS, SN = Service Node, RN = Recorder Node