Prometheus Endpoint Support for Infrastructure Metrics

Prometheus is an open source monitoring and alerting toolkit. It collects and stores metrics from different sources in a time-series database. Prometheus offers a powerful query language, which allows users to analyze and visualize the collected data in real-time. With its robust alerting system, Prometheus can also notify users of potential issues which helps with their timely resolution.

Starting with the DMF 8.5.0 release, the Prometheus server can scrape metrics from a DMF (DANZ Monitoring Fabric) deployment. The DMF Controller exposes interface counters, CPU usage, memory usage, sensor states, and disk usage statistics from all the devices including the Controllers from a single Prometheus endpoint.

Deployment

The aforementioned diagram shows a DMF deployment with an active/standby Controller cluster. In this environment, each Controller collects metrics from all the devices it manages as well as from both the Controller nodes. It then exposes them via the telemetry endpoint /api/v1/metrics/prometheus. This is an authenticated endpoint that listens on port 8443 and supports both Prometheus and OpenMetrics exposition formats.

Even though each Controller is capable of serving the telemetry information, it is recommended to use cluster's virtual IP (VIP) in the Prometheus configuration in order to achieve seamless continuity in the event of a Controller failover.

Configuration

No additional configuration is necessary on the DMF Controller to enable the metric collection. However, to allow the Prometheus service to access the new HTTP telemetry endpoint, a user access-token needs to be created. An admin user can choose an existing user or create a dedicated user with correct privileges and generate a token in order for Prometheus to fetch metrics from a fabric. The following sections describe the necessary configurations.

Permission

The group that the user belongs to needs to have sufficient permission to query the Prometheus endpoint. The following table summarizes the behavior.


Group	Behavior
`admin`	An access-token generated for a user in the `admin` group will have access to the telemetry endpoint.
`read-only`	An access-token for a user in the `read-only` group will not have access to the telemetry endpoint.
Any custom group	An access-token for a user in a group with TELEMETRY or DEFAULT permission but not with DEFAULT/SENSITIVE permission will have access.

To set telemetry permission for a custom group, use the following commands:

dmf-controller(config)# group group_name
dmf-controller(config)# permission category:TELEMETRY privilege read-only
dmf-controller(config)# associate user username

Access Token

Generate an access token for the user using the following command.

dmf-controller(config)# user username
dmf-controller(config-user)# access-token descriptive name for the access-token
access-token : ZxyHXL0QyOhDUogT8wjZj7ouSiVtWNB3

Prometheus Service

The following configuration needs to be added to the Prometheus server to fetch metrics from the Controller's /api/v1/metrics/prometheus endpoint periodically.

scrape_configs:
- job_name: <job_name>
scheme: https
authorization:
type: Bearer
credentials: <access-token>
metrics_path: /api/v1/metrics/prometheus
scrape_interval: <interval>
static_configs:
- targets:
- <vip>:8443
tls_config:
insecure_skip_verify: true

The table below depicts the recommended configurations for this feature.


Configuration	Value
Credential	The corresponding access token created on the Controller
Scrape Interval	The minimum supported interval is 10s
Target	Use the VIP of the DMF Controller cluster
TLS	If a self-signed certificate is used on the Controller, add `insecure_skip_verify: true`

Please refer to the configuration guidelines of the specific Prometheus version you are using in the production.

Limitations

Every device does not support every metric. Check the Metrics Summary section for more details.
Software interfaces (for example, loopback, bond, and management) do not report counters for broadcast and unicast packets.
The reported interface names are the raw physical interface name (e.g., et1) rather than the user configured name associated with the role of an interface (e.g., filter1).
Resetting the interface counter does not have any effect on the counter values reported by the telemetry. The value is monotonically increasing and corresponds to the total count since the device was last powered up. This value only gets reset when the device is rebooted.

Notes

The configured name of a managed device (e.g., switch, recorder node etc.) on the Controller is used as the value of the device_name label for all the metrics corresponding to it. In the case of a Controller, the configured hostname is used in the device_name label. Thus, these names are expected to be unique in a specific DMF deployment.
It is possible that no metrics are collected from a device for a short time period. This may happen when the device is rebooting or when the Controllers experience a failover event.
Prometheus will add additional metrics, e.g., scrape_duration_seconds, scrape_samples_post_metric_relabeling, scrape_samples_scraped, and scrape_series_added. These metrics do not collect any data from the DMF fabric and can be ignored.
A metric representing a status that can take a fixed set of values is represented as a StateSet metric. Each possible state is reported as a separate metric. The current state is reported with value 1 and the other states with value 0. The state itself is reported as a label with the same name as that of the metric. For example, device_psu_oper_status will display multiple metrics for the operational status of a PSU (Power Supply Unit). If a PSU, psu1 is in the failed state, then the metric device_psu_oper_status{name="psu1", device_psu_oper_status="failed"} will report value 1. At the same time, we will see the metric device_psu_oper_status{name="psu1", device_psu_oper_status="ok"} with value 0 for another state ok.
Internally, all the metrics are fetched at 10 sec frequency except the ones associated with the sensors. These are currently collected at every minute.

Troubleshooting

If no metric is collected or no change in the metrics is visible on Prometheus for a few minutes, please follow the following troubleshooting steps:

The Controller cluster is reachable over its VIP and there is an elected active Controller.
You can retrieve the telemetry states in Prometheus exposition format using the token you created. Use command curl -k -H "Authorization: Bearer <token>" https://<vip>:8443/api/v1/metrics/prometheus
The telemetry connection to the devices is active by running show telemetry connection command. Check the Telemetry Collector section of the DANZ Monitoring Fabric User Guide for more details.

Resources

Appendix

Metrics Summary

The following table shows the details of each metric exposed by the DMF fabric. The supported metric column shows that if a metric is generally supported on the device type. However, some specific platforms or hardware might not report a specific metric.

Metric	Description	Supported device type
Metric	Description	Ctrl	SWL	EOS	SN	RN
device_cgroup_cpu_percentage	The normalized CPU utilization percentage by a control group	Y	Y	N	Y	Y
device_cgroup_memory_bytes	The memory used by a control group in bytes	Y	Y	N	Y	Y
device_config_info	The informational metrics of a managed device	N	Y	Y	Y	Y
device_cpu_utilization_percentage	The percentage utilization of a CPU on a device of the fabric	Y	Y	Y	Y	Y
device_fan_oper_status	The current operational status of a fan	Y	Y	Y	Y	Y
device_fan_rpm	The current rate of rotation of a fan	Y	Y	Y	Y	Y
device_fan_speed_percentage	The percentage of the max rotation capacity that a fan is currently rotating	N	Y	N	N	N
device_interface_in_broadcast_packets_total	The total number of broadcast packets received on the interface of a device	Y	Y	Y	Y	Y
device_interface_in_discards_packets_total	The number of discarded inbound packets by the interface of a device even though no errors had been detected	Y	Y	Y	Y	Y
device_interface_in_errors_packets_total	The number of inbound packets discarded at the interface of a device for errors	Y	Y	Y	Y	Y
device_interface_in_fcs_errors_packets_total	The number of received packets on the interface of a device with erroneous frame check sequence (FCS)	Y	Y	Y	Y	Y
device_interface_in_multicast_packets_total	The total number of multicast packets transmitted from the interface of a device	Y	Y	Y	Y	Y
device_interface_in_octets_total	The total number of octets received on the interface of a device	Y	Y	Y	Y	Y
device_interface_in_packets_total	The total number of packets received on the interface of a device	Y	Y	Y	Y	Y
device_interface_in_unicast_packets_total	The total number of unicast packets received on the interface of a device	Y	Y	Y	Y	Y
device_interface_operational_status	The operational state of an interface of a device	Y	Y	Y	Y	Y
device_interface_out_broadcast_packets_total	The total number of broadcast packets transmitted from the interface of a device	Y	Y	Y	Y	Y
device_interface_out_discards_packets_total	The number of discarded outbound packets at the interface of a device even though no errors had been detected.	Y	Y	Y	Y	Y
device_interface_out_errors_packets_total	The the number of outbound packets that could not be transmitted by the interface of a device because of errors	Y	Y	Y	Y	Y
device_interface_out_multicast_packets_total	The total number of multicast packets transmitted from the interface of a device	Y	Y	Y	Y	Y
device_interface_out_octets_total	The total number of octets transmitted from the interface of a device	Y	Y	Y	Y	Y
device_interface_out_packets_total	The total number of packets transmitted from the interface of a device	Y	Y	Y	Y	Y
device_interface_out_unicast_packets_total	The total number of unicast packets transmitted from the interface of a device	Y	Y	Y	Y	Y
device_memory_available_bytes	The current available memory at a device of the fabric	Y	Y	Y	Y	Y
device_memory_total_bytes	The total memory of this device	Y	Y	N	Y	Y
device_memory_utilized_bytes	The current memory utilization of a device of the fabric	Y	Y	Y	Y	Y
device_mount_point_space_available_megabytes	The amount of unused space on the filesystem	Y	Y	N	Y	Y
device_mount_point_space_usage_percentage	The used space in percentage	Y	Y	N	Y	Y
device_mount_point_space_utilized_megabytes	The amount of space currently in use on the filesystem	Y	Y	N	Y	Y
device_mount_point_total_space_megabytes	The total size of the initialized filesystem.	Y	Y	N	Y	Y
device_psu_capacity_watts	The maximum power capacity of a power supply	N	N	Y	N	N
device_psu_input_current_amps	The input current drawn by a power supply	Y	Y	Y	Y	Y
device_psu_input_power_watts	The input power to a power supply	Y	Y	N	Y	Y
device_psu_input_voltage_volts	The input voltage to a power supply	Y	Y	Y	Y	Y
device_psu_oper_status	The current operational status of a power supply unit	Y	Y	Y	Y	Y
device_psu_output_current_amps	The output current supplied by a power supply	N	Y	Y	N	N
device_psu_output_power_watts	The output power of a power supply	N	Y	Y	N	N
device_psu_output_voltage_volts	The output voltage supplied by a power supply	N	Y	Y	N	N
device_thermal_oper_status	The current operational status of a thermal sensor	Y	Y	N	Y	Y
device_thermal_temperature_celsius	The current temperature reported by a thermal sensor	Y	Y	Y	Y	Y

* Ctrl = Controller, SWL = A switch running Switch Light

EOS = A switch running Arista EOS, SN = Service Node, RN = Recorder Node.

DANZ Monitoring Fabric Deployment Guide