Table of Contents
– Arista’s Approach
– Arista EOS: CloudVision
– Network-Wide Core Service
– OS CloudVision
– CloudVision Change Management
– Network-Wide Rollback
– Telemetry Use-Cases
– Workload Mobility
– Partner Use Cases
– Next Steps
– Use-Case Worksheet
Executive SummaryArchitecting a fault tolerant and resilient network fabric is only one part of the challenge facing network managers and operations teams today. It is simply not good enough to build a scalable fault tolerant network. Typical data centers can range from tens, to hundreds if not thousands of networking devices.
Operational tasks alone at this scale have created an endless set of challenges faced by the implementation and operations teams. Often the cost of these challenges outweighs the benefits of introducing a new vendor or emerging technology.
To meet these challenges, network operations teams have glued together a basket full of reactive software tools in order to manage, monitor and alarm on a day-by-day, minute-by-minute basis. Such tools usually consist of either homegrown scripts or 3rd party tools that have little or no integration with each other. The end result being every network issue ultimately requires human intervention in order to provision, manage or troubleshoot any infrastructure. This model adds to the complexity and exponentially increases the OpEx costs of running infrastructures of any size. Additionally, it is counterproductive to the original business intent, which led to purchase of these tools in order to drive down OpEx costs.
The industry is full of stories describing tremendous OpEx savings by the largest (private or public) cloud companies achieved by using automation techniques. Naturally companies of other sizes and markets want to achieve the same significant savings in their environment also.
However, most of these customers are handcuffed in moving forward with automation solutions, as they simply:
- Do not know which solutions to deploy — or:
- Do not know how to go about doing it — or:
- Do not have the budget — or:
- Do not have the resources to engage down this journey.
Businesses need to be able to simplify data center operations. The main reason they need to do this is to enable “business agility & flexibility” for workloads and workflows. This flexibility helps them drive their businesses at a more rapid pace.
The advantage that Arista brings to the data center is not only a reduction in CapEx costs with the artistry of its leading product portfolio, but also access to the benefits of a software driven cloud solution by leveraging its open and programmable Extensible Operating System (EOS) to drive down OpEx costs. Arista EOS CloudVision is an extension of EOS focused on giving customers a turnkey solution to cloud-like automation and visibility.
Arista’s Approach to Network AutomationThere are several approaches to network automation, with the primary approaches summarized as follows:
Figure 1: Arista supports a variety of approaches to network automation
Do-It-Yourself (D.I.Y.) Automation– D.I.Y. solutions are typically deployed by ‘cloud titans’, such as Microsoft, Facebook, etc who are building massive data center infrastructures. Often cloud titans have a 10x appetite for deploying infrastructure (server, database, network) compared to the largest enterprises. For them data center automation is necessary to their business model as a means to compete in the market place. Such large organizations also have different application profiles and their applications are designed to account for infrastructure failure. As a result, cloud titans often employ large software teams to write custom designed scripts to automate their infrastructure. Arista helps such customers by providing them open tools like EOS SDK and eAPI with unrestricted access to the kernel to deploy their own scripts to be able to fully customize their network.
DevOps Model– This model is typically deployed by large Service Providers or larger Enterprise, as they embark on the automation journey. The approach includes prebuilt 3rd party automation frameworks typically also being used by the Figure 1: Arista supports a variety of approaches to network automation compute teams, such as Puppet, Chef, Ansible, CFEngine etc. to consolidate provisioning tools and apply those frameworks to the network infrastructure to drive down OpEx costs. Such customers are large enough to have resource pools available in order to write custom scripts to achieve some of the automation gains in their environment. They are more invested in the automation approach with committed resources, budget and vision to achieve the OpEx reduction goals. Arista helps such customers by providing open tools like EOS SDK and eAPI with unrestricted access to the kernel to deploy their own scripts or 3rd party tools like Puppet, Chef, etc. to be able to fully customize their network.
Turnkey Solution– There are limited tools which exist in the marketplace today to guide customers down the path to network automation. Arista’s CloudVision is a tool that provides a turnkey solution, whereby the portal specifically allows customers to be able to provision, manage and gain more visibility into an infrastructure without inhibiting customers to be able to take on more complex automation scripts in the future. This software is designed to help customers of all sizes, in particular the small, mid-sized and large enterprise across every vertical, who are looking to reduce OpEx by applying the lessons-learned and problems solved from network automation of the cloud providers. As mentioned before, other enterprises have the need and desire to embark on an automation journey, but do not have the time, skillset, or resources to do so. This is where CloudVision comes in.
Figure 2: Software Services with CloudVision
CloudVision builds on one of Arista’s core strengths - the Extensible Operating System’s (EOS) innovative state database model, called “SysDB”. Sysdb acts as a broker between the many processes that make up EOS, enabling the scale, feature velocity and robustness EOS is known. SysDB holds the entire state for that switch e.g., configuration, topology, protocol state, monitoring counters details, etc.
Arista EOS: CloudVisionMost provisioning and monitoring tools are reactive and manage each box individually, i.e. box-by-box solution. There is no holistic view of the whole network to visualize as a complete view.
CloudVision changes that operational model by taking a drastically different approach. It is built on the following four characteristics:
OS Network Wide Service– By taking per switch state, aggregating it to provide consolidated network wide state of network.
Single Touch Point– Enable cloud mobility, by providing an open and single point touch point for any 3rd party controllers, orchestration services, or network services.
Turnkey Automation Solution– By providing the use of a portal to provide turnkey workflows for automation and visibility
Streaming Analytics and Telemetry– Improved network-wide visibility through granular state-streaming, an analytics engine, and telemetry visualization.
The rest of this white paper will focus on the various use cases where CloudVision can be deployed.
EOS CloudVision: Network-Wide Core Service
Figure 3: Arista EOS Architecture with NetDB
Why is a network-wide consolidated view key? Why do we need yet another model to manage the network? As an industry, we can manage the network like it has been for the last 25+ years for the next 25 years: box-by-box, provisioned individually, managed separately, no automation, with no correlation to the health of the infrastructure with no real OpEx savings. However, history has proven that that model does not work. So why has this not been done before?
Network operating software architecture has not allowed that to happen. With NetDB, EOS -- now as a distributed architecture, CloudVision is able to aggregate and consolidate the network state to provide a network wide view which provides the operator full proactive visibility of what is going on in the network at all times without requiring administrators to log into every network device one by one. Now with CloudVision, administrators can view the status of their network today, yesterday, a month ago to get a perspective on troubleshooting, traffic flows, and congestion from a historical viewpoint.
When a switch is managed, there are typically 3 things that can be changed on it - configuration, image or a script. Today, most customers may have tools to combine 1 or at best 2 of these items within the same tool. However, to be fully automated, what is needed is a tool that consolidates these management tasks into one simplified tool.
Today, most networks have multiple devices from the same networking vendor, which they have deployed over time and are running different code versions. Traditional inventory management tools are used to monitor platform versions and code versions.
However, these tools are not able to provide a service dashboard that actually shows if there are any critical proactive software alerts, manage the lifecycle of software upgrades, or to provide security reports. CloudVision solves that problem by providing a dashboard for a variety of switch metrics like Layer 1, Layer 2 and Layer 3 information from a graphical view. Pro-active bug alerts & patches via Cloudvision’s automatic Bug Visibility service which will greatly reduce OpEx for administrators.
When monitoring tools are deployed, they typically are purchased and deployed in pairs i.e. Active/Backup model or Active/Active model.
CloudVision changes that monolithic way of deploying management software by leveraging a fully distributed in-memory and archived Hadoop database, which allows it to be deployed in highly available clusters. This allows the infrastructure to scale as the network (i.e. number of devices) grows.
OS CloudVision: Provisioning Use CasesEven today with the advent of so much automation, most network device provisioning, software upgrades, and configuration changes are still being done manually
To resolve the above problem, Arista was the first in the networking industry to deliver Zero Touch Provisioning (ZTP). ZTP allows the customer to take a switch out of the box, rack it, and automatically provision it with a machine-generated configuration, officially approved image, or script without any human intervention – similar to how an IP Phone configures itself, or how a wireless access point configures itself with no manual intervention.
However, there was no turnkey way to orchestrate the ZTP process using a network wide view. CloudVision’s ‘Network Provisioning’ portal process allows the end user to create a logical network design diagram view to ensure devices are being provisioned with a data center leaf/spine topology view or any other network topology view, which represents the final deployment design. With a logical hierarchical inheritance, administrators can assign switches to different containers which in turn have the ability to apply settings like switch configuration, image version, and device labeling.
When network switches are managed, typically a configuration, image and script are used to provision and manage change controls for that switch. The ‘Network Provisioning’ portal allows a customer to perform all three actions at same time in a network wide view.
To take ZTP to a step further, CloudVision allows administrators to not only deploy brand new switches in remote locations without requiring an engineer to manually configure the switch, but for replacements as well. Zero Touch Replacement (ZTR), allows a switch that has failed to be re-provisioned, or decommissioned to inherit the configuration and settings of an existing switch without requiring to apply all settings from scratch. Once again, with the flexibility of the EOS single binary image, it makes moving switch settings from one switch to another with ease.
CloudVision now allows the ability for the dynamic creation of configuration via Configlet Builder – a way to enable you to programmable create device configurations, this prevents administrators having to manually create each configuration for every switch. By using a user interface (UI) and Python engine, administrators can create their own “wizard” like prompts to create configurations for any EOS feature, which can then in turn get applied to switches or a container full of switches.
ZTP solutions were first born out of the need for automating the initial deployment of a switch in the infrastructure i.e. a ‘day zero’ process. To obtain OpEx cost reductions of managing the asset during the life cycle of its deployment in the data center, CloudVision is expanding the scope of ‘Zero Touch’ to a broader perspective to help automate ongoing changes over the lifecycle of the network devices. Customers are enabled to use a turnkey portal-based ZTP and ZTR solution to provision the device initially and throughout its lifecycle.
EOS CloudVision: Change Management Use CasesTypically, enterprise customers perform change controls outside production hours and request a change control window. When the change control window starts, the engineer performing the change will perform pre-change control procedures e.g. capturing switch interface status, VLAN status, ip routing status, multicast status, ACLs, QoS configuration etc. using a number of show commands. These scripts may be run on a single device on a larger set of devices depending on the size of the change. Once the change has been completed, the engineer will most probably run exactly the same scripts again. The reason these scripts are run is to ensure that the delta performed during the change is per expectation. The only way to ensure this delta is accurate is if the engineer were to manually compare the pre & post change status. If the change impacts a large number of devices, it is not manually possible to ensure 100% accuracy and there is a reliance on sample-based confirmation, which substantially increases the risk of the change. Typically depending on the device or the complexity of the change, verifying the change manually can take an hour per device.
To be able to eliminate this manual and potentially error-prone process, and to be able to substantially increase the accuracy of the change control, CloudVision presents change control status via a new ‘Snapshot Mode’. This unique approach leverages the underlying database principles of Sysdb. By capturing various fields or states of the database before and after the change and by being able to report them per device or by a set of devices (referred to as ‘Config Containers’) drastically reduces the time of performing the change control.
CloudVision is designed to automate with proactive changes as well as identify adhoc or non-standard changes. With a compliance checking capability, CloudVision can detect and report on any network devices that deviate from the gold standard configuration or image versions. Sometimes device-specific deviations are intentional and necessary; sometimes they are not-authorized. Regardless of the reason, the network operator should have an up to date and accurate compliance view of all the devices in the network inventory. Now with CloudVision, a compliance and delta check are a few clicks away.
CloudVision takes the change control to the next level, by introducing pre-integrated Smart System Upgrade (SSU).
SSU allows the customer to take the whole or selected parts of the network out of service easily without impacting application traffic, thereby reducing OpEx again.
Typically, if a change control has not gone smoothly, the engineer(s) performing the change have to rollback their change - i.e. configuration, image or script. They have to do it methodically and step-by-step and device-by-device.
For the first time, a network-wide rollback is available. It is very easy to for the user to simply revert to a previous state of the network.
Figure 4: CloudVision Snapshot Views
EOS CloudVision: Network-Wide Rollback Use CasesBuilding on top of the Snapshot feature – where CloudVision keeps a revision change of each configlet on each switch - Network Wide Rollback brings this concept to our maintenance windows for a before and after comparison before the change takes place.
All enterprise networks have maintenance windows in order to make changes to adjust to business needs. However, any time a maintenance windows or change happens, there may be a need to rollback to a previous configuration for unforeseen reasons. Similar to how with virtualization we have the ability to take a snapshot and rollback to previous dates, Cloudvision now brings this concept to the networking world.
Figure 5: CloudVision Network Rollback
One issue with traditional network operating systems is the inability to easily move between different revisions of code, or configuration. Network engineers in the past have used notepad files or spreadsheets in order to accomplish their maintenance windows. CloudVision now allows for an easier approach, leveraging Arista’s eAPI allows for a quicker change between two different states on a one, some, or all switches in your network.
CloudVision’s change management framework can integrate into northbound 3rd parties as well. By leveraging the well-known objects documented in our API Command Guide, operators can use these open APIs to further customize CloudVision into existing management infrastructure and other 3rd party management platforms.
EOS CloudVision: Telemetry Use CasesIn a cloud network spine leaf design, it is key to be able to track workflows. Without visibility, the network operators are driving blind to determine outage causes or capacity planning. Arista EOS has a long history of network telemetry tools, called Tracers, which provide visibility into the devices, the topology, and even the workloads. These tools have been a strong foundation to ensure visibility and reduced mean time to resolution (MTTR) when troubleshooting a spine leaf architecture.
Traditional visibility tools are built on SNMP polling-based approaches that gather state every few minutes, thus only providing a very limited view of the network state. With NetDB, Arista EOS devices store all real-time state in one common database and then aggregate that state from all devices into a network-wide view. By collecting every state change on the network, Arista customers will have access to both real-time and historic views of the network in one place and at a level of granularity never before achievable.
To leverage this rich network data, the CloudVision® platform is now enhanced to provide both the analytics engine and telemetry visualization for this network-wide state. On the backend, a scalable state repository built on open-source HBase runs an analytics engine to track trends, correlate data across devices and layers, and detect anomalies. On the front-end, new telemetry apps for the CloudVision Portal, including the Workstream Analytics Viewer, providing simplified visualization of network-wide state for faster time to resolution.
Figure 6: CloudVision’s Workstream Analytics Viewer
As demand on a network increases with the onset of server virtualization, consolidation, IP storage, Hadoop, there will be times of congestion on the network. When there is congestion on the network, Arista switches have a feature called ‘LANZ’ (Latency ANalyZer) which can highlight proactively when there was congestion and the impact of the latency. However, this is by box and not holistic for the network.
CloudVision helps the network operator to manage the health and congestion network wide and to report any hot spots there may be on a specific port or link. This allows the operator to quickly move workloads and workflows to less demanding resources on the network.
Most organizations are moving to a more virtualized infrastructure and typically have virtualized their compute environments using hypervisor-based solutions up to 80% to 90% of servers. However, if there is a production issue, it is very difficult to correlate the virtualized-compute world to the compute and network world. Arista has had a feature called ‘VMTracer’ for a number of years that provides this correlation quickly from the command line of that switch.
CloudVision makes it very easy to search for, find, and see the state of any virtual machine. This will help to reduce the operating challenges and to make it easier to run a tightly integrated overlay and underlay network.
Arista has consolidated its Tap Aggregation Manager GUI into a consolidated view within the CloudVision portal, thus giving the customer a common starting point to manage their tap aggregation network.
EOS CloudVision: Workload Mobility Use CasesThere are various ways to implement VXLAN control plane functionality using multicast or BGP. However these approaches are not plug-n-play solutions and are not open to integrate with the array of SDN controllers.
CloudVision provides a simplified approach to deploy VXLAN overlays for mobility within the data center. Using open standards based APIs like OVSDB or JSON, CloudVision is the platform for integration with Arista’s ecosystem of orchestration, overlay controller, and service delivery partners like VMware NSX or OpenStack.
Network visibility is key to a network monitoring tool set for quick identification of issues. Traditional network monitoring tools are unable to provide visibility into overlay VXLAN tunnels.
CloudVision bridges this gap in network visibility and provides a topological view of the overlay network that helps with troubleshooting and monitoring the environment.
CloudVision increases the visibility of the overlay network by stitching the virtual hosts and tunnel to the physical network ports and providing a centralized view within the portal.
This allows the network designers and operators to make changes in the network with a higher degree of confidence.
CloudVision can be deployed in a clustered high availability mode with 3 servers in active/active state. In the event of a single server outage, there is no disruption in the network, and the overlay network is uninterrupted, thereby maintaining high availability.
EOS CloudVision: Partner Use Cases
Most SDN controllers are focused on the overlay network itself and are not tightly coupled with the underlay network.
CloudVision provides that openness to serve as a central integration point to all 3rd party controllers, such as VMware NSX, Microsoft, Nuage, etc. CloudVision also provides a more scalable solution as it does not require the controller to talk to every single network device. Instead, the SDN controller simply talks to CloudVision’s central integration point, which will then communicate the overlay information to the rest of the VTEP devices.
In addition to supporting OpenStack integration, CloudVision is fully open to supporting any customized controller that the customer may want to deploy. This provides the customer the choice of not being locked into any single overlay vendor.
CloudVision is the platform for integration with other best of breed solutions, including OpenStack integration (Rackspace, RedHat, etc), overlay controller integration (VMware NSX, Microsoft, Nuage, BYOC, etc), flexible compute integration (HP, Dell, etc), application services (L4-7) integration (Palo Alto Networks, F5, Checkpoint, etc), workflow tool integration (ServiceNow, HP OneView, etc), telemetry tool integration (Splunk, Corvil, etc), optical transport integration (Infinera), or storage partner integration. In addition, CloudVision can integrate with customer-specific or third party controller and network management solutions through its open APIs.
Of the various points of integration, the network service (firewall, load-balancer, etc) is often the most difficult to design into today’s cloud networks, with both virtualized and non-virtualized hosts as well as increasing east-west traffic patterns. The challenge becomes where to place the network service so that it is in the data plane path to make a filtering or load balancing decision for most – if not all – traffic in a data center. With Arista’s Macro-Segmentation Services (MSS), these network services can be more efficiently inserted into network designs, regardless of where traffic originates. MSS runs on CloudVision, which sees the entire physical network and serves as a broker point for service insertion by leveraging APIs to the appropriate service device. And MSS doesn’t change the service operations model, as all service enforcement and administration is in the domain of the appropriate service appliance.
For more information on Macro-Segmentation, see our Solution Brief
This provides the flexibility to be able to integrate and write scripts to integrate their environment with any 3rd party vendor to manage the network.
Next StepsThe next steps within any organization are to start by embarking down the automation journey. As such this section helps guide how to embark on this path and allows you to progress to very advanced levels of automation should you choose to do so.
Automation within your data center is not a specific destination but more of a journey, with newer ideas being generated as improvements are made. The best recommended approach is to:
a. Create a list of use cases that need to be tackled within the organization. Use the above white paper as a tool to generate ideas on which use cases need to be tackled to compile the initial list.
b. The next step is to prioritize the list of use cases that are relevant for your organization. Use the 80:20 rule to prioritize the list, by identifying the 20% of use cases that will provide the largest benefit.
c. If automation has never been deployed before within your organization, the recommendation is to start with the easiest use case first even if it brings only small value. Once confidence is built within the organization that automation is truly delivering a success story, more complex use case scenarios can be tackled subsequently.
Consider the list of use-cases in Table #1 to help embark on this automation journey:
a. Prioritize your category which needs to be automated first by giving it a ranking from 1-6
b. Within each category prioritize each use case by giving it a priority 1-10
c. Estimate the Opex Savings (in currency) that can be potentially achieved
d. Work with your Arista account team to review the use case sheet and how best to embark on the CloudVision deployment
CloudVision Use Cases Worksheet
|CloudVision Use Case||Priority (1-10)||Opex Savings ($)|
|Automate new switch provisioning using ZTP|
|Create custom scripts to be managed on switches|
|Maintain golden images post code certification|
|Manage & Provision switch configuration, image, scripts centrally|
|Smart System Upgrade|
|Streaming Analytics and Telemetry|
|Network wide virtualization visibility|
|Centralized TAP Aggregation tool|
|VXLAN plug-n-play deployment|
|VXLAN topology view|
|VXLAN virtualized host view|
|OpenStack Integration (Rackspace, RedHat, etc.)|
|Overlay Controller Integration (VMware NSX, Microsoft, Bring Your Own Controller ‘BYOC’, etc.)|
|Flexible Compute Integration (HP, Dell, etc.)|
|Macro-Segmentation Services (Palo Alto, F5, Checkpoint, Fortinet, etc.)|
|Macro-Segmentation Services (Palo Alto, F5, Checkpoint, Fortinet, etc.)|
|Workflow Tool Integration (ServiceNow)|
|Telemetry Tool Integration (Splunk, Corvil, etc.)|
|Optical Transport Integration (Infinera, etc.)|
|CloudVision Core Services|
|Network wide view|
|Consolidation of various network monitoring tools e.g. change management tools, scripting tools, code deployment tools|
|Consolidated reporting within a single tool for compliance, security, alerting and management reasons|
|Deployment of monitoring tools within global data centers|
|TOTAL ESTIMATED OPEX SAVINGS:|
No matter which stage the organization is within the automation life cycle, CloudVision can help tackle simple or advanced use cases. If additional use cases are required, please share use case scenarios with your Arista account team who would be more than happy to incorporate ideas in subsequent releases.
SummaryShifting spend from IT Operations to innovation and meeting business needs more quickly are the key goals for every CIO. The only way to obtain the substantial OpEx cost reductions required to remain competitive is to automate their network environments. Traditionally, approaches have been shackled in working with closed or limited network operating systems. This seriously restricts the ability of an organization to be agile and flexible as the requirements of the data centers change quickly. This also provides the first opportunity companies have had the flexibility to be able to manage a network infrastructure network wide any of the following methods: CLI, API, scripts, or a portal.
Arista EOS CloudVision is built on an innovative network-wide database architecture and is a truly open, next generation solution for cloud-like operations. With a focus on easy provisioning, configuration, image management, troubleshooting, visibility, security and 3rd party integration, CloudVision provides the platform to allow an organization to start leveraging its network automation in ways it was never able to do before, and drastically reduces OpEx costs to run the infrastructure.