Network Automation Levels

Jan Lindblad · ‎12-20-2022

Network Automation Levels

Network automation is a hot topic in the industry today and many will tell you about how they have successfully automated their network. But what do they mean by network automation? What is network automation?

In general, anything that moves work tasks from a human operator to a computer is automation. However, such a broad definition makes it hard to have a conversation on the topic. To simplify things, we attempt to define five levels which classifies varying levels of automation.

It is similar to what the car industry has done with regards to self-driving cars. Initially, the concept could be taken to mean anything from a car with lane assist and adaptive cruise control, all the way to a driverless taxi. In order to sort things out, some automotive industry people got together and produced a list of five levels of self-driving cars where the criteria for meeting each level was clearly explained. The Wikipedia page for self-driving cars (https://en.wikipedia.org/wiki/Self-driving_car) includes this list and it makes it easier to reason and have a conversion around self-driving cars.

Nomenclature

Before diving into the levels, we quickly need to cover the environment and context in which it is relevant to define these levels.

The network or networks are managed by one or more **management systems**. The management systems configure **services** in the network to provide value to the users of the network. Usually there are many types of services and many instances of each service type. For example, a service could be Internet connectivity for a residence or a point to point Layer 2 Ethernet connection. Services can be stacked, i.e. one service can create / call another service in order to form the full service. For example, to provide an Internet connection, a Layer 2 Ethernet connection from the customer location through the access network to the nearest edge router needs to be created too. The network can be divided, with a management system managing over its domain. For example, the access network can be managed by one management system and the edge routers providing Internet connectivity are in another domain. The management system of each domain can talk to the management systems of other domains.

Level 0: Text Templates

If you have a network with no automation, you're likely typing commands on a Command Line Interface (CLI) directly on the device, or possibly clicking around in Graphical User Interface (GUI). Seasoned operators tend to prefer the text based interface as it allows them to paste prepared chunks of configuration, which is much faster than manually typing things into numerous text input fields in a GUI. While we're hesitant to call this automation, at least it takes a lot of typing out of the management work.

A Method of Procedure (MOP) is a document that describes a sequence of tasks to perform in order to reach a goal. It's a little bit like a computer program with a sequence of instructions, just that it's intended for a human to process and carry out those instructions. There are many MOPs for different situations, for example how to install a new customer circuit or what to do when a backbone link goes down, as indicated by an alarm. MOPs are usually of the form "run show command X", look at some part of the output, copy+paste the following lines of text to the CLI, first replacing the <customer-id> place holder with the actual Customer ID.

With a very low threshold to entry, MOPs are a great way to start. Just write up new MOPs as the need for new ones arise, or adjust old ones when you've learned a better way of doing something. Over time this approach will lead to a great variety in how services are implemented in the network. Old service instances that are already in the network are rarely updated when the MOP template is updated. Instructions in MOPs are performed by humans, so the tasks are often expressed as a rather fuzzy high level task, which relies on the experience of the human operator by requiring judgment calls or implicit checks, i.e. "common sense". This makes the approach very flexible but also largely dependent on the human operators.

The largest drawback is a major risk of human error. For instance, pasting the correct configuration but into the wrong CLI terminal window, involuntarily causing an L2-loop, might turn your L2 network with 164 devices into 164 isolated L2 networks. The cleanup from that is rather painful. And even if pasting prepared configuration removes the time to do the typing, having human operators carry out MOPs scales poorly. More customer and more services directly translate into a need for more human operators, which drives up the network operations costs.

Level 1: Macro scripts

To get to level 1, up one level from cutting and pasting, you just turn the MOPs into scripts that an operator can run. A script will take as arguments the variables that were listed as place holders in the MOPs, such as the Customer ID in the previous example. The input arguments together with the stored configuration template is expanded to configuration that can be sent to a device. It is equivalent to macro expansion. Sometimes the variable arguments are retrieved from an inventory system rather than passed as script arguments. These scripts are often implemented in Perl or Python, or driven by simple automation tools like Ansible. Basically they contain the configuration template that used to be in the MOP document and replace the variables with the argument values.

The MOP is a procedure written for a human to carry out. By converting it into a script, we effectively translate that procedure into a programming language so a computer can carry out the work instead. Computers are great because they are very fast and consistent. However, they have no inherent knowledge or experience, so the procedure in the script needs to be much more precise.

At level 1, the macro command scripts typically operate in a "blind" mode. There is little or no up front checking of the existing configuration and conditions, to ensure they are as expected, before carrying out the provisioning procedure. Similarly, after configuration has been pushed, there are no checks afterwards to ensure that the desired outcome has been achieved. Since there are no or few checks, it is rare that an error is detected. If something goes terribly wrong during the execution, the macro command gives no guarantees as to what state the network will be left in. Cleanup would be left as a manual exercise to the operator.

Networks on this level typically also have some sort of alarm management system and collection of operational state data in place that gives the operators an aggregated view of the performance of their network.

Macro script removes much of the human mistakes and variation in the low level text handling, intentional or otherwise. While reliability is significantly improved, it is still prone to errors due to environmental factors and may easily leave the network in an inconsistent state. The investment cost is only slightly higher than the level 0 approach, but the much higher speed of changes saves a lot on the network operation expenses. The flexibility of the operator to solve problems in an ad hoc manner is considerably lower, for better and for worse.

Level 2: Adaptive Activation Scripts

The level 2 Adaptive Activation approach is fundamentally similar to macro scripts, but with some logic sprinkled on top. First, there would be proper pre-flight checks to see that the environment is properly prepared for the commands to run. Certain information or facts can be retrieved from the target and this information is in turn used to affect the configuration, for example a configuration section can be skipped if it is found to already be configured on the target device. After the command has run, there would be proper post-flight checks to see that the desired goals have been reached.

If a problem is detected, the script will handle common errors, to either work around the issue and get the operation to complete successfully, or abort the change and restore the network into a known good state. The restoration of previous good state is typically relying on rollback functionality of the target device or it is manually implemented as the inverse of the applied configuration, which means it is quite brittle and prone to going out of date. In this sort of environment, there would also be a script to remove the service, and perhaps to update existing services in certain pre-defined ways, e.g. to adjust a particular parameter or go from the basic to the premium version of the service.

Many users feel that they have arrived at a "fully automated" network when this level is reached, since there are scripts to handle the full service life cycle, i.e. creation, deletion and the typical kinds of updates. Occasionally error situations arise that the scripts are not able to recover from, and manual intervention is required, but that's true of every network, isn't it?

While many of the use cases or MOPs have indeed been automated, the manner in which they are implemented is brittle and prone to errors. Idempotency is typically implemented in a manual fashion. The configuration template author must condition every piece of configuration on some pre-flight check. When the script is updated due to service definition changes over time, the updated script is not necessarily able to handle deletion and updates of all the older service instances. After a few generations and service definition changes, the combinatorics of all operations across all versions of service instances make issues creep up.

Level 3: Model Driven Services

At levels 1 and 2, the messages sent by the management system to the devices, containing configuration payloads or request for operational data, have been authored by a human programmer using some sort of templates with variable placeholders. A MOP would consist of a sequence of such messages that gets the particular task done. It is up to the programmer to figure out a sequence of messages that work, and he can do that by using any kind of tools, prior knowledge or through trial and error. Once a working sequence of messages is determined, it's a simple matter of making sure the management system replicates that sequence to carry out the operation. There is no dependency on any sort of standards, and as long as there is a work around for any issue found, the hand crafted messages can be updated to avoid the problem.

The drawback with the level 1 & 2 approach as described is exactly the same as its strength: it depends entirely on a programmer to foresee all situations the system should deal with. On these low automation levels, the system will forever be limited in what it can send to the devices by the set of messages the programmer has prepared. If something out of the ordinary happens, there might not be any sequence of prepared messages that can get the network back to where it should be. This is analogous to our example with self-driving cars, where some vehicles today navigate along paths prepared in advance. Even if this approach easily handles multiple destinations, and alternative paths between them, it is still a long shot from autonomous navigation.

If the journey is from Paris to Dakar, it's going to be a lot of work to prepare all necessary turns in advance, due to the sheer number. It's going to be even worse if we need a few alternate routes. Plus many cross over routes in between the alternate routes, so that we don't need to backtrack to the start in order to take the alternate route. Ah, and yes, did I mention all the paths needed to backtrack from the current position back to the start or to the closest of the cross over routes?

No, as the journey becomes longer and more complex, autonomous navigation becomes an indispensable requirement. This realization has a profound consequence. We can no longer rely on a programmer to prepare templates for every situation the management system might find itself in. In this world, the management system must be able to send commands that it has computed on its own, messages that are tailored to the current situation and desired goal. To accomplish this feat, the management system derives all the messages to send to devices from two things: knowledge of the current state on each device, and data models (or "schemas") that describe the behavior of the devices. This approach is referred to as a model driven architecture. Basically, the management system reads the current state of the devices, compares with the desired state, and computes the operations required to take the device's current state to the desired state based on the models that describe the devices' behavior.

Since there are no fixed operation templates to consider, this approach allows creation, deletion and any kind of update. Reversing a previously applied configuration change is also possible by computing the inverse of a patch using knowledge of the data model. Not only does this free us from manually having to write down the reverse patch, but we are also guaranteed an always up to date version based on what was actually sent to the device. Thus, if an operation fails, it's possible for the management system to undo it by computing the operations required to take it from the current state to the last known good state. Further, with updates to the service definition, a model driven approach can automatically compute the necessary configuration changes to bring a service from version 1 to version 2.

While the model driven approach frees us from a dependency on code prepared by a programmer (and implicitly, the programmer's ability to imagine and work through a huge number of scenarios), we will depend on three new factors: the device's ability to correctly and precisely describe it's current state, it's accurate execution of control commands and that the model correctly describes the device behavior. Again, this is similar to how self-driving cars depend on correct and precise sensors, accurate implementation of steering and control functions, as well as a map that is kept up to date with reality.

Level 4: Verified Service Delivery

While level 3 network automation can reliably take our network from any state to any other, level 4 adds verification that configured services are being delivered as intended and meets Service Level Agreements (SLAs). Correct configuration on all devices involved in a service does not mean the service is working. At a minimum, the operational state must also be monitored to ensure e.g. all interfaces are up.

Proper service monitoring involves measuring the direct KPIs. We should strive to measure the direct KPIs of a service, for example, for a layer 2 point to point connection, we can't judge if the service is working just by checking if the interface to the customer is operationally up. We have to send traffic across the connection to gauge latency and similar metrics. For a corporate Internet connection, we can monitor the BGP session to make sure it is up and both sending and receiving the proper routes but we should also send actual measurement probes to verify connectivity.

Some KPIs may only be possible to measure before the service instance is handed over to a customer, as measurements would otherwise interfere with the customer's use of the service. Others can be measured or inferred throughout the lifetime of the service instance. Once a service has been tested and verified to fulfill SLAs, a service instance SLA report may be issued and handed over to the customer together with the service. Continued (passive or active) monitoring of each service instance is required at this level.

By proactively monitoring individual service instances, we are no longer waiting for customers to call us to report issues pertaining to their services. That gives us happier users, but also a deep understanding of how good we are at delivering the value promised in the SLAs. For a technically minded organization, it's quite a difference to know what is being delivered minute by minute, as opposed to reading the responses from the yearly customer satisfaction survey.

Level 5: Closed Loop

At level 5, the principles of model driven services and verified service delivery are the same as at levels 3 and 4. Level 5 adds automatic recovery and optimization. Once in a while, the continuous service instance monitoring from level 4 might indicate a degradation of delivered service quality, or even a breach of the SLA or total loss of service. The management system will then reexamine the service instance and see how it could be redeployed to deliver the expected service. If an underlay link is lost, simply redeploying the service will compute a new way of deploying the service instance based on the updated topology information to restore the working state of the service. If, and only if, the resource lost was crucial to the service delivery with no alternative backup, an operator alarm would be raised. Alarms are for triggering operator action, remember?

The environment that a service instance lives in changes constantly. Not only do underlay links go up or down, sometimes they are removed permanently or new links are added. Congestion comes and goes. The cost of CPU hours or the environmental footprint of a given data center may change frequently. The decisions about which resources to allocate for a given service when the service was created may not be optimal after some time has passed. Just as the management system may react to SLAs being jeopardized or violated, it needs to reexamine how each service instance is allocated at regular intervals, to keep them optimized. If this reevaluation happens once an hour or once a year is not important for the principle, but this ongoing optimization is an integral part of level 5 network automation.

With the full service lifecycle automated, including day to day error handling and optimization, there is still plenty of work to dig into. Handling of exceptional situations, such as bugs in devices, exotic hardware failure, cyberattacks or forest fires will be part of it. Planning the network evolution, both in terms of equipment and services. Optimizing the internal processes, since the need to become leaner and faster will never end. Maybe even getting time to develop new services and revenue streams?

What about Higher Levels?

Although this network automation scale ends with the definition of level 5, the area of network operations certainly encompass more than have been covered here. For example capacity planning of the network, which requires not only the notion of the current state of the network and our desired state, but also a view into history and the future, such that we can base future predictions of network usage on historic data. We need to predict when links become full and plan capacity upgrades accordingly. Would that be a level 6? Or should we consider it a separate component with its own defined levels? Regardless, there are vasts areas of common operational procedures that we as an industry can automate.

Big data and AI isn't explicitly mentioned or required at any level. It might be that certain KPIs or certain root cause inference is best tackled by a neural network, but it does not change the fundamental needs or criteria of each level.

Building Blocks

Building management systems or devices that plays well in this scenario isn't easy, so a few observations based on existing implementations may be in order.

Standards. Getting an entire industry to cooperate is not easy. Open standards is the fair way to get predictability to replace fragmentation. It's less about the standards being perfect (they usually are not), and more about avoiding trying to keep up with dozens of ways to get things done.

Models and modeling language. In order to get the programmer out of the equation, it is key that devices declare their interfaces in a way that management systems can fully understand. Both in terms of how to call upon the interface, but also when it comes to the precise effects of calling it. A catalog of API calls with descriptions in English clearly stops at automation level 2, since it requires a programmer to understand which API to call, and trial and error to find out the exact effects it has.

Templates. Humans have an easy time working with templates. That's a good thing worth leveraging. For automation levels 3 and above, any templates used have to be decoupled from protocol operations, since the protocol operations are now computed by the management system.

Transactions. It is hard to build a management system with good support for high network automation levels. Devices that support transactions drastically reduce amount of code in the management system. This means a lot of hairy error situations are handled by the device itself, and the management system no longer needs to worry about sequencing of individual configuration stanzas, as the device itself takes care of applying configuration in order within a transaction.

Handling overlapping service instances. In most networks, some resources will be shared between service instances. Well functioning management systems need to track which service instances depend on which resources, so that shared resources can be reclaimed when the last service instance that depends on it goes away, and not before that.

Upgrading existing service instances. In a well run network, service instances are continuously upgraded to the latest definition, as the service definition changes over time. If not, the number of service instance versions tend to continuously increase and over time it makes the network very hard to manage.

by Kristan Larsson and Jan Lindblad