A typical NOC is monitoring the network for performance and availability. When a problem occurs, they try to find the cause. Whether they have found the problem or not, the next step is to turn the problem over to network engineering. If the NOC finds the probable cause, network engineering can immediately start on repairs. If not, they spend more time searching for the cause. Finding the cause of a problem can take minutes, hours, or days.
First, let’s define change. For this post, we’ll discuss these types of change:
- A change in the running state of the device
- Unsaved changes – OK, not really a change but still worth noting as you’ll see momentarily
- A change in the boot configuration of a device
- A software update on the device
- Adding a new device to the network
Correctly executed changes have no negative impact. That is the goal but not always the result. Let’s look at how each type of change might cause a problem. And look at how NOM can help.
Spoiler alert: Everyone dies in the end! Ah, no, that was for a movie. One piece of background, NOM includes both performance/availability monitoring and configuration monitoring/management for network devices.
1: A change in the running state of a network device
Someone logs in to the network device and makes a change. Most network devices have a configuration file that holds the device’s parameters when booted. Once running, changes to the running state affect the running device but not the configuration file. This is great for testing a fix.
Let’s assume the change was wrong, and a performance problem occurs. A NOC operator will look for the problem. But without knowledge of the change, the operator will take longer to find the cause of the problem.
Knowing that, in general, a change is the most likely cause of an issue, having a tool that shows the operator changes made to the device can speed troubleshooting.
NOM sets up alerts on network devices to be informed when there is a change to the running state of a network device. Since it also stores snapshots of configurations periodically, it can compare the current (changed) configuration with the previous configuration and point out precisely what changed.
As a side note, NOM can roll back to a previous configuration. That’s something network engineering might use to improve Mean Time To Repair (MTTR).
2: Unsaved changes
In this case, somebody made a change to fix a problem, but the person making the change didn’t save the changes to the boot configuration file. Later, when a reboot occurs, the performance problem returns.
As mentioned above, NOM is notified when there is an exit from configuration mode and will take a snapshot. That configuration change will be expected. Later the device is rebooted and loads the previous configuration, missing the change that solved the problem. Here NOM has two methods of showing the problem. First, it detects unsaved changes. NOM also detects reboots and will show what changed between the two configurations. The one that just booted from the configuration file and the previously live one.
When the performance problem comes back, the NOC team will see the changes on the graph.
Figure 1: Device config change notification
3: A change in the boot configuration of a device
Tuesday morning, a change request is submitted and approved. Your organization has a change window at 2 am. The network engineer changes the boot configuration incorrectly, around noon, then asks for a device reboot during the change window. Depending on the nature of the change and your organization’s usage patterns, you might see the impact immediately or not until folks start work in the morning.
In this use case, either way, the impact is separated from the change by hours.
4: Software image update on the device
Number four is a particular case of #2, here the change is a software change rather than a configuration change. Not all software updates work as well as the developers would like. Just look at the issues both Windows and Mac updates have caused. Software updates are typically scheduled during a maintenance window, so as in #2, there may be a delayed impact if the new software has a problem. The rest of this use case is the same as #2, but the change will show as a different software version instead of a configuration change.
5: Adding a device to the network
When you add a new device to the network, an incorrect version of the software or a misconfiguration (or lack of configuration) can cause problems. An example might be a duplex mismatch. A duplex mismatch only shows up during times of heavier network load. If there is little traffic in the middle of the night, it probably won’t be noticed until the morning when the network load increases. Another example is a slow link misconfigured as a fast link.
Another concern is if the new device is non-compliant (config or OS), making it vulnerable to attack or not meeting required standards like PCI or HIPAA. If there’s no performance problem, this is a network engineering responsibility.
The good news here is that NOM has real-time discovery and will find the new device quickly and take a snapshot of its configuration. Discovering a new device is considered a change.
We’ve covered five examples of the types of changes detected by NOM. Given these use cases, how do NOC operators find out about a change that may affect a network’s performance?
As you can see in figure 2 below, change indicators show on a performance graph that is readily available to your NOM team. In the righthand screen, you can see the change to a packet filter.
Figure 2: Change overlay on performance screen. On the right, the configuration change detected
The first question I ask when troubleshooting is, “what changed?” If the NOC has the tools to provide that information to network engineering, diagnosis time will improve. Most network monitoring tools don’t have configuration management capabilities, so they cannot show changes to the folks looking at performance. NOM does. This lets your NOC team add more value, and it provides the capability to restore your network faster.
The solution is monitoring for change. Discover changes and give the NOC a change overlay to quickly determine that a change occurred and correlate it to the performance problem.
Read all our news at the Network Operations Management blog.
Explore all the capabilities of Network Operations Management by visiting these sites:
Events and Webinars
Webinar replay: What’s New in Network Operations Management