Auditing Node Configuration November 15, 2010
Issue
In a large scale deployment it becomes difficult to know if nodes are configured correctly in OpenNMS. As with any large configuration, things get missed and small mistakes can lead to critical holes in monitoring. The scripts on this site help maintain node configurations but some aspects of a node can not be checked or updated programmatically. Someone who knows the node must audit the configuration to ensure it’s correct. Often this requires input from staff other than the OpenNMS administrator. Staff find this tedious at best and want to spend as little time on it as possible. A well defined process is needed to effectively audit the nodes in OpenNMS in a timely manner.
Solution
Below is a basic process for auditing the configuration of nodes already in OpenNMS (a future post will look at auditing the discovery process). This process has worked well for me but will need modifications to meet your needs and those of your organization. I suggest auditing the nodes with small groups of staff. For example, the OpenNMS administrator and a group of 3 to 6 staff who all manage similar resources (eg. servers or networking). This allows everyone to agree on how things should be configured and puts enough knowledge in the room to catch errors. This small group should still be able to move quickly through an audit.
The reports posted on this site are used heavy since they present information in a quick reference format. I often print the reports and write edits on the paper during an audit meeting. The actual changes are made after the meeting, or set of meetings, when less distractions exist. Future posts will show how to break reports down by category, such as by management group, which speeds up the audit process further.
Also, I run clean up scripts, such as clean-rrd-data, set-critical-paths, check-dns, 24-48 hours before doing an audit to ensure everything is up to date.
Implementation
Prerequisites
The process uses many of the patterns and scripts from this site. Each is listed with the section it applies to.
Process
Using the Node Details (nodedetails.cgi) report the following questions should be asked for each node:
1) Is the LABEL correct?
Is the LABEL the best name to use for the node? The node label can come from multiple sources including: DNS name, SMB name, SNMP name, IP address, or manually set. If you use ITIL processes then it may help for the OpenNMS label to match the name in the CMDB. Also consider if short names or fully qualified names are needed.
If labels are a common issue during an audit then I do a full run through of the Known Names (knownnames.cgi) report before moving onto the Node Details report. This lets everyone concentrate on names without the other information getting in the way.
2) Is the SNMP Info Correct?
The SNMP Info should contain a valid name, location, contact, and description. If the information is not correct then it can be fixed on the node. The data will propagate to the OpenNMS server during the next data collection.
3) Is the “last 7 days datum count” greater than 0 ?
Look at the last number on the SNMP Info line. This is the last 7 days datum count. It represents the number of SNMP datum types (updated jrb files) that have been successfully collected at least once in the last 7 days. If the number is 0 then the OpenNMS server has been unable to collect SNMP data from this node for at least 7 days.
4) Is the node in the correct categories?
If you’re using A System for Node Categories then categories are used to filter nodes in reports, views, and when determining where to send alerts. Besides the categories below, your group may have custom categories, views, and alerts that need to be considered.
a) Is the node in at least one M-* category?
The M- categories represent which staff groups manage a node. There can be several M- entries for a node if it is co-managed by several groups. Examples of M- categories are:
| M-Toronto | Toronto Region |
| M-Networking | Central Networking Group |
| M-UPS | Central UPS Group |
| M-Server | Central Server Group |
| M-Storage | Central Storage Group |
| M-Calgary | Calgary Office |
b) Is the node in one S-* category?
The S- categories represent the node state. Every node must be in one, and only one, S- category. There are 3 normal states:
| S-Production | Node provides production services. |
| S-Maintenance | Node provides production services but is temporally under maintenance. Once maintenance is complete this node will be put back in the S-Production category and removed from S-Maintenance. |
| S-Other | Node is monitored but does not provide production services. |
See Node State – Controlling Outage Response for more information.
c) Is the node in at least one T-* category?
The T- categories represent the type of node. There could be several T- entries for a node but in most cases there is just one. This is normally set automatically by the set-cats-from-snmp script but if SNMP is not available or is not descriptive then this may need to be set manually.
d) Is the node in the correct A-* categories?
The A- categories represent the applications the node helps provide. There could be several A- entries for a node or none.
d) Is the node in the correct P-* categories?
The P- categories alter how a node is monitored and data collected. There could be several P- entries for a node or none.
5) Are the discovered Services and IPs correct?
The IPs and services in the Node Details report are a summary of what has been discovered on the node. In most cases, the OpenNMS discovery process will correctly identify this infomation. In some situations this information will be incorrect or incomplete.
6) Are the Links and Critical Path correct?
Finally, review the remaining columns of the Node Details report. In most cases, the OpenNMS discovery process or scripts from this site will correctly set this infomation. In some situations this information will be incorrect or incomplete.
7) Does the node appear in Comments / Forced Unmanaged?
Using the report Comments / Forced Unmanged (commentsforced.cgi), see if the node appears and review the information.
8) Does the node appear in Duplicate IP Addresses?
Using the report Duplicate IP Addresses (dupipaddr.cgi), see if the node appears and review the information.
What’s missing?
In most cases you have now audited all the critical information but there are other details you may want to check. For example, the Node Details report lists the services and the IP addresses but it does not say which services are monitored on which addresses. To see these details, and others, click on the node label to enter the OpenNMS page for the node.
Leave a Reply