(dev|web)ops | security | biking | soap box

Leave a comment

VMWare “Management Network” failover

Just a quick post demonstrating how to failover the VMware management network to a second virtual switch. You probably want your “Management Network” traffic for your ESXi hosts to be separate from your VM traffic. Unfortunately, you don’t always have enough network capacity for two uplinks for each virtual switch. Below is a script that will effectively move the “Management Network” to a second vSwitch in instances where the uplink of the primary vSwitch becomes unavailable. Basically it provides software based failover of the Management Network by moving it to the virtual switch used for your VM traffic. Just place the script as a here doc in the /etc/rc.local file of ESXi and add a cron entry that runs it every 5 minutes or so. The script will also fail back to the original virtual switch once connectivity returns.


vSwitch0_nic=$(esxcfg-vswitch -l | awk '$1 ~ /vSwitch0/ {print $6}')
ip=$(esxcfg-vmknic -l | awk '$2 ~ /Management/ && $3 ~ /Network/ {print $5}')
subnet=$(esxcfg-vmknic -l | awk '$2 ~ /Management/ && $3 ~ /Network/ {print $6}')

# vSwitch0 nic down
if ! esxcfg-nics -l | awk "$1 ~ /$vSwitch0_nic/ && $4 ~/Down/ {exit 1}"; then
    # "Management Network" on vSwitch0
    if esxcfg-vswitch -l | awk '/vSwitch0/,/Switch Name/' | grep -q 'Management Network'; then
        # remove "Management Network" from vSwitch0
        esxcfg-vmknic -d "Management Network"
        esxcfg-vswitch -D "Management Network" vSwitch0

        # add "Management Network" portgroup to vSwitch1 (vlan 96)
        esxcfg-vswitch -A "Management Network" vSwitch1
        esxcfg-vswitch --pg="Management Network" --vlan=96 vSwitch1

        # add "Management Network" VM Kernel NIC
        esxcfg-vmknic -a -i $ip -n $subnet "Management Network"
    # "Management Network" on vSwitch1
    if esxcfg-vswitch -l | awk '/vSwitch1/,/Switch Name/' | grep -q 'Management Network'; then
        # remove "Management Network" from vSwitch1
        esxcfg-vmknic -d "Management Network"
        esxcfg-vswitch -D "Management Network" vSwitch1

        # add "Management Network" portgroup to vSwitch0
        esxcfg-vswitch -A "Management Network" vSwitch0
        esxcfg-vswitch --pg="Management Network" vSwitch0

        # add "Management Network" VM Kernel NIC
        esxcfg-vmknic -a -i $ip -n $subnet "Management Network"

Leave a comment

Hyperic – scripting removal of server resources

By default the Hyperic agent will autodiscover sendmail and NTP server resources. So quite a few of Hyperic’s platforms will likely be monitoring these resources in your installation. In quite a few cases you don’t care about monitoring these resources and rather increase Hyperic performance by removing them. Also, they will likely fill up the auto-discovery screen and become a nuisance.

1. You can add a line to the agent.properties file to have the Hyperic agent NOT autodiscover these services:


2. To remove the “servers” from existing “platforms” you can use the below script. It makes use of the hqapi.sh cli tool (available for download from Hyperic). Just change the “server_to_remove” variable appropriately.

Caution, this script was written in 2.3 seconds and I’m quite sure the xml tree parsing is suboptimal.


import os
import sys
import subprocess as SP
import xml.etree.ElementTree as ET

def run_cmd(cmd, wd=os.getcwd()):
    """ Executes a unix command and verifies exit status. Takes command to be
        executed and directory from which to execute. """

        child = SP.Popen(cmd.split(), stdout=SP.PIPE, stderr=SP.PIPE, cwd=wd)
        stdout, stderr = [ str(out, 'UTF-8', 'ignore') for out in child.communicate()[:2] ]
        rc = child.returncode
    except OSError as e:
        print('Critical: Error running {}, {}'.format(cmd, e), file=sys.stderr)

    if rc:
       print('Error running command: {}'.format(cmd), file=sys.stderr)
       print(stderr, file=sys.stderr)


def main():
    server_to_remove = "NTP 4.x"
    api_path = '/bin/hqapi.sh'
    all_resources = '{} resource list --prototype=Linux --children'.format(api_path)
    del_resource = '{} resource delete --id={}'.format(api_path, '{}')

    xml = run_cmd(all_resources)
    elements = ET.fromstring(xml)
    for element in elements:
        for child in element:
            ResourcePrototype = child.find('ResourcePrototype')
            if ResourcePrototype is not None:
                if server_to_remove in str(ResourcePrototype.get('name')):
                    result = run_cmd( del_resource.format( child.get( 'id')))

if __name__ == '__main__': main()

Leave a comment

Nagios – Mitigating false positives

A common issue when monitoring thousands of services is dealing with intermittent issues and “false positives” clogging up the status page. Often when checks fail then clear on their own the issue is deemed a “false positive” by the operations staff. What’s more likely is that an actual issue was briefly observed but merely intermittent in nature (true positive).  In a perfect world when a service fails, even for a moment, you would perform root cause analysis and resolve the issue. In the real world when a service check fails the operations staff waits to see if the alert clears without intervention. How long they wait is determined by how often things show up in monitoring and clear on their own (aka flapping). The more often things alert and clear without need for intervention the longer the NOC is going to postpone a possible issue before investigating. The goal then is to only have checks display in monitoring when a sustained issue or outage is occurring. The NOC then reacts quickly knowing every alert is likely a serious incident that’s deserving of their attention. Nagios has several features that can assist with keeping known intermittent issues out of monitoring.
Continue reading

Leave a comment

Nagios – send external commands to collector

One of only a few issues I’ve experienced when using Nagios or Icinga in a distributed setup is the inability to send external commands to a remote instance instead of the Nagios instance the cgi web interface resides on. For example, say you have just two Nagios instances running on separate servers. One doing all the active checks and sending the results to the other “central” server. In this configuration the central server doesn’t perform any active checks. It is merely responsible for processing the checks, updating the database, performing notification logic, event handlers, hosting the web interface, etc. When you attempt to schedule an immediate check of a given service through the central server’s classic CGI interface an external command is generated. That command is processed by the central server and NOT the “collector”. The collector never receives the external command. There isn’t a way to schedule an immediate check on the collector from the central server’s interface. That’s a big issue! You can’t expect a busy operations staff to log into a second web interface, on the collector itself, each time an immediate check needs to be performed.

Continue reading

Leave a comment

VIM as a Python IDE

I recently began scripting in Python using the VIM editor; my editor of choice. In what became a failing effort to keep my sanity I forewent customization of the VIM settings on my personal machines. You see, I’m often tasked with editing files on servers whose VIM settings I can’t customize. I feared that if I were to become overly accustomed to any custom settings then I’d likely blurt obscenities when forced to use a vanilla VIM.

Without some tweaking of my vimrc I end up having to manually indent code in Python. Talk about a loss of productivity, having to use the space bar to indent Python code is the surest path to insanity. Mimicking the mindless repetition that’s better suited to steam powered machinery is a less than efficient use of my time. I’ve since submitted defeat and tailored my VIM settings to Python. I may occasionally blurt an obscenity when using VIM on somebody else’s machine but it’s a calculated loss. Below is a breakdown of my VIM settings. I hope others will find it useful. Continue reading


Nagios plugin: check_dell

Just finished a Python script to check Dell hardware components via the omreport utility. It’s designed to be used client-side via NRPE (or check_by_ssh). Additional usage information can be found within the scripts docstrings as well as the --help option. Some gotchas:

  • In some instances NRPE will not execute scripts that start with #!/usr/bin/env. In these instances you will need to specify the full path to python .
  • The plugin expects a symlink of omreport in /usr/sbin, you may need to add one if the OMSA install script didn’t. I hard-coded the path because relying on the shell environments PATH variable is a security concern; especially in cases where the plugin is setuid root or called via sudo
  • When starting OMSA use srv-admin.sh start on Redhat-based systems or /etc/init.d/dataeng start on Debian-based. The order that the services start is crucial. The necessary device drivers must be loaded prior to the loading of the IPMI module.

Continue reading


Get every new post delivered to your Inbox.