I’m Ryan Bowlby

a devops practitioner, mtn biker, hiker, & coffee connoisseur

about me

I’m a devops engineer working to automate away the normal operations tedium. Hacking on something all day with coffee and a multi-day backpacking trip sound equaly appealing.

Here I ramble about infrastructure as code, systems architecture, configuration management, scaling, and security.

VMWare “Management Network” Failover

Just a quick post demonstrating how to failover the VMware management network to a second virtual switch. You probably want your “Management Network” traffic for your ESXi hosts to be separate from your VM traffic. Unfortunately, you don’t always have enough network capacity for two uplinks for each virtual switch. Below is a script that will effectively move the “Management Network” to a second vSwitch in instances where the uplink of the primary vSwitch becomes unavailable. Basically it provides software based failover of the Management Network by moving it to the virtual switch used for your VM traffic. Just place the script as a here doc in the /etc/rc.local file of ESXi and add a cron entry that runs it every 5 minutes or so. The script will also fail back to the original virtual switch once connectivity returns.

“VMWare management network failover” linenos:true
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#!/bin/ash

vSwitch0_nic=$(esxcfgvswitch l | awk '$1 ~ /vSwitch0/ {print $6}')
ip=$(esxcfgvmknic l | awk '$2 ~ /Management/ && $3 ~ /Network/ {print $5}')
subnet=$(esxcfgvmknic l | awk '$2 ~ /Management/ && $3 ~ /Network/ {print $6}')

# vSwitch0 nic down
if ! esxcfgnics l | awk "$1 ~ /$vSwitch0_nic/ && $4 ~/Down/ {exit 1}"; then
    # "Management Network" on vSwitch0
    if esxcfgvswitch l | awk '/vSwitch0/,/Switch Name/' | grep q 'Management Network'; then
        # remove "Management Network" from vSwitch0
        esxcfgvmknic d "Management Network"
        esxcfgvswitch D "Management Network" vSwitch0

        # add "Management Network" portgroup to vSwitch1 (vlan 96)
        esxcfgvswitch A "Management Network" vSwitch1
        esxcfgvswitch pg="Management Network" vlan=96 vSwitch1

        # add "Management Network" VM Kernel NIC
        esxcfgvmknic a i $ip n $subnet "Management Network"
    fi
else
    # "Management Network" on vSwitch1
    if esxcfgvswitch l | awk '/vSwitch1/,/Switch Name/' | grep q 'Management Network'; then
        # remove "Management Network" from vSwitch1
        esxcfgvmknic d "Management Network"
        esxcfgvswitch D "Management Network" vSwitch1

        # add "Management Network" portgroup to vSwitch0
        esxcfgvswitch A "Management Network" vSwitch0
        esxcfgvswitch pg="Management Network" vSwitch0

        # add "Management Network" VM Kernel NIC
        esxcfgvmknic a i $ip n $subnet "Management Network"
    fi
fi

written in InfoTech

Hyperic - Scripting Removal of Server Resources

hyperic_logo

By default the Hyperic agent will autodiscover sendmail and NTP server resources. So quite a few of Hyperic’s platforms will likely be monitoring these resources in your installation. In quite a few cases you don’t care about monitoring these resources and rather increase Hyperic performance by removing them. Also, they will likely fill up the auto-discovery screen and become a nuisance.

  1. You can add a line to the agent.properties file to have the Hyperic agent NOT autodiscover these services:

pugins.exclude=sendmail,ntp

  1. To remove the “servers” from existing “platforms” you can use the below script. It makes use of the hqapi.sh cli tool (available for download from Hyperic). Just change the “server_to_remove” variable appropriately.

Caution, this script was written in 2.3 seconds and I’m quite sure the xml tree parsing is suboptimal.

Hyperic – Remove Server Resources [linenos:true] [start:#] [mark:#,#–#]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/python3.2

import os
import sys
import subprocess as SP
import xml.etree.ElementTree as ET

def run_cmd(cmd, wd=os.getcwd()):
    """ Executes a unix command and verifies exit status. Takes command to be
        executed and directory from which to execute. """

    try:
        child = SP.Popen(cmd.split(), stdout=SP.PIPE, stderr=SP.PIPE, cwd=wd)
        stdout, stderr = [ str(out, 'UTF-8', 'ignore') for out in child.communicate()[:2] ]
        rc = child.returncode
    except OSError as e:
        print('Critical: Error running {}, {}'.format(cmd, e), file=sys.stderr)
        sys.exit(2)

    if rc:
       print('Error running command: {}'.format(cmd), file=sys.stderr)
       print(stderr, file=sys.stderr)
       sys.exit(2)

    return(stdout.rstrip())

def main():
    server_to_remove = "NTP 4.x"
    api_path = '/bin/hqapi.sh'
    all_resources = '{} resource list —prototype=Linux —children'.format(api_path)
    del_resource = '{} resource delete —id={}'.format(api_path, '{}')

    xml = run_cmd(all_resources)
    elements = ET.fromstring(xml)
    for element in elements:
        for child in element:
            ResourcePrototype = child.find('ResourcePrototype')
            if ResourcePrototype is not None:
                if server_to_remove in str(ResourcePrototype.get('name')):
                    result = run_cmd( del_resource.format( child.get( 'id')))
                    print(result)

if name == 'main': main()

written in General, InfoTech

Nagios - Mitigating False Positives

icinga logo

A common issue when monitoring thousands of services is dealing with intermittent issues and “false positives” clogging up the status page. Often when checks fail then clear on their own the issue is deemed a “false positive” by the operations staff. What’s more likely is that an actual issue was briefly observed but merely intermittent in nature (true positive).  In a perfect world when a service fails, even for a moment, you would perform root cause analysis and resolve the issue. In the real world when a service check fails the operations staff waits to see if the alert clears without intervention. How long they wait is determined by how often things show up in monitoring and clear on their own (aka flapping). The more often things alert and clear without need for intervention the longer the NOC is going to postpone a possible issue before investigating. The goal then is to only have checks display in monitoring when a sustained issue or outage is occurring. The NOC then reacts quickly knowing every alert is likely a serious incident that’s deserving of their attention. Nagios has several features that can assist with keeping known intermittent issues out of monitoring.

written in General, InfoTech Read on →

Nagios - Send External Commands to Collector

One of only a few issues I’ve experienced when using Nagios or Icinga in a distributed setup is the inability to send external commands to a remote instance instead of the Nagios instance the cgi web interface resides on. For example, say you have just two Nagios instances running on separate servers. One doing all the active checks and sending the results to the other “central” server. In this configuration the central server doesn’t perform any active checks. It is merely responsible for processing the checks, updating the database, performing notification logic, event handlers, hosting the web interface, etc. When you attempt to schedule an immediate check of a given service through the central server’s classic CGI interface an external command is generated. That command is processed by the central server and NOT the “collector”. The collector never receives the external command. There isn’t a way to schedule an immediate check on the collector from the central server’s interface. That’s a big issue! You can’t expect a busy operations staff to log into a second web interface, on the collector itself, each time an immediate check needs to be performed.

nagios-cmd

written in InfoTech Read on →

VIM as a Python IDE

I recently began scripting in Python using the VIM editor; my editor of choice. In what became a failing effort to keep my sanity I forewent customization of the VIM settings on my personal machines. You see, I’m often tasked with editing files on servers whose VIM settings I can’t customize. I feared that if I were to become overly accustomed to any custom settings then I’d likely blurt obscenities when forced to use a vanilla VIM.

Without some tweaking of my vimrc I end up having to manually indent code in Python. Talk about a loss of productivity, having to use the space bar to indent Python code is the surest path to insanity. Mimicking the mindless repetition that’s better suited to steam powered machinery is a less than efficient use of my time. I’ve since submitted defeat and tailored my VIM settings to Python. I may occasionally blurt an obscenity when using VIM on somebody else’s machine but it’s a calculated loss. Below is a breakdown of my VIM settings. I hope others will find it useful.

written in InfoTech Read on →

Nagios Plugin: Check_dell

Just finished a Python script to check Dell hardware components via the omreport utility. It’s designed to be used client-side via NRPE (or check_by_ssh). Additional usage information can be found within the scripts docstrings as well as the --help option. Some gotchas:

  • In some instances NRPE will not execute scripts that start with #!/usr/bin/env. In these instances you will need to specify the full path to python .

  • The plugin expects a symlink of omreport in /usr/sbin, you may need to add one if the OMSA install script didn’t. I hard-coded the path because relying on the shell environments PATH variable is a security concern; especially in cases where the plugin is setuid root or called via sudo

  • When starting OMSA use srv-admin.sh start on Redhat-based systems or /etc/init.d/dataeng starton Debian-based. The order that the services start is crucial. The necessary device drivers must be loaded prior to the loading of the IPMI module.

written in General, InfoTech Read on →

Bonjour Isn’t Evil, But..

Just finished watching a Google Tech Talk on Bonjour presented by Dr. Stuart Cheshire. It’s a very simple introduction to Apple’s implementation of zeroconf. Bonjour aka Zeroconf aka Avahi isn’t the evil I thought it was; and I don’t know why I assumed it was evil. I guess it’s a mixture of hating that Avahi is on by default in most RH-based distros coupled with my misconception that bonjour was appletalk rebranded. Appletalk had a reputation for being chatty so I just assumed bonjour inherited that gene.

Truth is bonjour doesn’t introduce any non-standard whiz-bang protocols or “chatty” communications into the LAN. It’s simple multicast mixed with creative use of DNS PTR and SRV records. It uses some of the same tactics ARP uses to update all devices based off the requests and replies broadcast by other devices (nothing too surprising there).

Don’t go getting the impression I posted this just to evangelize the obvious utility or practicality of zeroconf. It’s one of those technologies that’s implicitly trusting of the local network. With today’s ubiquitous use of wifi, often public wifi, that’s a major fault. All OSes have these local network technologies that operate on the assumption that “all devices are inherently good”. Then ambivalently choose something like Kerberos with Active Directory. Where kerberos believes that all networks are inherently evil.. to the point where it doesn’t even trust the network enough to send an encrypted hash of a users password. Apple then blends these contradictions into one tremendously retarded practice of allowing the local DHCP server to specify the primary domain controller for use when authenticating on your local system. Read that last sentence again for effect. Lets remember that next time “it just works”.

System Pref –> Accounts –> Login Options –> Join –> Search Policy –> change from “Automatic” to “Local Directory” –> commence acting like you knew.

written in InfoTech

DynDNS With Iptables

I wanted to use a DynDNS address with iptables. Obviously, you need a way to update the iptables rules when the IP of the dyndns address changes. Easiest solution is to cron a script that updates iptables when the IP changes. Here is one such script:

written in InfoTech Read on →

FreeBSD Ramdisk - Mdconfig

Creating a ramdisk on FreeBSD is straight forward but Google will lead you astray. The main problem with finding accurate results on how to create a ramdisk is that it’s not called a ramdisk. It’s technically referred to as a “memory-based disk” in FreeBSD. To make matters worse the name of the utility has recently changed. It use to be “vnconfig” and is now “mdconfig”.

Most of the articles instruct the user to create a startup script run out of rc.local that initializes and mounts the memory disk at startup. However, FreeBSD added proper rc scripts to create the ramdisk earlier in the boot process – for services that rely on its existence (Nagios).

There are two start up scripts in /etc/rc.d entitled mdconfig and mdconfig2. The first script runs early in the rc.order and completes the majority of the mdconfig options. The second script exists to perform mdconfig options that can’t be completed early in the boot process. They basically work as a team to complete the options you specify in the rc.conf file. Here’s an example block from rc.conf where I create a 128M ramdisk of type malloc entitled /dev/md0.

mdconfig_md0="-t malloc -s 128m"
mdconfig_md0_owner="nagios:nagios"
mdconfig_md0_perms="2775"
mdconfig_md0_cmd="su -m nagios -c "mkdir /var/spool/nagios/ramdisk/{checkresults,rw}""

The owner and perms options should be self-explanatory. The _cmd option will perform just about anything as long as the necessary services the command relies on have been started. Here is the necessary fstab entry:

/dev/md0        /var/spool/nagios/ramdisk    ufs    rw    0    0

That’s basically all there is to it. See man mdconfig and man rc.conf (search mdconfig). The type you choose should usually be tmpfs (not malloc) tmpfs will allow the OS to more safely manage the memory; by moving it to swap in cases where the memory is inactive. In this instance I rather it move virtually anything else to swap before status.dat so I used malloc. Also, you don’t need to restart for these changes to take effect (but you may want to for testing). You can run /etc/rc.d/mdconfig start and mdconfig2 start.  I don’t want to talk about how much time I spent finding this information. I hope this helps someone, let me know in the comments.

written in InfoTech