Learn to write command line utilities in Python

If you read my series of posts on writing command line utilities in R, but were wondering how to do the same thing in Python, you’ve come to the right place.

I learned a lot with that original series of posts though, so this time we’re going to switch things up a bit, dive right into a complete working example, and cover it all off in a single post.

So let’s roll up our sleeves and get started!

Recap - What are command line utilities?

Command line utilities are tools that you can run on the command line of a computer. We most often see these on Linux and MacOS computers using the ‘bash’ shell, but Windows users have options like CMD, git-bash and powershell too.

These tools allow you to instruct the computer to do things using text alone. You can also start chaining the commands together to give you a really powerful way to get computers to do things for you. You’ve possibly already used command line tools like ls and cd before, to ’list" a directory’s contents and ‘change directory’ respectively, but by writing our own tools we really unlock the power of the command line.

Imagine you’re a entomologist studying the sounds that Cicadas make. You have field equipment set up that records audio overnight and sends it to a Linux computer in your lab. Every morning you come into the lab, see what files you have and then begin to process them.

First, you might list the files in the directory with ls. You notice there’s lots of other stuff in that directory as well as your ‘wav’ audio files, so you can do ls *.wav to just list the files you’re interested in. Then you have to run a pre-processing command on each file to turn the audio into the data you need. Finally you need to generate some preliminary plots from that data. And that’s all before you even start to do any analysis!

Wouldn’t it be better if you could get the computer to do all that for you before you’ve even arrived at the lab? Using the command line and writing our own tools for it, we can do just that. In the example above we’d want to do something like the following pseudo-code (which is mostly standard bash syntax)…

# process each wav file using the fictional 'audio-to-data' command line 
# tool, which generates a csv file for each input file
for wavfile in *.wav
  do
    ./audio-to-data ${wavfile}
  done

# process each data file to create preliminary plots using the fictional 
# 'data-to-plot' command line tool, which outputs a png file for each input file
for datafile in *.csv
  do
    ./data-to-plot ${datafile}
  done
  
# now we can tidy up

## move all the raw audio files to the 'raw-audio' subdirectory
mv *.wav ./raw-audio/

## move all the csv files to a 'data' subdirectory
mv *.csv ./data/

## move all the preliminary plots to a 'plots' subdirectory
mv *.png ./plots/

Now that we’ve written out the entomologist’s morning routine like this, it makes sense to get the computer to run that automatically. We can then use the scheduling tools built into the operating system (a thing called ‘cron’ in this instance), to run this as a script each morning at 6am. That means that all this work is already done by the time our entomologist arrives at the lab, meaning they can get on with the job of actually analysing the data and plots.

This is all well and good, but I cheated! Some of the commands in my example were fictional! The operating system sometimes doesn’t have a built-in command that can help you – for instance there’s no built in command to detect Cicada sounds! – and that’s why we write our own command line utilities. To fill a specific need that isn’t already met by your operating system.

A Python Sorting Hat

In this post we’re not going to try anything quite so ambitious as an audio-file-to-csv converter, but instead we’ll take a look at an example which provides some good foundations that you can build on yourself.

Below is the Python code for a command line sorting hat. If you did follow along on the R version of this you should recognise it. If you didn’t, it’s basically a small, text-based, program that takes a name as input and then tells you which Hogwarts house that person has been sorted in to.

$ ./sortinghat.py Mark
Hello Mark, you can join Slytherin!
$ ./sortinghat.py Hermione
Hello Hermione, you can join Ravenclaw!

The code for the R version appeared in the sixth installment of the R series. Here’s the output of that one:

$ ./sortinghat.R Mark
Hello Mark, you can join Slytherin!
$ ./sortinghat.R Hermione
Hello Hermione, you can join Ravenclaw!

Exact same thing. Poor Hermione!

Here’s the full code for the Python version.

#!/usr/bin/env python
"""
A sorting hat you can run on the command line
"""

import argparse
import hashlib

PARSER = argparse.ArgumentParser()
# add a positional argument
PARSER.add_argument("name", help="name of the person to sort")
# Add a debug flag
PARSER.add_argument("-d", "--debug", help="enable debug mode",
                    action="store_true")
# Add a short output flag
PARSER.add_argument("-s", "--short", help="output only the house",
                    action="store_true")

ARGV = PARSER.parse_args()

def debug_msg(*args):
    """prints the message if the debug option is set"""
    if ARGV.debug:
        print("DEBUG: {}".format("".join(args)))

debug_msg("Debug option is set")

debug_msg("Your name is - ", ARGV.name)

HOUSES = {"0" : "Hufflepuff",
          "1" : "Gryffindor",
          "2" : "Ravenclaw",
          "3" : "Slytherin",
          "4" : "Hufflepuff",
          "5" : "Gryffindor",
          "6" : "Ravenclaw",
          "7" : "Slytherin",
          "8" : "Hufflepuff",
          "9" : "Gryffindor",
          "a" : "Ravenclaw",
          "b" : "Slytherin",
          "c" : "Hufflepuff",
          "d" : "Gryffindor",
          "e" : "Ravenclaw",
          "f" : "Slytherin"
         }


NAME_HASH = hashlib.sha1(ARGV.name.lower().encode('utf-8')).hexdigest()

debug_msg("The name_hash is - ", NAME_HASH)

HOUSE_KEY = NAME_HASH[0]

debug_msg("The house_key is - ", HOUSE_KEY)

HOUSE = HOUSES[HOUSE_KEY]

if ARGV.short:
    print(HOUSE)
else:
    print("Hello {}, you can join {}!".format(ARGV.name, HOUSE))

In order to actually run this thing, you can either type it out yourself, or just copy and paste it into a file called ‘sortinghat.py’.

We could just run this with python sortinghat.py, but that doesn’t make our utility feel like it’s a proper command line tool. In order for Linux and MacOS shells (and Windows Subsystem for Linux and git-bash) to treat the file as ’executable’ we must mark it as such by changing the ‘mode’ of the file.

Make sure you’re in the same directory as your file and run:

$ chmod +x ./sortinghat.py

Now you can just run ./sortinghat.py to run the command.

Breaking things down

shebang and docstring

Next we’re going to look at each section in turn to look at the functionality.

#!/usr/bin/env python
"""
A sorting hat you can run on the command line
"""

That very first line is referred to as a ‘shebang’ and it tells your command line shell (of which there are many, but ‘bash ’ is the most common) which program to use to execute everything that follows. In this case we’re using a command called env to tell bash where to find python.

Note: I’m using python 3 for this example. On some systems that have both python 2 and 3, 3 is referred to as python3, not just python. If that’s the case for you, you’ll need to modify this script to reflect that.

After the shebang is a standard python docstring, just telling you what the app is all about.

import

import argparse
import hashlib

Next we import the external modules we’re going to use. Lucky for us, python has an extensive and varied standard library of modules that ship with it, so we don’t need to install anything extra.

‘argparse’ will parse command line arguments for us. If you think of a command line tool like ls, arguments are things you can put after it to modify its behaviour. For example ls -l has -l as the argument and causes ls to print ’longer’ output with more information than the standard output. For ls *.wav, the *.wav argument is a pattern which causes ls to only emit files that match that pattern.

‘hashlib’ is a module that implements various hash and message digest algorithms, which we’ll need later on for the sorting part of the utility.

Handling arguments

PARSER = argparse.ArgumentParser()
# add a positional argument
PARSER.add_argument("name", help="name of the person to sort")
# Add a debug flag
PARSER.add_argument("-d", "--debug", help="enable debug mode",
                    action="store_true")
# Add a short output flag
PARSER.add_argument("-s", "--short", help="output only the house",
                    action="store_true")

ARGV = PARSER.parse_args()

This block sets up a new argument parser for us and adds some arguments to it. Arguments that don’t start with -- are ‘positional’, which basically means that it’s a mandatory argument. If you define multiple positional arguments they must be specified at run-time in the order they are defined.

In our case, if we don’t specify the ’name’, then we’ll get an error:

$ ./sortinghat.py
usage: sortinghat.py [-h] [-d] [-s] name
sortinghat.py: error: the following arguments are required: name

We didn’t have to create this error message, argparse did that for us because it knows that ’name’ is a required argument.

The other arguments are ‘flags’, which means we can turn things on and off with them. Flags are specified with -- for the long form and - for the short form, you don’t have to have both but this has developed into something of a convention over the years. Specifying them separately like this is also useful as it gives you full control over how the short options relate to the longer ones.

If, for instance, you wanted two arguments in your application called --force and --file, the convention would be to use -f as the short form, but you can’t use it for both. Explicitly assigning the short form version allows you to decide what you want to use instead. Maybe you’d go for -i for an input file or -o for an output file or something like that.

These arguments are flags because we set action="store_true" in them, which stores True if they’re set and False if they’re not.

If you omit the action="store_true", you get an optional argument. This could be something like --file /path/to/file, where you must specify something immediately after the argument. You can use these for specifying additional parameters for your scripts. We’re not really covering that in this script though, so here are a few quick examples to get you thinking:

--config /path/to/config_file - specify an alternate config file to use instead of the default
--environment production - run against product data rather than test data
--algo algorithm_name - use a different algorithm instead of the default
--period weekly - change the default calculation period of your utility
--options /path/to/options/file - provide options for an analysis from an external file

Another freebie we get from argparse is -h and --help. These are built-in and print nicely formatted help output for your users or future-self!

$ ./sortinghat.py -h
usage: sortinghat.py [-h] [-d] [-s] name

positional arguments:
  name         name of the person to sort

optional arguments:
  -h, --help   show this help message and exit
  -d, --debug  enable debug mode
  -s, --short  output only the house

Lastly for this section, we use parse_args() to assign the arguments that have been constructed to a new namespace called ARGV so we can use them later. Arguments stored in ARGV are retrievable using the long version of the argument name, so in this example: ARGV.name, ARGV.debug and ARGV.short.

Everything from this point onward is largely to do with the functionality of the utility, not the command line execution of it, so we’ll go through it quite quickly.

Printing debug messages

I didn’t want to get bogged down using a proper logging library for this small tool, so this function takes care of our very basic needs for us.

def debug_msg(*args):
    """prints the message if the debug option is set"""
    if ARGV.debug:
        print("DEBUG: {}".format("".join(args)))

Essentially, it will only print a message if ARGV.debug is True and that will only be true if we set the -d flag when we run the tool on the command line.

We can then put messages like debug_msg("Debug option is set") in our code and they’ll do nothing unless that -d flag is set. If it is set, you’ll get output like:

$ ./sortinghat.py -d Mark
DEBUG: Debug option is set
DEBUG: Your name is - Mark
DEBUG: The name_hash is - f1b5a91d4d6ad523f2610114591c007e75d15084
DEBUG: The house_key is - f
Hello Mark, you can join Slytherin!

Using a technique like this – or perhaps a --verbose flag – can help to provide additional information about what’s going on inside your utility at run time that could be helpful to others or your future-self if they encounter any difficulties with it.

The debug_msg() function is used in this way throughout the rest of the program.

Figuring out the house

To figure out what house to assign someone to we use the same approach that we did for the R version. We calculate the hash of the input name and store the hexadecimal representation. Since hex uses the numbers 0-9 and the characters a-f, we can assign the four Hogwarts houses to these 16 symbols evenly in a Python dictionary.

We can then use the first character of the input name hash as the key when retrieving the value from the dictionary

HOUSES = {"0" : "Hufflepuff",
          "1" : "Gryffindor",
          "2" : "Ravenclaw",
          "3" : "Slytherin",
          "4" : "Hufflepuff",
          "5" : "Gryffindor",
          "6" : "Ravenclaw",
          "7" : "Slytherin",
          "8" : "Hufflepuff",
          "9" : "Gryffindor",
          "a" : "Ravenclaw",
          "b" : "Slytherin",
          "c" : "Hufflepuff",
          "d" : "Gryffindor",
          "e" : "Ravenclaw",
          "f" : "Slytherin"
         }


NAME_HASH = hashlib.sha1(ARGV.name.lower().encode('utf-8')).hexdigest()

HOUSE_KEY = NAME_HASH[0]

HOUSE = HOUSES[HOUSE_KEY]

We also make sure that the input name is converted to lower case first to prevent us from running into any discrepancies between, for example, ‘Mark’ and ‘mark’.

Printing output

Here in the final section, we use the value of ARGV.short to decide whether to print the long output or the short output. Flags are False by default with argparse, so we can test if it’s been set to True (by specifying the -s flag on the command line) and print accordingly.

if ARGV.short:
    print(HOUSE)
else:
    print("Hello {}, you can join {}!".format(ARGV.name, HOUSE))

Using the -s flag on the command line results in the following short output:

$ ./sortinghat.py -s Mark
Slytherin

Since the flags are optional you can combine them if you need to, so something like ./sortinghat.py -s -d Mark will produce the expected output - debug info with the short version of the final message.

That’s it for now

I hope you found this post useful and that you have some great ideas for things in your workflows that could be automated with command line utilities. If you do end up writing your own tool, let me know about it. I love hearing about all the awesome ways people are using these techniques to solve real world problems.

Saturday, November 3, 2018