Learn to Write Command Line Utilities in R

Introduction

Do you know some R? Have you ever wanted to write your own command line utilities, but didn’t know where to start? Do you like Harry Potter?

If the answer to these questions is “Yes!”, then you’ve come to the right place. If the answer is “No”, but you have some free time, stick around anyway, it might be fun!

Over the course of the next few posts we’re going to be taking a look at writing a simple command line based “Sorting Hat” utility, but first some background…

What is a command line utility

Command line utilities come in many shapes and sizes. If you’ve ever used the Linux command line (or the MacOS or Windows ones) you’ll have used these sorts of tools before. Things like ls, to ’list’ the contents of a directory (dir on Windows), or cd to ‘change directory’. Other utilities provide more advanced functionality. Things like grep for searching through text for a given string or regular expression, or wc for counting the number of words or lines in a file.

The thing that most of these tools have in common is that they do a small number of things really well. For instance, the cd tool, could also count words in files, like wc does, but that would make the functionality considerably more difficult to learn.

Command line utilities are written in all sorts of languages, but mainly C and python, though these tools can be written in any language that can be run on the system.

Why write command line tools in R?

Given that it’s not something you see a lot of, what’s the point of writing command line utilities in R? Well, there a lot of good reasons, but they’re mainly the same as why you’d choose to use R for anything:

It analytical power is second to none
It’s a really handy skill in HPC (High Performance Computing) Environments
It’s rarely used for command line tools, so you tend to have complete control over the environment, for example control over the R version used
Maybe it’s the language you’re most comfortable with
It’s fun - This is the most important one, obviously!

Writing command line utilities is a good exercise for the mind, especially if you’re used to using R interactively. These sorts of tools are designed to be run from start to finish with no user interaction and are therefore fairly easy to run in bulk, or on a schedule.

Say you wanted to perform some analysis every night at midnight. You could write one of the command line tools to do just that, leaving it’s outputs in a particular directory, or writing them to a database. Then you could schedule that script to run every day without having to be there to do it yourself.

OK, I’m convinced. What next?

Starting with this post, we’re going to write a little command line utility of our own, to see how it works and to get a feel for the steps we need to go through.

I’ve already written 7 versions of a command line “Sorting Hat” from Harry Potter, each one adding a new feature over the previous version. “Why a Sorting Hat?” I hear you ask. Well, I’m not a statistician or a data scientist, so the idea of me trying to do something cool in that space is not very appealing, and it’s the holidays, and this is more fun! The things to take away from all this are the concepts though, not the Sorting Hat itself. The concepts discussed over the course of this series are completely transferable to any sort of command line tool you might want to write.

So, let’s get started!

Our first command line Sorting Hat

For ease of use and consistency we’re going to be using the latest version of the RStudio IDE

Create a new project in RStudio and either copy out, or copy and paste the following code into a file called ‘sortinghat.R’.

#!/usr/bin/env Rscript --vanilla
houses <- c("Hufflepuff", "Gryffindor", "Ravenclaw", "Slytherin")
house <- sample(houses, 1)
cat(house, "\n")

This is the first version of our first sortinghat utility and as you can see, it’s very simple.

Let’s break it down line by line:

#!/usr/bin/env Rscript --vanilla

Sometimes called a ‘shebang’, this line tells the Linux and MacOS command line interpreters (which both default to one called ‘bash’), what you want to use to run the rest of the code in the file. As you write more command line tools like this, you may see variations on this, but this version is generally considered to be very portable. It basically says to the bash interpreter, that you want run the rest of the code in this file through whichever version of ‘Rscript’ the env command knows about. The env tool knows about your ’environment’, which includes things like what tools are available on your PATH and so on. All scripts on Linux and MacOS execute using the command interpreter specified on the first line like this. Rscript is such a command interpreter and is installed along with R. It is specifically intended to be used in these sorts of scripting scenarios. The --vanilla on the end, tells Rscript to run without saving or restoring anything in the process. This just keeps things nice a clean.

houses <- c("Hufflepuff", "Gryffindor", "Ravenclaw", "Slytherin")

This line sets up our houses.

house <- sample(houses, 1)

This one randomly selects a house from the list.

cat(house, "\n")

This one prints it out.

Running our new command line utility

Now we need to actually run our command line tool to see it in action.

MacOS/Linux

Make sure the file is saved and then switch to the ‘Terminal’ tab in RStudio. (If you’re not using RStudio, I’ll leave you to figure this one out on your own, in your Terminal application of choice.)

You should automatically be in the same directory that the project was started in, so if you type ls and hit return, you should see a file called ‘sortinghat.R’.

Next, we need to make that file executable, so that we can run it.

$ chmod +x sortinghat.R

This command sets the sortinghat.R file to be directly executable. The ‘$’ symbol is the prompt, so you need to type everything after that. This means that, if the file is set up correctly – which is what that first line is about – you ’execute’ the file as though it were itself a command. To do that, run the following:

$ ./sortinghat.R
Ravenclaw

This time, we’ve run our new command (again, everything after the ‘$’ prompt). We use the ./ to tell the command line that the file we want to execute is in the current working directory.

You should hopefully find that it produces the name of one of the houses as output. Try running it again and see what happens. Then run it bunch more times for fun.

Windows

Running command line tools on Windows is a little harder than Linux and MacOS, which are both derived from Unix and therefore have very similar underpinnings. Windows is altogether different and so it requires a completely different approach. That said, if you’re using git bash for working with git repositories then you can follow the Linux instructions above.

If you’re not sure, the easiest way to check is to switch to the Terminal tab in RStudio and run the following:

echo $SHELL

If the terminal prints ‘/usr/bin/bash’ or something similar underneath, skip this bit and just follow the Linux instructions above. If however, the terminal prints ‘$SHELL’ underneath the command then you have the standard windows command line, often just referred to as ‘cmd’. For cmd, we need to do a little extra work, but fortunately, we only need to do it once.

Create a new text file in the IDE and save it as ‘sortinghat.bat’. Some of you may be familiar with the ‘.bat’ extension as the one used for DOS Batch files, which is exactly what we’re going to make here, though ours will just be a really simple wrapper around our R script.

@echo off
"C:\Program Files\R\R-3.4.2\bin\Rscript" --vanilla "sortinghat.R" %*

That’s the whole batch file. The first line tells cmd not to echo any further lines. This just keeps our output looking tidy. The second line is the interesting part. The first section is the full path to the Rscript interpreter on your system. I’ve hopefully given you a helping hand by showing you the path on my system, but if you have a different version of R, or it’s installed in a non-standard place, you may have to change that part. The next bit is the --vanilla, which we saw in the first line of the script itself. Then we have the name of the file that we want to run. Lastly, we have this weird little ‘%*’ thing. That’s going to let us do some more interesting things later on in the series, but for now, lets just say that it will enable us to pass options and things to our script in the future.

That’s it. Now if you make sure that’s saved, head back to the ‘Terminal’ tab and run:

sortinghat

You should get a random house name. If you get an error, check it carefully, it’s likely that something in ‘sortinghat.bat’ isn’t quite right.

Once it’s working, try running it again and see what happens. Then run it bunch more times for fun.

Wrapping up

OK, so by this point you should have a working command line application, that prints a Hogwarts house at random to the command line. While you have the core of a command line app, once any initial novelty has worn off I’m sure you’ll agree, it’s pretty bad. It’s just random, and there are no options to play with!

Fear not though; in the up-coming parts of this series we’ll be able to go a bit faster, since we now have a foundation upon which to build. This means we’ll be able to quickly add new features that improve dramatically on what we’ve done so far and will hopefully turn our Sorting Hat script into something a lot more fun. (And the posts themselves will probably be shorter too!)

Monday, December 18, 2017

sellorm