Learn to Write Command Line Utilities in R - part 4

Check out the first post in this series for an index of all the other posts.

In previous posts, we’ve been working on our command line Sorting Hat utility. We started out with a really simple tool that ran on the command line and just output a random Hogwarts house. Since then, we’ve extended that to accept an argument – in this case a name – and also added some input validation and an error message.

The focus of this post will be a little different, we’ll be working on improving the Sorting Hat functionality, rather than anything specifically related to the command line operation of the script.

This is a really cool #rstats tutorial… but since when does the sorting hat assign a random Hogwarts house?! 🧙‍♀️🎩🎲➡️🏠🤔 — Maëlle Salmon (@ma_salmon) December 19, 2017

Maëlle is right of course, so how can we go about fixing that?

There are undoubtedly many ways to solve this problem and my approach may not be the best, but it gets the job done well enough for our purposes.

We need a method that will produce the same output, in this case the house name, for a given input, the person’s name.

The approach we’ll basically breaks down like this:

  • Read in the name
  • Create a hexadecimal cryptographic hash of that name
  • Get the first character of the hash
  • Use that first character to look up a house name

Our script is already reading in the name, which ends up as args[1], so the next step is to turn that into the cryptographic hash.

To do that, we use the ‘digest’ package, so we need to install that and we should have a quick play around with it at the same time, just to get a feel for what it does.

> install.packages("digest")
> digest::sha1("sellorm")
[1] "e9d883753fd4742672f4e7df6b93d367640e33bf"

We’ll be using the sha1() function from the digest package. The SHA-1 algorithm outputs a 40 character hexadecimal string. This process is repeatable for any given input string and in theory this string should be unique for any given input. (In practice it isn’t, see the Wikipedia page for a discussion of this if you’re interested.) Try using the sha1() function yourself and vary the input.

Now that we have the hash value, we can use the first character with a look-up table to assign the house. Since the hash is returned in hexadecimal we know the first character can be either a number from 0-9, or a letter from a-f (that gives you the 16 characters needed for hexadecimal). We can create a simple look-up table using a character vector like this:

houses <- c("0" = "Hufflepuff",
            "1" = "Gryffindor",
            "2" = "Ravenclaw",
            "3" = "Slytherin",
            "4" = "Hufflepuff",
            "5" = "Gryffindor",
            "6" = "Ravenclaw",
            "7" = "Slytherin",
            "8" = "Hufflepuff",
            "9" = "Gryffindor",
            "a" = "Ravenclaw",
            "b" = "Slytherin",
            "c" = "Hufflepuff",
            "d" = "Gryffindor",
            "e" = "Ravenclaw",
            "f" = "Slytherin"
            )

In the digest::sha1("sellorm") example above, the first character was an “e”. If we look up “e” in the table we find that it corresponds to “Ravenclaw”.

Now we need to incorporate this into our existing script, so that we end up with something that looks like this:

#!/usr/bin/env Rscript --vanilla
args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 1){
  stop("I think you forgot your name\n")
}
your_name <- args[1]
houses <- c("0" = "Hufflepuff",
            "1" = "Gryffindor",
            "2" = "Ravenclaw",
            "3" = "Slytherin",
            "4" = "Hufflepuff",
            "5" = "Gryffindor",
            "6" = "Ravenclaw",
            "7" = "Slytherin",
            "8" = "Hufflepuff",
            "9" = "Gryffindor",
            "a" = "Ravenclaw",
            "b" = "Slytherin",
            "c" = "Hufflepuff",
            "d" = "Gryffindor",
            "e" = "Ravenclaw",
            "f" = "Slytherin"
            )
name_hash <- digest::sha1(your_name)
house_index <- substr(name_hash, 1, 1)
house <- houses[house_index]
cat(paste0("Hello ", your_name, ", you can join ", house, "\n"))

We’ve done all this, to remove the randomness of the earlier version. Now, each name you input will be converted to a hash value, which is then used to assign a house. If, for instance, we use ‘sellorm’ as the name, this will always have a SHA-1 hash of ’e9d883753fd4742672f4e7df6b93d367640e33bf’, which will always resolve to Ravenclaw, using our simple lookup method.

Running our command line utility

MacOS/Linux/git-bash

$ ./sortinghat.R sellorm
Hello sellorm, you can join Ravenclaw

Remember to type everything after the ‘$’ symbol and feel free to replace ‘sellorm’ with a name of your choosing. You should see output similar to that displayed above.

Windows

Don’t forget, if you’re using git-bash (see the first article for more info), you need to follow the instructions for Linux/MacOS.

sortinghat sellorm
Hello sellorm, you can join Ravenclaw

Feel free to replace ‘sellorm’ with a name of your choosing. You should see output similar to that displayed above.

Wrapping up

To be honest, a large part of the motivation for the approach that I’ve taken is that I didn’t want my kids to be able to easily figure out how it was working. It’s complicated enough to obfuscate what’s going on, but still reasonably simple to implement. I also haven’t performed any testing to ensure that this approach provides a reasonable distribution between the houses.

Tweak the houses for best results! ;)

If you know people who’ll make trouble you if they’re not in a specific house (like my kids did for me!) you might like to tweak the ordering of the houses in the look-up table. This is reasonably straightforward to do, but since our script is getting a little more complicated, it might be useful if we had some way to see what was going on inside it as it’s running. To do this, we need some sort of debug logging, so we’ll look at that in the next instalment.

Update (2017-01-01): Given that I hadn’t thoroughly tested the distribution of house names using this method, I decided to run all of the unique names from the babynames package through it. It seems to be pretty good, with results for each house as follows:

# A tibble: 4 x 2
  `unlist(housedf$house_results)`     n
                            <chr> <int>
1                      Gryffindor 23783
2                      Hufflepuff 23587
3                       Ravenclaw 23913
4                       Slytherin 23742

The code I used to figure this out is available on GitHub.