The "nhsnumber" package and the joy of sharing your niche

This post originally appeared on the NHS-R blog.

Being the author of a package with tens of thousands of users must be incredibly rewarding. All those people getting value from your work and using it to do incredible things. Few of us will ever write a package that has that kind of reach though.

Most of us must be content to give back to our communities in smaller ways.

In 2019 I was working for a company building software for Genomics England and the NHS. We had many conversations about NHS numbers and NHS Spine and so on, and over time, I became interested in the numbers themselves.

As a manager, I wasn’t directly involved in writing any code, but was still heavily involved in the R community in my free time. So I wrote some code to validate the checksums used by NHS numbers to get a better understanding of how they work and to play with some R.

NHS numbers use a fairly simple format. The first 9 characters are the actual number and the 10th digit is a checksum. A checksum is some data that is used to verify another bit of data. In the case of NHS numbers, it’s the last digit. That 10th digit is generated by an algorithm that takes the first 9 digits as it’s input. This means you can check the validity of a number by taking the first 9 digits, computing the checksum and comparing that with the provided checksum. If they match, the number is valid.

Taken together, all 10 digits form the complete NHS number.

library(nhsnumber)
# Take a made up number and generate a checksum
get_checksum(123456788, full_output = TRUE)
# Take that output and a version of the same input number with an
# incorrect checksum and test their validity
is_valid(c(1234567881, 1234567882))

Credit card numbers, and many other numbers found out in the wild, use the same technique, though often with different algorithms (the Luhn algorithm in the case of credit card numbers). This makes it possible for us to validate the legitimacy of numbers before we pass them on to upstream services for final validation and association with the human it was assigned to..

Of course, in most cases, the algorithms are well known, and there’s nothing to stop people generating fake numbers that pass checksum validation. However, this early stage checksum validation can be used to flag typos and transcription errors in a number, or weed out obvious chancers.

Once I had an implementation figured out, I wrapped it up into an R package, dropped it on GitHub, tweeted about it…

…and then promptly forgot about it.

Over the years I’ve written a lot of super-niche and one-off R packages , principally for my own entertainment, and this felt like another one of those.

Cut to a year later and I’m working for RStudio, helping out a little at Data Orchard and spending a lot of time thinking about the data science community and ways to give back to the community that’s always been so generous to me. It was at this point that I decided to make the effort to publish the package to CRAN.

Getting your first package on CRAN can be a nerve wracking experience, but it was a fairly smooth process, with only small changes required by the CRAN team before it could be published.

When you publish any package, you never really know if anyone will use it. You’re pushing your work out into the world to see if it can survive on its own. After nhsnumber was published, I would occasionally check it’s stats and see low, but consistent numbers of downloads. A couple of times since it was published, actual users have reached out to say thanks or report a bug.

As a stats-first language rather than a general purpose one, some might say that R is itself, somewhat niche (though a pretty large niche, it must be said!). Add to that a package that only makes sense in one geographic region and is also specific to those working within and alongside one specific organisation within that region and we’re knee-deep in niches! Clearly this sort of package is never going to be applicable to all R users.

But none of that means a package isn’t valuable. If only one user benefits from its existence I’d consider that a success. If your work can serve a community, however small, and help improve their work in some way, I consider that a win. So, if like me, you have ideas for R packages, but you consider them too niche to be worth the time, I’d encourage you to share them in some way anyway. You’ll learn a lot along the way, maybe have a bit of fun thinking about how best to organise and present your package and may, just may, improve someone else’s life along the way.

Happy coding everyone!

Mark

PS. Because of the interest shown in the nhsnumber R package by the NHS-R community, I decided it would be fun to port the package to Python too.