My workflow automation journey: Discovering Shake (Haskell)
GitHub repository: https://github.com/nevrome/ShakeExperiment
Workflow management, so software to organize and run data analysis scripts, is one of these fortunate domains, where dozens of open source solutions are competing for our attention. There’s probably something for every taste (see e.g. the extensive list here), and many of these projects are actively maintained or at least comparatively easy to resurrect. This post is an attempt to describe my personal journey for a tool that fits me, in the hope to motivate you to go searching as well.
My user story
TL;DR: I ended up with Shake, because it is written in Haskell, fairly simple to use and capable to do exactly what I need. You can skip the story how I arrived at this conclusion and scroll down to the juicy coding bits in Using Shake.
My PhD research is located somewhere between Bioinformatics and Archaeoinformatics (yep — that’s a thing) and I work with large and high-dimensional datasets. Not really Big data, but big enough to require a high performance computing environment to run analyses in reasonable time. Space, time and (ancient)DNA meet in my data, so my code necessarily relies on a variety of software libraries from different domains. In the last two years I piled scripts on top of scripts and thus created a complex network of interlinked code for data preparation, analysis and visualization.
This is my personal user story. It eventually brought me to a point where I realized that I have to introduce a more sophisticated system for dependency management and workflow automation. The former is especially important for reproducibility, and the latter to propagate changes, so to always maintain an up-to-date version of derived data products and plots. I needed a system that defines, runs and monitors a pipeline of code across different interacting scripts.
As I share these challenges with a large number of people working professionally with computers, there are many excellent solutions for exactly these challenges out there. I just had to pick what fits me, my tasks and my interests. So I decided to follow my gut feelings and ended up with the containerization solutions docker and singularity to encapsulate my development environment (which will only be mentioned in passing here), and the build system Shake to orchestrate my analysis pipeline.
Why Shake, of all things?
The first options l considered for pipeline management were Nextflow and Snakemake. Both are very popular among my colleagues in bioinformatics. At our department there seems to be an even divide between strong fans of the former and the latter. I personally did not want to deal neither with Groovy nor with Python, though, which nextflow and snakemake respectively use as an underlying configuration language. Ideally I wanted to write the pipeline definition in a language and framework I’m already familiar with. That’s not (only) laziness. By working in either R or Haskell, with which I feel most comfortable, I could more easily leverage the power of these languages.
So then I gave some scrutiny to targets, an implementation of a pipelining tool in R. This might have worked for me, but it gave me the impression to be too focused on workflows within R. R is certainly an important component of my personal tech stack right now, but I wanted to be prepared for whatever the future might bring. I also — and that’s very shallow— didn’t like target’s syntax from what I saw in the example code, where every computation in a pipeline got crammed into a single list object.
At this point I realized I would really like to solve this in Haskell, as the language became something of a personal passion anyway. A functional, strongly typed language should also — at least in theory — be a good fit to formalize building rules. I did some research and came across three Haskell tools that seem to offer workflow management: Funflow, Porcupine and Bioshake. Instead of diving into them one after the other, I took a step back and asked the excellent Haskell community on reddit for advice: Experiences with workflow managers implemented in Haskell (funflow, porcupine, bioshake, ?)
Fortunately Justin Bedő, the author of Bioshake, saw the post and gave me some insights about his implementation. At the time he had already moved one step further, and had discontinued the development of Bioshake for his new solution BioNix, which solves both (!) dependency and worflow management with the fascinating Nix infrastructure. As Nix is a big world on its own, I couldn’t follow him there. So I instead gave the Bioshake documentation a good read. And there I realized that Bioshake heavily relies on Shake internally: understanding Shake seemed to be inevitable to figuring out Bioshake. And Shake alone already turned out to be powerful and flexible enough for my current needs!
I had reached the end of my software exploration journey.
Your journey for a workflow management solution would certainly be different, and you would most likely reach different conclusions. But I encourage you to explore this realm, if you think you share a user story similar to mine. You can keep reading, if you want to see how I configured Shake to help me with my challenges.
Using Shake
Shake is a build system like make, so software to organize the compilation of large software projects. That’s why its manual fully focuses on building C code. In my perception building software and managing a data analysis pipeline are very similar tasks, though: in the end you want to run every script necessary to get a certain product, and it does not matter much, if that product are crosscompiled executables or a set of plots.
The Shake homepage does a good job in listing the advantages it has over its competitors. Here are three aspects I find particularly appealing about it:
- “Pull-based”: Shake starts from the desired end product and figures out, which scripts it has to run to reach a certain result. If I modify a script, it only rebuilds everything that depends on it downstream.
- Fast and parallel: Compiling and running the massive, 600 line Shakefile I need for my current main project feels fast and responsive. It’s incredibly satisfying to see Shake plow through independent scripts in parallel.
- Configurable: Shake is a library with a simple interface, extensive documentation and useful configuration options. It boils down to idiomatic Haskell code, fully adjustable to your needs.
To illustrate how it works, I want to present a basic example in the following section (Code on GitHub).
A simple Shakefile
Let’s imagine a workflow like this:
raw_input.csv --> A.R -
\
-> C.R --> 3D.png
/
B.R -
We have three .R scripts: A, B and C. A requires an input .csv file, B is independent of A, and C requires the intermediate output of A and B to produce our desired, final output 3D.png.
In our file system this looks like this:
.
├── input
│ └── raw_input.csv
└── scripts
├── A.R
├── B.R
└── C.R
Now let’s add a “Shakefile”, so a script that expresses our tiny pipeline with Shake. This boils down to a Haskell script with a main
method, which describes the interaction of these files in a way Shake can parse and understand.
In my opinion the most easy way to run an independent Haskell script is via the Stack script interpreter. So if we have stack installed on our system, we can create a new script file Shakefile.hs and append these two lines to the top:
#!/usr/bin/env stack
-- stack --resolver lts-18.7 script --package shake
If we later run our script with ./Shakefile.hs, stack will automatically download and prepare the necessary dependencies: the Glasgow Haskell Compiler and the Shake package. That allows us to import modules with functions and data types from Shake.
import Development.Shake
import Development.Shake.Command
import Development.Shake.FilePath
Finally we can define our main method like this:
main :: IO ()
main = shake shakeOptions {shakeFiles = "_build"} $ do want [ "output" </> "3D.png" ]
"output" </> "3D.png" %> \out -> do
let script = "scripts" </> "C.R"
dataFiles = [
"intermediate" </> "dens_surface.RData",
"intermediate" </> "colours.RData"
]
need $ script : dataFiles
cmd_ "Rscript" script "intermediate" </> "dens_surface.RData" %> \out -> do
let script = "scripts" </> "A.R"
dataFiles = [ "input" </> "raw_input.csv" ]
need $ script : dataFiles
cmd_ "Rscript" script "intermediate" </> "colours.RData" %> \out -> do
let script = "scripts" </> "B.R"
need [ script ]
cmd_ "Rscript" script
I don’t want to get lost in the intricate details of Haskell and the Shake interface here, so it shall be enough to say that the function
shake :: ShakeOptions -> Rules () -> IO ()
called at the very beginning of the main
method takes a configuration type ShakeOptions
and a set of rules — which can be written with the Monad instance and do-notation — and evaluates them and the actions within them in a meaningful order.
That’s how one of these rules looks like:
"intermediate" </> "dens_surface.RData" %> \out -> do
let script = "scripts" </> "A.R"
dataFiles = [ "input" </> "raw_input.csv" ]
need $ script : dataFiles
cmd_ "Rscript" script
Each rule has output files (here: dens_surface.RData in the directory intermediate) and requires input files (here: the script A.R and input/raw_input.csv). It finally also has some mechanism that connects input and output, so for example a command to run a specific script that takes the input and yields the output (here: cmd_ "Rscript" script
).
In a Shakefile you write all rules necessary to fully represent your pipeline. The rest is pure magic: Shake runs all scripts in the right order, creates missing directories and keeps carefully track of the state of each input and output file.
$ ./Shakefile1.hs
# Rscript (for intermediate/colours.RData)
# Rscript (for intermediate/dens_surface.RData)
# Rscript (for output/3D.png)
After running our toy example, our directory will look like this, so full of output files:
.
├── _build
├── input
│ └── raw_input.csv
├── intermediate
│ ├── colours.RData
│ └── dens_surface.RData
├── output
│ └── 3D.png
├── scripts
│ ├── A.R
│ ├── B.R
│ └── C.R
└── Shakefile1.hs
_build is where Shake stores its knowledge and puts intermediate files for itself. You should certainly add it to your .gitignore file, if you work with Git, just as the intermediate and output directories, which are created by the pipeline.
As a small experiment and to test Shake’s power, we can edit one of the scripts. B.R only produces a colour vector to be used in the plotting function in C.R, so it’s an easy target for modification. And indeed: If we edit one of the colours there and run our script again, it only runs B and C, producing a new, nifty 3D.png. Brilliant!
$ ./Shakefile1.hs
# Rscript (for intermediate/colours.RData)
# Rscript (for output/3D.png)
Adjustments for my needs and convenience
Our very simple Shake script is already fulfilling its basic purpose. The pipeline is fully defined and runs, when we execute the Shakefile.
But some more advanced elements I personally need for my actual worflows are missing (e.g. support for singularity and our in-house HPC system). Shake itself also has some neat configuration options to explore. And finally the versatility of Haskell should allow to rewrite the core pipeline mechanics in shorter and clearer syntax. So: We have some room for improvement, and I wanted to dive deeper into that.
Here’s a refactored version of the script above:
#!/usr/bin/env stack
-- stack --resolver lts-18.7 script --package shakeimport Development.Shake
import Development.Shake.Command
import Development.Shake.FilePathdata Settings = Settings {
singularityContainer :: FilePath
, bindPath :: String
, qsubCommand :: String
}mpiEVAClusterSettings = Settings {
singularityContainer = "singularity_experiment.sif"
, bindPath = "--bind=/mnt/archgen/users/schmid"
, qsubCommand = "qsub -sync y -b y -cwd -q archgen.q \
\-pe smp 1 -l h_vmem=10G -now n -V -j y \
\-o ~/log -N example"
}relevantRunCommand :: Settings -> FilePath -> Action ()
relevantRunCommand (Settings singularityContainer bindPath qsubCommand) x
| takeExtension x == ".R" = cmd_ qsubCommand
"singularity" "exec" bindPath singularityContainer "Rscript" x
| takeExtension x == ".sh" = cmd_ qsubCommand
"singularity" "exec" bindPath singularityContainer xinfixl 8 %$
(%$) :: FilePath -> ([FilePath], [FilePath]) -> Rules ()
(%$) script (inFiles, outFiles) =
let settings = mpiEVAClusterSettings
in outFiles &%> \out -> do
need $ [script, singularityContainer settings] ++ inFiles
relevantRunCommand settings scriptinfixl 9 -->
(-->) :: a -> b -> (a,b)
(-->) x y = (x,y)input x = "input" </> x
intermediate x = "intermediate" </> x
scripts x = "scripts" </> x
output x = "output" </> xmain :: IO ()
main = shake shakeOptions {
shakeFiles = "_build"
, shakeProgress = progressSimple
, shakeColor = True
, shakeVerbosity = Verbose
, shakeThreads = 3
, shakeTimings = True
} $ do
want [output "3D.png"]
scripts "A.R" %$
[input "raw_input.csv"] --> [intermediate "dens_surface.RData"]
scripts "B.R" %$
[ ] --> [intermediate "colours.RData"]
scripts "C.R" %$
map intermediate ["dens_surface.RData", "colours.RData"] -->
[output "3D.png"]
There’s plenty to unpack here. So let’s pull it apart, starting with the the new files I added to our simple setup above.
.
├── input
│ └── raw_input.csv
├── scripts
│ ├── A.R
│ ├── B.R
│ └── C.R
├── Shakefile2.hs
├── singularity_build_sif.sh
├── singularity_experiment.def
└── singularity_experiment.sif
Specifically for Singularity I added three files: singularity_build_sif.sh is a bash script to build the singularity image file singularity_experiment.sif as defined in singularity_experiment.def:
Bootstrap: docker
From: rocker/r-base:4.1.0%post
# install the necessary R packages
R — slave -e 'install.packages(“MASS”)'
This simple configuration file describes a reproducible, self-sufficient computational environment with R v4.1.0 and only one additional R package (MASS). Singularity is very well integrated with docker — here I build directly on top of a rocker image. As I don’t want to get lost in singularity here, I’ll leave it at that, and instead jump right into the new Shakefile.
Rules that don’t hurt the eyes
I think the build rule creation syntax in Shake is an eyesore — as you can see in the first Shakefile above. For my new Shakefile I wrote a wrapper, that expresses rules more clearly.
Let’s start with the new operator %$
, which encapsulates Shake’s %>
:
(%$) :: FilePath -> ([FilePath], [FilePath]) -> Rules ()
(%$) script (inFiles, outFiles) =
let settings = mpiEVAClusterSettings
in outFiles &%> \out -> do
need $ [script, singularityContainer settings] ++ inFiles
relevantRunCommand settings script
It allows to write rules in an — in my opinion — much more idiomatic way:
script %$ ([input files], [output files])
The tuple ([],[])
to express input and output files in the second argument still feels a bit awkward, so I added an operator -->
to express tuple creation more neatly. Using an arrow for that of course only makes sense in the pipeline context we’re covering here. To make sure that the two new operators are actually evaluated in the correct order, we manually have to set their fixity.
(-->) :: a -> b -> (a,b)
(-->) x y = (x,y)infixl 8 %$
infixl 9 -->
That boils rule creation down to some wonderful syntax:
script %$ [input files] --> [output files]
The horrible
"intermediate" </> "colours.RData" %> \out -> do
let script = "scripts" </> "B.R"
need [ script ]
cmd_ "Rscript" script
becomes a much more pleasant
scripts "B.R" %$ [ ] --> [intermediate "colours.RData"]
Custom run commands and environments
Now that the rules look nicer, we can turn towards the system environment. As described above, I have pretty specific requirements how exactly my scripts should be run: Through our high performance computing setting and through a singularity container.
HPC runs Singularity runs Rscript runs my scripts
To express this, I added the function relevantRunCommand
, that does just that: compiling a relevant run command — here depending on the file extension of the respective script.
relevantRunCommand :: Settings -> FilePath -> Action ()
relevantRunCommand (Settings singularityContainer bindPath qsubCommand) x
| takeExtension x == ".R" = cmd_ qsubCommand
"singularity" "exec" bindPath singularityContainer "Rscript" x
| takeExtension x == ".sh" = cmd_ qsubCommand
"singularity" "exec" bindPath singularityContainer x
This function also requires the configuration type Settings
, which serves to make relevantRunCommand
somewhat flexible. It stores highly variable configuration like the path to the singularity container, which directories should be mapped into the container via bind mounts, and how exactly the scripts should be submitted to run on the HPC cluster. The example here is simplified, but true to the real setup I typically use:
data Settings = Settings {
singularityContainer :: FilePath
, bindPath :: String
, qsubCommand :: String
}mpiEVAClusterSettings = Settings {
singularityContainer = "singularity_experiment.sif"
, bindPath = "--bind=/mnt/archgen/users/schmid"
, qsubCommand = "qsub -sync y -b y -cwd -q archgen.q \
\-pe smp 1 -l h_vmem=10G -now n -V -j y \
\-o ~/log -N example"
}
For my real production code, the settings data type is a bit more complex and features additional elements — for example different cluster submission commands for different computing power requirements.
You see that the building of the singularity image itself is not part of the pipeline. Building it requires sudo
permissions, and — more fundamentally —building it every time would undermine reproducibility: The recipe in the .def file requires multiple different online servers to be available and to always provide specific versions of certain software dependencies. In a way, the singularity image should be considered a stable input data file, so nothing to be produced on the fly.
This approach to environment management and configuration is bare-bones. I like the flexibility that comes with it, but I also see the appeal of a higher level of abstraction as provided by e.g. nextflow’s executors.
Shake options
Shake itself comes with a number of easily configurable options how it should run. They are set in the record type shakeOptions
, as described here. These are the ones I modified for this example:
shakeOptions {
shakeFiles = "_build"
, shakeThreads = 3
, shakeChange = ChangeModtime
, shakeProgress = progressSimple
, shakeColor = True
, shakeVerbosity = Verbose
, shakeTimings = True
}
- shakeFiles: The directory used for storing Shake metadata files. We already used that option above.
- shakeThreads: The maximum number of rules to run in parallel. In our pipeline there are only three rules, and one depends on two others, so three is literally more than enough for maximum speed.
- shakeChange: How should Shake determine if a file has changed? The data type
Change
has multiple constructors, including the defaultChangeModetime
, which causes Shake to invalidate files based on timestamps or alternativelyChangeDigest
, which does so via checksums. - shakeProgress: How progess should be reported, when the pipeline is running.
progressSimple
is a basic default, but there is an entire datatypeProgress
to specify configuration options. - shakeColor: Whether to colorize the command line output.
- shakeVerbosity: How verbose the command line output should be. A data type
Verbosity
controles the different possible levels. - shakeTimings: Print timing information for each stage at the end.
There is more to discover among these options and beyond in the mechanisms Shake provides. Fortunately the library is quite extensivly documented.
Conclusion
Thanks for bearing with me until here. I wrote this post partly to document my decision process in this matter, but also to bring across one major and two minor points:
- Workflow managers are useful even for small projects. Check if a tool like nextflow, snakemake or target (or whatever you prefer!) can make your daily work easier, faster and more reproducible. I find it relieving if I can be sure, that all my plots represent the latest stage of work in every script.
- Shake is a powerful tool, if you know some Haskell. It’s flexible, very well written and elaborately documented.
- Haskell is a beautiful language to express logic in a concise, yet clear way. Its custom operators can reduce repetitive code to a minimum.
Acknowledgements: I got some valuable feedback by my colleague Alexander Hübner (@alexhbnr) for this post.