From scripts to pipelines with targets


https://papsti.github.io/talks/2023-10-19_targets.html

Irena Papst
Senior Scientist
Public Health Agency of Canada

Pipelining is the process of writing down a recipe for outputs where all dependencies are stated explicitly.

Isn’t that just a script??

No!

“Disease X is picking up, can you forecast it for the next few weeks?”

A simple script

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# simulate
sim = make_forecast(data)

# plot results
plot_forecast(sim)

“Great! Now can you add multiple forecast scenarios?”

A slightly-more-complicated script

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# specify parameters
for(scenario in c("A", "B", "C")){
  # simulate
  sim = make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

Parallelization to the rescue?

library(readr); library(dplyr); library(ggplot2)
library(doParallel); cl <- makeCluster(4); registerDoParallel(cl)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# specify parameters
foreach(scenario = c("A", "B", "C")) %dopar% {
  # simulate
  sim = make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

What if…

  • I only want to re-run certain scenarios?
  • I’m running into errors for one scenario and need to debug?
  • I step away for a meeting and forget which scenario results are up-to-date?

That’s a lot to manage!

Enter targets

“The targets package is a Make-like pipeline tool for statistics and data science in R.”

the targets user manual

Just use Make!!!

I don’t wanna!

In a government and/or corporate setting, an R package can be easier

  • to install
  • to set up for colleagues (thanks to renv <3)

Show & tell

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# simulate
sim = make_forecast(data)

# plot results
plot_forecast(sim)

Show & tell

Before

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

After

# as a target
tar_target(
  # name of the target
  data,
  # recipe for the target
  (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
  )
)

Make a _targets.R file

library(targets)
use_targets()

From script to pipeline

library(targets)
source("R/make_forecast.R"); source("R/plot_forecast.R")
tar_option_set(packages = c("readr", "dplyr", "ggplot2", "lubridate"))

# pipeline
list(
  # read & clean data
  tar_target(
    data,
    (read_csv("data/case-counts.csv")
      |> filter(date >= "2023-01-01")
    )
  ),
  # simulate
  tar_target(
    sim,
    make_forecast(data)
  ),
  # plot results
  tar_target(
    plot,
    plot_forecast(sim)
  )
)

The free lunch

Easy to make

tar_make(plot)
▶ start target data
● built target data [0.272 seconds]
▶ start target sim
● built target sim [0.003 seconds]
▶ start target plot
● built target plot [0.007 seconds]
▶ end pipeline [0.357 seconds]

Easy to read

tar_read(plot)
tar_load(plot)

Easy to visualize the pipeline

tar_visnetwork()

Only make what you need

tar_visnetwork()
tar_make()
✔ skip target data
✔ skip target sim
▶ start target plot
● built target plot [0.009 seconds]
▶ end pipeline [0.245 seconds]

Only make what you need

tar_visnetwork()

And so much more!

  • Branching: easily repeat sections of the pipeline
  • Distributed computing: parallelize easily with crew
  • tarchetypes: target archetypes for common tasks

Learn from my mistakes

Not everything has to be a target

  • Start with broad strokes, drill down into smaller pieces as needed
    • Larger targets = bigger speed-ups when skipped
    • Smaller targets = more control over what is skipped and what isn’t
  • Ask yourself:
    • Might this target get updated independently of other targets?
    • Is this something I’ll want to look at down the line?

Don’t be afraid to run the pipeline

tar_make(sim)
✔ skip target data
▶ start target sim
✖ error target sim
▶ end pipeline [0.278 seconds]
Error:
! Error running targets::tar_make()
  Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
  Debugging guide: https://books.ropensci.org/targets/debugging.html
  How to ask for help: https://books.ropensci.org/targets/help.html
  Last error: object 'x' not found

Don’t do this!

source("R/make_forecast.R")
debug(make_forecast)
tar_load(data)
make_forecast(data)

Do this

# insert browser() statement 
# into make_forecast()
tar_make(sim,
         callr_function = NULL)

Be explicit

Don’t do this!

tar_target(
  data,
  (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
  )
)

Do this

tar_target(
  data.file,
  "data/case-counts.csv",
  format = "file"
) # track the file!!!
tar_target(
  data,
  (read_csv(data.file)
   |> filter(date >= "2023-01-01")
  )
)

Dynamic branching > static branching

  • Static branching
    • Targets generated before pipeline is run
    • Clearly named targets generated (sim_ON)
    • More annoying to aggregate
  • Dynamic branching
    • Targets generated at run-time
    • Cryptic names (sim_3e0e0255)
    • Automagic

Final thoughts

You don’t always need a pipeline…

  • Short, one-off task? Maybe write a script
  • Multi-step process, will have to run again, will have to pass off to a colleage? Maybe write a pipeline

…but you can still set yourself up for success

  • Adopt file structure compatible with targets pipelines (and R packages)
    • Define functions in R/
  • Document functions as if you’re going to package them
    • Use roxygen

Getting started

Thank you!

https://papsti.github.io/talks/2023-10-19_targets.html