From scripts to pipelines with `targets`

https://papsti.github.io/talks/2023-10-19_targets.html

Irena Papst
Senior Scientist
Public Health Agency of Canada

Pipelining is the process of writing down a recipe for outputs where all dependencies are stated explicitly.

Isn’t that just a script??

No!

“Disease X is picking up, can you forecast it for the next few weeks?”

A simple script

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# simulate
sim = make_forecast(data)

# plot results
plot_forecast(sim)

“Great! Now can you add multiple forecast scenarios?”

A slightly-more-complicated script

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# specify parameters
for(scenario in c("A", "B", "C")){
  # simulate
  sim = make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

Parallelization to the rescue?

library(readr); library(dplyr); library(ggplot2)
library(doParallel); cl <- makeCluster(4); registerDoParallel(cl)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# specify parameters
foreach(scenario = c("A", "B", "C")) %dopar% {
  # simulate
  sim = make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

What if…

I only want to re-run certain scenarios?
I’m running into errors for one scenario and need to debug?
I step away for a meeting and forget which scenario results are up-to-date?

That’s a lot to manage!

Enter `targets`

“The targets package is a Make-like pipeline tool for statistics and data science in R.”

– the targets user manual

Just use Make!!!

I don’t wanna!

In a government and/or corporate setting, an R package can be easier

to install

to set up for colleagues (thanks to renv <3)

Show & tell

library(readr); library(dplyr); library(ggplot2)

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

# simulate
sim = make_forecast(data)

# plot results
plot_forecast(sim)

Show & tell

Before

# read & clean data
data = (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
)

After

# as a target
tar_target(
  # name of the target
  data,
  # recipe for the target
  (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
  )
)

Make a `_targets.R` file

library(targets)
use_targets()

From script to pipeline

library(targets)
source("R/make_forecast.R"); source("R/plot_forecast.R")
tar_option_set(packages = c("readr", "dplyr", "ggplot2", "lubridate"))

# pipeline
list(
  # read & clean data
  tar_target(
    data,
    (read_csv("data/case-counts.csv")
      |> filter(date >= "2023-01-01")
    )
  ),
  # simulate
  tar_target(
    sim,
    make_forecast(data)
  ),
  # plot results
  tar_target(
    plot,
    plot_forecast(sim)
  )
)

The free lunch

Easy to make

tar_make(plot)

▶ start target data
● built target data [0.272 seconds]
▶ start target sim
● built target sim [0.003 seconds]
▶ start target plot
● built target plot [0.007 seconds]
▶ end pipeline [0.357 seconds]

Easy to read

tar_read(plot)

tar_load(plot)

Easy to visualize the pipeline

tar_visnetwork()

Only make what you need

tar_visnetwork()

tar_make()

✔ skip target data
✔ skip target sim
▶ start target plot
● built target plot [0.009 seconds]
▶ end pipeline [0.245 seconds]

Only make what you need

tar_visnetwork()

And so much more!

Branching: easily repeat sections of the pipeline
Distributed computing: parallelize easily with crew
tarchetypes: target archetypes for common tasks

Learn from my mistakes

Not everything has to be a target

Start with broad strokes, drill down into smaller pieces as needed
- Larger targets = bigger speed-ups when skipped
- Smaller targets = more control over what is skipped and what isn’t
Ask yourself:
- Might this target get updated independently of other targets?
- Is this something I’ll want to look at down the line?

Don’t be afraid to run the pipeline

tar_make(sim)

✔ skip target data
▶ start target sim
✖ error target sim
▶ end pipeline [0.278 seconds]
Error:
! Error running targets::tar_make()
  Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
  Debugging guide: https://books.ropensci.org/targets/debugging.html
  How to ask for help: https://books.ropensci.org/targets/help.html
  Last error: object 'x' not found

Don’t do this!

source("R/make_forecast.R")
debug(make_forecast)
tar_load(data)
make_forecast(data)

Do this

# insert browser() statement 
# into make_forecast()
tar_make(sim,
         callr_function = NULL)

Be explicit

Don’t do this!

tar_target(
  data,
  (read_csv("data/case-counts.csv")
   |> filter(date >= "2023-01-01")
  )
)

Do this

tar_target(
  data.file,
  "data/case-counts.csv",
  format = "file"
) # track the file!!!
tar_target(
  data,
  (read_csv(data.file)
   |> filter(date >= "2023-01-01")
  )
)

Dynamic branching > static branching

Static branching
- Targets generated before pipeline is run
- Clearly named targets generated (sim_ON)
- More annoying to aggregate
Dynamic branching
- Targets generated at run-time
- Cryptic names (sim_3e0e0255)
- Automagic

Final thoughts

You don’t always need a pipeline…

Short, one-off task? Maybe write a script
Multi-step process, will have to run again, will have to pass off to a colleage? Maybe write a pipeline

…but you can still set yourself up for success

Adopt file structure compatible with targets pipelines (and R packages)
- Define functions in R/
Document functions as if you’re going to package them
- Use roxygen

Getting started

New to pipelines?
- Read the targets user manual!
- Convert a working script to a pipeline
Already a pipeline user?
- Read the targets user manual!
- Check out branching, crew and tarchetypes to go further

Thank you!

https://papsti.github.io/talks/2023-10-19_targets.html

From scripts to pipelines with targets

Isn’t that just a script??

No!

“Disease X is picking up, can you forecast it for the next few weeks?”

A simple script

“Great! Now can you add multiple forecast scenarios?”

A slightly-more-complicated script

Parallelization to the rescue?

What if…

That’s a lot to manage!

Enter targets

Just use Make!!!

I don’t wanna!

Show & tell

Show & tell

Before

After

Make a _targets.R file

From script to pipeline

The free lunch

Easy to make

Easy to read

Easy to visualize the pipeline

Only make what you need

Only make what you need

And so much more!

Learn from my mistakes

Not everything has to be a target

Don’t be afraid to run the pipeline

Be explicit

Dynamic branching > static branching

Final thoughts

You don’t always need a pipeline…

…but you can still set yourself up for success

Getting started

Thank you!

From scripts to pipelines with `targets`

Enter `targets`

Make a `_targets.R` file