Manage your analyses with ease using {targets}


https://papsti.github.io/talks/2025-01-22_targets.html

Irena Papst
Scientist, infectious disease modelling
Public Health Agency of Canada

The ask

  • “Can you combine these datasets for the team?”
  • “Can you make some plots for our Tuesday morning slide deck?”
  • “Can you make the tables and figures for our manuscript?”
  • “Can you forecast disease X for the next few weeks?”

What do these tasks have in common?

  • Moderate complexity
    • Several inputs coming together
    • Several outputs need to be produced
  • May need to repeat/reproduce (recurring, for accountability)

“Can you forecast disease X for the next few weeks?”

A simple script

library(readr); library(dplyr); library(ggplot2)

# read data
data <- read_data("data/case-counts.csv")

# simulate
sim <- make_forecast(data)

# plot results
plot_forecast(sim)

“Great! But what if…”

A slightly-more-complicated script

library(readr); library(dplyr); library(ggplot2)

# read data
data <- read_data("data/case-counts.csv")

# specify parameters
for(scenario in c("A", "B", "C")){
  # simulate
  sim <- make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

Parallelization to the rescue?

library(readr); library(dplyr); library(ggplot2)
library(doParallel); cl <- makeCluster(4); registerDoParallel(cl)

# read data
data <- read_data("data/case-counts.csv")

# specify parameters
foreach(scenario = c("A", "B", "C")) %dopar% {
  # simulate
  sim <- make_forecast(
    data, 
    scenario = scenario
  )
  # plot results
  plot_forecast(sim)
}

source("forecast.R")

  • Scripts are reproducible…
    • …but we sometimes want to use them interactively
  • What if…
    • my parameters changed for only one scenario?
    • I’m running into errors for the last scenario and need to debug?
    • I step away for a meeting and forget which scenario results are up-to-date?

That’s a lot to manage!

The problem

My analysis code is mixed up with my “management” code.

sim <- make_forecast(data)

My analysis code is mixed up with my “management” code.

(
  sim <-
  make_forecast(data)
)

My analysis code is mixed up with my “management” code.

(
  sim <- # management code
  make_forecast(data) # analysis code
)

Other management code:

  • for(scenario in c("A", "B", "C")){...}
  • saveRDS(sim)

The solution

Outsource your management code to {targets} by writing a pipeline.

What is a pipeline?

A pipeline is a layer on top of your analysis code that encodes the recipe for a sequence of generated objects, called “targets”, and how these targets depend on each other.

Warning message:
package 'targets' was built under R version 4.3.3 

{targets} is an R package that you can use to write, manage, and execute pipelines.

Show & tell

library(readr); library(dplyr); library(ggplot2)

# read data
data <- read_data("data/case-counts.csv")

# simulate
sim <- make_forecast(data)

# plot results
plot_forecast(sim)

Show & tell

Before

# simulate
sim <- make_forecast(data)

After

# as a target
tar_target(
  # name of the target
  sim,
  # recipe for the target
  make_forecast(data)
)

Make a _targets.R file

library(targets)
use_targets()

From script to pipeline

library(targets)
source("R/make_forecast.R"); source("R/plot_forecast.R")
tar_option_set(packages = c("readr", "dplyr", "ggplot2", "lubridate"))

# pipeline
list(
  # read data
  tar_target(
    data,
    read_data("data/case-counts.csv")
  ),
  # simulate
  tar_target(
    sim,
    make_forecast(data)
  ),
  # plot results
  tar_target(
    plot,
    plot_forecast(sim)
  )
)

The free lunch

Easy to make things

tar_make(plot)
▶ dispatched target data
● completed target data [2.01 seconds, 357 bytes]
▶ dispatched target sim
● completed target sim [0.01 seconds, 661 bytes]
▶ dispatched target plot
● completed target plot [0.01 seconds, 95.58 kilobytes]
▶ ended pipeline [2.33 seconds]
Warning messages:
1: package 'targets' was built under R version 4.3.3 
2: package 'readr' was built under R version 4.3.3 
3: package 'ggplot2' was built under R version 4.3.3 
4: package 'lubridate' was built under R version 4.3.3 
5: 1 targets produced warnings. Run targets::tar_meta(fields = warnings, complete_only = TRUE) for the messages. 

Easy to read things

tar_read(plot)
tar_load(plot)

Easy to visualize the overall process

Warning message:
package 'targets' was built under R version 4.3.3 

Only make what you need

Before

plot_forecast <- function(sim){
(ggplot(sim, aes(x = date, y = count))
  + geom_smooth()
  + geom_point()
  + theme_bw(base_size = 20)
)}

After

plot_forecast <- function(sim){
(ggplot(sim, aes(x = date, y = count))
  + geom_smooth()
  + geom_point(color = "red")
  + theme_bw(base_size = 20)
)}
suppressWarnings(tar_visnetwork())
Warning message:
package 'targets' was built under R version 4.3.3 
suppressWarnings(tar_make())
✔ skipping targets (1 so far)...
▶ dispatched target plot
● completed target plot [0.05 seconds, 95.671 kilobytes]
▶ ended pipeline [4.04 seconds]
Warning messages:
1: package 'targets' was built under R version 4.3.3 
2: package 'readr' was built under R version 4.3.3 
3: package 'ggplot2' was built under R version 4.3.3 
4: package 'lubridate' was built under R version 4.3.3 

Only make what you need

suppressWarnings(tar_visnetwork())
Warning message:
package 'targets' was built under R version 4.3.3 

Easy to scale

Distributed computing: parallelize easily with crew

# _targets.R
tar_option_set(
  controller = 
    crew::crew_controller_local(
      workers = 2
    )
)


To run:

tar_make()

Easy to scale

Branching: easily repeat sections of the pipeline1

setup

tar_target(
  scenario,
  c("A", "B", "C")
)
tar_target(
  sim,
  make_forecast(data, scenario),
  pattern = map(scenario)
)

payoff

# plot from combined results
# data frame (e.g., facetted
# by parameter set)
tar_target( 
  combined_plot,
  plot_forecast(sim)
)
# creating a list of plots, 
# one per parameter set
tar_target(
  single_plot,
  plot_forecast(sim),
  pattern = map(sim),
  iteration = "list"
)

This is not a {targets} tutorial

These are targets tutorials

Learn from my mistakes

Design tips

Sketch out a basic version of your pipeline

  • Start small and iterate
  • Right-size your targets
    • Larger targets = bigger speed-ups when skipped
    • Smaller targets = more control over what is skipped and what isn’t

Not every object has to be a target!

  • Will I want to update the contents of this target?
  • Might this target get updated independently of other targets or can some be bundled?
  • Is this just an intermediate object or something I’ll want to look at down the line?

Create thoughtful and systematic target names

  • {targets} supports {tidyselect} syntax!
  • When naming targets, think about usage
tar_target(
  plot_results,
  make_plot(results)
)
tar_target(
  plot_diagnostics,
  make_plot(diagnostics)
)


tar_make(starts_with("plot"))

Be explicit about your dependencies

Don’t do this!

tar_target(
  data,
  read_data("data/case-counts.csv")
)

Do this

tar_target(
  data_file,
  "data/case-counts.csv",
  format = "file"
) # track the file!!!
tar_target(
  data,
  read_data(data_file)
)

or this

tarchetypes::tar_file(
  data_file,
  "data/case-counts.csv"
)
tar_target(
  data,
  read_data(data_file)
)

Use functions

Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has four big advantages over using copy-and-paste:

  1. You can give a function an evocative name that makes your code easier to understand.
  2. As requirements change, you only need to update code in one place, instead of many.
  3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).
  4. It makes it easier to reuse work from project-to-project, increasing your productivity over time.

R for Data Science (2e)

Workflow tips

Don’t be afraid to crash the pipeline

tar_make(sim)
✔ skip target data
▶ start target sim
✖ error target sim
▶ end pipeline [0.278 seconds]
Error:
! Error running targets::tar_make()
  Error messages: targets::tar_meta(fields = error, complete_only = TRUE)
  Debugging guide: https://books.ropensci.org/targets/debugging.html
  How to ask for help: https://books.ropensci.org/targets/help.html
  Last error: object 'x' not found

Don’t be afraid to crash the pipeline

Don’t do this!

source("R/make_forecast.R")
debug(make_forecast)
tar_load(data)
make_forecast(data)

Do this

# _targets.R
tar_options_set(
  workspace_on_error = TRUE
)
# load what you need to recreate
# error in an interactive session
tar_workspace(sim)
debug(make_forecast) 
  # or insert browser()
make_forecast(data)

and/or

# inspect the error traceback
# in more detail
tar_traceback(sim) # get error traceback

(maybe restart your R session?)

Not every idea has to immediately go into your pipeline

  • Keep two scratch folders
    • dev for .R files
    • notes for .Rmd files
  • Use tar_load() and tar_read() in development
  • Encode in your pipeline if the analysis is a keeper
    • Want to reproduce regularly

Scope your pipelines (projects)

  • Pipelines (projects) can quickly grow out of control
  • Keep your code focused
  • Maintain a scope statement in your project README file

  • Out of scope? New project!

Final thoughts

There is a fundamental tension between reproducibility and interactivity

  • Both have their place within the same project
  • Can be hard to move between these two modes
  • {targets} makes this move easier for analysis projects
  • Tradeoff: you need to speak {targets}

“Should every project be a {targets} pipeline?”

No!

Consider maintainability

  • How maintainable do you need this project to be?
  • Is it a quick and dirty task or something you may want to return to in a few weeks/months?
  • Will you need to pass this project off to someone?

Consider modularity

  • Will I be able to make my code modular in a meaningful way?
  • Will I be performing replicates of an experiment or repeating experiments with different inputs?

Consider complexity

  • Will my analysis have many different inter-dependent bits?
  • Will I need to be careful about the order of execution for various bits?

Set yourself up for success

  • Write functions!!!
  • Adopt file structure compatible with {targets} pipelines (and R packages)
    • Define functions in R/
  • Document your functions with {roxygen2} (once they’re stable)
  • Consider using other reproducibility tools
    • {renv} to manage a project-specific package library
    • {config} to manage the context within which you are deploying your code

Thank you!

https://papsti.github.io/talks/2025-01-22_targets.html