Landscape 2024 masterclass
September 16, 2024
Introduction (15 mins)
Research projects with R (30 mins)
Comfort break (10 mins)
3 workflows for Reproducibility (20 mins)
Quarto (20 mins)
Comfort break (10 mins)
Exercise time (45 mins)
Discussion + feedback (30 mins)
Ben Black
Doctoral researcher
Manuel Kurmann
Research assistant
Let’s hear your thoughts: What does reproducible research mean to you?
Findability, Accessibility, Interoperability, and Reusability[1].
Developed by diverse stakeholders (academia, industry, funders, publishers).
Addressed the need for infrastructure supporting data reuse.
Emphasis on both human and machine readability.
Don’t just take our word for it, research funders are increasingly focused on reproducible research too.
But just using doesn’t necessarily make your research reproducible…
Jenny Bryan: A good R project… “creates everything it needs, in its own workspace or folder, and it touches nothing it did not create.” [2]
Stay away from setwd()!
Use Rstudio Projects:
Designates new or existing folders as working directory creating an .RProj
file within them.
When you open a project the working directory will automatically be set and all paths will be relative to this.
The .Rproj
can be shared along with the rest of the research project, users can easily open the project to have the same working directory.
Go to File > New Project, can be created in a new or existing directory
Using File > Open Project in the top left of Rstudio.
Using the drop down menu in the top-right of the Rstudio session.
Outside of R by double clicking on the .Rproj
file in the folder.
.Rprofile
’sRstudio projects can store project-specific settings using the .Rprofile
file.
File is run every time the project is opened, can be used to perform actions such as opening a particular script:
.Rprofile
’sThe easiest way to create and edit .Rprofile
files is to use the functions from the package usethis
:
Familiar lines from the beginning of many an R script:
Again, what is wrong?
No indication of version of package to be installed =
Potential for to break code
Introduce dependency conflicts
But the problem is bigger than just packages…
When your code runs it is also utilizing:
A specific version of R
A specific operating system
Specific versions of system dependencies, i.e. other software that R packages utilise.
Collectively, these are the Environment of your code, documenting and managing this is essential ensure reproducibilty
But how to manage your environment?
Different approaches that range in complexity hence maybe suited to some projects and not others.
Most user-friendly way to manage your package environment (caveat to be discussed) in R: renv
package.
renv
renv
helps you create reproducible environments for your R projects by:
Documenting your package environment
Providing functionality to re-create it.
renv
renv
creates a project specific libraries of packages (renv/library
) which contain all the packages used by your project.renv
also creates project specific lockfiles (renv.lock
) which contain sufficient metadata so that the project library can be re-installed on a new machine.Result: Different projects can use different versions of packages and installing, updating, or removing packages in one project doesn’t affect any other project.
renv
limitationrenv
is not intended to manage other aspects of your environment such as: tracking your version of R or your operating system.
This is why if you want ‘bullet-proof’ reproducibility renv
needs to be used alongside other approaches such as containerization.
There is no objective measure that makes code ‘clean’ vs. ‘un-clean’.
Think of ‘clean coding’ as the pursuit of making your code easier to read, understand and maintain.
Like writing, code should follow a set of rules and conventions. For example, in English, a sentence starts with a capital letter and ends with a full stop.
For R code there is not a single set of conventions instead there are numerous styles. Two most common are the Tidyverse style and the Google R style.
Most important: Choose a style and apply it consistently in your coding.
Code styles express opinionated preferences on a series of common topics:
We won’t discuss in detail but you should read one of the style guides when you have the time.
Two R packages for code styling, lintr
and styler
:
lintr
checks your code for style issues and potential programming errors then presents them to you to correct, like doing a ‘spellcheck’ on a written document.styler
automatically format’s your code to a particular style, the default of which is the tidyverse style.To use lintr
and styler
call their functions like any package
styler
can also be used through the Rstudio Addins menu below the Navigation bar:
Both packages can be used as part of a continuous integration (CI) workflow with Github, meaning that their functions can be run automatically when you update your code.
Starting your scripts with a consistent header containing information about it’s purpose, author/s, creation and modification dates is very helpful!
There are no rules as to what this should look like but this is an example:
```{r}
#############################################################################
## Script_title: Brief description of script purpose
##
## Notes: More detailed notes about the script and it's purpose
##
## Date created:
## Author(s):
##################################################################
```
To save time inserting your script header use Rstudio’s Code snippets feature.
Code snippets are text macros that insert a section of code using a keyword.
To create your own Code snippet go to Tools > Global Options > Code > Edit Snippets and then add a new snippet with your code below it
To use a code snippet simply start typing the keyword in the script and the auto-completion list will appear then press Tab
and the code section will be inserted:
{}
) sections of code (i.e. function definitions, conditional blocks, etc.) can be folded to hide their contents by clicking on the small triangle in the left margin:-
), equal signs (=
), or pound signs (#
):To navigate between code sections:
To navigate between code sections:
Workflow decomposition is the structuring or compartmentalising of code into seperate logical parts that makes it easier to maintain [5].
You probably already instinctively do decomposition by splitting typical processes such as:
This oftens leads to scripts with logical sounding names like: Data_prep.R
and Data_analysis.R
but can others be expected to know which order these must be run in?
Solutions:
1st step: Give your scripts sequential numeric tags in their names, e.g. 01_Data_prep.R
, 02_Data_analysis.R
ensuring that they are presented in numerical order in their designated directory.
Next level: Create a Master script that sources your other scripts in sequence (think of them as sub-scripts) so that users need only run one script.
base::source()
function to run the sub-scripts:#############################################################################
## Master_script: Run steps of research project in order
#############################################################################
#Prepare LULC data
source("Scripts/Preparation/Dep_var_dat_prep.R", local = scripting_env)
#Prepare predictor data
source("Scripts/Preparation/Ind_var_data_prep.R", local = scripting_env)
source(local= )
argument).Within your sub-scripts processes should also be seperated into code sections and any repetitive tasks should be performed with custom functions.
Following this approach you end up with a workflow that will look something like this:
A clean project directory that has well-organised sub-directories makes your projects code easier to understand for others.
Try to use:
data/raw/climatic/precipitation/2020/precip_2020.rds
vs. data/precip_2020_raw.rds
(helpful when it comes to programatically constructing file paths)As an example my go-to project directory structure looks like this:
└── my_project
├── data # The research data
│ ├── raw
│ └── processed
├── output # Storing results
├── publication # Containing the academic manuscript of the project
├── src # For all files that perform operations in the project
│ ├── scripts
│ └── functions
└── tools # Auxilliary files and settings
Creation of project directory structure can be automated using using Rstudio’s Project Templates functionality.
Allows selection of custom template when creating a new Rstudio project (File > New Project > New Directory > New Project Template).
Warning: Implementation of personal template is labor intensive as it needs to be contained within an R-package. But several template packages appropriate for scientific research projects are available:
But writing comprehensive documentation that covers all aspects of projects is time-consuming…
Suggested solution in the R research community: Research as package approach (i.e. creating your project as an R-package) [6].
Pro: R-packages have an existing strict set of conventions for documentation
Cons:
Learning curve for those unfamiliar with R-packages
May not be appropriate for all project requirements.
Our advice: don’t let the perfect be the enemy of the good and focus on these key areas:
Provide adequate in-script commentary: Remember that comments should be used to explain the purpose of the code, not what the code is doing
Document your functions with roxygen
skeletons
Include a README
file: README files are where you should document your project at the macro-level i.e. what it is about and how it is supposed to work.
roxygen2
base
R provides a standard way of documenting functions in packages as seperate .Rd
(R documentation) files.
.Rd
files use a custom syntax to detail key aspects of the functions such as input parameters, outputs, package dependencies [7].
Documenting functions in this way is a good practice for your project even if you are not creating a package.
roxygen2
Rather than manually create .Rd
files, we can use the roxygen2
package.
roxygen2
provides functionality to add blocks of comments (roxygen skeleton
) to the top of the function scripts. These are then used to automatically generate .Rd
files.
To add a roxygen skeleton
, place your cursor inside a function you want to document and press Ctrl + Shift + R
(or Cmd + Shift + R
on Mac) or you can go to code tools > insert roxygen skeleton (wand icon in the top row of the source pane).
README.md
files..md
is the Markdown format which is the most common format for README files in R projects because it can be read by many programs and rendered in a variety of formats.README.md
files are often accompanied by the corresponding file README.Rmd
, an Rmarkdown file which generates them.README.Rmd
files can be created using the usethis
package (use_readme_rmd()
)..txt
) may be better.No single standardised format for what should be included but here is an example of a README.txt file from one of the authors publications.
Useful to include a tree diagram of the project directory structure down to the file level:
fs
package:Now this some of the details of the graphical overview probably make more sense to you:
We will implement some of these good practices in our 1st exercise.
We will discuss three workflows for reproducibility:
Rstudio project to Zenodo pipeline
Containerization with Docker
Version control with Git
These are suggestions for different approaches and we hope that in future you will be able to adapt these workflows to the needs of your own research projects.
Zenodo
pipelineZenodo
pipelinerenv
creates project-specific libraries
Captures package versions in a renv.lockfile
Ensures reproducibility of package environment
Centralizes package environment management within each project
Zenodo
pipelinerenv
WorkflowInitialize renv
inside the project directory to identify dependencies using renv::init()
Snapshot dependencies to create a lockfile using renv::snapshot()
Restore environments using renv::restore()
Easy integration with RStudio for workflow management
Zenodo
pipelinerenv
Does not manage R versions or system-wide dependencies
Focuses on managing package environments within R
Best combined with containerization (e.g., Docker
) for full reproducibility
Complements external repositories (e.g., Zenodo
) for sharing and preservation
Zenodo
pipelineLong-term storage with generous 50GB upload limit per record
Permanent DOIs for easy citation and versioning support for updates
GitHub integration for seamless code archiving with DOI snapshots
Supports FAIR principles: aligned with open access, transparency, and reusability
Community creation for grouping related research outputs
API and open-source: flexible for programmatic access and customization
Zenodo
pipelineZenodo
with zen4R
Upload datasets, code, and metadata from R to Zenodo
Automate publication and deposition management
Retrieve and update Zenodo
records directly in R
Facilitates integration and reproducibility in R workflows
Zenodo
pipelinerenv
and Zenodo
renv
manages internal project environments
Zenodo
ensures external reproducibility with archiving
Together, they provide a comprehensive solution
Aligns with open science and FAIR principles
Containerization is the process of bundling code along with all of it’s dependencies including:
The operating system
Software libraries (packages)
Other system software
Everything needed to run the code is included means that the code is portable and can be run on any platform or cloud service.
This makes containerization the gold standard for reproducibility
Docker is an open-source, and the most popular, platform for containerization.
Dockerfile
:
Docker Image
:
Docker Container
:
Two main resources that can help in the creation of containerized R projects:
A project that catalogs and manages Docker Images for R projects.
Basic images include different versions of R and RStudio
Other images offering collections of R packages for specific purposes (e.g. tidyverse).
Two main resources that can help in the creation of containerized R projects:
Dockerfile
:A package which creates a custom class object that represents the Dockerfile
Has slots corresponding to common elements of Docker images allowing to add elements to the dockerfile in R.
renv
Two methods of integrating renv
with Docker to manage the package environment of your project:
renv
to install packages when the Docker image is built:renv::restore()
) when building the image is slow so try to avoid the need to re-build the base image many times.renv
Two methods of integrating renv
with Docker to manage the package environment of your project:
renv
to install/restore packages only when Docker containers are run:Better when you plan to have multiple projects built from the same base image but with different package requirements.
Package library is not included in the image but instead different project specific libraries are mounted to the container when it is run [8].
If renv::restore()
is run with caching, packages are not re-installed everytime the container is run.
Version control: A more systematic way to organise data beyond “dataprep_1”, “dataprep_final”, “dataprep_finalfinal” etc.
Systematic documentation and storage of code changes allowing us to track changes and revert back to previous versions when needed.
Create a GitHub repository in your account
Download and install Git
Add credentials for your account to Git
Link RProject to Github repository
Open, checkout and navigate Git repository local version via Rstudio
Basic functionalities of Git in Rstudio
An open source scientific and technical publishing system
Integrates code in multiple programming languages, written material, and interactive visual components
Produces a range of document formats including HTML, PDF, and Word
Developed by Posit the same company that created Rstudio.
How many programs do you currently use when writing academic papers?
Quarto solves this problem by allowing you to write full academic manuscripts from start to finish including text, code, and visualizations in a single document:
Key benefits:
.qmd
files can be edited with various code/text editors (VS Code, RStudio etc.)More reproducible as it allows others to use your underlying manuscript file in combination with your data to directly re-create your results.
Several formats: RevealJS, Microsoft Powerpoint and Beamer using a common syntax.
Useful features:
Modern themes with functionality to publish your own theme.
Interactive content: Executable code blocks, graphs, maps
Dynamic resizing of content depending on screen size
Functionality for slide notes, automatic transitions, timers etc.
Easy export to PDF or HTML
Similar to manuscripts code-based content is dynamically updated.
Websites to act as guides, tutorials or teaching materials:
Personal websites to share publications and presentations:
Websites for research projects to share progress and results:
Many options for interactive data visualisations, tables and diagrams using:
On the website for the masterclass under the heading Guided exercises you will find 4 exercises that put into practice the workflows we have discussed as well as the starting to write an academic manuscript with Quarto.
The exercises build incrementally on each other but they don’t need to be completed in order.
Choose which one interests you most or depending on your existing knowledge and expertise.
We have allocated 45 minutes to work on the exercises and we will be here to help you if you have any questions.
This is an open discussion so feel free to raise any points you might have, but here are some ideas:
Any questions of understanding or clarification about the content we have covered today?
What are your own experiences with trying to make your work reproducible? Particular successes or obstacles you have encountered?
Are there any other tools or workflows that you have found useful that you would like to share with the group?
Have you encountered any particular differences in the way that reproducibility is approached in your field/discipline?
Please feel free to share the website of the masterclass with your colleagues