Reproducible Research with and : Workflows for data, projects and publications

A workshop presented at the Landscape 2024 conference, Berlin

This image highlights some the key concepts we will discuss in the workshop, which have been divided into seperate sections:

Background: Some introductory information on why reproducible and transparent research is important.
Research projects with R: Starting from the basics to develop good practice for creating research projects with R, focusing on some features of Rstudio as an Integrated Development Environment that can help ensure your work is reproducible.
Workflows for reproducibility: Here we present three workflows of differing levels of complexity and discusses how they can be combined and which might be best given the research needs.
Quarto: Here we introduce the open-source scientific and technical publishing system Quarto which can be used for numerous academic activities including preparing manuscripts.
Guided Exercises: Now it’s time to get hands-on with some guided exercises to put into practice some of the concepts we have discussed.

1 Background

About us

We are four researchers from the research group Planning of Landscape and Urban Systems (PLUS) at ETH Zürich. Click on the social icons below our pictures to find out more about our individual research or get in touch with us.

Ben Black

Doctoral Researcher

Nivedita Harisena

Doctoral Researcher

Manuel Kurmann

Research Assistant

Maarten Van Strien

Senior scientist

What is reproducible research?

Reproducibility is a key aspect of reliable scientific research. It enables other researchers to reproduce the same results using the original data, code, and documentation [1]. Below are the core principles to ensure reproducibility in research:

Essawy BT, Goodall JL, Voce D, et al (2020) A taxonomy for reproducible and replicable research in environmental modelling. Environmental Modelling & Software 134:104753. https://doi.org/10.1016/j.envsoft.2020.104753

Starts with planning

Reproducibility begins during the planning stage. It is essential to organize data management and ensure clear protocols are in place even before starting the analysis. Consistent Data Storage Regular backups of data are crucial. Storing data in multiple locations ensures accessibility and minimizes the risk of data loss. [2]

Contains clear documentation

Thorough documentation is essential to guarantee that data and methods can be accurately interpreted and reproduced by others. This entails the use of well-organised files and the inclusion of metadata that describes the data, how it was obtained, and how it was processed. [[2]][3]

Utilizes version control

Using version control systems helps track changes in the project over time. This approach preserves the history of the project and facilitates the reversion of files to a previous state in the event of an error. [2]

Is accessible

Data should be stored in nonproprietary, portable formats to ensure broad accessibility and long-term usability. This practice ensures that researchers can access the data without relying on specific software tools. Making data and code publicly available in accessible repositories supports scientific transparency and allows broader use of research outputs. [[2]][3]

By following these steps, researchers contribute to the wider scientific community, ensuring that their work can be efficiently and accurately reproduced by others.

Introducing the FAIR Principles

While the principles above lay the groundwork for reproducibility, the FAIR principles (Findability, Accessibility, Interoperability, and Reusability) provide a more comprehensive framework for enhancing the value of research data in the digital age. These principles expand on reproducibility by emphasizing not only human access to research outputs but also machine actionability, ensuring that data can be effectively found, accessed, and reused by both people and computational tools. [4]

How FAIR Principles Build on Reproducibility

The FAIR principles naturally complement and expand on the core aspects of reproducible research:

Findability reinforces the importance of clear documentation. Assigning persistent identifiers and providing rich metadata makes it easier for researchers and search tools to locate and understand datasets, ensuring that your research remains accessible over time.
Accessibility builds on the concept of using nonproprietary formats. FAIR emphasizes that data should be retrievable using open, standardized protocols, which ensures long-term access to both the data and its metadata, even if the data itself becomes unavailable.
Interoperability relates to the consistent use of data standards and version control. By using standardized formats and vocabularies, research data can be more easily integrated with other datasets, supporting reuse and long-term relevance in broader research contexts.
Reusability directly aligns with the goals of reproducible research by ensuring that data is accompanied by clear licensing and provenance, allowing others to confidently reuse it. This principle reinforces the need for thorough documentation and transparent methods.

By incorporating the FAIR principles, researchers ensure that their data not only meets the standards of reproducibility but is also optimized for long-term use and discovery. This fosters a research environment where data is more transparent, accessible, and impactful over time. [4]

Why strive for reproducible research?

In recent years, various scientific disciplines have experienced what is known as a “replication crisis”. This crisis arises when researchers are unable to reproduce the headline results of key studies using the reported data and methods [[5]][6] [7]. This lack of reproducibility undermines public trust in science, as it raises doubts about the validity of research findings.

Moonesinghe R, Khoury MJ, Janssens ACJW (2007) Most Published Research Findings Are False—But a Little Replication Goes a Long Way. PLOS Medicine 4(2):1–4. https://doi.org/10.1371/journal.pmed.0040028

Collaboration OS (2015) Estimating the reproducibility of psychological science. Science 349(6251):aac4716. https://doi.org/10.1126/science.aac4716

Bohannon J (2015) Many psychology papers fail replication test. Science 349(6251):910–911. https://doi.org/10.1126/science.349.6251.910

Advantages of Reproducibility for Your Research

Personal Reference

Conducting reproducible research simplifies the process of remembering how and why specific analyses were performed. This makes it easier to explain your work to collaborators, supervisors, and reviewers, enhancing communication throughout your project. [2]

Efficient

Modifications Reproducible research enables you to quickly adjust analyses and figures when requested by supervisors, collaborators, or reviewers. This streamlined process can save substantial time during revisions. [2]

Streamlined Future Projects

By maintaining well-organized and reproducible systems, you can reuse code and organizational structures for future projects. This reduces the time and effort required for similar tasks in subsequent research. [2]

Demonstrates Rigor and Transparency

Reproducibility demonstrates scientific rigor and transparency. It allows others to verify your methods and results, improving the peer review process and reducing the risk of errors or accusations of misconduct. [2]

Increases Impact and Citations

Making your research reproducible can lead to higher citation rates [8] [9]. By sharing your code and data, you enable others to reuse your work, broadening its impact and increasing its relevance in the scientific community. [10] [11].

Piwowar HA, Day RS, Fridsma DB (2007) Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLOS ONE 2(3):1–5. https://doi.org/10.1371/journal.pone.0000308

McKiernan EC, Bourne PE, Brown CT, et al (2016) How open science helps researchers succeed. eLife 5:e16800. https://doi.org/10.7554/eLife.16800

10.

Whitlock MC (2011) Data archiving in ecology and evolution: Best practices. Trends in Ecology & Evolution 26(2):61–65. https://doi.org/https://doi.org/10.1016/j.tree.2010.11.006

11.

Culina A, Crowther TW, Ramakers JJC, Gienapp P, Visser ME (2018) How to do meta-analysis of open datasets. Nature Ecology & Evolution 2(7):1053–1056. https://doi.org/10.1038/s41559-018-0579-2

Advantages of Reproducibility for Other Researchers

Facilitates Learning

Sharing data and code helps others learn from your work more easily. New researchers can use your data and code as a reference, speeding up their learning curve and improving the quality of their analyses. [2]

Enables Reproducibility

Reproducible research makes it simpler for others to reproduce and build upon your work, fostering more compatible and robust research across studies. [2]

Error Detection

By allowing others to access and review your data and code, reproducibility helps detect and correct errors, ensuring that mistakes are caught early and reducing the chance of their propagation in future research. [2]

Alston JM, Rick JA (2021) A Beginner’s Guide to Conducting Reproducible Research. The Bulletin of the Ecological Society of America 102(2):e01801. https://doi.org/10.1002/bes2.1801

Why for reproducible research?

R is increasingly recognized as a powerful tool for ensuring reproducibility in scientific research. Here are some key advantages of using R for reproducible research:

Open Source

Accessibility R is freely available to everyone, eliminating cost barriers and promoting inclusive access to research tools. This open-source model ensures that researchers around the world can use and contribute to its development, fostering a collaborative research environment. [3]

Comprehensive Documentation

R encourages thorough documentation of the entire research process. This ensures that analyses are well-tracked and can be easily replicated across different projects, enhancing the overall transparency and reliability of the research.

Integrated Version Control

R seamlessly integrates with version control systems like Git, allowing researchers to track changes to code, data, and documents. This helps maintain a detailed record of a project’s evolution and ensures that all steps are easily reproducible. [3]

Siraji MA, Rahman M (2023) Primer on Reproducible Research in R: Enhancing Transparency and Scientific Rigor. Clocks & sleep 6(1):1–10. https://doi.org/10.3390/clockssleep6010001

Consistency Across Platforms

R provides a stable environment that works consistently across different operating systems, whether you are using Windows, Mac, or Linux. This cross-platform consistency greatly enhances the reproducibility of research across diverse systems.

Broad Community Support

The R community is large and active, continuously contributing to the improvement of the software. This broad support makes R a reliable choice for long-term research projects, ensuring that new tools and methods are constantly being developed and shared.

Flexibility and Adaptability

R offers a wide range of tools and functions that can be adapted to various research needs. This flexibility allows researchers to handle diverse tasks within a reproducible framework, making it a versatile tool for projects of all kinds.

2 Research projects with R

Let’s start with a definition of what makes a good R project from Jenny Bryan:

A good R project… “creates everything it needs, in its own workspace or folder, and it touches nothing it did not create.” [12]

12.

Bryan J (2017) Project-oriented workflow

This is a good definition that contains concepts, such as the notion that projects should be ‘self-contained’. However we add one more caveat to this definition which is that a good R project should explain itself.

For the purpose of this workshop we will approach this topic by splitting it up into 6 topics which are highlighted in this graphic:

Graphical overview of components of a good research project in R

As you move through these you will see that there are areas of overlap and complementarity between them. These topics are also central to the choice of approaches in the three workflows for reproducibility that we will share.

projects

How many times have you opened an R script and been greeted by this line:

setwd("C:/Users/ben/path/that/only/I/have")

While it is well-intentioned (i.e. avoiding the need to have full paths for all objects that will subsequently be loaded or daved ) the problem with it is obvious: This specific path is only relevant for the author and not other potential users and even for the author it will be invalid if they happen to change computers. The good news is there is a very simple way to avoid having to use setwd() at all by using Rstudio Projects.

Rstudio projects designate new or existing folders as a defined working directory by creating an .RProj file within them. This means that when you open a project the working directory of the Rstudio session will automatically be set to the directory that the .RProj file is located in and the paths of all files in this folder will be relative to this.

The .Rproj file can be shared along with the rest of the research project files meaning that others users can easily open the Project to have the same working directory removing the need for those troublesome setwd() lines.

Creating and opening projects

Creating an Rstudio project is as simple as using File > New Project in the top left and then choosing between creating the Project in a new or existing directory.

There are several ways to open a Project:

Using File > Open Project in the top left of Rstudio.

Using the drop down menu in the top-right of the Rstudio session.

Outside of R by double clicking on the .Rproj file in the folder.

Utilising project specific `.Rprofile`’s

Another useful feature of Rstudio projects is the ability to store project-specific settings using the .Rprofile file which controls the initialisation behaviour of the R session when the project is opened. A useful application of this for reproducible research projects is automatically open a particular script, for example a master script that runs all the code in the project (which is a concept that will discussed under workflow decomposition).

To do this the contents of your .Rprofile file would like this:

setHook("rstudio.sessionInit", function(newSession) {
  if (newSession)
    # Open the script specificed by the path
    rstudioapi::navigateToFile('scripts/script_to_open.R', line = -1L, column = -1L)
}, action = "append")

The easiest way to create and edit .Rprofile files is to use the functions from the package usethis:

# Note the use of scope = "project" to create a project specific .Rprofile
usethis::edit_r_profile(scope = "project")

Environment management

These lines of code are also probably familiar from the beginning of many an R script:

install.packages("ggplot2")
library(ggplot2)

But what is wrong with these lines?

Well firstly, there is no indication of what version of the package is to be installed and hence if the code installing this package is old it may not work with the most recent version of the package (This is less of a problem for well established packages like the Tidyverse but for less common packages, that may see large changes between versions, it could be substantial).

Secondly, having the user install an unspecified version of a package could also cause dependency conflicts with other packages required by the code. This is because almost all packages have some form of dependency (i.e. they use the functionality of) on other packages. This is shown aptly by the image below which, while out-dated now, showed that in 2014 to install the 7 most popular R packages at the time would actually install 63 packages in total when considering their dependencies.

Package dependencies of popular R package [13]

However the problem is bigger than just packages because when your code runs it is also utilising:

A specific version of R
A specific operating system
Specific versions of system dependencies, i.e. other software in other languages that R packages themselves utilise e.g GDAL for spatial analysis packages like terra.

All of these things together make up what is known as the ‘environment’ of your code. Hence the process of documenting and managing this environment to is ensure that your code is reproducible (i.e. it not only runs but also consistently produces the same results).

There are different approaches to environment management that differ in their complexity and hence maybe suited to some projects and not others. For the purpose of this workshop we will focus on what we have found is one of the most user-friendly ways to manage your package environment (caveat that will be discussed) in R which is the package renv. Below we will introduce this package in more detail as it will form a central part of the three workflows for reproducibility that we present.

Creating reproducible environments with `renv`

As mentioned above renv is an R package that helps you create reproducible environments for your R projects by not only documenting your package environment but also providing functionality to re-create it.

It does this by creating project specific libraries (i.e. directories: renv/library) which contain all the packages used by your project. This is different from the default approach to package usage and installation whereby all packages are stored in a single library on your machine (system library). Having separate project libraries means “that different projects can use different versions of packages and installing, updating, or removing packages in one project doesn’t affect any other project.” [14]. In order to make sure that your project uses the project library everytime it is opened renv utilises the functionality of .Rprofile's to set the project library as the default library.

14.

Ushey K, Wickham H (2024) Renv: Project environments

Another key process of renv is to create project specific lockfiles (renv.lock) which contain sufficient metadata about each package in the project library so that it can be re-installed on a new machine.

As alluded to, renv does a great job of managing your packages but is not intended to manage other aspects of your environment such as: tracking your version of R or your operating system. This is why if you want ‘bullet-proof’ reproducibility renv needs to be used alongside other approaches such as containerization which is the 3rd and most complex workflow we will discuss.

Writing clean code

The notion of writing ‘clean’ code can be daunting, especially for those new to programming. However, the most important thing to bear in mind is that there is no objective measure that makes code ‘clean’ vs. ‘un-clean’, rather we should of think ‘clean’ coding as the pursuit of making your code easier to read, understand and maintain. Also while we should aspire to writing clean code, it is arguably more important that it functions correctly and efficiently.

The central concept of clean coding is that, like normal writing, we should follow a set of rules and conventions. For example, in English a sentence should start with a capital letter and end with a full stop. Unfortunately, in terms of writing R code there is not a single set of conventions that everyone proscribes to, instead there are numerous styles that have been outlined and the important thing is to choose a style and apply it consistently in your coding.

Perhaps the two most common styles are the Tidyverse style and the Google R style (Which is actually a derivative of the former). Neither style can be said to be the more correct, rather they express opinionated preferences on a series of common topics such as: Object naming, use of assignment operators, spacing, indentation, line length, parentheses placement, etc.

Rather than detail all of these topics here we will focus on just on some related tips that we think are most relevant for scientific research coding, including how to automate the formatting of your code to a particular style. However, we encourage you to go through the different style guides when you have the time.

Script headers

Starting your scripts with a consistent header containing information about it’s purpose, author/s, creation and modification dates is a great step making your workflow more understandable and hopefully reproducible. There are no rules as to what such a header should look like but this is the style I like to use:

#############################################################################
## Script_title: Brief description of script purpose
##
## Notes: More detailed notes about the script and it's purpose
##
## Date created: 
## Author(s):
#############################################################################

To save time inserting this header into new scripts you use Rstudio’s Code snippets feature. Code snippets are simply text macros that quickly insert a section of code using a short keyword.

To create your own Code snippet go to Tools > Global Options > Code > Edit Snippets and then add a new snippet with your code below it:

To use a code snippet simply start typing the keyword in the script and the auto-completion list will appear then press Tab and the code section will be inserted:

Code sections

As you may already know braced ({}) sections of code (i.e. function definitions, conditional blocks, etc.) can be folded to hide their contents in RStudio by clicking on the small triangle in the left margin.

However, an often overlooked feature is the ability to create named code sections that can be also folded, as well as easily navigated between. These can be used to break longer scripts into a set of discrete regions according to specific parts of the analysis (discussed in more detail later). In this regard, another good tip is to give the resulting sections sequential alphabetical or numerical Pre-fixes. Code sections are created by inserting a comment line that contains at least four trailing dashes (-), equal signs (=), or pound signs (#):

# Section One ---------------------------------
 
# Section Two =================================
 
# Section Three #############################

Alternatively you can use the Code > Insert Section command.

To navigate between code sections:

Use the Jump To menu available at the bottom of the editor[15]

15.

Posit Support (2024) Code folding and sections in the RStudio IDE

Use the document outline pane in the top right corner of the source pane

Automating the styling of your code

There are two R packages that are very helpful in terms of ensuring your code confirms to a consistent style: lintr and styler.

lintr checks your code for common style issues and potential programming errors then presents them to you to correct, think of it like doing a ‘spellcheck’ on a written document.
styler is more active in the sense that it automatically format’s your code to a particular style, the default of which is the tidyverse style.

To use lintr and styler you call their functions like any package but styler can also be used through the Rstudio Addins menu below the Navigation bar as shown in this gif:

Another very useful feature of both packages is that they can be used as part of a continuous integration (CI) workflow using a version control application like Git. This is a topic that we will cover as part of our Version control with Git workflow but what it means is that the styler and lintr functions are run automatically when you push your code to a remote repository.

Workflow decomposition

In computer sciences workflow decomposition refers to the structuring or compartmentalising of your code into seperate logical parts that makes it easier to maintain [16].

16.

(2024) Decomposition (computer science)

In terms of coding scientific research projects many of us probably already instinctively do decomposition to some degree by splitting typical processes such as data preparation, statistical modelling, analysis of results and producing final visualizations.

However this is not always realized in the most understandable way, for example we may have seperate scripts with logical sounding names like: Data_prep.R and Data_analysis.R but can others really be expected to know exactly which order these must be run in, or indeed whether they even need to be run sequentially at all?

A good 1st step to remedying this is to give your scripts sequential numeric tags in their names, e.g. 01_Data_prep.R, 02_Data_analysis.R. This will also ensure that they are presented in numerical order when placed in a designated directory Structuring your project directory and can be explicitly described in your project documentation.

But you can take this to the next level by creating a Master script that sources your other scripts in sequence (think of them as sub-scripts) so that users of your code need only run one script. To do this is as simple as creating the master script as you would any normal R script (File > New File > R script) and then using the base::source() function to run the sub-scripts:

#############################################################################
## Master_script: Run steps of research project in order
##
## Date created: 30/7/2024
## Author(s): Jane Doe
#############################################################################

### =========================================================================
### A- Prepare dependent variable data
### =========================================================================

#Prepare LULC data
source("Scripts/Preparation/Dep_var_dat_prep.R", local = scripting_env)

### =========================================================================
### B- Prepare independent variable data
### =========================================================================

#Prepare predictor data
source("Scripts/Preparation/Ind_var_data_prep.R", local = scripting_env)

### =========================================================================
### C- Perform statisical modelling
### =========================================================================

source("Scripts/Modelling/Fit_stat_models.R", local = scripting_env)

As you can see in this example code I have also made use of a script header and code sections, that were previously discussed, to make the division of sub-processes even clearer. Another advantage of this approach is that all sub-scripts can utilise the same environment (defined by the source(local= ) argument) which means that each individual script does not need to load packages or paths as objects.

Finally, within your sub-scripts processes should also be seperated into code sections and ideally any repetitive tasks should be performed with custom functions which again are contained within their own files.

Following this approach you end up with a workflow that will look something like this:

The benefit of this hierarchical approach to structuring is that it is not only easier to debug and maintain individual processes but it is also more amenable to adding new processes.

Structuring your project directory

Similar to having clean code, having a clean project directory that has well-organised sub-directories goes a long way towards making your projects code easier to understand for others. For software development there are numerous sets of conventions for directories structures although these are not always so applicable for scientific research projects. However we can borrow some basic principles, try to use: - Use logical naming - Stick to a consistent style, i.e. use of captialisation and seperators - Make use of nested sub-directories e.g data/raw/climatic/precipitation/2020/precip_2020.rds vs. data/precip_2020_raw.rds. This is very helpful when it comes to programatically constructing file paths especially in projects with a lot of data.

As an example my go-to project directory structure looks like this:

└── my_project
    ├── data # The research data
    │   ├── raw
    │   └── processed
    ├── output # Storing results
    ├── publication # Containing the academic manuscript of the project
    ├── src # For all files that perform operations in the project
    │   ├── scripts
    │   └── functions
    └── tools # Auxilliary files and settings

Rather than manually create this directory structure everytime you start a new project, save yourself some time and automate it by using Rstudio’s Project Templates functionality. This allows you to select a custom template as an option when creating a new Rstudio project through the New project wizard (File > New Project > New Directory > New Project Template).

To implement this even as an intermediate R user is fairly labor intensive as your custom project directory template needs to be contained within an R-package, in order to be available in the wizard. However, quite a few templates with directory structures appropriate for scientific research projects have been created by others:

rrtools
ProjectTemplate
template
addinit (Not a template but an interactive shiny add-in for project creation)

Project documentation

As an example of why documentation is important think about if you bought a new table from Ikea only to excitedly rip open the box and find that there are no instructions for how to assemble it. Sure, you know what a table is supposedly to look like and given enough time you will end up with something that will probably be mostly right but maybe it’s missing small details. Also it will probably take you just as long to take it apart in 5 years time. Well, working with undocumented code for research projects is similar except a lot more complicated!

Writing comprehensive documentation that covers all aspects of our projects is time-consuming which is why it is often neglected. For example, there are a lot of different metadata conventions that exist that you could apply. However, learning and adhering strictly to these can be overwhelming and possibly lead to the opposite effect i.e. they are not simple for others to understand either.

In response to this there has been a movement in the R research community to adopt the research as package approach, which, as the name suggests, involves creating your project as an R-package which has a strict set of conventions for documentation [17]. This is a viable approach for those who are familiar with R-packages but is arguably not the best for all projects and users.

17.

Marwick B, Boettiger C, Mullen L (2018) Packaging data analytical work reproducibly using r (and friends). The American Statistician 72(1):80–88. https://doi.org/10.1080/00031305.2017.1375986

Instead, we would suggest to follow the maxim of not letting the perfect be the enemy of the good and to focus on these key areas:

Provide adequate in-script commentary: This is perhaps contentious for those from a software development community, but given the choice I would rather have to read through a script with too many comments than one with too few. However remember that comments should be used to explain the purpose of the code, not what the code is doing. In line with this use script headers.
Document your functions with roxygen skeletons:
Include a README file: README files are where you should document your project at the macro-level i.e. what it is about and how it is supposed to work.

The latter of these two are more detailed so we have provided further information and tips in sections below.

Function documentation with `roxygen2`

Base R provides a standard way of documenting a package where each function is documented in an .Rd file (R documentation). This documentation uses a custom syntax to detail key aspects of the functions such as their input parameters, outputs and any package dependencies [18].

18.

Wickham H, Bryan J (2024) R packages (2e), 2nd Edition

In the case of many research projects you will not be creating a package however it is still useful to apply this documentation style to your functions as it is a good way to make them understandable and easier to modify by others. For example, having clear information about the object (e.g. a vector or data.frame) that a function accepts, saves others time in guessing what the function is expecting if they are trying to use new data.

However, rather than manually writing .Rd files, we can use the roxygen2 package to automatically generate these files from a block of comments that are added to the top of the function scripts. To add this comment block, place your cursor inside a function you want to document and press Ctrl + Shift + R (or Cmd + Shift + R on Mac) or you can go to code tools > insert roxygen skeleton (code tools is represented by the wand icon in the top row of the source pane). As you can see in this gif below, when you insert the roxygen block it will already contain the names of the function, its arguments and any returns. You can then fill in the rest of the information, such as the description and dependencies etc. for a guide to these other fields see the roxygen2 documentation.

Tips for README writing

If you look at the source code of R packages or projects that use R in Github repositories you will see that they all contain README.md files. .md is Markdown format which is the most common format for README files in R projects because it can be read by many programs and rendered in a variety of formats. These files are often accompanied by the corresponding file README.Rmd which generates the README.md file. In this sense writing the README for your project in markdown makes sense and there tools available to help you do this such as the usethis package which has a function use_readme_rmd() that will create a README.Rmd file for you. However, depending on who you anticipate using your project you may also want to create your README as a raw text file (.txt) which may be a more familiar format for some users and again can be opened by many different programs.

Again there is not a single standardised format for what should be included in your README file but here is an example of a README file that was written for one of the authors code/data upload alongside a publication: README.txt

You will see that one of the things this README includes is a tree diagram which shows the directory structure of the project right down to the file level. This is a useful way to give an overview of what users should find included in the project and then explanatory notes can be added to explain the purpose of each file or directory. Such a diagram can be easily generated using the fs package:

install.packages("fs")
library(fs)

#vector path of the target directory to make a file tree from
Target_dir <- "YOUR DIR"

#produce tree diagram of directory sub-dirs and files and save output using capture.ouput from base R utils.
capture.output(dir_tree(Target_dir), file= 'Dir_tree_output.txt')

3 Workflows for Reproducibility

For this workshop we will outline three different workflows for creating reproducible research projects with R combined with other tools. We have named these workflows as follows:

to pipeline
Containerization with
Version control with

These workflows are inter-related in the sense that 2. and 3. build upon elements of the first and indeed the techniques of the latter workflows can also be combined together. The workflows differ in the level of reproducibility they ensure but the trade-off for better reproducibility is increased complexity. As such we would suggest that the most reproducible workflow may not always be the most appropriate to implement dependent on the needs of your research project and the capabilities of the collaborators involved.

Of course, these workflows are by no-means the only way of doing things and indeed we would actively encourage you to expand upon them in developing your own preferred approach.

project to pipeline

The graphic above shows the main steps of this workflow. It starts by developing your research project as an Rstudio project following the good practice project guidelines we have discussed. Then, it uses the renv package to manage the project environment so that others can re-create it. Finally, the code, data and environment are uploaded to the open-access repository Zenodo, which provides a DOI for your work, ensuring long-term accessibility and reproducibility.

The renv package helps maintain the R environment, allowing others to recreate the environment in which your analysis was conducted. By combining renv with Zenodo, you create a comprehensive solution for reproducible research. renv ensures that the computational environment is captured, while Zenodo makes your research outputs accessible and citable, supporting the FAIR principles of findability, accessibility, interoperability, and reusability [4].

Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al (2016) The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(1):160018. https://doi.org/10.1038/sdata.2016.18

The next sections provide some additional detail on environment management with renv and some background and practical tips on Zenodo. You will have the chance to apply this workflow to an example project in the accompanying exercise.

Environment Management with `renv`

renv is a powerful R package designed to help manage project environments by creating project-specific libraries and lockfiles. As mentioned earlier, renv captures the exact versions of R packages used in a project, storing this information in a renv.lock file. This allows users to recreate the exact package environment when revisiting a project or transferring it to a different machine, ensuring reproducibility.

The renv workflow is straightforward:

Initialize renv in a project: renv creates a separate library in the project folder, isolating the packages from the system-wide library.
Snapshot dependencies: renv scans the project, identifying which packages are being used and recording their versions in the lockfile.
Restore environments: Anyone cloning or receiving the project can run renv::restore() to install the exact versions of the packages listed in the lockfile from the project library, reproducing the original project package environment.

One of the core strengths of renv is its flexibility. It integrates seamlessly with tools like RStudio, allowing easy management of dependencies without disrupting existing workflows. This makes it particularly well-suited for ensuring that research projects are reproducible across different systems and platforms.

However, renv does not manage the entire system environment (such as the version of R itself or external dependencies like system libraries). For complete reproducibility, combining renv with

containerization tools (like Docker) or publishing outputs (such as code or data) via repositories like Zenodo is recommended.

as a research repository

Zenodo is a platform created under the European Commission‘s OpenAIRE project in partnership with CERN to publish, archive, and share scientific research outputs, including datasets, code, and publications.

Of course there are many other similar research repositories, such as Dryad, Figshare, Mendeley Data and OSF, but we recommend Zenodo for several reasons:

Generous upload size of 50GB (100 files) per record
Aligns with FAIR and Open Science principles: The practical features of Zenodo that ensure this are described in it‘s principles
Ability to create communities: Zenodo Communities are used to group similar records together. This is useful for creating a collection of related research outputs, either for a research group or a large-scale funded project.
Long term preservation with assignment of DOIs: Each item published on Zenodo is assigned a permanent Digital Object Identifier (DOI), which is a better way than a URL to cite the record in academic writing.
Open source: This means that Zenodo is not just free to use but you can even see the code it is built on and contribute to it.
Versioning functionality: Every record starts with a 1st version and new versions can be added as research is updated, while earlier versions remain accessible. This is crucial in scientific research, where updated analyses and data corrections are often necessary, but also transparency around earlier versions of the work should be maintained.
Integration with GitHub: When a research project (e.g., code) is hosted on GitHub, Zenodo can be used to archive the repository upon each new release, creating a snapshot with a DOI. This means that a version of the code can be more easily cited in scientific publications.
Application programming Interface (API) to access records programmatically: This a useful feature as it allows for interfacing with Zenodo records without using the website and is the backbone of the zen4R package that allows for publishing records directly from R which we discuss in more detail below.

Publishing to Zenodo with `zen4R`

The zen4R package [19] provides functions to interact with Zenodo‘s API directly from R. The package allows you to:

Retrieve the metadata of, and download, existing Zenodo records.
Create new records and versions of records, write their metadata and upload files to Zenodo.

We will use zen4R to publish the code, data, and environment of our example project to Zenodo in the accompanying exercise.

Containerization with

The title of this workflow raises two questions, the first being: what is containerization? and the second: what is Docker? ### Containerisation

Simply put containerization is the process of bundling code along with all of it’s dependencies, i.e. all the components we discussed as making up the environment, including the operating system, software libraries (packages), and other system software. The fact everything needed to run the code is included means that the code is portable and can be run on any platform or cloud service. This also makes containerization something of a gold standard for reproducibility as the entire environment is explicitly re-produced.

Docker

Docker is an open-source, and the most popular, platform for containerization. Before we dive into a practical example using Docker for research projects with R it is important to introduce some three key terms that we will come across:

Dockerfile: The first step in the containerization process, they are a straightforward text file containing a collection of commands or procedures to create a new Docker Image. In this sense we can consider a Dockerfile are the source code of Docker Image. Importantly, Dockerfiles typically start from a base image, which is a existing Docker Image that your image is extending.
Docker Image: A read-only file that contains the instructions for creating a Docker Container. Think of an image as the blueprint of what will be in a container when it is running. Docker Images can be shared via Dockerhub, so that they can be used by others.
Docker Container: Is an actual running instance of a Docker image. It runs completely isolated from the host environment by default and only accesses host files (i.e. data) if it has been configured to do so. It is possible to create multiple containers simultaneously from the same Docker Image, and each container can be started, stopped, moved, and deleted independently of the others.

The graphic above shows the relationships between these components including the central commands of Docker that connect them build and run.

Using Docker with R

So to create a Docker Image to containerize our R research projects we need to start by creating a Dockerfile and, as mentioned above, this should start with a base image. In our case this base image must logically include R and RStudio (if we want to utilise the RStudio Projects features).

Fortunately there is a project that specifically catalogs and manages Docker Images for R projects: Rocker. The images available through the Rocker project not only include different versions of R and RStudio but also images containing collections of R packages for specific purposes (e.g. tidyverse for data wrangling and visualisation, geospatial packages etc.).

In terms of actually creating the Dockerfile for our R project, this can be done manually (See a good R-focused tutorialhere), however there are also R packages that can help with this process such as dockerfiler and the [rrtools](https://github.com/benmarwick/rrtools) package.

For our exercise of this workflow we will use the dockerfiler package, which creates a custom class object that represents the Dockerfile and has slots corresponding to common elements of Docker images. This allows us to add elements to the dockerfile in a more R-like way. The following code snippet demonstrates adding Maintainer details to a Dockerfile object, before saving it:

library(dockerfiler)
# Create a dockerfile template
my_dock <- Dockerfile$new()

# Add maintainer
my_dock$MAINTAINER("Jane Doe", "jane_doe@gmail.com")

# Save
my_dock$write()

Docker with `renv`

Docker can be used with the renv package to manage the package environment of your project.

There are two methods of implementing this which come with their own considerations:

Use renv to install packages when the Docker image is built: This approach is useful if you plan to have multiple projects with identical package requirements. This because by creating an image containing this package library you can simply re-use the image as a base for new images for different projects [20].Warning: Restoring the package library (renv::restore()) when building the image will be slow if there are large numbers of packages so try to avoid the need to re-build the base image many times.
Use renv to install/restore packages only when Docker containers are run: This approach is better when you plan to have multiple projects that are built from the same base image but require different package requirements. Hence it is preferable to not included the package library in the image but instead to mount different project specific libraries to the container when it is run [20]. If project libraries are dynamically provisioned in this way and renv::restore() is run with caching this means that the packages are not re-installed everytime the container is run.

20.

Ushey K, Wickham H (2024) Using renv with docker

Version control with

Version control software output can have multiple uses including creating a systematic procedure for collaboration, working as a team along with increasing the ease of reproduction of the work by other users or by the original researcher as well. We all know the difficulty of tracing back the workflow of work projects a few months down the line after having shelved it, and it is here that Git and Github can be highly useful ([21]).

About Git

Git allows us to make snapshot or record of the changes undertaken in a script, and store it as with a message that defines the change. In this way even after multiple updates, the history is preserved allowing us to revert, compare and systematically trace back the workflow development.

Git is useful also for data scientists and researchers that work individually yet want to create systematically reproducible workflows with version control ([21]).

21.

Alexander R (2023) Telling stories with data, 1st Edition

GitHub repositories and functionalities

The version controlled code and all other auxiliary files related to the project are stored in a Github repository which was created by the user in their account on Github. A repository can either be set as public or private as per the users need for visibility of their work.

To help with version control Github repositories provide multiple functionalities like creating ‘branches’, ‘clones’, ‘merging’ multiple branches, setting up a ‘pull request’ before merge etc. As we get into more complicated workflows handled by multiple developers, Github allows many more functionalities as checks and balances to code development. However, we will limit our understanding to what is needed to create work with a version control history allowing for small scale collaboration, but with the main goal of creating reproducible research (see more details).

Basic functionalities

Creating a repository and setting up user authorisations: A project repository must first be set up on GitHub as either a private or a public repository. If it is not an individual project, collaborators can be added with appropriate (read or write) permission levels (see more details). It is good to elaborate the ‘Readme’ file so as to help viewers get an idea of the repository (see more details).
Push and Pull: The data and code related to a project must be cloned from the remote version to a local version before changes are made. Make sure to pull from updated (i.e. merged branches; see below) branches before making changes. Once changes are made the user must ‘commit’ all the correct changes. Once this is done the changed code can be pushed back to the branch.
Branching and merging : Git allows users to create branches that feed into the ‘main’ branch of a project repository on GitHub. Each branch can be created either for different tasks or for different users as per the requirement. To merge branches into ‘main’ users have to set up a pull-request which needs to be approved by an authorised user.
Git provides additionally many more functionalities for identifying differences between changed files or between branches, to make temporary commits and to revert back to a certain commit in history. However this is beyond the scope of this workshop.

Git in R-Studio

When a Github repository is connected to an R project, R-studio adds a ‘Git’ tab (see image below) with ‘push’, ‘pull’, ‘commit’ and ‘diff’ functionalities. We can switch branches to pull from or push to and additionally trace the history of changes in the Github respository by all users.

Alternatively you can also use Github desktop to perform the same functionalities.

Additional functionalities

Public Github repositories can now be archived on Zenodo as a permanent record of the work with a Digital Object Identifier (DOI) that can be cited in academic work (Click to see more details).
GitHub further provides advanced functionalities like GitHub Actions that allows the user to automate certain processes for application development or project management.
GitHub releases can be automatically published on Docker Hub or GitHub packages as part of a Continuous Integration(CI) workflow (see more details).

4 Quarto

Quarto is an open source scientific and technical publishing system developed by Posit the same company that also created Rstudio. Quarto allows you to integrate code in multiple programming languages, with written material, and a wide variety of interactive visual components into a range of different document formats. If you are familiar with Rmarkdown then you will find Quarto familiar as it is in many ways an evolution of this.

The Quarto website presents many examples of the application of the software but here we will focus on some of it’s key uses and features that are relevant for academics and producing reproducible research.

Writing academic manuscripts

We all know how painful it can be switching between multiple programs to produce academic manuscripts, maybe you write your text in word, and produce your visualizations in R or Python before having to convert the end result to PDF for submission. This is especially annoying when you need to update parts of the manuscript as part of the review process.

Quarto solves this problem by allowing you to write full academic manuscripts from start to finish including text, code, and visualizations in a single program. This functionality has been expanded even further with the release of Quarto Manuscripts as a project type from Quarto version 1.4.

Some key benefits of writing your manuscript with Quarto include:

Figures and tables are dynamically updated as your code changes
Supports the use and inclusion of R, Python and Julia code as well as LaTeX and Markdown
Simple cross-referencing capability for figures, tables, and sections
Documents can be rendered as Word, PDF, or HTML
Easily include citations and bibliographies from Crossref, DataCite, and PubMed as well as with direct integration with Zotero
High quality formatting options for equations and tables
Quarto’s .qmd files can be edited with various code/text editors including VS Code, RStudio and more
Track changes and collaborate using Git or other version control systems.

Writing your academic manuscripts with Quarto is more reproducible as it allows others to use your underlying manuscript file in combination with your data to directly re-create your results.

In one of our exercises you will practice creating a manuscript with Quarto for an example project

Presentations with RevealJS

Quarto also allows you to create presentations in several formats RevealJS, Microsoft Powerpoint and Beamer using a common syntax. In fact the presentation document for this workshop is created using Quarto and RevealJS.

Some useful features of making presentations with Quarto include:

Selection of pre-existing modern themes with functionality to publish your own theme.
Include interactive content: Executable code blocks and visual components (graphs and maps)
Dynamic resizing of content depending on screen size
Functionality for slide notes, automatic transitions, timers etc.
Easy export to PDF or HTML
Similar to manuscripts code-based content is dynamically updated.

Websites

Quarto can also be used to create websites that can be freely hosted through Github Pages or other services like Netlify or Posit Connect. This is a great way to share your research with a wider audience or promote your work. Creating Quarto websites is an intuitive and user-friendly process and because it uses the Bootstrap framework there is a lot of guidance available for customizing beyond the 25 default themes

Some examples from the authors:

This website is created with Quarto and hosted on Github Pages and you can see all of it’s source code here.
This is an example of personal website created by Ben to share publications, presentations and blog content: https://blenback.github.io/

This is a Multi-lingual website created for a research project in Peru to share progress and results of the project: https://nascent-peru.github.io/.

Dashboards to demostrate your research output

Quarto dashboards allow you to arrange multiple interactive or static components in a single page with a highly customizable layout. These components can include text summaries, tables, plots, maps and more. This is a great way to collect feedback on aspects of your research during the development process or to present your results in a visually appealing way.

Here is an interactive example produced using Python interactive plots:

Gapminder dashboard from JJallaire.

Data exploration and visualization

As has been mentioned already Quarto provides a lot of options for creating interactive data visualisations, tables and diagrams using frameworks such as:

There are lots of examples of these on the Quarto website but one we particularly like is Leaflet which allows you to create interactive maps with markers, popups and other features. In a research context this is useful to others to explore the spatial results of your research. Here is a simple example we used to show the locations and timings of workshops we were conducting in Peru.

Code

library(leaflet)
library(fontawesome)
library(mapview)


national <- makeAwesomeIcon(text = fa("people-roof"),
                              iconColor = 'white',
                              library = 'fa',
                              markerColor = 'darkgreen')
  
regional <- makeAwesomeIcon(text = fa("people-roof"),
                              iconColor = 'white',
                              library = 'fa',
                              markerColor = 'lightgreen')


#uncomment for saving map
#Workshop_locations <- 
  leaflet(options = leafletOptions(zoomControl = FALSE,
                                 attributionControl=FALSE)) %>%
  #addProviderTiles(providers$Esri.WorldTerrain) %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addAwesomeMarkers(lng=-77.043,
                    lat=-12.038,
                    icon= national,
                    popup="Lima, National Workshop 9th May",
                    label = "Lima, National Workshop, 9th May",
                    labelOptions = labelOptions(#noHide = TRUE,
                                                direction = "bottom",
                                                style = list("color" = "black",
                                                "font-family" = "Roboto",
                                                "font-style" = "italic",
                                                "box-shadow" = "3px 3px rgba(0,0,0,0.25)",
                                                "font-size" = "12px",
                                                "border-color" = "rgba(0,0,0,0.5)")))%>%
  addAwesomeMarkers(lng=-69.189, lat=-12.594, icon = regional, popup="Puerto Maldonado, Regional Workshop", label = "Puerto Maldonado Regional Workshop 28th May",
                    labelOptions = labelOptions(#noHide = TRUE,
                                                direction = "bottom",
                                                style = list("color" = "black",
                                                "font-family" = "Roboto",
                                                "font-style" = "italic",
                                                "box-shadow" = "3px 3px rgba(0,0,0,0.25)",
                                                "font-size" = "12px",
                                                "border-color" = "rgba(0,0,0,0.5)")))%>%
  addAwesomeMarkers(lng=-77.529, lat=-9.526, icon = regional, popup="Huaraz, Regional Workshop", label = "Huaraz, Regional Workshop 4th June",
                    labelOptions = labelOptions(#noHide = TRUE,
                                                direction = "bottom",
                                                style = list("color" = "black",
                                                "font-family" = "Roboto",
                                                "font-style" = "italic",
                                                "box-shadow" = "3px 3px rgba(0,0,0,0.25)",
                                                "font-size" = "12px",
                                                "border-color" = "rgba(0,0,0,0.5)")))%>%
  addAwesomeMarkers(lng=-76.370, lat=-6.485, icon = regional, popup="Tarapoto, Regional Workshop", label = "Tarapoto, Regional Workshop 10th June",
                    labelOptions = labelOptions(#noHide = TRUE,
                                                direction = "bottom",
                                                style = list("color" = "black",
                                                "font-family" = "Roboto",
                                                "font-style" = "italic",
                                                "box-shadow" = "3px 3px rgba(0,0,0,0.25)",
                                                "font-size" = "12px",
                                                "border-color" = "rgba(0,0,0,0.5)")))%>%
  addAwesomeMarkers(lng=-80.63282, lat=-5.19449, icon = regional, popup="Piura, Regional Workshop", label="Piura, Regional Workshop 20th May",
                    labelOptions = labelOptions(#noHide = TRUE,
                                                direction = "bottom",
                                                style = list("color" = "black",
                                                "font-family" = "Roboto",
                                                "font-style" = "italic",
                                                "box-shadow" = "3px 3px rgba(0,0,0,0.25)",
                                                "font-size" = "12px",
                                                "border-color" = "rgba(0,0,0,0.5)")))

5 Guided excercises

Now it’s time to get hands on with some guided exercises that will let you put into practice the three workflows for reproducible research that we have introduced as well as writing a academic manuscript using Quarto. Each exercise is available to download as a PDF if you prefer to view the instructions offline and the files required for each are available to download at the start of each exercise.

RStudio Project to Zenodo exercise

In this exercise you will setup of a basic RStudio project following good practices for organizing data, writing clean code and decomposing your workflow introduced in Rproject section. We‘ll also initialize renv to manage package dependencies, ensuring reproducibility. Once the project is set up, we’ll go through the process of how to publish the project to Zenodo as described in the Zenodo workflow section.

If you would prefer to view the exercise script offline, here is a PDF version: Download exercise instructions

Step 1: Download the exercise resources

Click here to download the resources for the exercise: Download resources for exercise
Unzip the downloaded file and move the folder to a location on your computer where you can easily find it.

Step 2: Create a new RStudio Project

Start by creating a new RStudio project in the root of the exercise_1_data directory. you have just downloaded. You can name the project what you like but in this example, we have named it Rice_farm_analysis.proj.

Open RStudio.
Create the project using: File > New Project > Existing Directory.
Select the exercise_1_data folder as the location and give the project a name, for example, Rice_farm_analysis.proj.

This creates a .Rproj file in the root of your project to help manage the workspace and project-specific settings.

Step 3: Organize Your Data

It’s good practice to organize raw and processed data in separate folders. Let’s start by organizing the data:

Create a directory Data/Raw inside your project folder.
Move the provided CSV file into this Data/Raw directory.

Step 4: Organize and Split Your Scripts

We’ll now organize the project’s scripts by splitting the original script into separate analysis and visualization scripts.

Create a Scripts folder inside your project directory.
Move the original RiceFarm_project.R script into the Scripts folder.

We’ll now organize the project’s scripts by splitting the original script into separate analysis and visualization scripts.

Create a scripts folder inside your project directory.
Move the original RiceFarm_project.R script into the Scripts folder.
Create two new scripts named 01_data_analysis.R and 02_data_visualisation.R.

For `01_data_analysis.R`

Copy the following code from RiceFarm_project.R:
- The call to the relevant library library(stringr)
- Everything before the call to the ggplot() function
In addition to this, replace the setwd() function with this code to set up relative paths, create a directory to save the processed data in and save rice_data_summary to disk after processing:

# vector and create processed data save dir
save_dir <- "Data/Processed"
dir.create(save_dir,
           showWarnings = FALSE,
           recursive = TRUE)

# Load csv file of data
rice_data <- read.csv(file.path(raw_dir, "RiceFarms.csv"))
  
# Save the summarized data
write.csv(rice_data_summary,
          file.path(save_dir, "RiceFarms_summary.csv"),
          row.names = FALSE)

For `02_data_visualisation.R`

Copy the following code from RiceFarm_project.R:
- The call to the relevant library library(ggplot2)
- The call to the functions ggplot() and ggsave()
Add the following code after the library() call to create an output directory for the plots and load the summarized data from the processed data folder:

# Directory for saving plots
plot_dir <- "Output/Visualisations"
dir.create(plot_dir, showWarnings = FALSE, recursive = TRUE)

# load the summarized data
rice_data_summary <- read.csv("Data/Processed/RiceFarms_summary.csv")

Step 5: Add Script Headers

Add headers to both new scripts. You can use this template:

# -----------------------------------------------
# Script Name: [01_data_analysis.R / 02_data_visualisation.R]
# Project: Rice Farm Analysis
# Purpose: [Data analysis / Data visualization]
# Author: [Your Name]
# Date: [YYYY-MM-DD]
# -----------------------------------------------

Step 6: Create a master script

As a next step you will create a master script that runs both the data analysis and visualization scripts.

In the root of your project, create a new file named RiceFarm_master.R
Add a header as in Step 5.
Add the following code snippet to the script to source 01_data_analysis.R and 02_data_visualisation.R:

###=========================================================================
### 01- Data analysis
### =========================================================================

# Source the data analysis script
source("Scripts/01_data_analysis.R")

### =========================================================================
### 02- Data visualization
### =========================================================================

# Source the data visualization script
source("Scripts/02_data_visualisation.R")

Running this master script will execute both analysis and visualization steps.

Step 7: Initialize `renv` to manage package dependencies

We will use renv to make your projects package environment reproducible.

install renv package

install.packages("renv")

Run the following command in your master script to set up the project-specific environment

renv::init()

This creates a local library for your project and captures the required packages.

Once the initialization is complete, run:

Step 8: Automate Opening the Master Script

For convenience, we can configure RStudio to automatically open the master script when the project is loaded.

install the rstudioapi package:

install.packages("rstudioapi")

Open the .Rprofile file in the root of your project directory. The file might be hidden. On Windows click “View” > “Show” > “Hidden items” in the explorer and on MacOS click Press Command+Shift+Dot within the root directory to see the file.

renv::snapshot()

This records the project’s environment in a renv.lock file, which is essential for reproducibility.

Step 8: Automate opening of the master script

For convenience, we will configure RStudio to automatically open the master script when the project is loaded.

Open the .Rprofile file in the root of your project directory. The file might be hidden. On Windows click View > Show > Hidden items in the explorer and on MacOS click Press Command+Shift+Dot within the root directory to see the file.
Add the following R code to the .Rprofile file:

setHook("rstudio.sessionInit", function(newSession) {
  if (newSession)
    rstudioapi::navigateToFile('RiceFarm_master.R', line = -1L, column = -1L)
}, action = "append")

Step 9: Re-snapshot the Project

After modifying the .Rprofile file, it’s important to capture these changes in the renv.lock file.

Run the following command in your master script to ensure that the rstudioapi package (which enables automatic script opening) is included in the snapshot:

renv::snapshot()

Now we a have a nicely organised project structure with the workflow decomposed into seperate scripts and a master script that to run the whole project.

Step 10: Install `zen4R` to access Zenodo through R

The following steps are heavily based on [19]. We have extracted the most relevant parts to explain the workflow. If you are interested in more details, check out their user manual at: https://cran.r-project.org/web/packages/zen4R/vignettes/zen4R.html.

19.

Blondel E, Barde J (2024) zen4R: R Interface to Zenodo REST API

For this exercise we will not be using Zenodo directly but Zenodo Sandbox. The Zenodo Sandbox is a separate, secure testing environment where users can explore Zenodo‘s features without impacting the main platform‘s publicly accessible data. It allows you to test file uploads, generate test DOIs, and experiment with API integrations. DOIs created in the sandbox are only for testing and use a different prefix. You will need a separate account and access token for the sandbox, distinct from those used on Zenodo‘s main site.

Create an account on https://sandbox.zenodo.org.

Zenodo can be accessed with the R package zen4R to upload, edit, publish and download data.

Create a new R script outside of the project directory.
Install zen4R library with the following code:

#install dependency "remotes"
install.packages("remotes")
        
#install zen4R
require("remotes")
install_github("eblondel/zen4R")

Step 11: Create a new Zenodo record

A Zenodo record includes metadata, data and a Digital Object Identifier (DOI) which is automatically generated by Zenodo for all uploads. But before you can add records to Zenodo, you need to get access to your account through R.

Go to https://sandbox.zenodo.org/account/settings/applications/.
Log into your account and then create a new “Personal access token” in the “Applications” section of your account.
Then run the following code in your script to establish the access and create a new record.

library(zen4R)
        
#Create manager to access your Zenodo repository
zenodo <- ZenodoManager$new(
    token = "your_token",
    sandbox = TRUE,
    logger = "INFO" 
    )
        
##Prepare a new record to be filled with metadata and uploaded to Zenodo
myrec <- ZenodoRecord$new()

If you want to connect to Zenodo and not Zenodo Sandbox, create the token in your Zenodo account and remove the line sandbox = True in the code above.

The types of metadata that can be included in a Zenodo record are vast. A full list can be found in the documentation.

Copy and run the example below to add metadata to your record.

myrec$setTitle("RiceFarm") #title of the record
myrec$addAdditionalTitle("This is an alternative title", type = "alternative-title")
myrec$setDescription("Calculating statistics of RiceFarm dataset") #description
myrec$addAdditionalDescription("This is an abstract", type = "abstract")
myrec$setPublicationDate("2024-09-16") #Format YYYY-MM-DD
myrec$setResourceType("dataset")
myrec$addCreator(firstname = "Yourfirstname", lastname = "Yourlastname", role = "datamanager", orcid = "0000-0001-0002-0003")
myrec$setKeywords(c("R","dataset")) #For filtering
myrec$addReference("Blondel E. et al., 2024 zen4R: R Interface to Zenodo REST API")

A record can be deposited on Zenodo before it is published. This will add the record to your account without making it public yet. A deposited record can still be edited or deleted. You can also upload data to a deposited record. If you prefer a graphical interface, you can also edit the record on the Zenodo website.

Deposit the record on Zenodo:

#deposit record
myrec <- zenodo$depositRecord(myrec, publish = TRUE)

View the deposited record at https://sandbox.zenodo.org/me/uploads?q=&l=list&p=1&s=10&sort=newest
Compress your project directory to a .zip file.
Upload the .zip file to your deposited record:

#add data to the record, adjust the path below
zenodo$uploadFile("path/to/your/file", record = myrec)

Publish the record:

#make the record publicly available on Zenodo (Sandbox).
myrec <- zenodo$publishRecord(myrec$id)

Step 12: Edit a published Zenodo record

It is also possible to edit or update the metadata of published records:

#get your record by metadata query, e.g. by title
myrec <- zenodo$getDepositions(q='title:zen4R')

#get depositions creates a list, access first element
myrec <- myrec[[1]]

#edit metadata
myrec <- zenodo$editRecord(myrec$id)
myrec$setTitle("zen4R 2.0")

#redeposit and publish the edited record
myrec <- zenodo$depositRecord(myrec, publish = TRUE)

Once a record has been published, it is not possible to edit the data that has been attached to it. However, it is possible to upload an updated version of the data. The previous version of the data will remain accessible via Zenodo. The record will have one overall DOI, while each version will have its own DOI.
Reconnect to your account:

zenodo <- ZenodoManager$new(
      token = "your_token",
      sandbox = TRUE,
      logger = "INFO")

Access your record:

#get your record by querying the metadata, e.g. by title, this will give you a list of all records with that title.
myrec <- zenodo$getDepositions(q='title:RiceFarm Statistics')
    
#access the first item in the list, as there should only be one record with that particular title
myrec <- myrec[[1]]

Rename your .zip file on your computer
Upload the renamed .zip file:

#edit data, delete_latest_files = TRUE to not include data of previous version in newer version
myrec <- zenodo$depositRecordVersion(myrec, delete_latest_files = TRUE, files = "path/to/your/new/file", publish = TRUE)

Again, go to https://sandbox.zenodo.org/me/uploads?q=&l=list&p=1&s=10&sort=newest
Activate “View all versions” on the left hand side.
Check if both versions show up

Containerization with Docker exercise

In this exercise we will create and run a Docker container for an example R project. The project is the same that is created in the first workflow exercise, however to save time or in case you haven’t completed this exercise we will start with the finished output from it.

Warning: Docker is a complex software and getting Docker Desktop running on different machines is not always smooth. For example, I had no problem getting it running on my desktop computer but my work laptop did not have the capabilities. If you do run into issues there is good support available online but also asking for help from your IT department may be a good idea.

If you would prefer to view the exercise script offline, here is a PDF version: Download exercise instructions

Step 1: Download the resources

Click here to download the resources for the exercise: Download resources for exercise
Unzip the downloaded file and move the folder to a location on your computer where you can easily find it.

Step 2: Download Docker

Download Docker Desktop for your operating system from the Docker website.
Once downloaded run the installer like you would for other software. If your computer is managed by your institution or your employer you will likely need an admin account to run the installer and you may need to restart your computer after installation.
While you are running the installer it is useful to make a Docker account. This is not necessary but can be useful for managing your containers. You can also sign in with your GitHub account.

Step 3: Open Docker desktop

Open the Docker desktop app. If the app does not open you may need to yourself to the program user-group on your computer. This is a common issue on Windows machines because only the admin account is added to the user-group by default. To add yourself to the user-group search computer management in the start menu and right-click and select to run it with admin privileges. Then navigate to Local Users and Groups -> Groups -> Docker Users. Right click on Docker Users and select Add to Group. Then add your user account to the group.
Once the Docker desktop app is open it should automatically start the docker engine which is the software that runs the containers. In the bottom left of app window you will see the status of the engine.

Alternatively if you look in the system tray on Windows or the top menu bar on Mac. You will see an icon of the Docker whale logo and if you click on this you can see the status of the engine.

Step 4: Creating the Dockerfile

Open Rstudio and navigate to the folder you downloaded in step 1.
Create a new R script and name it Create_Dockerfile.R.
Install the Dockerfiler package: install.packages("dockerfiler").
Add the following code to the script and replace the entries with your details:

# Load dockerfiler package
library(dockerfiler)

# Get your R version to select a base image to use for your image/container
R.Version()$version.string 

# Create a dockerfile template object using the Dockerfile class from the
# dockerfiler package and specify your version of R in the base image name
# my version is 4.3.1 hence the base image is rocker/r-ver:4.3.1
# but you should replace the end of this string with your version number from above
RiceFarm_dock <- Dockerfile$new(FROM = "rocker/r-ver:4.3.1")

# Add maintainer information (replace with your details)
RiceFarm_dock$MAINTAINER("Your_name", "Your_email")

# By default docker images contain a home directory and because our project
# is simple we will move the files we need there

# Copy the data directory 
# (1st argument is the source, 2nd is the destination in the container)
RiceFarm_dock$COPY("/Data", "/home/Data")

# Copy the scripts directory
RiceFarm_dock$COPY("/Scripts", "/home/Scripts")

# Copy the master script
RiceFarm_dock$COPY("/RiceFarm_master.R", "/home")

# For our project we need "ggplot2" and "stringr" packages
# We could try to find a base image on Rocker that has these installed
# But because we are not using lots of packages lets just install them in the container
# Note that the R commands are wrapped in `r()` which is a helper function from dockerfiler
# that then wraps the command in the correct syntax for the Dockerfile
RiceFarm_dock$RUN(r(install.packages("ggplot2")))
RiceFarm_dock$RUN(r(install.packages("stringr")))

# Add the command to run the master script
# Note the use of `Rscript` which is the command line tool included with R to run scripts
RiceFarm_dock$CMD("Rscript /home/RiceFarm_master.R ")

# Save the Dockerfile
RiceFarm_dock$write()

# Create dir in the host directory to receive the results from container
dir.create("/output")

After running this code you will see that a Dockerfile has been created in the directory where you downloaded the resources.

Step 5: Creating the Docker image

The Docker command build is used to create a Docker image from the instructions contained in your Dockerfile.
The build command should be called through a Command Line Interface (CLI) such as the terminal in Rstudio or the CLI of your operating system (e.g Command Prompt for Windows).
In Rstudio switch to the terminal tab next to the console pane:

Run the following command: docker build -t ricerarm_01 . Note: The -t flag is used to tag the image with a name (in this case we are using ricerarm_01). The . at the end of the command is used to specify the current directory as the location of Dockerfile that is to be used.
After running the command you will see the Docker engine pulling the base image from the Docker Hub and then building the image. This process can take a few minutes depending on the size of the base image and the number of packages you are installing. The output in the terminal will look something like this:

Once the image has been built you can check that it is there by running the command docker images in the terminal. This will show you a list of all the images on your computer. You should see the image you just created in the list.
Alternatively you can check the image in the Docker desktop app. You will see the image in the list of images on the left of the app window. You can inspect the image by clicking on it and see the details of the image:

Step 6: Running the Docker container

The Docker command run is used to run a container from an image.
This can be done through the CLI: In the terminal tab in Rstudio run the following command: docker run ricerarm_01.
Or in the Docker desktop app: Click on the image you want to run and then click the run button in the top right. This will open a window where you can specify the settings for the container but for now you should just run the container with the default settings.

After running the container you can also check the status of the container in the Docker desktop app. You will see the container in the list of containers on the left of the app window. You can inspect the container by clicking on it and see the details of the container.

Step 8 Copying files from the container to the host

One way to access the files created inside your container is to mount a directory from your host machine to a directory in the the container. This is done using the -v flag in the docker run command. However, this is not so effective in the example container we are using because the code is completed in a matter of seconds and after that the container is exited.
Instead we will copy the output files from the container using the CLI. To do this you need to know the container ID of the container you want to copy files from. You can get the container ID by running the command: docker ps -a in the terminal, this will show you a list of all the containers on your computer and in the terminal output you can copy the ID:

Now you have the ID in the terminal run the command: docker cp <Container ID>:/home/Output/Visualisations/Regional_size_summary_bar.png ./Output/ and replace the <Container ID> with the ID you copied. The first argument /home/Output/Visualisations/Regional_size_summary_bar.png is the path to the file you want to copy in the container. The second argument ./Output/ is the path to the directory to copy the file to on your host machine, again this is a relative path and the . specifies the current directory. After running the command you should a message printed in the terminal and the file should be copied to the directory you specified:

That’s it, you have successfully created a Dockerfile, Docker image and container, run your code inside the container and copied the output back to your host machine. If you were to share the Dockerfile with someone else they could build the image and run the container on their machine and get exactly the same results as you. Obviously this is a very simple example but the same principles apply to more complex projects where reproducibility becomes more challenging.

Version control with Git exercise

In this exercise we will show how version control with Git can be implemented for an example R project. The project is the same that is created in the first workflow exercise, however to save time or in case you haven’t completed this exercise we will start with the finished output from it.

If you would prefer to view the exercise script offline, here is a PDF version: Download exercise instructions

There is an alternative way of working with Git. Instead of linking Rstudio directly to your Github account, you can also use the Github Desktop application, which provides you with an easy to use GUI and very simple setup. If you are interested in using Github Desktop instead, skip to Step 8.

Step 1: Download and configure Git

Download Git from https://git-scm.com/downloads
Once downloaded open the Git terminal window and type in the following with your credentials

git config --global user.name "NVHarisena1" git config --global user.email "NVHarisena1\@ethz.ch" git config --global --list

The third command should return your updated user-name and email id.

Step 2: Create a repository on Github

We will make a quick repository on Github for an individual project, without changing much of the specific configurations since it will be beyond the scope of this workshop.

Login to your account at https://github.com/
Create a new repository by clicking the ‘+’ sign in the top right side of the website or in the ‘Start new repository’ section in the homepage
Provide a clear name for the repository for e.g. “R_repro_nv” and a quick description like “Test for Reproducible research workshop”
Set the visibility of the profile to “Public”
Initialize this repository with: Add a README file.
Select a license for your repository in the “Choose your license” section. Check out this website to identify which license works for you. Even though it is optional to add license information to a repository, it is good practice to include this (See more details).
Click the green Create repository button

Step 3: Link created repository to R-Studio

Open a new session in R studio and create a new project
In the ‘New Project Wizard’ navigate to ‘Version Control’>‘Git’
In the “repository URL” paste the URL of your new GitHub repository. It will be something like this https://github.com/nvharisena1/R_repro_test.
Add folder location where you want the project to be saved locally in your computer in “Create project as subdirectory of” section
Click Create project

You will see R-studio has now been set up for your local clone of the project to communicate with your online repository. You can see a drop-down to the right of ‘New Branch’ button in the Git tab. This will show you all the branches available to pull or push data to. Your drop-down should show only a ‘main’ branch, since no new branches were created. As stated in the workflow introduction, creating branches is useful for projects with multiple collaborators or sub-themes. Pushing to different branches and then setting up a ‘pull-request’ to merge to the ‘main’ branch allows for systematic version control of the project.

R-Studio new project session with git link

Step 5: Edit the `README.md` file

Open the README.md from the file viewer pane
Add a description section for the project with a heading and a describing sentence, for e.g. “This project is is a test”.
Add license information for the project, for e.g. “This project is licensed under the terms of the MIT license.”
Save the file.

Edit README.md

Step 6: Edit the `.gitignore` file

The git ignore functionality tells git which files to ignore while ‘pushing’ the local changes to the remote (online) repository see more details.In this example we will tell git to ignore all .html files. .html files are created when you preview a file, for example click preview on the edited README.md and a .html should be created.

Open the .gitignore file and add .html in a new line and save the file

Step 7: Commit and push

Click the Commit button in the Git tab
Check all the files listed in the top left section
Write a sentence describing the changes i.e. “Updated exercise, data, Readme and .gitignore”
Click Commit and close the window
click Push in the Git tab, a window will pop up showing the interface with the remote system and details of the upload.

Great! You have finished your first project update (local changes committed and pushed) via Git on Rstudio.

Step 8: Download and install Github Desktop

This is an alternative way to work with Github when coding in R. For this, we will not link Rstudio to our Github account but we will use the Github Desktop application instead.

Download Github Desktop
Install the application and login with your Github account

Step 9: Create a repository on Github

We will make a quick repository on Github for an individual project, without changing much of the specific configurations since it will be beyond the scope of this workshop.

Login to your account on Github
Create a new repository by clicking the ‘+’ sign in the top right side of the website or in the ‘Start new repository’ section in the homepage
Provide a clear name for the repository for e.g. “R_repro_nv” and a quick description like “Test for Reproducible research workshop”
Set the visibility of the profile to “Public”
Initialize this repository with: Add a README file.
Select a license for your repository in the “Choose your license” section. Check out this website to identify which license works for you. Even though it is optional to add license information to a repository, it is good practice to include this (See more details).
Click the green Create repository button

Step 10: Clone your Github repository to your computer

In the Github Desktopp application click ‘File’>‘Clone Repository’ and select your new repository. Define the directory on your computer, where you want to clone the repository to. Press the Clone button.

Step 11: Edit the `README.md` file

Open the README.md in your local project directory with a text editor.
Add a description section for the project with a heading and a describing sentence, for e.g. “This project is is a test”.
Add license information for the project, for e.g. “This project is licensed under the terms of the MIT license.”
Save the file.

Step 12: Push the updated `README.md` file

Since you have edited the README.md file, Github Desktopp app will propose to you to push this change from your computer to your repository.

Add a title and a description and click Commit to main button.
Click Push origin button to upload your changes to your repository on Github.

Writing an academic paper with Quarto

In this exercise we will start to write an academic manuscript using Quarto and incorporating data from the example R project we have used in the other exercises. The project is the same that is created in the first workflow exercise, however to save time or in case you haven’t completed this exercise we will start with the finished output from it.

If you would prefer to view the exercise script offline, here is a PDF version: Download exercise instructions

Step 1: Download the resources

Click here to download the resources for the exercise: Download resources for exercise
Unzip the downloaded file and move the folder to a location on your computer where you can easily find it.

Step 2: Open the project in RStudio

Open RStudio and navigate to the folder where you saved the resources for the exercise.
Open the Rstudio project file in the directory (Rice_farm_analysis.Rproj)

Step 3: Download Quarto

Note: Because Quarto is bundled with Rstudio if you have a recent version of Rstudio you may already have Quarto installed. To check go to File > New File and you should see Quarto Document as an option.
If you do not have Quarto installed you can install it from the Quarto website here

Step 4: Create a Quarto document for the article content

The Quarto website provides a template for creating a Manuscript project but we will create one from scratch so that you understand the various files.

The 1st step is to create a Quarto document that will act as the primary content file, go to File > New File > Quarto Document. You don’t need to add title or author information in the wizard as we will do this in the document itself.

Save the file as index.qmd in the root directory of the project. Using index as the file name is the convention for the main content file in a Quarto project and the software will look for this when attempting to create an output of the content. However it is possible to use a different file name if you prefer but it requires setting an option in the configuration file.

Step 5: Create a `_quarto.yml` configuration file

YAML files (.yml) are configuration files, and in the case of Quarto the convention is to use a file named _quarto.yml in the root directory of the project. This file will be used to set a range of options including the project type, code execution options and options for the output format/s.

Create a new file with File > New File > Text File and save it to the root directory of the project and name it _quarto.yml. Adding the extension .yml will automatically set the file type to YAML.

Step 6: Specify the `Manuscript` project type in the configuration file

Note: All YAML options are set using key-value pairs and the Quarto documentation provides a list of options that can be set in the configuration file.
The first step is to identify your project as a manuscript by entering the following code in the _quarto.yml file:

_quarto.yml

project: 
  type: manuscript

If you do want your main content file to be named something other than index.* you can add the file name with the article key (but make sure that your .qmd file name matches the entry):

_quarto.yml

manuscript:
  article: Rice_farm.qmd

Step 7: Set output formats in the configuration file

The next step is to set the output formats for the document. Quarto supports outputting your manuscript to a range of formats simultaneously including HTML, PDF, Docx and LaTeX. The output formats and their individual options are also set in the _quarto.yml file.

For this exercise we will set the output formats to HTML, docx and PDF. Add the following code to the _quarto.yml file:

_quarto.yml

format:
  html:
    toc: true
    comments:
      hypothesis: true
  docx: default

For PDF outputs specifically, there a number of templates that have been produced to align with the formatting of major academic publishers. These are available as Quarto extensions.
For this exercise we will use the Elsevier template. Quarto extensions can be installed directly using the terminal in Rstudio. To install the Elsevier template run the following code in the terminal: quarto add quarto-journals/elsevier

Once the template is installed you can set the PDF output to use the template by adding the following code to the _quarto.yml file indented below the existing format entry:

_quarto.yml

  elsevier-pdf: default

Step 8: Complete the front matter content

The front matter of the document is where you can enter the title, author/s and other metadata for the document. In the case of academic manuscripts this also includes the abstract, keywords, author institutions and roles (according to the CRediT Taxonomy). The full list of front matter options for Quarto manuscripts can be found here.

Front matter is entered as YAML content at the start of the main content file index.qmd. To be recognised as YAML content it must be enclosed by three dashes --- at the start and end of the content block.

Enter example front matter for this exercise as follows:

---
title: An Analysis of Rice Production in Indonesia
author:
  - name: Jane Doe
    orcid: 0000-0002-0760-5497
    corresponding: true
    email: Janedoe@gmail.com
    roles:
      - Investigation
      - Project administration
      - Software
      - Visualization
    affiliations:
      - id: ETH_PLUS
        name: ETH Zürich
        department: Planning of Landscape and Urban Systems (PLUS) Institut für Raum- und Landschaftsentwicklung (IRL)
        address: HIL H 52.1, Stefano-Franscini-Platz 5
        city: Zürich
        postal-code: 8093
  - name: John Doe
    orcid: 0000-0002-7859-8394
    corresponding: false
    roles: []
    affiliations: 
      - id: ETH_PLUS
keywords:
  - Rice farms
  - Indonesia
  - Regional production
abstract: |
  This is the abstract
plain-language-summary: |
  Rice production data for the country of Indonesia is analysed ...
date: last-modified
number-sections: true
---

Now take some time to practice editing this content by, for example, changing the title, adding yourself as an author or changing the abstract. You will see that indentation is very important for YAML content to work correctly.

One nice feature is the fact that you can use the affiliation: id option to avoid needing to repeat details for authors from the same institution.

Step 7: Add manuscript content

Now we have our manuscript project and content file set up let’s add some content. Content in Quarto files is written in Markdown format, which takes some getting used to but will quickly become intuitive.

Add some content to the index.qmd file. For example, use # to create a heading and ## for subheadings:

# Introduction

Rice is a staple food in Indonesia and is grown in many regions across the country. The production of rice is a key economic activity in Indonesia and is a major source of employment for many people. In this study we analyse the production of different types of rice in different regions in Indonesia.  

# Methods

## Data
For this study we used data from the Indonesian Ministry of Agriculture on rice production in the country. The data was collected from different regions across Indonesia and includes information on the types of rice grown, the amount of rice produced and the area of land used for rice cultivation.

# Results

Figure 1 shows the distribution of rice production across different regions in Indonesia. Table 1 shows the types of rice grown in each region and the amount of rice produced.

Now lets render our manuscript for the first time to see how the output looks. To render the document click the Render button in the top of the Rstudio source pane:

This will render the document to HTML and open it in the viewer pane on the right:

Step 8: Add Figures and Tables

Figures and tables can be added to quarto documents using code blocks in various programming languages. To add a code block into a Quarto document you use three backticks followed by the language name in braces, for example an R chunk would be: ```{r} and then close the block with another three backticks. Code blocks are executed when the document is rendered and the output is included in the final document (although this behaviour can be controlled with execution options).

Lets add a figure to the document using the R code from the exercise script Scripts/02_data_visualisation.R to create a bar chart of the area of farms growing different types in different regions in Indonesia.

Add the following code block to the index.qmd file below the Results heading:

```{r}

# Load required packages
library(ggplot2)

# load the summarized data
rice_data_summary <- read.csv("Data/Processed/RiceFarms_summary.csv")

# plot summary data as grouped bar chart
ggplot(rice_data_summary, aes(x=region, y=size, fill=varieties)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title="Total size of farms by region and rice varieties",
       x="Region",
       y="Total size of farms",
       fill="Rice varieties") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
```

Now if you re-render the document you should see the figure included in the output.

Step 9: Add Cross-references

Quarto allows you to cross-reference figures, tables and sections in your document.

Add a label to the figure code block by adding the following line at the top of the code block:

```{r}
#| label: fig-barchart
```

Now add a cross-reference to this figure in the text by changing the word Figure 1 to @fig-barchart in the 1st sentence of the Results section:

# Results

@fig-barchart shows the distribution of rice production across different regions in Indonesia. Table 1 shows the types of rice grown in each region and the amount of rice produced.

Now if you re-render the document you will see that the figure has been automatically numbered and the text now contains a hyperlinked cross reference to the figure:

Step 10: Add citations

Quarto provides the functionality to not only add citations to your documents but to automatically generate a bibliography, both of which can be formatted according to a wide range of citation styles to meet journal requirements. For a detailed explanation of citations with Quarto see the guide here.

Quarto uses the standard Pandoc markdown representation for citations, whereby citations go inside square brackets and are separated by semicolons, e.g. [@citation]. Citations can be entered manually however they need to have a corresponding entry in the articles bibliography file (a .bib or .csl file in the root directory of the project). Instead, we will add a citation using the wizard in the Quarto visual editor which will automatically create the citation entry in the bibliography file.

At the top of the source pane click the Visual button to switch to the visual editor:

Navigate the methods section and position the cursor at the end of the sentence:

For this study we used data from the Indonesian Ministry of Agriculture on rice production in the country.

In the Visual editor tool bar at the top the top of the source pane click Insert and in the drop down menu select Citation (or alternatively use the keyboard shortcut Ctrl + Shift + F8 on windows). This will open the Insert Citation wizard:

The Rice Farms data used for the exercises comes from the plm package so we will cite this as an example (even though it is totally accurate in terms of the sentence content). Click on the tab for R packages on the left side of the wizard, enter plm in the search bar and click on the entry in the window below to select it:

Make sure the Create box in the bottom left corner of the wizard has a value in it, the default should be references.bib.
Now click Insert and the wizard will close, you will see that the citation has been added at the cursor location and a references.bib has been created in the root directory of the project. If you openthis file you can see the bibliographic details of the citation (Note if you have previously added citations or already created a references.bib file citations will all be added to the same file.

Now when you render the document this citation will automatically be hyperlinked to the corresponding entry in the bibliography that is appended to the end of the document.

Tip: In our experience the easiest method of adding academic citations is to use the Quarto-Zotero integration. In this case you select the Zotero tab in the Insert Citation wizard and if Zotero is running on your computer you will be able to search your whole library to add citations.

Summary

In this exercise we have only touched upon a few of the many of the useful features of Quarto for writing reproducible research documents. As such we would urge you to look through the Quarto documentation to learn about some of the other features as well as useful packages and extensions such as the trackdown package which allows you to convert Quarto documents in Google docs (and back to Quarto again) to facilitate collaboration with users not familiar with the format.

R and Git

[ to pipeline] https://rstudio.github.io/renv/articles/renv.html

[ to pipeline] https://intro2r.com/

https://docs.github.com/en/get-started/getting-started-with-git

Quarto

[<img src=“img src=”assets/images/video.dark.svg” style=“vertical-align:middle; height:1em;”/> to pipeline] https://youtu.be/w3X76e_tZr8?si=bE4OfrNm-wLaeBlk

https://youtu.be/EbAAmrB0luA?si=yUXEup4qpRXCDh5k

[ to pipeline] https://www.marvinschmitt.com/blog/website-tutorial-quarto/

https://marioangst.com/en/blog/posts/multi-language-quarto/

Zenodo

[ to pipeline] https://cran.r-project.org/web/packages/zen4R/vignettes/zen4R.html

1 Background

About us

Ben Black

Nivedita Harisena

Manuel Kurmann

Maarten Van Strien

What is reproducible research?

Why strive for reproducible research?

Advantages of Reproducibility for Your Research

Advantages of Reproducibility for Other Researchers

Why for reproducible research?

2 Research projects with R

projects

Creating and opening projects

Utilising project specific .Rprofile’s

Environment management

Creating reproducible environments with renv

Writing clean code

Script headers

Code sections

Automating the styling of your code

Workflow decomposition

Structuring your project directory

Project documentation

Function documentation with roxygen2

Tips for README writing

3 Workflows for Reproducibility

project to pipeline

Environment Management with renv

as a research repository

Publishing to Zenodo with zen4R

Containerization with

Docker

Using Docker with R

Docker with renv

Version control with

About Git

GitHub repositories and functionalities

Basic functionalities

Git in R-Studio

Additional functionalities

4 Quarto

Writing academic manuscripts

Presentations with RevealJS

Websites

Dashboards to demostrate your research output

Data exploration and visualization

5 Guided excercises

RStudio Project to Zenodo exercise

Step 1: Download the exercise resources

Step 2: Create a new RStudio Project

Step 3: Organize Your Data

Step 4: Organize and Split Your Scripts

For 01_data_analysis.R

For 02_data_visualisation.R

Step 5: Add Script Headers

Step 6: Create a master script

Step 7: Initialize renv to manage package dependencies

Step 8: Automate Opening the Master Script

Step 8: Automate opening of the master script

Step 9: Re-snapshot the Project

Step 10: Install zen4R to access Zenodo through R

Step 11: Create a new Zenodo record

Step 12: Edit a published Zenodo record

Containerization with Docker exercise

Step 1: Download the resources

Step 2: Download Docker

Step 3: Open Docker desktop

Step 4: Creating the Dockerfile

Step 5: Creating the Docker image

Step 6: Running the Docker container

Step 8 Copying files from the container to the host

Version control with Git exercise

Step 1: Download and configure Git

Step 2: Create a repository on Github

Step 3: Link created repository to R-Studio

Step 5: Edit the README.md file

Step 6: Edit the .gitignore file

Step 7: Commit and push

Step 8: Download and install Github Desktop

Utilising project specific `.Rprofile`’s

Creating reproducible environments with `renv`

Function documentation with `roxygen2`

Environment Management with `renv`

Publishing to Zenodo with `zen4R`

Docker with `renv`

For `01_data_analysis.R`

For `02_data_visualisation.R`

Step 7: Initialize `renv` to manage package dependencies

Step 10: Install `zen4R` to access Zenodo through R

Step 5: Edit the `README.md` file

Step 6: Edit the `.gitignore` file

Step 11: Edit the `README.md` file

Step 12: Push the updated `README.md` file

Step 5: Create a `_quarto.yml` configuration file

Step 6: Specify the `Manuscript` project type in the configuration file