R
Python
1. Conda
2. Creating and managing an environment
  1. Environment file
  2. Creating the environment

Setting-up and managing a self-contained environment makes your projects more:

isolated: each project has its own dedicated package library; it helps keep better track of the packages installed and avoid cluttering the global environment,
portable: because the state of the project is captured, it’s easier to share projects or collaborate with others,
reproducible: a project state can be saved and restored.

This post presents a way to set up and manage a project local environment for each programming software R and Python.

R

Renv package

The \(\texttt{renv}\) package is an R package that acts as a dependency manager and provides all the functionalities (and more) listed beforehand.

Initialization

Recent versions of RStudio allows to create a new project with automatic installation of the \(\texttt{renv}\) package. This can be done directly from the GUI:

File > New Project > New Directory > New Project

and check the case “Use renv with this project”.

To use renv in an existing project, you first need to install it:

install.packages("renv")

then, initialize renv so that it will discover the R packages already used in your project and install them into a private project library:

renv::init()

Calling renv::init() will also write out the infrastructure necessary to automatically load and use the private library for future R sessions launched from the project root directory. This is accomplished by creating (or amending) a project-local .Rprofile with the necessary code to load the project when the R session is started.

Once the project is initialized, you can proceeds as per usual by installing (or removing) R packages as needed.

Saving and reverting

Saving

\(\texttt{renv}\) offers a functionality to save the current state of a project library:

renv::snapshot()

By default, the snapshot is configured to be implicit. It will only capture the packages installed in your R libraries, alongside those used in your R and Rmarkdown scripts (found recursively from the project root directory) as inferred by renv::dependencies().

A common caveat is that it will sometimes miss packages that are not explicitly loaded in a script. To circumvent this issue, \(\texttt{renv}\) can be configured to capture all packages, like so:

renv::settings$snapshot.type("all")
renv::snapshot()

The list of captured packages is written into a lockfile, named renv.lock, located at the root of the R project folder. In the Packages pane of the RStudio interface, the column field Lockfile lists the version of the package captured in the current snapshot. An empty field means that the corresponding package has not been captured.

Reverting

Restoring the current project library to a previous snapshot is done via:

renv::restore()

With this command, packages contained in the lockfile renv.lock are compared against packages currently installed in the project library. By default, any package which has changed will be moved to the default library. With the option clean = TRUE, these packages will be removed instead.

At any point, the state of the project library can be checked by calling:

renv::status()

Collaborating

\(\texttt{renv}\) drastically eases sharing your project with others. To do so:

Save the state of your project so that the lockfile renv.lock and autoloader files .Rprofile and renv/activate.R are up to date.
Share your project with both the lockfile and the autoloader files. The file .Rhistory and the folder .Rproj.user should not be shared.
When a collaborator launches the project, \(\texttt{renv}\) should automatically download and install versions of the packages listed in the lockfile.
The collaborator call renv::restore() to rebuild the project library as it was saved.

Warning

File conflicts may appear if multiple collaborators are concurrently installing new packages and
updating their own copy of the lockfile.

To guard against this, all collaborators should use a version control system, git for instance, to ensure that the lockfile always stays up to date.

Managing Python dependencies

If your R project also depends on some Python dependencies (through the reticulate package), \(\texttt{renv}\) offers a quick and easy integration. The first step is to create a project-local Python environment through the function renv::use_python() with the following arguments:

python: a path that points to the version of Python that will be used in the project-local environment,
type: the type of Python environment to use.

If you have a pre-existing conda environment on your machine, you can call:

renv::use_python(python=[path_to_python_bin], type="conda")

where [path_to_python_bin] points to the Python executable of the conda environment. This command will automatically set up reticulate to use the specified Python binary.

Once this is done, the Python integration becomes active and \(\texttt{renv}\) will:

capture the set of installed Python packages when renv::snapshot() is called. The list of captured packages are stored in a YAML file, named “environment.yml”, located at the project root directory,
re-install the set of recorded Python packages from the YAML file “environment.yml” when renv::restore() is called.

This last option is particularly helpful when you wish to share your project/collaborate with someone else. Following the steps described in the previous section, when your collaborator call renv::restore() to rebuild the project it will both restore the R and Python packages.

As an example, the following Git repository makes use of \(\texttt{renv}\) to integrate Python dependencies.

Import package (optional)

Along with \(\texttt{renv}\), the package \(\texttt{import}\) offers simple functionalities to improve the management of packages in an active R session. To illustrate its use, consider the following lines:

library(dplyr)
df <- data.frame(year = rev(2000:2005), value = (0:5) ^ 2)
df %>% filter(value < 2) %>% arrange(year)

In the example up top, the entire package dplyr is loaded with the intent to use its two function filter and arrange.

With the help of the \(\texttt{import}\) package, only the two functions of interest are loaded, instead of the whole package, like so:

import::from("dplyr", c("filter", "arrange"), .character_only=TRUE)
df <- data.frame(year = rev(2000:2005), value = (0:5) ^ 2)
df %>% filter(value < 2) %>% arrange(year)

This helps improving coding habit in two ways:

the explicit import makes it easier to track the parent package of each function,
a function is less likely to be masked by an homograph loaded from another package.

Python

Conda

An easy way to quickly set up a self-contained Python environment is to use \(\texttt{conda}\). Conda is an open source package management system and environment management system that comes in two versions: \(\texttt{Anaconda}\) or \(\texttt{Miniconda}\). Which version to install is up to the user, see recommendations.

Once installed, the conda command-line tool can be used to create and manage self-contained Python libraries, that is environments that have different versions of Python and/or packages installed in them.

You can check that conda is correctly installed by opening a terminal and entering the line,

conda --version

Packages are installed from channels which are locations, or paths, that conda takes to look for packages. To avoid channel collisions ¹, conda will prefer packages from a higher priority channel over any version from a lower priority channel.

Conda provides a useful cheatsheet, always pointing to the latest stable version, that sums up the main commands. Aside from that, a rich documentation is available to learn how to manage: conda itself, environments, channels and packages.

Creating and managing an environment

Environment file

A practical way to set up a new environment is by using an environment file. The environment file is a YAML file that contains the following fields:

name: specifies the name of the environment
channels: a list of channels from which to install the packages, channels priority is handle by conda,
dependencies: the list of packages to install

Lets consider the following example:

name: torch
channels:
  - defaults
  - pytorch
dependencies:
  - cudatoolkit=10.2
  - matplotlib-base
  - nb_conda_kernels
  - numpy
  - numpy-base
  - pip
  - psutil
  - python=3.8.5
  - pytorch=1.10.*
  - scipy
  - spyder-kernels
  - pip:
    - h5py
    - torchinfo

In this case, the YAML file will create the environment named \(\texttt{torch}\) using two channels:

defaults: this channel will search packages under the https://repo.anaconda.com/pkgs/ directory,
pytorch: a channel specific to the pytorch library.

While packages dependencies are resolved internally by conda, it is possible to enforce some package versions. Returning to the previous example,

the line python=3.8.5 specifies to install the version 3.8.5 of Python,
the line pytorch=1.10.* specifies to install the best patch of the minor release 1.10 of the Pytorch library.

Further information about package versioning and releases are provided at this link.

Creating the environment

Specify the location

Once the environment file written, it must be passed to conda to actually build the new environment. It is possible to control where the new environment will be created.

Regarding the example at hand, this would be done by entering in the terminal:

conda env create -f [path_to_yaml]/[yaml_file].yml -p [path_to_env]/torch

This line would:

call the command conda env create which is the command to use for creating a new environment from an environment file,
use the argument -f that specifies the path where to find and read the environment file,
use the argument -p that specifies the path where to create the new environment.

[path_to_yaml], [yaml_file], [path_to_env] are user-specific and contain each:

[path_to_yaml]: the absolute (or relative) path to the YAML file,
[yaml_file]: the name of the YAML file,
[path_to_env]: the absolute (or relative) path to the folder where the new environment will be created.

By default, \(\texttt{Miniconda}\) and \(\texttt{Anaconda}\) provides a folder envs, located at the root, which can be used to install new environments. That is, one can set [path_to_env]="~/envs" provided that \(\texttt{Miniconda}\) or \(\texttt{Anaconda}\) has been installed in your home directory (as configured by default). Otherwise, you would have to set [path_to_env]="[path_to_root]/envs" where [path_to_root] is the path to the root folder which can be found with the following command-line,

conda info --base

Register the name

The final step is to append the newly created environment to the local list of environments. This is done with the following line,

conda config --append envs_dirs [path_to_env]

where [path_to_env] refers to the path discussed beforehand. Once your new environment has been registered, it can be easily activated by calling its name. For the present toy case, this would be done as follows,

conda activate torch

You may find useful to turn-off the auto-activation of the base environment. This can be done by manually adding the line, auto_activate_base: false to your .condarc or more simply by entering the following line in your terminal,

conda config --set auto_activate_base False

A channel collision occurs when the same package exists in two different channels. ↩

How to set up a project local environment

Table of contents

R