The `Require` approach, comparing `pak` and `renv`
Eliot McIntire
Source:vignettes/Require.Rmd
Require.Rmd
Require
is a single package that combines features of
base::install.packages
, base::library
,
base::require
, as well as pak::pkg_install
,
remotes::install_github
, and
versions::install_version
, plus the snapshotting
capabilities of renv
. It takes its name from the idea that
a user could simply have one line named from the require
function that would load a package, but in this case it will also
install the package if necessary. Set it and forget it. This means that
even if a user has a dependency that is removed from CRAN (“archived”),
the line will still work. Because it can be done in one line, it becomes
relatively easy to share, which facilitates, for example, making
reprexes for debugging. This package can be a key part of a reproducible
workflow.
Principles used in Require
Require
is designed with features that facilitate
running R code that is part of a continuous reproducible workflow, from
data-to-decisions. For this to work, all functions called by a user
should have a property whereby the initial time they are called does the
heavy work, and the subsequent times are sufficiently fast that the user
is not forced to skip over lines of code when re-running code. This is
called “rerun-tolerance”, i.e., the line can be rerun under identical
conditions and very quickly return the original result. The package,
reproducible
, has a function Cache
which can
convert many function calls to have this property. It does not work well
for functions whose objectives are side-effects, like installing and
loading packages. Require
fills this gap.
Key features
Features include:
- Fast, parallel installs and downloads.
- Installs CRAN and CRAN-alike even if they have been archived..
- Installs GitHub packages.
- User can specify which version to install using the standard
R-version approach (e.g.,
==3.5.0
or>=3.5.0
). - Local package caching and cloning (see below) for fast (re-)installs.
- Manages (some types of) conflicting package requests, i.e., different GitHub branches.
-
options
-level control of which packages should be installed from source (seeRequireOptions()
) even if they are being downloaded from a binary repository. - Finds specific versions of packages from an incomplete CRAN-like repository (such as r-universe.dev), even when the version is not available, but it is available on the main CRAN mirrors.
- Handles some errors that are not handled by
install.packages
like “already in use”.
How it works
Require
uses install.packages
internally to
install packages. However, it does not let install.packages
download the packages. Rather, it identifies dependencies recursively,
finds out where they are (CRAN, GitHub, Archives, Local), downloads them
(or gets from local cache or clones from an specified package library).
If libcurl
is available (assessed via
capabilities("libcurl")
), it will download them in parallel
from CRAN-like repositories. If sys
is installed, it will
download GitHub packages in parallel also. If a user has not set
options("Ncpus")
manually, then it will set that to a value
up to 8 for parallel installs of binary and source packages.
Rerun-tolerance
To be functionally reproducible, code must be regularly run and tested on many operating systems and computers. When this does not happen, a user/developer does not know that certain code chunks no longer work until they try to run it later. In other words, code gets stale because underlying algorithms and data change. To be rerun-tolerant, a function must:
- return the same result or outcome every time it is run (first, second or more times later);
- be very fast after the first time; when it is not fast, users will skip running it “because we don’t need to run it again and it is slow”
Require
does both of these. See below “why is it
fast”.
Why these features help teams
It is common during code development to work in teams, and to be updating package code. This is beneficial whether the team is very tight, all working on exactly the same project, or looser where they only share certain components across diverse projects.
All working on same project
If the whole team is working on the same “whole” project, then it may
be useful to use a “package snapshot” approach, as is used with the
renv
package. Require
offers similar
functionality with the function pkgSnapshot()
. Using this
approach provides a mechanism for each team member to update code, then
snapshot the project, commit the snapshot and push to the cloud for the
team to share.
Diverse projects
However, if a team is more diversified and they are actually sharing
the new code, but not the whole project, then project snapshots will be
very inefficient and package management must be on a package-by-package
case, not the whole project. In other words, the code developer can work
on their package, and the various team members will have 2 options of
what they might want to do: keep at the bleeding edge or update only if
necessary for dependencies. More likely, they will want to have a
mixture of these strategies, i.e., bleeding edge with some code, but
only if necessary with others. Thus, Require
offers
programmatic control for this. For example
library(Require)
Require::Install(
c("PredictiveEcology/reproducible@development (HEAD)",
"PredictiveEcology/SpaDES.core@development (>=2.0.5.9004)"))
will keep the project at the bleeding edge of the development branch
of reproducible
, but will only update if necessary (based
on the version needed, expressed by the inequality) for the development
branch of SpaDES.core
. The user does not have to make
decisions at run time as to whether an update should be made, and for
which packages.
How Require
differs from other approaches
Default behaviours different
For packages that are not yet installed:
Description | Outcome |
---|---|
Install("data.table") |
data.table installed |
install.packages("data.table") |
data.table installed |
pak::pkg_install("data.table") |
data.table installed |
renv::install("data.table") |
data.table installed |
For packages that are installed:
Description | Outcome |
---|---|
Install("data.table") |
No installation |
install.packages("data.table") |
data.table installed |
pak::pkg_install("data.table") |
No installation |
renv::install("data.table") |
data.table installed |
For packages that are already installed, but not latest on CRAN:
Description | Outcome |
---|---|
Install("data.table") |
No installation |
install.packages("data.table") |
data.table installed |
pak::pkg_install("data.table") |
data.table installed, asks user if wants to update if
available |
renv::install("data.table") |
data.table installed, asks user if wants to update if
available |
Differences and similarities between pak
and
Require
This table is based on Require v1.0.0
and
pak v0.7.2
.
* Indicates that there is an example below.
Description | Require |
pak |
---|---|---|
Parallel downloads | Yes | Yes |
Parallel installs | Yes | Yes |
Archived package* (e.g., "knn" ) |
Automatic | Must prefix with url:: and exact url
path |
Archived package in dependency* | Automatic | May not work, even if manually adding
url:: or any::
|
Dependency conflicts* | Yes | No (see example below using any:: ) |
Multiple requests of same package* | Resolves by version number specification, or most recent version | Error |
Control individual package updates | With HEAD
|
No |
Very clean messaging | somewhat, with
options(Require.installPackagesSys = 1)
|
Yes |
Package dependencies |
data.table , sys
|
None (though yes if user wants control, e.g.,
pkgcache ) |
Uses local cache | Yes | Yes |
Package updates (default) | No, unless needed by version number | Yes, prompt user |
Package install by version | Yes | Yes, but does not deal well with multiple packages with specific versions |
Package conflict (CRAN & GitHub)* | Prefers CRAN, if version requirements met | Error |
Version specification by user | Yes e.g., Require (>=1.0.0)
|
Not an option |
Exact version specification by user | Uses DESCRIPTION file approach e.g.,
Require (==1.0.0)
|
Uses @ e.g.,
Require@1.0.0
|
Version conflicts | Require attempts to resolve them, detailing conflict | Reports “dependency conflict” without details |
Cache of package dependencies | Yes (internally in Require::pkgDep ) |
No (cache not used in pak::pkg_dep ) |
Additional_repositories (in DESCRIPTION
file of a package) |
Uses | Does not use (like
install.packages ) |
Cache of package binaries built locally from source | Yes | No (pak version 0.7.2 ) |
Archived packages
Between mid March 2024 and April 5, 2024, fastdigest
was
taken off CRAN. If this is part of your direct dependencies,
you can remove it and find an alternative. However, if it is an indirect
dependency, you don’t have that choice: your workflow will break.
Require
will just get the most recent archived copy and the
work can continue. While fastdigest
is back on CRAN, others
are not, e.g., an older knn
package:
Require::Install("knn")
try(pak::pkg_install(c("knn")))
Dependency conflict
When doing code development, it is common to use many
GitHub
packages. Each of these (or their dependencies) may
point to one or more branches, either directly by user or in
Remotes
field. In this next example, pak
errors, while Require
makes decisions and installs. This is
a common occurrence for teams developing packages concurrently. The
pak
approach suggests prepending any::
to the
package(s) that is/are causing the conflict. This may suffice under some
situations. The Require
approach is to assume the
equivalent of any::
which means to prioritize base on (in
this order) 1. use package version requirements, 2. CRAN-like
repositories, 3. order.
library(Require)
# Fails because of a) packages taken off CRAN & multiple GitHub branches requested within the nested dependencies
pkgs <- c("reproducible", "PredictiveEcology/SpaDES@development")
dirTmp <- tempdir2(sub = "first")
.libPaths(dirTmp)
install.packages("pak") # need this in the library; can't use personal library version
try(pak::pkg_install(pkgs))
# ✔ Loading metadata database ... done
# Error : ! error in pak subprocess
# Caused by error:
# ! Could not solve package dependencies:
# * reproducible: dependency conflict
# * PredictiveEcology/SpaDES@development: Can't install dependency PredictiveEcology/reproducible@development (>= 2.0.10)
# * PredictiveEcology/reproducible@development: Conflicts with reproducible
pkgsAny <- c("any::reproducible", "PredictiveEcology/SpaDES@development")
try(pak::pkg_install(pkgsAny))
# Fine
dirTmp <- tempdir2(sub = "second")
.libPaths(dirTmp)
Require::Install(pkgs)
# Fails
try(pk <- pak::pak(c("PredictiveEcology/LandR@development", "PredictiveEcology/LandR@main")))
# Error : ! error in pak subprocess
# Caused by error:
# ! Could not solve package dependencies:
# * PredictiveEcology/LandR@development: Conflicts with PredictiveEcology/LandR@main
# * PredictiveEcology/LandR@main: Conflicts with PredictiveEcology/LandR@development
# Fine -- takes in order, so main first in this example
rq <- Require::Install(c("PredictiveEcology/LandR@main", "PredictiveEcology/LandR@development"))
# Fine -- takes by version requirement, so takes development,
# which is the only one that fulfills requirement on Jul 25, 2024
rq <- Require::Install(c("PredictiveEcology/LandR@main", "PredictiveEcology/LandR@development (>=1.1.5)"))
The following does not work with pak
because BioSIM, a
dependency on GitHub is not found. This may be because the package name
is not the repository name, but it is not clear from the error message
why:
Version requirements determine package installation
Version number requirements drive package updates. If a user does not need an update because version numbers are sufficient, no update will occur.
If no version number specification, then installs only occur if package is not present.
Multiple simultaneous requests to install a package from what appear to be incompatible sources, will not create a conflict unless version requirements cause the conflict. If version number requirements are not specified, CRAN versions will take precedence, and sequence of packages listed at installation will take preference otherwise.
# The following has no version specifications,
# so CRAN version will be installed or none installed if already installed
Require::Install(c("PredictiveEcology/reproducible@development", "reproducible"))
# The following specifies "HEAD" after the Github package name. This means the
# tip of the development branch of reproducible will be installed if not already installed
Require::Install(c("PredictiveEcology/reproducible@development (HEAD)", "reproducible"))
# The following specifies "HEAD" after the package name. This means the
# tip of the development branch of reproducible
Require::Install(c("PredictiveEcology/reproducible@development", "reproducible (HEAD)"))
# Not a problem because version number specifies
Require::Install(c("PredictiveEcology/reproducible@modsForLargeArchives (>=2.0.10.9010)",
"PredictiveEcology/reproducible (>= 2.0.10)"))
# Even if branch does not exist, if later version requirement specifies a different branch, no error
Require::Install(c("PredictiveEcology/reproducible@modsForLargeArchives (>=2.0.10.9010)",
"PredictiveEcology/reproducible@validityTest (>= 2.0.9)"))
Require
can handle package version specifications at the
function call (pak
can handle them if they are in a
DESCRIPTION
file, if they are >=
), whereas
pak
cannot (currently).
## FAILS - can't specify version requirements
try(pak::pkg_install(
c("PredictiveEcology/reproducible@modsForLargeArchives (>=2.0.10.9010)",
"PredictiveEcology/reproducible (>= 2.0.10)")))
Why is it fast?
Some of the features make it fast the first time being used on a system, some make it fast the second & subsequent time on a system (which can be first time in a new project). These features are caching, cloning, and parallel downloads.
Caching
Require
creates a local cache of several steps: the
packages files (source or binary including locally built binaries); the
package dependency tree (only in RAM currently, so only affects the same
session); available package matrices for CRAN-like repositories.
Together, these speed up the installation of packages on a computer that
can access the local cache, e.g., for each new project.
Require
keeps the binary once the source
package is built, and it can therefore install the binary each
subsequent installation. This results in dramatically faster
installations of source packages after they have been built locally.
Cloning (still experimental; do not default)
Require
has an option,
options("Require.cloneFrom")
, which, when set, will create
a hard link between the current project’s package library and the
library pointed to by the option. Setting to
e.g. options("Require.cloneFrom" = Sys.getenv("R_LIBS_USER"))
will allow packages in the user’s personal library to be the source of
the “copying” to the project library. This is dramatically faster than
installing, even when the installation is a local binary from the local
cache.
Binary on Linux
On Linux, users have the ability to install binary packages that are
pre-built e.g., from the Posit Package Manager. Sometimes the binary is
incompatible with a user’s system, even though it is the correct
operating system. This occurs generally for several packages, and thus
they must be installed from source. Require
has a function
sourcePkgs()
, which can be informed by
options("Require.spatialPkgs")
and
options("Require.otherPkgs")
that can be set by a user on a
package-by-package basis. By default, some are automatically installed
from "source"
because in our experience, they tend to fail
if installed from the binary.
# In this example, it is `terra` that generally needs to be installed from source on Linux
if (Require:::isUbuntuOrDebian()) {
Require::setLinuxBinaryRepo()
pkgs <- c("terra", "PSPclean")
pkgFullName <- "ianmseddy/PSPclean@development"
try(remove.packages(pkgs))
pak::cache_delete() # make sure a locally built one is not present in the cache
try(pak::pkg_install(pkgFullName))
# ✔ Loading metadata database ... done
#
# → Will install 2 packages.
# → Will download 2 packages with unknown size.
# + PSPclean 0.1.4.9005 [bld][cmp][dl] (GitHub: fed9253)
# + terra 1.7-71 [dl] + ✔ libgdal-dev, ✔ gdal-bin, ✔ libgeos-dev, ✔ libproj-dev, ✔ libsqlite3-dev
# ✔ All system requirements are already installed.
#
# ℹ Getting 2 pkgs with unknown sizes
# ✔ Got PSPclean 0.1.4.9005 (source) (43.29 kB)
# ✔ Got terra 1.7-71 (x86_64-pc-linux-gnu-ubuntu-22.04) (4.24 MB)
# ✔ Downloaded 2 packages (4.28 MB) in 2.9s
# ✔ Installed terra 1.7-71 (61ms)
# ℹ Packaging PSPclean 0.1.4.9005
# ✔ Packaged PSPclean 0.1.4.9005 (420ms)
# ℹ Building PSPclean 0.1.4.9005
# ✖ Failed to build PSPclean 0.1.4.9005 (3.7s)
# Error:
# ! error in pak subprocess
# Caused by error in `stop_task_build(state, worker)`:
# ! Failed to build source package PSPclean.
# Type .Last.error to see the more details.
# Works fine because the `sourcePkgs()`
try(remove.packages(pkgs)) # uninstall to make sure it is a clean install for this test
Require::cacheClearPackages(pkgs, ask = FALSE) # remove any existing local packages
Require::Install(pkgFullName)
}
Package dependencies
default arguments – pkgDep(..., which = XX)
includes
LinkingTo
pkgDep
, by default, includes LinkingTo
as
these are required by Rcpp
if that is required, and so are
strictly necessary. pak::pkg_deps
does not include
LinkingTo
by default.
depPak <- pak::pkg_deps("PredictiveEcology/LandR@LandWeb")
depRequire <- Require::pkgDep("PredictiveEcology/LandR@LandWeb") # Slightly different default in Require
# Same
pakDepsClean <- setdiff(Require::extractPkgName(depPak$ref), Require:::.basePkgs)
requireDepsClean <- setdiff(Require::extractPkgName(depRequire[[1]]), Require:::.basePkgs)
setdiff(pakDepsClean, requireDepsClean)
setdiff(requireDepsClean, pakDepsClean) # does not report "RcppArmadillo", "RcppEigen", "cpp11" which are LinkingTo
CRAN-preference
If there is no version specification, Require
prefers
CRAN packages when there are multiple pointers to a package. Thus, even
though a package may have a Remotes
field pointing to e.g.,
PredictiveEcology/SpaDES.tools@development
, if there is a
recursive dependency within that package that specifies
SpaDES.tools
without a Remotes
field, then
pkgDep
will return the CRAN
version. If a user
wants to override this behaviour, then the user can specify a version
requirement that can only be satisfied with the Remotes
option. Then pkgDep
will take that.
pak::pkg_deps
prefers the top-level specification, i.e.,
the non-recursive Remotes
field will be returned, even if
the same package is also specified within a recursive dependency without
a Remotes
field, i.e, if a recursive dependency points the
CRAN package, it will not return that version of the dependency.
pak
fails for packages on GitHub that are not same name
as Git Repo in Remotes
gg <- pak::pkg_deps("PredictiveEcology/LandR@development", dependencies = TRUE)
# Error:
# ! error in pak subprocess
# Caused by error:
# ! Could not solve package dependencies:
# * PredictiveEcology/LandR@development: Can't install dependency BioSIM
# * BioSIM: Can't find package called BioSIM.
# Type .Last.error to see the more details.
ff <- Require::pkgDep("PredictiveEcology/LandR@development", dependencies = TRUE)
# $`PredictiveEcology/LandR@development`
# [1] "BH" "BIEN"
# [3] "BioSIM" "DBI (>= 0.8)"
# [5] "Deriv" "ENMeval"
# ...
renv
and Require
Managing projects during development
renv
has a concept of a lockfile. This lockfile records
a specific version of a package. If the current installed version of a
package is different from the lockfile (e.g., I am the developer and I
increment the local version), renv
will attempt to revert
the local changes (with prompt to confirm) unless the local
package is installed from a cloud repository (e.g., GitHub), and a
snapshot
is taken. This sequence is largely incompatible
with pkgload::load_all()
or
devtools::install()
, as these do not record “where” to get
the current version from. Thus, the renv
sequence can be
quite time consuming (1-2 minutes, instead of 1 second with
pkgload::load_all()
).
Require
does not attempt to update anything unless
required by a package. Thus, this issue never comes up. If and when it
is important to “snapshot”, then pkgSnapshot
or
pkgSnapshot2
can be used.
Using DESCRIPTION
file to maintain minimum
versions
During a project, a user can build and maintain and “project-level”
DESCRIPTION file, which can be useful for a renv
managed
project. This approach does not, however, automatically detect minimum
version changes or GitHub branch changes (renv::status
does
not recognize these). In order for a user to inherit the correct
requirements, a manual renv::install
must be used. For even moderate sized projects, this can take over
20 seconds.
Require
does not need a lockfile; package violations are
found on the fly.