When you create an R data package, it may happen that the data is too big to be hosted on CRAN that has a package size limit of 5 MB or even on Github that has a repository size limit of 1 GB and a file size limit of 100 MB. Here I show how to build such a package in a way that leaves the data on another server and then lets the user download the data when (s)he needs it for the first time. The downloadoing process copy the data in the installed package file hierarchy so that the user will not have to download it again for subsequent uses.
In the example I show below I simply use dropbox for the data server. This is particularly convenient when the local repository of the package is synchronized with dropbox. In that case, the operation basically consists in (i) making Git ignore the data and (ii) allowing the user to download the data from dropbox when (s)he needs it for the first time.
- In your package file hierarchy, create a
data-raw
directory with the following structure:
data-raw
|-data_creation.R
|-dropbox
|-data-raw
|-extdata
where data-raw/
is created, following Wickham’s suggestion, by the following
command
> devtools::use_data_raw()
This directory contains raw data that are used to generate the data that will be
include in the package. This directory will not be included in the bundled
version of the package. Also, still following Wickham’s suggestion, this
raw-data/
contains an R script data-creation.R
that documents how the clean
version of the data that will be included in the package is created from these
raw data. What we add here to this file structure, is the sub-directory
dropbox/
that will contain the data that will not be included in the bundled
version of the package but instead will be downloaded by the user when (s)he
first need them. So will have to make Git ignore this directory, at the bash
command line:
$ cat "data-raw/dropbox" >> .gitignore
or manually. Then, create a get*
function. This function will test whether the
data is already present in the package file hierarchy. If not, it will ask the
user when (s)he wants to download and install it
> getsrtm <- function() {
+ if (!file.exists(paste0(installed.packages()["srtmVN", "LibPath"], "/srtmVN/extdata/srtm90.tif"))) {
+ message("SRTM data are not on disk.")
+ message("Do you want to download them from the internet (108.0 MB)? y (default) / n")
+ ans <- readline()
+ if (ans %in% c("y", ""))
+ download("http://marcchoisy.free.fr/srtm90.tif", "srtmVN", "srtm90.tif")
+ else return(NULL)
+ }
+ data("srtm90", package = "srtmVN")
+ srtm90@file@name <- system.file("extdata", "srtm90.tif", package = "srtmVN")
+ srtm90
+ }
>