Statistical temporal downscaling

Downscaling is a procedure that consists in infering high-resolution information from low-resolution data. This can be performed either in time or, more commonly, in space. Methodologically, it can be performed statistically (statistical downscaling) based on observed relationships between the variables of our data or mechanistically (often called dynamical downscaling), using a mechanistical model of the process generating the data. Such techniques are widely used for example in meteorology and climatology in order to derive local-scale weather and climate from Global Climate Models (GCM) data that have a typical resolution of 50 x 50 km. See for example the CORDEX project. Downscaling is not easy and, unfortunately, very few tools, if any, exist in R to for that (see for example this StackOverflow post). Here I use a very simple example of statistical temporal downscaling based on the use of linear models in order to illustrate what are the difficulties and when such methods can and shouldn’t be used.

Often downscaling is misunderstood as a problem of interpolation. Even though there are common aspects of the two techniques, the problems of interpolation and downscaling are radically different and downscaling is substantially more difficult than interpolation. To illustrate this, let’s consider a simple example using measures of temperature as a function of time. Imagine that you measure temperature once per quarter, on the 15^th of February, the 15^th of May, the 15^th of August and the 15^th of November. If you want to infer the temperatures on the 15^th of the other months from these 4 measures, you can use an interpolation technique which consist in fitting these data to a model, either statistial such as a linear regression, or mechanistical. For example, in base R, the function approx() allows to perform linear interpolation and the function spline() allows to perform spline interpolation. See the examples of these functions to get a sense of what they are doing.

Let’s now imagine a slightly different situation: instead of having measures of temperatures at one time point per quarter, you have the average temperature per quarter and you would like, as before, to get an estimate of the temperature on the 15^th of every month. Compared to the above situation of infering values at certain time points from values measured at other time points, here we want to infer values at certain time points from aggregates of values at other time points. Thus, instead of guessing values in between measured values, the problem here is to guess the values that where used to generate the aggregated values we have at hand. The type of aggregation could be the mean, as in our example, but it could be something else, cumulative sum for example. The figure below illustrates the differences between interpolation and downscaling.

Whatever technique we use to perform downscaling, we generally want to comply to two constraints:

within each time interval, the aggregate of the infered values should be equal to the initial aggregate datum;
between all the time intervals, there should be a transition as smooth as possible.

Here, time intervals refers to the time intervals to which the initial aggregate data correspond, a quarter for example in our above example. In the case of spatial downscaling (2 dimensions instead of 1), we would probably speak of pixel instead of interval. These two constraints can be achieved easily with the use of linear models. More sophisticated methods such as spline would do better job but also be more complicated. For the purpose of illustrating the concepts, we will thus stick to the simplest method and we will assume that the aggregates are means (we’ll see that a cumulative sum won’t be much more difficult).

Step-by-step R code

We want to design a function that takes as inputs two vectors, one (y) of aggregate values, the other one (x) being the time coordinates of the middle of the time intervals to which the aggregates correspond. For the sake of illustration we can consider the concrete example were we have the rates of change of a population averaged by year and we want to downscale these data so that we can express the rates of change of this population averaged by semester. The first step consists in identifying the limits of the intervals to which the aggregates correspond to, as materialized by the red vertical lines on the figure below.

The function below, used with the option with_borders = TRUE, allows to do so:

centers <- function(x, with_borders = FALSE) {
  ctrs <- x[-1] - diff(x) / 2
  if (with_borders)
    return(c(2 * x[1] - ctrs[1], ctrs, 2 * tail(x, 1) - tail(ctrs, 1)))
  ctrs
}

To meet the first constraint, we simply need, on each of the intervals, to define a linear model that goes through the blue dots of the above figure. Furthermore, we know that through any two dots goes one and only one line. Thus, the second contraint is easy to address too as long as we know, for each interval, the y value of the linear model on the previous interval at the limit between the two intervals (vertical red lines on the figure above). We can thus design a recursive algorithm that estimates linear models for each interval from left to right and the for-loop below does the job:

nb <- length(x)
the_centers <- centers(x, TRUE)
y2 <- c(y2_0, rep(NA, nb))
for(i in 1:nb) {
  x_val <- c(the_centers[i], x[i])
  y_val <- c(y2[i], y[i])
  m <- lm(y_val ~ x_val)
  y2[i + 1] <- predict(m, data.frame(x_val = the_centers[i + 1]))
}

where nb is the number of intervals and the_centers is an output of the centers() function. The problem however concerns the first of these intervals for which, by definition, we don’t have a previous interval and thus no left point through which to make our linear model go through. Let’s call y2_0 the y coordinate of this initial left point. In principle, we could choose any value for y2_0. However, in order to meet our second constraint, let’s instead consider applying an optimization algorithm in order to find the y2_0 that makes the transitions between intervals as smooth as possible. For that, let’s define unsmoothness as the sum of the absolute differences between the slopes of adjacent linear models. The following function computes such a statistic from a list of linear models:

unsmoothness <- function(list_of_models) {
  require(magrittr)
  list_of_models %>% 
    sapply(get_slope) %>%
    diff() %>% 
    abs() %>% 
    sum()
}

where the function get_slope() would be defined by the following code:

get_slope <- function(model) coef(model)[2]

What we need then to perform this optimization, is a function that takes a value for y2_0 as an input and returns a list of models. The following function built around the above for-loop does the job:

make_models <- function(y2_0) {
  models <- vector("list", nb)
  y2 <- c(y2_0, rep(NA, nb))
  for(i in 1:nb) {
    x_val <- c(the_centers[i], x[i])
    y_val <- c(y2[i], y[i])
    models[[i]] <- m <- lm(y_val ~ x_val)
    y2[i + 1] <- predict(m, data.frame(x_val = the_centers[i + 1]))
  }
  models
}

As we can see, this funtion returns a list of models that we can then feed to unsmoothness() in order to find the value of unsmoothness that corresponds to our choice of y2_0. And the combination of make_models() and unsmoothness() can be used by the base R optimize() function in order to find the optimal value of y2_0 over a given interval that minimize the unsmoothness. The problem now is to define the interval to search in. An option could be given by the following function:

range2 <- function(..., n = 1, na.rm = FALSE) {
  the_range <- range(..., na.rm = na.rm)
  n * c(-1, 1) * diff(the_range) / 2 + mean(the_range)
}

that is basically a wrapper around the base R function range() that allows to reduce or expand the range. We can now look for the optimal value of y2_0 by the following call:

optimize(function(x) unsmoothness(make_models(x)), range2(y, n = 3))$minimum

Once we have our optimal linear models on each of our intervals, the last step consists in infering the new y values at a number of x values and this step requires an additional little precaution in order to ensure that the first of the two constraints listed above is met. Let’s call n the number of time points per time interval at which we wish to infer y values. In case where we want these points to be spread over each interval so that they represent the middles of n equal-duration sub-intervals, the x coordinates of these points can be computed by the following function:

subsegment_center <- function(x, n) {
  the_centers <- centers(x, TRUE)
  bys <- diff(the_centers) / n
  Map(seq, from = the_centers[seq_along(x)] + bys / 2, by = bys, le = n)
}

This function is basically a mapping of seq() on its parameters from and by. Note that in the particular case where the initial x values are regularly spaced, then all the values of the bys vector will be the same.

The final step will be a call to the predict() function in order to produce the infered y values:

unlist(Map(predict,
           models,
           lapply(subsegment_center(x, n), function(x) data.frame(x_val = x))))

Note that the elements of the list outputed from make_new_x() are converted into a data.frame in order to meet the requirement of the predict() function.

Wrapping into one function

All the code above can be wrapped into one single function of three arguments: y the data of aggregated values, x the centers of the (time) intervals to which these aggregated values correspond, and n the number of new y values we want to generate within each of these intervals:

downscale <- function(y, x, n) {
  require(magrittr)
  
  nb <- length(x)
  the_centers <- centers(x, TRUE)

# The function that makes the list of linear models (one per interval).
# Note that it uses `nb` and `the_centers` defined above.
  make_models <- function(y2_0) {
    models <- vector("list", nb)
    y2 <- c(y2_0, rep(NA, nb))
    for(i in 1:nb) {
      x_val <- c(the_centers[i], x[i])
      y_val <- c(y2[i], y[i])
      models[[i]] <- m <- lm(y_val ~ x_val)
      y2[i + 1] <- predict(m, data.frame(x_val = the_centers[i + 1]))
    }
    models
  }

# Using the function `unsmoothness()` to find the best `y2_0` value:
  best_y2_0 <- optimize(
    function(x) unsmoothness(make_models(x)),
    range2(y, n = 3)
  )$minimum

# Generating the list of models with the best `y2_0` value:
  models <- make_models(best_y2_0)
  
# Now that everything is ready:
  x %>%
    subsegment_center(n) %>% 
    lapply(function(x) data.frame(x_val = x)) %>% 
    Map(predict, models, .) %>% 
    unlist()
}

Testing the function:

Let’s now test the function downscale(). Let’s consider 10 time intervals of centers from 1 to 10 and let’s say we want to infer 12 values per interval:

x <- 1:10
n <- 12

The limits between these intervals are

the_centers <- centers(x, TRUE)

And the x coordinates of the infered values will be

new_x <- unlist(subsegment_center(x, n))

Let’s start with a simple example where the initial aggregate data are values from 1 to 10:

y1 <- 1:10
y1b <- downscale(y1, x, n)
plot(new_x, y1b)
points(x, y1, col = "blue", pch = 19)
abline(v = the_centers, col = "red")

On this figure the vertical red lines show the limits of the intervals, the blue dots show the aggregated values available for each of these intervals, and the black dots show the infered disaggregated values within each interval. The figure shows that the continuity between intervals is respected (second of the above-listed constraints). Let’s now check that the first constraint is also well respected (i.e. that the aggregates of the infered values are equal to the initial aggregate data):

plot(y1, colMeans(matrix(y1b, n)),
     xlab = "initial aggregate data",
     ylab = "aggregates of infered values")
abline(0, 1)

That works fine! Let’s now consider a slightly more complicated example that scrambles the values of y:

y2 <- sample(y1)
y2b <- downscale(y2, x, n)
plot(new_x, y2b)
points(x, y2, col = "blue", pch = 19)
abline(v = the_centers, col = "red")

plot(y2, colMeans(matrix(y2b, n)),
     xlab = "initial aggregate data",
     ylab = "aggregates of infered values")
abline(0, 1)

Works well too!

Conclusions

The function downscale() shows a very simple example of statistical temporal (i.e. one dimension) downscaling, using linear models. This is probably the simplest example of downscaling. This function could be complexified in basically 2 non-incompatible directions: adding dimensions (to consider spatial downscaling for example) and considering more complex models. The latter can be done either by still considering statistical models such as spline smoothing for example, or by considering mechanistical (dynamical) models. The second option will provide a less general tool and will require some domain knowledge. The functions centers(), range2(), subsegment_center() and downscale() are available in the package mcstats. To install it locally, just type:

# install.packages("devtools")
devtools::install_github("choisy/mcstats")