Here we explain how you can quickly, efficiently and reproducibly download the incidence data you need from the www.data.gov.my national data server and the climatic data from the TuTiempo.net platform. For that, you will need the following packages: dplyr, magrittr, readr, stringr and tidyr. Make sure that there are installed:

to_install <- setdiff(c("dplyr", "magrittr", "purrr", "readr", "rvest", "stringr", "tidyr", "xml2", "zeallot"),
                      rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)

Incidence data

Below we define a function that download weekly incidence data per state for a given week from www.data.gov.my, clean them a bit and rearrange them into a long format sorted by state, year and week:

datagovmy <- function(disease, hash, years) {
  require(magrittr)
  urls <- paste0("http://www.data.gov.my/data/ms_MY/dataset/", hash,
                 "/resource/", years, "/download/", names(years),
                 "bilangan-kes-penyakit-", disease, "-mingguan-mengikut-negeri.xlsx")
  tempfiles <- replicate(length(urls), tempfile(fileext = ".xlsx"))
  Map(download.file, urls, tempfiles)
  tempfiles %>%
    lapply(readxl::read_excel, skip = 2) %>% 
    setNames(names(years)) %>% 
    dplyr::bind_rows(.id = "year") %>% 
    dplyr::rename(week  = `MINGGU EPID`) %>% 
    dplyr::filter(week != "Grand Total") %>% 
    dplyr::select(-MALAYSIA) %>% 
    dplyr::mutate_all(as.integer) %>% 
    tidyr::gather("state", "incidence", -year, -week) %>% 
    dplyr::select(state, year, week, incidence) %>% 
    dplyr::mutate(state = stringr::str_to_title(sub("WP ", "", state))) %>% 
    dplyr::arrange(state, year, week)
}

Let’s now define the key for each year of dengue data:

dengue_years <- setNames(c("1dda5107-a25d-4529-8fa3-75d219b17298",
                           "e2ee5f21-0480-4af1-a236-ea94ed620e09",
                           "d96c67da-4e4b-439b-b4be-5611ead9d8e8",
                           "854288d8-d7a9-4ba4-892d-5f6d37a767e4",
                           "6af451f1-b80b-40c5-be98-0a1c8d55bda1",
                           "c8610944-44d3-4c65-abc3-36dbb9f23a0d",
                           "d8bcd5de-9934-4d4d-a2d7-586e76e64310",
                           "07257dc1-c39a-4d35-8f72-bad7fdd96bbb"), 2010:2017)

Same for tuberculosis:

tb_years <- setNames(c("c201ccf5-a4de-4806-bfa7-e7cb03079773",
                       "4955717e-f5d7-4203-b2a3-420751d4261d",
                       "e0a7812f-049a-462b-8036-0d40cc261d8c",
                       "f5444304-17bf-480b-a678-1332fcaf979e"), 2014:2017)

And for hand-foot-and-mouth disease:

hfmd_years <- setNames(c("33b20c42-8877-4016-987f-683707b7137a",
                         "0a53857a-d209-4ef0-a3ab-539be6720905",
                         "ff8d5c7c-267a-4c5d-88c0-43b15eb1084b",
                         "afe50193-ec76-4ea2-a05a-ba8324bde64b",
                         "3baee861-9f6f-4859-bb84-b56fbc4c66b5",
                         "ace3504a-6f17-4f9c-9386-8e47806dc9ea",
                         "a86a99fe-e8a7-45fc-9431-a416f17f009b",
                         "15d45312-6318-4b22-b7c3-d4a684a15f08"), 2010:2017)

With these above information, we can use our datagovmy() function to download the incidence data we want:

dengue <- datagovmy("dengue-haemorrhagic-fever", "0e34a58f-909e-4ec2-85ab-8633830be91c", dengue_years)
tb <- datagovmy("tuberculosis", "fbb1dd02-65c4-4383-aa41-344980001a88", tb_years)
hfmd <- datagovmy("hfmd", "e3edfbb1-2dba-4e0e-9c02-d6a213f58221", hfmd_years)

The outputs are in the forms of neat data frames:

dengue
## # A tibble: 6,255 x 4
##    state  year  week incidence
##    <chr> <int> <int>     <int>
##  1 Johor  2010     1         2
##  2 Johor  2010     2         5
##  3 Johor  2010     3         4
##  4 Johor  2010     4         6
##  5 Johor  2010     5         7
##  6 Johor  2010     6         4
##  7 Johor  2010     7         7
##  8 Johor  2010     8         5
##  9 Johor  2010     9         4
## 10 Johor  2010    10         5
## # … with 6,245 more rows
tb
## # A tibble: 3,135 x 4
##    state  year  week incidence
##    <chr> <int> <int>     <int>
##  1 Johor  2014     1        56
##  2 Johor  2014     2        49
##  3 Johor  2014     3        38
##  4 Johor  2014     4        42
##  5 Johor  2014     5        34
##  6 Johor  2014     6        48
##  7 Johor  2014     7        48
##  8 Johor  2014     8        39
##  9 Johor  2014     9        49
## 10 Johor  2014    10        28
## # … with 3,125 more rows
hfmd
## # A tibble: 6,255 x 4
##    state  year  week incidence
##    <chr> <int> <int>     <int>
##  1 Johor  2010     1         4
##  2 Johor  2010     2         1
##  3 Johor  2010     3        12
##  4 Johor  2010     4         9
##  5 Johor  2010     5        11
##  6 Johor  2010     6        11
##  7 Johor  2010     7         9
##  8 Johor  2010     8        18
##  9 Johor  2010     9         7
## 10 Johor  2010    10        14
## # … with 6,245 more rows

If you prefer, you can download the preprocessed data in the form of CSV files, that have made directly available for you here:

dengue <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/dengue.csv")
tb <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/tb.csv")
hfmd <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/hfmd.csv")

Climatic data

The website TuTiempo.net contains meteorological and climatic data from many climatic stations around the world. For example, climatic data for Malaysia are here. Here we show how to download all the daily data from 28 climatic stations of Malaysia from 2010. For that we need a number of utilitary functions that we start by defining.

The following function removes the last 2 lines of a matrix m:

rm_summaries <- function(m) {
  n <- nrow(m)
  m[-((n - 1):n), ]
}

The following function coerces a matrix m to a data frame using the first line for the variable names:

as.data.frame2 <- function(m) {
  setNames(as.data.frame(m, as.is = TRUE), m[1, ])[-1, ]
}

The following function ownloads data from the URL url and organizes it into a data frame:

get_page <- function(url) {
  require(magrittr) # for the " %>% " operator
  print(url)
  url %>%
    xml2::read_html() %>%
    rvest::html_nodes(".mensuales td , th") %>%
    rvest::html_text() %>%
    matrix(ncol = 15, byrow = TRUE) %>%
    rm_summaries() %>%
    as.data.frame2()
}

A safe version of the get_page() function, trying the URL again and again if internet is interrupted and handling specific errors (e.g. 404):

safe_get_page <- function(..., error) {
  repeat {
    out <- purrr::safely(get_page)(...)
    if(is.null(out$error) || grepl(error, out$error)) return(out)
  }
}

The following function pads 1-digit numbers to 2-digit ones with zeros on the left:

pad <- function(x) {
  stringr::str_pad(as.character(x), 2, pad = "0")
}

The following function builds an URL from a year, a month and a station:

make_url <- function(year, month, station) {
  paste0("http://en.tutiempo.net/climate/",
         pad(month), "-", year, "/ws-", station, ".html")
}

Here is the main function that downloads the data for the station station:

download_data <- function(station, years, months = 1:12, error = "HTTP error 404") {
  require(magrittr) # for the " %>% " operator
  require(zeallot) # for the " %<-% " operator
  c(months, years) %<-% expand.grid(months, years)
  out <- purrr::map2(years, months, make_url, station = station) %>%
    purrr::map(safe_get_page, error = error) %>%
    purrr::transpose()
  out <- out$result %>%
    setNames(paste(years, pad(months), sep = "-")) %>%
    `[`(sapply(out$error, is.null)) %>%
    dplyr::bind_rows(.id = "ym") %>%
    dplyr::mutate(day = lubridate::ymd(paste(ym, pad(Day), sep = "-"))) %>%
    dplyr::select(-ym, -Day) %>%
    dplyr::select(day, dplyr::everything()) %>%
    dplyr::mutate_if(is.factor, as.character) %>%
    dplyr::mutate_at(dplyr::vars(T, TM, Tm, SLP, PP, VV, V, VM, VG), as.numeric) %>%
    dplyr::mutate_at(dplyr::vars(H), as.integer) %>%
    dplyr::mutate_at(dplyr::vars(RA, SN, TS, FG), function(x) x == "o")
  names(out) %<>%
    sub("^T$", "ta", .) %>%
    sub("TM", "tx", .) %>%
    sub("Tm", "tn", .) %>%
    tolower()
  out
}

Now, we want to modify sligthly the above function so that it downloads the data only over the range of year that we have in the incidence data:

diseases_years <- c(dengue$year, hfmd$year, tb$year)
max_year <- max(diseases_years)
min_year <- min(diseases_years)
download_data2 <- function(station, year) {
  download_data(station, year:max_year)
}

A list of the 28 Malaysian climatic stations that have data in TuTiempo.net can be downloaded from this CSV file:

stations <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/climatic%20stations.csv")
## Parsed with column specification:
## cols(
##   location = col_character(),
##   station = col_integer(),
##   longitude = col_double(),
##   latitude = col_double(),
##   elevation = col_integer(),
##   from = col_integer()
## )

which gives:

stations
## # A tibble: 28 x 6
##    location           station longitude latitude elevation  from
##    <chr>                <int>     <dbl>    <dbl>     <int> <int>
##  1 Alor Star           486030      100.     6.2          5  1959
##  2 Bintulu             964410      113.     3.2          2  1955
##  3 Butterworth         486020      100.     5.46         4  1962
##  4 Cameron Highlands   486320      101.     4.46      1545  2015
##  5 Ipoh                486250      101.     4.56        40  1959
##  6 Johore Bharu/Senai  486790      104.     1.63        37  1999
##  7 Kota Bharu          486150      102.     6.16         5  1954
##  8 Kota Kinabalu       964710      116.     5.93         3  1955
##  9 Kuantan             486570      103.     3.78        18  1954
## 10 Kuching             964130      110.     1.48        27  1955
## # … with 18 more rows

Now, everything is ready to be downloaded! (It’ll take about 2 hours).

climate <- stations %$%
  purrr::map2(station, sapply(from, max, min_year), download_data2) %>%
  setNames(stations$station) %>%
  dplyr::bind_rows(.id = "station") %>% 
  dplyr::mutate_if(is.character, as.integer)

which gives:

climate
## # A tibble: 73,392 x 16
##    station day           ta    tx    tn   slp     h    pp    vv     v    vm
##      <int> <date>     <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
##  1  486030 2015-10-01    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  2  486030 2015-10-02    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  3  486030 2015-10-03    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  4  486030 2015-10-04    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  5  486030 2015-10-05    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  6  486030 2015-10-06    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  7  486030 2015-10-07    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  8  486030 2015-10-08    NA    NA    NA    NA    NA    NA    NA    NA    NA
##  9  486030 2015-10-09    NA    NA    NA    NA    NA    NA    NA    NA    NA
## 10  486030 2015-10-10    NA    NA    NA    NA    NA    NA    NA    NA    NA
## # … with 73,382 more rows, and 5 more variables: vg <dbl>, ra <lgl>,
## #   sn <lgl>, ts <lgl>, fg <lgl>

Where the variables’ meaning are

If you don’t want to spend the 2 hours downloading the data from TuTiempo.net, you can downlaod them in CSV form, directly from here:

climate <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/climate.csv")