Here we explain how you can quickly, efficiently and reproducibly download the incidence data you need from the www.data.gov.my national data server and the climatic data from the TuTiempo.net platform. For that, you will need the following packages: dplyr
, magrittr
, readr
, stringr
and tidyr
. Make sure that there are installed:
to_install <- setdiff(c("dplyr", "magrittr", "purrr", "readr", "rvest", "stringr", "tidyr", "xml2", "zeallot"),
rownames(installed.packages()))
if (length(to_install)) install.packages(to_install)
Below we define a function that download weekly incidence data per state for a given week from www.data.gov.my, clean them a bit and rearrange them into a long format sorted by state, year and week:
datagovmy <- function(disease, hash, years) {
require(magrittr)
urls <- paste0("http://www.data.gov.my/data/ms_MY/dataset/", hash,
"/resource/", years, "/download/", names(years),
"bilangan-kes-penyakit-", disease, "-mingguan-mengikut-negeri.xlsx")
tempfiles <- replicate(length(urls), tempfile(fileext = ".xlsx"))
Map(download.file, urls, tempfiles)
tempfiles %>%
lapply(readxl::read_excel, skip = 2) %>%
setNames(names(years)) %>%
dplyr::bind_rows(.id = "year") %>%
dplyr::rename(week = `MINGGU EPID`) %>%
dplyr::filter(week != "Grand Total") %>%
dplyr::select(-MALAYSIA) %>%
dplyr::mutate_all(as.integer) %>%
tidyr::gather("state", "incidence", -year, -week) %>%
dplyr::select(state, year, week, incidence) %>%
dplyr::mutate(state = stringr::str_to_title(sub("WP ", "", state))) %>%
dplyr::arrange(state, year, week)
}
Let’s now define the key for each year of dengue data:
dengue_years <- setNames(c("1dda5107-a25d-4529-8fa3-75d219b17298",
"e2ee5f21-0480-4af1-a236-ea94ed620e09",
"d96c67da-4e4b-439b-b4be-5611ead9d8e8",
"854288d8-d7a9-4ba4-892d-5f6d37a767e4",
"6af451f1-b80b-40c5-be98-0a1c8d55bda1",
"c8610944-44d3-4c65-abc3-36dbb9f23a0d",
"d8bcd5de-9934-4d4d-a2d7-586e76e64310",
"07257dc1-c39a-4d35-8f72-bad7fdd96bbb"), 2010:2017)
Same for tuberculosis:
tb_years <- setNames(c("c201ccf5-a4de-4806-bfa7-e7cb03079773",
"4955717e-f5d7-4203-b2a3-420751d4261d",
"e0a7812f-049a-462b-8036-0d40cc261d8c",
"f5444304-17bf-480b-a678-1332fcaf979e"), 2014:2017)
And for hand-foot-and-mouth disease:
hfmd_years <- setNames(c("33b20c42-8877-4016-987f-683707b7137a",
"0a53857a-d209-4ef0-a3ab-539be6720905",
"ff8d5c7c-267a-4c5d-88c0-43b15eb1084b",
"afe50193-ec76-4ea2-a05a-ba8324bde64b",
"3baee861-9f6f-4859-bb84-b56fbc4c66b5",
"ace3504a-6f17-4f9c-9386-8e47806dc9ea",
"a86a99fe-e8a7-45fc-9431-a416f17f009b",
"15d45312-6318-4b22-b7c3-d4a684a15f08"), 2010:2017)
With these above information, we can use our datagovmy()
function to download the incidence data we want:
dengue <- datagovmy("dengue-haemorrhagic-fever", "0e34a58f-909e-4ec2-85ab-8633830be91c", dengue_years)
tb <- datagovmy("tuberculosis", "fbb1dd02-65c4-4383-aa41-344980001a88", tb_years)
hfmd <- datagovmy("hfmd", "e3edfbb1-2dba-4e0e-9c02-d6a213f58221", hfmd_years)
The outputs are in the forms of neat data frames:
dengue
## # A tibble: 6,255 x 4
## state year week incidence
## <chr> <int> <int> <int>
## 1 Johor 2010 1 2
## 2 Johor 2010 2 5
## 3 Johor 2010 3 4
## 4 Johor 2010 4 6
## 5 Johor 2010 5 7
## 6 Johor 2010 6 4
## 7 Johor 2010 7 7
## 8 Johor 2010 8 5
## 9 Johor 2010 9 4
## 10 Johor 2010 10 5
## # … with 6,245 more rows
tb
## # A tibble: 3,135 x 4
## state year week incidence
## <chr> <int> <int> <int>
## 1 Johor 2014 1 56
## 2 Johor 2014 2 49
## 3 Johor 2014 3 38
## 4 Johor 2014 4 42
## 5 Johor 2014 5 34
## 6 Johor 2014 6 48
## 7 Johor 2014 7 48
## 8 Johor 2014 8 39
## 9 Johor 2014 9 49
## 10 Johor 2014 10 28
## # … with 3,125 more rows
hfmd
## # A tibble: 6,255 x 4
## state year week incidence
## <chr> <int> <int> <int>
## 1 Johor 2010 1 4
## 2 Johor 2010 2 1
## 3 Johor 2010 3 12
## 4 Johor 2010 4 9
## 5 Johor 2010 5 11
## 6 Johor 2010 6 11
## 7 Johor 2010 7 9
## 8 Johor 2010 8 18
## 9 Johor 2010 9 7
## 10 Johor 2010 10 14
## # … with 6,245 more rows
If you prefer, you can download the preprocessed data in the form of CSV files, that have made directly available for you here:
dengue <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/dengue.csv")
tb <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/tb.csv")
hfmd <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/hfmd.csv")
The website TuTiempo.net contains meteorological and climatic data from many climatic stations around the world. For example, climatic data for Malaysia are here. Here we show how to download all the daily data from 28 climatic stations of Malaysia from 2010. For that we need a number of utilitary functions that we start by defining.
The following function removes the last 2 lines of a matrix m
:
rm_summaries <- function(m) {
n <- nrow(m)
m[-((n - 1):n), ]
}
The following function coerces a matrix m
to a data frame using the first line for the variable names:
as.data.frame2 <- function(m) {
setNames(as.data.frame(m, as.is = TRUE), m[1, ])[-1, ]
}
The following function ownloads data from the URL url
and organizes it into a data frame:
get_page <- function(url) {
require(magrittr) # for the " %>% " operator
print(url)
url %>%
xml2::read_html() %>%
rvest::html_nodes(".mensuales td , th") %>%
rvest::html_text() %>%
matrix(ncol = 15, byrow = TRUE) %>%
rm_summaries() %>%
as.data.frame2()
}
A safe version of the get_page()
function, trying the URL again and again if internet is interrupted and handling specific errors (e.g. 404):
safe_get_page <- function(..., error) {
repeat {
out <- purrr::safely(get_page)(...)
if(is.null(out$error) || grepl(error, out$error)) return(out)
}
}
The following function pads 1-digit numbers to 2-digit ones with zeros on the left:
pad <- function(x) {
stringr::str_pad(as.character(x), 2, pad = "0")
}
The following function builds an URL from a year, a month and a station:
make_url <- function(year, month, station) {
paste0("http://en.tutiempo.net/climate/",
pad(month), "-", year, "/ws-", station, ".html")
}
Here is the main function that downloads the data for the station station:
download_data <- function(station, years, months = 1:12, error = "HTTP error 404") {
require(magrittr) # for the " %>% " operator
require(zeallot) # for the " %<-% " operator
c(months, years) %<-% expand.grid(months, years)
out <- purrr::map2(years, months, make_url, station = station) %>%
purrr::map(safe_get_page, error = error) %>%
purrr::transpose()
out <- out$result %>%
setNames(paste(years, pad(months), sep = "-")) %>%
`[`(sapply(out$error, is.null)) %>%
dplyr::bind_rows(.id = "ym") %>%
dplyr::mutate(day = lubridate::ymd(paste(ym, pad(Day), sep = "-"))) %>%
dplyr::select(-ym, -Day) %>%
dplyr::select(day, dplyr::everything()) %>%
dplyr::mutate_if(is.factor, as.character) %>%
dplyr::mutate_at(dplyr::vars(T, TM, Tm, SLP, PP, VV, V, VM, VG), as.numeric) %>%
dplyr::mutate_at(dplyr::vars(H), as.integer) %>%
dplyr::mutate_at(dplyr::vars(RA, SN, TS, FG), function(x) x == "o")
names(out) %<>%
sub("^T$", "ta", .) %>%
sub("TM", "tx", .) %>%
sub("Tm", "tn", .) %>%
tolower()
out
}
Now, we want to modify sligthly the above function so that it downloads the data only over the range of year that we have in the incidence data:
diseases_years <- c(dengue$year, hfmd$year, tb$year)
max_year <- max(diseases_years)
min_year <- min(diseases_years)
download_data2 <- function(station, year) {
download_data(station, year:max_year)
}
A list of the 28 Malaysian climatic stations that have data in TuTiempo.net can be downloaded from this CSV file:
stations <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/climatic%20stations.csv")
## Parsed with column specification:
## cols(
## location = col_character(),
## station = col_integer(),
## longitude = col_double(),
## latitude = col_double(),
## elevation = col_integer(),
## from = col_integer()
## )
which gives:
stations
## # A tibble: 28 x 6
## location station longitude latitude elevation from
## <chr> <int> <dbl> <dbl> <int> <int>
## 1 Alor Star 486030 100. 6.2 5 1959
## 2 Bintulu 964410 113. 3.2 2 1955
## 3 Butterworth 486020 100. 5.46 4 1962
## 4 Cameron Highlands 486320 101. 4.46 1545 2015
## 5 Ipoh 486250 101. 4.56 40 1959
## 6 Johore Bharu/Senai 486790 104. 1.63 37 1999
## 7 Kota Bharu 486150 102. 6.16 5 1954
## 8 Kota Kinabalu 964710 116. 5.93 3 1955
## 9 Kuantan 486570 103. 3.78 18 1954
## 10 Kuching 964130 110. 1.48 27 1955
## # … with 18 more rows
Now, everything is ready to be downloaded! (It’ll take about 2 hours).
climate <- stations %$%
purrr::map2(station, sapply(from, max, min_year), download_data2) %>%
setNames(stations$station) %>%
dplyr::bind_rows(.id = "station") %>%
dplyr::mutate_if(is.character, as.integer)
which gives:
climate
## # A tibble: 73,392 x 16
## station day ta tx tn slp h pp vv v vm
## <int> <date> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 486030 2015-10-01 NA NA NA NA NA NA NA NA NA
## 2 486030 2015-10-02 NA NA NA NA NA NA NA NA NA
## 3 486030 2015-10-03 NA NA NA NA NA NA NA NA NA
## 4 486030 2015-10-04 NA NA NA NA NA NA NA NA NA
## 5 486030 2015-10-05 NA NA NA NA NA NA NA NA NA
## 6 486030 2015-10-06 NA NA NA NA NA NA NA NA NA
## 7 486030 2015-10-07 NA NA NA NA NA NA NA NA NA
## 8 486030 2015-10-08 NA NA NA NA NA NA NA NA NA
## 9 486030 2015-10-09 NA NA NA NA NA NA NA NA NA
## 10 486030 2015-10-10 NA NA NA NA NA NA NA NA NA
## # … with 73,382 more rows, and 5 more variables: vg <dbl>, ra <lgl>,
## # sn <lgl>, ts <lgl>, fg <lgl>
Where the variables’ meaning are
If you don’t want to spend the 2 hours downloading the data from TuTiempo.net, you can downlaod them in CSV form, directly from here:
climate <- readr::read_csv("https://raw.githubusercontent.com/choisy/DMo2019/master/data/climate.csv")