--- title: "Clustering time series using funtimes package" author: "Srishti Vishwakarma" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 3 number_sections: true bibliography: vignrefs.bib vignette: > %\VignetteIndexEntry{Clustering time series using funtimes package} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE, message = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#", fig.width = 7, fig.height = 6) # devtools::load_all(".") #remove this line ``` # Introduction In this tutorial, two unsupervised clustering algorithms from the `funtimes` package are used to identify clusters of Australia's sea level time series. ## Loading libraries First, load the essential libraries for the analysis: ```{r echo=TRUE, warning=FALSE} library(funtimes) library(ggplot2) library(gridExtra) library(readxl) library(reshape2) ``` # Data The daily sea level data are available from 1993 to 2012 for 17 locations. The data are obtained from @Maharaj_etal_2019 using the following link . Download `Application7_3.zip` folder, where the `Aus Sea Levels 17.xlsx` file contains the sea level records. Annual average is taken to convert the temporal resolution. ```{r, eval=FALSE} d_org <- readxl::read_xlsx("Aus_Sea_Levels_17.xlsx", skip = 1, n_max = 7300) # yearly average d <- data.frame(aggregate(d_org[, 4:20], list(d_org$Year), FUN = 'mean', na.rm = TRUE)[, -1], row.names = unique(d_org$Year)) ``` ```{r, echo=FALSE} # saveRDS(d, "Aus_Sea_Levels_17.rds") d <- readRDS("Aus_Sea_Levels_17.rds") ``` ## Plotting time series Below is the plot of annual time series of sea level for 17 locations: ```{r} dlong <- reshape2::melt(t(d)) names(dlong)[1:2] <- c("Location", "Year") ggplot(dlong) + geom_line(aes(x = Year, y = value, color = Location), size = 1) + ylab('Sea level (m)') + theme_bw() ``` This plot demonstrates the variation in the sea levels across the locations. It can be seen that not all the time series are having a common trend since 1993. Grouping the locations with a common trend could benefit Australian government to assess and implement climate adaptation strategies for the impact of sea level rise on clustered locations. # Clustering time series based on trend synchronism The first function from the package to test is the `sync_cluster` that groups the time series with the common linear trend. The window parameter `w` is set here for number of slides in each window. If the number of years are not enough in the time series, this parameter is required to be set. ```{r} set.seed(123) Clus_sync <- sync_cluster(d ~ t, Window = 3, B = 100) Clus_sync ``` Total `r sum(Clus_sync$cluster != 0)` locations are clustered with a common linear trend, while the remaining `r sum(Clus_sync$cluster == 0)` are not tied to any other location and form so-called noise cluster. Below is the plot of the clustered time series of sea level, where `Cluster 0` indicates the noise cluster without any common linear trend, while `Cluster 1` shows the time series of locations with a common linear trend: ```{r} for (i in 0:max(Clus_sync$cluster)) { assign(paste('py', i, sep = ''), ggplot(melt(t(d[, Clus_sync$cluster == i]))) + geom_line(aes(x = Var2,y = value,color = Var1),size = 1) + ylab('Sea level (m)') + xlab('Year') + theme_bw() + ggtitle(paste('Cluster',i)) + theme(axis.text = element_text(size = 13), axis.title.x = element_text(size = 15), axis.title.y = element_text(size = 15), legend.text = element_text(size = 10), legend.title = element_blank(), legend.key.size = unit(0.3, "cm"))) } grid.arrange(py0, py1) ``` # Clustering time series using a spatiotemporal approach The `BICC` function applies an unsupervised spatiotemporal clustering algorithm, TRUST, from @Ciampi_etal_2010. The algorithm has a few tuning parameters, and the `BICC` function automatically selects two of those (`Delta` and `Epsilon`; for manual setting of all the parameters, use the lower-level functions `CSlideCluster` and `CWindowCluster`). First, the time series are clustered within small slides; the length of the slides is defined with the parameter `p` (i.e., number of time-series observations in each slide). Then, slides are aggregated into windows (each window contains `w` consecutive slides), and slide-level cluster assignments are used to cluster the time series at the window level. When defining the windows, the user can also set the step `s`, which is the number of steps used to shift the window (if `s = w`, the windows do not overlap). ```{r} Clus_BICC <- BICC(as.matrix(d), p = 5, w = 4, s = 4) Clus_BICC ``` The algorithm detected only one cluster. # Citation {-} This vignette belongs to R package `funtimes`. If you wish to cite this page, please cite the package: ```{r} citation("funtimes") ``` # References