Time Series Clustering and Classification

Time series clustering and classification has relevance in a diverse range of fields which include geology, medicine, environmental science, finance and economics. Clustering is an unsupervised approach to grouping together similar items of interest and was initially applied to cross-sectional data. However, clustering time series data has become a popular research topic over the past three to four decades and a rich literature exists on this topic. A set of time series can be clustered using conventional hierarchical and non-hierarchical clustering methods, fuzzy clustering methods, machine learning methods and model-based methods.

Actual time series observations can be clustered (e.g., DUrso, 2000; Coppi and DUrso, 2001, DUrso, 2005), or features extracted from the time series can be clustered. Features are extracted in the time, frequency and wavelets domains. Clustering using time domain features such as autocorrelations, partial autocorrelations, cross-correlations have been proposed by several authors including Goutte et al. (1999), Galeano and Peña (2000), Dose and Cincotti (2005), Singhal and Seborg (2005), Caiado et al. (2006), Basalto et al. (2007), Wang et al. (2007), Takayuki et al. (2006), Ausloos and Lambiotte (2007), Miskiewicz and Ausloos (2008), and DUrso and Maharaj (2009).

In the frequency domain, features such as the periodogram, spectral and cepstral ordinates are extracted, and included in the literature, are Zhang et al. (2005), Maharaj et al. (2010), DUrso and Maharaj (2012) and DUrso et al. (2014).

The features extracted in the wavelets domain are discreet wavelet transforms (DWT), wavelet variances and wavelet correlations and methods have been proposed by authors such as Zhang et al. (2005), Maharaj et al. (2010), DUrso and Maharaj (2012) and DUrso et al. (2014). As well, time series can be modelled and the parameters estimates used as the clustering variables. Studies on model-based clustering method include those by Piccolo (1990), Tong and Dabas (1990), Maharaj (1996, 2000), Kalpakis et al. (2001), Ramoni et. al. (2002), Xiong and Yeung (2004), Boets (2005), Singhal and Seborg (2005), Savvides et al. (2008), Otranto (2008), Caiado and Crato (2010), DUrso et al. (2013), Maharaj et al. (2016) and DUrso et al. (2016).

Classification is a supervised approach to grouping together items of interest and discriminant analysis and machine learning methods are amongst the approaches that have been used. Initially classification was applied to cross-sectional data but a large literature now exists on the classification of time series which include many very useful applications. These time series classification methods include the use of feature-based, model-based and machine learning techniques. The features are extracted in the time domain (Chandler and Polonok, 2006; Maharaj, 2014), the frequency domain (Kakizawa et al., 1998; Maharaj, 2002; Shumway, 2003) and the wavelet domain (Maharaj, 2005, Maharaj and Alonso, 2007, 2014, Fryzlewicz and Omboa, 2012). Model-based approaches for time series classification include ARIMA models, Gaussian mixture models and Bayesian approaches (Maharaj, 1999, 2000; Sykacek and Roberts, 2002; Liu and Maharaj, 2013; Liu et al., 2014; Kotsifakos and Papapetrou, 2014), while machine learning approaches include classification trees, nearest neighbour methods and support vector machines (Douzal-Chouakria and Amblard, 2000; Do et al., 2017; Gudmundsson et al., 2008; Zhang et al., 2010).

It should be noted that clustering and classifying data evolving in time is substantially difierent from classifying static data. Hence, the volume of work on these topics focuss on extracting time series features or considering specific time series models and also understanding the risks of directly extending common-use metric for static data to time series data.

Ann Maharaj, Pierpaolo D’urso and Jorge Caiado