Next Article in Journal
Micro-Scale Fatigue Damage Assessment of CFRP Laminates Using Lock-in Thermography
Previous Article in Journal
Design of Machine Learning Models for the Prediction of Transcription Factor Binding Regions in Bacterial DNA
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Robust Methods for Soft Clustering of Multidimensional Time Series †

by
Ángel López-Oriona
1,*,
Pierpaolo D’Urso
2,
José A. Vilar
1,3 and
Borja Lafuente-Rego
1
1
Research Group MODES, Research Center for Information and Communication Technologies (CITIC), University of A Coruña, 15071 A Coruña, Spain
2
Department of Economics, Sapienza University of Rome, Piazzale Aldo Moro 5, 00185 Rome, Italy
3
Technological Institute for Industrial Mathematics (ITMATI), 15782 Santiago de Compostela, Spain
*
Author to whom correspondence should be addressed.
Presented at the 4th XoveTIC Conference, A Coruña, Spain, 7–8 October 2021.
Eng. Proc. 2021, 7(1), 60; https://doi.org/10.3390/engproc2021007060
Published: 12 November 2021
(This article belongs to the Proceedings of The 4th XoveTIC Conference)

Abstract

:
Three robust algorithms for clustering multidimensional time series from the perspective of underlying processes are proposed. The methods are robust extensions of a fuzzy C-means model based on estimates of the quantile cross-spectral density. Robustness to the presence of anomalous elements is achieved by using the so-called metric, noise and trimmed approaches. Analyses from a wide simulation study indicate that the algorithms are substantially effective in coping with the presence of outlying series, clearly outperforming alternative procedures. The usefulness of the suggested methods is also highlighted by means of a specific application.

1. Introduction

Clustering of time series is a pivotal problem in statistics with several applications [1,2]. Generally, the goal is to divide collection of unlabelled time series into uniform groups so that intra-cluster similarity is maximized wheres the inter-cluster similarity is minimized. Most of the current techniques deal with univariate time series (UTS), while clustering of multidimensional time series (MTS) has received limited attention. This paper proposes three robust clustering methods for MTS. All of them are aimed at neutralizing the effect of outlying series while detecting the underlying grouping structure.

2. Robust Clustering Methods for Multivariate Time Series

Let { X t , t Z } = { ( X t , 1 , , X t , d ) , t Z } be a d-variate real-valued strictly stationary stochastic process. Let F j the marginal distribution function of X t , j , j = 1 , , d , and let q j ( τ ) = F j 1 ( τ ) , τ [ 0 , 1 ] , the corresponding quantile function. Fixed l Z and an arbitrary couple of quantile levels ( τ , τ ) [ 0 , 1 ] 2 , consider the cross-covariance of the indicator functions I X t , j 1 q j 1 ( τ ) and I X t + l , j 2 q j 2 ( τ )
γ j 1 , j 2 ( l , τ , τ ) = Cov I X t , j 1 q j 1 ( τ ) , I X t + l , j 2 q j 2 ( τ ) ,
for 1 j 1 , j 2 d . Taking j 1 = j 2 = j , the function γ j , j ( l , τ , τ ) , with ( τ , τ ) [ 0 , 1 ] 2 , so-called quantile autocovariance function (QAF) of lag l, generalizes the traditional autocovariance function.
For the multivariate process { X t , t Z } , we can consider the d × d matrix Γ ( l , τ , τ ) = γ j 1 , j 2 ( l , τ , τ ) 1 j 1 , j 2 d , which simultaneously gives information about both the cross-dependence (when j 1 j 2 ) and the serial dependence (since there is a lag l).
Under appropriate summability conditions (mixing conditions), we can define the the Fourier transform of the cross-covariances. In this regards, the quantile cross-spectral density is given by
f j 1 , j 2 ( ω , τ , τ ) = ( 1 / 2 π ) l = γ j 1 , j 2 ( l , τ , τ ) e i l ω ,
for 1 j 1 , j 2 d , ω R and τ , τ [ 0 , 1 ] . Note that f j 1 , j 2 ( ω , τ , τ ) is complex-valued.
The quantile cross-spectral density contains information about the general dependence patterns of a given stochastic process. For a specific realization of the process, this quantity can be consistently estimated by means of the so-called smoothed CCR-periodogram, G ^ T , R j 1 , j 2 ( ω , τ , τ ) , proposed by [3].
Based on previous remarks, a simple dissimilarity measure between two realizations of the d-variate process (MTS) can be defined as follows. Given the i-th MTS, X t ( i ) , consider the set G ( i ) = { G ^ T , R j 1 , j 2 ( ω , τ , τ ) , j 1 , j 2 = 1 , , d , ω Ω , τ , τ T } , where Ω is the set of Fourier frequencies and T = { 0.1 , 0.5 , 0.9 } . Let Ψ ( i ) be the vector formed by concatenating the elements of the set G ( i ) . The dissimilarity measure between the series X t ( 1 ) and X t ( 2 ) is defined as the Euclidean distance between the complex vectors Ψ ( 1 ) and Ψ ( 2 ) . We call this dissimilarity d Q C D .
The dissimilarity d Q C D is used to develop three robust fuzzy clustering methods. All of them assume that we want to group n MTS into C clusters, and are based on the traditional fuzzy C-means clustering algorithm. They look for the set of centroids Ψ ¯ = { Ψ ¯ ( 1 ) , , Ψ ¯ ( C ) } , and the n × C matrix of fuzzy coefficients, U = ( u i c ) , i = 1 , , n , c = 1 , , C , which define the solution of a given minimization problem. The quantity u i c represents the membership degree of the i-th MTS in the c-th cluster. The minimization problem for the first method is the following:
min Ψ ¯ , U i = 1 n c = 1 C u i c m 1 exp β Ψ ( i ) Ψ ¯ ( c ) 2 2 w.r.t c = 1 C u i c = 1   and   u i c 0 ,
where β is an hyperparameter that needs to be set in advance and m is a parameter which determines the fuzziness of the partition, frequently called the fuziness parameter.
The exponential distance is used in the previous model because it is capable of neutralizing the effect of outlying series by spreading out their membership degrees between the different clusters [4].
The second robust procedure follows the noise cluster approach, and takes into account the following minimization problem:
min Ψ ¯ , U i = 1 n c = 1 C 1 u i c m Ψ ( i ) Ψ ¯ ( c ) 2 2 + i = 1 n δ 2 1 c = 1 C 1 u i c m w.r.t. c = 1 C u i c = 1   and   u i c 0 ,
where δ > 0 is the a parameter known as the noise distance, which has to be specified in advance.
The previous model includes C groups, but only ( C 1 ) are “real” clusters. The noise cluster is artificially created for outlier identification purposes. The aim is to locate the outliers and place them in the noise cluster, which is represented by a fictitious prototype that has a constant distance from every MTS (the noise distance, δ ).
The third technique can be expressed by means of the minimization problem:
min Y , U i = 1 H ( α ) c = 1 C u i c m Ψ ( i ) Ψ ¯ ( c ) 2 w.r.t. c = 1 C u i c = 1   and   u i c 0 .
where Y ranges on all the subsets of Ψ = { Ψ ( 1 ) , , Ψ ( n ) } of size H ( α ) = n ( 1 α ) . The model attains its robustness by removing a certain proportion of the series and requires the specification of the fraction α of the data to be trimmed.
The three previously presented robust models have been analysed by means of a broad simulation study containing a wide variety of generating processes. Two alternative dissimilarities were taken into account for comparison purposes [5,6]. In all cases, the three proposed algorithms outperformed the competitors.

3. Application to real data

The three techniques proposed in Section 2 were applied to perform clustering in a real MTS database. Specifically, we considered daily stock returns and trading volume of the top 20 companies of the S&P 500 index, thus obtaining 20 bivariate MTS. Table 1 shows the membership degrees of the series concerning the trimmed approach.
The symbols in bold correspond to the companies which were trimmed away, Berkshire Hathaway (BRK.B), Walmart (WMT) and Home Depot (HD). Similar clustering solutions were obtained with the remaining two methods.

4. Conclusions

This work proposes three robust methods to perform fuzzy clustering of MTS. They are based on the so-called exponential, noise and trimmed ideas. Each approach attains robustness to outlying series in a different way. The three procedures have been presented and assessed through a wide simulation study, substantially outperforming alternative approaches. A real data application has been also carried out in order to show the usefulness of the presented techniques.

Acknowledgments

This research has been supported by MINECO (MTM2017-82724-R and PID2020-113578RB-100), the Xunta de Galicia (ED431C-2020-14), and “CITIC” (ED431G 2019/01).

References

  1. Liao, T.W. Clustering of time series data—A survey. Pattern Recognit. 2005, 38, 1857–1874. [Google Scholar] [CrossRef]
  2. Aghabozorgi, S.; Shirkhorshidi, A.S.; Wah, T.Y. Time-series clustering—A decade review. Inf. Syst. 2015, 53, 16–38. [Google Scholar] [CrossRef]
  3. Baruník, J.; Kley, T. Quantile coherency: A general measure for dependence between cyclical economic variables. Econom. J. 2019, 22, 131–152. [Google Scholar] [CrossRef] [Green Version]
  4. Wu, K.L.; Yang, M.S. Alternative c-means clustering algorithms. Pattern Recognit. 2002, 35, 2267–2278. [Google Scholar] [CrossRef]
  5. D’Urso, P.; Maharaj, E.A. Autocorrelation-based fuzzy clustering of time series. Fuzzy Sets Syst. 2009, 160, 3565–3589. [Google Scholar] [CrossRef]
  6. D’Urso, P.; Maharaj, E.A. Wavelets-based clustering of multivariate time series. Fuzzy Sets Syst. 2012, 193, 33–61. [Google Scholar] [CrossRef]
Table 1. Membership degrees for the top 20 companies in the S&P 500 index by considering the trimmed approach and a 6-cluster partition.
Table 1. Membership degrees for the top 20 companies in the S&P 500 index by considering the trimmed approach and a 6-cluster partition.
Company C 1 C 2 C 3 C 4 C 5 C 6
AAPL0.0830.1460.2990.3650.0660.041
MSFT0.1070.0490.2130.3560.0990.176
AMZN0.8650.0170.0510.0320.0100.025
GOOGL0.6820.0320.0920.1280.0250.040
GOOG0.9020.0100.0310.0280.0080.022
FB0.0020.9830.0060.0040.0030.002
TSLA0.0230.0120.0560.8850.0130.010
BRK.B------
V0.0040.0140.0150.0170.9410.009
JNJ0.0040.0150.0190.0130.9370.013
WMT------
JPM0.0020.0010.0030.0030.0020.989
MA0.0050.0060.9680.0100.0050.006
PG0.0150.0120.0280.0160.0190.909
UNH0.0060.9240.0260.0130.0220.008
DIS0.0200.0380.7720.0990.0420.030
NVDA0.0250.0200.0850.8040.0430.024
HD------
PYPL0.1550.3010.2970.1150.0570.075
BAC0.0760.0860.2250.0670.0600.485
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

López-Oriona, Á.; D’Urso, P.; Vilar, J.A.; Lafuente-Rego, B. Robust Methods for Soft Clustering of Multidimensional Time Series. Eng. Proc. 2021, 7, 60. https://doi.org/10.3390/engproc2021007060

AMA Style

López-Oriona Á, D’Urso P, Vilar JA, Lafuente-Rego B. Robust Methods for Soft Clustering of Multidimensional Time Series. Engineering Proceedings. 2021; 7(1):60. https://doi.org/10.3390/engproc2021007060

Chicago/Turabian Style

López-Oriona, Ángel, Pierpaolo D’Urso, José A. Vilar, and Borja Lafuente-Rego. 2021. "Robust Methods for Soft Clustering of Multidimensional Time Series" Engineering Proceedings 7, no. 1: 60. https://doi.org/10.3390/engproc2021007060

Article Metrics

Back to TopTop