Efficient Time Series Clustering and Its Application to Social Network Mining

Cangqi Zhou; Qianchuan Zhao

doi:10.1515/jisys-2014-0005

Open Access Published by De Gruyter February 21, 2014

Efficient Time Series Clustering and Its Application to Social Network Mining

Cangqi Zhou and Qianchuan Zhao

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2014-0005

Abstract

Mining time series data is of great significance in various areas. To efficiently find representative patterns in these data, this article focuses on the definition of a valid dissimilarity measure and the acceleration of partitioning clustering, a common group of techniques used to discover typical shapes of time series. Dissimilarity measure is a crucial component in clustering. It is required, by some particular applications, to be invariant to specific transformations. The rationale for using the angle between two time series to define a dissimilarity is analyzed. Moreover, our proposed measure satisfies the triangle inequality with specific restrictions. This property can be employed to accelerate clustering. An integrated algorithm is proposed. The experiments show that angle-based dissimilarity captures the essence of time series patterns that are invariant to amplitude scaling. In addition, the accelerated algorithm outperforms the standard one as redundancies are pruned. Our approach has been applied to discover typical patterns of information diffusion in an online social network. Analyses revealed the formation mechanisms of different patterns.

Keywords: Time series clustering; dissimilarity measure; algorithm acceleration; social network mining

1 Introduction

The ubiquitous use of dynamic data obtained from networks, such as wireless sensor networks (WSNs) and online social networks (OSNs), has created a great interest in the mining of these data. The analysis of dynamic data takes the domain of time into account. Time series, which record the dynamic variations of specific quantities in networks, is an important type of dynamic data. A time series is a sequence of points recording specific values or the number of occurrences of an event measured at successive time points, which are often spaced at equal intervals. Common examples include the periodic records of a sensor such as a thermometer, and the number of retweets in Twitter per hour.

The variations of “trends” are embedded in time series. Discovering the similarity of these patterns could be of great use. For instance, one may be looking for topics in social networks with similar variations in a specific time period. This information may be useful in predicting future trends. This task is known as representative pattern discovery, and a common method is clustering [5]. Clustering is aimed at identifying hidden structures in data sets. Among various clustering methods, partitioning clustering, such as k-means, will separate the data into several homogeneous clusters. In this article, we consider the problem of discovering similar patterns in time series data by partitioning clustering.

One crucial component in clustering is the dissimilarity measure used to compare two specific data points. A suitable dissimilarity measure must provide a concrete way of evaluating the distance between any two points. The shape of time series is the key factor in identifying pattern similarity. Some transformations, such as offset shifting and translation in time, will not change these shapes. Amplitude and time scaling changes the shapes visually. The pattern variations of the time series, however, are essentially unchanged. Eliminating the effect of these transformations is useful in some applications. For instance, different sensors may not be synchronized to the same clock. Also, recorded physical values may be measured in different units, such as Celsius and Fahrenheit for temperature. Nevertheless, after eliminating the effect of translation and scaling, the recorded patterns may be similar to each other. This similarity reflects the intrinsic trend, which is independent to specific transformations.

We expect to find a dissimilarity measure that is invariant to specific transformations. Simply using Euclidean distance is not a good choice as it can be easily affected by scaling and shifting. Some researchers calculated distance-based dissimilarities after normalization, while Zhou et al. [25] pointed out that normalization methods do not employ the optimal scaling and shifting. Chu and Wong [3] viewed time series as points in vector space. They first projected these points to eliminating the effect of amplitude shifting, then determined the optimal scaling factors to obtain the minimum distance between two time series. Unfortunately, the measurement obtained by this method violates the symmetry property [10]. Dynamic time warping (DTW) is a robust measure for time series [12]. It takes into consideration the alignment along a time axis. Scaling of amplitude, however, is not the key factor when designing DTW. In addition, the calculation of this measure is time consuming. Zhou et al. concluded that the angle between two vectors is more intrinsically accurate than distance-based measures in capturing the essence of invariance. We will show that, as an application, the dissimilarity measure used by Yang and Leskovec [23] is actually an angle-based one, which supports the conclusion made by Zhou et al. In this article, we define an angle-based dissimilarity measure for time series clustering. We will show that this definition has some useful properties, such as the triangle inequality, but with certain restrictions. Many dynamic data obtained from networks actually obey these restrictions. Thus, our definition and its properties are significant to real-world applications.

Time series data are inherently large in size with high dimension, and this makes efficient mining of these data a non-trivial task, especially with the coming era of Big Data. Partitioning methods are relatively scalable, easy to implement, and faster than hierarchical methods [9]. The most typical and widely used partitioning method is k-means, and many other algorithms incorporate its basic ideas. This method consists of two steps: determine the “closest” center for each data point, and then update these centers. The algorithm iteratively runs the above steps until convergence. The dissimilarity is defined as the measure of “distance.” The approach of Yang and Leskovec [23] can be viewed as a variation of k-means, and it provides an effective method for updating centers.

In the processing of high-dimensional data, especially those with complex measures, calculating point–center dissimilarities is time consuming. Several methods have been developed to reduce redundancy in these calculations. One effective method is the indexing of the data [8]. It has been reported, however, that as the dimension of the data increases, the speed of the clustering procedures with indexing may decrease [17]. Elkan [4] accelerated k-means by using the triangle inequality property, which is the only a priori condition required for dissimilarities used in k-means. Elkan’s method avoids unnecessary calculations and produces results that are identical with those of the standard algorithm. As our dissimilarity obeys the triangle inequality under specific circumstances, we use Elkan’s method to accelerate our clustering task. Combining the clustering algorithm of Yang and Leskovec with Elkan’s method, we propose a rapid version of clustering algorithm with angle-based dissimilarity measure for discovering representative patterns in time series data.

In this work, we conducted several experiments to verify the rationale of our angle-based dissimilarity measure and the effectiveness of the accelerated clustering algorithm. By employing the 1-nearest neighbor (1NN) algorithm, we applied the dissimilarity measure to some benchmark data sets. In addition, we used our algorithm in repeating the Memetracker experiment of Yang and Leskovec [23]. We also applied our approach to a social network mining task, in order to discover meaningful patterns of information diffusion. The clustering results contained some typical single-peaked and multipeaked patterns. Analyses of the results revealed the mechanisms that lead to these differences. The rest of the article is organized as follows: Section 2 shows the motivation, the definition, and the properties of our angle-based dissimilarity measure. An accelerated clustering algorithm is introduced in Section 3. In Section 4, we show the details of the application to social network mining. Finally, we conclude and discuss future work in Section 5.

2 Dissimilarity Measure

In this section, we introduce the rationale of angle-based dissimilarity measure, which was originally interpreted by Zhou et al. [25]. Then, we provide the definition of our measure, AngDis, and prove the satisfaction of the triangle inequality under specific circumstances. We show that the measure not only addresses the essence of the problem of discovering representative patterns but also provides a feasible and efficient way for updating cluster centers. At the end of this section, we perform several experiments on some benchmark data sets to evaluate our proposed dissimilarity measure.

2.1 Motivation

Liao [22] listed nine similarity/distance measures for clustering time series, but none of them is based on angle. Distance-based measures, such as Euclidean distance, are still widely used. To eliminate insensitive transformations, distances are generally calculated after choosing some optimal factors for these transformations.

A time series x can be represented as a sequence of real numbers indexed by natural numbers, x = (x₁, x₂, …, x_d), where d is its length. It can be viewed as a point in d dimensional real vector space R^d. Chu and Wong [3] proposed an elegant method to deal with scaling and shifting. If x is scaled by a real number α, it becomes αx = (αx₁, …, αx_d). The vector x in R^d is multiplied by a scalar, which changes the vector’s norm but not its direction. If x is shifted by a real number β, it becomes x + βE = (x₁ + β, x₂ + β, …, x_d + β), where E is a vector in R^d and all elements of E equal 1. They did not take into account the translation transformation on time axis. Given two vectors x and y, the authors defined the dissimilarity between them as the minimum Euclidean distance with respect to optimal scalars α and β. Intuitively, this means one can scale and shift x with any real numbers until it reaches the minimum distance to y. Unfortunately, this measure is not symmetric [10]; in other words, dis(x, y) ≠ dis(y, x), where dis denotes the dissimilarity measure. This lack of symmetry is awkward and counterintuitive for most applications.

Zhou et al. [25], in fact, concluded that the angle between two vectors is the most important factor in describing the dissimilarity in shapes. Considering a plane P that is perpendicular to vector E, as shown in Figure 1, the projection of all the points on P will eliminate the effect of shifting. We then can scale the projected vectors into uniform norms where x becomes x_norm. Thus, these steps map the data points on a unit sphere on plane P. Because the shapes of these time series are required to be insensitive to scaling, the norms of these vectors are no longer crucial. The Euclidean distance between any two points on the unit sphere is therefore unimportant. Obviously, the angle θ (see Figure 1) between the radials on which two vectors lie reveals the difference between the vectors. The authors used cosθ as the similarity measure.

Figure 1

Projecting Time Series to Eliminating Shifting.

Yang and Leskovec [23] defined a measure of two time series x and y as follows:

(1)dis(x, y) = mina, q||x − ay(q)||||x||, (1)

where a is the scaling factor, q is the translation factor, and ||·||, denotes the l₂ norm. Rather than solving the optimization problem, we find that with q fixed, the solution to Eq. (1) is the sine of the angle between x and y. Figure 2 shows the geometrical relationship between x and y. When y is allowed to be scaled, the optimization problem defines the solution as the minimum normalized distance between x and y. Clearly, the norm of x – a₂y is the minimum as it is perpendicular to y. After dividing by ||x||, we get sin θ. We can scale all the vectors on the unit sphere and then compare their differences by angles.

Figure 2

Geometrical Explanation to Eq. (1).

The research described above provides the rationale for using angle-based dissimilarity for time series. In the following of this section, we will define our measure and demonstrate its properties.

2.2 Definition

In practice, the values of the elements of some time series are real numbers, while some of the values are non-negative numbers. The latter case occurs when someone expects to record the number of occurrences of an event per unit time. This condition restricts the angle between any two time series in the domain of [0, π/2] because any inner product of two vectors with non-negative elements is greater than or equal to 0. In addition, sometimes when we record the entirety of an event from its start time to its end time, we are sensitive to amplitude shifting because the beginning and ending points are both zero. Thus, we restrict our problem to defining a suitable dissimilarity measure invariant to amplitude scaling for those vectors with non-negative elements. In Section 4, we will show that some applications actually satisfy these constraints. More complex situations will be discussed in future work.

Consider vectors x = (x₁, …, x_d) and y = (y₁, …, y_d), where x_i, y_i ≥ 0, i = 1, …, d. With respect to scaling invariance, the dissimilarity between them is defined as

AngDis(x, y) = sinθxy, (2)

where θ_xy is the angle between x and y. Clearly, this definition is independent of the norms of vectors. Thus, it is invariant to amplitude scaling of x and y. As shown in Figure 2, this definition indicates the component perpendicular to y, which reveals the dissimilarity part between x and y. The more similar a pair of vectors are, the lower the value. Although the form of our definition is quite simple, it is very useful in the task of discovering representative prototypes in a set of time sequences by classification or clustering. This measure is normalized into [0,1] with our restrictions, and it can be evaluated by

(3)sinθxy = 1 − cosθxy = 1 − (x⋅y||x||||y||)2 (3)

If vector x can be scaled to x′, that is, if they are in the same direction in R^d, they are considered similar to each other. As mentioned by Zhou et al. [25] and Goldin et al. [7], this similarity relation is an equivalent relation. Thus, these vectors that are similar to each other form an equivalence class. Our definition can be viewed as a measure for equivalence classes.

2.3 Proof of the Triangle Inequality

Some properties of a dissimilarity measure are useful for some applications. For instance, satisfying the symmetry property is important for clustering or classification. Otherwise, one might obtain counterintuitive results. Moreover, the property of the triangle inequality can be used to accelerate clustering. Actually, as a dissimilarity metric, it must satisfy the properties of symmetry, non-negativity, self-identity, and triangle inequality. In contrast, a dissimilarity measure does not need to satisfy the triangle inequality property. Euclidean distance is a well-known metric for time series. DTW can only be called a measure because it violates the triangle inequality property [12], which explains why DTW is difficult for exact indexing [12] and averaging [16]. AngDis satisfies the above four properties under specific conditions. We will demonstrate them as follows.

Symmetry is very obvious. Non-negativity holds as we restrict the angles into the domain of [0, π/2]. The measure violates the property of self-identity in the strict sense, because if AngDis(x, y) = 0, x and y can be different points in R^d. Nevertheless, if we consider the dissimilarity as a measure for equivalence classes, this property is satisfied. The triangle inequality is not that obvious. It can be proved as follows:

Proof. Suppose x, y, z are three d dimensional vectors with non-negative elements, and θ_xy, θ_xz, θ_yz are angles between each pair of them. According to the property of a trihedral angle, i.e., that the sum of any two face angles is greater than the third face angle [14], we have

θxy ≤ θxz + θyz(4)

If θ_xy ≤ θ_xz + θ_yz ≤ π/2, sin θ_xy ≤ sin(θ_xz + θ_yz) holds due to the monotonicity of sine function in [0, π/2]. The equation

(5)sin(θxz + θyz)=sinθxzcosθyz + cosθxzsinθyz≤ sinθxz + sinθyz (5)

holds because that cosine function is restricted in [0,1] in domain [0, π/2], and thus inequality sinθ_xy ≤ sinθ_xz + sinθ_yz holds. If θ_xy ≤ π/2 ≤ θ_xz + θ_yz ≤ π, we have

(6)sinθxz + sinθyz= 2sin(θxz + θyz2)cos(θxz − θyz2) (6)

As (θ_xz + θ_yz) ∈ [π/2, π] and (θ_xz – θ_yz) ∈ [–π/2, π/2], we have sin{(θxz + θyz)/2} ∈ [2/2, 1],cos{(θxz − θyz)/2} ∈ [2/2, 1]. Equation (6) is equal to the multiplication of the above two terms, and it is guaranteed to be not less than 1. Hence, sinθ_xz + sinθ_yz ≥ 1 ≥ sinθ_xy holds. Finally, we have AngDis(x, y) ≤ AngDis(x, z) + AngDis(y, z). □

From the proof of the triangle inequality, we can see that if we choose θ itself as the dissimilarity, the triangle inequality always holds due to the property of trihedral angle. With the restrictions, sinθ still obeys the property and it is more convenient in practice as sine function normalizes angles. Moreover, sinθ provides a feasible and effective way to update cluster centers in each iteration of clustering, thanks to the satisfaction of the triangle inequality, which facilitates the shape-averaging process of time series. More details will be given in the next section.

If we consider offset shifting, projecting time series to plane P in Figure 1 can eliminate this transformation. Unfortunately, such transformation can result in new problems. Although the angle between two vectors is restricted in [0, π/2], after projection, the angle will violate this restriction. The above proof holds, however, only under the condition of amplitude scaling.

2.4 Evaluation of Dissimilarity Measure

How well does our angle-based dissimilarity, AngDis, perform in capturing the shape or trend of time series? What are the differences between using AngDis and others? We will answer these questions by quantitatively performing several experiments on some benchmark data sets.

2.4.1 Dataset Description

Keogh and Kasetty [11] objectively evaluated some similarity measures using the 1NN algorithm. In the UCR Time Series Data homepage [13], the authors also strongly recommend one test and report the 1NN accuracy when one is advocating a new dissimilarity measure. We use two publicly available data sets for the task:

Cylinder–bell–funnel: This is a well-known artificial data set originally proposed by Saito [18]. Here, we give a brief introduction to the generation of this data set in respect that we will thoroughly test the performance of our dissimilarity on this data set. The data set consists of three classes: cylinder (c), bell (b), and funnel (f). Sequences are generated as follows:
c(t) = (6 + η) ⋅ χ[a,b](t) + e(t),b(t) = (6 + η) ⋅ χ[a,b](t) ⋅ (t − a) / (b − a) + e(t),f(t) = (6 + η) ⋅ χ[a,b](t) ⋅ (b − t) / (b − a) + e(t),χ[a,b] = {0t ≤ a1a ≤ t ≤ b0t ≥ b,
where η and ε(t) are drawn from standard normal distribution, a is an integer drawn uniformly from [16, 32], and b – a is also an integer drawn uniformly from [32, 96]. Some examples of each class are shown in Figure 3. This data set characterizes some typical properties of temporal domain. We used the data created by Geurts [6] in our first evaluation experiment. It contains 5000 examples with the dimension of 128. In the second evaluation experiment, we tuned some parameters and created synthetic data sets by ourselves.
Control chart: To be more rigorous, we also performed the 1NN algorithm on this data set. This data set can be freely downloaded from the UC Irvine Machine Learning Repository [1]. It consists of six classes with 100 examples of each. More details can be found in Ref. [1].

Figure 3

Examples of Cylinder–Bell–Funnel Data.

2.4.2 Experiment 1

In Ref. [11], the authors already reported the 1NN classification error rates for 12 similarity measures on both cylinder–bell–funnel and control chart data sets. The results are surprising. The simplest benchmark, Euclidean distance, beats all the other techniques. Thus, we choose Euclidean distance as the rival of our proposed dissimilarity. In addition, DTW has been considered as a much more robust measure for time series in recent years [12, 21, 22]. It allows non-linear alignment on time axis and finds optimal distance between “warped” time series. Although the calculation of DTW is not very efficient [19], it is still widely used in various fields [12]. Thus, we added a DTW dissimilarity measure to our experiment. AngDis is based on the angle between two vectors, while Pearson’s correlation can be viewed as the cosine between two vectors after centering. Thus, we also added Pearson’s correlation into our test set for comparing.

After choosing the rivals, we performed 1NN algorithm on the data sets mentioned above. We used leave-one-out cross-validation to assess the results. The error rates are reported in Table 1.

Table 1

Error Rates for Various Measures.

Similarity/Dissimilarity Measures	Cylinder–bell–funnel	Control Chart
Euclidean distance	0.0040	0.0133
DTW	0.0036	0.0067
Pearson’s correlation	0.0056	0.0750
AngDis	0.0052	0.0467

As shown in Table 1, DTW had the smallest error rates on both data sets. Although the performance of AngDis was moderate, its error rates were quite smaller in comparison with the results of almost all the techniques reported in Ref. [11] (several of the techniques have error rates close to random guess). We can conclude that our measure has a good performance in general, as these data sets are not generated for specific purposes.

2.4.3 Experiment 2

AngDis is particularly designed to capture temporal patterns that are insensitive to amplitude scaling. To demonstrate this characteristic, we performed 1NN classification algorithm several times on cylinder–bell–funnel data sets generated by ourselves, with some parameters changed.

In the generation functions of cylinder–bell–funnel data, the parameter η indicates the randomness of amplitude variation. In addition, the parameters a and (b – a) indicate the start time and the length of duration, respectively. We multiplied a scalar to η, as a factor to control the randomness of amplitude variation, and generated the data set afterward. The factor was set to {1, 2, 4, 8}, respectively. Then, we changed the range of the domains from which the parameters a and (b – a) are drawn. The specific values are shown in Figure 4. The purpose of these procedures is to evaluate the performance of our measure on various data sets with different levels of variation on amplitude axis and time axis.

Figure 4

Domains From Which a and (b – a) Are Drawn.

As we increased the factor multiplied to η, the 1NN classification error rates of the four measures increased almost simultaneously (see Figure 5). Our proposed measure, however, showed its robustness to amplitude randomness. DTW had the lowest error rate when the multiplier of η equals 1. Our measure had the lowest error rate when the multiplier was 2, 4, and 8, respectively. The error rate of our measure was much lower than that of Euclidean and DTW when the multiplier was high.

Figure 5

Experimental Results After Changing the Multiplier of η.

We then expanded the spans of the duration of events in the data set, by moving forward the spans of a and enlarging the spans of (b – a) (see Figure 4). In Figure 6, the DTW measure is very robust to this variation. Our angle-based dissimilarity measure, however, did not perform well. These results reveal the limitation of our measure; that is, the sensitivity to variation of translation along the time axis.

Figure 6

Experimental Results After Changing the Domains From Which a and (b – a) Are Drawn.

3 Time Series Clustering Acceleration

Nowadays, the size of data grows at unprecedented speed in almost every production field. Dealing with the so-called Big Data will become a key of competition. Hence, improving the efficiency of algorithms is of great significance. We now introduce how to accelerate time series clustering using the property of triangle inequality. Elkan’s algorithm [4] is applicable to our problem because it satisfies the triangle inequality of our dissimilarity and the high-dimension nature of time series data. In addition, with Euclidean distance, updating cluster centers only requires the calculation of the mean of the data; with the DTW dissimilarity measure, updating cluster centers is so challenging that the unproblematic k-medoid algorithm is often used instead for clustering. With our angle-based dissimilarity, however, simply calculating the mean does not accurately update cluster centers. The algorithm of Yang and Leskovec [23] provides an effective way for updating centers with angle-based dissimilarity.

3.1 Method

The triangle inequality informs us that if two points are close to a third point, the distance between these two points cannot be too large. Elkan [4] solidifies this idea by introducing two lemmas. Suppose x is a data point, as shown in Figure 7(A), and we have already calculated the dissimilarity between x and a center c_i. Now, considering another center c_j, we want to decide to which center x is closer. By applying the triangle inequality, we have:

Figure 7

Applying the Triangle Inequality.

dis(x, cj) ≥ dis(ci, cj) − dis(x, ci).(7)

Inequality (7) reveals that if dis(c_i, c_j) is greater than or equal to 2 times dis(x, c_i), we can ensure that dis(x, c_j) ≥ dis(x, c_i). As shown in Figure 7(A), the condition above guarantees that x lies on the side closer to c_i of the perpendicular bisector line of c_ic_j. Thus, x is still closer to c_i than c_j. As k-means assign data points to their closest centers, the calculation of dis(x, c_j) can be safely pruned without any change in the results. In practice, the upper bound u of dis(x, c_i) is used because dis(c_i, c_j) ≥ 2u is a stronger condition.

Within each iteration, we can calculate upper bounds for every data point to reduce redundancy using Elkan’s Lemma 1. After one iteration, however, the centers change. We do not ensure that the previous closest center is still the best. Suppose we already know the lower bound l(x, c′) between every data point x and every center c′ [see Figure 7(B)] in the previous iteration. By applying the triangle inequality, we have

dis(x, c′) ≤ dis(x, c) + dis(c′, c),(8)

where c is a center in the current iteration. The change between c′ and c is usually small. Now we have

dis(x, c) ≥ dis(x, c′) − dis(c′, c) ≥ l(x, c′) − dis(c′, c).(9)

Equation (9) provides the lower bounds of the dissimilarity measures between data points and their current centers. If the dissimilarity between a point and its assigned center in the previous iteration is lower than or equal to all of its current lower bounds, the previous assigned center does not change.

Algorithm 1

Accelerated K-SC Clustering Algorithm.

The discussion above details Elkan’s two lemmas. Although more time is needed to calculate the bounds, and more space is needed to store the list of bounds and inner-center dissimilarities, the cost is worth it. This assumes a large amount of point–center dissimilarity calculations, especially for high-dimensional time series data.

3.2 Algorithm

After each assignment step, the center should be updated according to the following criterion [23]:

(10)cj* = argmincj∑xi ∈ Cjdis(xi, cj)2, (10)

where c_j is the center of cluster C_j; x_i is the ith data point in C_j. This means that the optimal center is the point that minimizes the summation of the squared dissimilarities between all point–center pairs within the cluster. If the dissimilarity is Euclidean distance, the minimizer is the mean of all points in C_j. With our definition, the K-SC algorithm provides a feasible way [23] to calculate the solution to Eq. (10), i.e., the eigenvector corresponding to the smallest eigenvalue of matrix P = ∑xi∈Cj{I − (xixiT) / (xiTxi)}. Similarly, the eigenvector corresponding to the largest eigenvalue of matrix Q = ∑xi ∈ Cj{(xixiT) / (xiTxi)} also is a solution. As mentioned in Section 2, if we use the angle but not its function as the dissimilarity, we cannot calculate cluster centers through a closed form analogous to the above. The pseudo-code in Algorithm 1 describes the method. It is a k-means-like clustering algorithm with Elkan’s method for reducing redundancies and an applicable way to update centers.

ProcedureK-SC-Centers(X, A, j)
Require: Data set: X. Assignment set: A. Current cluster: j.
index ← index in A on which the element is j
forallt in index do
P ← ∑txtxtT\|\|xt\|\|2
c_j ← the largest eigenvector of P
Returnc_j

ProcedureElkans-Assignment (x_i, C, A, u, l, opt)
Require: The current data point: x_i. Cluster set: C. Assignment set: A. Upper and lower bounds: u and l. The index of the assigned center of x_i: opt.
Ensure: When dis(x, c) is computed, update l(x, c) to dis(x, c)
For allt ≠ opt do
Ifu(x_i) > l(x_i, c_t) andu(xi) > 12dis(copt, ct)then
Ifr(x_i) = 1 then
compute dis(x_i, c_opt), r(x_i) ← 0
else
dis(x_i, c_opt) ← u(x_i)
ifdis(x_i, c_opt) > l(x_i, c_t) ordis(xi, copt) > 12dis(copt, ct)then
compute dis(x_i, c_t)
ifdis(x_i, c_t)<dis(x_i, c_opt) then
The ith element in A ← t
ReturnA

Evaluating our dissimilarity takes O(d) time where d is the dimension of data points. During each iteration, the time complexity of updating centers is O(max(nd², kd³)) [23]. The time complexity of updating bounds after the center updating step is O(nkd) because lower bounds need to be updated between every point–center pair. The time complexity of the assignment step is O(nkd). Many dissimilarity calculations, however, do not happen due to the constraints of the bounds.

3.3 Evaluation of Clustering Acceleration

We performed an experiment to test the effectiveness of our algorithm for discovering representative patterns. Specifically, we repeated the Memetracker experiment in Ref. [23] to verify if the algorithm can reduce redundant dissimilarity calculations while guaranteeing identical results at the same time.

Yang and Leskovec [23] made their data set publicly available. It consists of 1000 time series of a dimension of 128. The data correspond to the 1000 most frequent short textual phrases extracted from online social media. All elements in the data set are non-negative. We ran the standard clustering algorithm and the accelerated version using the triangle inequality, respectively. We then compared the results produced by these two algorithms.

In the original experiment, the authors used an empirical approach to deal with translation invariance along time axis. Because the data have a bursty nature, the peaks of all the time series were aligned to the same time point, and the minimum dissimilarity between x and y was calculated by allowing one of the time series to translate from –5 time units to +5 time units. That means every time dissimilarity needs to be calculated, it should be calculated 11 times and the minimum is chosen to be the optimum. Unfortunately, this translation may break the satisfaction of the triangle inequality of the dissimilarity measure. We conducted an experiment without translation in time to verify the acceleration performance. The results without translation are not much different from the original results. The number of clusters is set to 6.

Without translation, the centers and the assignment of data points produced by the accelerated algorithm are exactly the same to those produced by the standard version. The numbers of dissimilarity calculations are shown in Figure 8. The standard algorithm calculates all point–center dissimilarities during each iteration. Thus, the number of calculations increases linearly with respect to iterations. The accelerated algorithm reduces much redundant calculations, especially during the end part of the iterations. At the beginning of the algorithm, the changes of dissimilarities of centers between successive iterations are relatively larger than those at the end. Therefore, the bounds updated after each iteration are relaxed. As the changes become smaller, the much tighter bounds will prune more unnecessary calculations. As shown in Figure 8, the accelerated algorithm reduced the number of calculations by almost 10-fold in our experiment. In addition, we visually compared our results with the original results obtained with translation, and the differences were small. We can ignore these differences because Yang and Leskovec performed a qualitative investigation.

Figure 8

Acceleration Results of Experiment 1.

4 Application to Social Network Mining

The emergence and explosion of OSNs transformed the way people receive news and messages, share information, and communicate with others. The most important mechanism for information diffusion in microblogging, one typical form of social network services, is retweeting, which is simply copying a message you find interesting and then forwarding it with tags or comments. This easy-to-use function is one of the key features that make microblogging a highly temporal dynamic and relation-linked social ecosystem [24]. Retweeting traces contain a rich amount of information, which can be utilized for decision making. We aim to find typical temporal variation patterns of these traces by our time series clustering algorithm.

4.1 Dataset Description

We obtained a data set composed of 69M microblogs (tweets) and 3M user profiles from Sina Weibo, the most popular microblogging site in China. The data set spans from August 2009 to April 2010. Sina’s API provides access to a lot of useful attributes for us to conveniently identify the occurrence time of every retweet to a certain original message. We then bin the sequence of timestamps by some granularity and obtain a time series that reflects the variation of retweeting activity. The amount of information stored in a time series depends heavily on the number of times the original message has been retweeted. The temporal variation pattern cannot be demonstrated with a limited amount of retweets. We keep the time series in which the original messages have been retweeted from 500 to 3000 times. Our data set accounts for 20% to 30% of the total messages generated at the time. Hence, not every single retweet can be found. We then kept the time series that contain 85% of actual retweets, through which way we obtained 4025 time series. We set the length of the time series to 744 hours (a month).

4.2 Clustering Results

After preprocessing the data, we ran our clustering algorithm several times with different cluster numbers and calculated the recommended Hartigan index [2]. Then, we chose the cluster number as 7.

Figure 9 demonstrates the results of clustering. We smooth the curves by splines. From the illustration, we observe that clusters 1, 2, and 3 have almost identical patterns, except that the peak appears 2 h after the posting of the original message and decays slower in cluster 3. We calculated the percentage of cluster members to the total number and average distance within each cluster. The first three clusters accounted for almost three-fourths of the total data, and the average distances within these clusters were relatively small. This pattern, represented by clusters 1, 2, and 3, is the most typical pattern of temporal variation of retweeting activities in microblogging sites such as Sina Weibo. It shows that the number of retweets goes sharply to the very peak in 1 or 2 h after the posting of the original message, and decreases rapidly in a few hours. After 24 h, there are no remarkable retweeting activities.

Figure 9

Clustering Results.

Clusters 4 to 7 are multipeaked cases. A possible explanation for multiple peaks is that during the process of retweeting, the message has been retweeted by some influential users. Their influence helps boost the popularity of the original message. Thus, several independent peaks appear. In clusters 5 and 7, the second peak is higher than the first one. However, in cluster 6, it is the opposite. A common characteristic of the patterns of clusters 1 to 3 and clusters 5 to 7 is that the lifespans of the popularity seem to be 24 h. Cluster 4, however, shows us a case of a longer lifespan. The second peak appears at the 24th hour after the posting of the original message, and the popularity lasts for >2 days.

4.3 Analysis of Clustering Results

To give reasonable explanations to our clustering results, we analyzed the factors that lead to different temporal variation patterns. The reasons may be complicated. The topic and content of the messages, the original users’ influence and social relations, and even the background of the public opinion are all possible factors that could affect the spread of the original message. Besides, the diversity and randomness of user behavior add more complexity to answering the question. Thus, we aim at giving reasonable qualitative explanations. Specifically, we extracted some features that are potentially significant, and performed principal component analysis (PCA) on both single-peaked and multipeaked clusters.

4.3.1 Feature Generation

We adopted part of the features mentioned in Refs. [20, 26] and added some new ones that are available. Table 2 shows the list of features, including content features extracted from the original text of messages (URL, hashtag, mention, multimedia), user profile features related to the authors (followee, follower, activeness, VIP), the ratio of the VIP-certified ones in the author’s followers (VIP ratio), and the number of retweets for a given message (retweet).

Table 2

Features.

Feature No.	Feature Name	Description
1	URL	No. of URLs in the original message
2	Hashtag	No. of hashtags in the original message (# short phrase #)
3	Mention	No. of mentions in the original message (@username)
4	Multimedia	No. of videos and musics in the original message
5	Followee	No. of the author’s followees
6	Follower	No. of the author’s followers
7	Activeness	No. of messages the author already posted in the past
8	VIP	Whether or not the author has VIP certification
9	VIP ratio	The ratio of VIP-certified ones in the followers
10	Retweet	No. of retweets for the original message

4.3.2 Principal Component Analysis

In Ref. [20], Suh et al. performed a PCA with nine features to find which ones have a strong relation to retweetability. PCA is a rigorous method for eliminating the redundancy of information and visualizing data. We adopted this idea to understand the formation mechanisms of single-peaked and multipeaked patterns.

For typicality, we chose clusters 1 and 2 in Figure 10(A) as single-peaked samples, and clusters 5 and 7 in Figure 10(B) as multipeaked samples. We performed PCA with these samples. Table 3 shows the top 4 extracted factors and the corresponding variances in different classes of samples. The variance along the direction of the corresponding component is represented by the corresponding eigenvalue of the covariance matrix of features. The largest four components in each class account for 49% and 53% of the total variance, respectively.

Table 3

Variances of Factors.

Single-peaked Class		Multipeaked Class
Factor	Variance%	Factor	Variance%
1	17.25	1	18.58
2	11.38	2	12.03
3	10.53	3	11.24
4	10.16	4	10.90

Every factor is a linear combination of the original features, and loadings are the coefficients of the combination. A common way to summarize these loadings is by correlation circles. We plotted the correlation circles in Figures 10 and 11 for the first four factors of each class. Each feature was mapped as a vector to represent its correlation with corresponding factors.

Figure 10

Coefficient Circles of Single-Peaked Class.

Figure 11

Coefficient Circles of Multipeaked Class.

By examining the angles (directions) between the retweet vector (feature no. 10) and others, we find the features associated with retweeting in different classes. On the basis of this correlation analysis, we then come up with different formation mechanisms of single-peaked and multipeaked patterns:

Single-peaked patterns are mainly generated by influential and active authors in relatively small communities.
Multipeaked patterns are mainly generated by a sequence of influential users in relatively large communities. This type of messages might be related to hot topics.

In the single-peaked class, retweeting is associated with number of followees, number of followers, activeness, and number of mentions in messages (feature nos. 3, 5, 6, and 7). The influence and social circles of the authors themselves are the important keys in generating this type of patterns. The number of mentions (@) plays an important role in single-peaked class. This is not surprising because messages with mentions tend to be more private [20] and involve more cognitive overhead [15] than other content features. In contrast, messages that contain other content features, such as URLs, hashtags, and videos, are more likely to be “broadcasted.” In the multipeaked class, retweeting is associated with the number of URLs, hashtags, and mentions in message content, VIP certified or not, and the ratio of VIPs in authors’ followers (feature nos. 1, 2, 3, 8, and 9). The correlation between retweeting and the VIP ratio tells us that this type of messages is often retweeted, like a relay race, by a lot of influential users. In addition, messages that contain popular URLs and hashtags are more likely to spread into larger communities.

In comparison with the work made by Suh et al. [20], rather than just analyzing what messages are more likely to be retweeted, we have obtained some new conclusions by comparing the PCA results of single-peaked and multipeaked patterns. We reveal different formation mechanisms that lead to these different patterns. Our conclusions are not only consistent with our intuition, but also reasonable to explain our clustering results.

5 Conclusion and Future Work

In this article, we have defined an angle-based dissimilarity measure, AngDis, for clustering time series data, which is an important type of dynamic data in networks. This measure is not only appropriate to capture the essence of the “shape” of time series, but is also useful in accelerating clustering, based on the triangle inequality. We have shown that our dissimilarity measure satisfies the triangle inequality under the circumstances of (i) all elements being non-negative and (ii) considering only amplitude scaling transformation. Evaluation experiments have shown that our measure performs well in general, and it has the advantage of being robust to amplitude variations.

On the basis of Elkan’s approach for accelerating k-means, AngDis can be used to prune redundant calculations during clustering, assuming the triangle inequality is satisfied. We propose an integrated algorithm, which includes a practical method for calculating cluster centers. Our experimental results have shown that the accelerated algorithm reduces the number of dissimilarity calculations by almost an order of magnitude.

We have also applied our algorithm to a social network mining task. Clustering results reveal two typical patterns: single-peaked and multipeaked. By using PCA, we have given reasonable explanations of what factors lead to these differences in retweeting activities. It turns out that single-peaked patterns are mainly related to authors’ influence, whereas multipeaked patterns seem to be generated by a sequence of influential people.

In the future, we intend to extend our work to more general circumstances. Rather than amplitude scaling, other transformations, including offset shifting and translation, may be considered for specific applications. We also intend to analyze the effect of link structures among users on retweeting activities.

Corresponding author: Qianchuan Zhao, Tsinghua National Laboratory for Information Science and Technology, Department of Automation, Center for Intelligent and Networked Systems, Tsinghua University, 100084 Beijing, China, e-mail: zhaoqc@tsinghua.edu.cn

Acknowledgments

This work is supported in part by NSFC grant nos. 61074034, 61021063, 61174072, and 61174105.

Bibliography

[1] K. Bache and M. Lichman, UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences, http://archive.ics.uci.edu/ml, 2013.Search in Google Scholar

[2] M. Chiang and B. Mirkin, Experiments for the number of clusters in k-means, Progr. Artif. Intell. 4874 (2007), 395–405.10.1007/978-3-540-77002-2_33Search in Google Scholar

[3] K. K. W. Chu and M. H. Wong, Fast time-series searching with scaling and shifting, in: Proceedings of the Eighteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Philadelphia, PA, USA, pp. 237–248, ACM, New York, NY, USA, 1999.Search in Google Scholar

[4] C. Elkan, Using the triangle inequality to accelerate k-means, in: ICML, pp. 147–153, AAAI Press, Menlo Park, CA, USA, 2003.Search in Google Scholar

[5] T. Fu, A review on time series data mining, Eng. Appl. Artif. Intell. 24 (2011), 164–181.10.1016/j.engappai.2010.09.007Search in Google Scholar

[6] P. Geurts, Pattern extraction for time series classification, in: Principles of Data Mining and Knowledge Discovery, Freiburg, Germany, pp. 115–127, Springer, New York, NY, USA, 2001.10.1007/3-540-44794-6_10Search in Google Scholar

[7] D. Goldin and P. Kanellakis, On similarity queries for time-series data: constraint specification and implementation, in: Principles and Practice of Constraint Programming – CP’95, Cassis, France, pp. 137–153, Springer, New York, NY, USA, 1995.10.1007/3-540-60299-2_9Search in Google Scholar

[8] G. Hamerly, Making k-means even faster, in: SIAM International Conference on Data Mining (SDM), 2010.10.1137/1.9781611972801.12Search in Google Scholar

[9] J. Han and M. Kamber, Data mining: concepts and techniques, Morgan Kaufmann, Burlington, MA, USA, 2006.Search in Google Scholar

[10] T. Kahveci, A. Singh and A. Gurel, Shift and scale invariant search of multi-attribute time sequences, in: Proc. of the SSDBM Conf., Fairfax, VA, USA, Citeseer, IEEE Computer Society, Washington, DC, USA, 2001.Search in Google Scholar

[11] E. Keogh and S. Kasetty, On the need for time series data mining benchmarks: a survey and empirical demonstration, in: Data Mining Knowl. Discov. 7 (2003), 349–371.Search in Google Scholar

[12] E. Keogh and C. A. Ratanamahatana, Exact indexing of dynamic time warping, Knowl. Inf. Syst. 7 (2005), 358–386.10.1007/s10115-004-0154-9Search in Google Scholar

[13] E. Keogh, Q. Zhu, B. Hu, Y. Hao., X. Xi, L. Wei, and C. A. Ratanamahatana, The UCR Time Series Classification/Clustering homepage (2011). www.cs.ucr.edu/eamonn/time_series_data/.Search in Google Scholar

[14] M. S. Klamkin, Vector proofs in solid geometry, Am. Math. Month. 77 (1970), 1051–1065.10.1080/00029890.1970.11992664Search in Google Scholar

[15] M. Nagarajan, H. Purohit and A. Sheth, A qualitative examination of topical tweet and retweet practices, in: Proceedings of Fourth International AAAI Conference on Weblogs and Social Media (ICWSM), Washington, DC, USA, pp. 295–298, AAAI Press, Menlo Park, CA, USA, 2010.10.1609/icwsm.v4i1.14051Search in Google Scholar

[16] V. Niennattrakul and C. A. Ratanamahatana, Inaccuracies of shape averaging method using dynamic time warping for time series data, in: Computational Science – ICCS 2007, pp. 513–520, Springer, New York, NY, USA, 2007.10.1007/978-3-540-72584-8_68Search in Google Scholar

[17] D. Pelleg and A. Moore, Accelerating exact k-means algorithms with geometric reasoning, in: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 277–281, ACM, New York, NY, USA, 1999.10.1145/312129.312248Search in Google Scholar

[18] N. Saito, Local feature extraction and its applications using a library of bases, Yale University, New Haven, CT, USA, 1994.Search in Google Scholar

[19] S. Salvador and P. Chan, Toward accurate dynamic time warping in linear time and space, Intell. Data Anal. 11 (2007), 561–580.10.3233/IDA-2007-11508Search in Google Scholar

[20] B. Suh, L. Hong, P. Pirolli and E. H. Chi, Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network, in: Social Computing (SocialCom), 2010 IEEE Second International Conference on, pp. 177–184, IEEE, Washington, DC, USA, 2010.10.1109/SocialCom.2010.33Search in Google Scholar

[21] N. Cong Thuong and D. T. Anh, Comparing three lower bounding methods for DTW in time series classification, in: Proceedings of the Third Symposium on Information and Communication Technology, pp. 200–206, ACM, New York, NY, USA, 2012.10.1145/2350716.2350747Search in Google Scholar

[22] T. W. Liao. Clustering of time series data – a survey, Pattern Recognition 38 (2005), 1857–1874, Elsevier, Oxford, England.10.1016/j.patcog.2005.01.025Search in Google Scholar

[23] J. Yang and J. Leskovec, Patterns of temporal variation in online media, in: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 177–186, ACM, New York, NY, USA, 2011.10.1145/1935826.1935863Search in Google Scholar

[24] D. Zhao and M.B. Rosson. How and why people twitter: the role that micro-blogging plays in informal communication at work. In Proceedings of the ACM 2009 International Conference on Supporting Group Work, pp. 243–252, ACM, New York, NY, USA, 2009.10.1145/1531674.1531710Search in Google Scholar

[25] M. Zhou, M.H. Wong and K.W. Chu, A geometrical solution to time series searching invariant to shifting and scaling, Knowl. Inf. Syst. 9 (2006), 202–229.10.1007/s10115-005-0215-8Search in Google Scholar

[26] J. Zhu, F. Xiong, D. Piao, Y. Liu and Y. Zhang, Statistically modeling the effectiveness of disaster information in social media, in: Global Humanitarian Technology Conference (GHTC), 2011 IEEE, pp. 431–436, IEEE, 2011.10.1109/GHTC.2011.48Search in Google Scholar

Article note: This work has been presented in part in the conference ICNSC 2013.

Published Online: 2014-2-21

Published in Print: 2014-6-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Efficient Time Series Clustering and Its Application to Social Network Mining

Abstract

1 Introduction

2 Dissimilarity Measure

2.1 Motivation

2.2 Definition

2.3 Proof of the Triangle Inequality

2.4 Evaluation of Dissimilarity Measure

2.4.1 Dataset Description

2.4.2 Experiment 1

2.4.3 Experiment 2

3 Time Series Clustering Acceleration

3.1 Method

3.2 Algorithm

3.3 Evaluation of Clustering Acceleration

4 Application to Social Network Mining

4.1 Dataset Description

4.2 Clustering Results

4.3 Analysis of Clustering Results

4.3.1 Feature Generation

4.3.2 Principal Component Analysis

5 Conclusion and Future Work

Acknowledgments

Bibliography

Journal and Issue

Articles in the same Issue