MISSING DATA IMPUTATION VIA OPTIMIZATION APPROACH: AN APPLICATION TO K-MEANS CLUSTERING OF EXTREME TEMPERATURE
Geovert John D. Labita1, Bernadette F. Tubo2
University of Science and Technology of Southern Philippines 2Mindanao State University - Iligan Institute of Technology [email protected]
Abstract
This paper introduces an optimization approach to impute missing data within the K-means cluster analysis framework. The proposed method has been applied to Philippine climate data over the previous 18 years (2006-2023) with the goal of classifying the regions according to average annual temperature including the maximum and minimum. This dataset contains missing values which is the result of the weather stations' measurement failure for some time and there is no chance of recovery. As an effect, the regional groupings are greatly affected. This paper adapts a modified method of missing value imputation suitable for climate data clustering, inspired by the work of Bertsimas et al. (2017). The proposed methodology focuses on imputing missing values within observations by finding the value that minimizes the distance between the observation and a cluster centroid in which the Mahalanobis distance is used as the similarity measure. Consequently, the outcomes of clustering obtained through this optimization approach were compared with certain imputation techniques namely Mean Imputation, Expectation-Maximization algorithm, and MICE. The assessment of the derived clusters was conducted using the silhouette coefficient as the performance metric. Results revealed that the proposed imputation gave the highest silhouette scores which means that most of the observations were being clustered appropriately as compared to the results using other imputation algorithms. Moreover, it was found out that most of the areas showing the features of extreme condition are located in the middle part of the country.
Keywords: Optimization, K-Means, Mahalanobis
I. Introduction
The risk of extreme temperature most directly affects health by compromising the body's ability to regulate its internal temperature. Loss of internal temperature control can result in various illnesses including heat cramps, heat exhaustion, heatstroke, and hyperthermia from extreme heat events [7]. Thus, awareness of the climatic differences of a particular region of interest becomes a major concern for the safety of the individual.
In detecting weather phenomena like extreme temperature, it is important to classify or cluster the regions according to their climatic elements. However, the problem of missing climatic data is common in most weather stations which might result from damaged or failure of the weather equipment or instrument. Also, events such as sickness or vacation of the personnel in-charge can create daily missing data values which could affect the climate statistics. If this happens, there will
be no record of measurements for a particular time and could affect the clustering of weather data which is a valuable endeavor in multiple respects. For example, the results can be used in various ways within a larger weather prediction framework or could simply serve as an analytical tool for characterizing climatic differences [4].
From the study of Calvo et al. [6], a new clustering technique was shown aiming to generate a robust regionalization using climate datasets with incomplete information. Their method provided a new approach to cluster time series of different temporal lengths using most of the information contained in heterogeneous sets of climate records. Although they showed that their algorithm is able to generate a climatically consistent regionalization, it must be noted that there is no imputation happened on the missing information. In a sense, the clustering accuracy is somehow questionable.
A common practice for dealing with missing values in the context of clustering is to first impute the missing values, and then apply the clustering algorithm on the completed data [5]. From the study of Bertsimas et al. [3], a flexible framework based on formal optimization to impute missing data was proposed. Specifically, this framework can readily incorporate various predictive models like the k Nearest Neighbors (fcNN) for data classification in which the missing data of an observation is imputed by determining the k nearest observations and getting the average of those k observations. However, the imputation for each observation is not based on the possibility that the point belongs to a particular cluster. Thus, the kNN imputation is based purely on the k neighbors without the involvement or intervention of the possible resulting clustering.
Trying to resolve the aforementioned issues or deficiencies, this paper creates an appropriate imputation technique for missing values when dealing with clustering problem. Specifically, this study aims to construct a two-step optimization approach for data imputation in ^-means cluster analysis where K is the number of clusters. The first step is to determine the optimal initial cluster centroids which are the K most frequent nearest neighbors from all incomplete observations, that is, the K points with highest densities. The second step is then imputing the missing value of an observation by determining the value that gives the minimum distance from the observation to a cluster centroid. The outcomes of clustering achieved through this optimization approach will be compared with some imputation approaches namely Mean Imputation, Expectation-Maximization algorithm, and Multivariate Imputation by Chained Equations in which the assessment of the derived clusters will be conducted using the silhouette coefficient.
This paper is arranged as follows. Methodology is introduced and discussed in section 2. The model solution is presented and derived in section 3. In section 4, the application of the proposed imputation is illustrated while some concluding remarks are stated in section 5.
II. Methods
This section presents the derivation of the optimization models of the proposed method with imputation algorithm.
Let X = [xi}7^=1 be the dataset given with p variables and assume that each data vector contains continuous variables indexed by q E {1,2,...,p}. Now, the missing and known values are defined by the following sets:
M = {(¿, q) ■ xiq is missing}, M = {(¿, q) ■ xiq is known}. Also, let J be the set of indices of all incomplete observations given by
J = {i ■ Xi has at least 1 missing coordinate}.
Let W E Rnxp be the matrix of imputed values where Wjq is the imputed value for entry Xjq for (j, q) E M. The full imputation for observation Xj is referred to as Wj where j E J. The idea is to consider the missing data problem as an optimization problem in which it optimizes the missing values in all incomplete data points. Thus, the key decision variables are the missing values
Geovert John D. Labita, Bernadette F. Tubo RT&A, No 2 (78)
MISSING DATA IMPUTATION VIA OPTIMIZATION..._Vdume 19, June, 2024
[wjq ■ (j,q)EM}.
As a similarity measure, we can incorporate different types of distance metrics, but we prefer to use the Mahalanobis distance because it takes into account the variances and covariances amongst the variables which is very important in clustering multivariate data. In constructing the Mahalanobis metric, it involves the centroid of the whole dataset which means that the distance actually measures a point from the mean of the distribution. Specifically, according to Ghorbani [8], the Mahalanobis distance measures the number of standard deviations that an observation is from the mean of a distribution.
In using the Mahalanobis distance as a similarity measure, the nearest neighbors of incomplete data are formulated based on the differences of the squared Mahalanobis distances of the two observations. Thus, the nearest neighbor of each Wj, j E] is the smallest difference Mj — Mi for all i = 1,2,... ,n, that is, the smallest deviations between Wj and wi where the squared Mahalanobis distance Mi is given by
rWii — Vi
Mi(wi,v) = [Wii— Vi ■■■ Wip—Vp]Z-1
Wip — Vp.
with v = {v1, ...,vp} and S are the mean and covariance matrix of the whole data respectively which are updated per iteration.
Imputation Model
To obtain the imputed values, the Mahalanobis distance between Wj, j Ej and its appropriate centroid Wi, I E {1,2,... ,K} is minimized. Thus, for each j E ], the goal is to solve the imputation model:
minMj — Mc (1)
subject to
WcE[Wk} 1 = 1,2.....K (2)
Wjq = Xjq (j, q) EM (3)
The solution {Wjq}, (j, q) E M are regarded as the imputed values for the corresponding {xjq}. It must be noted that in the objective function (1), we assume that Mj > Mc. If Mc > Mj, we change the objective to max Mj — Mc in order to represent the same idea that the value of Mj should be near to Mc. In other words, the objective function ensures that whatever imputed values Wjq obtained, the observation Wj is very close to its appropriate cluster centroid Wc which is selected based on constraint (2). These centroids are determined in the assignment model discussed in the next section. The constraint (3) assures that all the observed data are preserved.
Assignment Model
Let K be the number of clusters specified by the analyst. Now, assume that the initial cluster centroids are given by {wI = 1,2,..., K} which are the K most frequent nearest neighbors from all incomplete observations. To obtain the initial centroids, the immediate nearest neighbor for each Wj, j E ] must be determined resulting to the following assignment model:
minZ?=iZij(Mj — Mi) (4)
subject to
Y]}=iZtj = 1 (5)
Zj j = 0 (6)
ZijE{0,1}
The assignment model assigns each incomplete observation to its immediate nearest neighbor where i = 1 if Wi is the nearest neighbor of Wj and 0 otherwise. The objective function (4) will
Geovert John D. Labita, Bernadette F. Tubo RT&A, No 2 (78) MISSING DATA IMPUTATION VIA OPTIMIZATION..._Vdume 19, June, 2024
determine which Wi is the nearest neighbor of Wj among all observations. Because of constraint (5),
there will only be one immediate nearest neighbor per incomplete observation and an incomplete
observation cannot be the nearest neighbor of itself because of constraint (6).
From all of the nearest neighbors, the K most frequent observations can then be formulated as
an optimization problem using the binary variables yt E {0,1} as follows:
maxYl?=1yiIliejZij subject to Y]}=±yi=K (7)
The solution {yi±, - ,yiK} of model (7) corresponds to the desired initial centroids {wi±, ...,wiK}. It
must be noted that the assignment model will work only on complete data with imputed values. For
the first iteration with missing values, the model can be started with mean values as the warm start
values for the optimization process. The imputed values from the imputation model are then based
on the centroids obtained from the assignment model. In return, the centroids are updated based on
the new imputed values making this procedure an iterative process.
Imputation Algorithm
The proposed data imputation algorithm is given in the following steps:
1. Input: X E Rnxp, a data matrix with missing entries M = {(¿, q) ■ xiq is missing},
warm start W° E Rnxp and number of clusters K.
2. Output: W*, a full matrix with imputed values, = {wi±, ...,wiK} initial centroids.
3. Initialize: Wold ^ W0
4. repeat
5. Update mean ^ and covariance matrix S based on Wold.
6. Update the auxiliary variables Z* using the assignment model.
7. Update the initial centroids following:
^zkj>^zij ViE{1,2.....n}
]E] jEJ
8. Update the imputation W* using the imputation model.
9. (Zoid Woid,^oid) ^ (Z*,w*,v*)
10. until v* = \i°ld
III. Results
This section presents the solution of the proposed imputation method using Mahalanobis distance.
Proposition 1. Let X = {xf}f=1 be a dataset given with p variables where the missing and known values are specified by the sets M = {(i,q)^xiq is missing} and M = {(i,q)^xiq is known} respectively. If (j, q) E M, then the solution of the optimization problem (1-3) is given by
f
11 ^
where oqa E R and oqq > 0.
Proof. Let (j, q) E M and consider the optimization problem (1-3). Suppose that wc = wit such that Mj — Mk < Mj — Mm for all m ± I. Then by considering an unconstrained optimization where we plugin the values of the xjq to the corresponding wjq for all (j, q) E N in objective function (1), we can use the concept of relative minimum in calculus to solve for Wjq that would minimize Mj — Mtl. Since the missing variable Wjq is present only in Mj, the problem reduces to differentiating,
Geovert John D. Labita, Bernadette F. Tubo RT&A, No 2 (78)
MISSING DATA IMPUTATION VIA OPTIMIZATION..._Vdume 19, June, 2024
wn - \li
Mj — [WH -Vl - wjq-Vq - Wjp - Vp]Z-1 WJ1 - Vq
[wjp - Vp\
with respect to Wjq where v — {v1, —■,Vp\ and % are the mean and covariance matrix respectively. Now, suppose that
r-i — a,
L
then we have
all • .. alq . ■■ alp
aql • ■■ aqq • •• aqp
apl • - apq • aPP
P
Oab(wia - Va)(wjb - Vb).
b=1a=1
To differentiate Mj, we have to separate the terms containing Wjq, that is,
p p p
Mj = ^ aqa(wjq - Vq)(Wja - Va) + ^ ^ aab(wja - Va)(wjb - Vb)
a=l b.b&q a.a&q
P
DWjq(Mj) = 2aqq(wjq - Vq) + ^ Oqa(wja - Va).
a\a^q
Finally, equating the derivative to zero will solve for the imputed value as follows
p
Jqq\Wvjq Vn 1 1 / vnn\wjn vn i — 0
a.a^q
2aqq(wjq - Vq) + ^ Oqa(wja - Va) =
P
2aqqwjq = 2aqqVq - ^ aqa(wja - Va) a.a&q
P
wjq =Vq-^— Y aqa(wja - Va)
qq Jaiq
The following theorem will be used to prove the next proposition.
Theorem 1 (Andreasson et al.). Suppose that f.Rd ^ R is in C2 on Rd, that is, f is twice differentiable with continuous second partial derivatives. Then Vf(w*) = 0(d and V2f(w*) is positive definite
implies that w* is a strict local minimum of f where Vf(w) = . For d = 1, f'(w*) = 0 and
f"(w*) > 0 implies w* £ R is a strict local minimum.
V Swq j
q q=l
Proposition 2. The solution wjq given in Proposition 1 is a strict local minimum of the optimization problem (1-3) in an unconstrained setting.
Proof (for the case when d = 1). Let f.R^ R be defined by the objective function in the optimization problem (1-3) in an unconstrained setting. Following the same argument from the proof of Proposition 1, for any solution w*, we have
fiw") — 2aqq(w" -Vq)+ ^ Oqa(Wja - Va) * f"(W— 2o««.
a:a&q
Since f'(w*) and f"(w*) are linear functions, then they are continuous. Also, f''(w) — 2oqq > 0 since the diagonal entries of a covariance matrix are positive assuming that the data samples are unique. Now,
■
P
= 2a,
qq I
p
f'(Wjq) = 2°qq ( Hq — ^ — Va) — + ^ 'VO^« — ^a)
^ a.a&q I a.a&q
1 P \ P — ^ °qa(Wja—V-a) I + ^ °qa(Wja — Va)
^ a.a&q J a.a&q
P
= °qa(Wja—Va)+ ^ °qa(Wja—V-a)
a.a^q a.a^q
= 0.
Thus, by Theorem 1, the solution wjq is a strict local minimum. ■
IV. Application
The proposed methodology is applied on the historical Philippine climate data (2006-2023) taken from the 52 weather stations around the country which can be downloaded at https://en.tutiempo.net/climate/philippines.html and shown in Table 2. This dataset of three continuous variables per year (52 x 54 data matrix) contains actual missing values. This study can be considered as a multivariate time series clustering with the goal of classifying the regions suspected to have extreme temperature conditions.
In doing the experiment, the missing elements among the data are firstly imputed using the different imputation methods, and then the traditional ^-means algorithm is applied into the imputed dataset. The experiments with random centroid initialization (mean, MICE, EM) are repeated 100 times with different random seed to reduce the effect of randomness caused by the traditional ^-means, and report the best result.
We use the R function "silhouette()" from the R package "cluster" for obtaining the silhouette scores of the clustering results. Silhouette coefficient or Silhouette score ranging from -1 to +1 is a measure of how similar an object is to its own cluster compared to other clusters. In other words, it is a metric used to calculate the goodness of a clustering [2]. A high value indicates that the object is well matched or having a high relationship to its own cluster. Thus, it acts as the accuracy in the case when the cluster labels are not known.
Table 1 shows the silhouette score results from different number of clusters where the numbers in red are the highest score per case.
Table 1: Silhouette Scores (%) using different imputation algorithms
# of Clusters Proposed Imputation Mean Imputation MICE Expectation-Maximization
K=2 84.78 75.26 61.51 70.98
K=3 72.6 62.7 52.64 58.66
K=4 58.02 36.03 29.39 21.75
K=5 58.02 22 26.38 19.12
K=6 57.99 20.5 33.33 19.04
K=7 41.11 20.09 17.89 18.17
K=8 38.06 17.89 17.66 17.16
K=9 36.56 17.59 17.03 16.65
K=10 36.27 16.88 15.74 17.75
Using the proposed imputation method, we can classify the extreme temperature areas. For example, if we set K = 10, results showed that there are two clusters exhibiting extreme temperature having an overall average of at least 28°C. These areas are shown in Figure 1.
Extreme Temperature:
• Iba, Zambales
• Manila
• Sangley Point, Cavite
• Catarman, Northern Samar
• Catbalogan, Western Samar
• Guiuan, Eastern Samar
• Roxas, Capiz
• Tagbilaran, Bohol
• Butuan
¿
Figure 1: Philippine map with clustering results from the proposed imputation
From Figure 1, the areas with red spots are classified with extreme temperature. It can be observed that most of the areas are located in the middle part of the country.
V. Concluding Remarks
This paper presents a missing data imputation algorithm that can handle partitional clustering. It is created out of an optimization approach for imputing missing data and making use of the Mahalanobis distance metric as a similarity measure. Also, it avoids the problem of centroid initialization when performing ^-means clustering because the initial cluster centroids are fixed based on the algorithm's generated centroids.
When clustering the Philippine Climate data with 21% actual missing values, we were able to identify 9 places with extreme temperature classification which means that these places must be considered when predicting extreme temperature occurrence. It was found out that the proposed imputation using Mahalanobis distance gave higher clustering performance and is consistent for different number of clusters which means that the proposed optimization approach using Mahalanobis distance is a suitable imputation algorithm in the context of partitional clustering.
! a 3 a as a a a El a a a a aa a a a a aa a a a
i i £ a Si S M a 3 a a a a a a a a a a a a 3 a a
1 R a El a a a a a a a a a a a El a a a a a a a a
I ! a s 3 3 a a a a a a a a a a a a a a a a a a a a
Bp 3 a a a a B a a a a a a a a a s a a a a a a a a
1 3 a a a 3 a a a a a a a a a Ei a a a a a a a a a
I ia a a a a a a a ■ a a a a a a a a a ?l a a a a a a 3! a a
Bp a 3 a a SI a 51 a a a a a a a a a a a a a a a 1 a a a a
1 ■a El H a a a a a a a a a a a a a a a a a a a a a a
I i a 3 a a a a a a a a a a a El a a a a a a a 3 a a a a a a a a a a a a a a a 3 a a a a a a a a
m 3 S ! a a a a aiaia|a|a|a a a a a a a a a a a a a a a a a a a SI a a a a a a a a a a a a a a
i -a 3 a 3 a El a a a a a a R a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
, alalslala a a a a a a a a a a a a a a a a a a a 3 a a a a a a a a a a a a 3 3 a a 3 a a a a
*f a t--a "S a « s t-Ei t- a f a SI a a a a« ts 0 a « a 0 SI a Q> s a i a a a 1« s a t* s № s 1* s SI "5 a 0 a f a 1* a № S a a 0 a a ta ta t> s a 1* a
i -a H El s 3 3 5 a a a a a Ei * a a a a a a a a 3 a a a a a a a a a a a a a a a a a a a a a a a a
, M 3 a 1 al a al a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a 3 a a a a a a a a a a
gp u 3 a a a a a|a|a|a|a|a|a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
1 -a El №i a a a a a a a a a a a a a a a a a a a a 3 3 El El a a a a a a a a a a a a a a a a a a a a
I 3a MiliM a a a a a a ti 3 a 3 a ■ a a a a a a * a a a a a f a aa a a a a a a a a a a 1 a a
a U 3 3 a s 3 a a a a a a|a 3 ^ 3 3 3 v. i a a a a a a a a a a a a s a a a a a a a a i: s! a a a a a a
i a a i i E3 a a a s a a 3 s a a a a a a a a a a a a a El El a a a a a a a 3 a a a a a a a a a a
, a a mm I* a a a a a a a 3 m a 3 a a a a a a 3 a a a a a a a a a a a a a a a a a a a a a
¡f 53 3 3 a a a S K S W « S 3 3 a a SI a a a a a a a a a a a a a a a s &i 3 w a a a a a a a a a a
i i £ a a a a a a a a a a a a a a a a f f A El Ei 3 3 £ a ?! a a a -f ii a a a a a a El a a a
, 2 a 233 24.4 153 233 233 233 233 23.7 23.4 23.9 223 223 24.7 243 a a a a a a a a a a 3 a a a a a a a a a a a 3 a a a 1 a a
af U 3 a a 3 a a a a | a | a | a | a | a a a | a | a | e a a a a a a a a a a a a a a a a a a a a 3 a a a 3 a
3 a i 3 3 a El a a a a a a a s a a a 3 a a 3 a a a w a a a a a a 3 a a a a a a a a
a 3 a a ?! 3 3 a a a a a a a a a a a a a a a a 3 a a a a a a 3 a a a a a a 3 a a a a a a
3 3 a i 3 a 3 s a a a a K ,7; Tri ,7; ,7; V, a a a a S a a a a a a a a a a a a a a a a a a a 3
-a s 3 a i ii a a a a a a El 3 a 3 a a a a a a a a a a a a 3 a a a i; a a a a i; a a a a a a a a
a 3 3 3 11 a | a a | a I a 1 a 1 a 1 a 31 a 31 a 1 a 1 a 1a 3 31 a a a 3 3 3 a a 3; a a a a n a a a n a a a
i 5a 3 3 $ 3 i: a 1 a 1 a a a | a 1 a 1 a a | a 1 a 1 a 1 a 1 a a a | a | a 1 a H a a a a a a a a a a a a a 3 s a a
3 3 § 3 3 a i i; s a 3 a s 3 a a a ^ a a a a a a a 3 a « a a a a a a a a a a a a a
3 1 3 n 3 3 3 a a a a 3 a a 3 a a a a a a a 3 a a a 3 a a a a a 3 a a a a a a a 3 a a 3 a a
if 3 a a 3 3 3 3 a a a | a | a | a | a | a a a a | a a a a a 3 a 3 3 3 3 a a a a a 3 a a a a a 3 a a s! a a
3 3 a a 3 a a 3 a R 3 3 R 3 a a a a 3 a 3 3 a 3 a n 3 pi 3 a 3 a ii a a a 3 a a a 3 a 3 a a a
3 3 a 3 a 3 3 3 3 a a 3 a 3 a: 3 a a 3 a 3 3 3 3 a a a a a a a 3 a 3 a 3 3 a 3 3 a a
f 3 3 a a a a a a a a 3 a s: a a a a 3 a a a a 3 a a a m a a a a a 3 3 3323333 a
i; a a a a s a a 3 a 3 a a a a 3 3 a a 3 3 3: a a 3 a a a a 3 a 3 Ii a a 3 3 3 a 3 a a
i 3 3 3 3 a a a a a a * a 3 a 3 a a 3 a a a a a a a 3 a a a a 3 a a 1 a a
If 3 i a 3 Si a a a 3 3 3 3 3 ! H a 3 3 a a a a a 3 a a a a 3 a 3 a 3 a! a a a a 3 3
3 3 3 3 a a 3 K 3 3 a« 3 3 a 3 a a a a a 3 a a a tn 3 a a a 3 3 3 a 3 a 3 3 3 a a
3 3 a a a a a i 3 3 3 3 3 3 a a 3 a 3 3 a aa a 3 a a a 3 3 3 a 3 a a T T :: S a a
lp 3 3 3 a;; a a a a SI a * a 3 a a 3 a 3 a 3 a 3 a a a 3 a 3 a 3 3 3 3 a 3 a a 3 3 3
3 a 3 3 a 3 El 3 3 3 a a 3 a 3 El 3 3 3: 3 3 a a a 3 3 a a a a a a a a 3 3 3 3 a
! 3 3 3 a a a a a a 3 3 3 3: a a a a 3 a a a 3 a 3; a a a a a a a 3 3 a 3 a 3 a
If 3 3 3 a a a a a 3 ;; 3 a 3 3 a 3 a 3 3 3 3 a 3 a a a a s a a 3 a 3 3 a a a a a a a
3 £ Ii a a 3 3 a 3 a a a 3 a 3 a 3 a 3 3, 3 3 3 3 3 a a 3 a a 3 a a a 3 ri * 3 a a
3 3 3 a a a 3 a a a a 3 a a a a a a 3 3 5 a a a a a a a a 333a a a
i 3 3 a a a 3 3 3 3 3 3 3 a 3 3 a a 3 a 3 3 a a SI a a a a 3 3 3 3 3 3 a 3 a
3 a 3 a a a a a | a | a h a a a a 3 3 3 a 3 a 3 a a H ii a a a a 3 a 3 3 a a
3 3 a a 3 3 a 3 a 3 a 3: 3 a 3 a 3 : 3 3 -t* a -i' El a a 3 a a 3 3 3 3 3 i T a a
3 3 3 a is a a a 3 3 3 3 3 3 3 a a a 3 3 a 3 3 a a 3 a a 3 3 a a a a 3 3 3 3
a 3 3 3 a 3 3 a a 3 a a a 3 3 a 3 3 a 3 a 3 a 3 a a a a 3 a a a 3 a 3 3 a a
ii h iJJil illllll 1 1 u a Ja u u IS & D ! ■3 1 I IS 41 1 1 1 III 2 r\ a 1 Si a a 1 0) 1 1 3 si Ii 1 y i 1 MSIS BSI(J s ii a u 1 $ u & s ,9 'A 1 r-, 1 r-, a fi s J! 1 1 1 a 1 1 i 3 a ■S s u D 1 8 8 n № ill1 l 1
I
References
[1] Andr'easson, N., Evgrafov, A., & Patriksson, M. (2005). An introduction to optimization: Foundations and fundamental algorithms. Chalmers University of Technology Press: Gothenburg, Sweden, 1;1-205.
[2] Bhardwaj, A. (2020). Silhouette coefficient validating clustering techniques. Towards Data Science.
[3] Bertsimas, D., Pawlowski, C., & Zhuo, Y.D. (2017). From predictive methods to missing data imputation: an optimization approach. J. Mach. Learn. Res., 18(1);7133-7171.
[4] Beveridge, N.R. (2021). Deep learning for weather clustering and forecasting. Air Force Institute of Technology. https://scholar.afit.edu/etd/5082
[5] Boluki, S., Zamani Dadaneh, S., Qian, X., & Dougherty, E.R. (2018). Optimal clustering with missing values. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 593-594.
[6] Carro-Calvo, L., Jaume-Santero, F., García-Herrera, R., & Salcedo-Sanz, S. (2021). k-Gaps: a novel technique for clustering incomplete climatological time series. Theoretical and Applied Climatology, 143(1-2);447-460.
[7] Ebi, K.L., Capon, A., Berry, P., Broderick, C., de Dear, R., Havenith, G., ... & Jay, O. (2021). Hot weather and heat extremes: health risks. The Lancet, 398(10301);698-708.
[8] Ghorbani, H. (2019). Mahalanobis distance and its application for detecting multivariate outliers. Facta Universitatis, Series: Mathematics and Informatics, 583-595.