Add Thesis

Partitioning Methods of Data Clustering with R

Written by T. Engel

Paper category

Term Paper


Business Administration>General




"Term Paper: One of the most prominent data mining tools used in data exploration is clustering. A method of connecting data items to groups that are comparable to each other while separating them from other groups. Partition sets are useful for various tasks, such as document classification or customer segmentation. Among all the commonly used clustering algorithms, the K-means method is one of the most effective methods (Han&Kamber, 2000, p.1). In this article, the language and environment R for statistical calculations and graphics is used to explore the use of clustering on the iris dataset. The two clustering practices will use the standard k-Means and k-Medoids methods. Show the code required to use this method in R. In addition, this example will highlight the efficiency that R improves when using clustering methods to analyze Iris sets. Unsupervised learning is essentially synonymous with clustering. Since the input examples have no classification labels, the learning process is unsupervised. Therefore, clustering is used to find classes in the data. Given an example of a set of items with correct clustering, the goal is to learn similarity measures to perform similar clustering of future items. Unsupervised learning methods can use data set input to apply the function and find clusters in the data. However, considering that the data is not labeled, the model cannot display the semantics of clustering. We live in a world where large amounts of data are collected every day. Analyzing such data is a basic requirement. Therefore, data analysis can be seen as the result of the continuous development of information technology. Different from the classification and regression of analyzing class-labeled data sets, clustering analysis of data objects does not require reference to class labels. Usually, the data set has no category label at the beginning. Clustering can be used to create class labels for a set of data (Finley & Joachims, 2008) to cluster data points according to the principles of maximizing intra-class similarity and minimizing inter-class similarity. The formation of the generated object cluster makes the objects in the cluster have high similarity with each other, but are different from the objects in other clusters. Each cluster can be seen as a class of objects from which rules can be derived. When statistical testing allows the distribution or probability model of the data, outliers can be detected. Alternatively, a distance metric is used, where data far away from other clusters are determined as outliers. Density-based methods do not use statistics or distance measures, but can identify outliers in local areas, although from the perspective of global statistical distribution, these data points seem to be common (Han & Kamber, 2000, p.19-21) . For the application of the clustering method, the iris data set was selected. It is one of the most famous databases in the field of pattern recognition. The iris flower data set consists of 150 samples and is divided into three types of flower species. For each sample, five attributes were measured, the length and width of the sepals and the length and width of the petals. Each sample is divided into three categories: Iris species: iris, color-changing iris, and Virginia iris. () Use R, Functionstr() to view the data set in more depth. There are five columns of information for 150 observations. The first four are numeric variables. In contrast, the last column is a Factor with three levels of value (Loseva, 2018). k-means Clustering is an unsupervised algorithm used to solve problems. It categorizes the provided data set through several clusters. Within the cluster, data points are homogeneous and heterogeneous with external data points. First, select k objects from the data set, all objects initially represent a cluster center. According to the distance between the object and the cluster center, each object is assigned to the most comparable group. Second, establish the cluster mean as the centroid. Finally, repeat the previous steps to reduce the distance of each data point. This process continues until convergence occurs (Alsabti et al., 1998). The basic function k-means is available in R and does not need to be downloaded. Introduce the library (fpc) ​​to show the power of R, do k-means without setting the number of clusters. K-Medoids Clustering, also called Partitioning Around Medoid algorithm, is different from traditional k-means Clustering. The center point can be designated as the point in the cluster that has the least difference from all other objects in the batch. The cluster center represented in the algorithm has a center point closest to the center. K-medoids chooses the actual data point as the center, which is more robust to outliers. R has PAM (Partitioning Around Medoids) as the classic algorithm fork-medoids Clustering. The CLARA algorithm expands the functions of PAM and is an enhanced technology of it. Model multiple data samples, apply PAM on each unit, and then provide the best clustering. CLARA performs better on larger data sets. Both of these functions are available in the package cluster (Kaufman & Rousseeuw, 2005, pages 70-74). Using R to perform partition clustering requires modification of the Iris data set. The clusters will be selected to divide the types of iris flowers. Therefore, the class label must be deleted. In order to still have access to the original data sequence, the iris data set must be copied. After copying the data, R can be used to erase all species classifications. This changed data set will be used for all clusters." Read Less