Author: Mustakim

Publish: Journal of Theoretical and Applied Information Technology

Abstract:

One of the constraints in classification is how to divide the dataset into two parts, training and testing which can represent every data distribution. The most commonly used technique is K-Fold Cross Validation which divides data into several parts and alternately into training data and testing data. In addition, the commonly used technique is to divide data into percentage form (70% and 30%), also become an option in data mining research. K-Means is a grouping algorithm which able to maximizes the effectiveness of distributing data in classification. The experiments performed using K-Means Clustering against K-Nearest Neighbor (K-NN) which was validated by Confusion Matrix have the highest accuracy of 93.4%, it is higher than the K-Fold Cross Validation data distribution technique for each experiment using data Education Management Information System (EMIS) as well as random data. The concept of distributing data in groups can be a representative to each member and increase the accuracy of classification algorithm, although the experiment only applied 70% of training data and 30% of testing data in each group.

Conclusion:

Based on the research conducted there are some knowledge found among them such as for the distribution of training data and testing data based on K-Means Clustering has a higher accuracy of confusion matrix compared to K-Fold Cross Validation in all experiments. The highest values of each of these data distribution techniques were 93.4% for K-Means Clustering and 77.8% for K-Fold Cross Validation. Experiments conducted using EMIS data have a higher accuracy tendency than random data using either K-Means Clustering or K-Fold Cross Validation because of distribution range in random data doesn’t have specific variation. Unfortunately, this research has not been experimented using large data in number of hundreds of millions data records with many attributes and the distribution on each group only based on 70% of training data and 30% testing data. In addition, the disadvantage of data distribution by clustering leads to the effectiveness of members in each cluster that have many different data between clusters or the possibility of not having members in each group at all. The advantage of using clustering techniques in dividing data is that each training data and test data can be represented by each cluster member so that the proximity concept becomes the best pattern in performing the data sharing in the classification process.

Sumber Gambar