Raksmey Phann, Chitsutha Soomlek, Pusadee Seresangtakul

Multi-Class Text Classification on Khmer News Using Ensemble Method in Machine Learning Algorithms

Číslo: 2/2023
Periodikum: Acta Informatica Pragensia
DOI: 10.18267/j.aip.210

Klíčová slova: Text classification; Khmer news; Machine learning; Feature extraction; Optimal hyperparameters; News categorization; Ensemble learning method

Pro získání musíte mít účet v Citace PRO.

Přečíst po přihlášení

Anotace: The research herein applies text classification with which to categorize Khmer news articles. News articles were collected from three online websites through web scraping and grouped into nine categories. After text preprocessing, the dataset was split into training and testing sets. We then evaluated the performance of the ensemble learning method via machine learning classifiers with k-fold validation. Various machine learning classifiers were employed, namely logistic regression, Complement Naive Bayes, Bernoulli Naive Bayes, k-nearest neighbours, perceptron, support vector machines, stochastic gradient descent, AdaBoost, decision tree, and random forest were employed. Accuracy was improved for the categorization of Khmer news articles, in which Grid Search CV was used to find the optimal hyperparameters for each machine learning classifier with feature extraction TF-IDF and Delta TF-IDF. The results determined that the highest accuracy was achieved through the ensemble learning method in the support vector machine with the optimal hyperparameters (C = 10, kernel = rbf), using feature extraction TF-IDF and Delta TF-IDF, at 83.47% and 83.40%, respectively. The model establishes that Khmer news articles can be accurately categorized.