Detecting Offensive Malay Language Comments on YouTube using Support Vector Machine (SVM) and Naive Bayes (NB) Model

Main Article Content

Ameliyana Mohd Isa, Suzana Ahmad, Norizan Mat Diah

Abstract

Social media, such as YouTube, Twitter, and Facebook, have become a new way of communication allowing many users to interact and obtain information. Nowadays, many users on social media write and post using offensive language. Offensive language is an expression consisting of offensive words, either oral or text, including abusive, racial, and sexual content, and it can be in multiple languages. Offensive language may jeopardize user engagement. Users can manually control the offensive language; however, the colossal amount of unstructured data is challenging. Thus, this study addresses the issue by identifying the offensive words used in YouTube comments, focusing on the Malay language, based on the list of offensive words obtained from the Malaysian Communications and Multimedia Commission (MCMC). This study also builds an experiment for offensive YouTube comments detection using Term Frequency - Inverse Document Frequency (TF-IDF) and Bag of Words (BoW) features. This study employed the Random undersampling and Random oversampling techniques to treat the imbalanced data. Support Vector Machine (SVM) and Naïve Bayes (NB) were used to identify whether the comment is offensive. The results showed that the SVM model and TF-IDF, as a weighting feature, are the best approach for this study, with Recall results of 98.70%. Both models are effective in this study, with NB produced slightly lower results than SVM. Results can improve by further data preprocessing and adjustment of the classifiers

Article Details

Section
Articles