"Classifying written and spoken text - A comparison between BERT and Naive Bayes".
Overview
- Date:Starts 19 January 2023, 14:30Ends 19 January 2023, 15:30
- Language:English
Students: Jens Ifver and Calvin Smith
Supervisor: Mattias Wahde
Examiner: Torbjörn Lundh
Opponent: Jonathan Hellgren
Abstract
In the field of natural language processing (NLP), written or spoken communication is modeled using machine learning algorithms. However, the lack of data generated from spoken language has led to most pre-trained NLP models being trained on written texts. This could perhaps lead to problems in the performance of the model since there is a significant difference between written and spoken language.
The study will investigate how well the machine learning model BERT, in comparison to a naive Bayes classifier, can distinguish between written and spoken language and whether there is a difference in BERT’s ability to predict masked words for both classes.
The results indicate that both BERT and the naive Bayes classifier are able to separate written and spoken text fairly accurately, achieving a mean accuracy of 80.8% and 75.7% respectively. However, an interesting and quite surprising result is that the naive Bayes classifier outperforms BERT at classifying spoken sentences. The masked word predictions showed that BERT performs poorly at predicting masked words that are common in the spoken data and uncommon in the written data. This could imply that BERT has a lesser understanding of spoken language in general.