Student seminar
The event has passed

Jens Ifver and Calvin Smith present their master thesis

"Classifying written and spoken text - A comparison between BERT and Naive Bayes".

Overview

The event has passed
  • Date:Starts 19 January 2023, 14:30Ends 19 January 2023, 15:30
  • Language:English

Students: Jens Ifver and Calvin Smith

Supervisor: Mattias Wahde

Examiner: Torbjörn Lundh

Opponent: Jonathan Hellgren

Abstract

In the field of natural language processing (NLP), written or spoken communication is modeled using machine learning algorithms. However, the lack of data generated from spoken language has led to most pre-trained NLP models being trained on written texts. This could perhaps lead to problems in the performance of the model since there is a significant difference between written and spoken language.

The study will investigate how well the machine learning model BERT, in comparison to a naive Bayes classifier, can distinguish between written and spoken language and whether there is a difference in BERT’s ability to predict masked words for both classes.

The results indicate that both BERT and the naive Bayes classifier are able to separate written and spoken text fairly accurately, achieving a mean accuracy of 80.8% and 75.7% respectively. However, an interesting and quite surprising result is that the naive Bayes classifier outperforms BERT at classifying spoken sentences. The masked word predictions showed that BERT performs poorly at predicting masked words that are common in the spoken data and uncommon in the written data. This could imply that BERT has a lesser understanding of spoken language in general.