Doktorsavhandling

Teodor Fredriksson, Interaktionsdesign och Software Engineering

On Semi-Supervised Learning: Evaluation, Challenges and Mitigation Strategies

Översikt

Context: 
Supervised learning requires labeled data but in many real-world datasets there are few or no labeled instances available. Therefore companies may need to allocate resources to obtain labels. However, labeling is not always trivial and companies need people with domain knowledge to perform labeling. Acquiring suitable personnel for labeling may be expensive and time-consuming if new personnel needs to be hired and trained for labeling.

Objective:
The objective of this thesis is to investigate current challenges and mitigation strategies for data labeling. After challenges and weaknesses of current mitigation strategies have been identified, the goal is to identify solutions and improve current mitigation strategies.

Method:
This thesis employs multiple methods. The first study is a systematic mapping study that presents the most commonly utilized AI-based algorithms for data labeling related problems. In addition, the most common applications of these algorithms and the datasets utilized for evaluating these algorithms are presented. The second study reports on data collected during a case study in industry and interviews with company practitioners from two companies. Based on the data, three data labeling related challenges where formulated together with a mitigation strategy for each challenge. Statistical methods play an important role in the rest of the studies and are utilized to analyze algorithms. In two studies, the Bayesian Bradley-Terry model is utilized to rank graph-based and deep semi-supervised learning algorithms respectively. In both studies Bayesian generalized linear mixed models are utilized to analyze the probabilities of algorithms reaching a certain performance with and without noise added. In two other studies, Bayesian item response theory is utilized to assess how suitable the datasets are for evaluating graph-based and deep semi-supervised learning algorithms. Lastly, Bayesian linear regression is utilized to analyze the performance of a deep semi-supervised learning algorithm and its relative improvement over supervised learning on a real-world dataset provided by Saab.

Results:
First the most common AI-based algorithms for data labeling are presented along with the application domains and the datasets utilized to evaluate algorithms. Second, challenges and mitigation strategies are presented as well as currently available algorithms. Third, the optimal graph-based and deep semi-supervised learning algorithms are presented based on performance on each datatype. In addition manual effort is analyzed to demonstrate how many labeled instances are required to obtain a certain accuracy. Fourth, optimal datasets for evaluating graph-based amd deep semi-supervised learning algorithms are presented. Finally, proof demonstrating that deep semi-supervised learning may outperform supervised learning on real-world data collected from industry is presented.

Conclusions:
Many AI-based algorithms may help mitigate problems regarding data labeling. Active learning allows practitioners to reduce manual labeling and improve performance of supervised learning by choosing the most informative instances to be labeled. Graph-based algorithms are inductive learning algorithms that will automatically label data by learning from already labeled data. Deep semi-supervised learning algorithms are transductive algorithms that utilize unlabeled data to improve the performance of supervised learning by adding a loss term incorporating the loss function. Empirical evidence indicate that active learning outperforms passive learning where instances to be labeled are chosen at random. Theoretical studies demonstrate that machine learning algorithms utilizing unlabeled data may improve the performance over supervised learning. On the other hand, there are studies indicating that unlabeled data by degrade performance. These observations may be the cause as to why global companies have yet to incorporate semi-supervised learning and why there is a lack of research where semi-supervised learning is applied to real-world data. Deep semi-supervised learning has increased in popularity due to its many advantages such as robustness. The recently developed deep semi-supervised learning algorithms outperform supervised learning. Graph-based semi-supervised learning has the ability to label data with an accuracy above 90\%. In addition to performing well on benchmark datasets, both algorithms have proven to perform well when noise is present in the dataset, indicating that the algorithms are expected to perform well on real-world datasets. Noise may even increase the accuracy. On the other hand, the datasets utilized when evaluating algorithms may be inappropriate in the sense that they may be to easy for the algorithms to learn. This will cause a false sense of security as the algorithms may perform worse on real-world datasets that are more difficult to learn. Finally, it is demonstrated that deep semi-supervised learning algorithms based on pseudo-labeling and data augmentation have the ability to outperform supervised learning on real-world data from industry.