Studentarbete
Evenemanget har passerat

Examenspresentation av Hedvig Sundelin

Titel: Machine Learning for Genetic Studies

Undertitel: Exploring the Potential of Machine Learning Models for Predicting Preterm Delivery using Genetic Markers

Översikt

Evenemanget har passerat

Examinator: Andreas Fhager

Handledare: Julius Joudakis

Opponent: Shabir Afshar

Abstract:
Preterm delivery (PTD) is a significant contributor to infant mortality and morbidity worldwide, influenced by environmental and genetic factors. Although previous studies have identified genetic variants associated with PTD and gestational duration, their effect sizes remain relatively small, leaving a substantial portion of the hereditary variation unexplained. This thesis explores the potential of machine learning (ML) techniques to uncover additional insights into PTD and gestational duration using genetic data.

The background section underscores the global impact of preterm birth on child mortality and long-term health outcomes, emphasising the role of genetics with an estimated heritability of around 30\%. This project aims to apply ML techniques to improve the prediction of gestational duration and PTD based on genetic data. Research questions address ML model selection, the impact of variables on prediction performance, and a comparison to previous studies. The study is based on the Norwegian Mother, Father and Child Cohort Study (MoBa) and uses data from the Medical Birth Registry of Norway (MBRN). The scope includes the use of genetic data and a focus on the 23 loci previously identified in a related study.

The theory chapter provides an overview of genetics and its application in studying complex conditions like preterm delivery. It also introduces ML and explains the theoretical foundations of the employed ML models, enhancing understanding of the methods used for predicting PTD on individual-level genetic data. Subsequently, the methods and materials chapter describes the data acquisition process, preprocessing steps, ML classifiers employed for prediction, and model evaluation methods. The chapter highlights the use of neural networks, classic ML algorithms, and libraries for implementation.

Results reveal varying AUC scores among classic models, with logistic regression (LR) performing the best. The choice of variables, particularly the maternal genome and the Top 23 set, improved prediction accuracy. Network models achieved competitive AUC scores, with stochastic gradient descent (SGD2) providing the highest score for binary classification. Additional analyses on the predicted probabilities demonstrated higher AUC scores compared to binary classifications, identifying RMSprop as the best-performing network model. The study reveals consistency in results across classic models but variations between different folds. The findings in this study suggest that more extensive research is needed to unveil the potential of ML models in improving PTD and gestational duration predictions based on genetic data.

Welcome!
Hedvig and Andreas