4/2, Aleksej Zelezniak, Chalmers: Uncovering genotype-phenotype relationships using artificial intelligence
Understanding the genetic regulatory code that governs gene expression is a primary, yet challenging aspiration in molecular biology that opens up possibilities to cure human diseases and solve biotechnology problems. I will present two our recent works (1,2). First, I will demonstrate how we applied deep learning on over 20,000 mRNA datasets to learn the genetic regulatory code controlling mRNA expression in 7 model organisms ranging from bacteria to human. There, we show that in all organisms, mRNA abundance can be predicted directly from the DNA sequence with high accuracy, demonstrating that up to 82% of the variation of gene expression levels is encoded in the gene regulatory structure. In second study, I will present a ProteinGAN, a specialised variant of the generative adversarial network that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. We tested ProteinGAN experimentally showing that learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse functional sequence variants with natural-like physical properties. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space.
1) Zrimec J et al “Gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure”, biorxiv, https://doi.org/10.1101/792531
2) Repecka D et al, "Expanding functional protein sequence space using generative adversarial networks”, biorxiv, https://doi.org/10.1101/789719
11/2, Henrik Imberg, Chalmers: Optimal sampling in unbiased active learning
Abstract: We study the statistical properties of weighted estimators in unbiased pool-based active learning where instances are sampled at random with unequal probabilities. For classification problems, the use of probabilistic uncertainty sampling has previously been suggested for such algorithms, motivated by the heuristic argument that this would target the most informative instances, and further by the assertion that this also would minimise the variance of the estimated (logarithmic) loss. We argue that probabilistic uncertainty sampling does, in fact, not reach any of these targets.
Considering a general family of parametric prediction models, we derive asymptotic expansions for the mean squared prediction error and for the variance of the total loss, and consequently present sampling schemes that minimise these quantities. We show that the resulting sampling schemes depend both on label uncertainty and on the influence on model fitting through the location of data points in the feature space, and have a close connection to statistical leverage.
The proposed active learning algorithm is evaluated on a number of datasets, and we demonstrate better predictive performance than competing methods on all benchmark datasets. In contrast, deterministic uncertainty sampling always performed worse than simple random sampling, as did probabilistic uncertainty sampling in one of the examples.
10/3, Nikolaos Kourentzes, University of Skövde: Predicting with hierarchies
Abstract: Predictive problems often exhibit some hierarchical structure. For instance, in supply chains, demand over different products aggregates to the total demand per store, and demand across stores aggregates to the total demand over a region and so on. Several applications can be seen in a hierarchical context. Modelling the different levels of the hierarchy can provide us with additional information that can improve the quality of predictions across the whole hierarchy, enriching supported decisions and insights. In this talk we present the mechanics of hierarchical modelling and proceed to discuss recent innovations. We look in some detail the cases of cross-sectional and temporal hierarchies that are applicable to a wide range of time series problems and the newer possibilistic hierarchies that address classification and clustering problems. We provide the theoretical arguments favouring hierarchical approaches and use a number of empirical cases to demonstrate their flexibility and efficacy.