4/2, Aleksej Zelezniak, Chalmers: Uncovering genotype-phenotype relationships using artificial intelligence
Understanding the genetic regulatory code that governs gene expression is a primary, yet challenging aspiration in molecular biology that opens up possibilities to cure human diseases and solve biotechnology problems. I will present two our recent works (1,2). First, I will demonstrate how we applied deep learning on over 20,000 mRNA datasets to learn the genetic regulatory code controlling mRNA expression in 7 model organisms ranging from bacteria to human. There, we show that in all organisms, mRNA abundance can be predicted directly from the DNA sequence with high accuracy, demonstrating that up to 82% of the variation of gene expression levels is encoded in the gene regulatory structure. In second study, I will present a ProteinGAN, a specialised variant of the generative adversarial network that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. We tested ProteinGAN experimentally showing that learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse functional sequence variants with natural-like physical properties. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space.
1) Zrimec J et al “Gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure”, biorxiv, https://doi.org/10.1101/792531
2) Repecka D et al, "Expanding functional protein sequence space using generative adversarial networks”, biorxiv, https://doi.org/10.1101/789719
11/2, Henrik Imberg, Chalmers: Optimal sampling in unbiased active learning
Abstract: We study the statistical properties of weighted estimators in unbiased pool-based active learning where instances are sampled at random with unequal probabilities. For classification problems, the use of probabilistic uncertainty sampling has previously been suggested for such algorithms, motivated by the heuristic argument that this would target the most informative instances, and further by the assertion that this also would minimise the variance of the estimated (logarithmic) loss. We argue that probabilistic uncertainty sampling does, in fact, not reach any of these targets.
Considering a general family of parametric prediction models, we derive asymptotic expansions for the mean squared prediction error and for the variance of the total loss, and consequently present sampling schemes that minimise these quantities. We show that the resulting sampling schemes depend both on label uncertainty and on the influence on model fitting through the location of data points in the feature space, and have a close connection to statistical leverage.
The proposed active learning algorithm is evaluated on a number of datasets, and we demonstrate better predictive performance than competing methods on all benchmark datasets. In contrast, deterministic uncertainty sampling always performed worse than simple random sampling, as did probabilistic uncertainty sampling in one of the examples.
18/2, Ottmar Cronie, Department of Public Health and Community Medicine, University of Gothenburg, and Department of Mathematics and Mathematical Statistics, Umeå University: Resample-smoothing and its application to Voronoi estimators
Adaptive non-parametric estimators of point process intensity functions tend to have the drawback that they under-smooth in regions where the density of the observed point pattern is high, and over-smooth where the point density is low. Voronoi estimators, which is one such example, are such that the intensity estimate at a given location is equal to the reciprocal of the size of the Voronoi/Dirichlet cell containing that location. To remedy the over-/under-smoothing issue, we introduce an additional smoothing operation, based on resampling the point pattern/process by independent random thinning, which we refer to as ”resample-smoothing”, and apply it to the Voronoi estimator. In addition, we study statistical properties such as unbiasedness and variance, and propose a rule-of-thumb and a data-driven cross-validation approach to choose the amount of smoothing to apply. Through a simulation study we show that our resample-smoothing technique improves the estimation substantially and that it generally performs better than single-bandwidth kernel estimation (in combination with the state of the art in bandwidth selection). We finally apply our proposed intensity estimation scheme to real data.
10/3, Nikolaos Kourentzes, University of Skövde: Predicting with hierarchies
Abstract: Predictive problems often exhibit some hierarchical structure. For instance, in supply chains, demand over different products aggregates to the total demand per store, and demand across stores aggregates to the total demand over a region and so on. Several applications can be seen in a hierarchical context. Modelling the different levels of the hierarchy can provide us with additional information that can improve the quality of predictions across the whole hierarchy, enriching supported decisions and insights. In this talk we present the mechanics of hierarchical modelling and proceed to discuss recent innovations. We look in some detail the cases of cross-sectional and temporal hierarchies that are applicable to a wide range of time series problems and the newer possibilistic hierarchies that address classification and clustering problems. We provide the theoretical arguments favouring hierarchical approaches and use a number of empirical cases to demonstrate their flexibility and efficacy.
17/3, Mike Pereira, Chalmers: A matrix-free approach to deal with non-stationary Gaussian random fields in geostatistical applications
Abstract: Geostatistics is the branch of Statistics attached to model spatial phenomena through probabilistic models. In particular, the spatial phenomenon is described by a (generally Gaussian) random field, and the observed data are considered as resulting from a particular realization of this random field. To facilitate the modeling and the subsequent geostatistical operations applied to the data, the random field is usually assumed to be stationary, thus meaning that the spatial structure of the data replicates across the domain of study.
However, when dealing with complex spatial datasets, this assumption becomes ill-adapted. Indeed, what about the case where the data clearly display a spatial structure that varies across the domain? Using more complex models (when it is possible) generally comes at the price of a drastic increase in operational costs (computational and storage-wise), rendering them hard to apply to large datasets.
In this work, we propose a solution to this problem, which relies on the definition of a class of random fields on Riemannian manifolds. These fields extend ongoing work that has been done to leverage a characterization of the random fields classically used in Geostatistics as solutions of stochastic partial differential equations. The discretization of these generalized random fields, undertaken using a finite element approach, then provides an explicit characterization that is leveraged to solve the scalability problem. Indeed, matrix-free algorithms, in the sense that they do not require to build and store any covariance (or precision) matrix, are derived to tackle for instance the simulation of large Gaussian vectors with given covariance properties, even in the non-stationary setting.
24/3, Valeria Vitelli, Department of Biostatistics, University of Oslo: A novel Variational Bayes approach to Preference Learning with the Mallows rank model
Abstract: ranking data are ubiquitous in the digitalized society: we rank items as citizens, consumers, patients, and we receive recommendations based on estimates of our preferences. We have previously proposed a Bayesian preference learning framework based on the Mallows rank model, capable of jointly estimating the items consensus ranking, and of providing personalized accurate and diverse recommendations, also in the presence of heterogeneity and data inconsistencies. The Bayesian paradigm allows proper propagation of uncertainties, and provides probabilistic recommendations. The main bottleneck has shown to be computational: the current MCMC implementation, which manages up to thousands of users and hundreds of items, mixes slowly, and does not scale to meet the demands of realistic applications. Here we propose a Variational Bayes approach to performing posterior inference for the Mallows model, based on a pseudo-marginal approximating distribution that scans the items one by one: the novel inferential approach supports several data types, it has nice theoretical properties, and it shows a dramatic computational improvement in larger applications. We introduce this novel approach, together with empirical investigations of its performance, and a real case study on clicking data from the Norwegian Broadcasting Company (NRK).