4/2, Aleksej Zelezniak, Chalmers: Uncovering genotype-phenotype relationships using artificial intelligence
Understanding the genetic regulatory code that governs gene expression is a primary, yet challenging aspiration in molecular biology that opens up possibilities to cure human diseases and solve biotechnology problems. I will present two our recent works (1,2). First, I will demonstrate how we applied deep learning on over 20,000 mRNA datasets to learn the genetic regulatory code controlling mRNA expression in 7 model organisms ranging from bacteria to human. There, we show that in all organisms, mRNA abundance can be predicted directly from the DNA sequence with high accuracy, demonstrating that up to 82% of the variation of gene expression levels is encoded in the gene regulatory structure. In second study, I will present a ProteinGAN, a specialised variant of the generative adversarial network that is able to ‘learn’ natural protein sequence diversity and enables the generation of functional protein sequences. We tested ProteinGAN experimentally showing that learns the evolutionary relationships of protein sequences directly from the complex multidimensional amino acid sequence space and creates new, highly diverse functional sequence variants with natural-like physical properties. ProteinGAN therefore demonstrates the potential of artificial intelligence to rapidly generate highly diverse novel functional proteins within the allowed biological constraints of the sequence space.
1) Zrimec J et al “Gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure”, biorxiv, https://doi.org/10.1101/792531
2) Repecka D et al, "Expanding functional protein sequence space using generative adversarial networks”, biorxiv, https://doi.org/10.1101/789719
11/2, Henrik Imberg, Chalmers: Optimal sampling in unbiased active learning
Abstract: We study the statistical properties of weighted estimators in unbiased pool-based active learning where instances are sampled at random with unequal probabilities. For classification problems, the use of probabilistic uncertainty sampling has previously been suggested for such algorithms, motivated by the heuristic argument that this would target the most informative instances, and further by the assertion that this also would minimise the variance of the estimated (logarithmic) loss. We argue that probabilistic uncertainty sampling does, in fact, not reach any of these targets.
Considering a general family of parametric prediction models, we derive asymptotic expansions for the mean squared prediction error and for the variance of the total loss, and consequently present sampling schemes that minimise these quantities. We show that the resulting sampling schemes depend both on label uncertainty and on the influence on model fitting through the location of data points in the feature space, and have a close connection to statistical leverage.
The proposed active learning algorithm is evaluated on a number of datasets, and we demonstrate better predictive performance than competing methods on all benchmark datasets. In contrast, deterministic uncertainty sampling always performed worse than simple random sampling, as did probabilistic uncertainty sampling in one of the examples.
18/2, Ottmar Cronie, Department of Public Health and Community Medicine, University of Gothenburg, and Department of Mathematics and Mathematical Statistics, Umeå University: Resample-smoothing and its application to Voronoi estimators
Adaptive non-parametric estimators of point process intensity functions tend to have the drawback that they under-smooth in regions where the density of the observed point pattern is high, and over-smooth where the point density is low. Voronoi estimators, which is one such example, are such that the intensity estimate at a given location is equal to the reciprocal of the size of the Voronoi/Dirichlet cell containing that location. To remedy the over-/under-smoothing issue, we introduce an additional smoothing operation, based on resampling the point pattern/process by independent random thinning, which we refer to as ”resample-smoothing”, and apply it to the Voronoi estimator. In addition, we study statistical properties such as unbiasedness and variance, and propose a rule-of-thumb and a data-driven cross-validation approach to choose the amount of smoothing to apply. Through a simulation study we show that our resample-smoothing technique improves the estimation substantially and that it generally performs better than single-bandwidth kernel estimation (in combination with the state of the art in bandwidth selection). We finally apply our proposed intensity estimation scheme to real data.
10/3, Nikolaos Kourentzes, University of Skövde: Predicting with hierarchies
Abstract: Predictive problems often exhibit some hierarchical structure. For instance, in supply chains, demand over different products aggregates to the total demand per store, and demand across stores aggregates to the total demand over a region and so on. Several applications can be seen in a hierarchical context. Modelling the different levels of the hierarchy can provide us with additional information that can improve the quality of predictions across the whole hierarchy, enriching supported decisions and insights. In this talk we present the mechanics of hierarchical modelling and proceed to discuss recent innovations. We look in some detail the cases of cross-sectional and temporal hierarchies that are applicable to a wide range of time series problems and the newer possibilistic hierarchies that address classification and clustering problems. We provide the theoretical arguments favouring hierarchical approaches and use a number of empirical cases to demonstrate their flexibility and efficacy.
17/3, Mike Pereira, Chalmers: A matrix-free approach to deal with non-stationary Gaussian random fields in geostatistical applications
Abstract: Geostatistics is the branch of Statistics attached to model spatial phenomena through probabilistic models. In particular, the spatial phenomenon is described by a (generally Gaussian) random field, and the observed data are considered as resulting from a particular realization of this random field. To facilitate the modeling and the subsequent geostatistical operations applied to the data, the random field is usually assumed to be stationary, thus meaning that the spatial structure of the data replicates across the domain of study.
However, when dealing with complex spatial datasets, this assumption becomes ill-adapted. Indeed, what about the case where the data clearly display a spatial structure that varies across the domain? Using more complex models (when it is possible) generally comes at the price of a drastic increase in operational costs (computational and storage-wise), rendering them hard to apply to large datasets.
In this work, we propose a solution to this problem, which relies on the definition of a class of random fields on Riemannian manifolds. These fields extend ongoing work that has been done to leverage a characterization of the random fields classically used in Geostatistics as solutions of stochastic partial differential equations. The discretization of these generalized random fields, undertaken using a finite element approach, then provides an explicit characterization that is leveraged to solve the scalability problem. Indeed, matrix-free algorithms, in the sense that they do not require to build and store any covariance (or precision) matrix, are derived to tackle for instance the simulation of large Gaussian vectors with given covariance properties, even in the non-stationary setting.
24/3, Valeria Vitelli, Department of Biostatistics, University of Oslo: A novel Variational Bayes approach to Preference Learning with the Mallows rank model
Abstract: ranking data are ubiquitous in the digitalized society: we rank items as citizens, consumers, patients, and we receive recommendations based on estimates of our preferences. We have previously proposed a Bayesian preference learning framework based on the Mallows rank model, capable of jointly estimating the items consensus ranking, and of providing personalized accurate and diverse recommendations, also in the presence of heterogeneity and data inconsistencies. The Bayesian paradigm allows proper propagation of uncertainties, and provides probabilistic recommendations. The main bottleneck has shown to be computational: the current MCMC implementation, which manages up to thousands of users and hundreds of items, mixes slowly, and does not scale to meet the demands of realistic applications. Here we propose a Variational Bayes approach to performing posterior inference for the Mallows model, based on a pseudo-marginal approximating distribution that scans the items one by one: the novel inferential approach supports several data types, it has nice theoretical properties, and it shows a dramatic computational improvement in larger applications. We introduce this novel approach, together with empirical investigations of its performance, and a real case study on clicking data from the Norwegian Broadcasting Company (NRK).
21/4, Rasmus Pedersen, Roskilde Universitet: Modelling Hematopoietic Stem Cells and their Interaction with the Bone Marrow Micro-Environment
Blood cell formation (hematopoiesis) is a process maintained by the hematopietic stem cells (HSCs) from within the bone marrow. HSCs give rise to progenitors which in turn produce the vast amount of cells located in the blood. HSCs are capable of self-renewal, and hence a sustained production of cells is possible, without exhaustion of the HSCs pool.
Mutations in the HSC genome give rise to a wide range of hematologic malignancies, such as acute myeloid leukemia (AML) or the myeloproliferative neoplasms (MPNs). As HSCs are difficult to investigate experimentally, mathematical modelling of HSC and blood dynamics is a useful tool in the battle against blood cancers.
We have developed a mechanism-based mathematical model of the HSCs and their interaction with the bone marrow micro-environment. Specifically, the model directly considers the reversible binding of HSCs to their specific niches, often omitted in other modelling works. In my talk, I will discuss some of the aspects of developing the model and the immediate results that arise from the model, which includes an expression of HSC fitness and insight about outcomes of bone marrow transplantation. To relate the HSC dynamics to observable measures such as blood-cell count, the model is reduced and incorporated into a separate model of the blood system. The combined model is compared with a vast data-set of blood measurements of MPN-diagnosed patients during treatment.By including the biological effects of the treatment used in the model, patient trajectories can be modelled to a satisfying degree. Such insights from the model show great promise for future predictions of patient responses and design of optimal treatment schemes.
28/4, András Bálint, Chalmers: Mathematical methods in the analysis of traffic safety data
Abstract: This talk describes real-world examples related to traffic safety research in which mathematical methods or models have been applied or should be applied. Relevant data sources and current research challenges as well as potential approaches will be discussed. One example is presented in greater detail, namely the analysis of multitasking additional-to-driving (MAD) under various conditions. Results from an analysis of the Second Strategic Highway Research Program Naturalistic Driving Study (SHRP2 NDS) show that the number of secondary tasks that the drivers were engaged in differs substantially for different event types. A graphical representation is presented that allows mapping task prevalence and co-occurrence within an event type as well as a comparison between different event types. Odds ratios computed in the study indicate an elevated risk for all safety-critical events associated with MAD compared to no task engagement, with the greatest increase in the risk of rear-end striking crashes. The results are similar irrespective of whether secondary tasks are defined as in SHRP2 or in terms of general task groups. The results confirm that the reduction of driving performance from MAD observed in simulator studies is manifested in real-world crashes as well.
23/6, Chris Drovandi, Queensland University of Technology, Australia: Accelerating sequential Monte Carlo with surrogate likelihoods
Abstract: Delayed-acceptance is a technique for reducing computational effort for Bayesian models with expensive likelihoods. Using delayed-acceptance kernels in MCMC can reduce the number of expensive likelihoods evaluations required to approximate a posterior expectation to a given accuracy. It uses a surrogate, or approximate, likelihood to avoid evaluation of the expensive likelihood when possible. Importantly, delayed-acceptance kernels preserve the intended targeted distribution of the Markov chain, when viewed as an extension of a Metropolis-Hastings kernel. Within the sequential Monte Carlo (SMC) framework, we utilise the history of the sampler to adaptively tune the surrogate likelihood to yield better approximations of the expensive likelihood, and use a surrogate first annealing schedule to further increase computational efficiency. Moreover, we propose a framework for optimising computation time whilst avoiding particles degeneracy, which encapsulates existing strategies in the literature. Overall, we develop a novel algorithm for computationally efficient SMC with expensive likelihood functions. The method is applied to static Bayesian models, which we demonstrate on toy and real examples.
[This work is led by PhD student Joshua Bon (Queensland University of Technology) and is in collaboration with Professor Anthony Lee (University of Bristol)]
13/10, Raphaël Huser, KAUST: Estimating high-resolution Red Sea surface temperature hotspots, using a low-rank semiparametric spatial model
Abstract: In this work, we estimate extreme sea surface temperature (SST) hotspots, i.e., high threshold exceedance regions, for the Red Sea, a vital region of high biodiversity. We analyze high-resolution satellite-derived SST data comprising daily measurements at 16703 grid cells across the Red Sea over the period 1985–2015. We propose a semiparametric Bayesian spatial mixed-effects linear model with a flexible mean structure to capture spatially-varying trend and seasonality, while the residual spatial variability is modelled through a Dirichlet process mixture (DPM) of low-rank spatial Student-t processes (LTPs). By specifying cluster-specific parameters for each LTP mixture component, the bulk of the SST residuals influence tail inference and hotspot estimation only moderately. Our proposed model has a nonstationary mean, covariance and tail dependence, and posterior inference can be drawn efficiently through Gibbs sampling. In our application, we show that the proposed method outperforms some natural parametric and semiparametric alternatives. Moreover, we show how hotspots can be identified and we estimate extreme SST hotspots for the whole Red Sea, projected for the year 2100. The estimated 95% credible region for joint high threshold exceedances include large areas covering major endangered coral reefs in the southern Red Sea.
20/10, Luigi Acerbi, University of Helsinki: Practical sample-efficient Bayesian inference for models with and without likelihoods
Abstract: Bayesian inference in applied fields of science and engineering can be challenging because in the best-case scenario the likelihood is a black-box (e.g., mildly-to-very expensive, no gradients) and more often than not it is not even available, with the researcher being only able to simulate data from the model. In this talk, I review a recent sample-efficient framework for approximate Bayesian inference, Variational Bayesian Monte Carlo (VBMC), which uses only a limited number of potentially noisy log-likelihood evaluations. VBMC produces both a nonparametric approximation of the posterior distribution and an approximate lower bound of the model evidence, useful for model selection. VBMC combines well with a technique we (re)introduced, inverse binomial sampling (IBS), that obtains unbiased and normally-distributed estimates of the log-likelihood via simulation. VBMC has been tested on many real problems (up to 10 dimensions) from computational and cognitive neuroscience, with and without likelihoods. Our method performed consistently well in reconstructing the ground-truth posterior and model evidence with a limited budget of evaluations, showing promise as a general tool for black-box, sample-efficient approximate inference — with exciting potential extensions to more complex cases.
10/11, Peter Jagers and Sergey Zuyev, Chalmers: Galton was right: all populations die out
Abstract. The frequent extinction of populations (families, species,….) constitutes a classical scientific problem. In 1875 Francis Galton and Henry Watson introduced the Galton-Watson process and claimed that they proved the extinction of all families within its framework. Their proof contained a now famous, but for 50 years undetected, gap: for branching type processes (i.e. populations where individuals reproduce in an i.i.d. style fashion) the real truth is a dichotomy between extinction and exponential growth. However, as we proved recently (J. Math. Biol. 2020), if populations turn subcritical whenever they exceed a carrying capacity, then they must die out, under natural (almost self-evident) conditions. This however may take its time, if the carrying capacity is large.
Mathematically, the proof relies upon local supermartingales and Doob’s maximal inequality.