18/1, Emilia Pompe, University of Oxford: Robust inference using Posterior Bootstrap 

Abstract: Bayesian inference is known to provide misleading uncertainty estimation when the considered model is misspecified. This talk will explore various alternatives to standard Bayesian inference under model misspecification, based on extensions of the Weighted Likelihood Bootstrap (Newton & Raftery, 1994).
In the first part, we will talk about Posterior Bootstrap, which is an extension of Weighted Likelihood Bootstrap allowing the user to properly incorporate the prior. We will see how Edgeworth expansions can be used to understand the impact of the prior and guide the choice of hyperparameters.
Next we will talk about Bayesian models built of multiple components having shared parameters. Misspecification of any part of the model might propagate to all other parts and lead to unsatisfactory results. Cut distributions have been proposed as a remedy, where the information is prevented from flowing along certain directions. We will show that asymptotically cut distributions don't have the correct frequentist coverage for the associate credible regions. We will then discuss our new alternative methodology, based on the Posterior Bootstrap, which delivers credible regions with the nominal frequentist asymptotic coverage.
The talk is based on the following papers:
https://arxiv.org/abs/2110.11149 (joint work with Pierre Jacob)


25/1, Tomáš Mrkvička, University of South Boemia: What can be proven by functional test statistics: The R package GET

Abstract: Statistical testing is one of the major tools in biostatistics. Usually, the test statistic is one-dimensional, gathering the information from time or space into a single number. Performing the statistical inference with functional test statistics (such as the slope of warming measured every day of the year, spatial correlation measured in certain distances, F statistic of the GLM measured in every voxel of the brain) can reveal more information than the single agglomerative test statistic. On the other hand, using functional test statistics brings difficulties in the test statistic model assumptions, such as normality or homogeneity. Therefore, we have introduced a powerful, nonparametric statistical inference method with functional test statistics in our R package GET. The methods also provide graphical inference, which is equivalent to the formal inference, which allows for easy interpretation of the results. The package provides inference for functional GLM with one-, two- or three-dimensional functions, goodness-of-fit test based on multiple functional test statistics, graphical comparison of several distribution functions, graphical functional clustering, graphical test of dependence of two variables, functional central region detection together with functional box plot. It also allows for composite hypothesis testing in goodness of fit testing, i.e., when the model parameters must be estimated. All the procedures satisfy the family wise error rate control. We are recently working on false discovery rate control to detect all hypothesis (domain of functional test statistic) that should be rejected. The provided procedures are based on the ordering of functions according to extreme rank length functional depth, which allows for intrinsic graphical interpretation. The intrinsic graphical interpretation means that if the functional test statistic lies in at least one point outside the constructed envelopes, the null hypothesis is rejected. Thereafter, it identifies the domain of rejection.

-Myllymäki M., Mrkvička T. (2019). GET: Global envelopes in R. http://arxiv.org/abs/1911.06583
- Myllymäki M., Mrkvička T., Seijo H., Grabarnik P., Hahn U. (2017). Global envelope tests for spatial processes, JRSS Series B 79/2, 381-404.
- Mrkvička T., Roskovec T., Rost M. (2021). A nonparametric graphical tests of significance in functional GLM, Methodol Comput Appl Probab. 23, 593-612.


8/2, Arno Solin, Aalto University: On the link between neural networks and stationary Gaussian process priors 

Abstract: Deep feedforward neural networks have become an essential component of modern machine learning. These models are known to reinforce hidden data biases, making them unreliable and difficult to interpret. In Bayesian deep learning, the interests are two-fold: encoding prior knowledge into models and performing probabilistic inference under the specified model. In this talk, the focus is on the former and we seek to build models that 'know what they do not know' by introducing inductive biases in the function space. This is done by studying the connection between random (untrained) networks and Gaussian process priors. We will focus on stationary models, which act as a proxy for capturing sensitivity. Stationarity indicates translation-invariance, meaning that the joint probability distribution does not change when the inputs are shifted. This seemingly naive assumption has strong consequences in the sense that it induces conservative behaviour across the input domain, both in-distribution and outside the observed data.

This talk relates to two papers (https://arxiv.org/abs/2010.09494 and https://arxiv.org/abs/2110.13572) that were published in NeurIPS 2020 and 2021, respectively.


22/2, Paula Moraga, KAUST: Combined analysis of spatially misaligned data using Gaussian fields and the stochastic partial differential equation approach 

Abstract: Spatially misaligned data are becoming increasingly common due to advances in both data collection and management in a wide range of scientific disciplines including the epidemiological, ecological and environmental fields. Here, we present a Bayesian geostatistical model for fusion of data obtained at point and areal resolutions. The model assumes that underlying all observations there is a spatially continuous variable that can be modeled using a Gaussian random field process. The model is fitted using the integrated nested Laplace approximation (INLA) and the stochastic partial differential equation (SPDE) approaches. In the SPDE approach, a continuously indexed Gaussian random field is represented as a discretely indexed Gaussian Markov random field (GMRF) by means of a finite basis function defined on a triangulation of the region of study. In order to allow the combination of point and areal data, a new projection matrix for mapping the GMRF from the observation locations to the triangulation nodes is proposed which takes into account the types of data to be combined. The performance of the model is examined via simulation when it is fitted to point, areal, and point and areal data combined to predict several simulated surfaces that can appear in real settings. The model is also applied to predict the concentration of fine particulate matter (PM2.5) in Los Angeles and Ventura counties, USA, during 2011. The results show that the combination of point and areal data provides better predictions than if the method is applied to just one type of data, and this is consistent over both simulated and real data. We conclude the approach presented may be a helpful advance in the area of spatial statistics by providing a useful tool that is applicable in a wide range of situations where information at different spatial resolutions needs to be combined.


1/3, Maria Skoularidou, Cambridge University: Estimating the directing information and testing for causality

Abstract: The problem of estimating the directed information rate between two discrete processes { X n } and {Yn } via the plug-in (or maximum-likelihood) estimator is considered. When the joint process {(Xn,Yn)} is a Markov chain of a given memory length, the plug-in estimator is shown to be asymptotically Gaussian and to converge at the optimal rate O(1/√n) under appropriate conditions; this is the first estimator that has been shown to achieve this rate. An important connection is drawn between the problem of estimating the directed information rate and that of performing a hypothesis test for the presence of causal influence between the two processes. Under fairly general conditions, the null hypothesis, which corresponds to the absence of causal influence, is equivalent to the requirement that the directed information rate be equal to zero. In that case, a finer result is established, showing that the plug-in converges at the faster rate O (1/ n) and that it is asymptotically χ 2 -distributed. This is proved by showing that this estimator is equal to (a scalar multiple of) the classical likelihood ratio statistic for the above hypothesis test. Finally, it is noted that these results facilitate the design of an actual likelihood ratio test for the presence or absence of causal influence.


12/4, Radu Stoica, Université de Lorraine: Random structures and patterns in spatio-temporal data: probabilistic modelling and statistical inference

Abstract: The useful information carried by spatio-temporal data is often outlined by geometric structures and patterns. Filaments or clusters induced by galaxy positions in our Universe are such an example.
 Two situations are to be considered. First, the pattern of interest is hidden in the data set, hence the pattern should be detected. Second, the structure to be studied is observed, so relevant characterization of it should be done. Probabilistic modelling is one of the approaches that allows to furnish answers to these questions. This is done by developing unitary methodologies embracing simultaneously three directions: modelling, simulation and inference. This talk presents the use of marked point processes applied to the detection and to the characterization of such structures. Practical examples are also shown.


19/4, Ottmar Cronie, Chalmers/GU: Point process learning

Abstract: Point processes are random sets which generalise the classical notion of a random (iid) sample by allowing i) the sample size to be random and/or ii) the sample points to be dependent. Therefore, point process have become ubiquitous in the modeling of spatial and/or temporal event data, e.g. earthquakes and disease cases. In this talk, we present the first statistical learning framework for general point processes, which is based on a subtle combination of two new concepts: prediction errors and cross-validation for point processes. The general idea is to split a point process in two, through thinning, and estimate parameters by predicting one part using the other. By repeating this procedure, we implicitly induce a conditional repeated sampling scheme. The proposed approach allows us to introduce a variety of loss functions not only suitable for standard spatial statistical problems but for general estimation settings, without imposing the iid assumptions. Having discussed different properties of this new approach, we illustrate how it may be applied in different spatial statistical settings and, numerically, we show in (at least) one of these settings that it outperforms the state of the art.


17/5, Marina Axelson-Fisk, Chalmers/GU: Comparative gene finding on highly diverse chromosome alleles 

Abstract: With the new sequencing technologies, the number of novel genome sequences has exploded. At the same time the genome projects are hampered by the lack of homologies to the novel genome, making it difficult both to properly train genome analysis tools and to assess the results. Moreover, the faster and cheaper sequencing methods come with the cost of data quality, and sequencing and annotation errors that make it into the genome databases are propagated to new genomes, making annotation an even bigger challenge. Methods that can improve the data quality is desperately needed.

Here we propose a novel use of comparative gene finding to improve the annotation in organisms with high nucleotide diversity. Comparative gene finding methods utilizes the contrast in sequence similarity between functional and non-functional regions of evolutionary related species. We show how this can be applied to chromosome pair sequences in diploid sequence data. We demonstrate the approach on the highly diverse sea orchid Balanus improvisus using SLAM, a cross-species gene finder that simultaneously annotates and aligns two homologous sequences using Generalized Pair Hidden Markov Models.

Page manager Published: Fri 13 May 2022.