2015: Department of Statistics and Data Science

2015

Fall 2015

Time: Wednesday, 11 November, 11:00
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: DouglasDowney, Department of Electrical Engineering and Computer Science, Northwestern University

Title: Large Topic Models: Efficient Inference and Applications

Abstract: Latent variable topic models such as Latent Dirichlet Allocation (LDA) can discover topics from text in an unsupervised fashion. However, scaling the models up to the many distinct topics exhibited in modern corpora is challenging. ``Flat'' topic models like LDA have difficulty modeling sparsely expressed topics, and richer hierarchical models become computationally intractable as the number of topics increases.

In this talk, I will introduce efficient methods for inferring large topic hierarchies. The approach is built upon the Sparse Backoff Tree (SBT), a new prior for latent topic distributions that organizes the latent topics as leaves in a tree. I will show how a document model based on SBTs can effectively infer accurate topic spaces of over a million topics. Experiments demonstrate that scaling to large topic spaces results in much more accurate models, and that SBT document models make use of large topic spaces more effectively than flat LDA. Lastly, I will describe how the models power Atlasify, a prototype exploratory search engine.

Time: Wednesday, 04 November, 15:30-17:00
Location:617 Library Place, IPR Conference Room
Speaker: Professor Jennifer Hill, Department of Humanities and the Social Sciences, New York University

Title: Sensitivity Analysis with Relaxed Parametric Assumptions

Abstract: When estimating causal effects, unmeasured confounding and model misspecification are potential sources of bias. To mitigate these, sensitivity of a study to potential unmeasured confounders can be evaluated in terms of the strength of confounding need to invalidate the result, while flexible model fitting techniques reduce parametric assumptions and thus minimize the chance of model misspecification. Hill and her colleagues propose a simulation-based, two parameter sensitivity analysis strategy that uses Bayesian Additive Regression Trees (BART) to fit the model for the response. This results in an easily interpretable framework for testing for the impact of an unmeasured confounder that uses nonparametrics to limit the number of modeling assumptions. The researchers evaluate this approach in a large-scale simulation setting and with high blood pressure data taken from the Third National Health and Nutrition Examination Survey. The model is implemented as open-source software, integrated into the treatSens package for the R statistical programming language.

Time: Wednesday, 21 October at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Nathan VanHoudnos, Institute for Policy Research, Northwestern University

Title: Practical Meta-analysis of Incorrectly Analysed Studies

Abstract: In education research, group-randomized studies are commonly analyzed as if students were independent. This error is so common that the What Works Clearinghouse (WWC), which is the official evidence synthesis effort of the US Department of Education, has a policy of attempting to correct the significance tests from these studies before including them in their systematic reviews of educational interventions. A recent scrape of the WWC website revealed that of all the systematic reviews published by the WWC, 2 out of 5 of rely on the results of corrected studies and 1 out of 5 rely entirely on corrected studies. Understanding the properties of the WWC correction strategy, therefore, is important for understanding the policy implications of the recommendations the WWC makes.

This talk summarizes recent research by VanHoudnos and Greenhouse (2015) that shows that the current WWC correction strategy performs poorly in practical situations, and presents a new method to meta-analyze a mixture of correctly and incorrectly analyzed studies. This new method relies on the correction of effect size estimators instead of significance tests. The effect size correction is such that (i) the corrected point estimators are unbiased and consistent, and (ii) the corrected standard errors accurately reflect the larger sampling variability implied by group-random assignment. We show that an evidence weighted meta-analysis that uses a mixture of correct and corrected effect sizes outperforms the current WWC procedure.

Time: Wednesday, 07 October at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Yuguang (James) Ban, Department of Statistics, Northwestern University

Title: Regularized estimation of the basis covariance on compositional data

Abstract: Large-scale microbial interactions can be studied by estimating their correlations on metagenomic data generated by next-generation sequencing. Metagenomic data are usually summarized in a compositional fashion (constant sum) due to varying sampling/sequencing depths from one sample to another. Analyzing compositional data using conventional correlation methods has been shown prone to bias that leads to spurious correlations. The type of data which are measured in proportion (sum equals one) is also common in other areas of science such as geoscience and economics.

We consider that proportion data are formed by unobserved absolute basis abundance, and the covariance of the basis abundance links to the variance of the proportion data with the additive log-ratio transformation. The transformation results in an underdetermined system preventing general approaches from estimating the covariance of unobservables. We propose a novel method, regularized estimation of the basis covariance based on compositional data (REBACCA), to identify nonzero covariance by finding sparse solutions to the underdetermined system. To be specific, we construct the system using log ratios of count data and solve the system using the l1-norm shrinkage method. Under certain sparsity constraint, the underdetermined system can be solved with a unique solution. Our comprehensive simulation studies show that REBACCA achieves higher accuracy and runs considerably faster than the existing comparable methods.

(This talk is based on joint work with Hongmei Jiang and Lingling An.)

Time: Wednesday, 30 September, 1530-1700
Location: 617 Library Place, IPR Conference Room
Speaker: Professor Xiao-Li Meng, Department of Statistics, Harvard University

Title: Is it a computing algorithm or a statistical procedure: Can you tell or do you care?

Abstract: For years, it irritated me whenever someone calling the EM algorithm an estimation procedure. I’d argue passionately that EM merely is an algorithm designed to compute a maximum likelihood estimator (MLE), which can be computed by many other methods. Therefore the estimation principle/procedure is MLE, not EM, and it is dangerous to mix the two, for example by introducing modifications to EM steps without understanding how they would alter MLE as a statistical procedure. The reality, however, is that the line between computing algorithms and statistical procedures is becoming increasingly blurred. As a matter of the fact, practitioners are now typically given a black box, which turns data into an “answer." Is such a blackbox a computing algorithm or a statistical procedure? Does it matter that we know which is which? Should I continuously be irritated by the mixing of the two? This talk reports my contemplations of these questions that originated in my taking part in a team that has investigated the self-consistency principle introduced by Efron (1967). I will start with a simple regression problem to illustrate a self-consistency method that apparently can accomplish something that seems impossible at first sight, and the audiences will be invited to contemplate whether it is a magical computing algorithm or a powerful statistical procedure. I will then discuss how such contemplations have played critical roles in developing the self-consistency principle into a full bloom generalization of EM for semi/non-parametric estimation with incomplete data and under an arbitrary loss function, capable of addressing wavelets de-noising with irregularly spaced data as well as variable selection via LASSO-type of methods with incomplete data. Throughout the talk, the audience will also be invited to contemplate a widely open problem: how to formulate in general the trade-off between statistical efficiency and computational efficiency? (This talk is based on joint work with Thomas Lee and Zhan Li.)

Spring 2015

Time: Wednesday, 27 May at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Juned Siddique, Department of Preventive Medicine – Biostatistics and Department of Psychiatry and Behavioral Sciences, Northwestern University

Title: Multiple imputation for harmonizing longitudinal non-commensurate measures in individual participant data meta-analysis

Abstract: There are many advantages to individual participant data meta-analysis for combining data from multiple studies. These advantages include greater power to detect effects, increased sample heterogeneity, and the ability to perform more sophisticated analyses than meta-analyses that rely on published results. However, a fundamental challenge is that it is unlikely that variables of interest are measured the same way in all of the studies to be combined. We propose that this situation can be viewed as a missing data problem in which some outcomes are entirely missing within some trials, and use multiple imputation to fill in missing measurements. We apply our method to 5 longitudinal adolescent depression trials where 4 studies used one depression measure and the fifth study used a different depression measure. None of the 5 studies contained both depression measures. We describe a multiple imputation approach for filling in missing depression measures that makes use of external calibration studies in which both depression measures were used. We discuss some practical issues in developing the imputation model including taking into account treatment group and study. We present diagnostics for checking the fit of the imputation model and investigating whether external information is appropriately incorporated into the imputed values.

Time: Wednesday, 13 May at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Dr. Eduardo Mendes, The University of New South Wales Australia Business School

Title: Markov Interacting Importance Samplers (with Marcel Scharth and Robert Kohn)

Abstract: We introduce a new Markov chain Monte Carlo (MCMC) sampler that iterates by constructing conditional importance sampling (IS) approximations to target distributions. We present Markov interacting importance samplers (MIIS) in general form, followed by examples to demonstrate their flexibility. A leading application is when the exact Gibbs sampler is not available due to infeasibility of direct simulation from the conditional distributions. The MIIS algorithm uses conditional IS approximations to jointly sample the current state of the Markov Chain and estimate conditional expectations (possibly by incorporating a full range of variance reduction techniques). We
compute Rao-Blackwellized estimates based on the conditional expectations to construct control variates for estimating expectations under the target distribution. The control variates are particularly efficient when there are substantial correlations in the target distribution, a challenging setting for MCMC. We also introduce the MIIS random walk algorithm, designed to accelerate convergence and improve upon the computational efficiency of standard random walk samplers. Simulated and empirical illustrations for Bayesian analysis of the mixed Logit model and Markov modulated Poisson processes show that the method significantly reduces the variance of Monte Carlo estimates compared to standard MCMC approaches, at equivalent implementation and computational effort.

Link: http://arxiv.org/abs/1502.07039

Time: Wednesday, 29 April at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Susan Holmes, Department of Statistics, Stanford University

Title: Statistical Challenges from the Human Microbiome: variance stabilizing avoids discarding data

Abstract: New data are now available to enlighten the complex symbiotic systems that inhabit the human gut. Next generation sequencing provides tallies in contingency tables cross tabulating samples by taxa and we need to also incorporate contiguous information about the phylogenetic information between taxa as well as clinical data about the specimens. However, the interpretation of the count data requires special attention. In particular, the per-sample library sizes often vary by orders of magnitude from the same sequencing run, and the counts are over dispersed relative to a simple Poisson model.

This challenge can be addressed using an appropriate mixture model that simultaneously accounts for library size differences and biological variability.

We use statistical theory, extensive simulations, and empirical data to show that variance stabilizing normalization using a mixture model like the negative binomial is appropriate for microbiome count data. (Joint work with PJ McMurdie)

Time: Wednesday, 22 April at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Bo Li, Department of Statistics, University of Illinois at Urbana-Champaign

Title: Reconstructing Past Temperatures using Short- and Long-memory Models

Abstract: Understanding the dynamics of climate change in its full richness requires the knowledge of long temperature time series. Although long term, widely distributed temperature observations are not available, there are other forms of data, known as climate proxies, that can have a statistical relationship with temperatures and have been used to infer temperatures in the past before direct measurements. We propose a Bayesian hierarchical model (BHM) to reconstruct past temperatures that integrates information from different sources, such as proxies with different temporal resolution and forcings acting as the external drivers of large scale temperature evolution. The reconstruction method is assessed, using a global climate model as the true climate system and with synthetic proxy data derived from the simulation. Then we apply the BHM to real datasets and produce new reconstructions of Northern Hemisphere annually averaged temperature anomalies back to 1000AD. We also explore the effects of including external climate forcings within the reconstruction and of accounting for short-memory and long-memory features. Finally, we use posterior samples of model parameters to arrive at an estimate of the transient climate response to greenhouse gas forcings of 2:5C (95% credible interval of [2:16, 2:92]C), which is on the high end of, but consistent with, the expert-assessment-based uncertainties given in the recent Fifth Assessment Report of the IPCC.

Time: Wednesday, 15 April at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Yichao Wu, Department of Statistics, North Carolina State University

Title: Automatic Structure Recovery for Additive Models

Abstract: We propose an automatic structure recovery method for additive models, based on a back fitting algorithm coupled with local polynomial smoothing, in conjunction with a new kernel-based variable selection strategy. Our method produces estimates of the set of noise predictors, the sets of predictors that contribute polynomially at different orders up to a specified order M, and the set of predictors that contribute beyond polynomially of order M. Its asymptotic consistency is proved. An extension to partially linear models is also described. Finite-sample performance of the proposed method is illustrated via Monte Carlo studies and a real data example.

Winter 2015

Time: Wednesday, 11 March at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Hua Yun Chen, Professor of Biostatistics, University of Illinois at Chicago

Title: Semiparametric Odds Ratio Model for Sparse Network Detection

Abstract: Gaussian graphical model for network detection has been extensively studied in high-dimensional settings. This model for network detection relies on the joint normal distribution assumption for its interpretation of the nonzero off-diagonal elements of the precision matrix as the presence of network connection between two nodes. Generalization of the Gaussian graphical model to normal copula model and other generalized linear modeling approaches have been proposed to extend the scope of applications. However, limitations exist in these generalizations. In this talk, a unified graphical modeling approach to network detection through the semiparametric odds ratio model is proposed. The semiparametric odds ratio model is flexible in handling both discrete and continuous data and in including interactions, can avoid the problem of model incompatibility, and is invariant to biased sampling designs in network detection. A neighborhood selection approach is proposed. We show that the approach enjoys sign-consistency under a version of the irrepresentable condition. A coordinate descent algorithm is proposed to solve the computation problem. The proposed approach is compared with alternative approaches via simulations.

This is a joint work with Dr. Jinsong Chen at College of Medicine of UIC

Time: Wednesday, 25 February at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Alan M. Polansky, Associate Professor, Division of Statistics, Northern Illinois University (http://www.math.niu.edu/~polansky/)

Title: The Bootstrap and Network Graphs

Abstract: The proliferation of relational data in the form of observed random networks has produced the need for reliable statistical inference on the parameters of the mechanisms that produce such structures. Resampling methods, which were instrumental in providing useful techniques of analyzing complex models in the past, have lagged for behind in the statistical analysis of network models. The relative complexity of dependence in network structures, along with the lack of important information from a single observed realization of a network, produce a framework where the traditional nonparametric bootstrap is difficult to apply in practice. This talk will explore some approaches to using resampling techniques for observed realizations of network graphs, and will hopefully stimulate discussions that will produce new ideas in this area.

Time: Wednesday, 11 February at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Sijian Wang, Associate Professor, Department of Biostatistics & Medical Informatics, Department of Statistics at University of Wisconsin

Title: Regularized Outcome Weighted Subgroup Identification for Differential Treatment Effects

Abstract: To facilitate comparative treatment selection when there is substantial heterogeneity of treatment effectiveness, it is important to identify subgroups that exhibit differential treatment effects. Existing approaches model outcomes directly and then define subgroups according to treatment and covariates interaction. However outcomes are affected by both the covariate-treatment interactions and covariate main effects. Consequently mis- specification of the main effects interferes with the covariate-treatment interaction estimation thus impedes valid predictive variable identification. We propose a method that approximates a target function whose value directly reflects correct treatment assignment for patients. This can disconnect the covariate main effects from the covariate- treatment interactions. The function uses patient outcomes as weights instead as modeling targets. Consequently, our method can deal with binary, continuous, time-to-event, and possibly contaminated outcomes in the same fashion. We first focus on identifying only directional estimates from linear rules that characterize important subgroups. We further consider estimation of differential comparative treatment effects for identified subgroups. We demonstrate the advantages of our method in simulation studies and in an analysis of two real data sets. This is joint work with Yaoyao Xu, Quefeng Li, Menggang Yu, Yingqi Zhao and Jun Shao.

Time: Wednesday, 28 January at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Tracy Ke from University of Chicago

Title: Covariate Assisted Multivariate Screening

Abstract: Given a linear regression model, we consider variable selection in a rather challenging situation: the columns of the design matrix are moderately or even heavily correlated, and signals (i.e., nonzero regression coefficients) are rare and weak. In such a case, popular penalization methods are simply overwhelmed.

An alternative approach is screening, which has recently become popular. However, univariate screening is fast but suffers “signal cancellations”, and exhaustive multivariate screening may overcome “signal cancellations” but is infeasible in computation.

We discover that in multivariate screening, if all we wish is to overcome the “signal cancellations”, it is not necessary to screen allm-tuples in an exhaustive fashion: most of them can be safely skipped. In light of this, we propose covariate-assisted multivariate screening (CAS) as a new approach to variable selection, where we first construct a sparse graph using the Gram matrix, then use this graph to decide which m-tuples to skip and which to screen. CAS has a modest computational cost and is able to overcome the challenge of “signal cancellations”.

We demonstrate the advantage of CAS over penalization methods and univariate screen- ing in a “rare and weak” signal model. We show that our method yields optimal convergence rate on the Hamming selection error and optimal phase diagram.

CAS is a flexible idea for incorporating correlation structures into inferences. We discuss its possible extensions to multiple testing and feature ranking.

Time: Wednesday, 21 January at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Anindya Bhadra from Purdue University

Title: The Horseshoe+ Estimator of Sparse Signals

Abstract: We propose a new prior for sparse signal detection that we term the "horseshoe+ prior." The horseshoe+ prior is a natural extension of the horseshoe prior that has been shown to possess a number of desirable theoretical properties while enjoying computational feasibility in high dimensions. We improve upon these advantages with the horseshoe+ prior. Our work proves that the horseshoe+ posterior concentrates at a rate faster than that of horseshoe in the Kullback-Liebler (K-L) sense. We also establish theoretically that the proposed estimator has lower mean-squared error in estimating signals compared to the horseshoe and achieves the optimal Bayes risk in testing up to a constant. In simulations, we demonstrate the superior performance of the horseshoe+ estimator in a standard design setting against recent competitors, including the horseshoe and Dirichlet-Laplace estimators. This is joint work with Jyotishka Datta (SAMSI/Duke University), Nick Polson and Brandon Willard (The University of Chicago).

Time: Wednesday, 14 January at 11am
Location: basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor David Degras-Valabregue from DePaul University

Title: Estimation/detection of brain activations in fMRI studies

Abstract: Since its inception in the early 1990s, function magnetic resonance imaging (fMRI) has enabled fundamental advances in understanding the brain function. This technology has found important applications in neuropsychology (e.g., to treat schizophrenia, Alzheimer’s and Parkinson’s disease); pharmacology (to design new drugs for, e.g., migraine); and clinical practice (e.g., for pre-surgical mapping). A new area of fMRI research with tremendous medical impact is the development of individualized therapies.

A central task of functional neuroimaging is to study brain specialization, that is, to determine which specific brain regions are active during a given cognitive or sensorimotor task. From a statistical perspective, the analysis of fMRI data poses difficult and exciting challenges because of (i) the high spatial resolution of scans, (ii) their low signal/noise ratio, and (iii) the considerable variability of brain anatomy and brain function in populations.

In this talk I will discuss the estimation and detection of brain activations in fMRI data analysis. I will present a new approach that overcomes important limitations of mainstream methods. In particular, the proposed approach adopts a nonparametric spatial model for the hemodynamic response that sensibly improves the estimation/inference of activations across the brain. Time permitting, I will also touch on a current project on spatio-temporal modeling and the construction of simultaneous confidence regions for activation effects in fMRI data.

This talk is based on on a recent paper (Degras et Lindquist, 2014, “A hierarchical model for simultaneous detection and estimation in multi-subject fMRI Studies” NeuroImage 98, 61-72) available on my website:

http://depaul.academia.edu/DavidDegras