2013: Department of Statistics and Data Science

2013

Fall 2013

Time: Wednesday, 20 November at 11am

Place: Basement classroom, Department of Statistics, 2006 Sheridan Road

Speaker: Jun Xie, Associate Professor of Statistics, Purdue University

Title: Dimension Reduction Method for High Dimensional but Low Sample Size Data

Abstract: Sliced inverse regression (SIR) is an effective dimension reduction method in prediction, where a response variable is assumed to depend on a large number of predictors through an unknown function. SIR extracts features of high dimensional predictors from their low dimensional projections and is based on eigenvector decomposition of the conditional sample covariance matrix. Nowadays the popular open problems of big-data are often referred to as the challenges from high dimensional but low sample size data. The difficulty of estimating large covariance matrices and their inverses limits the application of SIR. In this talk, a least squares formulation of SIR is considered and an alternating least squares method is used to obtain SIR estimation. The alternating least squares method is not limited by high dimensional but low sample size data hence brings the conventional dimension reduction technique back to date for the challenges of big-data analysis. This is an ongoing project. Preliminary studies through simulations and a pharmacogenomics study for individualized treatment rules show that the extended SIR is very promising.

Time: Wednesday, 13 November at 11am

Place: Basement classroom, Department of Statistics, 2006 Sheridan Road

Speaker: Bhramar Mukherjee, Professor of Biostatistics, University of Michigan

Title: Incorporating Auxiliary Information for Improved Prediction in High Dimensional Datasets: An Ensemble of Shrinkage Approaches

Abstract: With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best measure of the underlying biological process. This same biological process may also be measured by W, coming from prior technology but correlated with X. On a moderately sized sample we have (Y,X,W), and on larger sample we have (Y,W). We utilize the data on W to boost prediction of Y by X. When p is large and the sub-sample containing X is small, this is a p>n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies, but using the available data on W and Y in the larger sample while predicting Y with X in new subjects. We propose to shrink the regression coefficients of Y on X towards different targets that use information derived from W in the larger dataset, comparing these with the classical ridge regression of Y on X, which does not use W. We unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of the regression coefficients and balances efficiency and robustness in a data-adaptive way to theoretically yield smaller prediction error than any of its constituents. We also explore fully Bayesian methods with hierarchical priors to conduct joint analysis of all the available data.

The methods are evaluated via simulation studies. We apply them to a gene-expression dataset related to lung cancer. mRNA expression of 91 genes is measured by quantitative real-time polymerase chain reaction

(qRT-PCR) and microarray technology on 47 lung cancer patients with microarray measurements available on an additional 392 patients. The goal is to predict survival time using qRT-PCR in an independent sample of 101 patients. We compare and contrast the methods in this example.

This is joint work with Jeremy Taylor and Philip S. Boonstra at University of Michigan.

Time: Wednesday, 30 October at 11am

Place: Basement classroom, Department of Statistics, 2006 Sheridan Road

Speaker: Jian (Frank) Zou, Assistant Professor of Statistics, Indiana University-Purdue University Indianapolis

Title: Bayesian Spatio-Temporal Methodology for Real-Time Detection of Disease Outbreaks

Abstract: The complexity of spatio-temporal data in epidemiology and surveillance presents challenges such as low signal-to-noise ratio and generating high false positive rate for researchers and public health agencies. Central to the problem in the context of disease outbreaks is a decision structure that requires trading off false positives for delayed detections. We describe a novel Bayesian hierarchical model capturing the spatio-temporal dynamics in public health surveillance data sets. We further quantify the performance of the method to detect outbreaks by incorporating different criteria, including false alarm rate, timeliness and cost functions. Our data set is derived from emergency department (ED) visits for Influenza-like illness and respiratory illness in the Indiana Public Health Emergency Surveillance System (PHESS). The methodology incorporates Gaussian Markov random field (GMRF) and spatio-temporal conditional autoregressive (CAR) modeling. Features of this model include timely detection of outbreaks, robust inference to model misspecification, reasonable prediction performance, as well as attractive analytical and visualization tool to assist public health authorities in risk assessment. Our numerical results show that the model captures salient spatio-temporal dynamics that are present in public health surveillance data sets, and that it appears to detect both “annual” and “atypical” outbreaks in a timely, accurate manner. We present maps that help make model output accessible and comprehensible to public health authorities. We use an illustrative family of decision rules to show how output from the model can be used to inform false positive--delayed detection tradeoffs.

Time: Wednesday, 16 October at 11am

Place: Basement classroom, Department of Statistics, 2006 Sheridan Road

Speaker: Dr. Juan Hu, Assistant Professor of Statistics, DePaul University

Title: Optimal Low Rank Models for Massive Spatial Data

Abstract: Massive spatial data are observed in an increasing number of disciplines. Because spatial data are usually correlated, the large sample imposes computationally challenges to statistical inferences and could make the likelihood-based or Bayesian inferences impractical. One of the approaches to the analysis of massive spatial data borrows ideas from dimensional reduction to express the spatial process by a finite number of random variables, which results in a low rank model. Several low rank models have been proposed in recent years for spatial data. I will show for a given covariance function how the optimal low rank model can be constructed through the Karhunen-Loeve (KL) expansion. Although the KL expansion has been applied extensively in computer science and engineering, it has not been well studied for spatial processes. As a result, some well adopted algorithms for calculating the eigenvalues and eigenfunctions do not work well for spatial processes. I will introduce a new algorithm I developed and compare it with the existing ones. I will compare the inferential results of the optimal low rank model with those of other low rank models and show that the optimal low rank model given by the KL expansion does outperform other low rank models. Finally, I will briefly show the application of KL expansion to the multivariate spatial models.

Time: Wednesday, 02 October at 11am

Place: Basement classroom, Department of Statistics, 2006 Sheridan Road

Speaker: Dr. Jin Liu, Assistant Professor, Division of Epidemiology & Biostatistics, University of Illinois at Chicago

Title: A Penalized Multi-trait Mixed Model for Association Mapping in Pedigree-based GWAS

Abstract: We consider genome-wide association studies (GWAS) of multiple highly correlated quantitative traits from pedigree data in this paper. In the analysis of GWAS, penalized regression has been found to be useful for identifying multiple associated genetic markers where linear mixed models are commonly used to account for complicated dependence among samples. Therefore, penalized linear mixed model is a natural choice that combines the advantages of both modeling approaches for GWAS data analysis. For GWAS of multiple quantitative traits that are highly correlated, analyzing each trait separately is sub-optimal. In this manuscript, we propose a penalized multi-trait mixed model (penalized-MTMM) which simultaneously accounts for both the within-trait and between-trait variance components to jointly analyze multiple traits. Our method not only accounts for the relatedness among study subjects, but also borrows information across traits through joint analysis of these traits using group penalties. We have evaluated the performance of penalized-MTMM in simulation studies and through its application to a GWAS data from the Genetic Analysis Workshop (GAW) 18.

Spring 2013

Time: Wednesday, May 29 at 11 am

Speaker: James Pustejovsky, PhD; Department of Statistics; Northwestern University

Title: Some Markov models for direct observation of behavior

Abstract: Data based on direct observation of behavior are used extensively in certain areas of educational and psychological research. One procedure for recording direct observation of behavior, known as partial interval recording or "Hansen sampling," produces measurements that are difficult to interpret and sensitive to procedural details; despite these problems, the procedure remains in wide use. In this talk, I describe a latent Markov model for partial interval recording data, defined in terms of readily interpretable parameters that can be estimated by conventional methods. I then propose two alternative methods for recording direct observation of behavior, both of which offer efficiency gains relative to partial interval recording.

Time: Wednesday, May 22 at 11 am

Speaker: John Lafferty, Louis Block Professor; Department of Statistics and Department of Computer Science; University of Chicago

Title: Variants on Nonparametric Additive Models for Regression and Graphical Modeling

Abstract: We present recent research on nonparametric additive models for different families of statistical estimation problems, including multivariate regression and graphical modeling. The focus is on scaling to high dimensions while controlling the loss in statistical and computational efficiency compared with linear models. Convex geometry and optimization plays a central role. Time permitting, we will also discuss some recent efforts to develop new courses and computational infrastructure for statistics and machine learning at UChicago.

Time: Wednesday, May 1 at 11 am

Speaker: Jim Berger, The Arts and Sciences Professor of Statistics; Department of Statistical Science; Duke University

Title: Reproducibility of Science: P-values and Multiplicity

Abstract: Published scientific findings seem to be increasingly failing efforts at replication. This is undoubtedly due to many sources, including specifics of individual scientific cultures and overall scientific biases such as publication bias. While these will be briefly discussed, the talk will focus on the all-too-common misuse of p-values and failure to properly account for multiplicities as two likely major contributors to the lack of reproducibility. The Bayesian approaches to both testing and multiplicity will be highlighted as possible general solutions to the problem.

Time: Wednesday, April 24 at 11 am

Speaker: Osnat Stramer, Associate Professor; Department of Statistics and Actuarial Science; University of Iowa

Title: Bayesian Inference for Diffusion Processes

Abstract: The problem of formal likelihood-based (either classical or Bayesian) inference for discretely observed multi-dimensional diffusions is particularly challenging. In principle this involves data-augmentation of the observation data to give representations of the entire diffusion trajectory. In this talk, we provide a generic and transparent framework for data augmentation for diffusions. We introduce a generic program which can be followed in order to identify appropriate auxiliary variables, to Markov chain Monte Carlo algorithms that are valid even in the limit where continuous paths are imputed, and to approximate these limiting algorithms. We also present the Pseudo-marginal (PM) approach for Bayesian inference in diffusion models. The PM approach can be viewed as a generalization of the popular data augmentation schemes that sample jointly from the missing paths and the parameters of the diffusion volatility. The efficacy of the proposed algorithms is demonstrated in a simulation study of the Heston models, and is applied to the bivariate S&P 500 and VIX implied volatility data.

Time: Wednesday, April 10 at 11 am

Speaker: Mary Lindstrom, Professor; Department of Biostatistics and Medical Informatics; University of Wisconsin − Madison

Title: Modeling the shape and variability of curves

Abstract: I will present an overview of self-modeling for functional data. Functional data occur when the ideal observation for each experimental unit (individual) is a curve and each individual's data consist of a number of noisy observations at points along their curve. The self-modeling approach to the analysis of functional data is based on the assumption that all individuals' unknown curves have a common, possibly complex, shape and that a particular individual's curve is a low-dimension, parametric transformation of the common shape curve. This simple idea works surprising well in practice. I will discuss the history of self-modeling and the natural extension of modeling the parameters in the individual transformations as random effects. I will also describe generalizations of the model to groups of curves that have similar but not identical shape, outlines (closed curves), curves with a nested structure and 2-dimensional, time-parameterized curves such as those arising from studies of motion.

Winter 2013

Time: Wednesday, March 13 at 11 am

Speaker: Robert D. Gibbons, Professor of Biostatistics in the Departments of Medicine and Health Studies, Director of the Center for Health Statistics, University of Chicago

Title: Statistical Issues in Drug Safety: The curious case of Antidepressants, Anticonvulsants, ...., and Suicide

Abstract: In 2003, the U.S. FDA, MHRA in the U.K., and EU released public health advisories for a possible causal link between antidepressant treatment and suicide in children and adolescents ages 18 and under. This led the U.S. FDA to issue a black box warning for antidepressant treatment of childhood depression in 2004, which was later extended to include young adults (18-24) in 2006. Following these warnings, rather than observing the anticipated decrease in youth suicide rates, increases in youth suicide rates were observed in both the U.S. and Europe. In this presentation, I review this literature and discuss new statistical and experimental design approaches to post-marketing drug safety surveillance.

Time: Wednesday, Jan. 16, 11am

Speaker: Ping Li, Assistant Professor of Statistics, Cornell University

Title: BigData: Probabilistic Methods for Efficient Search and Statistical Learning in Extremely High-Dimensional Data

Abstract: This talk will present a series of work on probabilistic hashing methods which typically transform a challenging (or infeasible) massive data computational problem into a probability and statistical estimation problem. For example, fitting a logistic regression (or SVM) model on a dataset with billion observations and billion (or billion square) variables would be difficult. Searching for similar documents (or images) in a repository of billion web pages (or images) is another challenging example. In certain important applications in the search industry, a web page is often represented as a binary (0/1) vector in billion square (2 to power 64) dimensions. For those data, both data reduction (i.e., reducing number of nonzero entries) and dimensionality reduction are crucial for achieving efficient search and statistical learning.

This talk will present two closely related probabilistic methods: (1) b-bit minwise hashing and (2) one permutation hashing, which simultaneously perform effective data reduction and dimensionality reduction on massive, high-dimensional, binary data. For example, training an SVM for classification on a text dataset of size 24GB took only 3 seconds after reducing the dataset to merely 70MB using our probabilistic methods. Experiments on close to 1TB data will also be presented. Several challenging probability problems still remain open.

Key references: [1] P. Li, A. Owen, C-H Zhang, On Permutation Hashing, NIPS 2012; [2] P. Li, C. Konig, Theory and Applications of b-Bit Minwise Hashing, Research Highlights in Communications of the ACM 2011.

Wednesday Jan. 9th, 11am

Speaker: Ryan Martin, Assistant Professor of Statistics, UIC

Title: A Bayesian test of normality versus a Dirichlet process mixture alternative.

Abstract:: In this talk I will describe a new Bayesian test of normality of univariate or multivariate data against alternative nonparametric models characterized by Dirichlet process mixture distributions. The alternative models are based on the principles of embedding and predictive matching. They can be interpreted to offer random granulation of a normal distribution into a mixture of normals with mixture components occupying a smaller volume the farther they are from the distribution center. A scalar parametrization based on latent clustering is used to cover an entire spectrum of separation between the normal distributions and the alternative models. An efficient sequential importance sampler is developed to calculate Bayes factors.

Simulations indicate the proposed test can effectively detect non-normality without favoring the nonparametric alternative when normality holds. (Joint work with Surya Tokdar at Duke.)