2017: Department of Statistics and Data Science

2017

Spring 2017

Wednesday, April 19, 2017

The Secret Life of I. J. Good

Time: 11:00 a.m.

Speaker: Sandy Zabell, Professor of Statistics and Mathematics Northwestern University; Director of Undergraduate Studies, Department of Statistics

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: I. J. ("Jack") Good was an important Bayesian statistician for more than half a century after World War II, and played an important role in the (eventual) post-war Bayesian revival. But his graduate training had been in mathematical analysis (one of his advisors had been G. H. Hardy); what was responsible for this metamorphosis from pure mathematician to statistician?

As Good only revealed in 1976, during the war he had initially served as an assistant to Alan Turing at Bletchley Park, working on the cryptanalysis of the German Naval Enigma, and it was from Turing that he acquired his life-long Bayesian philosophy. Declassified documents now permit us to understand in some detail how this came about, and indeed how many of the ideas Good discussed and papers he wrote in the initial decades after the war in fact presented in sanitized form results that had had their origins in his wartime work. In this talk, drawing on these newly available sources, I will discuss the daily and very real use of Bayesian methods that Turing and Good employed, and how this was very gradually revealed by Good over the course of his life (including revealing his return to classified work in the 1950s).

Wednesday, April 26, 2017

Simple, Scalable and Accurate Posterior Interval Estimation

Time: 11:00 a.m.

Speaker: Cheng Li, Assistant Professor, Department of Statistics and Applied Probability, National University of Singapore

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Standard posterior sampling algorithms, such as Markov chain Monte Carlo, face major challenges in scaling up to massive datasets. We propose a simple and general posterior interval estimation algorithm to rapidly and accurately estimate quantiles of the posterior distributions for one-dimensional functionals. Our algorithm runs Markov chain Monte Carlo in parallel for subsets of the data, and then averages quantiles estimated from each subset. We provide strong theoretical guarantees and show that the credible intervals from our algorithm asymptotically approximate those from the full posterior in the leading parametric order. Our theory also accounts for the Monte Carlo errors from posterior sampling. We compare the empirical performance of our algorithm with several competing embarrassingly parallel MCMC algorithms in both simulations and a real data example. We also discuss possible extensions to multivariate posterior credible regions.

Wednesday, May 10, 2017

Structure Adaptive Multiple Testing

Time: 11:00 a.m.

Speaker: Rina Barber, Assistant Professor, Department of Statistics, University of Chicago

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: In many multiple testing problems, the extremely large number of questions being tested means that we may be unable to find many signals after correcting for multiple testing, even when using false discovery rate (FDR) for more flexible error control. At the same time, prior information can indicate that the signals are more likely to be found in some hypotheses than others, or that they may tend to cluster together spatially. If we use our data to fit weights that reveal where the signals are most likely to be found, can we then reuse the same data to perform weighted multiple testing and identify discoveries? Our method, the structure adaptive Benjamini-Hochberg algorithm (SABHA), uses the data twice in this way in order to boost power. We find that as long as the first step is constrained so as to not overfit to the data, we maintain FDR control at nearly the target level for the overall two-stage procedure.

Wednesday, May 17, 2017

A Group-Specific Recommender System

Time: 11:00 a.m.

Speaker: Annie Qu, Professor and Director of Illinois Statistics Office, Department of Statistics, University of Illinois at Urbana-Champaign

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: In recent years, there has been a growing demand to develop efficient recommender systems which track users’ preferences and recommend potential items of interest to users. In this talk, we propose a group-specific method to utilize dependency information from users and items which share similar characteristics under the singular value decomposition framework. The new approach is effective for the “cold-start” problem, where, in the testing set, majority responses are obtained from new users or for new items, and their preference information is not available from the training set. One advantage of the proposed model is that we are able to incorporate information from the missing mechanism and group-specific features through clustering based on the numbers of ratings from each user and other variables associated with missing patterns. Our simulation studies and MovieLens data analysis both indicate that the proposed group-specific method improves prediction accuracy significantly compared to existing competitive recommender system approaches. In addition, we also extend the recommender system for the tensor data with multiple arrays.

Wednesday, May 24, 2017

Time: 11:00 a.m.

Speaker: Dan Apley, Professor, Department of Industrial Engineering and Management Sciences, Northwestern University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Winter 2017

Wednesday, January 25, 2017

Generating Marketing Insights from Social Media Data

Time: 11:00 a.m.

Speaker: Jennifer Cutler, Assistant Professor of Marketing, Kellogg School of Management, Northwestern University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Social media contain a wealth of information about consumers and market structures, and this has many marketers excited about the potential uses for such data. However, it is often not straightforward or cost-effective to process these data for reliable marketing insights. In particular, many approaches to commonly desired classification and prediction tasks (such as classifying text by topic or classifying users by demographics) rely on supervised learning methods, which require (often extensive) labeled training data. Such data can be difficult to obtain, and, due to the idiosyncratic and rapidly evolving user behavior on different platforms (e.g., “netspeak” slang), can become out-of-date quickly. In this talk, I will explore ways of leveraging the organic structure of social media data to circumvent the need for curated training data, resulting in unsupervised or distantly-supervised algorithms that are flexible, scalable, and highly automated. I will share examples of how such methods can be applied towards problems such as classifying marketer and user-generated text by topic, predicting demographic traits of users, and estimating the strength of brand image associations.

Wednesday, February 8, 2017

Inference in High-dimensional Semi-parametric Graphical Models

Time: 11:00 a.m.

Speaker: Mladen Kolar, Assistant Professor of Econometrics and Statistics at the University of Chicago Booth School of Business

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: In this talk, we discuss root-n consistent estimators for elements of the latent precision matrix under high-dimensional elliptical copula models. Under mild conditions, the estimator is shown to be asymptotically normal, which allows for construction of tests about presence of edges in the underlying graphical model. The asymptotic distribution is robust to model selection mistakes and does not require non-zero elements to be separated away from zero. The key technical result is a new lemma on the “sign-subgaussian” property, which allows us to establish optimality of the estimator under the same conditions as in the gaussian setting. Extension to dynamic elliptical copula models will also be presented.

Wednesday, March 1, 2017

Graphical Models via Joint Quantile Regression with Component Selection

Time: 11:00 a.m.

Speaker: Hyonho Chun, Assistant Professor, Department of Statistics, Purdue University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: A graphical model is used for describing interrelationships among multiple variables. In many cases, the multivariate Gaussian assumption is made partly for its simplicity but the assumption is hardly met in actual applications. In order to avoid dependence on a rather strong assumption, we propose to infer the graphical model via joint quantile regression with component selection, since the components of quantile regression carry information to infer the conditional independence. We demonstrate the advantages of our approach using simulation studies and apply our method to an interesting real biological dataset, where the dependence structure is highly complex.

Wednesday, March 8, 2017

Priors for the Long Run

Time: 11:00 a.m.

Speaker: Giorgio Primicer, Professor, Department of Economics, Northwestern University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: We propose a class of prior distributions that discipline the long-run behavior of Vector Autoregressions (VARs). These priors can be naturally elicited using economic theory, which provides guidance on the joint dynamics of macroeconomic time series in the long run. Our priors for the long run are conjugate, and can thus be easily implemented using dummy observations and combined with other popular priors. In VARs with stan- dard macroeconomic variables, a prior based on the long-run predictions of a wide class of dynamic stochastic general equilibrium models yields substantial improvements in the forecasting performance.

Authors: Domenico Giannone, Michele Lenza, and Giorgio Primiceri

Fall 2016

Wednesday, October 12, 2016

Discrete Optimization via Simulation using Gaussian Markov Random Fields

Time: 11:15 a.m.

Speaker: Professor Barry L Nelson, Department of Industrial Engineering & Management Sciences, Northwestern University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: The problem is maximizing or minimizing the expected value of a stochastic performance measure that can be observed by running a dynamic, discrete-event simulation when the feasible solutions are defined by integer decision variables. Inventory sizing, call center staffing and manufacturing system design are common applications. Standard approaches are ranking and selection, which takes no advantage of the relationship among solutions, and adaptive random search, which exploits it but in a heuristic way (“good solutions tend to be clustered”). Instead, we construct an optimization procedure built on modeling the relationship as a discrete Gaussian Markov random field (GMRF). This enables computation of the expected improvement (EI) that could be obtained by running the simulation for any feasible solution, whether actually simulated or not. The computation of EI can be numerically challenging, in general, but the GMRF representation greatly reduces the burden by facilitating the use of sparse matrix methods. By employing a multiresolution GMRF, problems with millions of feasible solutions can be solved.

Wednesday, October 26, 2016

Designing randomized trials for making generalizations to policy-relevant populations

Time: 11:00 a.m.

Speaker: Elizabeth Tipton, Assistant Professor of Applied Statistics, Department of Human Development, Teachers College, Columbia University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Randomized trials are common in education, the social sciences, and medicine. While random assignment to treatment ensures that the average treatment effect estimated is causal, studies are typically conducted on samples of convenience, making generalizations of this causal effect outside the sample difficult. This talk provides an overview of new methods for improving generalizations through improved research design. This includes defining an appropriate inference population, developing a sampling plan and recruitment strategies, and taking into account planned analyses for treatment effect heterogeneity. The talk will also briefly introduce a new webtool useful for those planning randomized trials in education research.

Wednesday, November 2, 2016

Combined Hypothesis Testing on Graphs with Applications to Gene Set Enrichment Analysis

Time: 11:00 a.m.

Speaker: Professor Ming Yuan, Department of Statistics, University of Wisconsin-Madison, Medical Sciences Center

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Motivated by gene set enrichment analysis, we investigate the problem of combined hypothesis testing on a graph. We introduce a general framework to effectively use the structural information of the underlying graph when testing multivariate means. A new testing procedure is proposed within this framework. We show that the test is optimal in that it can consistently detect departure from the collective null at a rate that no other test could improve, for almost all graphs. We also provide general performance bounds for the proposed test under any specific graph, and illustrate their utility through several common types of graphs.

Wednesday, November 9, 2016

Quantifying Nuisance Parameter Effects in Likelihood and Bayesian Inference

Time: 11:00 a.m.

Speaker: Todd Kuffner, Assistant Professor, Department of Mathematics, Washington University in St. Louis

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: In the age of computer-aided statistical inference, the practitioner has at her disposal an arsenal of computational tools to perform inference for low-dimensional parameters of interest, where the elimination of nuisance parameters can be accomplished by optimization, numerical integration, or some other computational sorcery. An operational assumption is that black-box tools can usually vaccinate the final inference for the interest parameter against potential effects arising from the presence of nuisance parameters. At the same time, from a theoretical, analytic perspective, accurate inference on a scalar interest parameter in the presence of nuisance parameters may be obtained by asymptotic refinement of likelihood-based statistics. Among these are Barndorff-Nielsen’s adjustment to the signed root likelihood ratio statistic, the Bartlett correction, and Cornish-Fisher transformations. We show how these adjustments may be decomposed into two terms, the first taking the same value when there are no nuisance parameters or there is an orthogonal nuisance parameter, and the second term being zero when there are no nuisance parameters. Illustrations are given for a number of examples which provide insight into the effect of nuisance parameters on parametric inference for parameters of interest. Connections and extensions for Bayesian inference are also discussed, and some open foundational questions are posed regarding the role of nuisance parameters in Bayesian inference, with some emphasis on possible effects in computational procedures for inference. Time permitting, I will explore potential links with recent work on post-selection inference.

Wednesday, November 30, 2016

Using Machine Learning to Predict Laboratory Test Results

Time: 11:00 a.m.

Speaker: Yuan Luo, Assistant Professor of Preventive Medicine (Health and Biomedical Informatics), Department of Industrial Engineering and Management Science, Northwestern University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: While clinical laboratories report most test results as individual numbers, findings or observations, clinical diagnosis usually relies on the results of multiple tests. Clinical decision support that integrates multiple elements of laboratory data could be highly useful in enhancing laboratory diagnosis. Using the analyte ferritin in a proof-of-concept, we extracted clinical laboratory data from patient testing and applied a variety of machine learning algorithms to predict ferritin test result using the results from other tests. We compared predicted to measured results and reviewed selected cases to assess the clinical value of predicted ferritin. We show that patient demographics and results of other laboratory tests can discriminate normal from abnormal ferritin results with a high degree of accuracy (AUC as high as 0.97, held-out test data). Case review indicated that predicted ferritin results may sometimes better reflect underlying iron status than measured ferritin. Our next step is to integrate temporality into predicting multi-variate analytes. We devise an algorithm alternating between multiple imputation based cross sectional prediction and stochastic process based auto regressive prediction. We show modest performance improvement of the combined algorithm compared to either component alone. These findings highlight the substantial informational redundancy present in patient test results and offer a potential foundation for a novel type of clinical decision support aimed at integrating, interpreting and enhancing the diagnostic value of multi-analyte sets of clinical laboratory test results.