2014: Department of Statistics and Data Science

2014

Fall 2014

Time: Wednesday, 29 October at 11am
Speaker: Cindy Xin Wang, PhD, Sr Manager, Statistics, Abbvie

Title: One Approach to Bayesian Sample Size Determination for Bivariate Normal Data

Abstract: Bayesian framework has been an attractive approach in clinical trial applications for its direct interpretation of drug effect and flexibility in making inference based on cumulative data. While it is desired to design a trial with Bayesian method as primary analysis, challenges arise to define and study the operating characteristics of Bayesian decision criteria.

There have been various schools of solutions to Bayesian sample size, such as Hybrid Bayesian, proper Bayesian and decision theoretic Bayesian. The authors have been motivated by Whitehead (2008), which determines the trial sample size with enough precision to be conclusive (either efficacious or futile with high posterior probabilities). Whitehead’s approach was extended to two-dimensional continuous case. The efficacy and futility criteria were built around joint probabilistic statement of drug effect on both endpoints. A total of 4 decision criteria were considered. The sample size is selected to ensure at least one criteria be met for all possible observed data.

Two searching algorithms were proposed with theoretical proof of their validity. Examples of sample size selection were discussed for various correlations, decision probability boundaries and standardized effect size.

This is joint work with Wei Zhong and Yi-fan Huang.

Time: Wednesday, 22 October at 11am
Speaker: Tanya Yang Tang, Postdoctoral Fellow, Institute for Policy Research and Department of Statistics, Northwestern University

Title: The generalizability and precision of causal estimates from the comparative regression discontinuity design: theory and empirical evidence from six within-study comparisons

Abstract: The basic regression discontinuity (RD) design takes advantage of deterministic treatment assignment such that units on one side of a quantified value on an assignment variable get treatment while those on the other side do not. This kind of design is especially appropriate and used when the treatment is assigned by some criterion of need, merit, age or first-come first-served. Relative to the randomized experiment, the basic RD suffers from lower statistical power, greater dependence on functional form assumptions, and causal estimates that are only valid at the RD cutoff determining treatment assignment. The comparative regression discontinuity (CRD) design adds a no-treatment comparison function, formed either from pretest scores on the same measures of the outcome, or data from a non-equivalent control group, though other options are also possible. The purposes of the present paper are twofold. First, this paper provides empirical evidence from six within study comparisons using the national Head Start Impact Study. We illustrate that CRD design not only allows functional form assumption to be better supported, but also supports generalized causal inference away from the RD cutoff. CRD design also increases statistical power relative to the basic RD, but the power gain is contingent upon this dataset and the parameters in it. This is inherently limited. Thus the second purpose is to develop a more general theory of power for CRD design. In addition, we calculate the gains each CRD design achieves under a variety of conditions akin to the practice-oriented ones that Schochet (2009) used to examine power in the basic RD design. We show that: (a) each type of CRD can attain statistical power considerably greater than a basic RD; (b) under the same sample size, with a very strong pretest-posttest correlation as is found in many applications, CRD with pretests as the comparison function can attain power very close to the experiment’s; (c) holding the sample size of the RD and the experiment fixed, adding comparison cases to the CRD with a non-equivalent comparison group can achieve almost the same statistical power as the experiment; (d) power of each CRD design are more robust to factors that influence the power of the basic RD design. The present paper adds another argument to the case for CRD as the design of choice in many settings where the basic RD design is now used.

Time: Wednesday, 01 October at 11am
Speaker: Professor Karl Rohe, Department of Statistics, University of Wisconsin - Madison

Title: Regularized and contextualized spectral clustering

Abstract: In relational data, we know how the data points relate to each other. These relationships could be observed (e.g. as in a social network) or be built by some similarity function (e.g. between data points in euclidean space). To cluster such relational data, spectral clustering utilizes the eigenvectors of the similarity matrix, where the (i,j)th element is the similarity between points i and j.

This talk will discuss (1) the evolution of the class of spectral clustering algorithms, (2) some simple "bells and whistles" that help the algorithm work in practice, and (3) how to incorporate contextualizing information on the nodes. Applications with a 1,000,000 node DTI neuroconnectome and a 4,000,000 node online social network will motivate the analysis. Two theorems show the consistency of spectral clustering under a mixture model; this helps to understand the algorithm as a statistical estimator.

Spring 2014

Time: Wednesday, 28 May at 11am
Speaker: Professor Lihui Zhao, Department of Preventive Medicine, Northwestern University Feinberg School of Medicine

Title: Effectively Selecting a Target Population for a Future Comparative Study

Abstract: When comparing a new treatment with a control in a randomized clinical study, the treatment effect is generally assessed by evaluating a summary measure over a specific study population. The success of the trial heavily depends on the choice of such a population. In this research, we show a systematic, effective way to identify a promising population, for which the new treatment is expected to have a desired benefit, using the data from a current study involving similar comparator treatments. Specifically, with the existing data we first create a parametric scoring system using multiple covariates to estimate subject-specific treatment differences. Using this system, we specify a desired level of treatment difference and create a subgroup of patients, defined as those whose estimated scores exceed this threshold. An empirically calibrated group-specific treatment difference curve across a range of threshold values is constructed. The population of patients with any desired level of treatment benefit can then be identified accordingly. To avoid any ``self-serving'' bias, we utilize a cross-training-evaluation method for implementing the above two-step procedure. Lastly, we show how to select the best scoring system among all competing models. The proposals are illustrated with real data examples.

Time: Wednesday, 21 May at 11am
Speaker: Professor Luke Tierney, Department of Statistics and Actuarial Science, University of Iowa

Title: Some Performance Improvements for the R Engine

Abstract: R is a dynamic language for statistical computing and graphics. In recent years R has become a major framework for both statistical practice and research. This talk presents a very brief outline of the R language and its evolution and describes some current efforts on improvements to the core computational engine, including work on compilation of R code, efforts to take advantage of multiple processor cores, and modifications to support working with larger data sets. Some new performance analysis tools will also be introduced.

Time: Wednesday, 16 April at 11am
Speaker: Professor Chunming Zhang, Department of Statistics, University of Wisconsin – Madison

Title: Estimation of the error auto-correlation matrix in semi-parametric model for brain fMRI data

Abstract: In statistical analysis of functional magnetic resonance imaging (fMRI), dealing with the temporal correlation is a major challenge in assessing changes within voxels. In this paper, we aim to address this issue by considering a semi-parametric model for fMRI data. For the error process in the semi-parametric model, we construct a banded estimate of the auto-correlation matrix R, and propose a refined estimate of the inverse of R. Under some mild regularity conditions, we establish consistency of the banded estimate with an explicit convergence rate and show that the refined estimate converges under an appropriate norm. Numerical results suggest that the refined estimate performs conceivably well when it is applied to the detection of the brain activity.

Time: Wednesday, 09 April at 11am
Speaker: Professor Douglas Bates, Department of Statistics, University of Wisconsin – Madison

Title: Languages for statistical computing: present and future

Abstract: R (and its predecessor, S) have had an enormous impact on statistical practice and even, to some extent, on theory. Those of us involved in the original development of R did not foresee anything close to its current popularity. R can be used interactively, has an enormous number of user-contributed packages, provides for high-level programming and now has an entire ecosystem of tools like RStudio available for users. However, there are inherent difficulties in using R itself as a programming language for complex algorithms applied to large data sets. This problem has been attacked in many different ways including switching to other languages like python or Matlab, developing other languages like JAGS or Stan that can interface with R, or allowing "seamless" integration of compiled code with R through Rcpp. Recently I have been developing in an Open-Source language called Julia (www.julialang.org) that is similar in structure to R (functions, generic functions, self-describing objects) but very different under the hood. It uses the "Just In Time" or JIT compiler capabilities of the LLVM project (llvm.org) to compile functions and methods. The result is code that can be amazingly fast while still having the dynamic properties we associate with R. Julia is early in its development and does not yet have the ecosystem that has been built for R over the past 20 (40 if you count earlier development on S) years. Even so, I am convinced it will become the language of choice over the next several years.

Time: Wednesday, 02 April at 11am
Speaker: Professor Lingling An from the University of Arizona

Title: A two-stage statistical procedure for feature selection and comparison in functional analysis of metagenomes

Abstract: With the advance of new sequencing technologies producing massive short reads data, metagenomics is rapidly growing, especially in the fields of environmental biology and medical science. The metagenomic data are not only high-dimensional with a large number of features and limited number of samples, but also complex with skewed distribution. Efficient computational and statistical tools are needed to deal with these unique characteristics of metagenomic data. In metagenomic studies, one main aim is to assess whether and how the multiple microbial communities differ under various environmental conditions. In this research we propose a two-stage statistical procedure for selecting informative features and identifying differentially abundant features between two or more microbial communities. In the functional analysis of metagenomes the features may refer to the pathways, subsystems, functional roles, and so on. Compared with other available methods the proposed approach demonstrates better performance through comprehensive simulation studies. The new method is also applied to real metagenomic datasets.

Winter 2014

Time: Wednesday, 05 March at 11am
Speaker: Dr. James D. Malley, Center for Information Technology, National Institutes of Health

Title: Statistical Learning Machines: Inference and Interpretation

Abstract: It is known that learning machines can be used for Bayes optimal binary or category prediction. Deploying a recent observation they can also be used for provably consistent estimation of probability and risk effects, starting from the same binary or category outcomes. These methods are entirely model-free, use existing code, and allow for arbitrarily complex lists of features as input. It is claimed that subject-specific probability and risk effects estimates are more informative than simple yes/no decisions. In at least partially opening up the black box of a learning machine, they also offer links to inference and interpretation. Background, applications, and newer developments are discussed.

Time: Wednesday, 26 February at 11am
Place: Basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Hakan Demirtas from Division of Epidemiology and Biostatistics, University of Illinois at Chicago, School of Public Health

Title: Concurrent generation of ordinal and normal data

Abstract: The use of joint models that are capable of handling different data types is becoming increasingly popular in statistical practice. Evaluation of various statistical techniques that have been developed for mixed data in simulated environments requires joint generation of multiple variables. In this talk, I propose a unified framework for concurrently simulating ordinal and normal data given the marginal characteristics and correlation structure.

Time: Wednesday, 19 February at 11am
Place: Basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Alexander Torgovitsky, Economics Department, Northwestern University

Title: Instrumental variables estimation of a generalized correlated random coefficients model.

Abstract: We study identiﬁcation and estimation of the average treatment effect in a correlated random coefficients model that allows for ﬁrst stage heterogeneity and binary instruments. The model also allows for multiple endogenous variables and interactions between endogenous variables and covariates. Our identiﬁcation approach is based on averaging the coefficients obtained from a collection of ordinary linear regressions that condition on different realizations of a control function. This identiﬁcation strategy suggests a transparent and computationally straightforward estimator of a trimmed average treatment effect constructed as the average of kernel-weighted linear regressions. We develop this estimator and establish its √n–consistency and asymptotic normality. Monte Carlo simulations show excellent ﬁnite-sample performance that is comparable in precision to the standard two-stage least squares estimator. We apply our results to analyze the effect of air pollution on house prices, and ﬁnd substantial heterogeneity in ﬁrst stage instrument effects as well as heterogeneity in treatment effects that is consistent with household sorting.

Time: Wednesday, 12 February at 11am
Place: Basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Mr. Cheng Li,Ph.D. candidate, Department of Statistics, Northwestern University

Title: Model Selection for likelihood-free Bayesian methods based on moment conditions.

Abstract: An important practice in statistics is to use robust likelihood-free methods, such as the estimating equations, which only require assumptions on the moments instead of specifying the full probabilistic model. We propose a Bayesian flavored model selection approach for such likelihood-free methods, based on (quasi-) posterior probabilities from the Bayesian Generalized Method of Moments (BGMM). This novel concept allows us to incorporate two important advantages of a Bayesian approach: the expressiveness of posterior distributions and the convenient computational method of MCMC. Many different applications are possible, including modeling correlated longitudinal data, quantile regression, and graphical models based on partial correlation. We demonstrate numerically how our method works in these applications. In addition, under mild conditions, we have shown that theoretically the BGMM can achieve the posterior consistency for selecting the unknown true model, and that it possesses a Bayesian version of the oracle property, i.e. the posterior distribution for the parameter of interest is asymptotically normal and is as informative as if the true model were known. These remain true even when the dimension of parameters diverges with the sample size and the true parameter is sparse.

Time: Wednesday, 29 January at 11am
Place: Basement classroom, Department of Statistics, 2006 Sheridan Road, Evanston
Speaker: Professor Jing Cynthia Wu, University of Chicago Booth School of Business

Title: Term Structure of Interest Rate Volatility and Macroeconomic Uncertainty

Abstract: We propose a new affine term structure model to capture the term structure of interest rate volatilities. With three volatility factors and three yield factors, our model is an excellent description of the data. The common movement in the volatilities provides a new measure of economy-wide uncertainty. We find that uncertainty pushes inflation higher when it is already high, and adds to deflationary concerns in a low inflation environment. Uncertainty also contributes to higher unemployment when the economy is in a deep recession.