2011: Department of Statistics and Data Science

2011

Fall 2011

Wednesday November 30, 11am

Speaker: Blake McShane, Kellogg School of Management, Northwestern University

Title: Modeling Time Series Dependence for Scoring Sleep in Mice

Abstract: Current methods for scoring sleep behavior in mice are expensive, invasive, and labor intensive, thus leading to considerable interest in high-throughput automated systems which would allow many mice to be scored cheaply and quickly. Previous efforts have been able to differentiate sleep from wakefulness, but cannot differentiate the rare and important state of REM sleep from non-REM sleep. Key difficulties in detecting REM are that (i) REM is much rarer than non-REM and wakefulness, (ii) REM looks similar to non-REM in terms of the observed covariates, (iii) the data is noisy, and (iv) the data contains strong time dependence structures crucial for differentiating REM from non-REM. We develop a novel approach which combines statistical learning methods with generalized Markov models, thereby enhancing the former to account for the time dependence in our data. Our proposed methodology can accommodate very general and very long-term time dependence structures in an easily estimable and computationally tractable fashion. Furthermore, it shows improved differentiation of REM from non-REM sleep in our application to sleep scoring in mice.

Wednesday November 23, 11am

Speaker: Dacheng Xiu, Booth School of Business, University of Chicago

Title: A Tale of Two Option Markets: State-Price Densities Implied from S&P 500 and VIX Option Prices

Abstract: The S&P 500 options and VIX options reveal the dynamics of the index return and its volatility. To study their dynamics, we perform a nonparametric analysis of state-price densities implicit in both option prices. We find that the state-price density of the index strongly depends on the current VIX level not only in the short-run, but also for the long term when VIX options become unavailable. The short-run dependence is compatible with and can be explained by the state-price density of the VIX. Furthermore, we conduct nonparametric specification tests of the state-of-the-art parametric models, and offer some insights on modeling the dynamics.

Wednesday November 2, 11am

Speaker: Elie Tamer, Department of Economics, Northwestern University

Title: Sensitivity Analysis in some Econometric Models

Wednesday October 12, 11am

Speaker: Jules van Binsbergen, Kellogg School of Management, Northwestern University

Title: On the Timing and Pricing of Dividends

Abstract: We recover prices of dividend strips on the aggregate stock market using data from derivatives markets. The price of a k-year dividend strip is the present value of the dividend paid in k years. The value of the stock market is the sum of all dividend strip prices across maturities. We study the properties of strips and find that expected returns, Sharpe ratios, and volatilities on short-term strips are higher than on the aggregate stock market, while their CAPM betas are well below one. Short-term strip prices are more volatile than their realizations, leading to excess volatility and return predictability.

Spring 2011

Wednesday May 25, 11am

Speaker: Hui Xie, School of Public Health, University of Illinois at Chicago

Title: An Integrated Adaptive Approach to Data Fusion

Abstract: Data fusion combines data items from various sources based on a common set of variables. The fused database overcomes the limitations of a single-source dataset and offers the opportunity to answer important managerial questions that cannot be addressed with a single-source dataset. However, performing proper data fusion in a general, adaptive, flexible and robust fashion is challenging. In this article, we propose an integrated adaptive data fusion framework, that integrates both the limited-information approach and full-information approach to data fusion. Other useful features of the proposed methods include: they can handle a mixture of continuous, semi-continuous and discrete variables in a robust manner in that no parametric distributional assumptions are required for variables in the datasets; the joint predictive distribution of the variables of interest has probability mass concentrating on a finite number of points, thereby substantially simplifying a direct approach to data fusion in general situations. Under the integrated framework for data fusion, researchers have full access to an array of data fusion approaches for a wide range of types of marketing data and applications, and have the flexibility to choose a most suitable one for the data at hand. We conduct simulation studies to evaluate the performance of the proposed methods. We then apply the methods to an application that studies counterfeits exposure and purchasing behaviors combining survey data and consumer databases. These analyses demonstrate several advantages of the proposed method, as compared with alternative approaches. This is a joint work with Dr. Yi Qian at Northwestern University.

Wednesday May 18, 11am

Speaker: Feng Liang, Department of Statistics, University of Illinois at Urbana-Champaign

Title: A Bayesian Approach to Structured Sparsity with Application to Market Segmentation

Abstract: Benefit segmentation, that is, grouping consumers into different segments based on their product preference, is an essential problem of marketing theory and practice. Modern marketing environments impose some new challenges to traditional segmentation methods. For example, companies are adding more and more features into a single product, while the data we could collect from each consumer is of relatively small size. Although most methods in benefit segmentation assume consumers use every product feature in their decision making, recent research has shown that consumers only consider a subset of features. Further, the heterogeneity among consumers in selecting important product features should be used as an additional index for market segmentation and for new product development. In responding to these challenges, we propose a Bayesian approach for collaborative inference among consumers. The proposed method is a Bayesian approach for multi-task learning problems with structure sparsity, where the structures we consider are stochastic groups and graphs. Connections with existing work on structure sparsity are discussed. And we demonstrate the utility of our method on several simulated data sets and a real case study example on online shopping websites. The talk is based on joint work with Jianfeng Xu (UIUC) and Sunghoon Kim (PSU).

Wednesday May 11, 11am

Speaker: Yuan Liao, Department of Operations Research and Financial Engineering, Princeton University

Title: High Dimensional Covariance Matrix Estimation in Approximate Factor Model

Abstract: The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Classical methods of estimating the covariance matrices are based on the strict factor models assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow for the presence of the cross-sectional correlation even after taking out common factors. The sparse covariance is estimated by the adaptive thresholding technique as in Cai and Liu (2011). The covariance matrix of the outcome is then estimated based on the factor structure. We first consider the case of observable factors, then extend the results to the case of unobservable factors. It is shown that in both cases, the estimated covariance matrix is still nonsingular regardless of its dimensionality, and is consistent under various norms. Finally, extensions to seemingly unrelated regression is considered.

Wednesday May 4, 11am

Speaker: Michael Stein, Department of Statistics, University of Chicago

Title: When does the screening effect hold?

Abstract: When using optimal linear prediction to interpolate point observations of a mean square continuous stationary spatial process, one might generally expect that the interpolant mostly depends on those observations located nearest to the predictand. This phenomenon is in fact commonly observed in practice and is called the screening effect. However, there are situations in which a screening effect does not hold in a reasonable asymptotic sense and theoretical support for the screening effect is limited to some rather specialized settings for the observation locations. This talk explores conditions on the observation locations and the process model under which an asymptotic screening effect holds. A series of examples shows the difficulty in formulating a general result, especially for processes with different degrees of smoothness in different directions, which can naturally occur for spatial-temporal processes. These examples motivate a general conjecture and I describe two theorems covering special cases of it. The key condition on the process is that its spectral density should change slowly at high frequencies. I will argue that models not satisfying this condition of slow high-frequency change should generally not be used in practice.

Wednesday April 27, 11am

Speaker: Robert Gramacy, Booth School of Business, University of Chicago

Title: Simulation-based Regularized Logistic Regression

Abstract: We develop an omnibus framework for regularized logistic regression by simulation-based inference, exploiting two important results on scale mixtures of normals. By carefully choosing a hierarchical model for the likelihood by one type of mixture, and how regularization may be implemented by another, we obtain subtly different MCMC schemes with varying efficiency depending on the data type (binary v. binomial, say) and the desired estimator (maximum likelihood, maximum a posteriori, posterior mean, etc.). Advantages of this umbrella approach include flexibility, computational efficiency, application in p >> n settings, uncertainty estimates, variable selection, and an ability to assess the optimal degree of regularization in a fully Bayesian setup. We compare the statistical and algorithmic efficiency of each of our proposed methods against each other, and against modern alternatives on synthetic and real data.

Thursday April 21, 11am

Speaker: Jim Berger, Department of Statistical Science, Duke University

Location: ITW Auditorium, Ford Engineering Building, 2133 Sheridan Road

Title: Risk Assessment for Pyroclastic Flows

Abstract: The problem of risk assessment for rare natural hazards - such as volcanic pyroclastic flows - is addressed, and illustrated with the Soufriere Hills Volcano on the island of Montserrat. Assessment is approached through a combination of mathematical computer modeling, statistical modeling of geophysical data, and extreme-event probability computation. A mathematical computer model of the natural hazard is used to provide the needed extrapolation to unseen parts of the hazard space. Statistical modeling of the available geophysical data is needed to determine the initializing distribution for exercising the computer model. In dealing with rare events, direct simulations involving the computer model are prohibitively expensive, so computation of the risk probabilities requires a combination of adaptive design of computer model approximations (emulators) and rare event simulation.

Wednesday April 20, 11am

Speaker: Peter Qian, Department of Statistics, University of Wisconsin-Madison

Title: Sudoku Based Space-Filling designs

Abstract: Sudoku is played by millions of people across the globe. It has simple rules and is very addictive. The game-board is a nine-by-nine grid of numbers from one to nine. Several entries within the grid are provided and the remaining entries must be filled in subject to each row, column, and three-by-three subsquare containing no duplicated numbers. By exploiting these three types of uniformity, we propose an approach to constructing a new type of design, called Sudoku based space-filling design, intended for data pooling. Such a design can be divided into groups of subdesigns such that the complete design and each subdesign achieve maximum uniformity in both univariate and bivariate margins. Also will be discussed are several unexpected applications of experimental design techniques, including stochastic optimization, parallel computing, cross-validation and variable selection.

Wednesday April 6, 11am

Speaker: Shaoli Wang, School of Statistics and Management, Shanghai University of Finance and Economics

Title: Laplace error penalty based variable selection in ultrahigh dimensions

Abstract: Variable selection is of fundamental importance in high-dimensional modeling and data analysis. The method based on L0 penalty function, which gives rise to subset selection, is generally believed to be optimal among various penalized procedures. However, it is unstable and computationally infeasible as the dimension grows. In this talk we propose a novel penalty function, Laplace Error Penalty (LEP), for variable selection. LEP is bounded and infinitely differentiable everywhere except for the origin. With an extra tuning parameter, LEP approximates the L0 penalty much faster than competing methods, and therefore achieves consistent model selection and accurate parameter estimation simultaneously. We show that the penalized least squares via LEP has a unique global minimizer, and the resulted estimator satisfies oracle properties. The LEP procedure allows fast computation and works well for high-dimensional data. Its performance is demonstrated through simulations and real data analysis.

Winter 2011

Wednesday March 9, 11am

Speaker: Guang Cheng, Department of Statistics, Purdue University

Title: How Many Iterations are Sufficient for Semiparametric Estimation?

Abstract: Iterative estimation procedure is a common practice to obtain an efficient estimate for the Euclidean parameter in semiparametric models. A rigorous theoretical study of the semiparametric iterative estimation approach is the main purpose of this talk. We first show that the grid search algorithm can be used to produce the desirable initial estimate with the proper convergence rate. Our major contribution is to provide a formula in calculating the minimal number of iterations k needed to produce an efficient estimate. We discover that k depends on the convergence rates of the initial estimate and functional nuisance estimate, and k iterations are also sufficient for recovering the estimation sparsity in high dimensional data. These general conclusions hold, in particular, when the nuisance parameter is not estimable at root-n rate, and apply to semiparametric models estimated under various regularizations, e.g., kernel or penalized estimation. In practice, our results may be useful in reducing the bootstrap computational cost for the semiparametric models.

Wednesday March 2, 11am

Speaker: Fengqing Zhang, Department of Statistics, Northwestern University

Title: Imaging Mass Spectrometry Data Biomarker Selection and Classification

Abstract: Imaging Mass Spectrometry (IMS) has shown great potential and is very promising in proteomics. However, data processing remains challenging due to the difficulty of analyzing high dimensionality, the fact that the number of predictors is significantly larger than the number of observations, and the needs for considering both spectral and spatial information in order to represent the advantage of IMS technology. In this talk, I’ll present some recent progress on IMS data analysis using multivariate analysis methods. First, we incorporate a spatial penalty term into the elastic net (EN) model for IMS data processing. The EN-based model fully utilizes not only the spectrum information within individual pixels but also the spatial information for the whole IMS image cube. Both the simulation and real data analysis results show that the EN-based model works effectively and efficiently for IMS data processing. We then propose a weighted elastic net (WEN) model combining ion intensity spreading information directly with the elastic net model. Properties including variable selection accuracy of the WEN model are also discussed. Finally, we develop a software package, called IMSmining, including visualization and analysis tools for IMS data processing.

Wednesday February 23, 11am

Speaker: Tim McMurry, Department of Mathematical Sciences, DePaul University

Title: Robust Empirical Bayes With Application to Genome-Wide Association Studies

Abstract:

Large scale technologies such as gene expression microarrays and genome-wide association studies measure a large number of parallel parameters on a usually much smaller number of subjects. Bayesian and empirical Bayes analyses are natural for large scale data because of their ability to infer the collective structure of the many underlying parameters and to borrow information from the other observations. This talk proposes a rank-conditioned procedure in which the inference is based on the conditional distribution of the error given the rank of the (raw) estimate among all other estimates as opposed to conditioning on the raw estimate itself. Our method is particularly suited for correcting ranking bias in large scale estimation and for constructing valid confidence intervals for selected top-ranked parameters. The new method is almost as efficient as the corresponding Bayesian analysis when the prior is correctly specified. When the prior is incorrectly specified, the new method can be much more robust in the sense that it continues to provide accurate point and interval estimates. The efficacy of the proposed method is demonstrated through application to two genome-wide association studies from the Wellcome Trust Case Control Consortium (2007).