2018: Department of Statistics and Data Science

2018

Spring 2018

Wednesday, May 9, 2018

On Data Reduction of Big Data

Time: 11:00 a.m.

Speaker: Min Yang, Professor, Department of Mathematics, Statistics, and Computer Science, University of Illinois at Chicago

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Extraordinary amounts of data are being produced in many branches of science. Proven statistical methods are no longer applicable with extraordinary large data sets due to computational limitations. A critical step in Big Data analysis is data reduction. In this presentation, I will review some existing approaches in data reduction and introduce a new strategy called information-based optimal subdata selection (IBOSS). Under linear and nonlinear models set up, theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to other approaches in term of parameter estimation and predictive performance. The tradeoff between accuracy and computation cost is also investigated. When models are mis-specified, the performance of different data reduction methods are compared through simulation studies. Some ongoing research work as well as some open questions will also be discussed.

Wednesday, May 23, 2018

Normalization of Transcript Degradation Improves Accuracy in RNA-seq Analysis

Time: 11:00 a.m.

Speaker: Bin Xiong, PhD Candidate, Department of Statistics, Northwestern University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: RNA-sequencing (RNA-seq) is a powerful high-throughput tool to profile transcriptional activities in cells. The observed read counts can be biased by various factors such that they do not accurately represent the true relative abundance of mRNA transcript. Normalization is a critical step to ensure unbiased comparison of gene expression between samples or conditions. Real data shows that the gene-specific heterogeneity of transcript degradation pattern across samples presents a common and major source of unwanted variation, and it may substantially bias the results in gene expression analysis. Most existing normalization approaches focused on global adjustment of systematic bias are ineffective to correct for this bias. We propose a novel method based on matrix factorization over-approximation that allows quantification of RNA degradation of each gene within each sample. The estimated degradation index scores are used to build a pipeline named DegNorm (stands for degradation normalization) to adjust read count for RNA degradation heterogeneity on a gene-by-gene basis while simultaneously controlling sequencing depth. The robust and effective performance of this method is demonstrated in an extensive set of real RNA-seq data and simulated data.

Winter 2018

Wednesday, January 24, 2018

Adversarial Machine Learning - Big Data Meets Cyber Security

Time: 11:00 a.m.

Speaker: Bowei Xi, Associate Professor of Statistics, Department of Statistics, Purdue University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: As more and more cyber security incident data ranging from systems logs to vulnerability scan results are collected, machine learning techniques are becoming an essential tool for real-world cyber security applications. One of the most important differences between cyber security and many other applications is the existence of malicious adversaries that actively adapt their behavior to make the existing learning models ineffective. Unfortunately, traditional learning techniques are insufficient to handle such adversarial problems directly. The adversaries adapt to the defender's reactions, and learning algorithms constructed based on the current training dataset degrades quickly. To address these concerns, we develop a game theoretic framework to model the sequential actions of the adversary and the defender, while both parties try to maximize their utilities. We also develop an adversarial support vector machine method and an adversarial clustering algorithm to defend against active adversaries.

Wednesday, February 7, 2018

A Sparse Clustering Algorithm for Identifying Cluster Changes Across Conditions with Applications in Single-cell RNA-sequencing Data

Time: 11:00 a.m.

Speaker: Jun Li, Associate Professor, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Clustering analysis, in its traditional setting, identifies groupings of samples from a single population/condition. We consider a different setting when the data available are samples from two different conditions, such as cells before and after drug treatment. Cell types in cell populations change as the condition changes: some cell types die out, new cell types may emerge, and surviving cell types evolve to adapt to the new condition. Using single-cell RNA-sequencing data that measure the gene expression of cells before and after the condition change, we propose an algorithm, SparseDC, which identifies cell types, traces their changes across conditions, and identifies genes which are marker genes for these changes. By solving a unified optimization problem, SparseDC completes all three tasks simultaneously. As a general algorithm that detects shared/distinct clusters for two groups of samples, SparseDC can be applied to problems outside the field of biology.

Wednesday, February 14, 2018

Statistical Learning for Time Dependent Data

Time: 11:00 a.m.

Speaker: Likai Chen, PhD candidate, Department of Statistics, University of Chicago

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: In statistical learning theory, researchers primarily deal with independent data and there is a huge literature. In comparison, it has been much less investigated for time dependent data, which are commonly encountered in economics, engineering, finance, geography, physics and other fields. In this talk, we focuses on concentration inequalities for suprema of empirical processes which plays a fundamental role in the statistical learning theory. We derive a Gaussian approximation and an upper bound for the tail probability of the suprema under conditions on the size of the function class, the sample size, temporal dependence and the moment conditions of the underlying time series. Due to the dependence and heavy-tailness, our tail probability bound is substantially different from those classical exponential bounds obtained under the independence assumption in that it involves an extra polynomial decaying term. We allow both short- and long-range dependent processes, where the long-range dependence case has never been previously explored. We showed our tail probability inequality is sharp up to a multiplicative constant. These bounds work as theoretical guarantees for statistical learning applications under dependence.

Wednesday, February 28, 2018

From Integrative Genomics to Therapeutic Discovery in Cancer Immunotherapies

Time: 11:00 a.m.

Speaker: Riyue Bao, Research Assistant Professor, Center for Research Informatics & Department of Pediatrics, University of Chicago

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Anti-PD1-based immunotherapy has had a major impact on treatment of multiple cancer histologies. However, only a subset of patients responds to these treatments, and a beneficial outcome is frequently observed in patients with a spontaneous pre-existing T-cell response against their tumor. Therefore, identifying variables that could contribute to the differences in patients’ response and the underlying mechanisms will enable development of therapeutic solutions for patients lacking a beneficial tumor microenvironment. We use genomics approaches including RNAseq, whole exome sequencing, 16S ribosomal RNA amplicon sequencing, and metagenomic shotgun sequencing to discover (1) tumor-intrinsic oncogenic pathways that drive immune exclusion in non-responders (2) gut microbiome associated with anti-PD1 efficacy in metastatic melanoma patients (3) intratumor microbiome associated with survival in neuroblastoma patients, towards the ultimate goal of developing immune-potentiating interventions in combination with checkpoint inhibitors for improved clinical outcome.

Fall 2017

Wednesday, October 18, 2017

CompareML: Structuring Machine Learning Research in Data Driven Science

Time: 11:00 a.m.

Speaker: Victoria Stodden, Associate Professor, University of Illinois at Urbana-Champaign

Place: 617 Library Place

Abstract: Statistical discovery is increasingly taking place using data not collected by the discoverers and often completely in silico. This calls on new considerations of methods and computational infrastructure that support statistical pipelines. In this talk I present a novel framework for statistical analysis of "organic data" as opposed to "designed data" (Kreuter & Peng 2014) called CompareML that permits the direct comparison of findings that purport to answer the same statistical question. I will illustrate that such computational frameworks are crucial to reproducible science by way of an example from genomics [acute leukemia (Golub et al 1999)] where traditional approaches (surprisingly) fail at scale.

Wednesday, October 25, 2017

Robust Simultaneous Inference for the Mean Function of Functional Data

Time: 11:00 a.m.

Speaker: Nedret Billor, Professor, Department of Mathematics and Statistics, Auburn University

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: A substantial amount of attention has been drawn to the field of functional data analysis since many scientific fields involving applied statistics have started measuring and recording massive continuous data due to rapid technological advancements. While the study of the probabilistic tools for infinite dimensional variables started at the beginning of the 20th century, research on the development of statistical methods for functional data started only in the last two decades. Further, these developed methods mainly require homogeneity of data, namely free of outliers.

However, functional data present new challenges when studying outlier contaminated datasets. In this talk, we will discuss robust simultaneous inference for the mean function based on polynomial splines, together with robust simultaneous confidence bands, and the asymptotic properties of the proposed robust estimator. The robust simultaneous confidence band is also extended to the difference of the mean functions of two populations. The performance of the proposed robust methods and their robustness are demonstrated by an extensive simulation study and real data examples.

Wednesday, November 1, 2017

Data Science Applications in Genomics and Precision Medicine

Time: 11:00 a.m.

Speaker: Ramana V Davuluri, PhD. Professor of Preventive Medicine, Health and Biomedical Informatics, Northwestern University, Feinberg School of Medicine

Place: Basement classroom - B02, Department of Statistics, 2006 Sheridan Road

Abstract: Given that the biomedical research is rapidly acquiring the character of BIG DATA, with rapid accumulation of datasets on genes, proteins and other molecules, data science applications are increasingly playing an important role in the analysis and interpretation of these large files ranging from discovery phase to clinical applications. Bioinformaticians have been successfully mastered the application of data science skills since the early days of human genome sequencing; for example, prediction of genes in the assembled genomes or genomic contigs. I will discuss some of those applications our group has successfully applied in the prediction of (a) gene promoters in the human genome (b) gene regulatory signals that are altered in breast cancer, (c) molecular grouping of brain tumor patients and (d) functional roles of germline single nucleotide variants that are associated with prostate cancer. I will also discuss various data science issues; for example – (a) processing of unstructured data to prepare the data matrices; (b) clustering of samples based on gene expression data; (c) feature selection; and (d) classification algorithms, etc.