Skip to main content

Dissertation Defense Talks

PhD students in the Department of Statistics and Data Science give a talk that is open to Northwestern faculty and graduate students as part of their dissertation defense.

 

2024-2025

Spring 2025 dissertation defense talks

 

TBA

Date: Thursday, May 1, 2025
Time: 12:30pm
Location:  TBA

Speaker: Zidan Wang, PhD candidate

Abstract: TBA

 

Statistical Methods for Characterizing Drought Indices

Date: Friday, May 2, 2025
Time: 3:00pm
Location:  Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Hybrid talk, register here to attend online

Speaker: Tiffany Christian, PhD candidate

Abstract: The Standardized Vapor Pressure Deficit Drought Index (SVDI) is a new drought index introduced in 2022 that holds a wealth of information for detecting and describing drought events. In this talk, we introduce three complementary research approaches to study a long-term SVDI dataset from 1980-2021 in the continental USA. We develop a new unsupervised machine learning clustering algorithm that creates 3D spatio-temporal drought clusters. The resulting clusters align with historical drought events and confirm long-term trends in drought severity and duration. We then apply a nonstationary Gaussian Process (GP) model to the data to uncover the spatial structure in the associated covariance function parameters and we present a new hierarchical model capable of modeling this higher level covariance function parameter structure. A comparative analysis of three different nonstationary GP methods is conducted to investigate the modeling accuracy of the covariance structure for a subset of the SVDI data.

 

Improved Estimation and Prediction of Intrinsic DNA Cyclizability Through Data Augmentation and an Application to CRISPR-Cas9 Cleavage Efficiency Modeling

Date: Tuesday, May 6, 2025
Time: 12:15pm
Location:  2006 Sheridan Rd, room B02

Speaker: Brody Kendall, PhD candidate

Abstract: Loop-seq is a pioneering high-throughput assay that enables the simultaneous quantification of intrinsic cyclizability across a large set of DNA fragments. However, the assay's reliance on biotin-tethered elongated DNA fragments introduces a tethering effect, leading to biased cyclizability measurements. We demonstrate that the current de-biasing technique is inadequate for fully mitigating this bias. To address this, we introduce DNAcycP2, an enhanced software tool that extends the capabilities of its previous platform DNAcycP. DNAcycP2 incorporates a novel data augmentation approach to more effectively eliminate biotin tether bias, yielding more accurate estimates of intrinsic DNA cyclizability. Additionally, DNAcycP2 offers improved computational efficiency and expands accessibility through a newly developed R package alongside its existing Python package and web server, ensuring broader utility for the research community.

We also develop an application of DNAcycP2 predictions, namely, to develop features and analyze the relationship between intrinsic cyclizability and CRISPR-Cas9 gene editing efficiency. Specifically, we uncover a notable relationship between the predicted cyclizability of regions of the lentiviral delivery vector for the CRISPR-Cas9 system and its cleavage efficiency. Through exploratory analysis, hyperparameter optimization, feature selection, and final model construction on two distinct datasets, we investigate this relationship and conclude that it is dependent on the choice of promoter.

 

Winter 2025 dissertation defense talks

 

Reliability and Causality of Natural Language Processing Methods

Date: Monday, March 10, 2025
Time: 10:00am
Location:  Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Kayla Schroeder, PhD candidate

Abstract: Text data, while abundant, presents significant challenges for Natural Language Processing (NLP) due to its complexity and high dimensionality. The inherent variability in many NLP models, especially topic models and large language models (LLMs), necessitates robust reliability evaluations to ensure the trustworthiness of model outputs. Furthermore, inferring causal relationships between text and outcomes is complex, yet often done implicitly without proper consideration. This work aims to strengthen conclusions derived from NLP models by developing reliability metrics for topic models (e.g., LDA) and LLMs, and establishing a framework for causal inference with text as treatment. We introduce new reliability methods for topic models, demonstrating the shortcomings of existing approaches. For LLMs, we propose a reliability framework, highlighting the need for multiple instantiations to mitigate the impact of fixed randomness. Finally, we present a new causal dimension reduction method for text using semiparametric methods and topic models. This combined framework improves prediction and interpretability of textual treatments. This research contributes to more robust NLP applications, enabling more confident conclusions from text data.

  

Confidence Intervals for Evaluations of Data Mining

Date: Thursday, January 16, 2025
Time: 1:00pm
Location: Online - reistration required (link below)

Speaker: Zheng Yuan, PhD candidate

Abstract: In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Typically, these performance measures are only approximately estimated from a finite dataset, which may lead to findings that are not statistically significant. In order to properly quantify such statistical uncertainty, it is important to provide confidence intervals associated with these estimated performance measures. In this talk, we introduce statistical inference about general performance measures used in data mining, with both individual and joint confidence intervals. Based on asymptomatic normal approximations, these confidence intervals can be computed fast, without needs to do bootstrap resampling. One of the applications of our framework is allowing multiple performance measures of multiple classification rules to be inferred simultaneously for comparisons.

Register here

 

 

2023-2024

Summer 2024 dissertation defense talks

 
Summarizing and Characterizing Heterogeneous Treatment Effects

Date: Thursday, July 25, 2024
Time: 1:00pm
Location: Harris Hall room L28

Speaker: Jingyang (Judy) Zhang, PhD candidate

Abstract: Whenever an intervention is proposed, people want to know if it works. Quantifying and summarizing treatment effects has been a focus in fields such as public policy, business, and medicine. Current practices emphasize finding “the” treatment effect, assumed to be shared by all individuals. For a continuous outcome, individual treatment effects are summarized using the standardized mean difference, an effect size comparing the mean outcomes of the treatment group to those of the control group.

In theory, the standardized mean difference can sufficiently summarize individual treatment effects if the variations among those effects are small and completely due to sampling error. However, individuals often respond to an intervention differently because of differences in their characteristics. The standardized mean difference alone does not reflect this heterogeneity in individual effects. Furthermore, some ad hoc alternative effect sizes, such as the robust standardized mean difference, fail to fully account for the heterogeneity in treatment effects, leaving the problem of better summarizing heterogeneous effects unsolved.

My dissertation focuses on developing a novel approach that better summarizes treatment effects by providing effect size parameters that characterize the probability distribution associated with the treatment effect. Additionally, I explore the relationship between treatment effects and baseline outcomes to characterize interventions as either inequality-increasing or decreasing.

I demonstrate how these novel effect size parameters can be estimated and derive the sampling properties of the corresponding estimators. The proposed methods are applied to empirical data, including studies from the What Works Clearinghouse, the National Study of Learning Mindsets, and a meta-analysis on the effect of school-based programs on cyberbullying perpetration and victimization. These examples reveal results that current practices would likely overlook, highlighting the additional insights provided by the proposed methods. The goal of my dissertation is to enhance research on variations in treatment effects, ultimately enabling more effective implementation of interventions.  

Fall 2023 dissertation defense talks

Optimization, Sampling and Their Interplay: Theory and Applications to Statistics and Machine Learning

Date: Friday, November 10, 2023
Time: 4:15pm-5:15pm
Location: University Hall 102

Speaker: Tim Tsz-Kit Lau, PhD candidate

Abstract: Optimization and sampling are two main pillars of modern high-dimensional statistics and machine learning, and more broadly, data science. Optimization theory and algorithms have been heavily involved in the development of numerical solvers for high-dimensional statistical estimation problems under the frequentist paradigm as well as the success of deep learning, whereas efficient sampling procedures have been the major workhorse of Bayesian inference and uncertainty quantification. Leveraging the recently revived and intriguing connection between optimization and sampling, I study the theoretical underpinning in their interplay, and develop novel algorithms for applications to statistics and machine learning. In particular, I address two intrinsic issues arising in both high-dimensional statistical estimation and sampling problems—"nonsmoothness'' and "nonconvexity'', which are exacerbated by the notoriously inevitable "curse of dimensionality'' brought by massive datasets and gigantic models, employing tools from convex optimization and diffusion processes. Finally, I also explore the use of deep learning to develop novel estimation procedures for various high-dimensional regularized M-estimation problems.