Skip to main content

Dissertation Defense Talks

PhD students in the Department of Statistics and Data Science give a talk that is open to Northwestern faculty and graduate students as part of their dissertation defense.

 

2024-2025

Winter 2025 dissertation defense talks

 

Reliability and Causality of Natural Language Processing Methods

Date: Monday, March 10, 2025
Time: 10:00am
Location:  Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Kayla Schroeder, PhD candidate

Abstract: Text data, while abundant, presents significant challenges for Natural Language Processing (NLP) due to its complexity and high dimensionality. The inherent variability in many NLP models, especially topic models and large language models (LLMs), necessitates robust reliability evaluations to ensure the trustworthiness of model outputs. Furthermore, inferring causal relationships between text and outcomes is complex, yet often done implicitly without proper consideration. This work aims to strengthen conclusions derived from NLP models by developing reliability metrics for topic models (e.g., LDA) and LLMs, and establishing a framework for causal inference with text as treatment. We introduce new reliability methods for topic models, demonstrating the shortcomings of existing approaches. For LLMs, we propose a reliability framework, highlighting the need for multiple instantiations to mitigate the impact of fixed randomness. Finally, we present a new causal dimension reduction method for text using semiparametric methods and topic models. This combined framework improves prediction and interpretability of textual treatments. This research contributes to more robust NLP applications, enabling more confident conclusions from text data.

 

 

Confidence Intervals for Evaluations of Data Mining

Date: Thursday, January 16, 2025
Time: 1:00pm
Location: Online - reistration required (link below)

Speaker: Zheng Yuan, PhD candidate

Abstract: In data mining, when binary prediction rules are used to predict a binary outcome, many performance measures are used in a vast array of literature for the purposes of evaluation and comparison. Typically, these performance measures are only approximately estimated from a finite dataset, which may lead to findings that are not statistically significant. In order to properly quantify such statistical uncertainty, it is important to provide confidence intervals associated with these estimated performance measures. In this talk, we introduce statistical inference about general performance measures used in data mining, with both individual and joint confidence intervals. Based on asymptomatic normal approximations, these confidence intervals can be computed fast, without needs to do bootstrap resampling. One of the applications of our framework is allowing multiple performance measures of multiple classification rules to be inferred simultaneously for comparisons.

Register here

 

 

2023-2024

Summer 2024 dissertation defense talks

 
Summarizing and Characterizing Heterogeneous Treatment Effects

Date: Thursday, July 25, 2024
Time: 1:00pm
Location: Harris Hall room L28

Speaker: Jingyang (Judy) Zhang, PhD candidate

Abstract: Whenever an intervention is proposed, people want to know if it works. Quantifying and summarizing treatment effects has been a focus in fields such as public policy, business, and medicine. Current practices emphasize finding “the” treatment effect, assumed to be shared by all individuals. For a continuous outcome, individual treatment effects are summarized using the standardized mean difference, an effect size comparing the mean outcomes of the treatment group to those of the control group.

In theory, the standardized mean difference can sufficiently summarize individual treatment effects if the variations among those effects are small and completely due to sampling error. However, individuals often respond to an intervention differently because of differences in their characteristics. The standardized mean difference alone does not reflect this heterogeneity in individual effects. Furthermore, some ad hoc alternative effect sizes, such as the robust standardized mean difference, fail to fully account for the heterogeneity in treatment effects, leaving the problem of better summarizing heterogeneous effects unsolved.

My dissertation focuses on developing a novel approach that better summarizes treatment effects by providing effect size parameters that characterize the probability distribution associated with the treatment effect. Additionally, I explore the relationship between treatment effects and baseline outcomes to characterize interventions as either inequality-increasing or decreasing.

I demonstrate how these novel effect size parameters can be estimated and derive the sampling properties of the corresponding estimators. The proposed methods are applied to empirical data, including studies from the What Works Clearinghouse, the National Study of Learning Mindsets, and a meta-analysis on the effect of school-based programs on cyberbullying perpetration and victimization. These examples reveal results that current practices would likely overlook, highlighting the additional insights provided by the proposed methods. The goal of my dissertation is to enhance research on variations in treatment effects, ultimately enabling more effective implementation of interventions.  

Fall 2023 dissertation defense talks

Optimization, Sampling and Their Interplay: Theory and Applications to Statistics and Machine Learning

Date: Friday, November 10, 2023
Time: 4:15pm-5:15pm
Location: University Hall 102

Speaker: Tim Tsz-Kit Lau, PhD candidate

Abstract: Optimization and sampling are two main pillars of modern high-dimensional statistics and machine learning, and more broadly, data science. Optimization theory and algorithms have been heavily involved in the development of numerical solvers for high-dimensional statistical estimation problems under the frequentist paradigm as well as the success of deep learning, whereas efficient sampling procedures have been the major workhorse of Bayesian inference and uncertainty quantification. Leveraging the recently revived and intriguing connection between optimization and sampling, I study the theoretical underpinning in their interplay, and develop novel algorithms for applications to statistics and machine learning. In particular, I address two intrinsic issues arising in both high-dimensional statistical estimation and sampling problems—"nonsmoothness'' and "nonconvexity'', which are exacerbated by the notoriously inevitable "curse of dimensionality'' brought by massive datasets and gigantic models, employing tools from convex optimization and diffusion processes. Finally, I also explore the use of deep learning to develop novel estimation procedures for various high-dimensional regularized M-estimation problems.