## Spring 2024 Seminar Series

**Department of Statistics and Data Science 2023-2024 Seminar Series - Spring 2024**

**The 2023-2024 Seminar Series will primarily be in person**, but some talks will be offered virtually using Zoom. Talks that are virtual will be clearly designated and registration for the Zoom talks will be required to receive the zoom link for the event. Please email Kisa Kowal at k-kowal@northwestern.edu if you have questions.

Seminar Series talks are free and open to faculty, graduate students, and advanced undergraduate students

### Large Language Models to understand biomedical text

**Friday, April 5, 2024 **

**Time:** 11:00 a.m. to 12:00 p.m. central time

**Location:** Online - talk will be presented on Zoom, registration is required to receive the link

**Speaker: **Yuan Luo, Director, Institute for Artificial Intelligence in Medicine - Center for Collaborative AI in Healthcare; Associate Professor of Preventive Medicine (Health and Biomedical Informatics), McCormick School of Engineering and Pediatrics

**Abstract: **Large Language Models such as transformer-based models have been wildly successful in setting state-of-the-art benchmarks on a broad range of natural language processing (NLP) tasks, including question answering (QA), document classification, machine translation, text summarization, and others. Recently, the release of OpenAI’s free tool ChatGPT demonstrated the ability of large language models to generate content, with anticipations on its possible uses and potential controversies. The ethical and acceptable boundaries of ChatGPT’s use in scientific writing remain unclear. I will talk about our research on exploring large language models, e.g., long-sequence transformers and GPT style models, in the clinical and biomedical domains. Our work examines the adaptability of these large language models to a series of clinical NLP tasks including clinical inferencing, biomedical named entity recognition, EHR based question answering, interoperability etc.

### Surprises in binary linear classification

**Friday, April 12, 2024 **

**Time:** 11:00 a.m. to 12:00 p.m. central time

**Location:** Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

**Speaker: **Andrea Montanari, John D. and Sigrid Banks Professor in Statistics and Mathematics, Stanford University

**Abstract: **Machine learning calls into question our understanding of statistical methodology, both because of the new classes of statistical models used, and because of the new regimes and use cases. I will focus on the latter aspect by considering the (supposedly) well understood case of binary linear classification. I will discuss a certain number of phenomena that are not captured by classical statistical theory: interpolation, universality, data subsampling, tractability. High-dimensional asymptotics will be used to shed light on these behaviors.

This talk will be given **in person** on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/615126

### t-SNE and Local 1D Structures

**Friday, April 26, 2024 **

**Time:** 11:00 a.m. to 12:00 p.m. central time

**Location:** Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

**Speaker: **Anna Ma, Assistant Professor, UC Irvine, Department of Mathematics.

**Abstract: **Data visualization is a vital task in data exploration, especially in the presence of large-scale data sets. Rudimentary approaches for data visualization, such as scatter plots, histograms, and pie charts, can only represent a small number (typically, 1-2) of features at a time. Furthermore, such methods often lack the sophistication to capture higher dimensional structures in their representations. Fortunately, new approaches to high-dimensional data visualization, such as the t-distributed stochastic neighbor embedding (t-SNE) algorithm, have been proposed in recent years. One of t-SNE’s more interesting properties is its tendency to preserve local linear data structures while successfully representing clusterable data. Despite its wide success, there is limited mathematical understanding of the algorithm. In this talk, we will discuss the t-SNE algorithm and present theoretical guarantees for t-SNE’s output to answer the question: does t-SNE preserve 1-dimensional curves?

*The work presented is joint with Kat Dover and Roman Vershynin.*

This talk will be given **in person** on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/612111

### Audience Choice: Bayesian Workflow / Causal Generalization / Modeling of Sampling Weights

**Friday, May 3, 2024 **

**Time:** 11:00 a.m. to 12:00 p.m. central time

**Location:** Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

**Speaker: **Andrew Gelman, Professor of Statistics and Political Science, Columbia University

**The audience is invited to choose among three possible talks:**

**Bayesian Workflow**: The workflow of applied Bayesian statistics includes not just inference but also building, checking, and understanding fitted models. We discuss various live issues including prior distributions, data models, and computation, in the context of ideas such as the Fail Fast Principle and the Folk Theorem of Statistical Computing. We also consider some examples of Bayesian models that give bad answers and see if we can develop a workflow that catches such problems. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/Bayesian_Workflow_article.pdf

**Causal Generalization:** In causal inference, we generalize from sample to population, from treatment to control group, and from observed measurements to underlying constructs of interest. The challenge is that models for varying effects can be difficult to estimate from available data. We discuss limitations of existing approaches to causal generalization and how it might be possible to do better using Bayesian multilevel models. For background, see here: http://www.stat.columbia.edu/~gelman/research/published/KennedyGelman_manuscript.pdf and here: http://www.stat.columbia.edu/~gelman/research/published/causalreview4.pdf and here: http://www.stat.columbia.edu/~gelman/research/unpublished/causal_quartets.pdf

**Modeling of Sampling Weights:** A well-known rule in practical survey research is to include weights when estimating a population average but not to use weights when fitting a regression model—as long as the regression includes as predictors all the information that went into the sampling weights. But what if you don’t know where the weights came from? We propose a quasi-Bayesian approach using a joint regression of the outcome and the sampling weight, followed by poststratifcation on the two variables, thus using design information within a model-based context to obtain inferences for small-area estimates, regressions, and other population quantities of interest. For background, see here: http://www.stat.columbia.edu/~gelman/research/unpublished/weight_regression.pdf

Topic will be chosen live by the audience attending the talk.

This talk will be given **in person** on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/615862

### T-Stochastic Graphs

**Friday, May 10, 2024 **

**Time:** 11:00 a.m. to 12:00 p.m. central time

**Location:** Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

**Speaker: **Karl Rohe, Professor of Statistics, University of Wisconsin–Madison

**Abstract: **Previous statistical approaches to hierarchical clustering for social network analysis all construct an "ultrametric" hierarchy. While the assumption of ultrametricity has been discussed and studied in the phylogenetics literature, it has not yet been acknowledged in the social network literature. We show that "non-ultrametric structure" in the network introduces significant instabilities in the existing top-down recovery algorithms. To address this issue, we introduce an instability diagnostic plot and use it to examine a collection of empirical networks. These networks appear to violate the "ultrametric" assumption. We propose a deceptively simple class of probabilistic models called T-Stochastic Graphs which impose no topological restrictions on the latent hierarchy. Perhaps surprisingly, this model generalizes the previous models. To illustrate this model, we propose six alternative forms of hierarchical network models and then show that all six are equivalent to the T-Stochastic Graph model. These alternative models motivate a novel approach to hierarchical clustering that combines spectral techniques with the well-known Neighbor-Joining algorithm from phylogenetic reconstruction. We prove this spectral approach is statistically consistent.

This talk will be given **in person** on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/612112

### An Automatic Finite-Sample Robustness Check: Can Dropping a Little Data Change Conclusions?

**Friday, May 17, 2024 **

**Time:** 11:00 a.m. to 12:00 p.m. central time

**Location:** Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

**Speaker: **Tamara Broderick, Associate Professor, Department of Electrical Engineering and Computer Science, MIT

**Abstract: **Practitioners will often analyze a data sample with the goal of applying any conclusions to a new population. For instance, if economists conclude microcredit is effective at alleviating poverty based on observed data, policymakers might decide to distribute microcredit in other locations or future years. Typically, the original data is not a perfect random sample from the population where policy is applied -- but researchers might feel comfortable generalizing anyway so long as deviations from random sampling are small, and the corresponding impact on conclusions is small as well. Conversely, researchers might worry if a very small proportion of the data sample was instrumental to the original conclusion. So we propose a method to assess the sensitivity of statistical conclusions to the removal of a very small fraction of the data set. Manually checking all small data subsets is computationally infeasible, so we propose an approximation based on the classical influence function. Our method is automatically computable for common estimators. We provide finite-sample error bounds on approximation performance and a low-cost exact lower bound on sensitivity. We find that sensitivity is driven by a signal-to-noise ratio in the inference problem, does not disappear asymptotically, and is not decided by misspecification. Empirically we find that many data analyses are robust, but the conclusions of several influential economics papers can be changed by removing (much) less than 1% of the data.

This talk will be given **in person** on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/612113