Spring 2026 Seminar Series
Department of Statistics and Data Science 2025-2026 Seminar Series - Spring 2026
The 2025-2026 Seminar Series will primarily be in person, but some talks will be offered virtually using Zoom. Talks that are virtual will be clearly designated and registration for the Zoom talks will be required to receive the zoom link for the event. Please email Kisa Kowal at k-kowal@northwestern.edu if you have questions.
Seminar Series talks are free and open to faculty, graduate students, and advanced undergraduate students
Feature learning in kernel machines and applications to monitoring and steering LLMs
Friday, April 17, 2026
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Mikhail Belkin, HDSI Endowed Chair Professor in AI, Halicioglu Data Science Institute, University of California San Diego
Abstract: Classical kernel machines are a powerful and theoretically grounded method for data analysis. However, they are not adaptive to low-dimensional "features" in the data.
In this talk I will discuss feature learning introducing Recursive Feature Machines—a powerful method designed for extracting relevant features from tabular data. I will discuss some of its interesting properties and, in particular, will show how this technique enables us to detect and precisely guide LLM behaviors toward almost any desired concept by manipulating a single fixed vector in the LLM activation space.
This talk will be given in person on Northwestern's Evanston campus.
planitpurple.northwestern.edu/event/639970
Representation learning with iteratively reweighted kernel machines
Friday, May 1, 2026
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Dmitriy Drusvyatskiy, Professor and HDSI Faculty Fellow, Halıcıoğlu Data Science Institute (HDSI), University of California San Diego
Abstract: The impressive practical performance of neural networks is often attributed to their ability to learn low-dimensional data representations and hierarchical structure directly from data. In this work, we argue that these two phenomena are not unique to neural networks, and surprisingly can be elicited from classical kernel methods. Namely, we show that the derivative of the kernel predictor can detect the influential coordinates with low sample complexity. Moreover, by iteratively using the derivatives to reweight the data and retrain kernel machines, one is able to efficiently learn hierarchical polynomials in a high dimensionsional regime. I will illustrate the developed theory with numerical experiments on both synthetic and real data sets.
This talk will be given in person on Northwestern's Evanston campus.
planitpurple.northwestern.edu/event/640122
Words matter: Multimodal Suicide Risk Prediction from Veterans Health Administration Clinical Notes
Friday, May 8, 2026
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Jiang Gui, Associate Professor, Biomedical Data Science, Dartmouth College
Abstract: In this talk, we demonstrate that integrating unstructured clinical narratives with structured electronic health record (EHR) data enhances suicide risk prediction for U.S. Veterans, outperforming models that rely on structured data alone. By analyzing a retrospective matched case-control cohort of 4,584 Veterans who died by suicide and 22,657 controls, we compared traditional count-based text features against pretrained contextual large language model (LLM) embeddings, such as Clinical Longformer and BioClinicalBERT. We found that while Adaptive Mixture Categorization (AMC) improves the utility of skewed linguistic data, contextual LLM embeddings consistently provide comparable or superior predictive power, particularly within low- and moderate-risk tiers where structured indicators may be less obvious. Our multimodal approach, which integrated 66 structured patient characteristics with text features, yielded substantial performance gains, increasing AUROC by approximately 0.07–0.11 across various risk tiers and time windows. Furthermore, our temporal analysis revealed that while long-term data (270 days) is most informative for low-risk patients, short-term windows (<30 days) are critical for high-risk individuals. Using SHAP-based interpretability and topic modeling, we identified clinically coherent themes that shift semantically as risk increases, providing a context-aware framework for improving suicide prevention efforts within the Veterans Health Administration.
This talk will be given in person on Northwestern's Evanston campus.
planitpurple.northwestern.edu/event/639974
Data-Centric Learning: Aligning Data and Model Knowledge for Better AI
Friday, May 15, 2026
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Yanjie Fu, Associate Professor, School of Computing and Augmented Intelligence, Arizona State University
Abstract: Recent progress in AI has been driven largely by scaling models and compute. Yet in many real-world and scientific settings, AI failures are still rooted less in model architecture than in the data itself: missing or incomplete observations, noisy labels, distribution shift, imbalance, poor feature geometry, and weak coverage of the underlying domain. This talk argues for a shift from a model-centric view of AI to a data-centric learning perspective, where the central goal is not only to train better models, but to reshape better data for learning. I will present a unifying view of data-centric learning through the lens of data-model knowledge alignment: data serves as the knowledge base, models learn knowledge from data, and poor alignment between the two leads to poor generalization, shortcut learning, instability, and low trust. I will introduce key directions in this space, including data curation, relabeling, synthetic data generation, feature selection, feature transformation, and data reprogramming. I will also highlight our recent work on AI4Data-RL, AI4Data-GenAi, AI4Data-LLM&Agents. Overall, the talk will discuss how data-centric learning opens a new path toward more robust and trustworthy AI systems.
This talk will be given in person on Northwestern's Evanston campus.