Spring 2025 Seminar Series : Department of Statistics and Data Science

Spring 2025 Seminar Series

Spring 2025 At A Glance

Friday, April 11 @ 11:00am
Foundation Models and Generative AI for Medical Imaging Segmentation in Ultra-Low Data Regimes
Pengtao Xie

Canceled: Friday, April 18 @ 11:00am
(Re)visiting Foundation Models for Science and Beyond
Wei Wang

Friday, April 25 @ 11:00am
Neural Collapse in AI Training
X.Y. Han

Friday, May 2 @ 11:00am
A Bayesian semi-parametric model for functional near-infrared spectroscopy data
Timothy D. Johnson

Friday, May 9 @ 11:00am
Learning Multi-Index Models
Ilias Diakonikolas

Friday, May 23 @ 11:00am
Data Augmentation for Graph Regression
Meng Jiang

Friday, May 30 @ 11:00am ZOOM
LLM-Enhanced, Theme-Focused Science Discovery: A Retrieval and Structuring Approach
Jiawei Han

Department of Statistics and Data Science 2024-2025 Seminar Series - Spring 2025

The 2024-2025 Seminar Series will primarily be in person, but some talks will be offered virtually using Zoom. Talks that are virtual will be clearly designated and registration for the Zoom talks will be required to receive the zoom link for the event. Please email Kisa Kowal at k-kowal@northwestern.edu if you have questions.

Seminar Series talks are free and open to faculty, graduate students, and advanced undergraduate students

Foundation Models and Generative AI for Medical Imaging Segmentation in Ultra-Low Data Regimes

Friday, April 11, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Pengtao Xie, Assistant Professor, Electrical and Computer Engineering, University of California, San Diego

Abstract: Semantic segmentation of medical images is pivotal in disease diagnosis and treatment planning. While deep learning has excelled in automating this task, a major hurdle is the need for numerous annotated masks, which are resource-intensive to produce due to the required expertise and time. This scenario often leads to ultra-low data regimes where annotated images are scarce, challenging the generalization of deep learning models on test images. To address this, we introduce two complementary approaches. One involves developing foundation models. The other involves generating high-fidelity training data consisting of paired segmentation masks and medical images. In the former, we develop a bi-level optimization based method which can effectively adapt the general-domain Segment Anything Model (SAM) to the medical domain with just a few medical images. In the latter, we propose a multi-level optimization based method which can perform end-to-end generation of high-quality training data from a minimal number of real images. On eight segmentation tasks involving various diseases, organs, and imaging modalities, our methods demonstrate strong generalization performance in both in-domain and out-of-domain settings. Our methods require 8-12 times less training data than baselines to achieve comparable performance.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/627979

CANCELED - (Re)visiting Foundation Models for Science and Beyond

Friday, April 18, 2025

Speaker: Wei Wang, Leonard Kleinrock Professor in Computer Science and Computational Medicine, University of California, Los Angeles

This talk has been canceled. We hope to reschedule it for Fall 2025

Neural Collapse in AI Training

Friday, April 25, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: X.Y. Han, Assistant Professor of Operations Management, Booth School of Business, University of Chicago

Abstract: In the performance-dominated landscape of AI development, systematic understanding is difficult: Any architecture is fair game as long as it can climb the leaderboard. Yet, even among the overwhelming multiformity of neural networks, one motif remains: data is transformed layer-by-layer and iteration-by-iteration into high-dimensional representations with which predictions are eventually made. Thus, examining the geometry of these representations offers a powerful lens for the analysis of AI. Neural Collapse is a striking example, evident late in the training of predictive AI: Originally observed in classification networks during a 'Terminal Phase of Training' (TPT) – where training error vanishes but loss continues decreasing – Neural Collapse reveals a fundamental inductive bias. It encompasses four interconnected geometric regularities in the final layer representations: (NC1) Within-class variability of activations collapses towards zero as they converge to their class means; (NC2) These class means arrange themselves into a maximally separated, symmetric structure (a Simplex Equiangular Tight Frame); (NC3) The classifier vectors align with these class means in a self-dual configuration; (NC4) The model's decision mechanism effectively simplifies to Nearest Class-Center rule. This emergent geometric simplicity is linked to improved generalization, robustness, and interpretability. I will discuss the core Neural Collapse phenomenon, its theoretical underpinnings, and recent updates and extensions since its discovery.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/628239

A Bayesian semi-parametric model for functional near-infrared spectroscopy data

Friday, May 2, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Timothy D. Johnson, Professor, Department of Biostatistics, School of Public Health, University of Michigan

Abstract: Functional near-infrared spectroscopy (fNIRS) is a relatively new neuroimaging technique. It is a low cost, portable, and non-invasive method to monitor brain activity. Similar to fMRI, it measures changes in the level of blood oxygen in the brain. Its time resolution is much finer than fMRI, however its spatial resolution is much courser—similar to EEG or MEG. fNIRS is finding widespread use on young children that have trouble staying still in the MRI magnet and it can be used in situations where fMRI is contraindicated—such as chochlear implant patients. In this talk, I propose a fully Bayesian semi-parametric model to analyze fNIRS data. The hemodynamic response function is modeled with the canonical HRF. The model error and the autoregressive process vary with time and are modeled in the dynamic linear model framework. The low frequency drift is modeled non-parameterically with a variable B-spline model (both locations and number of knots are allowed to vary). Although motion is not as big an issue as in fMRI, it can still cause huge inferential bias and poor statistical properties if not handled appropriately. The variable B-spline model not only models the low frequency drift, but will regress out motion artifacts as well. Most methods require motion to be removed prior to statistical analysis except one, which I refer to as the ARIRLS model. Via simulation studies, I show that this Bayesian model easily handles motion artifacts and results in better statistical properties than the AR-IRLS model. I then show its performance on real data.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/627980

Learning Multi-Index Models

Friday, May 9, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Ilias Diakonikolas, Sheldon B. Lubar Professor, Department of Computer Sciences, University of Wisconsin, Madison

Abstract: Multi-index models (MIMs) are functions that depend on the projection of the input onto a low-dimensional subspace. These models offer a powerful framework for studying various machine learning tasks, including multiclass linear classification, learning intersections of halfspaces, and more complex neural networks. Despite extensive investigation, there remains a vast gap in our understanding of the efficient learnability of MIMs.

In this talk, we will survey recent algorithmic developments on learning MIMs, focusing on methods with provable performance guarantees. In particular, we will present a robust learning algorithm for a broad class of well-behaved MIMs under the Gaussian distribution. A key feature of our algorithm is that its running time has fixed-degree polynomial dependence on the input dimension. We will also demonstrate how this framework leads to more efficient and noise-tolerant learners for multiclass linear classifiers and intersections of halfspaces.

Time permitting, we will highlight some of the many open problems in this area.

The main part of the talk is based on joint work with I. Iakovidis, D. Kane, and N. Zarifis.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/628563

Data Augmentation for Graph Regression

Friday, May 23, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Meng Jiang, Associate Professor, Department of Computer Science and Engineering, University of Notre Dame

Abstract: Graph regression plays a key role in materials discovery by enabling the prediction of numerical properties of molecules and polymers. However, graph regression models often rely on training sets with only a few hundred labeled examples, and these labels are typically imbalanced. While a large number of unlabeled examples are available, they are often drawn from diverse domains, making them less effective for improving target label predictions. In machine learning, data augmentation refers to techniques that increase the size of the training set by generating slightly modified or synthetic versions of existing data. These methods are simple yet effective. In this talk, I will introduce three graph data augmentation techniques tailored for supervised learning, imbalanced learning, and transfer learning in graph regression tasks. The first technique leverages the mutual enhancement between model rationalization and data augmentation, improving both accuracy and interpretability in molecular and polymer property prediction. This approach demonstrates that graph data augmentation can be effectively performed in latent spaces. The second technique generates representations of additional data points with underrepresented labels to balance the training set. The third technique introduces a graph diffusion transformer (Graph DiT) that facilitates data-centric transfer learning, addressing the limitations of self-supervised methods when dealing with unlabeled graph data. Graph DiT integrates multiple properties such as synthetic score and gas permeability as condition constraints into diffusion models for multi-conditional polymer generation. Lastly we will discuss foundation model approaches for materials discovery.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/627997

LLM-Enhanced, Theme-Focused Science Discovery: A Retrieval and Structuring Approach

Friday, May 30, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Virtual talk, registration required (link below)

Speaker: Jiawei Han, Michael Aiken Chair Professor, Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign

Abstract: Large Language Models (LLMs) may bring unprecedent power in scientific discovery. However, current LLMs may still encounter major challenges for effective scientific exploration due to their lack of in-depth, theme-focused data and knowledge. Retrieval augmented generation (RAG) has recently become an interesting approach for augmenting LLMs with grounded, theme-specific datasets. We discuss the challenges of RAG and propose a retrieval and structuring (RAS) approach, which enhances RAG by improving retrieval quality and mining structures (e.g., extracting entities and relations and building knowledge graphs) to ensure its effective integration of theme-specific data with LLM. We show the promise of this approach at augmenting LLMs and discuss its potential power for LLM-enabled science exploration.

Short bio: Jiawei Han is Michael Aiken Chair Professor in the Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign. He received ACM SIGKDD Innovation Award (2004), IEEE Computer Society Technical Achievement Award (2005), IEEE Computer Society W. Wallace McDowell Award (2009), Japan's Funai Achievement Award (2018), and being elevated to Fellow of Royal Society of Canada (2022). He is Fellow of ACM and Fellow of IEEE and served as the Director of Information Network Academic Research Center (INARC) (2009-2016) supported by the Network Science-Collaborative Technology Alliance (NS-CTA) program of U.S. Army Research Lab and co-Director of KnowEnG, a Center of Excellence in Big Data Computing (2014-2019), funded by NIH Big Data to Knowledge (BD2K) Initiative. Currently, he is serving on the executive committees of two NSF funded research centers: MMLI (Molecular Maker Research Institute)—one of NSF funded national AI centers since 2020 and I-Guide—The National Science Foundation (NSF) Institute for Geospatial Understanding through an Integrative Discovery Environment (I-GUIDE) since 2021.

https://planitpurple.northwestern.edu/event/628563