Winter 2025 Seminar Series
Department of Statistics and Data Science 2024-2025 Seminar Series - Winter 2025
The 2024-2025 Seminar Series will primarily be in person, but some talks will be offered virtually using Zoom. Talks that are virtual will be clearly designated and registration for the Zoom talks will be required to receive the zoom link for the event. Please email Kisa Kowal at k-kowal@northwestern.edu if you have questions.
Seminar Series talks are free and open to faculty, graduate students, and advanced undergraduate students
Leveraging multi-study, multi-outcome data to improve external validity and efficiency of clinical trials for managing schizophrenia
Friday, January 17, 2025
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Caleb H. Miles, Assistant Professor of Biostatistics, Columbia University Mailman School of Public Health
Abstract: As data sources have become more plentiful and readily accessible, the practice of data fusion has become increasingly ubiquitous. However, when the focus is on a causal effect on a particular outcome, a major limitation is that this outcome may not be available in all data sources. In fact, different randomized experiments or observational studies of a common exposure will often focus on potentially related, yet distinct outcomes. One such example is the Database of Cognitive Training and Remediation Studies (DoCTRS), which consists of several randomized trials of the effect of cognitive remediation therapy on various outcomes among patients with schizophrenia. We develop causally principled methodology for fusing data sets when multiple outcomes are observed across studies, which leverages outcomes of secondary interest as informative proxies for the missing outcome of primary interest, thereby maximizing power and efficiency by making full use of the available data. As this methodology relies on a key transportability assumption, we also develop methods to assess the degree of sensitivity to violations of this assumption. We apply this methodology to data from the DoCTRS trials to make improved causal inferences about the effectiveness of cognitive remediation therapy on cognition among patients with schizophrenia.
This talk will be given in person on Northwestern's Evanston campus at the location listed above.
https://planitpurple.northwestern.edu/event/624722
The Role of AI in Scientific Discovery: Opportunities and Limitations
Friday, January 24, 2025
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Xiangliang Zhang, Leonard C. Bettex Collegiate Professor of Computer Science, University of Notre Dame
Abstract: Artificial Intelligence (AI) is reshaping the landscape of scientific discovery, enabling breakthroughs across diverse fields. However, when these AI tools are applied to scientific problems, gaps and mismatches often arise. The inherent uncertainty in scientific phenomena, coupled with issues like data quality, biases, and interpretability, poses significant challenges. This talk will discuss the transformative potential of AI in scientific discovery, focusing on its applications in predictive modeling, generative tasks, optimization strategies, and literature analysis. Examples will include AI models ranging from traditional neural networks to large language models (LLMs). At the same time, their limitations will be critically examined, calling for collaboration between the AI and scientific communities to address these challenges and unlock AI’s full potential in advancing scientific discovery.
This talk will be given in person on Northwestern's Evanston campus at the location listed above.
https://planitpurple.northwestern.edu/event/624723
Tensor Time Series: Factor Modeling and Deep Neural Networks
Friday, January 31, 2025
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Yuefeng Han, Assistant Professor, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame
Abstract: The analysis of tensors (multi-dimensional arrays) has become a vital area in modern statistics and data science, driven by advancements in scientific research and data collection. High-dimensional tensor data arise in diverse applications such as economics, genetics, microbiome studies, brain imaging, and hyperspectral imaging. These tensors are often high-dimensional and high-order, yet key information typically resides in reduced-dimensional subspaces governed by structural properties. This talk explores novel methodologies and theories for tensor time series analysis.
The presentation consists of two parts. The first part introduces a factor modeling framework for high-dimensional tensor time series, leveraging a structure similar to CP tensor decomposition. We propose a computationally efficient estimation procedure incorporating a warm-start initialization and an iterative simultaneous orthogonalization scheme. The algorithm achieves $\epsilon$-accuracy within $\log\log(1/\epsilon)$ iterations. Additionally, we establish inferential results, demonstrating consistency and asymptotic normality under relaxed assumptions. The second part integrates tensor factor models with deep neural networks. Specifically, a Tucker-type low-rank tensor structure is employed as a tensor-augmentation module in neural networks. Extensive experiments demonstrate the integration of this module into transformers and temporal neural networks for tensor time series prediction and tensor-on-tensor regression. The results highlight significant performance improvements, underscoring its potential for advancing time series forecasting.
This talk will be given in person on Northwestern's Evanston campus at the location listed above.
https://planitpurple.northwestern.edu/event/625082
AI for Nature: From Science to Impact
Friday, February 7, 2025
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker: Tanya Berger-Wolf, Professor, Computer Science and Engineering and Director, Translational Data Analytics Institute
Abstract: Computation has fundamentally changed the way we study nature. New data collection technologies, such as GPS, high-definition cameras, autonomous vehicles under water, on the ground, and in the air, genotyping, acoustic sensors, and crowdsourcing, are generating data about life on the planet that are orders of magnitude richer than any previously collected. Yet, our ability to extract insight from these data lags substantially behind our ability to collect it.
The need for understanding is more urgent ever and the challenges are great. We are in the middle of the 6th extinction, losing the planet's biodiversity at an unprecedented rate and scale. In many cases, we do not even have the basic numbers of what species we are losing, which impacts our ability to understand biodiversity loss drivers, predict the impact on ecosystems, and implement policy.
The talk will discuss how AI can turn these data into high resolution information source about living organisms, enabling scientific inquiry, conservation, and policy decisions. It will introduce a new field of science, imageomics, and present a vision and examples of AI as a trustworthy partner both in science and biodiversity conservation, discussing opportunities and challenges.
This talk will be given in person on Northwestern's Evanston campus at the location listed above.
https://planitpurple.northwestern.edu/event/625083
Towards Data-efficient Training of Large Language Models (LLMs)
Friday, February 14, 2025
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Virtual talk, registration required (link below)
Speaker: Baharan Mirzasoleiman, Assistant Professor, Computer Science Department, UCLA
Abstract: High quality data is crucial for training LLMs with superior performance. In this talk, I will present two theoretically-rigorous approaches to find smaller subsets of examples that can improve the performance and efficiency of training LLMs. First, I will present a one-shot data selection method for supervised fine-tuning of LLMs. Then, I'll talk about an iterative data selection strategy to pretrain or fine-tune LLMs on imbalanced mixtures of language data. I'll conclude by showing empirical results confirming that the above data selection strategies can effectively improve the performance of various LLMs during fine-tuning and pretraining.
This talk will be given virtually by Zoom. Registration is required to receive the Zoom link for the talk.
TBA
Friday, February 21, 2025
Time: 11:00 a.m. to 12:00 p.m. central time
Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)
Speaker:
Abstract:
This talk will be given in person on Northwestern's Evanston campus at the location listed above.