Winter 2025 Seminar Series: Department of Statistics and Data Science

Winter 2025 Seminar Series

Winter 2025 At a Glance

Friday, January 17 @ 11:00am
Leveraging multi-study, multi-outcome data to improve external validity and efficiency of clinical trials for managing schizophrenia
Caleb H. Miles

Friday, January 24 @ 11:00am
The Role of AI in Scientific Discovery: Opportunities and Limitations
Xiangliang Zhang

Friday, January 31 @ 11:00am
Tensor Time Series: Factor Modeling and Deep Neural Networks
Yuefeng Han

Friday, February 7 @ 11:00am
AI for Nature: From Science to Impact
Tanya Berger-Wolf

Friday, February 14 @ 11:00am- ZOOM
Towards Data-efficient Training of Large Language Models (LLMs)
Baharan Mirzasoleiman

Friday, February 21 @ 11:00am - ZOOM
Knowledge-Guided Machine Learning for Scientific Discovery: Challenges and Opportunities
Xiaowei Jia

Department of Statistics and Data Science 2024-2025 Seminar Series - Winter 2025

The 2024-2025 Seminar Series will primarily be in person, but some talks will be offered virtually using Zoom. Talks that are virtual will be clearly designated and registration for the Zoom talks will be required to receive the zoom link for the event. Please email Kisa Kowal at k-kowal@northwestern.edu if you have questions.

Seminar Series talks are free and open to faculty, graduate students, and advanced undergraduate students

Leveraging multi-study, multi-outcome data to improve external validity and efficiency of clinical trials for managing schizophrenia

Friday, January 17, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Caleb H. Miles, Assistant Professor of Biostatistics, Columbia University Mailman School of Public Health

Abstract: As data sources have become more plentiful and readily accessible, the practice of data fusion has become increasingly ubiquitous. However, when the focus is on a causal effect on a particular outcome, a major limitation is that this outcome may not be available in all data sources. In fact, different randomized experiments or observational studies of a common exposure will often focus on potentially related, yet distinct outcomes. One such example is the Database of Cognitive Training and Remediation Studies (DoCTRS), which consists of several randomized trials of the effect of cognitive remediation therapy on various outcomes among patients with schizophrenia. We develop causally principled methodology for fusing data sets when multiple outcomes are observed across studies, which leverages outcomes of secondary interest as informative proxies for the missing outcome of primary interest, thereby maximizing power and efficiency by making full use of the available data. As this methodology relies on a key transportability assumption, we also develop methods to assess the degree of sensitivity to violations of this assumption. We apply this methodology to data from the DoCTRS trials to make improved causal inferences about the effectiveness of cognitive remediation therapy on cognition among patients with schizophrenia.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/624722

The Role of AI in Scientific Discovery: Opportunities and Limitations

Friday, January 24, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Xiangliang Zhang, Leonard C. Bettex Collegiate Professor of Computer Science, University of Notre Dame

Abstract: Artificial Intelligence (AI) is reshaping the landscape of scientific discovery, enabling breakthroughs across diverse fields. However, when these AI tools are applied to scientific problems, gaps and mismatches often arise. The inherent uncertainty in scientific phenomena, coupled with issues like data quality, biases, and interpretability, poses significant challenges. This talk will discuss the transformative potential of AI in scientific discovery, focusing on its applications in predictive modeling, generative tasks, optimization strategies, and literature analysis. Examples will include AI models ranging from traditional neural networks to large language models (LLMs). At the same time, their limitations will be critically examined, calling for collaboration between the AI and scientific communities to address these challenges and unlock AI’s full potential in advancing scientific discovery.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/624723

Tensor Time Series: Factor Modeling and Deep Neural Networks

Friday, January 31, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Yuefeng Han, Assistant Professor, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame

Abstract: The analysis of tensors (multi-dimensional arrays) has become a vital area in modern statistics and data science, driven by advancements in scientific research and data collection. High-dimensional tensor data arise in diverse applications such as economics, genetics, microbiome studies, brain imaging, and hyperspectral imaging. These tensors are often high-dimensional and high-order, yet key information typically resides in reduced-dimensional subspaces governed by structural properties. This talk explores novel methodologies and theories for tensor time series analysis.

The presentation consists of two parts. The first part introduces a factor modeling framework for high-dimensional tensor time series, leveraging a structure similar to CP tensor decomposition. We propose a computationally efficient estimation procedure incorporating a warm-start initialization and an iterative simultaneous orthogonalization scheme. The algorithm achieves $\epsilon$-accuracy within $\log\log(1/\epsilon)$ iterations. Additionally, we establish inferential results, demonstrating consistency and asymptotic normality under relaxed assumptions. The second part integrates tensor factor models with deep neural networks. Specifically, a Tucker-type low-rank tensor structure is employed as a tensor-augmentation module in neural networks. Extensive experiments demonstrate the integration of this module into transformers and temporal neural networks for tensor time series prediction and tensor-on-tensor regression. The results highlight significant performance improvements, underscoring its potential for advancing time series forecasting.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/625082

AI for Nature: From Science to Impact

Friday, February 7, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Ruan Conference Room – lower level (Chambers Hall 600 Foster Street)

Speaker: Tanya Berger-Wolf, Professor, Computer Science and Engineering and Director, Translational Data Analytics Institute

Abstract: Computation has fundamentally changed the way we study nature. New data collection technologies, such as GPS, high-definition cameras, autonomous vehicles under water, on the ground, and in the air, genotyping, acoustic sensors, and crowdsourcing, are generating data about life on the planet that are orders of magnitude richer than any previously collected. Yet, our ability to extract insight from these data lags substantially behind our ability to collect it.

The need for understanding is more urgent ever and the challenges are great. We are in the middle of the 6th extinction, losing the planet's biodiversity at an unprecedented rate and scale. In many cases, we do not even have the basic numbers of what species we are losing, which impacts our ability to understand biodiversity loss drivers, predict the impact on ecosystems, and implement policy.

The talk will discuss how AI can turn these data into high resolution information source about living organisms, enabling scientific inquiry, conservation, and policy decisions. It will introduce a new field of science, imageomics, and present a vision and examples of AI as a trustworthy partner both in science and biodiversity conservation, discussing opportunities and challenges.

This talk will be given in person on Northwestern's Evanston campus at the location listed above.

https://planitpurple.northwestern.edu/event/625083

Towards Data-efficient Training of Large Language Models (LLMs)

Friday, February 14, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Virtual talk, registration required (link below)

Speaker: Baharan Mirzasoleiman, Assistant Professor, Computer Science Department, UCLA

Abstract: High quality data is crucial for training LLMs with superior performance. In this talk, I will present two theoretically-rigorous approaches to find smaller subsets of examples that can improve the performance and efficiency of training LLMs. First, I will present a one-shot data selection method for supervised fine-tuning of LLMs. Then, I'll talk about an iterative data selection strategy to pretrain or fine-tune LLMs on imbalanced mixtures of language data. I'll conclude by showing empirical results confirming that the above data selection strategies can effectively improve the performance of various LLMs during fine-tuning and pretraining.

This talk will be given virtually by Zoom. Registration is required to receive the Zoom link for the talk.

Knowledge-Guided Machine Learning for Scientific Discovery: Challenges and Opportunities

Friday, February 21, 2025

Time: 11:00 a.m. to 12:00 p.m. central time

Location: Virtual talk, registration required (link coming soon)

Speaker: Xiaowei Jia, Assistant Professor, Department of Computer Science, University of Pittsburgh

Abstract: Data science and machine learning (ML) models, which have found tremendous success in several commercial applications where large-scale data is available, e.g., computer vision and natural language processing, has met with limited success in scientific domains. Traditionally, physics-based models of dynamical systems are often used to study engineering and environmental systems. Despite their extensive use, these models have several well-known limitations due to incomplete or inaccurate representations of the physical processes being modeled. Given rapid data growth due to advances in sensor technologies, there is a tremendous opportunity to systematically advance modeling in these domains by using machine learning methods. However, capturing this opportunity is contingent on a paradigm shift in data-intensive scientific discovery since the “black box” use of ML often leads to serious false discoveries in scientific applications. Because the hypothesis space of scientific applications is often complex and exponentially large, an uninformed data-driven search can easily select a highly complex model that is neither generalizable nor physically interpretable, resulting in the discovery of spurious relationships, predictors, and patterns. This problem becomes worse when there is a scarcity of labeled samples, which is quite common in science and engineering domains.

My work aims to build the foundations of knowledge-guided machine learning (KGML) by exploring several ways of bringing scientific knowledge and machine learning models together. In particular, we discuss gaps and opportunities in scientific discovery and show the effectiveness of KGML in multiple applications of great societal and scientific relevance. My work also has the potential to greatly advance the pace of discovery in a number of scientific and engineering disciplines where physics-based models are used, e.g., hydrology, agriculture, climate science, materials science, power engineering and biomedicine.

This talk will be given virtually by Zoom. Registration is required to receive the Zoom link for the talk.

https://planitpurple.northwestern.edu/event/626465