Past Biostatistics Events

March 2023

  • 3/15/23

    Biostat Seminar: Transfer learning in high-dimensional linear regression and graphical models

    Date
    Wednesday, March 15, 2023

    Time
    3:00pm - 4:50pm

    Location
    43-105 CHS

    Speaker
    Hongzhe Li
    Perelman Professor
    Biostatistics, Epidemiology, and Informatics
    University of Pennsylvania

    Abstract
    This talk considers estimation and prediction of   high-dimensional linear regression model in the setting of  transfer learning, using samples from the target model as well as auxiliary samples from different but possibly related models.  When the set of ``informative" auxiliary samples is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. In the case when sample informativeness  is unknown,  a data-driven procedure for transfer learning, called Trans-Lasso is proposed, and  its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer is established. A related method,  Trans-CLIME is also developed for estimation and inference of high-dimensional Gaussian graphical models with transfer learning. The proposed procedures are demonstrated in numerical studies and are applied to the GTEx data sets concerning the associations among gene expressions in different tissues and to several GWAS data sets.  It is shown that Trans-Lasso and Trans-CLIME lead to improved performance in gene expression prediction in a target tissue by incorporating the data from multiple different tissues as auxiliary samples.

  • 3/1/23

    Biostat Seminar: Constrained multivariate functional principal components analysis for novel outcomes in eye-tracking experiments

    Date
    Wednesday, March 1, 2023

    Time
    3:30pm - 4:50pm

    Refreshments at 3:00pm in the Biostat Library

    Location
    43-105 CHS

    Speaker
    Brian Kwan, PhD
    Post-Doctoral Fellow
    Department of Biostatistics
    Fielding School of Public Health, UCLA

    Abstract
    Individuals with autism spectrum disorder (ASD) tend to experience greater difficulties with social communication and processing sensory information. Of particular interest in ASD biomarker research is the study of visual attention, effectively quantified in eye tracking (ET) experiments. ET experiments offer a powerful, safe, and feasible platform for gaining insights into attentional processes by measuring moment-by-moment gaze patterns to stimuli. Even though moment-by-moment gaze patterns are recorded, analyses commonly collapse data across trials into variables such as total looking time duration in a region of interest. In addition, looking times in different regions of interest are typically analyzed separately. We propose a novel multivariate functional outcome that carries looking time duration information from multiple regions of interest jointly as a function of trial type. A novel constrained multivariate functional principal components analysis is also proposed to capture variation in this multivariate functional outcome, incorporating the constraint in the data that looking time durations from multiple regions of interest must sum up to the total trial time. Our proposals are motivated by the Activity Monitoring task, a social-attentional assay within the ET battery of the Autism Biomarkers Consortium for Clinical Trials (ABC-CT). Application to ABC-CT data yields new insights into dominant modes of variation of looking time durations from multiple regions of interest for school-age children with ASD and their typically developing (TD) peers and to novel group differences in social attention.

February 2023

  • 2/22/23

    Biostat Seminar: The unreasonable effectiveness of data science

    Date
    Wednesday, February 22, 2023

    Time
    3:30pm - 4:50pm

    Refreshments at 3:00pm in the Biostat Library

    Location
    43-105 CHS

    Speaker
    Renato Assunção, PhD
    Professor
    ESRI Inc. and Department of Computer Science
    Universidade Federal de Minas Gerais (Brazil)

    Abstract
    There are three factors responsible for the revolution brought about by artificial intelligence: (1) the constant increase in computational capacity; (2) the accumulation of large amounts of data generating insights and enabling the creation of data-driven products; (3) the development of statistical learning theory and its algorithms. The alignment of these planets allowed great success in difficult tasks such as the development of virtual assistants and chatbots, self-driving car, the automatic translation between languages, and the early detection of unspecified anomalies in vital signs. In this talk, I will present an overview of these developments from a historical point of view focusing on the contribution brought by Statistics. I will illustrate this presentation with examples from my own research on epidemiological surveillance using social media data, space-time demographic forecasting, and the Bayesian spatial partitioning of space-time maps.

  • 2/15/23

    Biostat Seminar: Statistical Methods and Predictive Modeling for Personalized Diabetes Management: Continuous Glucose Monitoring (CGM), Electronic Health Records (EHR), and Biobanks

    Date
    Wednesday, February 15, 2023

    Time
    3:30pm - 4:50pm

    Refreshments at 3:00pm in the Biostat Library

    Location
    43-105 CHS

    Speaker
    Jin Zhou, PhD
    Adjunct Associate Professor
    Department of Medicine
    David Geffen School of Medicine, UCLA

    Abstract
    Healthcare data in the modern era, such as electronic medical records (EHRs), wearable devices, and biobanks, offer a wealth of multi-level and multi-scale information over an extended period. These datasets present a unique chance to analyze disease progression and related time-varying risk factors, but existing statistical tools and algorithms for effectively analyzing exposure trajectories and disease onset at this scale and complexity are limited. In particular, the study of biomarker trajectories and their role in disease onset and progression is underdeveloped. In the first part of the talk, I will introduce TrajGWAS, a linear mixed model-based method for testing the genetic impact on a biomarker trajectory, including shifts in mean or changes in within-subject variability. This method can handle biobank data with 100,000 to 1 million individuals, multiple longitudinal measurements, and is robust against distributional assumptions. In the second part, I will present our recent efforts in developing a joint model for longitudinal and survival data that can handle biobank data with millions of subjects, intensive longitudinal measurements, and multiple random effects. Finally, I will showcase the application of these methods in two retrospective studies using Veterans Health Administration (VHA) EHRs, covering 3.8 million veterans, and continuous glucose monitoring records of 4,000 veterans.

  • 2/8/23

    Biostat Seminar: Targeting underrepresented populations in precision medicine: Multi-source data integration via transfer learning

    Date
    Wednesday, February 8, 2023

    Time
    3:30pm - 4:50pm

    Refreshments at 3:00pm in the Biostat Library

    Location
    43-105 CHS

    Speaker
    Tian Gu, PhD
    Postdoctoral Fellow
    Department of Biostatistics
    Harvard School of Public Health

    Abstract
    The increasing numbers of large-scale biobanks and institutional data networks have brought unique opportunities to link patients’ genomics, electronic health records, and survey data for studying complex human diseases, especially to address the diminished model performance in minority and disadvantaged groups due to their low representation in biomedical studies. In this talk, I will introduce two transfer learning methods to improve statistical learning in underrepresented populations by integrating data from multiple biobanks, different ancestries, and related outcomes. These methods protect data privacy by learning from pre-trained models in external data sources without sharing patient-level data and account for potential data heterogeneity. We provide theoretical guarantees for the model performance and insights regarding when the external model can be helpful to the target model. We demonstrate the superiority of our methods compared to benchmark methods, with examples using data from the UK biobank and the electronic Medical Records and Genomics (eMERGE) Network.

January 2023

  • 1/25/23

    Biostat Seminar: Inside Baseball: Statistical Analysis in a Baseball Front Office

    Date
    Wednesday, January 25, 2023

    Time
    3:30pm - 4:50pm

    Refreshments at 3:00pm in the Biostat Library

    Location
    43-105 CHS

    Speaker
    Richard Anderson, Ph.D., Director of Quantitative Analysis
    Justin Williams, Ph.D., Senior Quantitative Analyst
    Los Angeles Dodgers

    Abstract
    Twenty-five years ago, baseball statistics were primarily limited to what could be found in the newspaper. Today, all 30 teams employ teams of statistical analysts to inform player acquisition, coaching, and on-field strategy. Two members of the Dodgers Quantitative Analysis group will discuss the history of statistics in baseball, recent developments in data and methods of analysis, and how the work of statisticians is used inside of a baseball organization. They will also discuss baseball as a career, including how one ends up working in a baseball front office and what the career path for statisticians looks like inside of a baseball organization.

November 2022

  • 11/16/22

    Biostat Seminar: Recent Works on Synthetic Data and Trustworthy AI

    Date
    Wednesday, November 16, 2022

    Time
    3:30pm - 4:50pm

    Location
    13-041 Dentistry

    Speaker
    Guang Cheng, Ph.D.
    Professor
    Department of Statistics
    UCLA

    Abstract
    Our lab believes that the next generation of AI is driven by trustworthiness (beyond performance), and built upon synthetic data (on top of real data). Hence, this talk covers the two related topics: synthetic data generation and trustworthy AI. In the first topic, we develop (perhaps the first) statistical learning framework to analyze synthetic data, and further use recommender systems as an example to illustrate how synthetic data can preserve privacy without sacrificing recommendation accuracy (i.e., utility of downstream tasks). In the second topic, we propose to protect privacy by machine un-learning, and develop theory-inspired and user-friendly fair classification algorithms.

  • 11/2/22

    Biostat Seminar: Hidden (Semi-)Markov Models for Dynamic Connectivity Analysis in Resting-State fMRI

    Date
    Wednesday, November 2, 2022

    Time
    3:30pm - 4:50pm

    Location
    13-041 Dentistry

    Speaker
    Mark Fiecas, Ph.D.
    Associate Professor
    Division of Biostatistics
    University of Minnesota

    Abstract
    Motivated by a study on adolescent mental health, we conduct a dynamic connectivity analysis using resting-state functional magnetic resonance imaging (fMRI) data. A dynamic connectivity analysis investigates how the interactions between different regions of the brain, represented by the different dimensions of a multivariate time series, change over time. Hidden Markov models (HMMs) and hidden semi-Markov models (HSMMs) are common analytic approaches for conducting dynamic connectivity analyses. In this seminar, we will give an overview of HMMs and their utility of dynamic connectivity analysis, and describe how we can use an HMM to approximate an HSMM. The approximate HSMM model allows one to explicitly model dwell-time distributions that are available to HSMMs, while maintaining the theoretical and methodological advances that are available to HMMs. We use these models to conduct a dynamic connectivity analysis on fMRI data obtained from female adolescents, where we show how dwell-time distributions vary across the severity of non-suicidal self-injury (NSSI).

October 2022

  • 10/19/22

    Biostat Seminar: Geometric EDA for Random Objects

    Date
    Wednesday, October 19, 2022

    Time
    3:30pm - 4:50pm

    Location
    13-041 Dentistry

    Speaker
    Paromita Dubey, Ph.D.
    Assistant Professor
    Marshall Business School
    USC

    Abstract
    In this talk I will propose new tools for the geometric exploration of data objects taking values in a general separable metric space. First, I will introduce depth profiles, where the depth profile of a point ω in the metric space refers to the distribution of the distances between ω and the data objects. I will describe how depth profiles can be harnessed to define transport ranks, which capture the centrality of each element in the metric space with respect to the data cloud. Next, I will discuss the properties of transport ranks and show how they can be an effective device for detecting and visualizing patterns in samples of random objects. Together with the practical illustrations of this approach, I will establish theoretical guarantees for the estimation of the depth profiles and the transport ranks for a wide class of metric spaces. Finally, I will describe a new two sample test geared towards populations of random objects by utilizing the depth profiles corresponding to the data objects. I will demonstrate the utility of this new approach on distributional data comprising of a sample of age-at-death distributions for various countries and on functional Magnetic Resonance Imaging data. This talk is based on joint work with Yaqing Chen and Hans-Georg Müller.

  • 10/4/22

    Biostat Seminar: The Role of Preferential Sampling in Spatial and Spatio-Temporal Geostatistical Modeling

    Date
    Wednesday, October 4, 2022

    Time
    11:00am - 12:00pm

    Location
    4660 Geology Building

    Speaker
    Alan E Gelfand, PhD
    James B. Duke Professor of Statistics and Decision Sciences
    Duke University

    Abstract
    The notion of preferential sampling was introduced into the literature in the seminal paper of Diggle et al. (2010) Subsequently, there has been considerable follow up research. A standard illustration arises in geostatistical modeling. Consider the objective of inferring about environmental exposures. If environmental monitors are only placed in locations where environmental levels tend to be high, then interpolation based upon observations from these locations will necessarily produce only high predictions. A remedy lies in suitable spatial design of the locations, e.g., a random or space-filling design for locations over the region of interest is expected to preclude such bias. However, in practice, sampling may be designed in order to learn about areas of high exposure.

    While the set of sampling locations may not have been developed randomly, we study it as if it was a realization of a spatial point process. That is, it may be designed/specified in some fashion but not necessarily with the intention of being roughly uniformly distributed over D. Then, the question becomes a stochastic one: is the realization of the responses independent of the realization of the locations? If no, then we have what is called preferential sampling. Importantly, the dependence here is stochastic dependence. Notationally/functionally, the responses are associated with the locations.

    Another setting is the case of species distribution modeling with a binary response, presence or absence, recorded at locations. Here, bias can arise when sampling is designed such that ecologists will tend to sample where they expect to find individuals. This setting can be extended to data fusion where we have both presence/absence data and presence-only data. Other potential applications include missing data settings and hedonic modeling for price with property sales. Very recent work explores preferential sampling in the context of multivariate geostatistical modeling.

    Fundamental issues are: (i) can we identify the occurrence of a preferential sampling effect, (ii) can we adjust inference in the presence of preferential sampling, and (iii) when can such adjustment improve predictive performance over a customary geostatistical model? We consider these issues in a modeling context and illustrate with application to presence/absence data, to property sales, and to tree data where we observe mean diameter at breast height (MDBH) and trees per hectare (TPH). (This is joint work with Shinichiro Shirota and Lucia Paci.)

May 2022

  • 5/25/22

    Biostat Seminar: Minimax powerful functional tests for longitudinal Genome-Wide Association Studies

    Date
    Wednesday, May 25, 2022

    Time
    3:30pm - 4:50pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Yehua Li, PhD
    Professor & Chair
    Department of Statistics
    UC Riverside

    Abstract
    We model the Alzheimer's Disease related phenotype response variables observed on irregular time points in longitudinal Genome-Wide Association Studies as sparse functional data and propose nonparametric test procedures to detect functional genotype effects, while controlling the confounding effects of environmental covariates. Existing nonparametric tests do not take into account within-subject correlations, suffer from low statistical power, and fail to reach the genome-wide significance level. We propose a new class of functional analysis of covariance tests based on a seemingly unrelated kernel smoother, which can incorporate the correlations. We show that the proposed test combined with a uniformly consistent nonparametric covariance function estimator enjoys the Wilks phenomenon and is minimax most powerful. In an application to the Alzheimer's Disease Neuroimaging Initiative data, the proposed test leads to discovery of new genes that may be related to Alzheimer's Disease.

  • 5/18/22

    Biostat Seminar: GWAS of Longitudinal Trajectories at Biobank Scale

    Date
    Wednesday, May 18, 2022

    Time
    3:30pm - 4:50pm

    Location
    33-105 CHS and Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Jin Zhou, Ph.D.
    Associate Professor
    Department of Medicine
    UCLA

    Abstract
    Biobanks linked to massive, longitudinal electronic health record (EHR) data make numerous new genetic research questions feasible. Among these is the study of biomarker trajectories. For example, high blood pressure measurements over visits strongly predict stroke onset, and consistently high fasting glucose and Hb1Ac levels define diabetes. Recent research reveals that not only the mean level of biomarker trajectories but also their fluctuations, or within-subject (WS) variability, are risk factors for many diseases. Glycemic variation, for instance, is recently considered an important clinical metric in diabetes management. It is crucial to identify the genetic factors that shift the mean or alter the WS variability of a biomarker trajectory. Compared to traditional cross-sectional studies, trajectory analysis utilizes more data points and captures a complete picture of the impact of time-varying factors, including medication history and lifestyle. Currently, there are no efficient tools for genome-wide association studies (GWAS) of biomarker trajectories at the biobank scale, even for just mean effects. We propose TrajGWAS, a linear mixed model-based method for testing genetic effects that shift the mean or alter the WS variability of a biomarker trajectory. It is scalable to biobank data with 100,000 to 1,000,000 individuals, many longitudinal measurements and robust to distributional assumptions. Simulation studies corroborate that TrajGWAS controls the type I error rate and is powerful. Analysis of eleven biomarkers measured longitudinally and extracted from UK Biobank primary care data for more than 150,000 participants with 1,800,000 observations reveals novel loci that significantly alter the mean or WS variability.

  • 5/11/22

    Biostat Seminar: Depth Importance in Precision Medicine (DIPM): A Tree and Forest Based Method for Right- Censored Survival Outcomes

    Date
    Wednesday, May 11, 2022

    Time
    3:30pm - 4:50pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Heping Zhang, Ph.D.
    Susan Dwight Bliss Professor
    Biostatistics
    Yale School of Public Health

    Abstract
    Many clinical trials have been conducted to compare right-censored survival outcomes between interventions. Such comparisons are typically made on the basis of the entire group receiving one intervention versus the others. In order to identify subgroups for which the preferential treatment may differ from the overall group, we propose the Depth Importance in Precision Medicine (DIPM) method for such data within the precision medicine framework. The approach first modifies the split criteria of the traditional classification tree to fit the precision medicine setting. Then, a random forest of trees is constructed at each node. The forest is used to calculate depth variable importance scores for each candidate split variable. The variable with the highest score is identified as the best variable to split the node. The importance score is a flexible and simply constructed measure that makes use of the observation that more important variables tend to be selected closer to the root nodes of trees. The DIPM method is primarily designed for the analysis of clinical data with two treatment groups. We also present the extension to the case of more than two treatment groups. We use simulation studies to demonstrate the accuracy of our method and provide the results of applications to two real-world datasets. In the case of one dataset, the DIPM method outperforms an existing method, and a primary motivation of this paper is the ability of the DIPM method to address the shortcomings of this existing method. Altogether, the DIPM method yields promising results that demonstrate its capacity to guide personalized treatment decisions in cases with right-censored survival outcomes. This is a joint work with Victoria Chen.

April 2022

  • 4/27/22

    Biostat Seminar: A Simple, Consistent Estimator of Heritability from Genome-Wide Association Studies

    Date
    Wednesday, April 27, 2022

    Time
    3:30pm - 4:50pm

    Location
    33-105A CHS and Online via Zoom
    https://ucla.zoom.us/j/93739481772?pwd=MVZvMjNuQkUzdk1TM2hWWFZKUGZZQT09
    Meeting ID: 937 3948 1772 | Password: 339592

    Speaker
    Armin Schwartzman, Ph.D.
    Professor
    Division of Biostatistics and
    Halicioglu Data Science Institute
    UC San Diego

    Abstract
    Analysis of genome-wide association studies (GWAS) is characterized by a large number of univariate regressions where a quantitative trait is regressed on hundreds of thousands to millions of single-nucleotide polymorphism (SNP) allele counts, one at a time. In this talk, I present an estimator of the fraction of the variance of the trait explained by the SNPs in the study, also called SNP heritability. The proposed GWAS heritability (GWASH) estimator is easy to compute, highly interpretable and is consistent as the number of SNPs and the sample size increase. More importantly, it can be computed from summary statistics typically reported in GWAS. Unlike other proposed estimators in the literature, we establish the theoretical properties of the GWASH estimator and obtain analytical estimates of the precision, allowing for power and sample size calculations and forming a firm foundation for future methodological development.

  • 4/13/22

    Biostat Seminar: Analyzing Social Media Conversations At Scale with Tensor LDA

    Date
    Wednesday, April 13, 2022

    Time
    3:30pm - 4:50pm

    Location
    33-105 CHS and Online via Zoom
    https://ucla.zoom.us/j/93739481772?pwd=MVZvMjNuQkUzdk1TM2hWWFZKUGZZQT09
    Meeting ID: 937 3948 1772 | Password: 339592

    Speaker
    Sara Kangaslahti, Danny Ebanks, R. Michael Alvarez
    California Institute of Technology

    Abstract
    The data exchanged daily on social media platforms such as Twitter presents an important testbed for topics in social science research including messaging coordination around social movements such as #MeToo. The growth of social media data and prohibitive cost of annotation has rendered traditional supervised machine learning methods infeasible for analysis. This cost arises from the enormity and dynamic nature of some of the most compelling textual data. In order to analyze these datasets, unsupervised topic modeling methods such as Latent Dirichlet Allocation (LDA) have gained widespread popularity, since they can extract the important information without requiring labeling or prior knowledge. Unlike supervised methods, unsupervised learning methods like LDA do not require a pre-defined set of topics, but the existing LDA models face numerous computational limitations. To overcome these, we build on previous scalable spectral methods and propose an online GPU-based tensor Latent Dirichlet Allocation method (tLDA). We achieve optimal scalability by centering and batching the data, as well as providing an end-to-end GPU pipeline, from data pre-processing to model estimation. In order to demonstrate the utility of the method, we show qualitative results derived from applications of this method for studying topic evolution in domains relevant to social research -- particularly a large #MeToo Twitter dataset, composed of over 8 million tweets. We find that topics related to political events were generally ephemeral, whereas topics related to supporting women in the #MeToo Movement were persistently prominent as the topics dynamically evolved. To validate the method, we show relative gains in topic coherence of 7 to 44 percent over traditional LDA methods and 2 to 8 percent over previous Tensor LDA methods. At the same time, our method is up 7.6x faster than traditional LDA implementations and 20x faster than previous tensor LDA methods. On GPU, we show 10x improvement against the fastest parallelized CPU implementations of LDA. Finally, we demonstrate scaling in this paper up to 30 million tweets.

March 2022

  • 3/2/22

    Biostat Seminar: Survey Attention and Self-Reported Behavior

    Date
    Wednesday, March 2, 2022

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    R. Michael Alvarez, Ph.D. (Professor) and Yimeng Li (PhD Candidate)
    Political and Computational Social Science
    Caltech

    Abstract
    Survey research methodology is evolving rapidly, as new technologies provide new opportunities. One of the areas of innovation regards the development of online interview best practices, and the advancement of methods that allow researchers to measure the attention that subjects are devoting to the survey task. Reliable measurement of subject attention can yield important information about the quality of the survey response. In this paper, we take advantage of an innovative survey we conducted in 2018, in which we directly connect survey responses to administrative data, allowing us to directly assess the association between survey attention and response quality. We show that attentive survey subjects are more likely to provide accurate survey responses regarding a number of behaviors and attributes that we can validate with our administrative data. The best strategy to deal with inattentive respondents, however, depends on the correlation between respondent attention and the outcome of interest.

February 2022

  • 2/23/22

    Biostat Seminar: Resampling-Based Assumption-Lean Statistical Inference in Big and Complex Data

    Date
    Wednesday, February 23, 2022

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Zhe Fei, Ph.D.
    Assistant Professor In-Residence
    Department of Biostatistics
    UCLA

    Abstract
    What is the role of statisticians in the era of big data and data science? Another way of asking this might be, what could a statistician offer that is different from a computer scientist, a machine learning specialist, or a mathematician? In this talk, I will share my own research experiences on this topic. I will outline the statistical challenges when analyzing various types of “big data,” including omics data, electronic health records (EHR) and medical imaging. There are two fundamental modeling goals, to estimate the effects of individual predictors, and to make predictions of future observations. For both goals, it is crucial to attach statistical inferences to the estimates or predictions, so that our findings can be validated with confidence. In other words, statistical inferences refer to the uncertainty measures of the model parameters of interest. I will introduce resampling-based inferential approaches to these problems and show their advantages both in practice and in theory.

  • 2/18/22

    Biostat Seminar: Bayesian methods for studying dynamic brain connectivity

    Date
    Wednesday, February 18, 2022

    Time
    12:00pm - 1:00pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Michele Guindani, Ph.D.
    Professor
    Department of Statistics
    UC Irvine

    Abstract
    An improved understanding of the heterogeneity of brain mechanisms is considered critical for developing interventions based on observed neuroimaging features. Recently, neuroscientists have been particularly interested in understanding how the brain reorganizes functional networks between brain areas throughout a neuroscience experiment. In this talk, we will look first at functional magnetic resonance imaging (fMRI) data and discuss a computationally efficient time-varying Bayesian vector autoregressive (VAR) approach for studying dynamic effective connectivity. Effective connectivity is defined as the direct influence that one brain region exerts on another. The proposed framework employs a tensor decomposition for the VAR coefficient matrices at different lags. Dynamically varying connectivity patterns are captured by assuming that only a subset of components in the tensor decomposition is active at any given time. Latent binary time series select the active components at each time via a convenient Ising prior specification. The proposed prior structure encourages sparsity in the tensor structure and allows ascertaining model complexity through the posterior distribution. More specifically, sparsity-inducing priors are employed to allow for global-local shrinkage of the coefficients, automatically determine the rank of the tensor decomposition, and guide the selection of the lags of the auto-regression. We will show the performances of our model formulation via simulation studies and data from an actual fMRI study involving a book reading experiment. We will then conclude by outlining extensions and further directions of research to study brain connectivity in both animal and human experiments.

  • 2/14/22

    Biostat Seminar: Data science and policy: Addressing inequity in health

    Date
    Wednesday, February 14, 2022

    Time
    12:00pm - 1:00pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Elizabeth Chin
    PhD Candidate
    Biomedical Data Science Stanford University

    Abstract
    Advances in statistics, econometrics, and computer science have the potential to facilitate data-driven decision making in improving the health of populations. However, adapting modern data science methods to eliminate health disparities remains challenging because interventions based singularly on health data do not fully address health issues borne from structural, upstream inequities. A multi-level approach that integrates social and health data to characterize how specific social systems perpetuate health inequities provides opportunities to create more tailored health and social policies. I will discuss examples of addressing health inequity through data science in two contexts: (1) mass incarceration in relationship to public health policies, and (2) algorithmic fairness for structurally vulnerable populations in social policy. An underlying theme is the importance of statistical methodology and study design informed by a holistic understanding of the interplay between social and health systems.

  • 2/7/22

    Biostat Seminar: Efficient Learning of Optimal Individualized Treatment Rules

    Date
    Wednesday, February 7, 2022

    Time
    12:00pm - 1:00pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Weibin Mo, Ph.D.
    Applied Scientist
    Graduate from Department of Statistics and Operations Research, University of North Carolina at Chapel Hill

    Abstract
    Recent development in data-driven decision science has seen great advances in individualized decision making. Given data with individual covariates, treatment assignments and outcomes, researchers can search for the optimal individualized treatment rule (ITR) that maximizes the expected outcome. Existing methods typically require initial estimation of some nuisance models. The double robustness property that can protect from misspecification of either the treatment-free effect or the propensity score has been widely advocated. However, when model misspecification exists, a doubly robust estimate can be consistent but may suffer from downgraded efficiency. Other than potential misspecified nuisance models, most existing methods do not account for the potential problem when the variance of outcome is heterogeneous among covariates and treatment. We observe that such heteroscedasticity can greatly affect the estimation efficiency of the optimal ITR. In this presentation, we demonstrate that the consequences of misspecified treatment-free effect and heteroscedasticity can be unified as a covariate-treatment dependent variance of residuals. To improve efficiency of the estimated ITR, we propose an Efficient Learning (E-Learning) framework for finding an optimal ITR in the multi-armed treatment setting. We show that the proposed E-Learning is optimal among a regular class of semiparametric estimates that can allow treatment-free effect misspecification. In our simulation study, E-Learning demonstrates its effectiveness if one of or both misspecified treatment-free effect and heteroscedasticity exist. Our analysis of a Type 2 Diabetes Mellitus observational study also suggests the improved efficiency of E-Learning.

  • 2/2/22

    Biostat Seminar: Efficient Learning of Optimal Individualized Treatment Rules

    Date
    Wednesday, February 2, 2022

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Molei Liu, Ph.D.
    PhD Candidate
    Department of Biostatistics
    Harvard Chan School of Public Health

    Abstract
    Precise statistical modeling and inference often rely on integrative analysis of datasets from multiple sites. While such modern meta-analysis could be uniquely challenging for electronic health record (EHR) data due to noisiness, high dimensionality, heterogeneity and privacy constraints. I will present novel statistical framework and approaches to overcome these practical challenges. In specific, we develop three methods for individual information protected aggregation of multi-institutional large-scale and heterogeneous EHR data sets, aiming at sparse regression, multiple testing, and surrogate-assisted semi-supervised learning respectively. Through both asymptotic analysis and numerical experiments, we demonstrate that our proposed methods outperform existing options and perform closely to the ideal individual patient data pooling analysis not feasible due to the privacy constraint. We illustrate the use of our methods in real EHR-based studies including EHR phenotyping for cardiovascular disease and inferring genetic associations of type II diabetes linked with biobank data.

January 2022

  • 1/31/22

    Biostat Seminar: A Burden Shared is a Burden Halved: A Fairness-Adjusted Approach to Classification

    Date
    Wednesday, January 31, 2022

    Time
    12:00pm - 1:00pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98667559876?pwd=QkNTWEI0aDlaak9RQ0RTc3JTdHdwdz09
    Meeting ID: 986 6755 9876 | Password: 563773

    Speaker
    Bradley Rava
    PhD Candidate (Statistics)
    Department of Data Sciences and Operations
    USC

    Abstract
    We study fairness in classification, where one wishes to make automated decisions for people from different protected groups. When individuals are classified, the decision errors can be unfairly concentrated in certain protected groups. We develop a fairness-adjusted selective inference (FASI) framework and data-driven algorithms that achieve statistical parity in the sense that the false selection rate (FSR) is controlled and equalized among protected groups. The FASI algorithm operates by converting the outputs from black-box classifiers to R-values, which are intuitively appealing and easy to compute. Selection rules based on R-values are provably valid for FSR control, and avoid disparate impacts on protected groups. The effectiveness of FASI is demonstrated through both simulated and real data.

November 2021

  • 11/17/21

    Biostats Seminar: Investigating Longitudinal Patterns and Group-level Model Development for Diabetic Kidney Disease Progression Using Functional Data Methods

    Date
    Wednesday, November 17, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/92676785924?pwd=US9WSHpyZlMxdUJtWVE1TmJMYWxFUT09
    Meeting ID: 926 7678 5924 | Password: 507253

    Speaker
    Brian Kwan, Ph.D.
    Postdoctoral Researcher
    Department of Biostatistics
    UCLA

    Abstract
    Patients with diabetic kidney disease (DKD) are at high risk for kidney failure and estimated glomerular filtration rate (eGFR) trajectories are natural markers for DKD progression. Longitudinal trajectories may exhibit nonlinear trends with the timing and number of repeated measurements varying per patient, leading to irregularly spaced and sparse data. In this talk, we discuss the application of functional principal components analysis (FPCA) to model and investigate salient patterns of eGFR trajectories among clinical subgroups of patients with diabetes and chronic kidney disease defined by the presence of albuminuria. Furthermore, to determine whether fitting a full cohort model or separate group-specific models is more optimal for modeling long-term trajectories, we evaluated model fit, using our goodness-of-fit procedure, and future prediction accuracy. Our findings indicated there are advantages to both modeling approaches for accomplishing different objectives. While our application focused on DKD, our methods are applicable to other settings with longitudinally assessed biomarkers as indicators of disease progression. The talk concludes with a discussion of future directions to explore from both statistical and clinical viewpoints.

  • 11/10/21

    Biostat Seminar: Projection-based Testing in Longitudinal Functional Regression

    Date
    Wednesday, November 10, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/92676785924?pwd=US9WSHpyZlMxdUJtWVE1TmJMYWxFUT09
    Meeting ID: 926 7678 5924 | Password: 507253

    Speaker
    Ana-Maria Staicu, Ph.D.
    Professor
    Department of Statistics
    North Carolina State University

    Abstract
    We consider longitudinal functional regression, where for each subject, we observe multiple 1D profiles (curves) over different time visits. We explore the idea of “projecting” the data onto 1D data-driven directions and discuss significance tests based on ``projections” in two general settings. First, we develop a test procedure to assess that the mean profile is time-invariant. Second, we extend the ideas to cross-over designs, to study if treatment is significant in the presence of the carryover effect. The tests have a non-standard null distribution that is easy to simulate. Numerical studies confirm that the testing approaches have the correct size in finite samples and have superior power relative to available competitors. The methods are illustrated on multiple sclerosis and wearable design applications.

  • 11/3/21

    Biostat Seminar: The Emergence and Future of Public Health Data Science

    Date
    Wednesday, November 3, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/92676785924?pwd=US9WSHpyZlMxdUJtWVE1TmJMYWxFUT09
    Meeting ID: 926 7678 5924 | Password: 507253

    Speaker
    Jeff Goldsmith , Ph.D.
    Associate Professor
    Department of Biostatistics
    Mailman School of Public Health
    Columbia University

    Abstract
    Although the major components of data science have existed for many years, the term has rapidly grown in prominence in the last decade. This reflects the confluence of several important trends in science, including the prevalence of big data, the development of computational approaches to analysis, and the recognized need for reproducibility in research. We'll provide an understanding of “data science”, with particular emphasis on connotation and the implied perspectives for working with data. We'll then consider the ongoing dialog between public health and data science, and suggest ways that public health data science might evolve and drive innovation in coming years.

October 2021

  • 10/27/21

    Biostat Seminar: Statistical Approaches for Integrative Learning for Neuroimaging Data

    Date
    Wednesday, October 27, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/92676785924?pwd=US9WSHpyZlMxdUJtWVE1TmJMYWxFUT09
    Meeting ID: 926 7678 5924 | Password: 507253

    Speaker
    Suprateek Kundu, Ph.D.
    Associate Professor Department of Biostatistics
    University of Texas MD Anderson Cancer Center

    Abstract
    Motivated by a recent interest in data fusion methods in medical imaging, we discuss novel approaches for joint analysis of multiple neuroimaging datasets. In the first part of the talk, I discuss our recently developed approach for integrative Bayesian learning of multiple brain networks using functional magnetic resonance imaging data. We illustrate that joint network learning results in biologically interpretable and reproducible results compared to single network analysis. In the second part of the talk, we propose a novel approach for joint estimation of multiple scalar-on-image regression models involving high-dimensional noisy images. Standard scalar- on-image regression models that fit each dataset separately are not equipped to leverage information across inter-related images, and existing multi-task learning approaches are compromised by the inability to account for the noise that is often observed in images.  Under both convex and non-convex grouped penalties that are designed to pool information across inter-related images for joint learning, we are able to explicitly account for noise in high-dimensional images via a projection-based approach. In the presence of non-convexity arising due to noisy images, we derive non-asymptotic error bounds under non-convex as well as convex grouped penalties, even when the number of voxels increases exponentially with sample size. A projected gradient descent algorithm is used for computation, which is shown to approximate the optimal solution via well-defined non-asymptotic optimization error bounds under noisy images. Extensive simulations and application to a motivating longitudinal Alzheimer’s disease study illustrate significantly improved predictive ability and greater power to detect true signals, that are simply missed by existing methods without noise correction due to the attenuation to null phenomenon.

  • 10/13/21

    Biostat Seminar: Statistical Data Depth and its Applications to Health Sciences

    Date
    Wednesday, October 13, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/92676785924?pwd=US9WSHpyZlMxdUJtWVE1TmJMYWxFUT09
    Meeting ID: 926 7678 5924 | Password: 507253

    Speaker
    Sara-Lopez Pintado, Ph.D.
    Associate Professor Department of Health Sciences
    Bouvé College of Health Sciences Northeastern University

    Abstract
    Data depth was originally introduced for multivariate data as a powerful non-parametric tool for developing robust exploratory data analysis methods. It provides a way of measuring how representative an observation is within the distribution or sample and of ranking multivariate observations from center-out- ward. Based on these depth-rankings, robust estimators and outliers can be defined. Notions of depth have been extended to functional data in the last several decades. In this work we develop different depth-based methods for functional data, such as an envelope test for detecting and visualizing differences between groups of functions. We applied this method to longitudinal growth data, where the goal is to find differences between the growth pattern of normal versus premature low birth weight babies. We also introduce and establish the properties of the metric halfspace depth, an extension of the well-known Tukey’s depth to object data in general metric spaces. The metric halfspace depth was applied to an Alzheimer's disease study, revealing group differences in the brain connectivity, modeled as covariance matrices, for subjects in different stages of dementia.

  • 10/6/21

    Biostat Seminar: Mixtures of Multivariate Regressions & Selective Inference for Clustering

    Date
    Wednesday, October 6, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/92676785924?pwd=US9WSHpyZlMxdUJtWVE1TmJMYWxFUT09
    Meeting ID: 926 7678 5924 | Password: 507253

    Speaker
    Jacob Bien, Ph.D.
    Associate Professor Data Sciences and Operations
    University of Southern California

    Abstract
    This will be a talk in two parts: The first part will focus on a statistical model developed for continuous flow cytometry data, while the second part will describe a widespread statistical challenge across many application areas.

    Part 1: Mixture of Multivariate Regressions Modeling for Oceanographic Flow Cytometry Data Although microscopic, phytoplankton in the ocean are extremely important to all of life and are together responsible for as much photosynthesis as all plants on land combined. Today, oceanographers are able to collect flow cytometry data in real time while onboard a moving ship, providing them with fine-scale resolution of the distribution of phytoplankton across thousands of kilometers. We present a novel sparse mixture of multivariate regressions model to estimate the time-varying phytoplankton subpopulations while simultaneously identifying the specific environmental covariates that are predictive of the observed changes to these subpopulations.

    Part 2: Selective Inference for Hierarchical Clustering Although statistics textbooks emphasize the importance of forming a hypothesis before looking at a data set, in practice it is quite common for data analysts to "double dip." That is, they first explore a data set to formulate some hypotheses and then they want to know whether what they have found is "real." For example, after running a clustering method on some data, a data analyst looking at two of the clusters might want to know whether their means are "truly" different from each other. Applying a standard two-sample test in such a setting will lead to a grossly inflated Type I error rate. We develop a selective inference approach to help answer this question while properly accounting for clustering having been performed on the data.

March 2021

  • 3/8/21

    Identification and Characterization of Genomic Variants with High Throughput Data | UCLA CTSI Biostatistics Seminar

    Date
    Monday, March 8, 2021

    Time
    12:00pm - 1:00pm

    Location
    Online via Zoom
    https://uclahs.zoom.us/j/780633065

    Speaker
    Feifei Xiao, Ph.D.
    Associate Professor, University of South Carolina

    Bio
    Feifei Xiao, Ph.D, is an Assistant Professor in the Department of Epidemiology and Biostatistics at the University of South Carolina. Dr. Xiao received her Ph.D. in Biostatistics from The University of Texas MD Anderson Cancer Center in 2013. She then got her postdoc training in Biostatistics from School of Public Health at Yale University (2013-2015). Dr. Xiao’s research focuses on high throughput genetic/genomics data, specifically on copy number variations, gene-gene/environment interactions, epigenetics and next generation sequencing data analysis. She has published 27 articles in peer reviewed journals of statistics, genetics and bioinformatics including Nucleic Acid Research, Human Genetics, and Bioinformatics.

    Abstract
    Massive datasets generated by modern technologies have enabled great effort toward precision medicine. Researchers have identified various genetics/genomics features as potential biomarkers for disease prevention and diagnosis. The first part of my talk will be on copy number variants (CNVs) analysis. Most of existing methods used algorithms assuming that the observed data of different genetic loci are independent. Our study found that the correlation structure of CNV data is associated with linkage disequilibrium. Therefore, we developed a novel algorithm that will systematically integrate the genomic correlation structure into the modeling. I will show simulations and the application to a whole genome melanoma study. Application to a large cohort lung cancer study to reveal high confidence CNVs predisposing to lung cancer risk will also be illustrated. In the second part of my talk, I will talk about the identification of a gene expression based immune signature for lung adenocarcinoma prognosis using machine learning methods.

    Participating CTSI Institutions
    UCLA, Harbor-UCLA, Charles Drew University, and Cedars-Sinai

February 2021

  • 2/22/21

    Genetics of Within-Subject Variability and Diabetes Complications | UCLA CTSI Biostatistics Seminar

    Date
    Monday, February 22, 2021

    Time
    12:00pm - 1:00pm

    Location
    Online via Zoom
    https://uclahs.zoom.us/j/780633065

    Speaker
    Jin Zhou, Ph.D.
    Associate Professor, University of Arizona

    Abstract
    The development of diabetes complications, both macrovascular and microvascular, is heterogeneous, even when patients have the same glucose control and clinical features. Research searching for susceptible genes underlying diabetes complications is limited due to the complexities of studying diseases (complications) within a disease (diabetes). Our prior findings highlighted the importance of time-varying within subject (WS) glycemic (GV) and blood pressure variability (BPv) for developing diabetes complications. We hypothesize that genetic variants contribute to within-subject GV and BPv, contributing to progression to diabetes complications. In this talk, to quantify the genetic contributions to GV and BPv using biobank scale data, we develop a WS variance estimator by robust regression to estimate and inference the effects of both time-varying and time-invariant predictors on WS variance. Our method is robust against the distributional misspecification. We further boost the computational efficiency by implementing a score test that only needs to fit the null model once for the entire data sets, making it applicable to massive biobank data. We apply our method (vGWAS) to longitudinal glycemic, and blood pressure (BP) measures extracted from electronic medical records from UK Biobank. Our results complement current BP GWAS and shed light on disease mechanisms.

  • 2/10/21

    Biostat Seminar: COVID-19 Vaccines: What We Know and What We Don’t

    Date
    Wednesday, February 10, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/95460365266?pwd=bEFCT2NBVE51RTVBTFQxa29WbnhqUT09
    Meeting ID: 954 6036 5266 | Password: 943207

    Speaker
    Abdelmonem Afifi
    Professor of Biostatistics & Biomathematics
    Fielding School of Public Health
    UCLA

    Abstract
    The emergence of vaccines in late 2020 has shone a bright light at the end of the long and dark COVID-19 tunnel. In this seminar, I summarize what I wanted to know about these vaccines. I begin by describing the different types of vaccines that have appeared or are under development. I describe the FDA approval process, particularly as it relates to the Pfizer-BioNTech and Moderna vaccines. I discuss the process and potential consequences of vaccine distribution in the USA, including herd immunity and what it takes to reach it. I conclude by speculating on what is next for the course of the pandemic.

  • 2/3/21

    Biostat Seminar: Meshed Gaussian Processes For Efficient Bayesian Inference Of Big Data Spatial Regression Models

    Date
    Wednesday, February 3, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/93422849767
    Meeting ID: 934 2284 9767

    Speaker
    Michele Peruzzi
    Postdoctoral Associate
    Department of Statistics
    Duke University

    Abstract
    Big spatial data are now routinely collected in massive amounts in diverse scientific and data-driven industrial applications including, but not limited to, natural and environmental sciences; economics; climate science; ecology; forestry; and public health. In this talk, I will introduce Meshed Gaussian Processes (MGPs) for scalable Bayesian regression modeling of spatial Big Data. The underlying idea combines concepts on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). Unlike other methods, MGPs consider the DAG as an explicit design choice -- rather than building the DAG based on some criterion (e.g. limiting conditional dependence to the m nearest neighbors), one chooses a DAG because of its known properties. The DAG is linked to groups of spatial locations, arising e.g. from domain tiling, tessellations, or other partitioning strategies. In particular, one may consider two particularly convenient DAGs and the corresponding domain partitioning strategies: (1) a recursive tree, (2) a "cubic" mesh. I will focus on the latter and show that the resulting "cubic" MGP (QMGP) corresponds to efficient parallel MCMC sampling of the latent spatial process, even with spatiotemporal data at more than ten million locations. I will then mention refinements, improvements and extensions of MGPs and QMGPs in particular: (1) MCMC for QMGPs may exhibit slow convergence for irregularly spaced data and/or in estimating the covariance parameters a posteriori. I will resolve these issues by showing that a Grid-Parametrize-Split (GriPS) strategy results in massively more efficient MCMC. (2) Why MCMC though? In some scenarios, it may be possible to fix some covariance parameters at some reasonable value; then, MCMC may be avoided. I will outline the possible computational advantages of QMGPs in these settings, compared to existing alternatives. (3) The idea of fixing the DAG allows one to devise tailor-made MCMC algorithms for sampling specific MGPs. As a result, MGPs may facilitate computations for more general regression models on (multivariate) non-Gaussian outcomes.

January 2021

  • 1/20/21

    Biostat Seminar: Implicit Bias Of Gradient Descent For Mean Squared Error Regression With Wide Neural Networks

    Date
    Wednesday, January 20, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/95460365266?pwd=bEFCT2NBVE51RTVBTFQxa29WbnhqUT09
    Meeting ID: 954 6036 5266 | Passcode: 943207

    Speaker
    Guido Montufar
    Assistant Professor
    Department of Mathematics and Statistics
    UCLA

    Abstract
    We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. For 1D regression, we show that the solution of training a width-n shallow ReLU network is within n^(-1/2) of the function which fits the training data and whose difference from initialization has smallest 2-norm of the second derivative weighted by 1/ζ. The curvature penalty function 1/ζ is expressed in terms of the probability distribution that is utilized to initialize the network parameters, and we compute it explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. While similar results have been obtained in previous works, our analysis clarifies important details and allows us to obtain significant generalizations. In particular, the result generalizes to multivariate regression and different activation functions. Moreover, we show that the training trajectories are captured by trajectories of spatially adaptive smoothing splines with decreasing regularization strength. This is joint work with Hui Jin.

  • 1/13/21

    Biostat Seminar: Partial Separability And Functional Graphical Models

    Date
    Wednesday, January 13, 2021

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/95460365266?pwd=bEFCT2NBVE51RTVBTFQxa29WbnhqUT09
    Meeting ID: 954 6036 5266 | Passcode: 943207

    Speaker
    Alexander Petersen
    Assistant Professor
    Department of Statistics
    Brigham Young University
    UC Santa Barbara

    Abstract
    The covariance structure of multivariate functional data can be highly complex, especially if the multivariate dimension is large, making extension of statistical methods for standard multivariate data to the functional data setting quite challenging. For example, Gaussian graphical models have recently been extended to the setting of multivariate functional data by applying multivariate methods to the coefficients of truncated basis expansions. However, a key difficulty compared to multivariate data is that the covariance operator is compact, and thus not invertible. The methodology in this paper addresses the general problem of covariance modeling for multivariate functional data, and functional Gaussian graphical models in particular. As a first step, a new notion of separability for multivariate functional data is proposed, termed partial separability, leading to a novel Karhunen-Loève-type expansion for such data. Next, the partial separability structure is shown to be particularly useful in order to provide a well-defined Gaussian graphical model that can be identified with a sequence of finite-dimensional graphical models, each of fixed dimension. This motivates a simple and efficient estimation procedure through application of the joint graphical lasso. Empirical performance of the method for graphical model estimation is assessed through simulation and analysis of functional brain connectivity during a motor task.

December 2020

  • 12/9/20

    Biostat Seminar: Causal Learning: excursions in double robustness

    Date
    Wednesday, December 9, 2020

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/97619833513?pwd=dVdpYUtIOFBaWk5TR0xkNktTVCt3UT09
    Meeting ID: 976 1983 3513 | Passcode: 828359

    Speaker
    Jelena Bradic
    Associate Professor
    Department of Mathematics & Halicioglu Data Science Institute
    UC San Diego

    Abstract
    Recent progress in machine learning provides many potentially effective tools to learn estimates or make predictions from datasets of ever-increasing sizes. Can we trust such tools in clinical and highly-sensitive systems? If a learning algorithm predicts an effect of a new policy to be positive, what guarantees do we have concerning the accuracy of this prediction? The talk introduces new statistical ideas to ensure that the learned estimates satisfy some fundamental properties: especially causality and robustness. The talk will discuss potential connections and departures between causality and robustness.

  • 12/2/20

    Biostat Seminar: Reframing proportional-hazards modeling for large time-to-event datasets with applications to deep learning

    Date
    Wednesday, December 2, 2020

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/98576333860?pwd=QTdSdmVZOWMwaHZscldJZG1GUzhBQT09
    Meeting ID: 985 7633 3860 | Passcode: 140409

    Speaker
    Noah Simon
    Associate Professor
    Department of Biostatistics
    University of Washington

    Abstract
    To build inferential or predictive survival models, it is common to assume proportionality of hazards and fit a model by maximizing the partial likelihood. This has been combined with non-parametric and high dimensional techniques, eg. spline expansions and penalties, to flexibly build survival models. New challenges require extension and modification of that approach. In a number of modern applications there is interest in using complex features such as images to predict survival. In these cases, it is necessary to connect more modern backends to the partial likelihood (such as deep learning infrastructures based on eg. convolutional/recurrent neural networks). In such scenarios, large numbers of observations are needed to train the model. However, in cases where those observations are available, the structure of the partial likelihood makes optimization difficult (if not completely intractable).

    In this talk we show how the partial likelihood can be simply modified to easily deal with large amounts of data. In particular, with this modification, stochastic gradient- based methods, commonly applied in deep learning, are simple to employ. This simplicity holds even in the presence of left truncation/right censoring. This can also be applied relatively simply with data stored in a distributed manner.

November 2020

  • 11/25/20

    Biostat Seminar: Individualized Multi-directional Variable Selection

    Date
    Wednesday, November 25, 2020

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/99467473438?pwd=UmRadlpuR1pFaFpJY1hXN0h5WjNTQT09
    Meeting ID: 994 6747 3438 | Passcode: 254642

    Speaker
    Annie Qu
    Professor
    Department of Statistics
    University of California, Irvine

    Abstract
    In this talk we propose a heterogeneous modeling framework which achieves individual-wise feature selection and individualized covariates’ effects subgrouping simultaneously. In contrast to conventional model selection approaches, the new approach constructs a separation penalty with multi-directional shrinkages, which facilitates individualized modeling to distinguish strong signals from noisy ones nd selects different relevant variables for different individuals. Meanwhile, the proposed model identifies subgroups among which individuals share similar covariates’ effects, and thus improves individualized estimation efficiency and feature selection accuracy. Moreover, the proposed model also incorporates within-individual correlation for longitudinal data to gain extra efficiency. We provide a general theoretical foundation under a double-divergence modeling framework where the number of individuals and the number of individual-wise measurements can both diverge, which enables inference on both an individual level and a population level. In particular, we establish strong oracle property for the individualized estimator to ensure its optimal large sample property under various conditions.

  • 11/18/20

    Biostat Seminar: Constructing Confidence Interval for RMST under Group Sequential Setting

    Date
    Wednesday, November 18, 2020

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/94791678395?pwd=aTZoV0trQzdIaEE5WjNLY0IzSElpZz09
    Meeting ID: 947 9167 8395 | Passcode: 042094

    Speaker
    Lu Tian
    Associate Professor of Biomedical Data Science and Statistics
    Stanford University

    Abstract
    It is appealing to compared survival distributions based on restricted mean survival time (RMST), since it generates a clinically interpretable summary of the treatment effect and can be estimated nonparametrically without assuming restrictive model assumptions such as the proportional hazards assumption. However, there are special challenges in designing and analyzing group sequential study based on RMST, because the truncation timepoint of the RMST in the interim analysis often differs from that in the final analysis. A valid test controls the unconditional type one error has been developed in the past. However, there is no appropriate statistical procedure for constructing the confidence interval for the treatment effect measured by a contrast in RMST, while it is crucial for informative clinical decision making. In this talk, I will review some important design issues for study based on RMST. I will then discuss how to conduct hypothesis testing and how to construct confidence intervals for the difference RMST in a group sequential setting. Examples and numerical studies will be presented to illustrate the method.

  • 11/10/20

    Biostatistics Admission Information Session

    Date
    Tuesday, November 10, 2020

    Time
    1:00pm - 2:00pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/94445671105?pwd=emxESzJiUXNxTHJjTkRGK0JSVm8yZz09
    Meeting ID: 944 4567 1105 | Passcode: 742314

    Details
    Biostatistics admission committee members and biostatistics student representatives are available to answer any questions you have regarding graduate programs (MS, MPH, PhD) in Biostatistics.

  • 11/4/20

    Biostat Seminar: Optimal post-selection inference for sparse signals: a nonparametric empirical-Bayes

    Date
    Wednesday, November 4, 2020

    Time
    3:30pm - 4:30pm

    Location
    Online via Zoom
    https://ucla.zoom.us/j/93327847797?pwd=OVpGUTg1SEJGZ1VWb3ZJK08rRThrZz09
    Meeting ID: 933 2784 7797 | Passcode: 001430

    Speaker
    Oscar Hernan Madrid Padilla
    Assistant Professor
    Statistics Department, UCLA

    Abstract
    Many recently developed Bayesian methods have focused on sparse signal detection. However, much less work has been done addressing the natural follow-up question: how to make valid inferences for the magnitude of those signals after selection. Ordinary Bayesian credible intervals suffer from selection bias, owing to the fact that the target of inference is chosen adaptively. Existing Bayesian approaches for correcting this bias produce credible intervals with poor frequentist properties, while existing frequentist approaches require sacrificing the benefits of shrinkage typical in Bayesian methods, resulting in confidence intervals that are needlessly wide. We address this gap by proposing a nonparametric empirical-Bayes approach for constructing optimal selection-adjusted confidence sets. Our method produces confidence sets that are as short as possible on average, while both adjusting for selection and maintaining exact frequentist coverage uniformly over the parameter space. Our main theoretical result establishes an important consistency property of our procedure: that under mild conditions, it asymptotically converges to the results of an oracle-Bayes analysis in which the prior distribution of signal sizes is known exactly. Across a series of examples, the method outperforms existing frequentist techniques for post selection inference, producing confidence sets that are notably shorter but with the same coverage guarantee.

March 2020

  • 3/4/20

    (CANCELLED) | Biostat Seminar: Inference in the Presence of Intractable Normalizing Functions

    Date
    Wednesday, March 4, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    33-105 CHS

    Speaker
    Murali Haran
    Professor and Head, Department of Statistics
    Penn State University

    Abstract
    Models with intractable normalizing functions arise frequently in statistics. Common examples of such models include exponential random graph models for social networks and Markov point processes for ecology and disease modeling. Inference for these models is complicated because the normalizing functions of their probability distributions include the parameters of interest. We provide a framework for understanding existing algorithms for Bayesian inference for these models, comparing their computational and statistical efficiency, and discussing their theoretical bases. We propose an algorithm that provides computational gains over existing methods by replacing Monte Carlo approximations to the normalizing function with a Gaussian process-based approximation. We provide theoretical justification for this method. We also develop a closely related algorithm that is applicable more broadly to any likelihood function that is expensive to evaluate. We illustrate the application of our methods to a variety of challenging simulated and real data examples, including an exponential random graph model, a Markov point process, and a model for infectious disease dynamics. Our algorithms show significant gains in computational efficiency over existing methods, and have the potential for greater gains for more challenging problems. For a random graph model example, this gain in efficiency allows us to carry out Bayesian inference when other algorithms are computationally impractical.

February 2020

  • 2/26/20

    Biostat Seminar: Real-world Evidence in Drug Development and Regulatory Submission

    Date
    Wednesday, February 26, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    33-105A CHS

    Speaker
    Tse L. Lai
    Ray Lyman Wilbur Professor of Statistics
    Stanford University

    Abstract
    There has been growing interest in using real-world data (RWD) and evidence (RWE) for drug development since the passage of the 21st Century Cures Act in Dec 2016. The US FDA released its Framework for Real-World Evidence Program in Dec 2018 and subsequently issued a draft guidance for industry on submitting documents using RWD & E for drugs and biologics. I will discuss statistical challenges and opportunities in using RWD/RWE for drug development and regulatory submission, and describe some ongoing projects that are summarized in my forthcoming book (Chapman & Hall/CRC, 2020) with Richard Baumgartner and Jie Chen of Merck on RWD & E.

  • 2/19/20

    Biostat Seminar: Ghost Data

    Date
    Wednesday, February 19, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    33-105 CHS

    Speaker
    Dennis K.J. Lin
    University Distinguished Professor
    Department of Statistics, The Pennsylvania State University, University Park, USA

    Abstract
    As natural as the real data, ghost data is everywhere—it is just data that you cannot see. We need to learn how to handle it, how to model with it, and how to put it to work. Some examples of ghost data are (see, Sall, 2017):
    (a) Virtual data—it isn’t there until you look at it;
    (b) Missing data—there is a slot to hold a value, but the slot is empty;
    (c) Pretend data—data that is made up;
    (d) Highly Sparse Data—whose absence implies a near zero, and
    (e) Simulation data—data to answer “what if.”
    For example, absence of evidence/data is not evidence of absence. In fact, it can be evidence of something. More Ghost Data can be extended to other existing areas: Hidden Markov Chain, Two-stage Least Square Estimate, Optimization via Simulation, Partition Model, Topological Data, just to name a few. Three movies will be discussed in this talk: (1) “The Sixth Sense” (Bruce Wallis)—I can see things that you cannot see; (2) “Sherlock Holmes” (Robert Downey)—absence of expected facts; and (3) “Edge of Tomorrow” (Tom Cruise)—how to speed up your learning (AlphaGo-Zero will also be discussed). It will be helpful, if you watch these movies before coming to my talk. This is an early stage of my research in this area–any feedback from you is deeply appreciated. Much of the basic idea is highly influenced via Mr. John Sall (JMP-SAS).

  • 2/5/20

    Biostat Seminar: Estimation and Inference for Changepoint Models

    Date
    Wednesday, February 5, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    33-105A CHS

    Speaker
    Sean Jewell
    PhD Candidate
    Department of Statistics at the University of Washington

    Abstract
    This talk is motivated by statistical challenges that arise in the analysis of calcium imaging data, a new technology in neuroscience that makes it possible to record from huge numbers of neurons at single-neuron resolution. In the first part of this talk, I will consider the problem of estimating a neuron’s spike times from calcium imaging data. A simple and natural model suggests a non-convex optimization problem for this task. I will show that by recasting the non-convex problem as a changepoint detection problem, we can efficiently solve it for the global optimum using a clever dynamic programming strategy.

    In the second part of this talk, I will consider quantifying the uncertainty in the estimated spike times. This is a surprisingly difficult task, since the spike times were estimated on the same data that we wish to use for inference. To simplify the discussion, I will focus specifically on the change-in-mean problem, and will consider the null hypothesis that there is no change in mean associated with an estimated changepoint. My proposed approach for this task can be efficiently instantiated for changepoints estimated using binary segmentation and its variants, L0 segmentation, or the fused lasso. Moreover, this framework allows us to condition on much less information than existing approaches, thereby yielding higher-powered tests. These ideas can be easily generalized to the spike estimation problem.

    This talk will feature joint work with Toby Hocking, Paul Fearnhead, and Daniela Witten.

  • 2/3/20

    Biostat Seminar: Estimation and Inference for Changepoint Models

    Date
    Monday, February 3, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    63-105A CHS

    Speaker
    Jessica Gronsbell, PhD
    Data Scientist
    Alphabet’s Verily Life Sciences

    Abstract
    The widespread adoption of electronic health records (EHR) and their subsequent linkage to specimen biorepositories has generated massive amounts of routinely collected medical data for use in translational research. These integrated data sets enable real-world predictive modeling of disease risk and progression. However, data heterogeneity and quality issues impose unique analytical challenges to the development of EHR-based prediction models. For example, ascertainment of validated outcome information, such as presence of a disease condition or treatment response, is particularly challenging as it requires manual chart review. Outcome information is therefore only available for a small number of patients in the cohort of interest, unlike the standard setting where this information is available for all patients. In this talk I will discuss semi-supervised and weakly-supervised learning methods for predictive modeling in such constrained settings where the proportion of labeled data is very small. I demonstrate that leveraging unlabeled examples can improve the efficiency of model estimation and evaluation and in turn substantially reduce the amount of labeled data required for developing prediction models.

January 2020

  • 1/31/20

    Biostat Seminar: Modeling and testing in high-throughput cancer drug screenings

    Date
    Friday, January 31, 2020

    Time
    11:00am - 12:00pm

    Refreshments at 10:30am in 51-254 CHS

    Location
    33-105A CHS

    Speaker
    Wesley Tansey
    Postdoctoral Research Scientist
    Columbia University

    Abstract
    High-throughput drug screens enable biologists to test hundreds of candidate drugs against thousands of cancer cell lines. The sensitivity of a cell line to a drug is driven by the molecular features of the tumor (e.g. gene mutations and expression). In this talk, I will consider two scientific goals at the forefront of cancer biology: (i) predicting drug response from molecular features, and (ii) discovering gene-drug associations that represent candidates for future drug development. I will present an end-to-end model of cancer drug response that combines hierarchical Bayesian modeling with deep neural networks to learn a flexible function from molecular features to drug response. The model achieves the first goal of state-of-the-art predictive performance, but the black box nature of deep learning makes the model difficult to interpret, presenting a barrier to the second goal of uncovering gene-drug associations. I will use this challenge as motivation for the development of a new method, the holdout randomization test (HRT), for conditional independence testing with black box predictive models. Applying the HRT to the deep probabilistic model of cancer drug response yields more biologically-plausible gene-drug associations than the current analysis technique in biology. I will use these projects to illustrate how statisticians can work closely with biologists to create a virtuous cycle where cutting -edge experiments lead to new statistical models and methods, which in turn drive all of science forward.

  • 1/27/20

    Biostat Seminar: Statistical Analysis of Brain Structural Connectomes

    Date
    Monday, January 27, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    63-105A CHS

    Speaker
    Zhengwu Zhang
    Assistant Professor
    University of Rochester

    Abstract
    There have been remarkable advances in imaging technology, used routinely and pervasively in many human studies, that non-invasively measures human brain structure and function. Among them, a particular imaging modality called diffusion magnetic resonance imaging (dMRI) is used to infer shapes of millions of white matter fiber tracts that act as highways for neural activity and communication across the brain. The collection of interconnected fiber tracts is referred to as the brain connectome. There is increasing evidence that an individual’s brain connectome plays a fundamental role in cognitive functioning, behavior, and the risk of developing mental disorders. Improved mechanistic understanding of relationships between brain connectome structure and phenotypes is critical to the prevention and treatment of mental disorders. However, progress in this area has been limited duo to the complexity of the data. In this talk, I will present challenges of analyzing such data and our recent progress, including connectome reconstruction and novel statistical modeling methods.

  • 1/22/20

    Biostat Seminar: Bayes in the time of Big Data

    Date
    Wednesday, January 22, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    33-105 CHS

    Speaker
    Andrew Holbrook
    Postdoctoral Scholar
    UCLA, Human Genetics

    Abstract
    Big Bayes is the computationally intensive co-application of big data and large, expressive Bayesian models for the analysis of complex phenomena in scientific inference and statistical learning. Standing as an example, Bayesian multidimensional scaling (MDS) can help scientists learn viral trajectories through space and time, but its computational burden prevents its wider use. Crucial MDS model calculations scale quadratically in the number of observations. We mitigate this limitation through massive parallelization using multi-core central processing units, instruction-level vectorization and graphics processing units (GPUs). Fitting the MDS model using Hamiltonian Monte Carlo, GPUs can deliver more than 100-fold speedups over serial calculations and thus extend Bayesian MDS to a big data setting. To illustrate, we employ Bayesian MDS to infer the rate at which different seasonal influenza virus subtypes use worldwide air traffic to spread around the globe. We examine 5392 viral sequences and their associated 14 million pairwise distances arising from the number of commercial airline seats per year between viral sampling locations. To adjust for shared evolutionary history of the viruses, we implement a phylogenetic extension to the MDS model and learn that subtype H3N2 spreads most effectively, consistent with its epidemic success relative to other seasonal influenza subtypes.

  • 1/15/20

    Biostat Seminar: Bayes in the time of Big Data

    Date
    Wednesday, January 15, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    33-105 CHS

    Speaker
    Lan Luo
    Doctoral Candidate
    Department of Biostatistics
    University of Michigan, Ann Arbor

    Abstract
    This research is largely motivated by the challenges in modeling and analyzing streaming health data, which are becoming increasingly popular data sources in the fields of biomedical science and public health. In this work, the term “streaming data” refers to high throughput recording of large volumes of observations collected sequentially and perpetually over time, such as national disease registry, mobile health, and disease surveillance. Due to the large volume and frequent updates intrinsic to this type of data, major challenges arising from the analysis of streaming data pertain to data storage and information updating. This talk primarily concerns the development of a real-time statistical estimation and inference method for regression analysis, with a particular objective of addressing challenges in streaming data storage and computational efficiency. Termed as “renewable estimation”, this method greatly helps overcome the data sharing barrier, reduce data storage cost, and improve computing speed, all without loss of statistical efficiency. The proposed algorithms for streaming real-time regression will be demonstrated in generalized linear models (GLM) for cross-sectional data. I will discuss both conceptual understanding and theoretical guarantees of the renewable method and illustrate its performance via numerical examples. This is joint work with my supervisor Peter Song at the University of Michigan.

  • 1/13/20

    Biostat Seminar: Risk Models with Polygenic Risk Scores

    Date
    Monday, January 13, 2020

    Time
    3:30pm - 4:30pm

    Refreshments at 3:00pm in 51-254 CHS

    Location
    63-105 CHS

    Speaker
    Allison Meisner, PhD
    Postdoctoral Fellow
    Department of Biostatistics
    Johns Hopkins University

    Abstract
    Most complex diseases are the result of environmental variables, genetic factors, and their interaction. In building risk models, it is important to account for each of these components to enable estimation of risk and identification of high-risk subgroups. Historically, research into the genetic determinants of disease has largely focused on the role of individual variants. However, this endeavor is complicated by the fact that most diseases are highly polygenic and result from the combined effect of many variants, each with small effect. A great deal of attention has been paid recently to polygenic risk scores, which represents the total genetic burden of a given trait. Here, I present recent work on utilizing polygenic risk scores in risk models, alongside environmental risk factors. This includes an efficient case-only method for using polygenic risk scores to identify gene- environment interactions and an expansive analysis of the combined utility of polygenic risk scores for specific diseases and mortality risk factors in predicting survival in the UK Biobank, a large cohort study. I will also touch on possibilities for future work in this area, including the use of polygenic risk scores in treatment selection.