Are you interested in help solve one of the most ubiquitous and insidious problems in social, behavioural, and biomedical research? Then this is the PhD position for you.
Your job The
department of Methodology and Statistics has a job opening for a PhD candidate. In this position, you will develop novel statistical algorithms to treat missing data in large, longitudinal datasets. Doing so will require you to develop a broad base of statistical expertise and a diverse data analytic toolkit. You will work with Bayesian modelling, (longitudinal) structural equation modelling, high-dimensional prediction algorithms, and computational statistics/numerical methods. You will also have the opportunity to learn high-performance statistical computing techniques and develop highly marketable software development skills. Just for good measure, we’ll throw in four years of intensive on-the-job training in open-science workflows and open-source software development. Are you up for the challenge?
Today’s researchers are blessed with a wealth of high-quality, publicly available, longitudinal datasets. Yet, these data are nearly always incomplete, so fully realising the potential of these rich datasets requires principled missing data treatments. Multiple imputation (MI) is one of the most flexible and broadly applicable principled missing data treatments available. Treating missing data with MI can produce optimal results if the assumptions of the underlying predictive models are satisfied. However, the task of satisfying these assumptions in datasets with hundreds of variables linked by complex, multivariate relations—like the datasets implied above—remains a daunting challenge.
A recent PhD project supervised by
Dr Lang (the daily supervisor also for this PhD project) has made promising headway on this problem by combining supervised principal components regression (SPCR) with MI. The MI-SPCA methods developed in that work have shown excellent performance in (high-dimensional) cross-sectional data, but the authors did not consider longitudinal or nested data structures. Clearly, many interesting datasets contain longitudinal measurements or otherwise nested structures, and these nested structures bring additional challenges for any missing data treatment (e.g., the need to preserve random effects, complex growth trajectories, and cross-level interactions).
In this project, you will work under the supervision of a team of three experts in missing data analysis, statistical computing, and longitudinal modelling to extend the abovementioned work to accommodate longitudinal data (e.g., by incorporating multilevel PCA methods). Although these new methods will suit diverse contexts, we will specifically target applications in developmental science. The methods you develop, therefore, will be designed to support valid estimation and inference in popular models of growth and temporal association (e.g., latent growth curves, [random intercept] cross-lagged panel models). You will use Monte Carlo simulation studies to compare the performance of the new methods to the current state-of-the-art in missing data treatment for longitudinal/nested data. As these simulation studies will necessarily entail a high computational demand, they will offer an excellent opportunity to learn and practice high-performance computing techniques.
In addition to the statistical computing that will be woven throughout your methodological research, you will also have the opportunity for hands-on software development experience. A key aspect of the proposed research plan is to distribute the methods we develop via free, open-source software (e.g., a standalone R package or contributions to existing software). Your supervisors are well-equipped to guide you through this process.
Professor van Buuren is the author and maintainer of the
mice package, and Dr Lang is one of the developers for the
JASP package.
Finally, this project will provide the opportunity to explore best practices in open science. To ensure transparency and maximize chances for peer review, we will use public GitHub repositories for all software development projects. To ensure that our work is available to the widest possible audience, we will distribute any resulting software under a suitable open-source license (e.g., MPL, Apache, MIT). We will publish all papers with open access.