Multiple Chronic Conditions in Research for Emerging Investigators

Missing Data Methods in Longitudinal Data Sets

AGS/AGING LEARNING Collaborative Season 1 Episode 20

Join Terrence E. Murphy, PhD, MS, Pennsylvania State University College of Medicine, and Qian-Li Xue, PhD, Johns Hopkins Bloomberg School of Public Health, as they discuss missing data. They review different types of missing data mechanisms, analytic approaches to approach missing data and good practice with missing data.

To view a transcript click here then select the transcript tab.

Terrence E. Murphy, PhD, MS: Hello, my name is Terry Murphy. I'm a professor of biostatistics at the Penn State College of Medicine, and this morning we have the great privilege of presenting another podcast within our module of Useful Analytic Approaches. And this morning, the module we're going to discuss is by Dr. Qian-Li Xue from Johns Hopkins University School of Medicine, and his module is entitled Dealing with Missing Data in Longitudinal Data Sets.

Now, Dr. Xue is uniquely qualified to do this. In 2001, he received his Ph. D. in Biostatistics at Johns Hopkins. And that same year, he started working as the Director of [01:00] Biostatistics in the Division of Geriatric Medicine and Gerontology and the Center on Aging and Health. He has stayed at Hopkins for the duration.

In 2014, he received a joint appointment, becoming an associate professor, both in the Department of Biostatistics, as well as in the Department of Epidemiology, in the Johns Hopkins Bloomberg School of Public Health. Since 2014, he has also been an associate professor in the Department of Medicine, the Division of Geriatric Medicine and Gerontology.

He is very well known around the country. He has over 202 publications, and the great majority of those publications have to do with research on aging. [02:00] He has created a very nice module. You know, missing data is one of these things that we don't get taught about in graduate school. And yet, as soon as you go out and you start doing applied work, it hits you in the face very quickly because there's always missing data and the responsible analyst, first of all, documents that, usually in a table one, and then explains in their statistical analysis approach in both grants and papers, what they do about it.

And so in his module, Dr. Xue first talks about missing data mechanisms, such as missing completely at random, missing at random, missing not at random. He gives a nice example that shows. When missingness in gait speed and leg strength, two [03:00] things often measured in studies of fall prevention and otherwise, can lead to either missing at random or missing not at random.

He discusses ignorable versus non ignorable missing data. He reviews a few techniques that are commonly used when data is missing at random, such as inverse probability weighting, and multiple imputation. He also touches on likelihood based approaches and reviews basic longitudinal models. He makes a very important point of emphasizing that in fixed effects models, we're often concerned with the slope of the averages, whereas in random effects models, we're actually taking the average of the slopes. He continues in his slides to look at [04:00] longitudinal cases of both missing at random and missing not at random, and provides a nice summary and a list of references. 

So, Dr. Xue, hopefully I've done justice to your slides, and now I thought we would follow up by just discussing at a high level some of your thoughts and experiences on some of the important aspects of missing data.

So starting from the beginning, why is missing data a problem? Why should we care about this? 

Qian-Li Xue, PhD: Okay. Um, before I answer this question, I just wanted to thank you, Dr. Murphy for your introduction and it's my pleasure to talk about missing data with you all today. It is a very important topic. And so to answer your question, missing data certainly can be problematic [05:00] in research and data analysis for several reasons.

It challenges the statistical inference because the missing data can lead to biased parameter estimates. And also can, you know, reduce the power for detecting significant effects. If you delete the missing observations, then it certainly reduces the sample size. Therefore, it's going to make it harder to draw meaningful conclusions or detect the significant relationships.

Terrence E. Murphy, PhD, MS: Okay. So you're saying it's possibly bias and some loss of precision or both. And in statistics, since we're always concerned about bias and precision, missing data is something vitally important to the quality of our inference. 

Qian-Li Xue, PhD: That's correct. 

Terrence E. Murphy, PhD, MS: So, when you look at your missing data, how do you [06:00] go about deciding what, if anything, you're going to do about the missing data?

Qian-Li Xue, PhD: Okay, I think that first of all, the degree to which missing data stand to bias a given analysis depends on, you know, both the analytical models that you're using, but also the degree to which the observed the data represent those target population. So the knowledge of the process that leads to the missing value is important.

And this process has been codified by Ruben in 1976 at the missing data mechanism. Before deciding what you can do with the missing data, you, in your specific analysis, I think that it probably would be useful to give a, uh, just a brief review of the different missing data mechanisms. So first. Is the missing completely at [07:00] random, MCAR, which means that missing data is just like a coin flip, purely by chance, with no clear reason.

So for instance, if in a survey people just forget to answer a question randomly, then this is a missing completely at random situation. And if you have missing completely at random, that inference is still unbiased. But the position is going to be lower because you are losing information due to the missing observations.

Now the second type is called a missing at random, which means that the missing information is linked to other data that you already have. For instance, if you are collecting data on age and income, and then you find that younger people tend to skip income questions, but you also have their education levels, which [08:00] we know that they are related to income. So you can use education levels to predict the missing incomes. So this is called missing at random because it's connected to available data in your data set. 

Now the third type is the most challenging. It's called missing not at random, which means that missing data is not random chance or related to known information in the data set. It is connected to the value that are missing. So, for example, in the same income survey, if you find out that a wealthier individual is not reporting their income, this will be an example of missing not at random, because the missing income data is directly related to the income values themselves. And if there isn't any information in the data set that can be used to predict [09:00] such missingness, then you have the situation called missing not at random.

If you understand the differences of these three types, then you can think about, you know, which missing data situation in your specific application fits to one of the three scenarios. Then you can choose your methods accordingly. So I'll give you a just couple of example. So for example, the most commonly used a naive approach is simply delete all the records containing missing data.

So this is fine as long as the missing data is missing completely at random and then you still get unbiased estimates. But also as I said that in this case you make it as Bigger standard error estimates because of reduce the sample size leading to a non significance potentially. [10:00] 

So another example is, you know, you may have heard about, you know, missing imputation, right?

So you can impute missing values using different, that's the simplest is to use the sample mean to impute missing data. However, this will yield valid estimates only if, you know, the missing data is completely at random or missing at random. In other words, you need to make sure that you have information in your data set that you can use to impute values for the missing observations.

Terrence E. Murphy, PhD, MS: I think what you're saying is you have missing data and you want to decide what to do about it, but first you really need to make a decision on whether you think it's missing at random or missing completely at random or missing not at random. 

I love that [11:00] salary income question, because to me, an important conceptual point of missing at random is. It must not be related to the specific value. And I think we can all understand that very wealthy people or people of very low income, they may decide not to report exactly because of the value that they would report. And so that, that is by its very nature, missing not at random, it's missing directly tied to the value.

And if, if that's the case, then the- the most common tools like multiple imputation are actually not appropriate. 

Qian-Li Xue, PhD: That's correct.

Terrence E. Murphy, PhD, MS: So you've mentioned a couple of different types of missing data and you've referred to multiple imputation. Do these different methods have their own sorts of biases? And, and what are the biases that might come from these [12:00] different methods?

Qian-Li Xue, PhD: Yes, different handling methods can introduce bias in different ways. So here are a few examples. We just talked about list wise deletion, which is simply deleting all the records with missing data. And I just said that if the data are not missing completely at random, it may lead to loss of information. It can also lead to non representative results. And therefore we can introduce bias in the parameter estimates. 

So now think about a mean imputation approach, right? So if you're going to just use sample mean to impute missing values, in this case, let's suppose that the missing data is based on valid prediction, meaning that you correctly specify the missing data imputation model.

Under that condition, [13:00] you will have unbiased parameter estimates. However, in this case, you were underestimating the variability or uncertainty in your parameter estimates, meaning that you are underestimating the standard error of your parameter estimates. Therefore, you tend to potentially have significant findings where that in truth that the relationship was not significant if you had observed the missing information.

Now, to account for the uncertainty in the missing imputation data, we can do alternatively, we can do multiple imputation to have a better estimate of the uncertainty in the imputed values. So we do so by [14:00] generating multiple data sets with imputed values. And this way we can preserve the uncertainty in the missing data imputation, as well as providing unbiased estimates conditional on missing at random assumption.

And the other commonly used approach is sort of a model-based missing data treatment. So, in that case, you really want to make sure that the model is specified correctly. That's why it's important to do sensitivity analyses and to make sure that the findings are robust to different modeling assumptions.

Terrence E. Murphy, PhD, MS: You've mentioned multiple imputation several times, and the more one reads about missing data, or even one reads papers where the statisticians describe, you'll almost always see missing at random come up [15:00] and it will often be in concert with the use of multiple imputations. So I just want to emphasize what you said, which is this super useful approach.

Multiple imputation is completely dependent upon you feeling quite confident that the missingness in your data is actually missing at random.

Qian-Li Xue, PhD: Yes. 

Terrence E. Murphy, PhD, MS: So another thing you mentioned in your slides is that in random effects models, there is a sort of imputation that takes place. Would you comment please on that?

Qian-Li Xue, PhD: If you have, you know, a random intercept model and it's doing some kind of extrapolation of missing values within a certain person, when is it fine to rely on that versus actually imputing missing values of outcomes? 

[16:00] Sure. I'm happy to answer that question. So, in the case of a random effects models, I mean, there are a few facts that I think we need to know.

First is the random effects models yield unbiased estimates when data are missing completely at random or missing at random and conditional on the observed outcome and baseline covariates in your model?

Second, the random effects if you are including not only the random intercept, which is the degree of heterogeneity in the baseline status across your study subjects, um, but also if you include time as a random effect in the case of a growth curve modeling.

And that way you can address the issue of missing a random if the missingness has to do with study [17:00] attrition or study dropout in the course of the follow up. 

Now, of course it is true for every single model that correct a model specification is the most important aspect of statistical inference in the case of missing data. So if you misspecify your model, you tend to lead to a biased inference. So therefore, sensitivity is very important in the case of random effects models in general. Now, there are a few situations that combining multiple imputation with random effects model fitting could be useful. 

Number one, if the inference is about population representation. In other words, you want it to be able to generalize your findings to the general population. 

But in many cases in survey [18:00] studies, and the proportion of people included in the study sample could be different: different by age, different by race. So therefore, in order to account for the different sampling proportions for different demographics groups, you wanted to make sure that you should adjust for sampling weights in your analysis.

Now, similarly, let's suppose that the missing data or study dropout varies depending on the particular demographic groups. For example, older age group tend to have greater missingness during follow up or a certain racial group tend to have a greater or less missingness. So in order to account for those combining, for example, random effects models with multiple imputation could be useful.

Now, multiple imputation [19:00] is also going to be useful. It- it could be more efficient, meaning that it can provide narrower confidence bands. And if you do multiple imputation in combination with the random effects model fitting.

I'm not sure if I have made it clear.

Terrence E. Murphy, PhD, MS: Yes, I think so. I think what you're saying is that they each have strengths in particular cases. If we're talking about a more longitudinally intense design, such as growth curve modeling, then the person specific random effects may be preferred, but in other more general cases, multiple imputation may be better. And it sounds like there are some cases where the combination of the two may be the ideal approach. 

Qian-Li Xue, PhD: Yes, so if I may, [20:00] maybe I can restate a little bit. 

The random effects could be particularly powerful if you are doing the growth curve modeling where the time trajectory is the primary interest. And in that case, when you include both the random intercept and the random time slope in your model, that would help address the data missing in a random situation, as long as you include all the variables that are related to your outcome, but also related to the missing reason in your model.

Now, in some cases when you have not only missing outcomes due to study dropout, but you also have missing covariates. So in which case that having the random slope cannot address the missing covariate situation. So in which case that missing [21:00] imputation for the missing covariates could be very useful in combination of using you know, random time slope to account for study attrition. 

Terrence E. Murphy, PhD, MS: So is it fair to say that the random effects in the growth curve modeling perform a sort of imputation of the outcomes, but not necessarily the covariates, whereas the multiple imputation can do a good job with the covariates as well? 

Qian-Li Xue, PhD: Yes, that's correct.

Terrence E. Murphy, PhD, MS: Okay, so their combination can be, can be especially helpful if one has the time and energy to take on that additional bit of complexity in their analysis. 

Qian-Li Xue, PhD: Yeah, that's true. 

Terrence E. Murphy, PhD, MS: Okay, very good. And what reporting practices do you recommend to analysts when they're working with missing data? 

Qian-Li Xue, PhD: I think [22:00] it's- transparency is the very important thing.

Um, so you want to be able to describe the missing data and by reporting the proportion of missing data and its mechanisms in your research paper. And you wanted to explain what models which models that you use and which methods that you use to account for missing data and the reason why you choose a particular approach. And and finally, I think it's very important for you to conduct a sensitivity analysis, you know, considering different scenarios by which the missing data arise to assess the robustness of your findings to the different missing data assumptions. 

This is particularly important in the case of missing not at random, and also missing at random. Because I mean, the validity of the missing at random assumption [23:00] depends on the knowledge of the predictors of such missingness.

So sometimes that it's not that clear because your observable data will not be able to tell you for sure whether it is a missing at random or missing not at random. So the only way to find out is by trying these different scenarios based on different assumptions and to see how the findings actually change.

So the ideal scenario Would be that regardless of how you impute the missing data or which model you use to address the missing data, the finding remains qualitatively similar. So that could be the best scenario that you can hope for. 

Terrence E. Murphy, PhD, MS: Okay, so you're emphasizing sensitivity analyses, and I know in my statistical practice, I also like to [24:00] review the ideas of missingness with the content experts to get their input, since, like you say, there's no hard and fast test, right, to clearly tell you whether it's MCAR or MAR.

Uh, you want to do sensitivity analyses and have some experts weigh in to make sure that it's reasonable, right? 

Qian-Li Xue, PhD: That's very much true. 

Terrence E. Murphy, PhD, MS: Are there ethical considerations with missing data? Yes, absolutely. 

Qian-Li Xue, PhD: First of all, from an informed consent standpoint, I think it's important to make sure that your study participants know, you know, how their data will be handled and including the potential missing data issues and how the analysis would address those issues.

And secondly, a transparency, just like I have mentioned before, is very important in terms of reporting how missing data was [25:00] handled in your analysis and also potential impact on study results. 

And thirdly, it is important that we do everything that we can to minimize missing data. We should, you know, for example, try to standardize the data collection procedures and conduct the regular data checks. And also actively, uh, try to retain study participants if it's a longitudinal follow up study to make sure that we can minimize the missing data. 

So then we also need to make sure that we can collect as much information about the missing data reasons, because the determination of missing data mechanism can help us decide, you know, appropriate missing data models to address missing data.

And the other [26:00] thing is that I think it also has to do with fairness to your study subjects. You know, if you're just naively excluding people who have a lot of missing data, then it's possible the reason they are having a lot of missing data is because their inability to complete the testing that you are conducting.

So in that cases is that by excluding them really that you are affecting the generalizability of your results and and then you are not considering perhaps the most important subset of the population you wanted to help, which is the most vulnerable subset of the older population. 

And lastly, in terms of a publication, I think it's also ethical to adhere to the publication ethics by clearly reporting any data exclusions and [27:00] handling procedures in your research articles.

Terrence E. Murphy, PhD, MS: Okay, so important ethical considerations, both in how you conduct and report. How you conduct your handling of missing data as well as how you report it in the follow up work. 

Okay, so we've reviewed a lot of things, but if you had to pick, you know, the ideal way to deal with missing data, what would that look like? What would that be?

Qian-Li Xue, PhD: I would have to say prevention, prevention, prevention. You know, preventing missing data from happening is probably the most important. It's the best way to address missing data. And as I just mentioned, every single missing data approach has its own assumptions. You know, sometimes, many times, I think most times that we cannot assess the validity of those assumptions. All we [28:00] can do is try different scenarios and hopefully the findings remain the same. So the best approach to handle missing data is try to prevent from happening. 

So here are a few things that we can do. Uh, from a study conduct the design perspective to minimize missing data. For example, we can, you know, enhance participant engagement, you know, by maintaining, you know, open communication and establish rapport, build trust and also motivate our participants to remain engaged in the study.

And another thing we can do is to consider offering incentives to participants for completing assessments or data submissions on time. And the other important thing we could do is make sure that our data collection instruments or surveys are user friendly, easy to [29:00] understand, and not overly burdened for participants.

And also it is important to ensure that, you know, the data collection process is standardized. And with frequent data checks to minimize errors, address data omissions in real time. 

Also, piloting testing collection, test collection instruments and protocols before the main study is also important to identify and address any issues that may lead to missing data.

Training of data collection staff is also important. You want to make sure that you provide detailed instructions and demonstrations and also provide opportunities for questions. 

And lastly, I think that, you know, take advantage of the technology. The new technology for data collection is also, [30:00] uh, a useful way to minimize data, missing data. You know, consider, for example, using electronic data capture tools and systems, which can have a built in validation checks which can minimize errors during data entry.

And also in some cases, if the study participants cannot answer questions or participate in a survey by themselves due to, for example, cognition, then perhaps the use of proxy respondents would be important to fill in missing information on behalf of the study participants.

So there are many things that we can do actually at the study implementation and design phase and to take into account the possibilities that can give rise to the missing data situation. 

Terrence E. Murphy, PhD, MS: What I'm hearing you say at the end is that even though we have [31:00] all this knowledge and all this statistical machinery to help us deal with missing data, the most elegant solution is to simply prevent it in the first place.

And to do that effectively, we, we have to consider it in the design of our data collection instruments and our recruitment and a whole lot of other factors. And the benefits of that are huge in terms of reducing bias, improving precision, honoring the integrity of our study, the way it's reported. 

So missing data is, is not a small thing. It's actually a very, very important thing. 

And I want to thank you, Dr. Xue for sharing your experience on this and helping us see that how important it is, how carefully we need to think about it and some of these very important ways that we can address it [32:00] responsibly. Thanks so much for sharing your time and especially your really important work and all the learning you've accumulated in your studies on aging.

Qian-Li Xue, PhD: Thank you. Thank you for having me.