Skip to Main Content

BIS Seminar: A General Framework for Quantile Estimation with Incomplete Data

October 14, 2020

Linglong Kong

Department of Mathematical and Statistical Sciences

University of Alberta

October 13, 2020

ID
5741

Transcript

  • 00:00- Hi, everyone.
  • 00:02Welcome to the departmental seminar of
  • 00:04the Departmental Biostatistics, Yale University.
  • 00:09I'm pleased to introduce you Linglong Kong.
  • 00:12He was associate professor of the Department of Mathematical
  • 00:16and Statistical Sciences at the University of Alberta.
  • 00:20He's research interests are on, and correct me if I'm wrong,
  • 00:24on functional and neuro imaging data analysis,
  • 00:27statistical machine learning,
  • 00:29and robost statistics and quantile regression.
  • 00:32So today, he is gonna talk about his work on
  • 00:35general framework for quantile estimation
  • 00:38with incomplete data.
  • 00:40Thank you, Linglong. And whenever you're ready.
  • 00:44- Thank you Laura for the introduction.
  • 00:47And also thanks Professor John for the invitation.
  • 00:52I'm very happy to be here, although it's way too early.
  • 00:57So today I'm going to talk about general framework for
  • 01:01quantile estimation with incomplete data.
  • 01:13So, this is a joint work with Peisong from
  • 01:20University of Michigan and Jiwei from
  • 01:23University of Wisconsin-Madison, and Xingeai.
  • 01:27And we started this work when at the second year
  • 01:33when I started my position at the University of Alberta.
  • 01:37I know Peisong a long time ago before he was a student,
  • 01:44and at that time he just started his position as
  • 01:48assistant professor at the University of Waterloo.
  • 01:51And I invited him to visit me and afterwards,
  • 01:56he invited me to visit him.
  • 01:58And we feel like we visited each other already,
  • 02:02we should get something done.
  • 02:04But I remember that I've known where he stayed in his office
  • 02:11at the University of Waterloo and thinking about
  • 02:15what do we have to do together.
  • 02:17And eventually we thought, "Okay, what I'm good at
  • 02:21and while all my research area is quantile regression.
  • 02:24And what is Peisong good at?
  • 02:27One of the research area of Peisong is missing the data."
  • 02:31So we said maybe we can put them together,
  • 02:34then we are write a couple of formula on the paper.
  • 02:41Then we feel like, "Okay, we get a copy already."
  • 02:45Then we went to have a dinner.
  • 02:48And then one year later Peisong send me like
  • 02:52two pages to trap, said maybe we should continue it.
  • 02:57And that's the first scenario in this topic,
  • 03:03I'm gonna talk about.
  • 03:04And then another half year, I sent him my feedback.
  • 03:12I said, "Why don't we make it more general,
  • 03:15make it a framework?"
  • 03:17So this semester we're going to be able to apply
  • 03:20to honor other scenarios.
  • 03:22And then we we both feel it's good idea,
  • 03:26then we started working on it.
  • 03:27At that time, Jiwei was posed to at a University of Waterloo
  • 03:33and Xingeai where my post are.
  • 03:35So, we thought together and started a project.
  • 03:38Eventually, I wound a project that I'm kind of proud of.
  • 03:47So, what's missing data?
  • 03:49The missing data arise in almost all
  • 03:52serious statistical analysis.
  • 03:56Missing on values are representative of the
  • 04:03messiness of real world.
  • 04:05Why we would have missing a missing value,
  • 04:08it could be all kinds of reason.
  • 04:12For example, it may be due to social or natural process.
  • 04:17Like for example, a student get a graduate,
  • 04:20get a job out in clinical trial, people get died, and so on.
  • 04:26And also could happen that you survey.
  • 04:29For example, in certain question asked,
  • 04:32only asked respondent answer yes,
  • 04:35to continue to answer certain questions.
  • 04:38Or maybe it's the intention missing
  • 04:41as a part of a data collection process.
  • 04:45Or some other scenario including random data collection
  • 04:48issues respondent refusal or non-response.
  • 04:56So, mathematically how we categorize these kind of missing,
  • 05:01and here is the three scenario.
  • 05:05Now, first scenario we call it missing completely at random.
  • 05:10What does that mean?
  • 05:11That means the missingness is nothing to do with the
  • 05:15person being studied.
  • 05:17They're just completely got missing,
  • 05:19it's nothing related to any feature of this person.
  • 05:23The second scenario is missing at random.
  • 05:26Missing is to do with the person, but can be predicted
  • 05:30from other information about the person.
  • 05:34Like either a certain scenario need these project,
  • 05:39the missingness maybe predictive from some
  • 05:43auxiliary verbals auxiliary information.
  • 05:48The third one is a very hard one, is missing not at random.
  • 05:55The missingness depends on observed the information
  • 05:59and sometime even the response itself.
  • 06:05So, the missingness is specifically related to
  • 06:08what is missing.
  • 06:09For example, a person to not attend a drug test
  • 06:13because the person took drugs the night before.
  • 06:17And therefore the second day,
  • 06:18he couldn't make to the drug test.
  • 06:20Couldn't get to that drug test result.
  • 06:23These are three missing mechanism.
  • 06:30How do we handle those missing data?
  • 06:33There are many strategies.
  • 06:35For example, the first one would be,
  • 06:37well, let's try to get the meeting data.
  • 06:40That would be great.
  • 06:42But in reality, that's usually impossible.
  • 06:48But the second is, well, as we have incomplete cases,
  • 06:52let's just discard.
  • 06:57Just analyze the complete case, right?
  • 07:02But these could cause some other problems.
  • 07:05We will talk about it.
  • 07:07And the third one is we replace missing data
  • 07:12by some conservative estimation.
  • 07:14For example, using sample mean, sample median, and so on.
  • 07:20The first one is we are trying to estimate the missing data
  • 07:25from other data on the person.
  • 07:27We use on sort of more sophisticated method to impute.
  • 07:37Now in particular, mathematically speaking,
  • 07:43the strategy we are using today do to deal
  • 07:46with missing data,
  • 07:48the first one is a complete case analysis.
  • 07:51These are very simple, okay?
  • 07:52We just analyze compete case, okay?
  • 07:56And we only analyze in consideration that individuals with
  • 08:01no missing data.
  • 08:05Sometimes it can provide good result,
  • 08:07but the estimation obtained from this complete case analysis
  • 08:12maybe biased if they excluded individuals are systematically
  • 08:18different from those included.
  • 08:20So hence, if the complete case would be a good
  • 08:24representation of those missing case,
  • 08:28then this method would it be fine.
  • 08:34Otherwise, if the complete case is quite different from
  • 08:38those we miss, then all result can be biased.
  • 08:44And then there's inverse probability weighting method IPW.
  • 08:50This is a commonly use method to correct the bias from a
  • 08:53complete case analysis.
  • 08:56What does that mean?
  • 08:57It means, okay, we give each complete case a weight.
  • 09:03This weight is the inverse of the probability of
  • 09:07being a complete case.
  • 09:12Well, this can also cause some bias
  • 09:16if this IPW relies on the data distribution.
  • 09:25The first strategy is more sophisticated to do
  • 09:29these multiple imputation.
  • 09:31It's quite common method,
  • 09:32especially nowadays in genetic study.
  • 09:35How do we do multiple imputation?
  • 09:39We create multiple sets of imputation for
  • 09:44the missing values, using imputation process
  • 09:48with a random component.
  • 09:51Now, we have an full data set.
  • 09:54Then we analyze each data set.
  • 09:59Those full data set can be a little bit different.
  • 10:02Can be slightly different because the randomness of
  • 10:08the imputation process.
  • 10:11Anyway, analyze those data set, complete the data set,
  • 10:14and then we get all set of parameter estimates.
  • 10:17Then we can combine those result.
  • 10:20We can combine this result,
  • 10:22and we hopefully we get a better result.
  • 10:26The multiple imputation sometimes works quite well,
  • 10:31but only if the missing data can be ignored.
  • 10:36And also, we have a good imputation models.
  • 10:39And while it depends on the nature of the data,
  • 10:41the auto mind depends on what kind of imputation model
  • 10:45we are going to use.
  • 10:51Now, that's how we deal with missing data,
  • 10:56the strategy we happen to use to deal with missing data.
  • 11:01But let's matched them together in terms of missing data.
  • 11:06How we use these meeting dates age to deal with
  • 11:11different missing mechanism.
  • 11:14For example, if the data is missing complete at random,
  • 11:19now in this case, the complete case analysis is quite good.
  • 11:25Multiple imputation or any other imputation methods
  • 11:29is also okay.
  • 11:31Is also valid.
  • 11:32So, this missing complete at random is
  • 11:36the easiest case to deal with.
  • 11:40What if data is missing at random?
  • 11:43Then in this case, some complete case analysis are valid
  • 11:51and multiple imputation nearly is okay too,
  • 11:56if the bias is negligible.
  • 12:00Now in a certain case,
  • 12:02if the data is missing not at random,
  • 12:05then we have to model the missingness explicitly.
  • 12:11We need jointly modeling the response.
  • 12:15We need jointly model the response,
  • 12:17and also the missingness.
  • 12:22In practice of course,
  • 12:23we try to assume missing and random whenever it's possible
  • 12:28and try to avoid to deal with
  • 12:32missing not at a random situation.
  • 12:34But the reality, it's not anything that we can control.
  • 12:41Sometime we have data always missing not either random.
  • 12:45Think in that case center or there is one special issue
  • 12:53dedicated to missing data, not at a random situation.
  • 13:02Now, we have different strategies.
  • 13:04And that they state different strategies
  • 13:07have different advantage and disadvantage.
  • 13:12For example, multiple imputation is generally more efficient
  • 13:17than IPW, but it's more complex.
  • 13:23And the imputation and IPW approach
  • 13:28require to model the data distribution
  • 13:32and the missingness probability, respectively.
  • 13:35Imputation, we need to model data distribution.
  • 13:39IPW, we need model the missingness probability.
  • 13:45And also, for all kinds of strategy,
  • 13:48we would have have good property,
  • 13:52only if the corresponding model is correctly specified.
  • 13:59Most existing method are vulnerable to
  • 14:03these model misspecifications.
  • 14:06Of course can use nonparametric method to reduce the risk
  • 14:11of model misspecification, but it's often impractical
  • 14:16due to the curse of dimensionality.
  • 14:21So now, how do we deal with this model misspecification?
  • 14:27We have some method available.
  • 14:30For example, we can use a double robust method.
  • 14:37In particular, in double robust method,
  • 14:40we have this augmented IPW.
  • 14:44We are not only model the missingness probability,
  • 14:49but also the distribution.
  • 14:52Why is double robust?
  • 14:54Because the result would be confusing
  • 14:58if the model is correct.
  • 15:02If the way we model missingness probability
  • 15:07or the way we model the distribution is correct,
  • 15:12then we would get consistent result.
  • 15:14And that's why it's called double robust.
  • 15:18Well, now that we are not satisfied with double robust,
  • 15:22what about we can a multiple guarantee?
  • 15:25So, we have these multiple robust.
  • 15:27This is a proposal by Peisong.
  • 15:33And they multiple robust method is proposed to account for
  • 15:38multiple models for missingness probability
  • 15:42and the distribution.
  • 15:45In double robust, we can only one model for missingness
  • 15:48probability and one model for data distribution.
  • 15:51Well, for multiple robust,
  • 15:54we get multiple models to model missingness probability,
  • 15:59and we can have multiple models to model distribution.
  • 16:05The good thing is the estimation result will be consistent
  • 16:11if either one or the model is correct.
  • 16:19Now, let's look at those crushing mathematically.
  • 16:26So, we are looking at missing at random.
  • 16:29We assume on the observed data are ID.
  • 16:34So we have data R, RY XT.
  • 16:38R, we use it to missingness, and the IPW estimator,
  • 16:48essentially we are trying to solve these equation.
  • 16:52And here, these π is the probability,
  • 16:58although makes complete case.
  • 17:01And IPW is consistent,
  • 17:03only if this πX is correctly specified.
  • 17:08And then, then from the equation,
  • 17:10we can get consistent estimate of those
  • 17:13permit we are interested in.
  • 17:17This is IPW. The other one is imputation.
  • 17:23For imputation, we need model that take distribution.
  • 17:28And here we have on the model of a f(Y|X)
  • 17:36And as you can see,
  • 17:37we have our imputation for those missing data.
  • 17:44This imputation is consistent,
  • 17:47only if this state distribution is correctly modeled,
  • 17:52this f(Y|X) is correctly modeled.
  • 17:58Now for these augmented inverse probability waited method,
  • 18:05we actually combined these two together.
  • 18:11We had the first part from IPW,
  • 18:14second part from implication.
  • 18:17So the estimation result would be consistent
  • 18:23if either this model for missingness probability
  • 18:28or the model for data distribution is correctly specified.
  • 18:35Well, for multiple robust method,
  • 18:38they have a serious model for missingness probability
  • 18:44and a serious model for data distribution.
  • 18:49And all result would be consistent,
  • 18:53if any one model is correctly specified.
  • 19:01Well, this is something
  • 19:03I just get a quick review about this missing data.
  • 19:07Like I said, this is the part Peisong is
  • 19:12one of the Peisong research area.
  • 19:14For me, my research area is quantile regression.
  • 19:18So, internal quantile regression at that time
  • 19:23we were thinking, "Okay, those methods,
  • 19:26these IPW, AIPW or double robust method,
  • 19:32multiple robust method, had been quite well studied
  • 19:35for when we model the conditional mean.
  • 19:39Therefore, condition of quantile, there are not
  • 19:41a lot of methods available.
  • 19:44Why we care about the quantile?
  • 19:46A quantile not only provide a central feature
  • 19:49of the distribution, but also care about the tail behavior.
  • 19:57And also under very mild conditions,
  • 20:01the quantile function can uniquely determine
  • 20:05the underlying distribution.
  • 20:07So, there are a lot of advantages to model the quantiles.
  • 20:13Then, we decided to study these missingness
  • 20:18in quantile estimation.
  • 20:21In particular, we proposed a general framework
  • 20:23for quantile estimation with missing data.
  • 20:30So, our proposed model, these kind of framework,
  • 20:35can do a lot of estimation for
  • 20:38missingness in quantile estimation.
  • 20:43But in this paper,
  • 20:46we particularly applied all proposed method,
  • 20:50these three scenario.
  • 20:52Okay, three commonly encountered situation.
  • 20:56The first one we trying to estimate
  • 21:01the marginal quantile of response.
  • 21:04This response get some missingness.
  • 21:09Well, there are fully observed covariates.
  • 21:13That's the first scenario, response gets some missingness
  • 21:16while the corresponding covariates get fully observed.
  • 21:20The second scenario, we are looking at
  • 21:23the conditional quantile of a fully observed response.
  • 21:28In this scenario, we look at
  • 21:31there are some covariates are partialy available.
  • 21:36So, we have some missingness for covariates.
  • 21:39And then the third scenario, we are still looking at
  • 21:43the conditional quantile of a response.
  • 21:47And in this case, the response gets some missingness
  • 21:52and we have fully observed covariates
  • 21:55and also extra auxiliary variable.
  • 22:02Now, let's look at the first situation.
  • 22:07We want to estimate the marginal quantile.
  • 22:10In this scenario, we have the response gets some missingness
  • 22:18and we have the covariates fully observed.
  • 22:22Now, let m to be the number of subjects with
  • 22:26data completely observed.
  • 22:30Then our method consists of the following five steps.
  • 22:38The first step, we calculate this α or estimate to this α.
  • 22:45This α isn't related to the missingness probability, okay?
  • 22:52The way we estimate this, is by maximizing
  • 22:57the binomial likelihood.
  • 23:01So, the first step we estimate the α,
  • 23:03and then we get estimate of the missingness probability.
  • 23:10Okay?
  • 23:11The second step, we calculate gamma.
  • 23:16This gamma is related to this data distribution.
  • 23:21So, we maximize this data distribution.
  • 23:25This gamma is a parameter related to the distribution.
  • 23:33And then the third step is we can
  • 23:39sort of preliminary estimate of the quantile
  • 23:43or the marginal quantile through these imputation process,
  • 23:51by solving this equation.
  • 23:55And as you can see this is quite close to the AIPW scenario.
  • 24:05Okay?
  • 24:06And in this equation, this five is the score function
  • 24:13of quantile lost function.
  • 24:17This prosaic is r - i(r<0).
  • 24:23This is the generalized derivative
  • 24:28of quantile lost function, okay?
  • 24:34Here, this one can not be exact zero.
  • 24:39The reason this phosaica is a non-smooth function.
  • 24:46and it sometime it won't be exact here.
  • 24:53Basically the first step, okay?
  • 24:57Now, we have a preliminary estimator
  • 25:01of the marginal quantile.
  • 25:03The first step is the case that of method
  • 25:08is where the multiple robustness is coming from.
  • 25:14Now, we calculates weights for the complete case.
  • 25:19In total, do we have m complete case.
  • 25:21For each case, we calculate the weight.
  • 25:24As you can see, the weight is determined by three parts.
  • 25:32The first part is related to this alpha,
  • 25:36which is related to the missing probability, okay?
  • 25:40Missing probability.
  • 25:43The second part is related to this gamma.
  • 25:47This is related to the data distribution.
  • 25:52The third part is related to this cube.
  • 25:56This preliminary estimate of these marginal quantile,
  • 26:02which is related to this self step.
  • 26:07As you can see from the first three step,
  • 26:10we are trying to get ready for this,
  • 26:14to get the estimate for the weight for the complete case,
  • 26:18for this complete case.
  • 26:21And also, we have our parameter,
  • 26:23though is obtained through
  • 26:27minimizing these equation, through minimizing this equation.
  • 26:33Now, after we calculate the weight
  • 26:36we get off final estimate of our multiple robust estimate
  • 26:42by solving the following with estimated equation.
  • 26:50This wi is the width.
  • 26:52We estimate it from the first four steps.
  • 26:58And this posy is a score function of quantile loss, okay?
  • 27:06Now, you may get wondering on what's going on
  • 27:10with these five steps.
  • 27:14And let me try to explain it one by one, okay?
  • 27:20In the first step, we get the estimate of alpha, okay?
  • 27:24We get the estimate of alpha.
  • 27:28In sense trying to model they missingness probability, okay?
  • 27:33Missingness probability.
  • 27:35And of course, this missingness probability is consistent
  • 27:41only if this model is correctly specified, okay?
  • 27:45So in the first step, we actually have multiple models
  • 27:49to model the missingness probability.
  • 27:52And you need a hope at least a one model is correct.
  • 27:57Now, in the other case, the missingness probability
  • 28:00will not be correctly specified.
  • 28:05Well, in the second step, we only estimate gamma.
  • 28:09We are trying to model the data distribution
  • 28:14and we have models for the data distribution.
  • 28:20And then the third step,
  • 28:22we are sort of doing some imputation as made
  • 28:26of these marginal quantile.
  • 28:32And these marginal quantile will be correctly estimated,
  • 28:42if those data distribution is correctly specified.
  • 28:50Now for the key staff,
  • 28:53(coughs)
  • 28:54Excuse me.
  • 28:55The step four is typical formulation of
  • 28:59an empirical likelihood program.
  • 29:03I will getting back to this in the next slide,
  • 29:08why it's a empirical likelihood program.
  • 29:12And this is a key contribution of methodology.
  • 29:18Now, in step five, we have the structure of IPW, okay?
  • 29:23For complete case, we have weight to correctify, okay?
  • 29:32And do this weight actually, is coming from two parts.
  • 29:35And one part is from the missingness probability.
  • 29:41The other part is from the data distribution.
  • 29:45Now, the weight actually does not distinguish
  • 29:48the missingness probability and the data distribution.
  • 29:54The way it treats them equally.
  • 29:59And another note I want to say is step two and four
  • 30:03are based on the complete case only.
  • 30:12Now, let's look at step four.
  • 30:15Okay? Let's look at step four.
  • 30:18In step four, we saw assumption are missing at random.
  • 30:26It's easy to verify this, okay?
  • 30:29Like wx, which is the inverse of the missingness probability
  • 30:34times b(X) - E{b(X)}| R-1 = 0, okay?
  • 30:43And in thus case, we can let b(X) to be the score function
  • 30:48of quantile lost function.
  • 30:52And these probability are conditional estimation
  • 30:56and the conditional probability under this density.
  • 31:01And because of this, okay?
  • 31:06We can easily write a sample case, a sample scenario.
  • 31:14So, the scenario is like this.
  • 31:16All the weight is inactive.
  • 31:19Some weight is one,
  • 31:22and this is the estimating equation part,
  • 31:25estimation equation part.
  • 31:29As you can see,
  • 31:30this is a typical empirical likelihood scenario.
  • 31:40So, this is a typical formulation for empirical likelihood.
  • 31:47And the solution actually can be even as in all formula,
  • 31:55our previous, can be given by this one, okay?
  • 32:02The weight can be determined by this.
  • 32:05And though hard, can be estimated by solving this equation.
  • 32:16Okay?
  • 32:19So, that's all key steps for this methodology, okay?
  • 32:29This actually, is the formula we first written down
  • 32:35on the paper.
  • 32:36And then we thought, "Okay, this might also be able
  • 32:40to be applied to the other scenario."
  • 32:44Indeed it can be applied in other scenarios.
  • 32:48For example, in this quantile regression
  • 32:52with missing covariates.
  • 32:55In this scenario, all parameter of interest is β0.
  • 33:01This β0 is coming from these linear regression.
  • 33:05We want to estimate this β0.
  • 33:10And all covariates had two paths, X1 and X2.
  • 33:17This X1 path is always observed,
  • 33:22while this X2 may have some missing.
  • 33:27So, the observed data.
  • 33:31And I need copies of this format.
  • 33:33This missingness response completely observed covariates
  • 33:43and some covariates are missing,
  • 33:45some covariates are observed, okay?
  • 33:49So, in this setting, we want to estimate β0,
  • 33:55as in previous scenario.
  • 33:59We have two sets of models, okay?
  • 34:02One set model is for π, the missing probability.
  • 34:08And the other set of model is for data distribution.
  • 34:15Here the distribution is related to X2,
  • 34:19given the condition of the response
  • 34:21and completely of the X1.
  • 34:27Now, as previous, we have five steps.
  • 34:35Step one and step two are same as in case one.
  • 34:40And in step one, we estimate in the missing probability.
  • 34:45In step two, we estimate the data distribution.
  • 34:53And then in step three,
  • 34:55we get preliminary imputation estimate pf β0
  • 35:03by solving this seemed a very complicated equation.
  • 35:09And here there's Xl, which had two parts,
  • 35:17the complete the case and on the missing part.
  • 35:21The missing part is random drawn
  • 35:24from this data distribution.
  • 35:28We estimate from step two.
  • 35:32And then the step four, okay?
  • 35:35The key is that the empirical likelihood part
  • 35:39where we used to compute to the weight.
  • 35:46And these weights that I had, is for complete case.
  • 35:50And at previous, this weight depends on three parts.
  • 35:59One is missing probability, α1 is the distribution.
  • 36:05Gamma previous, it depend on the preliminary as estimate
  • 36:09of margin quantile.
  • 36:11Now, it's related to the preliminary estimate of
  • 36:18linear quantile coefficient β.
  • 36:22Okay?
  • 36:23After we estimate these weight WI,
  • 36:27then we can go to the estimating equation part, okay?
  • 36:35Let's say five steps. Let's say five steps.
  • 36:40As you can see you, step one, step two, step three,
  • 36:44is all preexisting method we adapt trying to estimate
  • 36:56the missing probability, the data distribution,
  • 37:02and also impute to get a preliminary estimate
  • 37:05of the parameter we are increasing.
  • 37:08And then from all these,
  • 37:10we pull all this information together to get
  • 37:12a good weight for the compete case.
  • 37:18And then the using this empirical likelihood method
  • 37:25and then we adjust this complete case with the
  • 37:31estimated weight to get a final estimate,
  • 37:34to get the final multiple robust estimate.
  • 37:41Now the case three, okay?
  • 37:45In the case three, the parameter we are interested
  • 37:49is still β0.
  • 37:51This linear quantile regression are here.
  • 37:55The scenario is the full-data vector is (Y, X).
  • 38:02In this scenario, Y is missing and random, okay?
  • 38:07Of course the simple complete a case analysis
  • 38:10where lead to a consistent estimate,
  • 38:14but it doesn't mean it will be optimal.
  • 38:18Here we are trying to get a more complete educated
  • 38:21but still very practical method.
  • 38:30We are having some auxiliary variable.
  • 38:33As this auxiliary variable,
  • 38:36usually not the main study interest,
  • 38:40and thus do not enter the quantile regression model.
  • 38:43However, we can use it to help us to explain
  • 38:48the missingness mechanism
  • 38:51and to help us to build a more plausible model
  • 38:55for the conditional distribution of Y.
  • 39:00Now, here is the observed data.
  • 39:06So, we now have an ID copies of these R, RY,
  • 39:12this Y gets a missing, X is completely observed,
  • 39:19and we have got auxiliary variable S.
  • 39:23We have this missing and random scenario.
  • 39:27We use π(X, S) to denote the probability,
  • 39:34and we use f(Y| X, S) to denote conditional density.
  • 39:40As previous, we have multiple models
  • 39:43for missing probability,
  • 39:46and we have multiple models for data distribution.
  • 39:56And then once again, we have the all five steps.
  • 40:00The first step, we modeled the missing probability.
  • 40:05And here we have this additional auxiliary variable.
  • 40:10The second step, we model the data distribution.
  • 40:14Again, we have this auxiliary variable.
  • 40:17And then step three,
  • 40:18we get a preliminary estimate on
  • 40:21using this imputation method.
  • 40:24We have our preliminary estimate of the parameter
  • 40:28we are interested in,
  • 40:30which is a linear regression coefficient here.
  • 40:36And then after the preparation of step one,
  • 40:39step two, and step three,
  • 40:41we finally be able to estimate our weight, okay?
  • 40:46Our weight is for complete case.
  • 40:50And from the formula here,
  • 40:52you can tell why I put this scenario as scenario three
  • 40:55because it got more and more complicated.
  • 40:59Although the weight still depends on three parts,
  • 41:02related to the first three step.
  • 41:05The missing probability related to this alpha,
  • 41:08the data distribution related to this gamma,
  • 41:12and the preliminary estimate made by using the imputation
  • 41:19in step three.
  • 41:25And once we get the weight through
  • 41:28this empirical likelihood method,
  • 41:30we then put it into this estimating equation.
  • 41:34Adjusted by this weight, we can get our proposed estimator
  • 41:39as multiple robust estimator of
  • 41:41the linear regression coefficient.
  • 41:48Okay.
  • 41:50(coughs)
  • 41:51Our method all framework in general,
  • 41:55these five sets, the key thing is step four
  • 41:58is empirical likelihood method to estimate the weight.
  • 42:03I'll estimate his probability
  • 42:06and we will estimate our framework in these three scenarios.
  • 42:13Of course there are some other scenarios,
  • 42:15and you can easily adapt to these five steps.
  • 42:20Now, let's look at some theoretical proprietary.
  • 42:23Why we propose these seemingly complicated five steps.
  • 42:30We first look at the case one. There are two parts.
  • 42:36The first theorem is about this consistence.
  • 42:40The second theorem is about asymptotic normality, okay?
  • 42:46So, under certain conditions, if...
  • 42:51Remember we have two sets of models.
  • 42:53One set of model, we modeled the probability.
  • 42:57The other set of model, we modeled the data distribution.
  • 43:02So if either one from the model
  • 43:07of modeling missingness probability
  • 43:12or the model set model the data distribution,
  • 43:15if either one is correctly specified, Okay?
  • 43:21Then, our estimate will be consistent.
  • 43:26Our estimate it well be consistent.
  • 43:28So, all proposed method allow you to make mistakes, okay?
  • 43:37But you at least make one good right decision,
  • 43:44then you get a consistent result, okay?
  • 43:49Of course if you make all the bad decisions,
  • 43:52you didn't choose any track modeling,
  • 43:55these two sets of model, then you probably won't be able
  • 43:59to get that consistent result.
  • 44:01Right?
  • 44:04And then the second theorem is about
  • 44:07the asymptotic normality.
  • 44:09Under certain conditions, the model estimate
  • 44:17some multiple robust estimate on the marginal quantile
  • 44:20where I have asymptotic normal distribution
  • 44:23with mean zero and variates here
  • 44:28is related to this variable.
  • 44:30Variates is related to this data one random variable.
  • 44:38And as you can see these variates of data one
  • 44:46actually coming from these three parts,
  • 44:50the estimate of the missingness probability,
  • 44:53the estimate of these data distribution,
  • 44:56and also the imputation process, okay?
  • 45:00That's for case one.
  • 45:02Similarly for case two, we have these two theorem.
  • 45:08Y is consistent.
  • 45:11And as long as the one model is correctly specified,
  • 45:14we would have this consistency.
  • 45:17And then this asymptotic normality,
  • 45:21we would have asymptotic normal distribution.
  • 45:23And also the variates, they're two, as you can see.
  • 45:28The two is ready to first three step
  • 45:32to estimate the different component, okay?
  • 45:38And then case three, two theorem.
  • 45:43Consistency, we need at least one model.
  • 45:47As long as one model is correctly specified,
  • 45:50we have a consistent result.
  • 45:53And we have this asymptotic normalcy
  • 45:56and the variates come from their three part. Okay?
  • 46:02As you can see, this is a very complicated formula.
  • 46:07It's a model getting more and more complicated.
  • 46:10And also, if you see that you can compound the variates
  • 46:15of the three to the situation with complete case analysis.
  • 46:21Because for complete case analysis,
  • 46:23we also get the consistent result, but like I said,
  • 46:28it doesn't mean the variates would be optimal.
  • 46:30And here, we actually can verify the variates of the three
  • 46:34will be smaller if our model are correctly specified, okay?
  • 46:43Let's say theoretical propriety.
  • 46:49Now, let's look at some simulation, okay?
  • 46:54We did simulation for each scenario,
  • 46:58but due to the timely meet, I will only present two.
  • 47:03Let's look at the second scenario.
  • 47:05In the second scenario, we have four here.
  • 47:09We have X1 follow exponential distribution X2
  • 47:13is a normal distribution.
  • 47:16And so Y is discrete, one is continuous, okay?
  • 47:20The model is the simple linear model
  • 47:24and the error distribution Y,
  • 47:28as you can see, is heteroscedastic.
  • 47:32Because of these error distribution, it's reduced to X1.
  • 47:38The missing mechanism for X2,
  • 47:42in the second scenario, we have a part of X2 is missing is
  • 47:47through this logistic regression, okay?
  • 47:50Now, missingness rate is about 38%.
  • 47:57Eventually, they have this conditional quantile regression,
  • 48:00linear regression, they have those coefficient excess.
  • 48:04This is our simulation setup is in the second scenario.
  • 48:13Now, we consider two working models for π, okay?
  • 48:19The fist one is correct. The second one is incorrect.
  • 48:24We can see there are two models for the distribution, okay?
  • 48:32All right.
  • 48:33This is the incorrect one
  • 48:35and for the ordinary least squares regression.
  • 48:38And this is correct one with title 0.25 0.75.
  • 48:48We have replication, 1,000 times.
  • 48:51We have some π equals 500, L is 10.
  • 48:55This L is really related to the first step
  • 48:59of the imputation.
  • 49:03Okay.
  • 49:03Now, here is all our simulation result, okay?
  • 49:09Although the result has to be multiplied by 100,
  • 49:14as you can see Y is very large.
  • 49:15And also we denote our mass as 0000, okay?
  • 49:25The fist two digit represent
  • 49:28the missing probability model.
  • 49:31The last two is data distribution.
  • 49:35For example, for IPW 1000,
  • 49:40that means we only use inverse probability method.
  • 49:44And the weight is estimating is based on
  • 49:49this correct weight, okay?
  • 49:52And for the imputation,
  • 49:56that means we only use this data distribution.
  • 50:00And for this IM 0010, that means we use our first model,
  • 50:08which is to model the data distribution.
  • 50:14This is the second model for data distribution.
  • 50:18And in either case,
  • 50:20is always the first one is correct model.
  • 50:23The first one is correct model.
  • 50:24The second one is not, okay?
  • 50:26That's just from notation.
  • 50:28As you can see here using IPW
  • 50:31if the model is correctly specified,
  • 50:34the bias is quite small
  • 50:35and everything is quite good.
  • 50:38However, if you miss specify the missingness probability,
  • 50:42we see the estimate is quite out of control, okay?
  • 50:47Let's say for IM imputation, if you specify correctly
  • 50:53the data distribution, the result is good.
  • 50:56If not, then it's not.
  • 50:58Okay.
  • 50:59Then there's multiple robust method.
  • 51:03In the multiple robust method,
  • 51:08we look at, for example, this one,
  • 51:12we get a missing probability correctly specified,
  • 51:15then we get a good result.
  • 51:17If not, we get bad result as the IPW, okay?
  • 51:22But anyway, if we can choose to use all these four models,
  • 51:29as you can see, the result is quite good, okay?
  • 51:33The taking home method for these simulation study is,
  • 51:39if you have some ideas about missingness probability
  • 51:47about the state of this data distribution,
  • 51:50and you think, "Okay, maybe this one is right
  • 51:53or maybe this one is also right, okay?
  • 51:56So on my side, just tell you,
  • 51:58"Okay, I don't have to just put all these
  • 52:04potential candidate potential model into all framework.
  • 52:11Then we look at the recount.
  • 52:16This one of the simulation is scenario two.
  • 52:22We also have a simulation in a scenario three,
  • 52:27but I will skip it here and go directly to the
  • 52:35real data analysis.
  • 52:38So, in this real data analysis, we look at this
  • 52:43AIDS clinical Trials Group Protocol 175 or ACTG 175 data.
  • 52:52In this research, we evaluate treatment with either a single
  • 53:01nucleosides or through HIV-infected subject
  • 53:05whose CD4 cells count
  • 53:08and are from 200 to 500 per cubic millimeters.
  • 53:14So, we consider to arms or treatment.
  • 53:17One is standardized,
  • 53:19and the other one is with three newer treatments.
  • 53:24The two arms respectively,
  • 53:29have about 500 and 1,600 subjects.
  • 53:34Now, model we are looking at is
  • 53:36the linear quantile regression model
  • 53:39and with those kind of covariates inside.
  • 53:43The data can be found in this package.
  • 53:51Now for the data, the average subject is 35 years old,
  • 53:57standard variation is about nine,
  • 54:01and the variable CD4 96 is missing for approximate 37%.
  • 54:10It's quite similar to simulation scenario.
  • 54:16Each athlete is part of set up of simulations scenario.
  • 54:22However, at baseline during the followup,
  • 54:25full measurements on additional variable are correlated
  • 54:28with CD4 96 are obtained.
  • 54:30So this would be the missing part. We get the missing part.
  • 54:39Here we assumed this CD4 96 is the missing and random.
  • 54:46And we also have other baseline, for example,
  • 54:50CD4 80 and CD4 20, and so on.
  • 54:56we will use these as auxiliary variables.
  • 55:01So, we have our third scenario
  • 55:07in this real data analysis.
  • 55:12And why we choose this data?
  • 55:16If we look at this CD4 96, the histogram of this, okay?
  • 55:24The left one is before we do it's original skill.
  • 55:32The right one is after we do log transformation.
  • 55:39So, as you can see, the left one is kind of truncated,
  • 55:46and the right one also truncated.
  • 55:49So you may debate,
  • 55:50"Okay, which one I should use?
  • 55:52Do I take log transformation or not?
  • 55:59Or to be, or not to be."
  • 56:03So that's no apparent reason to favor one of them
  • 56:10for the imputation method.
  • 56:13Now, what do we do?
  • 56:17In our proposed method,
  • 56:19we can put all these two models in our framework, okay?
  • 56:26We don't need to make the choice.
  • 56:29And because no apparent reason,
  • 56:31we take a log, or not take log.
  • 56:33Now, let's put the two together into our model, okay?
  • 56:38So we can simultaneously accommodate both simulation.
  • 56:44And then we have a eight covariates and auxiliary variable.
  • 56:49Then we have this probability is modeled by
  • 56:54a logistic regression containing all main effect of X and S.
  • 57:02So, here is the result. Here is the result.
  • 57:04This is a big table, but let me summarize these table.
  • 57:10Okay.
  • 57:11They three newer treatment, significantly slow the progress.
  • 57:16Our proposed method and the IPW method,
  • 57:19produce very similar results, okay
  • 57:23And the incubation estimate,
  • 57:27one failed to catch difference in the treatment
  • 57:31and treatment arm effect for different quantile.
  • 57:38The amputation estimator 2 gives
  • 57:40an increasing estimation effect and covariance.
  • 57:44In addition, the two imputation estimates
  • 57:48are quite sensitive to the selection of the working models.
  • 58:04Okay?
  • 58:05And also, from these real data,
  • 58:07we can help complete case analysis
  • 58:11overestimate the treatment arm effects once again,
  • 58:16so that even sometimes the compete case analysis is valid
  • 58:23but there are also advantage to use our proposed method.
  • 58:34All right, so here's the summary of my talk.
  • 58:40We proposed a general framework for
  • 58:44quantile estimation with missing data.
  • 58:48And we actually applied these framework
  • 58:52in different scenario.
  • 58:55Now, the taking home message is,
  • 59:00our proposed method or whatever robust against
  • 59:04possible model misspecification.
  • 59:08So, as we have two sets of model,
  • 59:10one for missing probability
  • 59:12and one is for data distribution.
  • 59:14As long as one model is correct,
  • 59:17then we will get good result.
  • 59:19And also our method can be easily to be generalized
  • 59:23to many other scenario.
  • 59:26And I think that's all of my talk,
  • 59:32and thank you.
  • 59:36- All right.
  • 59:37Thank you, Linglong. This was very interesting.
  • 59:39I think we're almost out of time, so if there's
  • 59:43we have time probably for one question.
  • 59:45So if there's any, if not
  • 59:48Let's see if there are any questions.
  • 59:52Feel free to write in the chat box or on cells.
  • 01:00:12Okay.
  • 01:00:13Just gonna ask one question
  • 01:00:14and then I think I'm gonna ask all the questions
  • 01:00:17when we meet.
  • 01:00:19Just a quick question.
  • 01:00:20Do you know why the complete case analysis have
  • 01:00:24overestimation rather than underestimation?
  • 01:00:27Like, do you have a feeling why that's the case and what?
  • 01:00:33- Well, I don't know. No.
  • 01:00:39- Yeah.
  • 01:00:40I believe it will be interesting to see what cases,
  • 01:00:42like what are the conditions for overestimation
  • 01:00:45or underestimation for complete case analysis, I guess.
  • 01:00:48I guess, it must depend on the data distribution
  • 01:00:52and the missingness mechanism that's been made.
  • 01:00:56But I'm not sure one.
  • 01:00:59- I agree with you.
  • 01:01:01The reason I would answer I don't know,
  • 01:01:05because it's really hard to know how the data is miss.
  • 01:01:11Although we assume it's missing at runtime.
  • 01:01:13- Yeah.
  • 01:01:14- But, who knows the reality?
  • 01:01:17- Right. Yeah, right.
  • 01:01:19I guess, under your assumption of missing at random,
  • 01:01:22then I guess there could be conditions for underestimation
  • 01:01:27or overestimation under the assumption of where MI.
  • 01:01:31But, I don't know.
  • 01:01:32I was wondering if people have derived those or not.
  • 01:01:36(laughs)
  • 01:01:37They could be future work, right?
  • 01:01:40(laughs)
  • 01:01:42All right.
  • 01:01:43Linglong, thank you.
  • 01:01:44I'll see you in an hour for a one on one meetings,
  • 01:01:47and I know other students and maybe faculty have
  • 01:01:51signed up for it to meet with you.
  • 01:01:53So, thank you very much.
  • 01:01:55And I'll see you later. All right.
  • 01:01:57- Thank you.
  • 01:01:57- Bye-bye. Thank you everyone for joining.
  • 01:01:58Bye.
  • 01:01:59- Bye.
  • 01:02:00- Bye.