Skip to Main Content

Data Collection

February 06, 2023
  • 00:02<v Maria>My name is Maria Ciarleglio</v>
  • 00:04and I'm a faculty member
  • 00:05in the Department of Biostatistics
  • 00:08at the Yale School of Public Health.
  • 00:11In this video series,
  • 00:12I will introduce the clinical research process
  • 00:15to prepare you to collaborate with a statistician.
  • 00:20In this third video,
  • 00:22we'll discuss data collection
  • 00:24or how to structure your data in a spreadsheet
  • 00:27so that you can share the data files
  • 00:29with a statistician for analysis.
  • 00:32We'll also discuss variable types and special considerations
  • 00:37for quantitative, categorical, and time-to-event data.
  • 00:43Collecting good data is important
  • 00:45because the data enable you
  • 00:47to answer your research question.
  • 00:49Generally, we're interested in the effect
  • 00:51of one or more exposure
  • 00:53on one or more outcome of interest.
  • 00:56Exposures and outcomes are key variables to collect.
  • 01:00There may be several other descriptive characteristics
  • 01:03that we would like to either report
  • 01:05when we're describing the sample
  • 01:07or control or adjust for in the analysis.
  • 01:11We can also explore these variables
  • 01:13as possible effect modifiers
  • 01:14or as characteristics that define subgroups of interest.
  • 01:19These additional variables are also recorded
  • 01:21during data collection.
  • 01:23What is the structure of this data table?
  • 01:26Specifically, the rows contain information
  • 01:29from different subjects
  • 01:30and the columns contain the different characteristics
  • 01:33or variables collected on the subjects.
  • 01:36This means that the number of rows in your data table
  • 01:39corresponds to the number of participants,
  • 01:42not including the header row
  • 01:44or the row that contains the variable names.
  • 01:47The number of columns in your data table
  • 01:49corresponds to the number of variables in your data table.
  • 01:54There are several features of this table
  • 01:56that I want to discuss further,
  • 01:58but first, let's review a few options available to you
  • 02:02for collecting your study data.
  • 02:05About 90% of the time,
  • 02:07we work with data collected in Excel spreadsheets.
  • 02:11Although Excel was not designed as a data capture tool
  • 02:14for clinical research,
  • 02:16it does provide an easy way to collect,
  • 02:19store and share your data.
  • 02:22A nice feature of Excel is the ability to quickly filter
  • 02:26on different variables.
  • 02:28To enable filtering, go to the Data tab.
  • 02:31And then, select the Filter button.
  • 02:34After we turn filtering on,
  • 02:36notice how the cells in the header row
  • 02:38that contain the variable names
  • 02:40now have a dropdown button on the right side of the cell.
  • 02:44If you click this button and select or filter
  • 02:47on the values that you want to view,
  • 02:50for example, suppose sex equals 1 corresponds to males.
  • 02:55If I want to look at the data in our male patients,
  • 02:58filter on the sex variable
  • 03:00and select the value 1 and deselect the value 2.
  • 03:05After filtering, we only see the rows
  • 03:07from the patients with sex equals 1.
  • 03:11Select the Clear filter button
  • 03:14to remove any filters applied in the worksheet.
  • 03:18Another option available to you
  • 03:20for data collection is REDCap.
  • 03:23REDCap is a web-based data collection tool
  • 03:26designed for research data.
  • 03:29The data are securely stored on the cloud.
  • 03:32You will need to request a REDCap account
  • 03:34from the REDCap team at Yale
  • 03:37to create a project.
  • 03:38The REDCap team also provides training and support
  • 03:42and you can find their contact information
  • 03:45on the REDCap website,
  • 03:46portal.redcap.yale.edu.
  • 03:51Let's briefly talk about collecting or recording your data.
  • 03:55In general, variables
  • 03:56are either quantitative or categorical.
  • 03:59We also sometimes collect text fields or notes.
  • 04:03Quantitative variables are numeric variables
  • 04:06such as height, weight, age,
  • 04:09bilirubin, a calculated MELD score.
  • 04:12Technically, dates are also numeric
  • 04:15in the way they're stored in Excel
  • 04:17and most other statistical data analysis software.
  • 04:21In our sample data table,
  • 04:24total bilirubin and INR are quantitative variables.
  • 04:29We also have several dates
  • 04:31including date of birth and date of diagnosis.
  • 04:35In the full data sheet,
  • 04:37we also have intake date into the study.
  • 04:40We recommend that you collect dates
  • 04:43and allow us, the statisticians,
  • 04:44to calculate durations or lengths of time
  • 04:47such as age at diagnosis or age at study baseline
  • 04:51using statistical software.
  • 04:54We often use SAS or R to perform our data analysis.
  • 04:59For example, we can calculate age at diagnosis
  • 05:02using date of birth and date of diagnosis.
  • 05:06The SAS code here adds a new column to the data table
  • 05:09containing age at diagnosis in years.
  • 05:13Same usually goes for variables
  • 05:15such as FIB-4 and MELD score.
  • 05:18We can program the calculation
  • 05:20of these variables in our code
  • 05:21rather than have you perform the calculation
  • 05:24in your Excel spreadsheet.
  • 05:27Categorical variables are ones
  • 05:29where the values the variable can take
  • 05:31are essentially categories.
  • 05:34An example of a categorical variable
  • 05:36is race or sex or gender.
  • 05:40Categorical variables that can take only two levels
  • 05:43are called dichotomous or binary variables.
  • 05:46In our sample data table,
  • 05:48response status, treatment group, race and sex
  • 05:52are categorical variables
  • 05:53but notice that these variables are collected
  • 05:56and coded numerically.
  • 05:58This is a common method for recording categorical data
  • 06:01where each category is given a numerical label.
  • 06:06For example, sex is coded as 1 for males
  • 06:10and 2 for females.
  • 06:12We discourage the use of character variables or text
  • 06:15when collecting categorical data
  • 06:17because our statistical software is case sensitive
  • 06:21when reading character data.
  • 06:24It's important to maintain a data key
  • 06:26that defines the numerical coding.
  • 06:29This is especially important
  • 06:31when sharing the data with others.
  • 06:33Here, our data key would include the definition
  • 06:36of sex equals 1 as corresponding to males
  • 06:39and sex equals 2 as corresponding to females.
  • 06:43We recommend creating a separate tab in the Excel file
  • 06:47that defines the numerical coding
  • 06:49of the categorical variables in the data.
  • 06:53A few notes on naming the variables in your spreadsheet.
  • 06:57Variable names should be descriptive,
  • 07:00making it clear what the variable represents.
  • 07:03SAS variable names may be up to 32 characters in length.
  • 07:07The first character of the variable name
  • 07:09must begin with an alphabetic character
  • 07:12or an underscore.
  • 07:13Variable names should not begin with a number or symbol.
  • 07:17And finally, the variable name should not contain
  • 07:20any special characters other than the underscore.
  • 07:25Sometimes certain variables are not available
  • 07:28or not collected.
  • 07:29We see in our sample data table
  • 07:31that we have a few missing values.
  • 07:34Missing values occur when we don't have the data available
  • 07:37for that individual.
  • 07:39One or more variables can be missing for an individual.
  • 07:43Here, race is missing for patients 106 and 117.
  • 07:48Date of diagnosis is missing for patient 106.
  • 07:53There are also no ultrasound findings
  • 07:55for the majority of patients.
  • 07:58Notice how missing values are represented
  • 08:00in the data as empty cells.
  • 08:02Do not use an N/A or a period or a dash
  • 08:06to indicate missing values.
  • 08:08Leave the cell blank when the value is missing.
  • 08:11Finally, if any calculated variables
  • 08:13involve variables with a missing value,
  • 08:16then that calculated variable will also be missing.
  • 08:20For example, if total bilirubin is missing for a patient,
  • 08:23then their calculated MELD score will also be missing.
  • 08:28Finally, I want to discuss some special considerations
  • 08:31for collecting endpoint or response data in your study.
  • 08:36Remember that your primary endpoint
  • 08:38answers your primary research question.
  • 08:41You may also be collecting data on secondary,
  • 08:44tertiary, or other exploratory endpoints.
  • 08:47Endpoints are either continuous or quantitative,
  • 08:50categorical or most often dichotomous,
  • 08:54or there's some measure of time to an event of interest.
  • 08:59Looking at quantitative variables,
  • 09:01we can begin by summarizing a quantitative variable
  • 09:05in the full sample.
  • 09:07We can summarize and compare that quantitative variable
  • 09:10in two or more groups
  • 09:12such as the group exposed to a particular intervention
  • 09:16and the group unexposed to that intervention.
  • 09:19And we can also analyze that quantitative variable
  • 09:22in a regression model,
  • 09:24allowing us to control for certain confounders.
  • 09:28In this 2022 hepatology paper,
  • 09:30they explore how TIPS affects PPG
  • 09:33in patients with ascites.
  • 09:36They found that mean PPG, portal pressure and IVC pressure
  • 09:41decreased significantly after TIPS.
  • 09:44These are quantitative endpoints
  • 09:46and these measurements are taken at two points in time,
  • 09:49before TIPS and after TIPS.
  • 09:53To analyze the change in these quantitative measures,
  • 09:56we need the measurements recorded in our data
  • 09:59before TIPS and after TIPS.
  • 10:01We can use statistical software
  • 10:03to compute and analyze the responsive interest
  • 10:06which is the change in these measures.
  • 10:10The next type of endpoint we often work with
  • 10:13is a dichotomous or binary endpoint.
  • 10:16For example, the goal could be to determine
  • 10:19if the patient's disease improves
  • 10:22over the course of the study.
  • 10:23This is recorded as a binary variable,
  • 10:26improvement, yes or no.
  • 10:28Similarly, we can look at occurrence
  • 10:30of surgical site infection following liver transplant.
  • 10:34Taking it a step further,
  • 10:36we look for an association between exposure
  • 10:38such as exposure to perioperative antibiotic
  • 10:42compared to intraoperative antibiotic
  • 10:44and development of surgical site infection.
  • 10:47We can also model development of surgical site infection
  • 10:50using predictor variables
  • 10:52such as exposure to a specific treatment
  • 10:55while controlling for potential confounders.
  • 10:59We can use similar methods to analyze categorical responses
  • 11:02with more than two levels and ordinal responses
  • 11:06in which the endpoint categories are ordered.
  • 11:10In this 2019 hepatology paper,
  • 11:13they compare the proportion of patients
  • 11:15with surgical site infection
  • 11:17in those patients receiving 72 hours
  • 11:19of perioperative antibiotics or extended antibiotics,
  • 11:24and those receiving intraoperative antibiotics only
  • 11:28or short antibiotics.
  • 11:31The primary endpoint here
  • 11:32is development of surgical site infection.
  • 11:36They also look at 30-day hospital readmission
  • 11:39and 30-day mortality,
  • 11:41which are also dichotomous responses.
  • 11:44Sometimes we use quantitative data
  • 11:47to define a categorical variable
  • 11:49such as a dichotomous endpoint.
  • 11:52For example, clinically significant portal hypertension
  • 11:55is defined as an HVPG greater than or equal
  • 11:59to 10 millimeters of mercury.
  • 12:02You would collect the quantitative data.
  • 12:04And then, we would create the categorical
  • 12:06or dichotomous variable to use in the analysis.
  • 12:11The final endpoint that we most often see
  • 12:14is a time-to-event or survival endpoint
  • 12:18such as time to death, time to decompensation,
  • 12:21time to recovery or response.
  • 12:24For example, our goal could be to determine
  • 12:27if patients exposed to a new treatment
  • 12:29have longer survival times
  • 12:31or greater likelihood of survival
  • 12:33to a certain time point.
  • 12:35When looking at one group such as the overall sample,
  • 12:38we could report survival probabilities
  • 12:40from a Kaplan-Meier survival curve
  • 12:43or median survival time.
  • 12:45Median survival time
  • 12:47is the time beyond which 50% of the individuals
  • 12:50are expected to survive.
  • 12:52We can naturally extend this to two or more groups
  • 12:55and formally compare the groups,
  • 12:58and we can also build regression models
  • 13:00that estimate the relationship
  • 13:02between rate of the event and exposure variables
  • 13:05such as treatment status.
  • 13:09So far with the other endpoints,
  • 13:11data collection has been pretty intuitive.
  • 13:14However, survival data requires
  • 13:16certain pieces of information
  • 13:18to properly complete the analysis.
  • 13:20In survival data,
  • 13:22the outcome is time to a target event occurring.
  • 13:26Dates are important because we need to compute time
  • 13:30from a specific start point to an event.
  • 13:33To calculate the survival time in your study,
  • 13:35you must precisely define:
  • 13:38the time origin or the starting point,
  • 13:40that is when follow up begins;
  • 13:42the ending event of interest is at death,
  • 13:45decompensation, remission, relapse;
  • 13:49and the measurement scale for the passage of time,
  • 13:51for example, days, months, years.
  • 13:56An important feature of survival data
  • 13:58is that we may have patients who do not experience the event
  • 14:02during the study period.
  • 14:04For example, some patients
  • 14:06may not decompensate during the study.
  • 14:09Those patients are censored.
  • 14:11Here, patient one enters the study
  • 14:13and experiences the event at three months.
  • 14:17Patient two enters the study
  • 14:18and is either lost a follow up
  • 14:21or withdraws from the study before we observe the event.
  • 14:24Patient two is censored at his withdrawal
  • 14:27or at the last time we knew he was event free at two months.
  • 14:32Patient three enters the study
  • 14:34and does not experience the event
  • 14:35before the end of the study.
  • 14:38Patient three is censored
  • 14:39at the administrative end of the study at four months.
  • 14:44It's straightforward to compute survival time in patient one
  • 14:47because he experienced the event,
  • 14:50so his survival time is his event date
  • 14:53minus his entry date.
  • 14:56For those censored,
  • 14:57there's no time-to-event
  • 14:59but we want to account for time at risk
  • 15:01because they're at risk for the event.
  • 15:04But the fact that they don't experience the event
  • 15:06before their loss or withdrawal
  • 15:09or end of follow up is meaningful.
  • 15:12We compute their survival time
  • 15:14as their censoring date minus their entry date.
  • 15:19Survival endpoints must contain two variables:
  • 15:22an event indicator,
  • 15:24which is usually referred to as a censoring indicator
  • 15:28and the patient's survival time
  • 15:30which is time-to-event
  • 15:32for those experiencing the event
  • 15:34or time to censoring for those censored.
  • 15:37Patient one experiences the event
  • 15:40so his event indicator is equal to 1.
  • 15:42Patient's 2 and 3 are censored
  • 15:45so their event indicator is equal to 0.
  • 15:49And survival time is time from entry
  • 15:51to the event or censoring, whichever occurs first.
  • 15:56We recommend that you collect the relevant dates
  • 15:59and allow us to compute survival time.
  • 16:02Here we have the sample toy data set
  • 16:05that we looked at earlier.
  • 16:06The goal is to analyze time to response
  • 16:09and describe the probability
  • 16:11of treatment and response over time.
  • 16:13The first variable needed to define the survival endpoint
  • 16:18is the event or censoring indicator.
  • 16:21Here we identify which subjects experienced
  • 16:23the event of interest using the variable called response.
  • 16:28Response equal to 1 if the subject experienced the event
  • 16:32and 0 if the subject does not experience the event.
  • 16:37Next, collect intake date or the start of follow up.
  • 16:41In the subjects who experienced the event,
  • 16:44the variable response date
  • 16:46records the date the event occurred.
  • 16:48Notice that this variable is missing
  • 16:51for those who do not experience the event.
  • 16:54For censored patients,
  • 16:56we compute their survival time
  • 16:58as the time between intake and censoring.
  • 17:02Censoring time is the last time
  • 17:04we knew the subject was event free.
  • 17:06In this case, we can use the last visit date for a patient.
  • 17:11You can record last visit date for all patients
  • 17:14and we would use this information as the censoring date
  • 17:17for those who do not experience the event.
  • 17:20Notice that the last visit dates
  • 17:22for subjects 106 and 107 are after their response dates.
  • 17:27But when we compute the time variable for these subjects,
  • 17:30we would use time from intake to the response date.
  • 17:35You could also leave the last visit date blank
  • 17:38for those subjects who experienced the event.
  • 17:41We will use the response column and the time column
  • 17:44as the response variable in the survival analysis.
  • 17:48The response variable indicates if the time variable
  • 17:51represents time to event or time to censoring.
  • 17:56In the 2022 hepatology paper
  • 17:59looking at the effect of TIPS on PPG
  • 18:02in patients with ascites,
  • 18:04we know that they looked at PPG reduction
  • 18:06as a continuous endpoint
  • 18:08but they also looked at survival or time to death
  • 18:11in patients with ascites resolution after TIPS.
  • 18:14That's the purple curve.
  • 18:15Compared to those with a persistent need for paracentesis,
  • 18:20six weeks after TIPS, the red curve,
  • 18:23they found significantly improved survival
  • 18:26in those with ascites resolution.
  • 18:30In this video, we discussed data collection
  • 18:32and variable types and specifically looked
  • 18:35at quantitative endpoints, dichotomous endpoints,
  • 18:39and time-to-event endpoints.
  • 18:41Now that we've seen different examples
  • 18:43of common endpoint types in clinical research,
  • 18:46in our next video, the fourth video in this series,
  • 18:50we'll discuss an important step in the study design process
  • 18:55and that's sample size determination.