Data Collection

February 06, 2023

Information

In this third video, we discuss data collection and how to structure a database. We also discuss variable types and special considerations for quantitative, categorical, and time-to-event data.

ID9451

To CiteDCA Citation Guide

00:02<v Maria>My name is Maria Ciarleglio</v>
00:04and I'm a faculty member
00:05in the Department of Biostatistics
00:08at the Yale School of Public Health.
00:11In this video series,
00:12I will introduce the clinical research process
00:15to prepare you to collaborate with a statistician.
00:20In this third video,
00:22we'll discuss data collection
00:24or how to structure your data in a spreadsheet
00:27so that you can share the data files
00:29with a statistician for analysis.
00:32We'll also discuss variable types and special considerations
00:37for quantitative, categorical, and time-to-event data.
00:43Collecting good data is important
00:45because the data enable you
00:47to answer your research question.
00:49Generally, we're interested in the effect
00:51of one or more exposure
00:53on one or more outcome of interest.
00:56Exposures and outcomes are key variables to collect.
01:00There may be several other descriptive characteristics
01:03that we would like to either report
01:05when we're describing the sample
01:07or control or adjust for in the analysis.
01:11We can also explore these variables
01:13as possible effect modifiers
01:14or as characteristics that define subgroups of interest.
01:19These additional variables are also recorded
01:21during data collection.
01:23What is the structure of this data table?
01:26Specifically, the rows contain information
01:29from different subjects
01:30and the columns contain the different characteristics
01:33or variables collected on the subjects.
01:36This means that the number of rows in your data table
01:39corresponds to the number of participants,
01:42not including the header row
01:44or the row that contains the variable names.
01:47The number of columns in your data table
01:49corresponds to the number of variables in your data table.
01:54There are several features of this table
01:56that I want to discuss further,
01:58but first, let's review a few options available to you
02:02for collecting your study data.
02:05About 90% of the time,
02:07we work with data collected in Excel spreadsheets.
02:11Although Excel was not designed as a data capture tool
02:14for clinical research,
02:16it does provide an easy way to collect,
02:19store and share your data.
02:22A nice feature of Excel is the ability to quickly filter
02:26on different variables.
02:28To enable filtering, go to the Data tab.
02:31And then, select the Filter button.
02:34After we turn filtering on,
02:36notice how the cells in the header row
02:38that contain the variable names
02:40now have a dropdown button on the right side of the cell.
02:44If you click this button and select or filter
02:47on the values that you want to view,
02:50for example, suppose sex equals 1 corresponds to males.
02:55If I want to look at the data in our male patients,
02:58filter on the sex variable
03:00and select the value 1 and deselect the value 2.
03:05After filtering, we only see the rows
03:07from the patients with sex equals 1.
03:11Select the Clear filter button
03:14to remove any filters applied in the worksheet.
03:18Another option available to you
03:20for data collection is REDCap.
03:23REDCap is a web-based data collection tool
03:26designed for research data.
03:29The data are securely stored on the cloud.
03:32You will need to request a REDCap account
03:34from the REDCap team at Yale
03:37to create a project.
03:38The REDCap team also provides training and support
03:42and you can find their contact information
03:45on the REDCap website,
03:46portal.redcap.yale.edu.
03:51Let's briefly talk about collecting or recording your data.
03:55In general, variables
03:56are either quantitative or categorical.
03:59We also sometimes collect text fields or notes.
04:03Quantitative variables are numeric variables
04:06such as height, weight, age,
04:09bilirubin, a calculated MELD score.
04:12Technically, dates are also numeric
04:15in the way they're stored in Excel
04:17and most other statistical data analysis software.
04:21In our sample data table,
04:24total bilirubin and INR are quantitative variables.
04:29We also have several dates
04:31including date of birth and date of diagnosis.
04:35In the full data sheet,
04:37we also have intake date into the study.
04:40We recommend that you collect dates
04:43and allow us, the statisticians,
04:44to calculate durations or lengths of time
04:47such as age at diagnosis or age at study baseline
04:51using statistical software.
04:54We often use SAS or R to perform our data analysis.
04:59For example, we can calculate age at diagnosis
05:02using date of birth and date of diagnosis.
05:06The SAS code here adds a new column to the data table
05:09containing age at diagnosis in years.
05:13Same usually goes for variables
05:15such as FIB-4 and MELD score.
05:18We can program the calculation
05:20of these variables in our code
05:21rather than have you perform the calculation
05:24in your Excel spreadsheet.
05:27Categorical variables are ones
05:29where the values the variable can take
05:31are essentially categories.
05:34An example of a categorical variable
05:36is race or sex or gender.
05:40Categorical variables that can take only two levels
05:43are called dichotomous or binary variables.
05:46In our sample data table,
05:48response status, treatment group, race and sex
05:52are categorical variables
05:53but notice that these variables are collected
05:56and coded numerically.
05:58This is a common method for recording categorical data
06:01where each category is given a numerical label.
06:06For example, sex is coded as 1 for males
06:10and 2 for females.
06:12We discourage the use of character variables or text
06:15when collecting categorical data
06:17because our statistical software is case sensitive
06:21when reading character data.
06:24It's important to maintain a data key
06:26that defines the numerical coding.
06:29This is especially important
06:31when sharing the data with others.
06:33Here, our data key would include the definition
06:36of sex equals 1 as corresponding to males
06:39and sex equals 2 as corresponding to females.
06:43We recommend creating a separate tab in the Excel file
06:47that defines the numerical coding
06:49of the categorical variables in the data.
06:53A few notes on naming the variables in your spreadsheet.
06:57Variable names should be descriptive,
07:00making it clear what the variable represents.
07:03SAS variable names may be up to 32 characters in length.
07:07The first character of the variable name
07:09must begin with an alphabetic character
07:12or an underscore.
07:13Variable names should not begin with a number or symbol.
07:17And finally, the variable name should not contain
07:20any special characters other than the underscore.
07:25Sometimes certain variables are not available
07:28or not collected.
07:29We see in our sample data table
07:31that we have a few missing values.
07:34Missing values occur when we don't have the data available
07:37for that individual.
07:39One or more variables can be missing for an individual.
07:43Here, race is missing for patients 106 and 117.
07:48Date of diagnosis is missing for patient 106.
07:53There are also no ultrasound findings
07:55for the majority of patients.
07:58Notice how missing values are represented
08:00in the data as empty cells.
08:02Do not use an N/A or a period or a dash
08:06to indicate missing values.
08:08Leave the cell blank when the value is missing.
08:11Finally, if any calculated variables
08:13involve variables with a missing value,
08:16then that calculated variable will also be missing.
08:20For example, if total bilirubin is missing for a patient,
08:23then their calculated MELD score will also be missing.
08:28Finally, I want to discuss some special considerations
08:31for collecting endpoint or response data in your study.
08:36Remember that your primary endpoint
08:38answers your primary research question.
08:41You may also be collecting data on secondary,
08:44tertiary, or other exploratory endpoints.
08:47Endpoints are either continuous or quantitative,
08:50categorical or most often dichotomous,
08:54or there's some measure of time to an event of interest.
08:59Looking at quantitative variables,
09:01we can begin by summarizing a quantitative variable
09:05in the full sample.
09:07We can summarize and compare that quantitative variable
09:10in two or more groups
09:12such as the group exposed to a particular intervention
09:16and the group unexposed to that intervention.
09:19And we can also analyze that quantitative variable
09:22in a regression model,
09:24allowing us to control for certain confounders.
09:28In this 2022 hepatology paper,
09:30they explore how TIPS affects PPG
09:33in patients with ascites.
09:36They found that mean PPG, portal pressure and IVC pressure
09:41decreased significantly after TIPS.
09:44These are quantitative endpoints
09:46and these measurements are taken at two points in time,
09:49before TIPS and after TIPS.
09:53To analyze the change in these quantitative measures,
09:56we need the measurements recorded in our data
09:59before TIPS and after TIPS.
10:01We can use statistical software
10:03to compute and analyze the responsive interest
10:06which is the change in these measures.
10:10The next type of endpoint we often work with
10:13is a dichotomous or binary endpoint.
10:16For example, the goal could be to determine
10:19if the patient's disease improves
10:22over the course of the study.
10:23This is recorded as a binary variable,
10:26improvement, yes or no.
10:28Similarly, we can look at occurrence
10:30of surgical site infection following liver transplant.
10:34Taking it a step further,
10:36we look for an association between exposure
10:38such as exposure to perioperative antibiotic
10:42compared to intraoperative antibiotic
10:44and development of surgical site infection.
10:47We can also model development of surgical site infection
10:50using predictor variables
10:52such as exposure to a specific treatment
10:55while controlling for potential confounders.
10:59We can use similar methods to analyze categorical responses
11:02with more than two levels and ordinal responses
11:06in which the endpoint categories are ordered.
11:10In this 2019 hepatology paper,
11:13they compare the proportion of patients
11:15with surgical site infection
11:17in those patients receiving 72 hours
11:19of perioperative antibiotics or extended antibiotics,
11:24and those receiving intraoperative antibiotics only
11:28or short antibiotics.
11:31The primary endpoint here
11:32is development of surgical site infection.
11:36They also look at 30-day hospital readmission
11:39and 30-day mortality,
11:41which are also dichotomous responses.
11:44Sometimes we use quantitative data
11:47to define a categorical variable
11:49such as a dichotomous endpoint.
11:52For example, clinically significant portal hypertension
11:55is defined as an HVPG greater than or equal
11:59to 10 millimeters of mercury.
12:02You would collect the quantitative data.
12:04And then, we would create the categorical
12:06or dichotomous variable to use in the analysis.
12:11The final endpoint that we most often see
12:14is a time-to-event or survival endpoint
12:18such as time to death, time to decompensation,
12:21time to recovery or response.
12:24For example, our goal could be to determine
12:27if patients exposed to a new treatment
12:29have longer survival times
12:31or greater likelihood of survival
12:33to a certain time point.
12:35When looking at one group such as the overall sample,
12:38we could report survival probabilities
12:40from a Kaplan-Meier survival curve
12:43or median survival time.
12:45Median survival time
12:47is the time beyond which 50% of the individuals
12:50are expected to survive.
12:52We can naturally extend this to two or more groups
12:55and formally compare the groups,
12:58and we can also build regression models
13:00that estimate the relationship
13:02between rate of the event and exposure variables
13:05such as treatment status.
13:09So far with the other endpoints,
13:11data collection has been pretty intuitive.
13:14However, survival data requires
13:16certain pieces of information
13:18to properly complete the analysis.
13:20In survival data,
13:22the outcome is time to a target event occurring.
13:26Dates are important because we need to compute time
13:30from a specific start point to an event.
13:33To calculate the survival time in your study,
13:35you must precisely define:
13:38the time origin or the starting point,
13:40that is when follow up begins;
13:42the ending event of interest is at death,
13:45decompensation, remission, relapse;
13:49and the measurement scale for the passage of time,
13:51for example, days, months, years.
13:56An important feature of survival data
13:58is that we may have patients who do not experience the event
14:02during the study period.
14:04For example, some patients
14:06may not decompensate during the study.
14:09Those patients are censored.
14:11Here, patient one enters the study
14:13and experiences the event at three months.
14:17Patient two enters the study
14:18and is either lost a follow up
14:21or withdraws from the study before we observe the event.
14:24Patient two is censored at his withdrawal
14:27or at the last time we knew he was event free at two months.
14:32Patient three enters the study
14:34and does not experience the event
14:35before the end of the study.
14:38Patient three is censored
14:39at the administrative end of the study at four months.
14:44It's straightforward to compute survival time in patient one
14:47because he experienced the event,
14:50so his survival time is his event date
14:53minus his entry date.
14:56For those censored,
14:57there's no time-to-event
14:59but we want to account for time at risk
15:01because they're at risk for the event.
15:04But the fact that they don't experience the event
15:06before their loss or withdrawal
15:09or end of follow up is meaningful.
15:12We compute their survival time
15:14as their censoring date minus their entry date.
15:19Survival endpoints must contain two variables:
15:22an event indicator,
15:24which is usually referred to as a censoring indicator
15:28and the patient's survival time
15:30which is time-to-event
15:32for those experiencing the event
15:34or time to censoring for those censored.
15:37Patient one experiences the event
15:40so his event indicator is equal to 1.
15:42Patient's 2 and 3 are censored
15:45so their event indicator is equal to 0.
15:49And survival time is time from entry
15:51to the event or censoring, whichever occurs first.
15:56We recommend that you collect the relevant dates
15:59and allow us to compute survival time.
16:02Here we have the sample toy data set
16:05that we looked at earlier.
16:06The goal is to analyze time to response
16:09and describe the probability
16:11of treatment and response over time.
16:13The first variable needed to define the survival endpoint
16:18is the event or censoring indicator.
16:21Here we identify which subjects experienced
16:23the event of interest using the variable called response.
16:28Response equal to 1 if the subject experienced the event
16:32and 0 if the subject does not experience the event.
16:37Next, collect intake date or the start of follow up.
16:41In the subjects who experienced the event,
16:44the variable response date
16:46records the date the event occurred.
16:48Notice that this variable is missing
16:51for those who do not experience the event.
16:54For censored patients,
16:56we compute their survival time
16:58as the time between intake and censoring.
17:02Censoring time is the last time
17:04we knew the subject was event free.
17:06In this case, we can use the last visit date for a patient.
17:11You can record last visit date for all patients
17:14and we would use this information as the censoring date
17:17for those who do not experience the event.
17:20Notice that the last visit dates
17:22for subjects 106 and 107 are after their response dates.
17:27But when we compute the time variable for these subjects,
17:30we would use time from intake to the response date.
17:35You could also leave the last visit date blank
17:38for those subjects who experienced the event.
17:41We will use the response column and the time column
17:44as the response variable in the survival analysis.
17:48The response variable indicates if the time variable
17:51represents time to event or time to censoring.
17:56In the 2022 hepatology paper
17:59looking at the effect of TIPS on PPG
18:02in patients with ascites,
18:04we know that they looked at PPG reduction
18:06as a continuous endpoint
18:08but they also looked at survival or time to death
18:11in patients with ascites resolution after TIPS.
18:14That's the purple curve.
18:15Compared to those with a persistent need for paracentesis,
18:20six weeks after TIPS, the red curve,
18:23they found significantly improved survival
18:26in those with ascites resolution.
18:30In this video, we discussed data collection
18:32and variable types and specifically looked
18:35at quantitative endpoints, dichotomous endpoints,
18:39and time-to-event endpoints.
18:41Now that we've seen different examples
18:43of common endpoint types in clinical research,
18:46in our next video, the fourth video in this series,
18:50we'll discuss an important step in the study design process
18:55and that's sample size determination.