Skip to Main Content

SAS Video 2.1

July 09, 2019
Review, Arithmetic Operations, Functions, Date Variables
ID
3396
Maria Ciargleglio

Transcript

  • 00:01Welcome to video 2.1.
  • 00:03The first of our lab 2 videos in our SAS lab video series.
  • 00:07My name is Maria Ciarleglio, and I'm from the Department of
  • 00:10Biostatistics at the Yale School of Public Health.
  • 00:14In this lab we'll cover how to work with variables in SAS.
  • 00:17Specifically, how to create new variables from existing variables
  • 00:21in a dataset.
  • 00:26The topics covered in this lab will be presented over three videos.
  • 00:30Video 2.1 will begin by reviewing proc import to read
  • 00:34a data file into SAS.
  • 00:36We learned how to use this technique in video 1.2.
  • 00:40Then we will begin looking at a number of different ways of creating
  • 00:43variables.
  • 00:45We will cover arithmetic operations, such as addition, subtraction,
  • 00:49multiplication, division, and exponentiation.
  • 00:53Next, we'll look at some commonly used mathematical functions that
  • 00:56can transform a variable.
  • 00:58For example, the exponential function, square root, and logarithmic
  • 01:01functions.
  • 01:03We'll also discuss some arithmetic and statistical functions that
  • 01:07allow us to sum and find the mean standard deviation,
  • 01:10minimum or maximum value of several variables for each
  • 01:14subject.
  • 01:15At the end of video 2.1.
  • 01:17We will discuss how to work with date variables.
  • 01:20We will add a date variable to a SAS data set, calculate the length
  • 01:24of time between two date variables, and discuss different ways
  • 01:27of displaying or formatting date variables.
  • 01:31In video 2.2, we'll continue to learn about creating variables
  • 01:35using comparison operators such as less than,
  • 01:39greater than, or equal to.
  • 01:41These are commonly used in if-then statements.
  • 01:44For example, if a certain condition is true, then a
  • 01:47new variable will take on a certain value.
  • 01:51And lastly, we'll discuss logical operators such as
  • 01:54and, or, and not.
  • 01:57Video 2.3 will show you how to save this new analysis
  • 02:01dataset that you've created containing all of your new variables.
  • 02:06We will again look at data from the Framingham Heart Study.
  • 02:10The data file that we will use in this video is an Excel file called
  • 02:14Framingham subset 2 which is available in the lab 2
  • 02:17Resources folder on the course web page.
  • 02:20This dataset is similar to Framingham subset that we used
  • 02:24in videos 1.2 and 1.3, but some
  • 02:27of the variables have been modified and I've also added a few new
  • 02:31variables.
  • 02:33There are 30 variables in the data set shown in columns and
  • 02:384434 individual observations shown as rows.
  • 02:42Notice that the number of rows in our Excel spreadsheet is
  • 02:474435. This is because the first row of the spreadsheet contains
  • 02:50the variable names, and the individual observations begin
  • 02:54on row 2. This version of Framingham subset includes
  • 02:58a variable I created describing participant race.
  • 03:01And for those who died, I have generated a date of death
  • 03:05in order to give us an opportunity to work with missing data, the 10th
  • 03:08subject shown in row 11 has been modified to give
  • 03:12him some missing data values.
  • 03:14This will allow you to see how missing values in the data affect the
  • 03:17creation of new variables.
  • 03:19Additionally instead of only having information on baseline total
  • 03:23cholesterol, we now have total cholesterol information collected
  • 03:27during three examination periods.
  • 03:30In this study, the three examination periods were approximately six
  • 03:33years apart.
  • 03:35Touching again on missing data.
  • 03:36Notice that when an individual does not have cholesterol level
  • 03:40measured at one of the follow up examinations, the cell is
  • 03:43left blank. When SAS imports these data, SAS
  • 03:47knows that a blank cell indicates a missing value.
  • 03:50So if you're collecting data in an Excel spreadsheet simply leave
  • 03:54the cell blank if a subject has a missing value.
  • 03:57We will use the import procedure to read this Excel file
  • 04:01into SAS which will create a SAS dataset.
  • 04:04The SAS dataset will be given the name Framingham in SAS.
  • 04:08The imported data set will have the same number of variables and
  • 04:11observations as the original Excel spreadsheet.
  • 04:15The first step to working with data and adding new variables to
  • 04:19a data set is to get your data into SAS.
  • 04:22We will use the import procedure to reading our external data file.
  • 04:26This method was discussed in video 1.2.
  • 04:30Here is what typical SAS code for this procedure looks
  • 04:33like. Remember our trick of using the import wizard
  • 04:37and saving the code generated by the wizard for use later.
  • 04:42The generated code is what I have pasted here.
  • 04:45The first line begins with proc import which is the SAS
  • 04:49procedure name. Out = work.Framingham
  • 04:53is where we can specify the library where the imported SAS
  • 04:56data set will be saved, and we can also name the data set.
  • 05:00Here we're saving this data set to the temporary work library
  • 05:04that we discussed in video 1.3.
  • 05:06Remember that the contents of the work library will disappear once
  • 05:10we exit SAS. This means that if we exit our current SAS
  • 05:14session, any data sets that we created in the work library
  • 05:17will need to be re-imported, and any new variables that we created
  • 05:21will need to be recreated once we launch SAS if we want
  • 05:25to continue analysing these data.
  • 05:28We can also create a permanent library using the lib name
  • 05:31statement to save our imported data as a permanent SAS
  • 05:35data set. In video 2.3, I will give you a few
  • 05:39options for saving your modified data set containing all of
  • 05:42your new variables, so that you can directly use that analysis data
  • 05:46set later without having to repeat your data management
  • 05:49steps.
  • 05:51Note that if we had omitted the library name here and just
  • 05:54written out equals Framingham, the Framingham dataset would
  • 05:58have by default been saved to the work library anyway.
  • 06:03The name we are giving the data set is the name Framingham.
  • 06:06So when we refer to this data set in our code we can simply
  • 06:10call it Framingham and SAS will know to look in the
  • 06:13work library if no other library is specified.
  • 06:18On the second line we see the word data file followed by the equal
  • 06:21sign and the physical location of the file.
  • 06:24In this case, the excel file that we're going to import into SAS.
  • 06:28Notice that we put the path of our excel file in double quotes.
  • 06:31You can also use single quotes but just be sure to be consistent
  • 06:35on both ends.
  • 06:37On the 3rd line of code, we have the acronym DBMS
  • 06:40for database management systems.
  • 06:42We need this line of code to tell SAS what type of file we are
  • 06:46importing. Here we're importing an Excel file in this example.
  • 06:50The word replace is necessary to overwrite
  • 06:54any other existing files in SAS by the same name.
  • 06:58Here is where we also put our first semicolon.
  • 07:01The fourth line of code tells SAS which Excel worksheet to
  • 07:05read in. Here we're specifying the name that appears
  • 07:08on the tab of the specific worksheet of interest.
  • 07:12In our excel file Framingham subset 2, our worksheet
  • 07:16is named Framingham. So that is what we specify in the range
  • 07:19statement. We also end this line with a semicolon.
  • 07:23On the fifth line, get names equals yes tell SAS
  • 07:27that there is a header row in the Excel file which contains the
  • 07:30variable names. Line 6 through 9 specify how
  • 07:34text, date, and time variables are imported.
  • 07:38Finally we end the import procedure with the word run,
  • 07:41followed by a semicolon.
  • 07:44Now let's open SAS and run this procedure to
  • 07:47import Framingham subset 2.
  • 07:50I'm going to open SAS by directly double clicking on the existing
  • 07:53program file for this lab called SAS code Video
  • 07:572. This program file is available in the lab 2 Resources
  • 08:01folder on the course web page so that you can follow along.
  • 08:06In the program editor, I like to begin my code with some comments.
  • 08:10I've also pasted the proc import code generated by the
  • 08:14import wizard and inserted a comment about the procedure
  • 08:18to document what I'm doing and to help me remember my thought process
  • 08:21if I return to this code later.
  • 08:24Most of the comments in this file are to help you better
  • 08:28understand and locate the different pieces of code.
  • 08:31But the use of comments is completely up to you.
  • 08:34You can make them as detailed as you want or omit them altogether.
  • 08:38You will likely have to modify the data file statement to point
  • 08:41to the location where you have Framingham subset 2
  • 08:45saved on your computer.
  • 08:47I have the Framingham subset 2 excel files saved
  • 08:50in the SAS video 2 folder on my USP flash drive
  • 08:54which is mapped as drive G on my computer.
  • 08:58I'm going to highlight and submit or run the import procedure.
  • 09:02Remember you can run your highlighted code by either clicking on the
  • 09:05running man button in the menu bar or by using the keyboard
  • 09:09shortcuts F3 or F8.
  • 09:12Let's check our log window.
  • 09:13The lines in black are the lines of code we highlighted and submitted
  • 09:17and the lines in blue tell us that our SAS data set worked up
  • 09:21Framingham was successfully created.
  • 09:23The log also reports the number of observations and the number
  • 09:27of variables in the imported dataset.
  • 09:29This should match the number in the original imported excel file.
  • 09:34Next let's check the contents of the Framingham dataset.
  • 09:37We will use the contents procedure to give us a list of the variables
  • 09:41in our dataset.
  • 09:43We'll also be able to see if the variable was imported as
  • 09:46a numeric or character variable.
  • 09:49I would like the variables to be listed in the order they are included
  • 09:52in the file rather than in alphabetical order, so I specified
  • 09:56the order equals var num option in the first line of
  • 10:00PROC CONTENTS. I find this option useful because there tends to
  • 10:03be a reason for the ordering of variables in a dataset.
  • 10:07Although if I'm looking to see if the dataset contains a specific
  • 10:10variable, then looking at the variable list in alphabetical order
  • 10:14would be more helpful. Again highlight and run the code.
  • 10:19Checking the log, I don't see any errors.
  • 10:22But I do see a new window.
  • 10:24The results viewer window.
  • 10:26This is the window that contains our output.
  • 10:29Here we see the data file Framingham is saved in the work
  • 10:32library because under dataset name we see work.Framingham.
  • 10:37This dataset has 30 variables and
  • 10:414434 observations.
  • 10:43I also see the intake and date of death variables were
  • 10:46imported correctly as a date, and the sex and race
  • 10:50variables were imported correctly as character variables.
  • 10:54It's always a good idea to check that your data are being imported
  • 10:58correctly.
  • 11:00We will take the Framingham data set that we just imported and
  • 11:03create several new variables.
  • 11:05The new variables will appear as new columns at the end of
  • 11:09the data set. This process creates a new SAS
  • 11:12data file that contains both the original variables and
  • 11:16the new variables.
  • 11:18I call this new data set my analysis data set because
  • 11:21oftentimes we are performing this data management and creating
  • 11:25these new variables for the purpose of conducting data analysis
  • 11:29later on. The data set that then contains the variables we
  • 11:32need for a future analysis is my analysis data set.
  • 11:36We have two options for saving this new data set.
  • 11:39We can overwrite the existing Framingham data set by giving
  • 11:43the new data set. the name Framingham, the same as the original.
  • 11:47Alternatively, we can create a completely new dataset by
  • 11:51assigning it a different name.
  • 11:52In our example we will name our new dataset Framingham
  • 11:56new. This approach allows us to keep the Framingham
  • 12:00SAS data file in its original form.
  • 12:03Our new dataset is a copy of Framingham plus any newly
  • 12:07created or modified variables.
  • 12:09Be careful when using option one because you will overwrite
  • 12:13the version of Framingham that you originally imported.
  • 12:16I don't mean that you will overwrite the excel file; you will
  • 12:19overwrite the SAS data file that was created as a result of
  • 12:22the import procedure.
  • 12:24Once you overwrite a data set you cannot simply go back to the
  • 12:28previous version. For example, if we overwrite Framingham
  • 12:31and add 10 new variables, the dataset will now contain 40 variables.
  • 12:36This means that we've lost the original version of the dataset with
  • 12:3930 variables. To get the original back, we would have to rerun
  • 12:43the import procedure.
  • 12:44My preference is to always leave the original data file as
  • 12:48is and create a second analysis dataset that contains any
  • 12:51modifications changes or new variables.
  • 12:55This way if I make a mistake in creating a new variable or if I
  • 12:58accidentally overwrite an existing variable in the original dataset
  • 13:02I can correct my code and use the original dataset to try
  • 13:06again.
  • 13:07We will create new variables using a data step.
  • 13:11We used a data step in video 1.2 to read data
  • 13:14into SAS using manual data entry.
  • 13:17On line one, we begin the DATA step with the word data
  • 13:21followed by the name of the data set that we want to create in this
  • 13:24data step. Because I prefer option 2, I'm
  • 13:28giving the modified data set a new name, Framingham new.
  • 13:32Because I'm not specifying a library, this new data set will
  • 13:36be saved in the work library.
  • 13:38You could explicitly specify the work library as the location
  • 13:42of Framingham new as we see here but it's not necessary.
  • 13:46When you don't specify a library., SAS will default to
  • 13:50the work library. We end the data statement with a
  • 13:53semicolon.
  • 13:55On the second line I begin the SET statement.
  • 13:58This is where I identify the dataset that will be read
  • 14:01into the data step.
  • 14:03We want to read the original Framingham data set into
  • 14:07the DATA step, create a copy called Framingham new,
  • 14:10and make any of the modifications to the dataset specified
  • 14:14within the body of the DATA step.
  • 14:16This is where we will define new variables, usually by performing
  • 14:20operations on existing variables.
  • 14:23Again, we are reading the Framingham data set in from the work
  • 14:27library. You could have also put the SET statement on the same
  • 14:30line as the data statement.
  • 14:33In SAS, it's not the line breaks that indicate the end
  • 14:36of each statement, it's the semicolons.
  • 14:39End the DATA step with the word run followed by a semicolon.
  • 14:43Now we can discuss SAS operators and functions that are
  • 14:47used in SAS programming statements to create new variables.
  • 14:51We will begin by discussing arithmetic operators.
  • 14:55The first arithmetic operation we will discuss is addition.
  • 14:59We will use addition to add two or more variables together
  • 15:03to make a new variable, or to add a constant
  • 15:06to one or more variables.
  • 15:08We type the plus symbol between the terms we are summing.
  • 15:12We have five binary variables in our Framingham dataset
  • 15:16that report the presence of prevalent disease at the time of entry
  • 15:19into the study: prevalent coronary heart disease, angina
  • 15:23pectoris, myocardial infarction stroke, and prevalent
  • 15:27hypertension.
  • 15:28The variable equals one, if the individual has the prevalent disease,
  • 15:32and zero if the individual is free from the disease.
  • 15:36We see the count and percentage of individuals with each prevalent
  • 15:39condition reported in these frequency tables with prevalent
  • 15:43hypertension being the most common prevalent condition in these
  • 15:46individuals. We want to create a new variable in the dataset
  • 15:50that counts the number of prevalent conditions that a participant
  • 15:54has. Because the variables indicating presence or
  • 15:57absence of disease are binary, if we sum these indicator
  • 16:01variables, we will be counting the number of prevalent conditions
  • 16:05for each individual.
  • 16:06Here we are naming this new account variable prev cond,
  • 16:10and this variable will be in Framingham new.
  • 16:13Also notice that when it comes to variable names, SAS is
  • 16:17not case sensitive.
  • 16:18Although the prevalent condition variables are all capitalized
  • 16:22in the dataset Framingham, I do not need to use all caps
  • 16:25when referencing the variables.
  • 16:28Just so it's clear I want you to see that these operations will be
  • 16:31applied to each subject.
  • 16:33The subjects are represented in the rows of the data set.
  • 16:36Seven subjects are shown here.
  • 16:38The last subject happens to be the 10th subject in the original
  • 16:42dataset. This subject has several missing data values so that
  • 16:45you can see how variable creation behaves in the presence of
  • 16:49missing input values.
  • 16:51When you follow along in print some of the new variables we will
  • 16:54create using the code in the SAS program file, the obs
  • 16:58equals ten option in PROC PRINT will print the first
  • 17:0210 subjects in the dataset.
  • 17:04The new variable we're creating sums the values of
  • 17:08prevalent CHD, AP, MY, stroke, and hypertension
  • 17:12for each subject, essentially counting the number of prevalent
  • 17:16conditions each subject has.
  • 17:18Prev Cond equals zero for the first subject shown because
  • 17:22he has no prevalent conditions. One for the second subject,
  • 17:26three for the third, two for the fourth, zero
  • 17:29for the fifth, and one for the six.
  • 17:32When it comes to subject seven we see that his value of
  • 17:36Prev Cond is missing.
  • 17:38This is because one of the variables involved in calculating
  • 17:41the sum is missing.
  • 17:43Let's move on to subtraction.
  • 17:45Here we create a new variable called pulse pressure in Framingham
  • 17:49new which is the difference between systolic and diastolic blood
  • 17:53pressure. The next operation is multiplication which
  • 17:57uses the asterisk symbol.
  • 17:59We are creating a variable age months which is simply
  • 18:02age expressed in months and not in years.
  • 18:06We do this by multiplying the variable age in the Framingham
  • 18:09dataset which is aging years by the number 12.
  • 18:13The constant twelve will be multiplied by each individual's
  • 18:17age to give us our new variable.
  • 18:19Again age in months is missing when age itself is missing.
  • 18:24Looking next at Division we can calculate mean arterial pressure
  • 18:28the average pressure in a patient's arteries during one cardiac
  • 18:31cycle. By adding systolic blood pressure to twice diastolic
  • 18:35blood pressure and dividing that quantity by three.
  • 18:39This example illustrates how you can use multiple arithmetic
  • 18:42operations together when defining a variable.
  • 18:45Here we are using addition, multiplication, and division.
  • 18:49Also notice the use of parentheses.
  • 18:52Here, we want to divide the quantity systolic plus
  • 18:55twice diastolic by three.
  • 18:58If we had admitted the parentheses, SAS would have only divided
  • 19:02the quantity, two times diastolic by three.
  • 19:06A second formula for map involves first calculating
  • 19:10pulse pressure, which we have, and dividing that by three, then
  • 19:13adding diastolic blood pressure.
  • 19:15Even though we have not yet run the DATA step to formally create
  • 19:19pulse pressure, you can still reference variables that you
  • 19:22defined on an earlier line in that same data step.
  • 19:27Finally, we apply exponentiation using two asterisks.
  • 19:31We calculate age squared or age to the power 2.
  • 19:36As a reminder if one of the arguments for an arithmetic
  • 19:39operator is missing, the result is missing.
  • 19:44I want to take a moment before we get into the next topic to show
  • 19:47you how to use SAS as a calculator.
  • 19:50SAS can be used to perform calculations such as simple arithmetic
  • 19:54operations that we've seen or more complex calculations
  • 19:58involving different built in functions which we'll discuss shortly.
  • 20:02First let me show you how not to perform calculations
  • 20:05in your SAS program.
  • 20:07Suppose you wanted SAS to compute the sum of 1 in 1,
  • 20:11so 1 plus 1.
  • 20:12Typing 1 plus 1 in your program and running this line of
  • 20:16code will not return an answer.
  • 20:18In fact, you'll receive an error in your log.
  • 20:21Maybe we need to assign the result to a variable.
  • 20:24We want the variable called answer to equal the result of the
  • 20:28calculation. Unfortunately, this also isn't the way to
  • 20:31do it. You will again receive an error in your log.
  • 20:35SAS is different from other object oriented programming languages.
  • 20:39For the most part you can't have variables floating around.
  • 20:42You need to work within data sets which means we
  • 20:46need to perform our calculations within a data step.
  • 20:50We do this even though we're not reading in a data file using
  • 20:54the SET statement. We still need to work within a data
  • 20:57step when using SAS as a calculator .This means
  • 21:01that we need to print the data set that we create in
  • 21:05our data step or navigate to the created data set
  • 21:08in the Explorer tab in order to see the results of
  • 21:12our calculations.
  • 21:14The calculated value which is the output of the function
  • 21:17or the operation, will need to be assigned a variable name,
  • 21:21and that variable which holds the result will be the last value
  • 21:25in your created data set.
  • 21:26Or perhaps it will be the only value in your data set, as we'll see
  • 21:29with this first example.
  • 21:31As an example of how to use SAS as a calculator, let's use
  • 21:35some of the arithmetic operations that we're already familiar with.
  • 21:39I'm naming the data set that will contain the results of my
  • 21:42calculations, calc.
  • 21:44I want to compute an individual's body mass index, or
  • 21:48BMI, using the formula, weight in kilograms divided
  • 21:51by height in meters squared.
  • 21:53The BMI I want to calculate is for an individual with a weight
  • 21:57of 70 kilograms and a height equal to 1.6
  • 22:01meters. The result of the calculation will be in a
  • 22:05variable called BMI.
  • 22:07After running the DATA step when I run the print procedure to print
  • 22:11the data set called calc, I see a single row in a single
  • 22:14variable, the variable that we created called BMI.
  • 22:18The no OBS option in PROC PRINT suppresses SAS
  • 22:22is customary observation number.
  • 22:24Another way of performing this calculation is to define a variable
  • 22:28for weight and a variable for height and then use those
  • 22:32variables in the formula for BMI.
  • 22:35We can compare this to the calculation that we performed on line 2
  • 22:39where we plug the raw values directly into the formula.
  • 22:42As you would expect, we get the same result.
  • 22:45Notice here that the data set Calc 2 contains all
  • 22:48of the variables that we created in this second data step.
  • 22:53Next we'll discuss using mathematical functions in SAS.
  • 22:57Functions can be applied to a single variable or to a set of variables
  • 23:01to help us transform or analyze data.
  • 23:04There are many built in functions in SAS, and we'll be looking at some
  • 23:07of the most frequently used mathematical and statistical functions.
  • 23:12Functions are applied in a data step in the same way we carried
  • 23:16out arithmetic operations.
  • 23:18Each function takes one or more arguments.
  • 23:21These arguments can be existing variables in a data set.
  • 23:24In this case you want to use the values of those variables to
  • 23:28evaluate the function. SAS will take the input values, the
  • 23:31arguments for each subject, apply the function, and create the
  • 23:35new variable in Framingham new.
  • 23:37For example, we have BMI in our Framingham dataset.
  • 23:41If I want to compute the natural log of BMI for all subjects
  • 23:44in my dataset, I would create a variable in Framingham new
  • 23:48that applies the log function, which is natural log in SAS, to
  • 23:51the existing variable BMI.
  • 23:54The name I chose for this new variable is log BMI.
  • 23:57Notice that the log function applied to a missing value, as we see
  • 24:01here in the last subject, returns a missing value, as you would
  • 24:04expect. A function that can take multiple arguments
  • 24:08is the max function.
  • 24:10Recall that we have total cholesterol collected over three study
  • 24:14periods. The largest non missing value of total cholesterol
  • 24:17for each subject would be the max of existing variables,
  • 24:21Total cholesterol 1, 2, and 3.
  • 24:24The arguments of the function are separated by commas.
  • 24:27Again, notice that if all of the input values of the function are
  • 24:31missing, as we see in the last subject, and the maximum of
  • 24:34a set of missing values is also a missing value.
  • 24:38In this data step we are producing new variables as a result
  • 24:42of the calculations.
  • 24:43The new variables will be appended as columns to the dataset
  • 24:47created in the DATA step.
  • 24:49Log BMI and Max cholesterol are the final columns
  • 24:53of Framingham new.
  • 24:56Here is a list of some commonly used mathematical functions that can
  • 24:59be applied to a number.
  • 25:01Here X is a real number.
  • 25:03The first function we'll discuss is the exponential function.
  • 25:07The exponential function E to the X calculates the value
  • 25:10of E raised to the power of your argument where E is the base
  • 25:14of the natural logarithm 2.718.
  • 25:17Let's.switch to SAS to apply the exponential function to
  • 25:21a variable.
  • 25:23Here I used a DATA step to create a variable X
  • 25:27in a data set that I call Example 1.
  • 25:29I have nine observations here but one observation is missing.
  • 25:33I'm including the missing X value to show you how the functions deal
  • 25:37with missing values. Normally in a data step I enter each
  • 25:40observation on a separate line after the word data lines.
  • 25:44SAS will run a complete iteration of the DATA step to construct
  • 25:48one observation using one raw line of data.
  • 25:51The one line of data can contain one or more variables.
  • 25:55The data step then repeats this process again and again until there
  • 25:58are no raw data lines left to read.
  • 26:01But you see here that I list all nine values of X on the same line.
  • 26:05I am able to do this because I include the double trailing at sign
  • 26:09in the input statement.
  • 26:10Basically this allows me to create several observations from
  • 26:14one line of data.
  • 26:15I could also break the data across more than one line and SAS will
  • 26:19continue to read the next observation in the order the values are
  • 26:22entered. Let's run the data step and check the log for errors.
  • 26:30SAS tells us that we have nine observations and one variable in
  • 26:33our data set Example 1.
  • 26:36We can either print the data set or navigate to it in the Explorer
  • 26:40tab to view it as a table.
  • 26:42Since it's a small data set, let's print it to our results viewer
  • 26:45window.
  • 26:47I see that all nine of my observations have been read in and are
  • 26:51contained in the data setmExample 1.
  • 26:54Notice that we see SAS's automatic observation numbering
  • 26:57because I did not use the no OBS option in PROC PRINT.
  • 27:01Let's go back to our code.
  • 27:03We want to transform X by exponentiating.
  • 27:06I can do this by writing a second data step that reads in the data
  • 27:10set Example 1, and creates a new variable which is the
  • 27:14exponentiated x value, and this would be a fine way to do it.
  • 27:17I'm preserving the original data set Example 1 and creating
  • 27:21a copy called example 1A which will contain e
  • 27:25to the X.
  • 27:26When we print our new data set
  • 27:28Example 1A, We see it contains both X and E
  • 27:31to the X. Notice that exponentiating a missing value
  • 27:35results in a missing value.
  • 27:37SAS also tells us that we try to perform an operation on a missing
  • 27:41value in the log.
  • 27:43I want to show you that you can also create new variables in the
  • 27:46original data step where I'm reading in the data.
  • 27:49I can define any new variables that I also want this dataset
  • 27:52to contain right after the input statement and before the data
  • 27:56line statement. For example I know that this dataset will contain
  • 28:00the variable x because I specified this variable in the
  • 28:03input statement.
  • 28:05I want this data set to also contain e to the X.
  • 28:08So I define a new variable which I name EXPX,
  • 28:12equal to the exponential function applied to the variable x.
  • 28:16This saves us from having to run an additional data step.
  • 28:21Let's print the data set to see if it contains our two variables.
  • 28:25And it does.
  • 28:27The function for natural log or log base e is simply
  • 28:31log in SAS.
  • 28:33You might be used to using LN to refer to natural log
  • 28:37but know that in SAS log is natural log.
  • 28:39The function for log base ten is log ten
  • 28:43both log and log ten must have positive arguments, otherwise
  • 28:47the function will return a missing value and you will see a note about
  • 28:50this in the SAS log.
  • 28:53Next we see absolute value and square root.
  • 28:56Square root of a negative number will return a missing value.
  • 28:59In order to round a number to a certain number of decimal places, we
  • 29:03use the round function.
  • 29:05Specify the argument and then specify to how many decimals
  • 29:08we'd like to round the number.
  • 29:10In this example we would like to round X to the hundredth place.
  • 29:15Int basically cuts off the decimal from a number and returns
  • 29:19the integer only. This function does not round.
  • 29:22Ceiling returns the smallest integer that is greater than
  • 29:26or equal to the argument.
  • 29:28Floor returns the largest integer that is less than or
  • 29:31equal to the argument.
  • 29:33We'll see how these functions behave when we apply them to our x
  • 29:36variable in SAS.
  • 29:39I'm going to repeat the process of creating these new transformed
  • 29:43X variables by applying the remainder of the functions that we just
  • 29:46covered in the original data step where we entered the raw data.
  • 29:50I'm naming this dataset Example 1.
  • 29:52So after I run this data step SAS will overwrite
  • 29:56Example 1 that currently exists in the work library.
  • 30:00To keep our original example 1, we would need to assign
  • 30:03a different name to this dataset.
  • 30:06Let's run this code.
  • 30:09Checking the log we do see several messages telling us
  • 30:12when we are trying to perform a calculation that is not defined.
  • 30:16Each time SAS is asked to perform a calculation using
  • 30:20an invalid argument, it lets you know in the log.
  • 30:23SAS also notifies you when it's asked to perform calculations
  • 30:27using missing values and tells you that this action generated
  • 30:30a missing value. However, these are not program halting errors
  • 30:35and do not stop SAS from running the DATA step.
  • 30:38As we see at the very bottom, Example One was successfully created
  • 30:42and it contains nine observations and 10 variables.
  • 30:46Again, keep an eye on your log after each run because it can help
  • 30:49you identify problems with your data.
  • 30:54Here we see some common mainly statistical functions
  • 30:57that are applied to either a list of raw numbers entered as
  • 31:01arguments or to more than one variable in an existing dataset.
  • 31:05The first function listed here is the sum function.
  • 31:09I'll point out the difference between the SUM function and simply
  • 31:12using the addition arithmetic operation shortly.
  • 31:15Next we have mean, median, min max, number
  • 31:19of non missing values, number of missing values, standard
  • 31:23deviation, and variance.
  • 31:24The arguments are listed inside the parentheses separated by commas.
  • 31:29Note that the result of these functions is based on the non missing
  • 31:33values of the arguments.
  • 31:35Even if a subject has a missing value for one of the arguments
  • 31:38these functions will use the non missing values to perform the
  • 31:42calculation.
  • 31:43This is the difference between the basic arithmetic operations we
  • 31:47performed previously and the sum function, for example.
  • 31:50The SUM function will add all available data whereas if we
  • 31:54use the plus sign to define the sum if any argument of
  • 31:57the arithmetic operation is missing the sum itself will be missing.
  • 32:02Moving to SAS, we will apply these functions to our three total
  • 32:06cholesterol measurements taken over the three periods in this study.
  • 32:10I'm going to read in the original Framingham dataset and apply each
  • 32:14of these functions. The dataset Framingham new will
  • 32:17contain the results of the calculations.
  • 32:20It might be interesting to also create a few duplicate variables
  • 32:24that check the calculations performed by the functions.
  • 32:27For example let's compare a manual arithmetic calculation
  • 32:31of the sum of total cholesterol 1, 2, and 3 to
  • 32:35the sum calculated using the SUM function.
  • 32:38The end function reports the number of non missing observations
  • 32:42in the arguments for each, and Miss reports the number
  • 32:46of missing observations.
  • 32:48Since we have three total cholesterol variables the sum of
  • 32:52N and N Miss should equal three.
  • 32:55Let's also check if the mean cholesterol calculated using the
  • 32:58mean function equals the sum of all observations divided
  • 33:02by the number of observations used in the sum.
  • 33:05That is the number of non missing values in total cholesterol 1,
  • 33:092, and 3. Printing the first 10 observations in Framingham
  • 33:12new, we see that individuals tend not to have complete
  • 33:16cholesterol data in all three periods.
  • 33:19But the arithmetic and statistical functions apply to these variables
  • 33:22are not equal to missing.
  • 33:24In contrast, the manual SUM check Sum is missing if
  • 33:28any of the arguments in the manual some is missing.
  • 33:30Notice that these functions analyze the data within each subject.
  • 33:34We are not finding the means, say, over all subjects.
  • 33:39You can again use SAS as a calculator to manually enter a list
  • 33:42of values as arguments into the function.
  • 33:45Here, I'm computing the mean of the values 1 2 3
  • 33:494 5 and the standard deviation of those data values.
  • 33:53You can also nest functions.
  • 33:55For example suppose I want to compute the square root of the sum
  • 33:59of 1 2 3 4 5.
  • 34:01We can do that without creating an intermediate variable for
  • 34:04the sum. But you could also do it that way.
  • 34:07Again print the data set to view the results of the calculation.
  • 34:11As we would expect, the mean of that sequence is 3 and
  • 34:15the standard deviation of those five numbers is reported along
  • 34:18with the square root of the sum.
  • 34:22Lastly, we're going to talk about working with date variables.
  • 34:25In SAS, date variables are stored as numbers
  • 34:29and are displayed to us using different date formats.
  • 34:33We see here that our 2 date variables in Framingham,
  • 34:37intake and date of death, were imported as numeric variables.
  • 34:41The numeric value of the date is the number of days between
  • 34:45January 1st 1960 and the specified date,
  • 34:49that is day 0 in SAS is January 1st
  • 34:521960.
  • 34:54Dates before January 1st 1960 are represented
  • 34:57as negative numbers, and dates after as positive numbers.
  • 35:01The variables date and SAS date are the same.
  • 35:04The variable on the left has been formatted to display in a format
  • 35:08we're used to seeing, while the variable on the right is an
  • 35:11unformatted date showing you how SAS represents a date
  • 35:15numerically.
  • 35:17This numeric representation of dates allows us to easily
  • 35:20find the number of days between two dates.
  • 35:24You can use the arithmetic operation of subtraction to find the
  • 35:27number of days between an end date and a start date.
  • 35:31For example, to find the number of days between January 30th
  • 35:351948 and April 18th 1964, we see
  • 35:39that the end date is
  • 35:421569 days after January 1st 1960, and
  • 35:46the start date is 4354 days
  • 35:50before January 1st 1960.
  • 35:53To find the number of days in between we subtract the start date
  • 35:56from the end date, which gives us 5923
  • 36:00days. We can also convert those days to months
  • 36:04or years to give a more meaningful timescale.
  • 36:08In the Framingham data we have two dates, date of intake into
  • 36:11the study and date of death when a death was observed during study
  • 36:15follow up. To find the number of days between a participant's
  • 36:19date of death and their intake date, we subtract the start
  • 36:22date from the end date or intake from date of death.
  • 36:27The new variable we're defining is called days to death, and we
  • 36:30see that it is only defined for those individuals with non missing
  • 36:34values for both intake and date of death.
  • 36:37We can convert days to years by dividing the number of days
  • 36:41by 365.25.
  • 36:44With this many days a timescale of years is more appropriate.
  • 36:48Finally, since we also have the variable called age that tells
  • 36:52us the subject's age in years at intake, we can calculate
  • 36:56an approximate age at death by adding years to death to
  • 37:00the age variable.
  • 37:01Note this is an approximate age of death because we do not have the
  • 37:05individual's date of birth in these data.
  • 37:08If we had date of birth we could calculate age of death by subtracting
  • 37:12date of birth from date of death.
  • 37:15If we want to define a new date variable in our data set to use in
  • 37:18calculations, we can do so directly in a data step.
  • 37:23For example, if I know that follow up for long term endpoints ended
  • 37:26on December 31st 1972 in this study, I
  • 37:30can define a new variable and follow up equals December
  • 37:3431st 1972.
  • 37:37I will show you two ways of defining the date variable.
  • 37:40In the first method you enclose the date in quotes followed by
  • 37:43the letter D. It is important to specify the day,
  • 37:47followed by month, and then year.
  • 37:49You must also use the three letter abbreviation for the month.
  • 37:54The second method of defining a date variable uses the N
  • 37:58D Y function in SAS.
  • 38:00The arguments of the function are numerical month, followed by da,y
  • 38:04and then year with each argument separated by a comma.
  • 38:08Both methods produce identical results.
  • 38:11Another useful function is the today function in SAS.
  • 38:15This function will define a date variable using the current date.
  • 38:19There are no arguments of this function.
  • 38:21Because we're reading in the Framingham dataset you're using the SET
  • 38:25statement, these new date variables will be defined for every
  • 38:28subject in the dataset.
  • 38:30We will see three new columns at the end of our dataset but the value
  • 38:34of each date will be constant for all subjects.
  • 38:37This is necessary if we want to use these newly defined variables
  • 38:40to perform calculations involving existing date variables
  • 38:45such as date of intake because SAS evaluates expressions
  • 38:48involving variables in the data set for each observation.
  • 38:53All of these methods for defining date variables will produce
  • 38:56the date in the form of a SAS date value.
  • 38:59This is fine if you just want to use the date to perform calculations.
  • 39:03However, if you want to print the date variables it's helpful
  • 39:06to apply a date format to the variables so that you can
  • 39:10see the date presented in a way we are used to seeing.
  • 39:14For example, the way intake is displayed here.
  • 39:17Since the date variable intake was an existing variable
  • 39:21in the imported Excel workbook, SAS recognized that
  • 39:24this variable was a date and applied a date format to
  • 39:28the variable when reading in the data.
  • 39:30This is why it appears as a formatted date.
  • 39:34We see this when we look at PROC CONTENTS.
  • 39:36The format date 9 was applied to the variable intake.
  • 39:40Formats are applied to variables in a data step using a format
  • 39:44statement. After the word format specify the name of the
  • 39:47variable that you would like to format followed by the name of the
  • 39:50format. The format name must be followed by a period
  • 39:54or decimal point. We closed the format statement with a
  • 39:58semicolon.
  • 39:59You can apply the same format to more than one variable.
  • 40:03By including the format name after more than one variable name.
  • 40:07You can also apply formats to multiple variables in your dataset
  • 40:11using one format statement.
  • 40:13SAS has many predefined date formats to choose from,
  • 40:17even more than I have listed here.
  • 40:19We are applying a different date format to each of these 4 date
  • 40:23variables. Intake is formatted as date 9.
  • 40:279 refers to the length of the date variable under this format.
  • 40:31End follow up is formatted as MMDDyy.
  • 40:35The thing to notice about this format is that it uses a two
  • 40:38digit year and follow up version to use its
  • 40:42date format MMDDyy10.
  • 40:4610 again refers to the length of the resulting formatted date
  • 40:49including the forward slashes between month day and year.
  • 40:54Finally today date uses date format word date
  • 40:58which displays the date as we would in text form.
  • 41:01Formats do not change the value of the variable.
  • 41:04These dates are still thought of as numbers by SAS.
  • 41:08Formats only change the way values of the variable are displayed.
  • 41:13This concludes video 2.1 In this video, we reviewed
  • 41:17PROC import. We also covered how to perform calculations
  • 41:20and construct new variables using arithmetic operations and
  • 41:24using mathematical, arithmetic, and statistical functions.
  • 41:28Finally we discussed working with dates in SAS.
  • 41:32In video 2.2 We will discuss creating new variables
  • 41:36using comparison and logical operators.