SAS Video 2.1
July 09, 2019Review, Arithmetic Operations, Functions, Date Variables
Transcript
- 00:01Welcome to video 2.1.
- 00:03The first of our lab 2 videos in our SAS lab video series.
- 00:07My name is Maria Ciarleglio, and I'm from the Department of
- 00:10Biostatistics at the Yale School of Public Health.
- 00:14In this lab we'll cover how to work with variables in SAS.
- 00:17Specifically, how to create new variables from existing variables
- 00:21in a dataset.
- 00:26The topics covered in this lab will be presented over three videos.
- 00:30Video 2.1 will begin by reviewing proc import to read
- 00:34a data file into SAS.
- 00:36We learned how to use this technique in video 1.2.
- 00:40Then we will begin looking at a number of different ways of creating
- 00:43variables.
- 00:45We will cover arithmetic operations, such as addition, subtraction,
- 00:49multiplication, division, and exponentiation.
- 00:53Next, we'll look at some commonly used mathematical functions that
- 00:56can transform a variable.
- 00:58For example, the exponential function, square root, and logarithmic
- 01:01functions.
- 01:03We'll also discuss some arithmetic and statistical functions that
- 01:07allow us to sum and find the mean standard deviation,
- 01:10minimum or maximum value of several variables for each
- 01:14subject.
- 01:15At the end of video 2.1.
- 01:17We will discuss how to work with date variables.
- 01:20We will add a date variable to a SAS data set, calculate the length
- 01:24of time between two date variables, and discuss different ways
- 01:27of displaying or formatting date variables.
- 01:31In video 2.2, we'll continue to learn about creating variables
- 01:35using comparison operators such as less than,
- 01:39greater than, or equal to.
- 01:41These are commonly used in if-then statements.
- 01:44For example, if a certain condition is true, then a
- 01:47new variable will take on a certain value.
- 01:51And lastly, we'll discuss logical operators such as
- 01:54and, or, and not.
- 01:57Video 2.3 will show you how to save this new analysis
- 02:01dataset that you've created containing all of your new variables.
- 02:06We will again look at data from the Framingham Heart Study.
- 02:10The data file that we will use in this video is an Excel file called
- 02:14Framingham subset 2 which is available in the lab 2
- 02:17Resources folder on the course web page.
- 02:20This dataset is similar to Framingham subset that we used
- 02:24in videos 1.2 and 1.3, but some
- 02:27of the variables have been modified and I've also added a few new
- 02:31variables.
- 02:33There are 30 variables in the data set shown in columns and
- 02:384434 individual observations shown as rows.
- 02:42Notice that the number of rows in our Excel spreadsheet is
- 02:474435. This is because the first row of the spreadsheet contains
- 02:50the variable names, and the individual observations begin
- 02:54on row 2. This version of Framingham subset includes
- 02:58a variable I created describing participant race.
- 03:01And for those who died, I have generated a date of death
- 03:05in order to give us an opportunity to work with missing data, the 10th
- 03:08subject shown in row 11 has been modified to give
- 03:12him some missing data values.
- 03:14This will allow you to see how missing values in the data affect the
- 03:17creation of new variables.
- 03:19Additionally instead of only having information on baseline total
- 03:23cholesterol, we now have total cholesterol information collected
- 03:27during three examination periods.
- 03:30In this study, the three examination periods were approximately six
- 03:33years apart.
- 03:35Touching again on missing data.
- 03:36Notice that when an individual does not have cholesterol level
- 03:40measured at one of the follow up examinations, the cell is
- 03:43left blank. When SAS imports these data, SAS
- 03:47knows that a blank cell indicates a missing value.
- 03:50So if you're collecting data in an Excel spreadsheet simply leave
- 03:54the cell blank if a subject has a missing value.
- 03:57We will use the import procedure to read this Excel file
- 04:01into SAS which will create a SAS dataset.
- 04:04The SAS dataset will be given the name Framingham in SAS.
- 04:08The imported data set will have the same number of variables and
- 04:11observations as the original Excel spreadsheet.
- 04:15The first step to working with data and adding new variables to
- 04:19a data set is to get your data into SAS.
- 04:22We will use the import procedure to reading our external data file.
- 04:26This method was discussed in video 1.2.
- 04:30Here is what typical SAS code for this procedure looks
- 04:33like. Remember our trick of using the import wizard
- 04:37and saving the code generated by the wizard for use later.
- 04:42The generated code is what I have pasted here.
- 04:45The first line begins with proc import which is the SAS
- 04:49procedure name. Out = work.Framingham
- 04:53is where we can specify the library where the imported SAS
- 04:56data set will be saved, and we can also name the data set.
- 05:00Here we're saving this data set to the temporary work library
- 05:04that we discussed in video 1.3.
- 05:06Remember that the contents of the work library will disappear once
- 05:10we exit SAS. This means that if we exit our current SAS
- 05:14session, any data sets that we created in the work library
- 05:17will need to be re-imported, and any new variables that we created
- 05:21will need to be recreated once we launch SAS if we want
- 05:25to continue analysing these data.
- 05:28We can also create a permanent library using the lib name
- 05:31statement to save our imported data as a permanent SAS
- 05:35data set. In video 2.3, I will give you a few
- 05:39options for saving your modified data set containing all of
- 05:42your new variables, so that you can directly use that analysis data
- 05:46set later without having to repeat your data management
- 05:49steps.
- 05:51Note that if we had omitted the library name here and just
- 05:54written out equals Framingham, the Framingham dataset would
- 05:58have by default been saved to the work library anyway.
- 06:03The name we are giving the data set is the name Framingham.
- 06:06So when we refer to this data set in our code we can simply
- 06:10call it Framingham and SAS will know to look in the
- 06:13work library if no other library is specified.
- 06:18On the second line we see the word data file followed by the equal
- 06:21sign and the physical location of the file.
- 06:24In this case, the excel file that we're going to import into SAS.
- 06:28Notice that we put the path of our excel file in double quotes.
- 06:31You can also use single quotes but just be sure to be consistent
- 06:35on both ends.
- 06:37On the 3rd line of code, we have the acronym DBMS
- 06:40for database management systems.
- 06:42We need this line of code to tell SAS what type of file we are
- 06:46importing. Here we're importing an Excel file in this example.
- 06:50The word replace is necessary to overwrite
- 06:54any other existing files in SAS by the same name.
- 06:58Here is where we also put our first semicolon.
- 07:01The fourth line of code tells SAS which Excel worksheet to
- 07:05read in. Here we're specifying the name that appears
- 07:08on the tab of the specific worksheet of interest.
- 07:12In our excel file Framingham subset 2, our worksheet
- 07:16is named Framingham. So that is what we specify in the range
- 07:19statement. We also end this line with a semicolon.
- 07:23On the fifth line, get names equals yes tell SAS
- 07:27that there is a header row in the Excel file which contains the
- 07:30variable names. Line 6 through 9 specify how
- 07:34text, date, and time variables are imported.
- 07:38Finally we end the import procedure with the word run,
- 07:41followed by a semicolon.
- 07:44Now let's open SAS and run this procedure to
- 07:47import Framingham subset 2.
- 07:50I'm going to open SAS by directly double clicking on the existing
- 07:53program file for this lab called SAS code Video
- 07:572. This program file is available in the lab 2 Resources
- 08:01folder on the course web page so that you can follow along.
- 08:06In the program editor, I like to begin my code with some comments.
- 08:10I've also pasted the proc import code generated by the
- 08:14import wizard and inserted a comment about the procedure
- 08:18to document what I'm doing and to help me remember my thought process
- 08:21if I return to this code later.
- 08:24Most of the comments in this file are to help you better
- 08:28understand and locate the different pieces of code.
- 08:31But the use of comments is completely up to you.
- 08:34You can make them as detailed as you want or omit them altogether.
- 08:38You will likely have to modify the data file statement to point
- 08:41to the location where you have Framingham subset 2
- 08:45saved on your computer.
- 08:47I have the Framingham subset 2 excel files saved
- 08:50in the SAS video 2 folder on my USP flash drive
- 08:54which is mapped as drive G on my computer.
- 08:58I'm going to highlight and submit or run the import procedure.
- 09:02Remember you can run your highlighted code by either clicking on the
- 09:05running man button in the menu bar or by using the keyboard
- 09:09shortcuts F3 or F8.
- 09:12Let's check our log window.
- 09:13The lines in black are the lines of code we highlighted and submitted
- 09:17and the lines in blue tell us that our SAS data set worked up
- 09:21Framingham was successfully created.
- 09:23The log also reports the number of observations and the number
- 09:27of variables in the imported dataset.
- 09:29This should match the number in the original imported excel file.
- 09:34Next let's check the contents of the Framingham dataset.
- 09:37We will use the contents procedure to give us a list of the variables
- 09:41in our dataset.
- 09:43We'll also be able to see if the variable was imported as
- 09:46a numeric or character variable.
- 09:49I would like the variables to be listed in the order they are included
- 09:52in the file rather than in alphabetical order, so I specified
- 09:56the order equals var num option in the first line of
- 10:00PROC CONTENTS. I find this option useful because there tends to
- 10:03be a reason for the ordering of variables in a dataset.
- 10:07Although if I'm looking to see if the dataset contains a specific
- 10:10variable, then looking at the variable list in alphabetical order
- 10:14would be more helpful. Again highlight and run the code.
- 10:19Checking the log, I don't see any errors.
- 10:22But I do see a new window.
- 10:24The results viewer window.
- 10:26This is the window that contains our output.
- 10:29Here we see the data file Framingham is saved in the work
- 10:32library because under dataset name we see work.Framingham.
- 10:37This dataset has 30 variables and
- 10:414434 observations.
- 10:43I also see the intake and date of death variables were
- 10:46imported correctly as a date, and the sex and race
- 10:50variables were imported correctly as character variables.
- 10:54It's always a good idea to check that your data are being imported
- 10:58correctly.
- 11:00We will take the Framingham data set that we just imported and
- 11:03create several new variables.
- 11:05The new variables will appear as new columns at the end of
- 11:09the data set. This process creates a new SAS
- 11:12data file that contains both the original variables and
- 11:16the new variables.
- 11:18I call this new data set my analysis data set because
- 11:21oftentimes we are performing this data management and creating
- 11:25these new variables for the purpose of conducting data analysis
- 11:29later on. The data set that then contains the variables we
- 11:32need for a future analysis is my analysis data set.
- 11:36We have two options for saving this new data set.
- 11:39We can overwrite the existing Framingham data set by giving
- 11:43the new data set. the name Framingham, the same as the original.
- 11:47Alternatively, we can create a completely new dataset by
- 11:51assigning it a different name.
- 11:52In our example we will name our new dataset Framingham
- 11:56new. This approach allows us to keep the Framingham
- 12:00SAS data file in its original form.
- 12:03Our new dataset is a copy of Framingham plus any newly
- 12:07created or modified variables.
- 12:09Be careful when using option one because you will overwrite
- 12:13the version of Framingham that you originally imported.
- 12:16I don't mean that you will overwrite the excel file; you will
- 12:19overwrite the SAS data file that was created as a result of
- 12:22the import procedure.
- 12:24Once you overwrite a data set you cannot simply go back to the
- 12:28previous version. For example, if we overwrite Framingham
- 12:31and add 10 new variables, the dataset will now contain 40 variables.
- 12:36This means that we've lost the original version of the dataset with
- 12:3930 variables. To get the original back, we would have to rerun
- 12:43the import procedure.
- 12:44My preference is to always leave the original data file as
- 12:48is and create a second analysis dataset that contains any
- 12:51modifications changes or new variables.
- 12:55This way if I make a mistake in creating a new variable or if I
- 12:58accidentally overwrite an existing variable in the original dataset
- 13:02I can correct my code and use the original dataset to try
- 13:06again.
- 13:07We will create new variables using a data step.
- 13:11We used a data step in video 1.2 to read data
- 13:14into SAS using manual data entry.
- 13:17On line one, we begin the DATA step with the word data
- 13:21followed by the name of the data set that we want to create in this
- 13:24data step. Because I prefer option 2, I'm
- 13:28giving the modified data set a new name, Framingham new.
- 13:32Because I'm not specifying a library, this new data set will
- 13:36be saved in the work library.
- 13:38You could explicitly specify the work library as the location
- 13:42of Framingham new as we see here but it's not necessary.
- 13:46When you don't specify a library., SAS will default to
- 13:50the work library. We end the data statement with a
- 13:53semicolon.
- 13:55On the second line I begin the SET statement.
- 13:58This is where I identify the dataset that will be read
- 14:01into the data step.
- 14:03We want to read the original Framingham data set into
- 14:07the DATA step, create a copy called Framingham new,
- 14:10and make any of the modifications to the dataset specified
- 14:14within the body of the DATA step.
- 14:16This is where we will define new variables, usually by performing
- 14:20operations on existing variables.
- 14:23Again, we are reading the Framingham data set in from the work
- 14:27library. You could have also put the SET statement on the same
- 14:30line as the data statement.
- 14:33In SAS, it's not the line breaks that indicate the end
- 14:36of each statement, it's the semicolons.
- 14:39End the DATA step with the word run followed by a semicolon.
- 14:43Now we can discuss SAS operators and functions that are
- 14:47used in SAS programming statements to create new variables.
- 14:51We will begin by discussing arithmetic operators.
- 14:55The first arithmetic operation we will discuss is addition.
- 14:59We will use addition to add two or more variables together
- 15:03to make a new variable, or to add a constant
- 15:06to one or more variables.
- 15:08We type the plus symbol between the terms we are summing.
- 15:12We have five binary variables in our Framingham dataset
- 15:16that report the presence of prevalent disease at the time of entry
- 15:19into the study: prevalent coronary heart disease, angina
- 15:23pectoris, myocardial infarction stroke, and prevalent
- 15:27hypertension.
- 15:28The variable equals one, if the individual has the prevalent disease,
- 15:32and zero if the individual is free from the disease.
- 15:36We see the count and percentage of individuals with each prevalent
- 15:39condition reported in these frequency tables with prevalent
- 15:43hypertension being the most common prevalent condition in these
- 15:46individuals. We want to create a new variable in the dataset
- 15:50that counts the number of prevalent conditions that a participant
- 15:54has. Because the variables indicating presence or
- 15:57absence of disease are binary, if we sum these indicator
- 16:01variables, we will be counting the number of prevalent conditions
- 16:05for each individual.
- 16:06Here we are naming this new account variable prev cond,
- 16:10and this variable will be in Framingham new.
- 16:13Also notice that when it comes to variable names, SAS is
- 16:17not case sensitive.
- 16:18Although the prevalent condition variables are all capitalized
- 16:22in the dataset Framingham, I do not need to use all caps
- 16:25when referencing the variables.
- 16:28Just so it's clear I want you to see that these operations will be
- 16:31applied to each subject.
- 16:33The subjects are represented in the rows of the data set.
- 16:36Seven subjects are shown here.
- 16:38The last subject happens to be the 10th subject in the original
- 16:42dataset. This subject has several missing data values so that
- 16:45you can see how variable creation behaves in the presence of
- 16:49missing input values.
- 16:51When you follow along in print some of the new variables we will
- 16:54create using the code in the SAS program file, the obs
- 16:58equals ten option in PROC PRINT will print the first
- 17:0210 subjects in the dataset.
- 17:04The new variable we're creating sums the values of
- 17:08prevalent CHD, AP, MY, stroke, and hypertension
- 17:12for each subject, essentially counting the number of prevalent
- 17:16conditions each subject has.
- 17:18Prev Cond equals zero for the first subject shown because
- 17:22he has no prevalent conditions. One for the second subject,
- 17:26three for the third, two for the fourth, zero
- 17:29for the fifth, and one for the six.
- 17:32When it comes to subject seven we see that his value of
- 17:36Prev Cond is missing.
- 17:38This is because one of the variables involved in calculating
- 17:41the sum is missing.
- 17:43Let's move on to subtraction.
- 17:45Here we create a new variable called pulse pressure in Framingham
- 17:49new which is the difference between systolic and diastolic blood
- 17:53pressure. The next operation is multiplication which
- 17:57uses the asterisk symbol.
- 17:59We are creating a variable age months which is simply
- 18:02age expressed in months and not in years.
- 18:06We do this by multiplying the variable age in the Framingham
- 18:09dataset which is aging years by the number 12.
- 18:13The constant twelve will be multiplied by each individual's
- 18:17age to give us our new variable.
- 18:19Again age in months is missing when age itself is missing.
- 18:24Looking next at Division we can calculate mean arterial pressure
- 18:28the average pressure in a patient's arteries during one cardiac
- 18:31cycle. By adding systolic blood pressure to twice diastolic
- 18:35blood pressure and dividing that quantity by three.
- 18:39This example illustrates how you can use multiple arithmetic
- 18:42operations together when defining a variable.
- 18:45Here we are using addition, multiplication, and division.
- 18:49Also notice the use of parentheses.
- 18:52Here, we want to divide the quantity systolic plus
- 18:55twice diastolic by three.
- 18:58If we had admitted the parentheses, SAS would have only divided
- 19:02the quantity, two times diastolic by three.
- 19:06A second formula for map involves first calculating
- 19:10pulse pressure, which we have, and dividing that by three, then
- 19:13adding diastolic blood pressure.
- 19:15Even though we have not yet run the DATA step to formally create
- 19:19pulse pressure, you can still reference variables that you
- 19:22defined on an earlier line in that same data step.
- 19:27Finally, we apply exponentiation using two asterisks.
- 19:31We calculate age squared or age to the power 2.
- 19:36As a reminder if one of the arguments for an arithmetic
- 19:39operator is missing, the result is missing.
- 19:44I want to take a moment before we get into the next topic to show
- 19:47you how to use SAS as a calculator.
- 19:50SAS can be used to perform calculations such as simple arithmetic
- 19:54operations that we've seen or more complex calculations
- 19:58involving different built in functions which we'll discuss shortly.
- 20:02First let me show you how not to perform calculations
- 20:05in your SAS program.
- 20:07Suppose you wanted SAS to compute the sum of 1 in 1,
- 20:11so 1 plus 1.
- 20:12Typing 1 plus 1 in your program and running this line of
- 20:16code will not return an answer.
- 20:18In fact, you'll receive an error in your log.
- 20:21Maybe we need to assign the result to a variable.
- 20:24We want the variable called answer to equal the result of the
- 20:28calculation. Unfortunately, this also isn't the way to
- 20:31do it. You will again receive an error in your log.
- 20:35SAS is different from other object oriented programming languages.
- 20:39For the most part you can't have variables floating around.
- 20:42You need to work within data sets which means we
- 20:46need to perform our calculations within a data step.
- 20:50We do this even though we're not reading in a data file using
- 20:54the SET statement. We still need to work within a data
- 20:57step when using SAS as a calculator .This means
- 21:01that we need to print the data set that we create in
- 21:05our data step or navigate to the created data set
- 21:08in the Explorer tab in order to see the results of
- 21:12our calculations.
- 21:14The calculated value which is the output of the function
- 21:17or the operation, will need to be assigned a variable name,
- 21:21and that variable which holds the result will be the last value
- 21:25in your created data set.
- 21:26Or perhaps it will be the only value in your data set, as we'll see
- 21:29with this first example.
- 21:31As an example of how to use SAS as a calculator, let's use
- 21:35some of the arithmetic operations that we're already familiar with.
- 21:39I'm naming the data set that will contain the results of my
- 21:42calculations, calc.
- 21:44I want to compute an individual's body mass index, or
- 21:48BMI, using the formula, weight in kilograms divided
- 21:51by height in meters squared.
- 21:53The BMI I want to calculate is for an individual with a weight
- 21:57of 70 kilograms and a height equal to 1.6
- 22:01meters. The result of the calculation will be in a
- 22:05variable called BMI.
- 22:07After running the DATA step when I run the print procedure to print
- 22:11the data set called calc, I see a single row in a single
- 22:14variable, the variable that we created called BMI.
- 22:18The no OBS option in PROC PRINT suppresses SAS
- 22:22is customary observation number.
- 22:24Another way of performing this calculation is to define a variable
- 22:28for weight and a variable for height and then use those
- 22:32variables in the formula for BMI.
- 22:35We can compare this to the calculation that we performed on line 2
- 22:39where we plug the raw values directly into the formula.
- 22:42As you would expect, we get the same result.
- 22:45Notice here that the data set Calc 2 contains all
- 22:48of the variables that we created in this second data step.
- 22:53Next we'll discuss using mathematical functions in SAS.
- 22:57Functions can be applied to a single variable or to a set of variables
- 23:01to help us transform or analyze data.
- 23:04There are many built in functions in SAS, and we'll be looking at some
- 23:07of the most frequently used mathematical and statistical functions.
- 23:12Functions are applied in a data step in the same way we carried
- 23:16out arithmetic operations.
- 23:18Each function takes one or more arguments.
- 23:21These arguments can be existing variables in a data set.
- 23:24In this case you want to use the values of those variables to
- 23:28evaluate the function. SAS will take the input values, the
- 23:31arguments for each subject, apply the function, and create the
- 23:35new variable in Framingham new.
- 23:37For example, we have BMI in our Framingham dataset.
- 23:41If I want to compute the natural log of BMI for all subjects
- 23:44in my dataset, I would create a variable in Framingham new
- 23:48that applies the log function, which is natural log in SAS, to
- 23:51the existing variable BMI.
- 23:54The name I chose for this new variable is log BMI.
- 23:57Notice that the log function applied to a missing value, as we see
- 24:01here in the last subject, returns a missing value, as you would
- 24:04expect. A function that can take multiple arguments
- 24:08is the max function.
- 24:10Recall that we have total cholesterol collected over three study
- 24:14periods. The largest non missing value of total cholesterol
- 24:17for each subject would be the max of existing variables,
- 24:21Total cholesterol 1, 2, and 3.
- 24:24The arguments of the function are separated by commas.
- 24:27Again, notice that if all of the input values of the function are
- 24:31missing, as we see in the last subject, and the maximum of
- 24:34a set of missing values is also a missing value.
- 24:38In this data step we are producing new variables as a result
- 24:42of the calculations.
- 24:43The new variables will be appended as columns to the dataset
- 24:47created in the DATA step.
- 24:49Log BMI and Max cholesterol are the final columns
- 24:53of Framingham new.
- 24:56Here is a list of some commonly used mathematical functions that can
- 24:59be applied to a number.
- 25:01Here X is a real number.
- 25:03The first function we'll discuss is the exponential function.
- 25:07The exponential function E to the X calculates the value
- 25:10of E raised to the power of your argument where E is the base
- 25:14of the natural logarithm 2.718.
- 25:17Let's.switch to SAS to apply the exponential function to
- 25:21a variable.
- 25:23Here I used a DATA step to create a variable X
- 25:27in a data set that I call Example 1.
- 25:29I have nine observations here but one observation is missing.
- 25:33I'm including the missing X value to show you how the functions deal
- 25:37with missing values. Normally in a data step I enter each
- 25:40observation on a separate line after the word data lines.
- 25:44SAS will run a complete iteration of the DATA step to construct
- 25:48one observation using one raw line of data.
- 25:51The one line of data can contain one or more variables.
- 25:55The data step then repeats this process again and again until there
- 25:58are no raw data lines left to read.
- 26:01But you see here that I list all nine values of X on the same line.
- 26:05I am able to do this because I include the double trailing at sign
- 26:09in the input statement.
- 26:10Basically this allows me to create several observations from
- 26:14one line of data.
- 26:15I could also break the data across more than one line and SAS will
- 26:19continue to read the next observation in the order the values are
- 26:22entered. Let's run the data step and check the log for errors.
- 26:30SAS tells us that we have nine observations and one variable in
- 26:33our data set Example 1.
- 26:36We can either print the data set or navigate to it in the Explorer
- 26:40tab to view it as a table.
- 26:42Since it's a small data set, let's print it to our results viewer
- 26:45window.
- 26:47I see that all nine of my observations have been read in and are
- 26:51contained in the data setmExample 1.
- 26:54Notice that we see SAS's automatic observation numbering
- 26:57because I did not use the no OBS option in PROC PRINT.
- 27:01Let's go back to our code.
- 27:03We want to transform X by exponentiating.
- 27:06I can do this by writing a second data step that reads in the data
- 27:10set Example 1, and creates a new variable which is the
- 27:14exponentiated x value, and this would be a fine way to do it.
- 27:17I'm preserving the original data set Example 1 and creating
- 27:21a copy called example 1A which will contain e
- 27:25to the X.
- 27:26When we print our new data set
- 27:28Example 1A, We see it contains both X and E
- 27:31to the X. Notice that exponentiating a missing value
- 27:35results in a missing value.
- 27:37SAS also tells us that we try to perform an operation on a missing
- 27:41value in the log.
- 27:43I want to show you that you can also create new variables in the
- 27:46original data step where I'm reading in the data.
- 27:49I can define any new variables that I also want this dataset
- 27:52to contain right after the input statement and before the data
- 27:56line statement. For example I know that this dataset will contain
- 28:00the variable x because I specified this variable in the
- 28:03input statement.
- 28:05I want this data set to also contain e to the X.
- 28:08So I define a new variable which I name EXPX,
- 28:12equal to the exponential function applied to the variable x.
- 28:16This saves us from having to run an additional data step.
- 28:21Let's print the data set to see if it contains our two variables.
- 28:25And it does.
- 28:27The function for natural log or log base e is simply
- 28:31log in SAS.
- 28:33You might be used to using LN to refer to natural log
- 28:37but know that in SAS log is natural log.
- 28:39The function for log base ten is log ten
- 28:43both log and log ten must have positive arguments, otherwise
- 28:47the function will return a missing value and you will see a note about
- 28:50this in the SAS log.
- 28:53Next we see absolute value and square root.
- 28:56Square root of a negative number will return a missing value.
- 28:59In order to round a number to a certain number of decimal places, we
- 29:03use the round function.
- 29:05Specify the argument and then specify to how many decimals
- 29:08we'd like to round the number.
- 29:10In this example we would like to round X to the hundredth place.
- 29:15Int basically cuts off the decimal from a number and returns
- 29:19the integer only. This function does not round.
- 29:22Ceiling returns the smallest integer that is greater than
- 29:26or equal to the argument.
- 29:28Floor returns the largest integer that is less than or
- 29:31equal to the argument.
- 29:33We'll see how these functions behave when we apply them to our x
- 29:36variable in SAS.
- 29:39I'm going to repeat the process of creating these new transformed
- 29:43X variables by applying the remainder of the functions that we just
- 29:46covered in the original data step where we entered the raw data.
- 29:50I'm naming this dataset Example 1.
- 29:52So after I run this data step SAS will overwrite
- 29:56Example 1 that currently exists in the work library.
- 30:00To keep our original example 1, we would need to assign
- 30:03a different name to this dataset.
- 30:06Let's run this code.
- 30:09Checking the log we do see several messages telling us
- 30:12when we are trying to perform a calculation that is not defined.
- 30:16Each time SAS is asked to perform a calculation using
- 30:20an invalid argument, it lets you know in the log.
- 30:23SAS also notifies you when it's asked to perform calculations
- 30:27using missing values and tells you that this action generated
- 30:30a missing value. However, these are not program halting errors
- 30:35and do not stop SAS from running the DATA step.
- 30:38As we see at the very bottom, Example One was successfully created
- 30:42and it contains nine observations and 10 variables.
- 30:46Again, keep an eye on your log after each run because it can help
- 30:49you identify problems with your data.
- 30:54Here we see some common mainly statistical functions
- 30:57that are applied to either a list of raw numbers entered as
- 31:01arguments or to more than one variable in an existing dataset.
- 31:05The first function listed here is the sum function.
- 31:09I'll point out the difference between the SUM function and simply
- 31:12using the addition arithmetic operation shortly.
- 31:15Next we have mean, median, min max, number
- 31:19of non missing values, number of missing values, standard
- 31:23deviation, and variance.
- 31:24The arguments are listed inside the parentheses separated by commas.
- 31:29Note that the result of these functions is based on the non missing
- 31:33values of the arguments.
- 31:35Even if a subject has a missing value for one of the arguments
- 31:38these functions will use the non missing values to perform the
- 31:42calculation.
- 31:43This is the difference between the basic arithmetic operations we
- 31:47performed previously and the sum function, for example.
- 31:50The SUM function will add all available data whereas if we
- 31:54use the plus sign to define the sum if any argument of
- 31:57the arithmetic operation is missing the sum itself will be missing.
- 32:02Moving to SAS, we will apply these functions to our three total
- 32:06cholesterol measurements taken over the three periods in this study.
- 32:10I'm going to read in the original Framingham dataset and apply each
- 32:14of these functions. The dataset Framingham new will
- 32:17contain the results of the calculations.
- 32:20It might be interesting to also create a few duplicate variables
- 32:24that check the calculations performed by the functions.
- 32:27For example let's compare a manual arithmetic calculation
- 32:31of the sum of total cholesterol 1, 2, and 3 to
- 32:35the sum calculated using the SUM function.
- 32:38The end function reports the number of non missing observations
- 32:42in the arguments for each, and Miss reports the number
- 32:46of missing observations.
- 32:48Since we have three total cholesterol variables the sum of
- 32:52N and N Miss should equal three.
- 32:55Let's also check if the mean cholesterol calculated using the
- 32:58mean function equals the sum of all observations divided
- 33:02by the number of observations used in the sum.
- 33:05That is the number of non missing values in total cholesterol 1,
- 33:092, and 3. Printing the first 10 observations in Framingham
- 33:12new, we see that individuals tend not to have complete
- 33:16cholesterol data in all three periods.
- 33:19But the arithmetic and statistical functions apply to these variables
- 33:22are not equal to missing.
- 33:24In contrast, the manual SUM check Sum is missing if
- 33:28any of the arguments in the manual some is missing.
- 33:30Notice that these functions analyze the data within each subject.
- 33:34We are not finding the means, say, over all subjects.
- 33:39You can again use SAS as a calculator to manually enter a list
- 33:42of values as arguments into the function.
- 33:45Here, I'm computing the mean of the values 1 2 3
- 33:494 5 and the standard deviation of those data values.
- 33:53You can also nest functions.
- 33:55For example suppose I want to compute the square root of the sum
- 33:59of 1 2 3 4 5.
- 34:01We can do that without creating an intermediate variable for
- 34:04the sum. But you could also do it that way.
- 34:07Again print the data set to view the results of the calculation.
- 34:11As we would expect, the mean of that sequence is 3 and
- 34:15the standard deviation of those five numbers is reported along
- 34:18with the square root of the sum.
- 34:22Lastly, we're going to talk about working with date variables.
- 34:25In SAS, date variables are stored as numbers
- 34:29and are displayed to us using different date formats.
- 34:33We see here that our 2 date variables in Framingham,
- 34:37intake and date of death, were imported as numeric variables.
- 34:41The numeric value of the date is the number of days between
- 34:45January 1st 1960 and the specified date,
- 34:49that is day 0 in SAS is January 1st
- 34:521960.
- 34:54Dates before January 1st 1960 are represented
- 34:57as negative numbers, and dates after as positive numbers.
- 35:01The variables date and SAS date are the same.
- 35:04The variable on the left has been formatted to display in a format
- 35:08we're used to seeing, while the variable on the right is an
- 35:11unformatted date showing you how SAS represents a date
- 35:15numerically.
- 35:17This numeric representation of dates allows us to easily
- 35:20find the number of days between two dates.
- 35:24You can use the arithmetic operation of subtraction to find the
- 35:27number of days between an end date and a start date.
- 35:31For example, to find the number of days between January 30th
- 35:351948 and April 18th 1964, we see
- 35:39that the end date is
- 35:421569 days after January 1st 1960, and
- 35:46the start date is 4354 days
- 35:50before January 1st 1960.
- 35:53To find the number of days in between we subtract the start date
- 35:56from the end date, which gives us 5923
- 36:00days. We can also convert those days to months
- 36:04or years to give a more meaningful timescale.
- 36:08In the Framingham data we have two dates, date of intake into
- 36:11the study and date of death when a death was observed during study
- 36:15follow up. To find the number of days between a participant's
- 36:19date of death and their intake date, we subtract the start
- 36:22date from the end date or intake from date of death.
- 36:27The new variable we're defining is called days to death, and we
- 36:30see that it is only defined for those individuals with non missing
- 36:34values for both intake and date of death.
- 36:37We can convert days to years by dividing the number of days
- 36:41by 365.25.
- 36:44With this many days a timescale of years is more appropriate.
- 36:48Finally, since we also have the variable called age that tells
- 36:52us the subject's age in years at intake, we can calculate
- 36:56an approximate age at death by adding years to death to
- 37:00the age variable.
- 37:01Note this is an approximate age of death because we do not have the
- 37:05individual's date of birth in these data.
- 37:08If we had date of birth we could calculate age of death by subtracting
- 37:12date of birth from date of death.
- 37:15If we want to define a new date variable in our data set to use in
- 37:18calculations, we can do so directly in a data step.
- 37:23For example, if I know that follow up for long term endpoints ended
- 37:26on December 31st 1972 in this study, I
- 37:30can define a new variable and follow up equals December
- 37:3431st 1972.
- 37:37I will show you two ways of defining the date variable.
- 37:40In the first method you enclose the date in quotes followed by
- 37:43the letter D. It is important to specify the day,
- 37:47followed by month, and then year.
- 37:49You must also use the three letter abbreviation for the month.
- 37:54The second method of defining a date variable uses the N
- 37:58D Y function in SAS.
- 38:00The arguments of the function are numerical month, followed by da,y
- 38:04and then year with each argument separated by a comma.
- 38:08Both methods produce identical results.
- 38:11Another useful function is the today function in SAS.
- 38:15This function will define a date variable using the current date.
- 38:19There are no arguments of this function.
- 38:21Because we're reading in the Framingham dataset you're using the SET
- 38:25statement, these new date variables will be defined for every
- 38:28subject in the dataset.
- 38:30We will see three new columns at the end of our dataset but the value
- 38:34of each date will be constant for all subjects.
- 38:37This is necessary if we want to use these newly defined variables
- 38:40to perform calculations involving existing date variables
- 38:45such as date of intake because SAS evaluates expressions
- 38:48involving variables in the data set for each observation.
- 38:53All of these methods for defining date variables will produce
- 38:56the date in the form of a SAS date value.
- 38:59This is fine if you just want to use the date to perform calculations.
- 39:03However, if you want to print the date variables it's helpful
- 39:06to apply a date format to the variables so that you can
- 39:10see the date presented in a way we are used to seeing.
- 39:14For example, the way intake is displayed here.
- 39:17Since the date variable intake was an existing variable
- 39:21in the imported Excel workbook, SAS recognized that
- 39:24this variable was a date and applied a date format to
- 39:28the variable when reading in the data.
- 39:30This is why it appears as a formatted date.
- 39:34We see this when we look at PROC CONTENTS.
- 39:36The format date 9 was applied to the variable intake.
- 39:40Formats are applied to variables in a data step using a format
- 39:44statement. After the word format specify the name of the
- 39:47variable that you would like to format followed by the name of the
- 39:50format. The format name must be followed by a period
- 39:54or decimal point. We closed the format statement with a
- 39:58semicolon.
- 39:59You can apply the same format to more than one variable.
- 40:03By including the format name after more than one variable name.
- 40:07You can also apply formats to multiple variables in your dataset
- 40:11using one format statement.
- 40:13SAS has many predefined date formats to choose from,
- 40:17even more than I have listed here.
- 40:19We are applying a different date format to each of these 4 date
- 40:23variables. Intake is formatted as date 9.
- 40:279 refers to the length of the date variable under this format.
- 40:31End follow up is formatted as MMDDyy.
- 40:35The thing to notice about this format is that it uses a two
- 40:38digit year and follow up version to use its
- 40:42date format MMDDyy10.
- 40:4610 again refers to the length of the resulting formatted date
- 40:49including the forward slashes between month day and year.
- 40:54Finally today date uses date format word date
- 40:58which displays the date as we would in text form.
- 41:01Formats do not change the value of the variable.
- 41:04These dates are still thought of as numbers by SAS.
- 41:08Formats only change the way values of the variable are displayed.
- 41:13This concludes video 2.1 In this video, we reviewed
- 41:17PROC import. We also covered how to perform calculations
- 41:20and construct new variables using arithmetic operations and
- 41:24using mathematical, arithmetic, and statistical functions.
- 41:28Finally we discussed working with dates in SAS.
- 41:32In video 2.2 We will discuss creating new variables
- 41:36using comparison and logical operators.