March 2006 Technical Tip Recoding data in SAS

Continuous data, such as income, is well suited for many forms of statistical analysis, such as the five measures produced by PROC MEANS (count, mean, minimum, maximum, and standard deviation) or as a variable in linear regression. But in some instances, such as in logistic regression, categorical data is preferred. This article illustrates a simple means by which continuous data can be recoded, or grouped, as categorical data.

In the example which follows, student test scores (continuous) are converted to both their letter grade and numeric grade equivalents (categorical). Thus the reader can see how to create categorical data which is either character or numeric.

proc format;
  value NbrToLtr
    low-<60 = 'F'
    60 -<70 = 'D'
    70 -<80 = 'C'
    80 -<90 = 'B'
    90 -100 = 'A'
  ;
  value NbrToNbr
    low-<60 = '0'
    60 -<70 = '1'
    70 -<80 = '2'
    80 -<90 = '3'
    90 -100 = '4'
  ;

data mydata;
  input student $ score;
  ltrGrade = put(score, NbrToLtr.);
  nbrGrade = input(put(score, NbrToNbr.), 1.0);
datalines;
713 71
421 92
701 55
125 92
896 63
626 81
402 79
263 80
;

proc print data=mydata;
run;

PROC FORMAT can be used to recode or group numeric or character data. In this example we use numeric data only (student's test score.)

The PUT function outputs a variable using a specified format. So if score has a value of 75 then put(score, NbrToLtr.) outputs the character C. So the statement ltrGrade = put(score, NbrToLtr.); assigns the letter C to the variable ltrGrade.

But the output from PROC FORMAT is always character. So how do we recode or group data as a number? The INPUT function converts a character value to a numeric variable using an INFORMAT. For example, input("2", 1.0) converts the character "2" to the number 2. We can nest the PUT and INPUT functions, so if score has a value of 75 then the statement nbrGrade = input(put(score, NbrToNbr.),1.0); assigns the number 2 to the variable nbrGrade.

The output from PROC PRINT is as follows:

                            ltr      nbr
Obs    student    score    Grade    Grade

 1       713        71       C        2  
 2       421        92       A        4  
 3       701        55       F        0  
 4       125        92       A        4  
 5       896        63       D        1  
 6       626        81       B        3  
 7       402        79       C        2  
 8       263        80       B        3  

If the input to your PROC FORMAT is character data, then you must use a leading dollar sign on the format name. The following example shows how to recode gender from the letter 'M' to number 0 and from the letter 'F' to number 1:

proc format;
  value $gender
    'M' = '0'
    'F' = '1'
  ;

data genders;
  input gender $ @@;
  recoded = input(put(gender, $gender.), 1.0);
datalines;
M F M M F
;

proc print data=genders;
run;

Reminder: you cannot use a trailing period when defining a format in PROC FORMAT but you must use a trailing period when using the format name in a PUT or INPUT function!

We hope you will consider Caliber Data Training when you are in need of a SAS training provider.


Go to the articles index. Written by Bill Qualls. Copyright © 2006 by Caliber Data Training 800.938.1222