1. Welcome to the new MacRumors forums. See our announcement and read our FAQ

R-statistical: Changing categorical variables from text to numbers

Discussion in 'Mac Programming' started by Erniecranks, May 4, 2011.

  1. macrumors newbie

    #1
    Is there an easy way in R to change the variables in a column, say 'blue' 'red' and 'green' to 1, 2, and 3?

    Thanks,

    ernie
     
  2. macrumors 6502a

    #2
    Assuming your categorical variables are factors, you can use them integers, or call as.integer. When creating the factor, you can determine the sequence with levels()

    Code:
    > clrs <- factor(c("red", "green", "blue"))
    > clrs
    [1] red   green blue 
    Levels: blue green red
    > as.integer(clrs)
    [1] 3 2 1
    > clrs <- factor(c("red", "green", "blue"), levels=c("red", "green", "blue"))
    > clrs
    [1] red   green blue 
    Levels: red green blue
    > as.integer(clrs)
    [1] 1 2 3
    > r <- rnorm(10)
    > r
     [1]  1.8639513  0.4930384  1.8170273  1.7606512  0.1418951  1.0160500
     [7] -2.1571495 -1.0363607 -0.4395486 -0.6859069
    > r[clrs]
    [1] 1.8639513 0.4930384 1.8170273
    
     
  3. macrumors 6502a

    #3
    Alternative that works for both factors and character vectors:

    Code:
    > a <- data.frame(cbind(sample(c("red","green","blue"),10,T),matrix(rnorm(30),10,3,T)))
    > a
          X1                 X2                 X3                  X4
    1   blue  -2.65160520677154   1.13671997203813    1.59844807462027
    2    red   1.63603301299993  -1.44809803772613  -0.372299702576141
    3    red   -1.0520070389011   1.17005686224478   0.747703941203762
    4    red -0.577843522326433  0.157226421406988  0.0672999761529491
    5  green   1.04109600264608  0.103028340501787    2.76900952476021
    6   blue -0.811299237328568   1.42245069258426    2.09960604012682
    7   blue   1.92844116562255  0.371461524110289   -0.69285790935153
    8    red  -1.38648123984735 -0.377298779883762  0.0943156014716379
    9  green  0.579690002553588  0.172524006604432  -0.568180791202796
    10 green  0.198367546634958 -0.848545701166513 -0.0666525679750112
    
    > a[,1] <- sapply(a[,1],switch,"blue"=1,"red"=2,"green"=3)
    > a
       X1                 X2                 X3                 X4
    1   1 -0.406057311060619   1.37221795975874   1.03588708890414
    2   2  -1.14795852027568  -1.02997903738951 -0.371426930694828
    3   3  0.586066884589126   1.10068549323689  0.414053801828515
    4   3  -0.23266477205783 -0.127766174966108 0.0115180462499652
    5   2  -1.42033488605275 0.0983241940109921   1.06460692207479
    6   2 -0.377867851352621  -1.22987957019859  0.651746344101077
    7   1 -0.951456500887181  0.260840314961966   2.04018777986721
    8   2  0.758593153216336 0.0765212264963914   1.41236673762932
    9   2 -0.917024506889731  -1.37698559321206 0.0197024018447221
    10  3  0.343258023825715 -0.561586274559691   1.12637896095723
    
     
  4. macrumors 65816

    #4
    Above are two good options. My question is why you want to do it. Depending on that, it may be unnecessary or there may be a better way of doing what you want.
     
  5. macrumors newbie

    #5
    Why do I want to do it? Because....

    Isn't it easier to work with single numerals than character strings? I don't really know that answer in R. I've used minitab, and 1 is a lot easier than saying 'left lateral recumbency', or even LLR.
     
  6. AlmostThere, May 8, 2011
    Last edited: May 9, 2011

    macrumors 6502a

    #6
    I would go the other way, "left lateral recumbency" or LLR carries meaning in the problem domain, but 1 doesn't, but then again, whatever works for you. Providing you are happy with '1' when you come to review your work, then go for it.

    This mapping of nominal / categorical or ordinal values is basically what factors in R are for. I edit scripts in a text editor, so copy and pasting long names (or using autocompletion) is less of an issue than working interactively.

    If your data is in a data.frame you will probably use strings.as.factors by default. If you want to specify the levels see above (or more details here http://www.statmethods.net/input/valuelabels.html).

    I think the easiest thing for you to do is to add another column to your data.frame using as.integer
    i.e.
    Code:
    obs <- read.delim('/your/data/observations.tsv')
    obs$posture.int <- as.integer(obs$posture.factor)
    obs$reading[obs$posture.int == 2]
    
    but Hansr's solution will work too and is probably less typing if you have character data.
     
  7. macrumors 65816

    #7
    As AlmostThere suggests, you want your code to be as readable as possible. Using "red," "green," and "blue" as variable names may be very slightly more typing, but it's a lot more clear what you're doing and what you're model is than if you used "1", "2", and "3."

    Furthermore, R will use character variables as factors (categorical/class variables) by default. If you change them to integers, you'll have to remember you tell R to use them as factors rather than numeric variables.

    I'd suggest you just keep them as character variables and not change them to numeric.
     

Share This Page