PDA

View Full Version : R-statistical: Changing categorical variables from text to numbers

Erniecranks
May 4, 2011, 09:40 PM
Is there an easy way in R to change the variables in a column, say 'blue' 'red' and 'green' to 1, 2, and 3?

Thanks,

ernie

AlmostThere
May 5, 2011, 02:39 AM
Assuming your categorical variables are factors, you can use them integers, or call as.integer. When creating the factor, you can determine the sequence with levels()

> clrs <- factor(c("red", "green", "blue"))
> clrs
[1] red green blue
Levels: blue green red
> as.integer(clrs)
[1] 3 2 1
> clrs <- factor(c("red", "green", "blue"), levels=c("red", "green", "blue"))
> clrs
[1] red green blue
Levels: red green blue
> as.integer(clrs)
[1] 1 2 3
> r <- rnorm(10)
> r
[1] 1.8639513 0.4930384 1.8170273 1.7606512 0.1418951 1.0160500
[7] -2.1571495 -1.0363607 -0.4395486 -0.6859069
> r[clrs]
[1] 1.8639513 0.4930384 1.8170273

Hansr
May 5, 2011, 04:44 AM
Alternative that works for both factors and character vectors:

> a <- data.frame(cbind(sample(c("red","green","blue"),10,T),matrix(rnorm(30),10,3,T)))
> a
X1 X2 X3 X4
1 blue -2.65160520677154 1.13671997203813 1.59844807462027
2 red 1.63603301299993 -1.44809803772613 -0.372299702576141
3 red -1.0520070389011 1.17005686224478 0.747703941203762
4 red -0.577843522326433 0.157226421406988 0.0672999761529491
5 green 1.04109600264608 0.103028340501787 2.76900952476021
6 blue -0.811299237328568 1.42245069258426 2.09960604012682
7 blue 1.92844116562255 0.371461524110289 -0.69285790935153
8 red -1.38648123984735 -0.377298779883762 0.0943156014716379
9 green 0.579690002553588 0.172524006604432 -0.568180791202796
10 green 0.198367546634958 -0.848545701166513 -0.0666525679750112

> a[,1] <- sapply(a[,1],switch,"blue"=1,"red"=2,"green"=3)
> a
X1 X2 X3 X4
1 1 -0.406057311060619 1.37221795975874 1.03588708890414
2 2 -1.14795852027568 -1.02997903738951 -0.371426930694828
3 3 0.586066884589126 1.10068549323689 0.414053801828515
4 3 -0.23266477205783 -0.127766174966108 0.0115180462499652
5 2 -1.42033488605275 0.0983241940109921 1.06460692207479
6 2 -0.377867851352621 -1.22987957019859 0.651746344101077
7 1 -0.951456500887181 0.260840314961966 2.04018777986721
8 2 0.758593153216336 0.0765212264963914 1.41236673762932
9 2 -0.917024506889731 -1.37698559321206 0.0197024018447221
10 3 0.343258023825715 -0.561586274559691 1.12637896095723

kuwisdelu
May 6, 2011, 10:44 PM
Above are two good options. My question is why you want to do it. Depending on that, it may be unnecessary or there may be a better way of doing what you want.

Erniecranks
May 8, 2011, 01:14 PM
Isn't it easier to work with single numerals than character strings? I don't really know that answer in R. I've used minitab, and 1 is a lot easier than saying 'left lateral recumbency', or even LLR.

AlmostThere
May 8, 2011, 03:47 PM
I would go the other way, "left lateral recumbency" or LLR carries meaning in the problem domain, but 1 doesn't, but then again, whatever works for you. Providing you are happy with '1' when you come to review your work, then go for it.

This mapping of nominal / categorical or ordinal values is basically what factors in R are for. I edit scripts in a text editor, so copy and pasting long names (or using autocompletion) is less of an issue than working interactively.

If your data is in a data.frame you will probably use strings.as.factors by default. If you want to specify the levels see above (or more details here http://www.statmethods.net/input/valuelabels.html).

I think the easiest thing for you to do is to add another column to your data.frame using as.integer
i.e.

obs\$posture.int <- as.integer(obs\$posture.factor)