PDA

View Full Version : R-statistical: Changing categorical variables from text to numbers




Erniecranks
May 4, 2011, 09:40 PM
Is there an easy way in R to change the variables in a column, say 'blue' 'red' and 'green' to 1, 2, and 3?

Thanks,

ernie



AlmostThere
May 5, 2011, 02:39 AM
Assuming your categorical variables are factors, you can use them integers, or call as.integer. When creating the factor, you can determine the sequence with levels()


> clrs <- factor(c("red", "green", "blue"))
> clrs
[1] red green blue
Levels: blue green red
> as.integer(clrs)
[1] 3 2 1
> clrs <- factor(c("red", "green", "blue"), levels=c("red", "green", "blue"))
> clrs
[1] red green blue
Levels: red green blue
> as.integer(clrs)
[1] 1 2 3
> r <- rnorm(10)
> r
[1] 1.8639513 0.4930384 1.8170273 1.7606512 0.1418951 1.0160500
[7] -2.1571495 -1.0363607 -0.4395486 -0.6859069
> r[clrs]
[1] 1.8639513 0.4930384 1.8170273

Hansr
May 5, 2011, 04:44 AM
Alternative that works for both factors and character vectors:

> a <- data.frame(cbind(sample(c("red","green","blue"),10,T),matrix(rnorm(30),10,3,T)))
> a
X1 X2 X3 X4
1 blue -2.65160520677154 1.13671997203813 1.59844807462027
2 red 1.63603301299993 -1.44809803772613 -0.372299702576141
3 red -1.0520070389011 1.17005686224478 0.747703941203762
4 red -0.577843522326433 0.157226421406988 0.0672999761529491
5 green 1.04109600264608 0.103028340501787 2.76900952476021
6 blue -0.811299237328568 1.42245069258426 2.09960604012682
7 blue 1.92844116562255 0.371461524110289 -0.69285790935153
8 red -1.38648123984735 -0.377298779883762 0.0943156014716379
9 green 0.579690002553588 0.172524006604432 -0.568180791202796
10 green 0.198367546634958 -0.848545701166513 -0.0666525679750112

> a[,1] <- sapply(a[,1],switch,"blue"=1,"red"=2,"green"=3)
> a
X1 X2 X3 X4
1 1 -0.406057311060619 1.37221795975874 1.03588708890414
2 2 -1.14795852027568 -1.02997903738951 -0.371426930694828
3 3 0.586066884589126 1.10068549323689 0.414053801828515
4 3 -0.23266477205783 -0.127766174966108 0.0115180462499652
5 2 -1.42033488605275 0.0983241940109921 1.06460692207479
6 2 -0.377867851352621 -1.22987957019859 0.651746344101077
7 1 -0.951456500887181 0.260840314961966 2.04018777986721
8 2 0.758593153216336 0.0765212264963914 1.41236673762932
9 2 -0.917024506889731 -1.37698559321206 0.0197024018447221
10 3 0.343258023825715 -0.561586274559691 1.12637896095723

kuwisdelu
May 6, 2011, 10:44 PM
Above are two good options. My question is why you want to do it. Depending on that, it may be unnecessary or there may be a better way of doing what you want.

Erniecranks
May 8, 2011, 01:14 PM
Isn't it easier to work with single numerals than character strings? I don't really know that answer in R. I've used minitab, and 1 is a lot easier than saying 'left lateral recumbency', or even LLR.

AlmostThere
May 8, 2011, 03:47 PM
I would go the other way, "left lateral recumbency" or LLR carries meaning in the problem domain, but 1 doesn't, but then again, whatever works for you. Providing you are happy with '1' when you come to review your work, then go for it.

This mapping of nominal / categorical or ordinal values is basically what factors in R are for. I edit scripts in a text editor, so copy and pasting long names (or using autocompletion) is less of an issue than working interactively.

If your data is in a data.frame you will probably use strings.as.factors by default. If you want to specify the levels see above (or more details here http://www.statmethods.net/input/valuelabels.html).

I think the easiest thing for you to do is to add another column to your data.frame using as.integer
i.e.

obs <- read.delim('/your/data/observations.tsv')
obs$posture.int <- as.integer(obs$posture.factor)
obs$reading[obs$posture.int == 2]


but Hansr's solution will work too and is probably less typing if you have character data.

kuwisdelu
May 8, 2011, 07:30 PM
Isn't it easier to work with single numerals than character strings? I don't really know that answer in R. I've used minitab, and 1 is a lot easier than saying 'left lateral recumbency', or even LLR.

As AlmostThere suggests, you want your code to be as readable as possible. Using "red," "green," and "blue" as variable names may be very slightly more typing, but it's a lot more clear what you're doing and what you're model is than if you used "1", "2", and "3."

Furthermore, R will use character variables as factors (categorical/class variables) by default. If you change them to integers, you'll have to remember you tell R to use them as factors rather than numeric variables.

I'd suggest you just keep them as character variables and not change them to numeric.