# R-statistical: Changing categorical variables from text to numbers

Discussion in 'Mac Programming' started by Erniecranks, May 4, 2011.

1. ### Erniecranks macrumors newbie

Joined:
Apr 10, 2011
#1
Is there an easy way in R to change the variables in a column, say 'blue' 'red' and 'green' to 1, 2, and 3?

Thanks,

ernie

2. ### AlmostThere macrumors 6502a

#2
Assuming your categorical variables are factors, you can use them integers, or call as.integer. When creating the factor, you can determine the sequence with levels()

Code:
```> clrs <- factor(c("red", "green", "blue"))
> clrs
[1] red   green blue
Levels: blue green red
> as.integer(clrs)
[1] 3 2 1
> clrs <- factor(c("red", "green", "blue"), levels=c("red", "green", "blue"))
> clrs
[1] red   green blue
Levels: red green blue
> as.integer(clrs)
[1] 1 2 3
> r <- rnorm(10)
> r
[1]  1.8639513  0.4930384  1.8170273  1.7606512  0.1418951  1.0160500
[7] -2.1571495 -1.0363607 -0.4395486 -0.6859069
> r[clrs]
[1] 1.8639513 0.4930384 1.8170273
```

3. ### Hansr macrumors 6502a

Joined:
Apr 1, 2007
#3
Alternative that works for both factors and character vectors:

Code:
```> a <- data.frame(cbind(sample(c("red","green","blue"),10,T),matrix(rnorm(30),10,3,T)))
> a
X1                 X2                 X3                  X4
1   blue  -2.65160520677154   1.13671997203813    1.59844807462027
2    red   1.63603301299993  -1.44809803772613  -0.372299702576141
3    red   -1.0520070389011   1.17005686224478   0.747703941203762
4    red -0.577843522326433  0.157226421406988  0.0672999761529491
5  green   1.04109600264608  0.103028340501787    2.76900952476021
6   blue -0.811299237328568   1.42245069258426    2.09960604012682
7   blue   1.92844116562255  0.371461524110289   -0.69285790935153
8    red  -1.38648123984735 -0.377298779883762  0.0943156014716379
9  green  0.579690002553588  0.172524006604432  -0.568180791202796
10 green  0.198367546634958 -0.848545701166513 -0.0666525679750112

> a[,1] <- sapply(a[,1],switch,"blue"=1,"red"=2,"green"=3)
> a
X1                 X2                 X3                 X4
1   1 -0.406057311060619   1.37221795975874   1.03588708890414
2   2  -1.14795852027568  -1.02997903738951 -0.371426930694828
3   3  0.586066884589126   1.10068549323689  0.414053801828515
4   3  -0.23266477205783 -0.127766174966108 0.0115180462499652
5   2  -1.42033488605275 0.0983241940109921   1.06460692207479
6   2 -0.377867851352621  -1.22987957019859  0.651746344101077
7   1 -0.951456500887181  0.260840314961966   2.04018777986721
8   2  0.758593153216336 0.0765212264963914   1.41236673762932
9   2 -0.917024506889731  -1.37698559321206 0.0197024018447221
10  3  0.343258023825715 -0.561586274559691   1.12637896095723
```

4. ### kuwisdelu macrumors 65816

Joined:
Jan 13, 2008
#4
Above are two good options. My question is why you want to do it. Depending on that, it may be unnecessary or there may be a better way of doing what you want.

5. ### Erniecranks thread starter macrumors newbie

Joined:
Apr 10, 2011
#5
Why do I want to do it? Because....

Isn't it easier to work with single numerals than character strings? I don't really know that answer in R. I've used minitab, and 1 is a lot easier than saying 'left lateral recumbency', or even LLR.

6. May 8, 2011
Last edited: May 9, 2011

### AlmostThere macrumors 6502a

#6
I would go the other way, "left lateral recumbency" or LLR carries meaning in the problem domain, but 1 doesn't, but then again, whatever works for you. Providing you are happy with '1' when you come to review your work, then go for it.

This mapping of nominal / categorical or ordinal values is basically what factors in R are for. I edit scripts in a text editor, so copy and pasting long names (or using autocompletion) is less of an issue than working interactively.

If your data is in a data.frame you will probably use strings.as.factors by default. If you want to specify the levels see above (or more details here http://www.statmethods.net/input/valuelabels.html).

I think the easiest thing for you to do is to add another column to your data.frame using as.integer
i.e.
Code:
```obs <- read.delim('/your/data/observations.tsv')
obs\$posture.int <- as.integer(obs\$posture.factor)
```
but Hansr's solution will work too and is probably less typing if you have character data.

7. ### kuwisdelu macrumors 65816

Joined:
Jan 13, 2008
#7
As AlmostThere suggests, you want your code to be as readable as possible. Using "red," "green," and "blue" as variable names may be very slightly more typing, but it's a lot more clear what you're doing and what you're model is than if you used "1", "2", and "3."

Furthermore, R will use character variables as factors (categorical/class variables) by default. If you change them to integers, you'll have to remember you tell R to use them as factors rather than numeric variables.

I'd suggest you just keep them as character variables and not change them to numeric.