View Full Version : Random Letter by distribution

napierzaza

Oct 4, 2009, 10:07 PM

I am trying to figure out how to pick a random letter from the alphabet. But I don't want to pick random letters, I want to pick them according to their frequency in the the english language.

Does anyone have a lead on this? I have the values for each letter that I have to play with to remove the statistics from a "space".

http://www.data-compression.com/english.html

How can you implement this?

I already can switch between a number to a letter (0=a,1=b...) so that might make it easier.

lee1210

Oct 4, 2009, 10:36 PM

A way:

Take the table from that website. Create an array of characters that's 1000 or 10000 elements long. Based on the statistical likelihood of a letter's occurrence, enter it in the array this many times. If you go to 1000, multiply the values on that page by 100, and drop the decimal. You'll need to round up to one for z. For 10000, multiply by 1000. That page includes space, though, so you'd need to as well. You're going to have to round a bit, so you might need to fill the last few fields with the top most occurring letters (e and something else). Once you have this array, it would be best to shuffle it. How to do this will vary by language. Then get a random number, take it's value modulo 1000 or 10000, and grab that element of your array.

This probably isn't the best way. It's pseudorandom, imprecise, etc. But it's a way. You could probably build a range of reals from 0-1 sectioned off by probability, then get a random value divided by the range of that random, then look up which letter's range that value falls into.

Others may be able to provide more statistically rigorous ideas.

-Lee

Doctor Q

Oct 5, 2009, 12:19 AM

I have a method to suggest.

I took the table of letter frequencies from that site. I recomputed the frequencies eliminating the space character so it's just the letters A through Z (26 choices instead of 27). I did this by dividing each value by 0.8081818, which is the sum of the frequencies excluding the space, i.e., 1 - 0.1918182. Then I computed a running sum (cumulative frequencies).

This array of 26 floating-point values is the result:

{ 0.0806425, 0.0960163, 0.1229086, 0.1661953, 0.2950576, 0.3195423, 0.3391679, 0.4001551, 0.4692106, 0.4703283, 0.4765805, 0.5175973, 0.5426070, 0.6124568, 0.6862399, 0.7032713, 0.7043362, 0.7659019, 0.8297193, 0.9199659, 0.9478228, 0.9580807, 0.9792730, 0.9809672, 0.9990304, 1.0000000 }

Here's how you can use it to solve your problem:

1. Pick a random number n between 0 and 1, so 0 <= n < 1.

2. Scan the array for the first array value greater than n, let's say at index i.

3. If i is 0, use A. If i is 1, use B. ... If i is 25, use Z.

The letters will be picked at frequencies matching English, based on the website's data.

Please let me know if you use this technique and if it works for you. :)

TwinCities Dan

Oct 5, 2009, 12:23 AM

RSTLN and a vowel of E... ;)

Big money, big money! :D

napierzaza

Oct 5, 2009, 06:22 AM

lee1210:

That would work, but as you say it's pretty dirty. You're rounding as well as making a huge array that you can do without.

Doctor Q:

This is an interesting idea. But it seems as if this would be affected by the position in the array that the number is. If you randomly choose 0.07 you'll never get past the first array value. In fact anything below 0.8 won't Either.

If I reordered it it might work, but you depend on the distribution being even between all the letters. It's not a random based on the actual statistical probability, but a random based on the gap between the numbers. A letter that had more space above its neighbour would get selected more often because the pocket below would catch more random numbers.

In this setup I couldn't even re-order the list, unless I make a second array that's used to lookup the table and reorder the results.

rossipoo

Oct 5, 2009, 11:18 AM

If you want to be sure the distribution is even, like in scrabble for example, you need to make a "bag" system.

You start by creating an array with all the possible letters acording to their distribution. Put 10 Es into the array but only 1 X, for example.

Then move the elements randomly to a second "bag" array. Pick an index randomly, remove the element at that index from the first array and add it to the end of the bag array. Repeat till all the letters are gone from first array.

Now you have an array with the next 100 or so letters to use in your game. When you use up all the letters, you regenerate the "bag".

autorelease

Oct 5, 2009, 12:50 PM

Doctor Q's idea should work if you use a random number generator that provides uniformly distributed numbers between 0 and 1. It's basically like taking a dart board, dividing it into 26 regions of appropriate sizes, and throwing a dart.

I'm not sure how uniformly distributed the built-in random number generator is, so you might want to use something more "random," like Mersenne Twister.

Using a "bag" has additional memory overhead, only allows you to generate a small number of random samples, and has lower precision, since you can only put a limited number of items in the bag.

Doctor Q

Oct 5, 2009, 03:49 PM

Doctor Q:

This is an interesting idea. But it seems as if this would be affected by the position in the array that the number is. If you randomly choose 0.07 you'll never get past the first array value. In fact anything below 0.8 won't Either.

You are correct that if the random number is 0.07 you'll choose A. That's what you want. Based on the website's data, you want 8.06425% of all choices to be letter A. If you pick a random number between 0 and 1 (uniformly), it will be .0806425 or less 8.06425% of the time so you'll pick A 8.06425% of the time.

Random numbers between 0.0806425 and 0.0960163 get you a B, which mean it happens 0.0960163 - 0.0806425 = 1.53738% of the time, which is exactly what you want for letter B. And so on. (Note that you said 0.8 when you meant 0.08.)

If I reordered it it might work, but you depend on the distribution being even between all the letters. It's not a random based on the actual statistical probability, but a random based on the gap between the numbers. A letter that had more space above its neighbour would get selected more often because the pocket below would catch more random numbers.

My suggested method does not depend on the distribution being even between letters. The gaps vary purposely to match the frequencies you want. Letters (like vowels and letter T) have bigger gaps because they should be picked more often. My method gives you exactly the letter distribution of the data from the website. I think that's what you mean by "actual statistical probability".

In this setup I couldn't even re-order the list, unless I make a second array that's used to lookup the table and reorder the results.If you reorder my array then you'll get the wrong frequencies. If you use the array and algorithm as described you can certainly "shuffle" the letters you pick (after you pick them) and they will still have frequencies matching English, but since they are each picked randomly they are "already shuffled" and changing the order won't serve any purpose that I can think of.

Unless I misunderstood your original post, what I proposed is exactly what you asked for.

I tried the algorithm myself on 100000 trials and got these results:

12983 E

8914 T

8041 A

7455 O

7116 N

6837 I

6441 S

6057 R

6053 H

4287 D

4061 L

2825 C

2814 U

2484 M

2412 F

2091 W

1980 G

1771 Y

1683 P

1572 B

1007 V

646 K

171 X

116 J

108 Q

75 Z

napierzaza

Oct 6, 2009, 02:51 PM

I'll briefly explain what I ended up doing, I hadn't read Dr. Qs idea before I did mine, or the bad idea either.

Essentially, you don't need a 1000 item array. You need an array for 26 letters. Each array item doesn't contain the letter, but its probability out of 100.

Then you pick a random number out of 100.

You pick through the array from the beginning seeing if the item is larger than the picked number. This only works if you are summing the past items together so it climbs up the array's values.

Once the random number is overtaken by the arraySum you know that it is inside the particular range.

So it will be appropriate in regards to picking, it won't be like a bag though and I have seen 2 Qs go by.

lee1210

Oct 6, 2009, 03:03 PM

Your idea is like Doctor Q's, but just a bit worse. If you're only picking from 1 to 100, something needs to have at least a 1% change to get picked, or you have to do some severe rounding. Z, for example, only has a ~.7% chance of occurring. The issues with rounding in other suggestions is made much worse by only allowing for 100 different possibilities. This would be like using the bag idea, but only having 100 positions instead of 1000 or 10000.

-Lee

napierzaza

Oct 6, 2009, 07:05 PM

I had it more "accurate" before, but I then used the scrabble assignment, which is 100 tiles in a distribution which is similar to the more "accurate" one. I put accurate in quotes because really, it does not need to be so close to the letter distribution in the Bible or the Origin of the Species which is what the wikipedia information is based on.

But yes, I previously had it using the floats, and picking within 1000.