Running the Average and Std Dev on a data set - Newbie

tyrant

macrumors newbie
Original poster
May 1, 2007
13
0
IDon't have much programming experience. I have a data set that I was for each successive number in the set I want to the find the average (of pervious numbers and current number) and the standard deviation.

Example

Day 1 Average (Day 1), Standard Deviation (Day 1)
Day 2 Average (Day 1 - 2), Standard Deviation (Day 1 - 2)
Day 3 Average (Day 1 - 3), Standard Deviation (Day 1 - 3)
Day 4 Average (Day 1 - 4), Standard Deviation (Day 1 - 4)
...
Day n Average (Day 1 - n), Standard Deviation (Day 1 - n)

I have about 750 sets that I want to run this computation on and each set has about 6,050 data points (days).

I am assuming that each day in the set is going to add exponential number of calculations for the computer. Due to the high volume, excel would run out of RAM very quickly (I have 2 GB RAM). I assume that once programmed, the computations (just a guesstimation) would take a month or two of full time computing.

The only programming I have ever done was on the TI calculators in High School. Can someone point to a programming language that would be good for this type of program? I am working on this project over a couple of months and therefore don't mind learning a language that lends itself to this type of data computation.
 

ChrisA

macrumors G4
Jan 5, 2006
11,598
380
Redondo Beach, California
IDon't have much programming experience. I have a data set that I was for each successive number in the set I want to the find the average (of pervious numbers and current number) and the standard deviation.

Example

Day 1 Average (Day 1), Standard Deviation (Day 1)
Day 2 Average (Day 1 - 2), Standard Deviation (Day 1 - 2)
Day 3 Average (Day 1 - 3), Standard Deviation (Day 1 - 3)
Day 4 Average (Day 1 - 4), Standard Deviation (Day 1 - 4)
...
Day n Average (Day 1 - n), Standard Deviation (Day 1 - n)

I have about 750 sets that I want to run this computation on and each set has about 6,050 data points (days).

I am assuming that each day in the set is going to add exponential number of calculations for the computer. Due to the high volume, excel would run out of RAM very quickly (I have 2 GB RAM). I assume that once programmed, the computations (just a guesstimation) would take a month or two of full time computing.

The only programming I have ever done was on the TI calculators in High School. Can someone point to a programming language that would be good for this type of program? I am working on this project over a couple of months and therefore don't mind learning a language that lends itself to this type of data computation.
No, it's not exponential. It's linear. The time to compute will be limited by the speed of your disk drive, by how fast it can read the data. The computation time is nearly trivial

Hint: Keep three variables: N - the current number of samples, is incremented as you read the sample, SUM, the sum of samples read so far and SUM_SQUARED with is the sum of the squares. Once these are updated you can print to statics on the same line as the day's data then go to the nest day.

The entire process should take seconds not months on any Mac that can run Mac OS X.

Try doing a subset of the data by hand on paper to work out the steps.

Is this a homework problem? Sure sounds like it. I hope no one here does writes this for you. but I expect it to take about a dozen lines of code.
 

tyrant

macrumors newbie
Original poster
May 1, 2007
13
0
Is this a homework problem? Sure sounds like it. I hope no one here does writes this for you. but I expect it to take about a dozen lines of code.

Thank you for the response, to be clear; this is not a homework problem. I understand the ethical implications if it was.

One question, what language would you recommend for this?
 

chown33

Moderator
Staff member
Aug 9, 2009
8,380
4,367
Pumpkindale
One question, what language would you recommend for this?
Any language that can do basic arithmetic would be able to carry out the calculations.

Seriously, think about it a little and write down what needs to happen. Express it in plain English rather than trying to write a program right away. Don't take this as patronizing a noob: real software developers do this because it helps clarify the scope of the problem.

The running average (which I assume is the arithmetic mean) and standard deviation are your calculations. What's the calculation for mean? What's the calculation for std-dev?
http://en.wikipedia.org/wiki/Standard_deviation

What arithmetic operations are needed to carry out these calculations:
Addition, subtraction, multiplication, division, square root: that's it.

And you might want to explain why you thought the calculations would be exponential. Again, don't take that as patronizing a noob: explaining one's rationale and reasoning is an important part of real software development. It's unexplained and improperly reasoned kluges that cause the most trouble.


A bigger issue than calculations is likely to be reading your data sets into your program, and converting that data to suitable numbers that you can then add, multiply, etc.

So exactly what format are your data sets in? Excel files? Comma-separated text files? XML? Something else?

Not every language has the same ease of reading different data formats. In fact, if your data format isn't documented, then that's going to be a big problem right there. If it's documented but hard to decode, then that means writing a parser (or with luck, finding a free one already written).
 

tyrant

macrumors newbie
Original poster
May 1, 2007
13
0
Good points Chown33.

I have thought about the arithmetic operations, and that I feel comfortable with. Upon further consideration the computations are not exponential they are actually linier. Large, but linier. The following are the areas I am really clueless about:
1) How do I make the program pull the data in from a separate file?

2) How to I tell it to:

Day 1 Average = (Day 1)/1
Day 2 Average = (Day 1 +Day 2)/2
Day 3 Average = (Day 1 + Day 2 + Day 3)/3
Day n Average = (Day 1 + Day 2 + Day 3 … Day n)/n

I may not know the nomenclature, but I would call this successive history. This seems to be the biggest area that I have no clue about.

3) How to I make it save the output to the same file?
Data Average

Currently the data is in Excel. It could very easily be converted to comma separated vales or fixed with.
 

milbournosphere

macrumors 6502a
Mar 3, 2009
856
1
San Diego, CA
If the data is in Excel, why not use the built in functions that already exist for this? Excel has pretty decent formula functionality. I would look into their statistical functions, and use them to build what you want. I would recommend starting with a subset of 10 or so elements to get what you are looking for. I would think that this would not be so hard to build just using excel. That way, you could avoid the hassle of having to learn a programming language (unless that is part of your aim; programming is an excellent skill to have).
 

chown33

Moderator
Staff member
Aug 9, 2009
8,380
4,367
Pumpkindale
1) How do I make the program pull the data in from a separate file?
The exact procedure will depend on which language is used. This is called I/O, which stands for Input/Output, and is the means by which data is read into a program or written out from it. Every language has different I/O facilities, although the principles are generally the same.


2) How to I tell it to:

Day 1 Average = (Day 1)/1
Day 2 Average = (Day 1 +Day 2)/2
Day 3 Average = (Day 1 + Day 2 + Day 3)/3
Day n Average = (Day 1 + Day 2 + Day 3 … Day n)/n

I may not know the nomenclature, but I would call this successive history. This seems to be the biggest area that I have no clue about.
Break it down.
http://www.cocoadev.com/index.pl?BreakItDown

Look for patterns. For example, notice how Day 3's average consists of Day 2's numerator (summation) plus Day 3's value. Then notice how every other day exhibits the same pattern. The conclusion: you don't have to calculate Day 1 + Day 2 + Day 3 + Day n at each step. You can use the summation of previous days and only add one day's value, before dividing by the day-count. And adding each day's value sets things up for the following day.


3) How to I make it save the output to the same file?
Data Average
That's I/O again and it will depend on which language is used.


Currently the data is in Excel. It could very easily be converted to comma separated vales or fixed with.
"Fixed width"?

Comma-separated or tab-separate numbers are probably the easier form to process for most programming languages. Very few things can deal directly with Excel files.

Start with a limited data set, say only 5 days. Write down the specific steps, including the loop, the reading of one line, the calculations, and so on.

If you have the steps written down properly, without regard to language, it will be a lot simpler to figure out what to do in any particular language.
 

tyrant

macrumors newbie
Original poster
May 1, 2007
13
0
If the data is in Excel, why not use the built in functions that already exist for this?

My thinking is that it will take quite some time to do it to approx 750 files. I fI can automate it that would make me very happy. I tried adding all the sheets to one file, then pasting the formulas across all the sheets. Excel "ran out of memory" and then froze.
 

chown33

Moderator
Staff member
Aug 9, 2009
8,380
4,367
Pumpkindale
My thinking is that it will take quite some time to do it to approx 750 files. I fI can automate it that would make me very happy. I tried adding all the sheets to one file, then pasting the formulas across all the sheets. Excel "ran out of memory" and then froze.
I suggest trying it on one file, rather than just thinking about it or trying to do it on all data at once. Measure how long it takes Excel on that one file.

In the time it takes you to learn enough programming, it's possible that you can script Excel to do it using AppleScript and Excel's own builtin formulas, if you do it file by file rather than all at once. Because no matter what programming language you use, you'll still have to figure out a way to get the data out of 750 Excel files.

It depends on whether your primary goal is to have the calculated results, or whether your primary goal is to learn programming.
 

milbournosphere

macrumors 6502a
Mar 3, 2009
856
1
San Diego, CA
IDon't have much programming experience. I have a data set that I was for each successive number in the set I want to the find the average (of pervious numbers and current number) and the standard deviation.

Example

Day 1 Average (Day 1), Standard Deviation (Day 1)
Day 2 Average (Day 1 - 2), Standard Deviation (Day 1 - 2)
Day 3 Average (Day 1 - 3), Standard Deviation (Day 1 - 3)
Day 4 Average (Day 1 - 4), Standard Deviation (Day 1 - 4)
...
Day n Average (Day 1 - n), Standard Deviation (Day 1 - n)

I have about 750 sets that I want to run this computation on and each set has about 6,050 data points (days).

I am assuming that each day in the set is going to add exponential number of calculations for the computer. Due to the high volume, excel would run out of RAM very quickly (I have 2 GB RAM). I assume that once programmed, the computations (just a guesstimation) would take a month or two of full time computing.

The only programming I have ever done was on the TI calculators in High School. Can someone point to a programming language that would be good for this type of program? I am working on this project over a couple of months and therefore don't mind learning a language that lends itself to this type of data computation.
If I'm correct, are you trying to track the change in standard dev as time goes on? If so, I was able to build a quick formula using excel's STDEV() function. All you really have to to is define a name that references the starting element of the data set (assuming the data is in one column, for example column A, then the starting element would be in cell A1), like 'start' for example. then, in the cell you that you want to have display the std dev for that given row, enter the formula.

Example: (excuse the poor formatting)

A B
14 =STDEV(start:A1) [Should evaluate to 0, although excel will complain]
15 =STDEV(start:A2) [Should evaluate to 0.71]
16 =STDEV(start:A3) [Should evaluate 1]


In the above example, I created a reference to cell A1 called 'start'. Then, I input the formulas in row B. This should scale well, because after you've entered the first two or three formulas in the row, excel will figure out the pattern, and you can just drag the pattern down the row. In this way, you can track the change in standard deviation as you move down your collection of data points.
 

miles01110

macrumors Core
Jul 24, 2006
19,264
30
The Ivory Tower (I'm not coming down)
If I'm correct, are you trying to track the change in standard dev as time goes on? If so, I was able to build a quick formula using excel's STDEV() function.
This would work if all of the standard deviations were on the same sheet or even in the same workbook, but my understanding is that the OP has 750 different files, each with one standard deviation. The OP wants to work with 750 standard deviations and 750 averages, one from each file.
 

milbournosphere

macrumors 6502a
Mar 3, 2009
856
1
San Diego, CA
This would work if all of the standard deviations were on the same sheet or even in the same workbook, but my understanding is that the OP has 750 different files, each with one standard deviation. The OP wants to work with 750 standard deviations and 750 averages, one from each file.
He could merge the files into one spreadsheet, and then collapse them together, couldn't he?
 

miles01110

macrumors Core
Jul 24, 2006
19,264
30
The Ivory Tower (I'm not coming down)
My thinking is that it will take quite some time to do it to approx 750 files. I fI can automate it that would make me very happy. I tried adding all the sheets to one file, then pasting the formulas across all the sheets. Excel "ran out of memory" and then froze.
He could merge the files into one spreadsheet, and then collapse them together, couldn't he?
Apparently not.
 

tyrant

macrumors newbie
Original poster
May 1, 2007
13
0
The OP wants to work with 750 standard deviations and 750 averages, one from each file.
I want one for each day, in each file. Assuming approx. 6,050 days in each file, I would have 4,537,500 Std Devs across all files
 

milbournosphere

macrumors 6502a
Mar 3, 2009
856
1
San Diego, CA
So maybe you can get everything into one sheet. Another question. Is your data one massive related set, or are they distinct sets? IOW, do you need the standard devs to carry on from one file to another? If you don't need this behavior, then maybe you can write a script to handle the 750 files...
 

jared_kipe

macrumors 68030
Dec 8, 2003
2,967
1
Seattle
I'm close to releasing a statistics program for the iphone, I would be curious if you could send me a file or two so I could test a "really bad" case scenario for my algorithms.

I've been testing with 500+ line files, and so far most everything works pretty well. I'm curious what would happen with some 6000+ line files.


Also, what type of data needs to be averaged? How big are the numbers? What precision is expected?
 

tyrant

macrumors newbie
Original poster
May 1, 2007
13
0
So maybe you can get everything into one sheet. Another question. Is your data one massive related set, or are they distinct sets? IOW, do you need the standard devs to carry on from one file to another? If you don't need this behavior, then maybe you can write a script to handle the 750 files...
Each file is its own distinct set.
 

milbournosphere

macrumors 6502a
Mar 3, 2009
856
1
San Diego, CA
Each file is its own distinct set.
Well then, there's your answer! I would try to write an Automator script using the extras that Office installs for you. I am at work so I cannot play with this (stuck on an XP box), but I am confident that Automator would be the best way to handle the near flood of files.
 
Or you could use 1 file and reference the file of interest in another, put something like the following in "analysis.xls", assuming you have files "data1.xls" to "data750.xls" with columns "date" and "value":

Code:
   A                       B                      C                  D
1 date                   value                 average              stdev
2 =[data1.xls]Sheet1!A2  =[data1.xls]Sheet1!B2  =average($B$2:b2) =stdev($B$2:b2)
3 =[data1.xls]Sheet1!A3  =[data1.xls]Sheet1!B3  =average($B$2:b3) =stdev($B$2:b3)
4 =[data1.xls]Sheet1!A4  =[data1.xls]Sheet1!B4  =average($B$2:b4) =stdev($B$2:b4)
5 =[data1.xls]Sheet1!A5  =[data1.xls]Sheet1!B5  =average($B$2:b5) =stdev($B$2:b5)
6 =[data1.xls]Sheet1!A6  =[data1.xls]Sheet1!B6  =average($B$2:b6) =stdev($B$2:b6)
7 =[data1.xls]Sheet1!A7  =[data1.xls]Sheet1!B7  =average($B$2:b7) =stdev($B$2:b7)
Excel will autocomplete the cell references for you and you just replace data1.xls with whatever file you are working with

(Hope formatting is OK, I don't have Excel on this machine to give exact syntax)

That said, if you want to build some skills and are working with data like this, I would suggest using a database (learn some SQL) and a language like python to access and process that data. Python has libraries already (on OS X) to enable you to calculate these values in one line in the numpy package (a slightly older version on my 10.5)

Open a Terminal.app, and type the following:
Code:
mymac:~ almostthere$ python
Python 2.5.1 (r251:54863, Feb  6 2009, 19:02:12) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import numpy as np
>>> data = [1, 2, 3, 4, 5]
>>> print np.average(data)
>>> print np.std(data)
The data = .... line can be changed to load the contents of a file, csv, text are easiest but you can use Excel files to a certain extent through addon libraries.
Code:
data = [float(line) for line in open('data1.txt').readlines()]
for example would work if you have a file with a single column and one value per line
 

jared_kipe

macrumors 68030
Dec 8, 2003
2,967
1
Seattle
Algorithmically, average to n depends on on the sum of the previous numbers, and standard deviation depends on the sum of squares for the previous numbers.

Meaning that to depth n, average = (previous_sum(n-1)+x(n))/n
stdDeviation = sqrt( {(previousSSx(n-1)+SSx(n)}/n )

So in sudo code your table would be like...

Code:
input_array[]; // holds each numer in an index
int i;

double or int previous_sum = 0;
double or int previous_SSx = 0;

double or int temp_average = 0; // holds temporary output for average
double or int temp_stdDev = 0;  // same for standardDeviation

for ( i=0; i<(input_array count); i++)
     previous_sum = previous_sum + input_array[i];
     previous_SSx = previous_SSx + input_array[i]*input_array[i];

     temp_average = previous_sum / i;
     temp_stdDev = previous_SSx / i;
     temp_stdDev = temp_stdDev - (temp_average*temp_average);
     temp_stdDev = sqrt( temp_stdDev );

     // at this point the two temps have the data needed for depth i
// so you would write them to a file, or make a string and put them
// in an array or something.
// you would do it inside this for loop so that you get a list of them.
// for our purpouses I'll just write it out to standard output, in theory 
// if you actually wrote this in C you could just capture standard
// output into a file if you wanted to.  (assuming int's)

     printf("Day %i, avg=%i, stdDev=%i", (i+1), temp_average, temp_stdDev);
} // end for loop
// at the end of the loop temp_average and temp_stdDev contain the "final"
// values you would expect if you told someone to "calculate the average
// and standard deviation on this data"