Running the Average and Std Dev on a data set - Newbie

Discussion in 'Mac Programming' started by tyrant, Jan 15, 2010.

  1. tyrant macrumors newbie

    Joined:
    May 1, 2007
    #1
    IDon't have much programming experience. I have a data set that I was for each successive number in the set I want to the find the average (of pervious numbers and current number) and the standard deviation.

    Example

    Day 1 Average (Day 1), Standard Deviation (Day 1)
    Day 2 Average (Day 1 - 2), Standard Deviation (Day 1 - 2)
    Day 3 Average (Day 1 - 3), Standard Deviation (Day 1 - 3)
    Day 4 Average (Day 1 - 4), Standard Deviation (Day 1 - 4)
    ...
    Day n Average (Day 1 - n), Standard Deviation (Day 1 - n)

    I have about 750 sets that I want to run this computation on and each set has about 6,050 data points (days).

    I am assuming that each day in the set is going to add exponential number of calculations for the computer. Due to the high volume, excel would run out of RAM very quickly (I have 2 GB RAM). I assume that once programmed, the computations (just a guesstimation) would take a month or two of full time computing.

    The only programming I have ever done was on the TI calculators in High School. Can someone point to a programming language that would be good for this type of program? I am working on this project over a couple of months and therefore don't mind learning a language that lends itself to this type of data computation.
     
  2. ChrisA macrumors G4

    Joined:
    Jan 5, 2006
    Location:
    Redondo Beach, California
    #2
    No, it's not exponential. It's linear. The time to compute will be limited by the speed of your disk drive, by how fast it can read the data. The computation time is nearly trivial

    Hint: Keep three variables: N - the current number of samples, is incremented as you read the sample, SUM, the sum of samples read so far and SUM_SQUARED with is the sum of the squares. Once these are updated you can print to statics on the same line as the day's data then go to the nest day.

    The entire process should take seconds not months on any Mac that can run Mac OS X.

    Try doing a subset of the data by hand on paper to work out the steps.

    Is this a homework problem? Sure sounds like it. I hope no one here does writes this for you. but I expect it to take about a dozen lines of code.
     
  3. tyrant thread starter macrumors newbie

    Joined:
    May 1, 2007
    #3

    Thank you for the response, to be clear; this is not a homework problem. I understand the ethical implications if it was.

    One question, what language would you recommend for this?
     
  4. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #5
    Any language that can do basic arithmetic would be able to carry out the calculations.

    Seriously, think about it a little and write down what needs to happen. Express it in plain English rather than trying to write a program right away. Don't take this as patronizing a noob: real software developers do this because it helps clarify the scope of the problem.

    The running average (which I assume is the arithmetic mean) and standard deviation are your calculations. What's the calculation for mean? What's the calculation for std-dev?
    http://en.wikipedia.org/wiki/Standard_deviation

    What arithmetic operations are needed to carry out these calculations:
    Addition, subtraction, multiplication, division, square root: that's it.

    And you might want to explain why you thought the calculations would be exponential. Again, don't take that as patronizing a noob: explaining one's rationale and reasoning is an important part of real software development. It's unexplained and improperly reasoned kluges that cause the most trouble.


    A bigger issue than calculations is likely to be reading your data sets into your program, and converting that data to suitable numbers that you can then add, multiply, etc.

    So exactly what format are your data sets in? Excel files? Comma-separated text files? XML? Something else?

    Not every language has the same ease of reading different data formats. In fact, if your data format isn't documented, then that's going to be a big problem right there. If it's documented but hard to decode, then that means writing a parser (or with luck, finding a free one already written).
     
  5. tyrant thread starter macrumors newbie

    Joined:
    May 1, 2007
    #6
    Good points Chown33.

    I have thought about the arithmetic operations, and that I feel comfortable with. Upon further consideration the computations are not exponential they are actually linier. Large, but linier. The following are the areas I am really clueless about:
    1) How do I make the program pull the data in from a separate file?

    2) How to I tell it to:

    Day 1 Average = (Day 1)/1
    Day 2 Average = (Day 1 +Day 2)/2
    Day 3 Average = (Day 1 + Day 2 + Day 3)/3
    Day n Average = (Day 1 + Day 2 + Day 3 … Day n)/n

    I may not know the nomenclature, but I would call this successive history. This seems to be the biggest area that I have no clue about.

    3) How to I make it save the output to the same file?
    Data Average

    Currently the data is in Excel. It could very easily be converted to comma separated vales or fixed with.
     
  6. miles01110 macrumors Core

    miles01110

    Joined:
    Jul 24, 2006
    Location:
    The Ivory Tower (I'm not coming down)
    #7
    Python (as recommended by AlmostThere) or tcl would be my choices for this problem.
     
  7. milbournosphere macrumors 6502a

    milbournosphere

    Joined:
    Mar 3, 2009
    Location:
    San Diego, CA
    #8
    If the data is in Excel, why not use the built in functions that already exist for this? Excel has pretty decent formula functionality. I would look into their statistical functions, and use them to build what you want. I would recommend starting with a subset of 10 or so elements to get what you are looking for. I would think that this would not be so hard to build just using excel. That way, you could avoid the hassle of having to learn a programming language (unless that is part of your aim; programming is an excellent skill to have).
     
  8. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #9
    The exact procedure will depend on which language is used. This is called I/O, which stands for Input/Output, and is the means by which data is read into a program or written out from it. Every language has different I/O facilities, although the principles are generally the same.


    Break it down.
    http://www.cocoadev.com/index.pl?BreakItDown

    Look for patterns. For example, notice how Day 3's average consists of Day 2's numerator (summation) plus Day 3's value. Then notice how every other day exhibits the same pattern. The conclusion: you don't have to calculate Day 1 + Day 2 + Day 3 + Day n at each step. You can use the summation of previous days and only add one day's value, before dividing by the day-count. And adding each day's value sets things up for the following day.


    That's I/O again and it will depend on which language is used.


    "Fixed width"?

    Comma-separated or tab-separate numbers are probably the easier form to process for most programming languages. Very few things can deal directly with Excel files.

    Start with a limited data set, say only 5 days. Write down the specific steps, including the loop, the reading of one line, the calculations, and so on.

    If you have the steps written down properly, without regard to language, it will be a lot simpler to figure out what to do in any particular language.
     
  9. tyrant thread starter macrumors newbie

    Joined:
    May 1, 2007
    #10

    My thinking is that it will take quite some time to do it to approx 750 files. I fI can automate it that would make me very happy. I tried adding all the sheets to one file, then pasting the formulas across all the sheets. Excel "ran out of memory" and then froze.
     
  10. chown33 macrumors 604

    Joined:
    Aug 9, 2009
    #11
    I suggest trying it on one file, rather than just thinking about it or trying to do it on all data at once. Measure how long it takes Excel on that one file.

    In the time it takes you to learn enough programming, it's possible that you can script Excel to do it using AppleScript and Excel's own builtin formulas, if you do it file by file rather than all at once. Because no matter what programming language you use, you'll still have to figure out a way to get the data out of 750 Excel files.

    It depends on whether your primary goal is to have the calculated results, or whether your primary goal is to learn programming.
     
  11. milbournosphere macrumors 6502a

    milbournosphere

    Joined:
    Mar 3, 2009
    Location:
    San Diego, CA
    #12
    If I'm correct, are you trying to track the change in standard dev as time goes on? If so, I was able to build a quick formula using excel's STDEV() function. All you really have to to is define a name that references the starting element of the data set (assuming the data is in one column, for example column A, then the starting element would be in cell A1), like 'start' for example. then, in the cell you that you want to have display the std dev for that given row, enter the formula.

    Example: (excuse the poor formatting)

    A B
    14 =STDEV(start:A1) [Should evaluate to 0, although excel will complain]
    15 =STDEV(start:A2) [Should evaluate to 0.71]
    16 =STDEV(start:A3) [Should evaluate 1]


    In the above example, I created a reference to cell A1 called 'start'. Then, I input the formulas in row B. This should scale well, because after you've entered the first two or three formulas in the row, excel will figure out the pattern, and you can just drag the pattern down the row. In this way, you can track the change in standard deviation as you move down your collection of data points.
     
  12. miles01110 macrumors Core

    miles01110

    Joined:
    Jul 24, 2006
    Location:
    The Ivory Tower (I'm not coming down)
    #13
    This would work if all of the standard deviations were on the same sheet or even in the same workbook, but my understanding is that the OP has 750 different files, each with one standard deviation. The OP wants to work with 750 standard deviations and 750 averages, one from each file.
     
  13. milbournosphere macrumors 6502a

    milbournosphere

    Joined:
    Mar 3, 2009
    Location:
    San Diego, CA
    #14
    He could merge the files into one spreadsheet, and then collapse them together, couldn't he?
     
  14. miles01110 macrumors Core

    miles01110

    Joined:
    Jul 24, 2006
    Location:
    The Ivory Tower (I'm not coming down)
    #15
    Apparently not.
     
  15. tyrant thread starter macrumors newbie

    Joined:
    May 1, 2007
    #16
    I want one for each day, in each file. Assuming approx. 6,050 days in each file, I would have 4,537,500 Std Devs across all files
     
  16. miles01110 macrumors Core

    miles01110

    Joined:
    Jul 24, 2006
    Location:
    The Ivory Tower (I'm not coming down)
    #17
    Ah ok, well the point was more that you had multiple files to work with :p
     
  17. milbournosphere macrumors 6502a

    milbournosphere

    Joined:
    Mar 3, 2009
    Location:
    San Diego, CA
    #18
    I saw that, but I figured that maybe the error occurred when he was pasting the formulas. tyrant, when(at what step in the process) did the memory error occur?
     
  18. tyrant thread starter macrumors newbie

    Joined:
    May 1, 2007
    #19
    Recalculating
     
  19. milbournosphere macrumors 6502a

    milbournosphere

    Joined:
    Mar 3, 2009
    Location:
    San Diego, CA
    #20
    So maybe you can get everything into one sheet. Another question. Is your data one massive related set, or are they distinct sets? IOW, do you need the standard devs to carry on from one file to another? If you don't need this behavior, then maybe you can write a script to handle the 750 files...
     
  20. jared_kipe macrumors 68030

    jared_kipe

    Joined:
    Dec 8, 2003
    Location:
    Seattle
    #21
    I'm close to releasing a statistics program for the iphone, I would be curious if you could send me a file or two so I could test a "really bad" case scenario for my algorithms.

    I've been testing with 500+ line files, and so far most everything works pretty well. I'm curious what would happen with some 6000+ line files.


    Also, what type of data needs to be averaged? How big are the numbers? What precision is expected?
     
  21. tyrant thread starter macrumors newbie

    Joined:
    May 1, 2007
    #22
    Each file is its own distinct set.
     
  22. milbournosphere macrumors 6502a

    milbournosphere

    Joined:
    Mar 3, 2009
    Location:
    San Diego, CA
    #23
    Well then, there's your answer! I would try to write an Automator script using the extras that Office installs for you. I am at work so I cannot play with this (stuck on an XP box), but I am confident that Automator would be the best way to handle the near flood of files.
     
  23. AlmostThere macrumors 6502a

    #24
    Or you could use 1 file and reference the file of interest in another, put something like the following in "analysis.xls", assuming you have files "data1.xls" to "data750.xls" with columns "date" and "value":

    Code:
       A                       B                      C                  D
    1 date                   value                 average              stdev
    2 =[data1.xls]Sheet1!A2  =[data1.xls]Sheet1!B2  =average($B$2:b2) =stdev($B$2:b2)
    3 =[data1.xls]Sheet1!A3  =[data1.xls]Sheet1!B3  =average($B$2:b3) =stdev($B$2:b3)
    4 =[data1.xls]Sheet1!A4  =[data1.xls]Sheet1!B4  =average($B$2:b4) =stdev($B$2:b4)
    5 =[data1.xls]Sheet1!A5  =[data1.xls]Sheet1!B5  =average($B$2:b5) =stdev($B$2:b5)
    6 =[data1.xls]Sheet1!A6  =[data1.xls]Sheet1!B6  =average($B$2:b6) =stdev($B$2:b6)
    7 =[data1.xls]Sheet1!A7  =[data1.xls]Sheet1!B7  =average($B$2:b7) =stdev($B$2:b7)
    
    
    Excel will autocomplete the cell references for you and you just replace data1.xls with whatever file you are working with

    (Hope formatting is OK, I don't have Excel on this machine to give exact syntax)

    That said, if you want to build some skills and are working with data like this, I would suggest using a database (learn some SQL) and a language like python to access and process that data. Python has libraries already (on OS X) to enable you to calculate these values in one line in the numpy package (a slightly older version on my 10.5)

    Open a Terminal.app, and type the following:
    Code:
    mymac:~ almostthere$ python
    Python 2.5.1 (r251:54863, Feb  6 2009, 19:02:12) 
    [GCC 4.0.1 (Apple Inc. build 5465)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>>
    >>> import numpy as np
    >>> data = [1, 2, 3, 4, 5]
    >>> print np.average(data)
    >>> print np.std(data)
    
    The data = .... line can be changed to load the contents of a file, csv, text are easiest but you can use Excel files to a certain extent through addon libraries.
    Code:
    data = [float(line) for line in open('data1.txt').readlines()]
    
    for example would work if you have a file with a single column and one value per line
     
  24. jared_kipe macrumors 68030

    jared_kipe

    Joined:
    Dec 8, 2003
    Location:
    Seattle
    #25
    Algorithmically, average to n depends on on the sum of the previous numbers, and standard deviation depends on the sum of squares for the previous numbers.

    Meaning that to depth n, average = (previous_sum(n-1)+x(n))/n
    stdDeviation = sqrt( {(previousSSx(n-1)+SSx(n)}/n )

    So in sudo code your table would be like...

    Code:
    input_array[]; // holds each numer in an index
    int i;
    
    double or int previous_sum = 0;
    double or int previous_SSx = 0;
    
    double or int temp_average = 0; // holds temporary output for average
    double or int temp_stdDev = 0;  // same for standardDeviation
    
    for ( i=0; i<(input_array count); i++)
         previous_sum = previous_sum + input_array[i];
         previous_SSx = previous_SSx + input_array[i]*input_array[i];
    
         temp_average = previous_sum / i;
         temp_stdDev = previous_SSx / i;
         temp_stdDev = temp_stdDev - (temp_average*temp_average);
         temp_stdDev = sqrt( temp_stdDev );
    
         // at this point the two temps have the data needed for depth i
    // so you would write them to a file, or make a string and put them
    // in an array or something.
    // you would do it inside this for loop so that you get a list of them.
    // for our purpouses I'll just write it out to standard output, in theory 
    // if you actually wrote this in C you could just capture standard
    // output into a file if you wanted to.  (assuming int's)
    
         printf("Day %i, avg=%i, stdDev=%i", (i+1), temp_average, temp_stdDev);
    } // end for loop
    // at the end of the loop temp_average and temp_stdDev contain the "final"
    // values you would expect if you told someone to "calculate the average
    // and standard deviation on this data"
    
     

Share This Page