View Full Version : Anyone got any recommendations?
Cromulent
Jul 2, 2008, 06:10 PM
I've attached an example data file that the program needs to read (it is actually a comma separated file but had to change the extension). The thing is, it needs to read data like this from multiple streams and process it at speeds up to one piece per second. It will be working discreetly to calculate things such as the moving average and must also take into account things such as network latency and inability to access data without hanging the computer.
I'm really not looking for programming help as such as I can handle that part I'm more asking for help with how best to approach the problem. What do I need to look out for? What would the best approach be? Does anyone have any experience with reading in large data sets and processing them? Especially when the data is being streamed to the computer from the internet in real time.
I would imagine it would be handy to write a helper application that just receives the data from the net and stores it into a file so that the main program can process it at it's leisure and does not need to worry about the network side of things. I guess it also helps with security as you are sure the data you are reading is in the correct format.
All this will be done in C, although I'm starting to think Python maybe a better alternative as writing the file handling part is unnecessarily complex in C.
toddburch
Jul 2, 2008, 06:20 PM
Archives of stock data probably won't really be coming in that THAT heavy, so I would most likely use a regular expression for data validation. Ruby, Python, Perl, C - all would be fine for this.
One approach I take for this type of thing is to use a scripting language first - for proof of concept and quick implementation. Then, if the performance sucks, go to a faster language.
Todd
Cromulent
Jul 2, 2008, 06:23 PM
Archives of stock data probably won't really be coming in that THAT heavy, so I would most likely use a regular expression for data validation. Ruby, Python, Perl, C - all would be fine for this.
One approach I take for this type of thing is to use a scripting language first - for proof of concept and quick implementation. Then, if the performance sucks, go to a faster language.
Todd
Depends. Ultimately it won't be archives but streaming and you can get data as quick as once a second.
Good plan with the scripting language though, that should knock quite a bit of time off initial development.
lee1210
Jul 2, 2008, 07:22 PM
once per second there could be 1 record or once per second there could be 10MB on the line? This makes a big difference in how you poll, if you need a dedicated I/O thread, etc.
-Lee
Cromulent
Jul 2, 2008, 07:44 PM
once per second there could be 1 record or once per second there could be 10MB on the line? This makes a big difference in how you poll, if you need a dedicated I/O thread, etc.
-Lee
Once per second for one line of data which is probably going to be 1/2KB, if that. The problem comes when you need to do the mathematical calculations on the data and have less than one second to download the data, append it to the end of the file and perform the calculations.
Basically I need it to run as fast as possible as a delay is not going to be very good for the calculations. I'm guessing I'm going to have to have an I/O thread and at least one data processing thread (probably one per algorithm).
All the data is will be the current stock price and the time. But there are at least 3 different things I need to calculate, one of which requires backtesting a certain number of previous data in order to make sure it is correct.
lee1210
Jul 2, 2008, 08:04 PM
Does the result of the calculation for the current line of input have to be complete before you read the next? Does it need to be complete before you start the calculation on the next line of input?
Depending on the requirements it might be easiest to have a separate thread (or even process) grabbing data from the line and putting it in a queue (could be a hand-crafted queue, a database table, etc.). Your processing thread(s)/process(es) would need to be able to grab the "next" record if the order of the calculations are important, or any available item if not.
-Lee
Cromulent
Jul 3, 2008, 02:05 PM
Does the result of the calculation for the current line of input have to be complete before you read the next? Does it need to be complete before you start the calculation on the next line of input?
Yep. It is a real time application that will eventually be used to spot simple trends in the numbers as they are received. Therefore it is imperative that all calculations are complete before the next set of data is received.
The point is that eventually it won't be limited to one stream and the data can be received at different intervals for each stream (one could be once a second, one could be once every 15 mins, one could be once a day).
Cromulent
Jul 6, 2008, 02:05 PM
Okay following advice I decided to mock up parts of this using Python (first time using it) and am having a problem calling a Python module from my C code.
Here is the C code:
PyObject *module, *dict, *func, *value;
Py_Initialize();
module = PyImport_ImportModule("logic");
dict = PyModule_GetDict(module);
func = PyDict_GetItemString(dict, "logic");
value = PyObject_CallFunction(func, "HW", "");
Py_DECREF(module);
Py_DECREF(dict);
Py_DECREF(func);
Py_DECREF(value);
Py_Finalize();
Here is the Python code it is meant to call (just something simple to make sure it is working):
def HW():
print 'Hello World!'
It seems to be crashing out at the PyModule_GetDict() function saying EXEC_BAD_ACCESS but I have no idea why? As far as I can see they are legitimate places to write and read from. The documentation is poor on this unfortunately.
The code compiles fine with no warnings or errors.
toddburch
Jul 6, 2008, 02:11 PM
Bummer, if you had chosen Ruby I'd dig in and help!
Cromulent
Jul 6, 2008, 05:39 PM
Bummer, if you had chosen Ruby I'd dig in and help!
Doh, unfortunately my Dad already has a few Python books so it makes sense to use that.
Although this is turning into a bigger problem than I realised. Hmm, stupid documentation. At least with C you are guaranteed to have decent documentation for the standard library and much used functions. I would have thought embedding a Python script in C would be an extremely well used part of the language.
Mac Player
Jul 6, 2008, 05:55 PM
Why not java? Its easier than C and faster than python.
Edit: Does the 1 sec limit include the network delay?
Cromulent
Jul 6, 2008, 05:59 PM
Why not java? Its easier than C and faster than python.
Not sure about that. Python can be extremely fast. Anyway I'm using Python because you can extend a standard C program with it which gives great flexibility. You do all the mission critical stuff in C and all the setup in Python. Plus you can add new features easily with Python.
Java on the other hand is not a particularly easy language to learn, it has a huge sprawling collection of libraries and like C++ it has tried to do the one language fits all approach and is thus pretty unwieldy.
Mac Player
Jul 6, 2008, 06:17 PM
But you don't have native threads in python.
Cromulent
Jul 6, 2008, 06:24 PM
But you don't have native threads in python.
Technically you don't have native threads in Java as well.
Mac Player
Jul 6, 2008, 06:35 PM
I thought the green thread mode was removed.
Cromulent
Jul 6, 2008, 06:39 PM
I thought the green thread mode was removed.
My point was that Java runs on top of a virtual machine. Anything Java does is native for the VM, but it then gets JIT compiled down to the machine level as it is run. Whether you count that as native is up for debate.
Python on the other hand has a tool which will take normal Python code and turn it directly into native x86 assembly which can then be assembled and run as a normal program. So in that case Python is actually more native than Java as it is properly compiled and not interpreted at all.
Therefore I would argue Python can be faster than Java.
Mac Player
Jul 6, 2008, 07:02 PM
Java threads can execute concurrently, python threads can't. And the JVM runs faster even than psyco.
Cromulent
Jul 6, 2008, 07:23 PM
Java threads can execute concurrently, python threads can't. And the JVM runs faster even than psyco.
Having done a little Googling I see you are correct.
Still the point is moot because of these reasons:
C is faster than both Python and Java and I will be using that for the performance critical areas of the program.
Python can be called from within C threads which are native and can run concurrently.
Python is (IMO) a nicer language and easier to learn.
Python integrates with C in such a way as to make the program extensible in an easy and approachable manner.
lee1210
Jul 6, 2008, 08:12 PM
this is strictly academic, I do not support using Java for this task, but it can be called from C and vice versa by way of the Java Native Interface(JNI). I have not tried it with Python but I hope it is easier than JNI.
Also, the input filter you are building will be the input process/thread. For this problem I don't think you need multiple input threads so python's threading model is irrelvant to this problem. Thanks to those that brought it up, it is important to know in general. I just don't think for this project it will come to bear.
I mentioned it breifly before, but it may be worth considering using separate processes for I/O and data processing. These task don't need to interact aside from in the datastore. You can maintain coherency there via file locks or transactions in an RDBMS. With pthreads you need to learn a lot more(never a bad thing) before you can start working. You may need to use semaphores depending on your design. If you don't have a backing store whatever you've read is gone if your program is terminated. If you need a backing store anyway, using this with coprocesses is probably easier.
Anyway, good luck, keep posting with progress. I might pull out my machine in a few minutes and give the C/Python bridge a try. I'll let you know if I come up with anything.
-Lee
Cromulent
Jul 6, 2008, 08:56 PM
Good advice, I think a database may work well as I will have to query ranges of data. I assume something like PostgreSQL will be suitable. Although I think it might be overly complex at this moment in time.
Working with a common data store could be beneficial in other ways too. I'll have to think about that a little bit.
As for Python the C API for it seems very comprehensive and rather nice for a bridge.
lee1210
Jul 6, 2008, 09:39 PM
I've never used Python on my system before, so there were some issues getting it built and getting my module included, but I have something that seems to be working (I don't normally consider this test complete until I can pass a dynamic chunk of data like a list of variable length of string, but I'm pretty tired).
Here's what I have:
testpy.c:
#include <stdio.h>
#include "Python.h" //I had to use a long path, but I assume your system is better setup
int main(int argc, char *argv) {
PyObject *module, *dict, *func, *value;
Py_Initialize(); //Set up the interpreter
PyRun_SimpleString("import sys \n"); //This line and the next
PyRun_SimpleString("sys.path.append('.')\n"); //are only b/c of my system
module = PyImport_ImportModule("logic"); //Import the module in logic.py
dict = PyModule_GetDict(module); //I have no idea
func = PyDict_GetItemString(dict, "HW"); //Get a reference to the function
value = PyObject_CallFunction(func,NULL); //Call HW with no parameters
Py_DECREF(module);
Py_DECREF(dict);
Py_DECREF(func);
Py_DECREF(value);
Py_Finalize();
return 0;
}
logic.py:
def HW():
print 'Hello from Python!'
When I built with GCC i had to explicitly include the python library. Again, hopefully your system has this setup already, but just an FYI.
-Lee
Cromulent
Jul 7, 2008, 05:14 PM
Crashes out when I try and run it in Xcode. I have the Python included with Xcode tools (2.5.1) and have included libpython.dylib into the project as well as the Python header file.
I'll have to do some more reading when I'm not so tired. I don't think I'm thinking straight enough to do any programming at the moment.
mamcx
Jul 7, 2008, 05:46 PM
I think the best route is do this directly in python. You can put it to work faster with psyco and can solve the treading with stacless (but anyway, nobody can solve taht except if is erlang!).
Also, I think you can capture th datastream as fast as can, but delay the calculation thing.
Imagine is a waterfall. Put a lake to hold the water then later clean it. If the thing is something to be in a GUI the dealy is minimal.
lee1210
Jul 7, 2008, 05:55 PM
Crashes out when I try and run it in Xcode. I have the Python included with Xcode tools (2.5.1) and have included libpython.dylib into the project as well as the Python header file.
I'll have to do some more reading when I'm not so tired. I don't think I'm thinking straight enough to do any programming at the moment.
GDB it and see which of the PyObject pointers you have there come back null. The crash is likely passing one of those as null to another Python function, or when you do Py_DECREF on them. Once you know that it may be easier to figure out. I had the logic.py file in the same working directory as my program (I just used gcc from terminal, I wasn't trying Xcode), which is why I did:
sys.path.append('.')
I don't know what directory Xcode uses for its working directory, but it might be better for you, while you're trying this out, to put the logic.py file in a specific place, then just add that to Python's path using the sys.path.append function.
That was one of the causes of problems when I started out, the other was that I had to change the last argument of PyObject_CallFunction to NULL from the empty string. The rest of the issues I had seemed to be related to the environment, so if Xcode is already handling that you shouldn't have to worry about those.
-Lee
P.S. If you are going to go with a co-process method, you could just run a Python script by itself for the "data acquisition" process as mamcx suggested. However, I think these bridges are interesting so I like playing with them. Even after I went through the trouble of passing dynamic data to/from Fortran to C to Java (via JNI) and Fortran to C to C# (via Embedded Mono) I haven't ended up using either in production. In my case the overhead of starting a lot of JVMs for JNI or embedding the mono runtime in all of our processes wasn't worthwhile.
Cromulent
Jul 8, 2008, 03:37 PM
Bah, the program works perfectly when compiling on the command line :(. Yet another reason to hate IDE's which just get in your way...
I guess I'm going to have to have a closer look at Xcode.
lee1210
Jul 8, 2008, 04:33 PM
Bah, the program works perfectly when compiling on the command line :(. Yet another reason to hate IDE's which just get in your way...
I guess I'm going to have to have a closer look at Xcode.
Did you try sticking the python script in a well-defined place, and putting:
sys.path.append('/users/Cromulent/scripts');
in the code? I'm sure Xcode uses some subdirectory under the project directory as the working directory when your program runs, so that would be more difficult to find and place the python script in.
-Lee
Cromulent
Jul 8, 2008, 06:35 PM
Okay, that seems to have fixed it. If you don't change the path the Python source needs to be in the same directory as the executable which would be rather inconvenient but there you go.
Cromulent
Aug 14, 2008, 08:44 PM
Just gone back to this after a few weeks putting it off. The Python part works fine and I can return the data to C and convert it to the relevant C types. The problem, really, is down to design. How do you guys decide the best way to pass arguments around functions? If I carry on down the route I'm going 90% of my functions will require 4+ arguments which is pretty hard to remember when you have a fair amount.
How would you handle a situation where you have 6 arrays each with 8000+ elements to them? Obviously put them in a struct but would you make each of the items an array or make an array of structs? Is there much of a difference in terms of performance or memory use? All elements are the same size so that is not an issue (in terms of array length).
lee1210
Aug 14, 2008, 09:48 PM
Just gone back to this after a few weeks putting it off. The Python part works fine and I can return the data to C and convert it to the relevant C types. The problem, really, is down to design. How do you guys decide the best way to pass arguments around functions? If I carry on down the route I'm going 90% of my functions will require 4+ arguments which is pretty hard to remember when you have a fair amount.
How would you handle a situation where you have 6 arrays each with 8000+ elements to them? Obviously put them in a struct but would you make each of the items an array or make an array of structs? Is there much of a difference in terms of performance or memory use? All elements are the same size so that is not an issue (in terms of array length).
It might seem "nicer" to use the bridge in this manner, but is this really superior to passing in a filename which points to a file with the data to process?
If you did want to send it all over the bridge, how slow this is will depend on the speed of the embedded python code. I'm guessing it's pretty quick, but have no idea. I would thing that you would want a struct in C that holds all of your arrays, that you pass to a function that breaks them out into 6 separate PyListObjects of the appropriate type (say, PyIntObjects). You can then compose that into one big list and stick all 6 of the PyListObjects you generated in that, and have that be your one parameter to your python function. This will take a lot of C to do what's a few lines worth of Python, but so is the life of a programmer doing embedding.
-Lee
Cromulent
Aug 14, 2008, 10:05 PM
It might seem "nicer" to use the bridge in this manner, but is this really superior to passing in a filename which points to a file with the data to process?
If you did want to send it all over the bridge, how slow this is will depend on the speed of the embedded python code. I'm guessing it's pretty quick, but have no idea. I would thing that you would want a struct in C that holds all of your arrays, that you pass to a function that breaks them out into 6 separate PyListObjects of the appropriate type (say, PyIntObjects). You can then compose that into one big list and stick all 6 of the PyListObjects you generated in that, and have that be your one parameter to your python function. This will take a lot of C to do what's a few lines worth of Python, but so is the life of a programmer doing embedding.
-Lee
Thanks for that. Basically my Python code reads the file, strips out the commas and splits it into 6 lists (one for each heading). I then return the 6 lists in a tuple, extract each list from the tuple and convert it into an array of either char (for the date), double or int depending on the type in question. Doing it this way basically means that the brunt of the code is C, and Python just does the stuff which is a royal pain in the arse in C.
I'm just trying to work out a nice solution to these arrays that does not require me to a) make them global or b) pass structs around by value. Plus the program does not know how big the list will be until Python returns the tuple so additionally they need to be C99 style dynamic arrays.
Design is not my strong point.
lee1210
Aug 14, 2008, 10:15 PM
Thanks for that. Basically my Python code reads the file, strips out the commas and splits it into 6 lists (one for each heading). I then return the 6 lists in a tuple, extract each list from the tuple and convert it into an array of either char (for the date), double or int depending on the type in question. Doing it this way basically means that the brunt of the code is C, and Python just does the stuff which is a royal pain in the arse in C.
I'm just trying to work out a nice solution to these arrays that does not require me to a) make them global or b) pass structs around by value. Plus the program does not know how big the list will be until Python returns the tuple so additionally they need to be C99 style dynamic arrays.
Design is not my strong point.
I'm really not familiar with python, but i was playing with this using lists. You should be able to, on return from the python function, use PyList_Size to get the length, then allocate that times sizeof(type). I'm not familiar with C99 dynamic arrays, either, so I'm not sure how that changes things, but probably not much.
I'm playing with this now to see how long it's taking to setup all of the python objects from 51,000 ints.
-Lee
Cromulent
Aug 14, 2008, 10:27 PM
Here's my Python code:
def fileInput():
# Lists for data read from file
data = []
date = []
opening = []
high = []
low = []
closing = []
volume = []
s = raw_input("Please enter the filename to process (enter full path if not in current directory): ")
fd = open(s, "r")
fd.readline() # Throw away
for record in fd:
record = record.strip()
items = record.split(',')
for i in range(1, 5):
items[i] = float(items[i])
items[5] = int(items[5])
for j in range(6):
date.append(items[0])
opening.append(items[1])
high.append(items[2])
low.append(items[3])
closing.append(items[4])
volume.append(items[5])
fd.close()
return (date, opening, high, low, closing, volume)and here's the part of the C program which deals with the return:
retTuple = setupPyScriptForUse();
if(retTuple == NULL)
{
printf("Failed to open and examine file.\n");
Py_Finalize();
return EXIT_FAILURE;
}
listItem = PyTuple_GetItem(retTuple, 1);
if(listItem == NULL)
{
printf("Failed to extract item from Tuple.\n");
return EXIT_FAILURE;
}
listSize = PyList_Size(listItem);
finDataPtr = parsePyInputOpen(listItem, listSize);
and the parsePyInputOpen function:
finData * parsePyInputOpen(PyObject *returnValue, Py_ssize_t nitems)
{
PyObject *processed = NULL;
Py_ssize_t i = 0;
finData finDataStruct[nitems];
finData *finDataPtr = &finDataStruct[0];
printf("Size of the list: %i\n", (int)nitems);
for(i = 0; i < nitems; i++)
{
processed = PyList_GetItem(returnValue, i);
finDataStruct[i].finOpen = PyFloat_AsDouble(processed);
}
return finDataPtr;
}
I'm just not sure the best way to handle it is design wise. I could probably keep it in the psuedo Python / C object style but I'd rather have the data as raw doubles etc for later on.
lee1210
Aug 14, 2008, 11:34 PM
I played with this and this waas what I came up with:
logic.py (just used the same one as I did originally when looking at this):
def HW(list):
print 'Hello from Python!'
print len(list);
print len(list[0]);
list[0][4:10] = list[3][1:7];
list[0].extend(list[1]);
list[0].extend(list[2][1:8024]);
return list;
testpy.c:
#include <stdio.h>
#include <stdlib.h>
#include "Python.h"
int main(int argc, char *argv) {
int cListA[8500];
int cListB[8500];
int cListC[8500];
int cListD[8500];
int cListE[8500];
int cListF[8500];
int *cResult = NULL;
int randInt = -1;
int loopControl = 0;
int sz;
PyObject *module, *dict, *func, *value, *arglist,*pArgs;
PyListObject *listA, *listB, *listC, *listD, *listE, *listF, *componentList;
Py_Initialize(); //Set up the interpreter
PyRun_SimpleString("import sys\n"); //This line and the next
PyRun_SimpleString("sys.path.append('.')\n"); //are only b/c of my system
module = PyImport_ImportModule("logic"); //Import the module in logic.py
dict = PyModule_GetDict(module); //I have no idea
func = PyDict_GetItemString(dict, "HW"); //Get a reference to the function
for(loopControl = 0; loopControl < 8500; loopControl++) { //C data setup
cListA[loopControl] = rand();
cListB[loopControl] = rand();
cListC[loopControl] = rand();
cListD[loopControl] = rand();
cListE[loopControl] = rand();
cListF[loopControl] = rand();
}
listA = PyList_New(8500);
listB = PyList_New(8500);
listC = PyList_New(8500);
listD = PyList_New(8500);
listE = PyList_New(8500);
listF = PyList_New(8500);
for(loopControl = 0; loopControl < 8500; loopControl++) {
PyList_SetItem(listA,loopControl,PyInt_FromLong((long)cListA[loopControl]));
PyList_SetItem(listB,loopControl,PyInt_FromLong((long)cListB[loopControl]));
PyList_SetItem(listC,loopControl,PyInt_FromLong((long)cListC[loopControl]));
PyList_SetItem(listD,loopControl,PyInt_FromLong((long)cListD[loopControl]));
PyList_SetItem(listE,loopControl,PyInt_FromLong((long)cListE[loopControl]));
PyList_SetItem(listF,loopControl,PyInt_FromLong((long)cListF[loopControl]));
}
componentList = PyList_New(6);
PyList_SetItem(componentList,0,listA);
PyList_SetItem(componentList,1,listB);
PyList_SetItem(componentList,2,listC);
PyList_SetItem(componentList,3,listD);
PyList_SetItem(componentList,4,listE);
PyList_SetItem(componentList,5,listF);
pArgs = PyTuple_New(1);
PyTuple_SetItem(pArgs,0,componentList);
value = PyObject_CallObject(func,pArgs); //Call HW
Py_DECREF(module);
Py_DECREF(dict);
Py_DECREF(func);
Py_DECREF(listA);
Py_DECREF(listB);
Py_DECREF(listC);
Py_DECREF(listD);
Py_DECREF(listE);
Py_DECREF(listF);
Py_DECREF(componentList);
Py_DECREF(pArgs);
sz = PyList_Size(PyList_GetItem(value,0));
cResult = malloc(sz*sizeof(int));
for(loopControl = 0; loopControl < sz; loopControl++) {
cResult[loopControl] = (int) PyInt_AsLong(PyList_GetItem(PyList_GetItem(value,0),loopControl));
if(loopControl % 1000 == 0) printf("Value: %d\n",cResult[loopControl]);
}
free((void *)cResult);
Py_DECREF(value);
Py_Finalize();
return 0;
}
This doesn't do much for you than what you already are doing... i would be more comfortable, personally, dynamically allocating memory in C on the heap than returning something from a functions stack that has variable length. You do have to remember to free it, but you can do this multiple times without messing up your data if you do it that way.
The way I would handle it would just be to pass the data in whatever encapsulated form you use it elsewhere in your C program into a function, that will then tear it apart and stick it in to python objects. Call the python function, get the result, and unpack it from the python data structures into whatever you need back in C. I was concerned about the time it would take to generate the python objects, etc. but it seemed pretty negligible for the large number of ints I was working with. I doubt doubles will make this much worse.
If it makes you more comfortable your python interface function can call some other helper functions for performing specific tasks, like unboxing things from python.
-Lee
lee1210
Aug 15, 2008, 12:36 AM
One more thought, for efficiency considerations:
Does each "record" get processed by the python routine individually? That's to say, are you acting 6000 times on 6000 items? Or does the results depend on interaction between the records?
If you are just processing each record independently, it would be easy to loop over each record, encapsulate it in python types, call a python function that processes the single record, return the result, and break the result back into C types. This way you can make a single loop. You'll have to remember to decref the python stuff before you assign a new object to them so you don't leak memory.
It doesn't seem like this will work for you, exactly, but I thought I'd mention it.
-Lee
Cromulent
Aug 15, 2008, 02:01 PM
One more thought, for efficiency considerations:
Does each "record" get processed by the python routine individually? That's to say, are you acting 6000 times on 6000 items? Or does the results depend on interaction between the records?
Individually.
If you are just processing each record independently, it would be easy to loop over each record, encapsulate it in python types, call a python function that processes the single record, return the result, and break the result back into C types. This way you can make a single loop. You'll have to remember to decref the python stuff before you assign a new object to them so you don't leak memory.
Yeah, I completely forgot to decref the Python objects. Thanks for the reminder, I guess I'll have to look into how I'm going about this. At the moment the main function basically just calls the Python script from the start and then deals with the results, rather than setting up an adequate solution before hand.
It doesn't seem like this will work for you, exactly, but I thought I'd mention it.
-Lee
Cheers for that. You've got me thinking about a few new possibilities.
vBulletin® v3.8.6, Copyright ©2000-2012, Jelsoft Enterprises Ltd.