Folding Accelerator?

Discussion in 'Distributed Computing' started by atszyman, Apr 27, 2005.

  1. atszyman macrumors 68020

    atszyman

    Joined:
    Sep 16, 2003
    Location:
    The Dallas 'burbs
    #1
    Ok, this is a bit out there but it ties in nicely with what I do for a living so I thought it would be an interesting idea to float.

    How much is known about the folding algorithm? I work pretty heavily in FPGAs and have become fairly proficient at VHDL and Verilog coding. FPGAs have a lot of powerful resources available in them now and I have been working closely with Xilinx Virtex 4's which claim to have DSP resources (multipliers/adders) that run at 500 MHz.

    My thought is that it might be possible to implement part if not all of the algorithm in an FPGA and offload all of the processor work to another piece of hardware. If the algorithm does a lot of multiplication type operations it is possible that huge increases could be achieved with proper implementation of the algorithm. Of course I don't think anyone would be willing to buy the cards to run it on which typically run in the $5-$10k range, but there might be cheaper alternatives.

    Of course this would require time and motivation and I am severely short on time these days. I also doubt anyone would be willing to spend that much on folding so there is definitely a lack of motivation. Just though it was an interesting idea.
     
  2. daveL macrumors 68020

    daveL

    Joined:
    Jun 18, 2003
    Location:
    Montana
    #2
    On the G4 and G5, the Foldings cores are already highly optimized for Altivec which, of course, runs at the cpu clock rate. So, you'd have to get roughly 3x the parallelism of Altivec from your FPGA implementation (@ 500 MHz) simply to match the current G4 cpus Folding performance. For G5s, you'd be looking at 5x. For the Stanford Folding team to invest the time to optimizer the Folding cores to use the FPGA, I would think you would have to demonstrate at *least* a 2 to 3 fold performance boost beyond a high range G4 or low range G5. Even then, they'd want some idea that people would buy enough of your add-on boards to make a difference. You'd only be able to use it with systems that have an open PCI slot, as well. In short, I doubt it would fly.
     
  3. atszyman thread starter macrumors 68020

    atszyman

    Joined:
    Sep 16, 2003
    Location:
    The Dallas 'burbs
    #4
    I know that Altivec runs at the clock rate but in one of the FPGAs we are talking about hundreds of multipliers that can be utilized in parallel that would be tasked with nothing but folding. So you have a massive number of parallel streams at 500 MHz with nothing interrupting them or one stream running a CPU clock rate that has the lowest priority on the CPU where you are doing everything else. I think you could get good gains provided that you could load all of the algorithm to the FPGA and parallelize the heck out of it.

    But as I said, I don't have time and demand definitely wouldn't be there at the current price point of the cards. I just thought it was an interesting idea.
     
  4. Dreadnought macrumors 68020

    Dreadnought

    Joined:
    Jul 22, 2002
    Location:
    Almere, The Netherlands
    #5
    Only by optimizing the folding app for a mac could there be gained a lot. Especally with tinkers. Now they aren't optimized for a mac, a pc goes a lot faster through a tinker.
     
  5. daveL macrumors 68020

    daveL

    Joined:
    Jun 18, 2003
    Location:
    Montana
    #6
    Are you sure about the tinkers, Dreadnought? I have a dual 2 GHz Opteron server and a dual 2.5 GHz G5. I have been doing a lot of work on the G5, but the Opteron has been virtually dedicated to Folding (that will change). Anyway, I find the tinker completion times to be nearly the same on both machines, like +/- 5%. The Opteron is running Linux kernel 2.6.
     
  6. Dreadnought macrumors 68020

    Dreadnought

    Joined:
    Jul 22, 2002
    Location:
    Almere, The Netherlands
    #7
    That's what I mean, in the G5 there is 500 mhz more per proc. and the time it takes is the same. With other apps (video, audio etc) the G5 blows everything away. We always have a drop in production when a tinkerstorm is there, other teams (pc teams) don't have that drop.
     
  7. daveL macrumors 68020

    daveL

    Joined:
    Jun 18, 2003
    Location:
    Montana
    #8
    But clock-for-clock, the Opteron and G5 are almost identical, with the Opteron having better memory access performance. As I said, I'm doing other work on the G5, so Folding isn't getting 100% of the G5, while the Opteron is virtually 100% for a number of weeks now. Anyway, it is what it is.
     
  8. daveL macrumors 68020

    daveL

    Joined:
    Jun 18, 2003
    Location:
    Montana
    #9
    This thread is likely dead, but I was thinking about this again yesterday, and the real problem is going to be PCI bus bandwidth You'll have to load the operands into the FPGA and retrieve the results over the PCI, PCI-X bus, which isn't even close to the processor memory bus/processor cache bandwidth or latency.
     
  9. atszyman thread starter macrumors 68020

    atszyman

    Joined:
    Sep 16, 2003
    Location:
    The Dallas 'burbs
    #10
    Not if you have on board memory. You could theoretically load the entire WU, process it on the card, then send it back to RAM for transmission. Almost no work for the CPU, possible gains through parallelization, it would be pretty slick if cheap enough (heck with a PCI(-X) form factor you could load up a G5's PCI-X slots and get a total of 5 folding units running in one box (2 CPUs + 3 PCI cards) and that's assuming only one FPGA/Memory setup per PCI board.

    You would have to somehow periodically back up the data to disk otherwise the cards would loose all of their data on power cycles but that would require only a small amount of PCI bandwidth.
     
  10. mischief macrumors 68030

    mischief

    Joined:
    Aug 1, 2001
    Location:
    Santa Cruz Ca
    #11
    There've been a few of these but this is the most current I could find.

    If I had Fifteen grand to blow on this I could deck out an old 9600 I have laying around...

    Then again, for that same fifteen thou (plus tax) I could buy sixteen maxed-out mini's and implement an x-grid cluster.*shrug*
     
  11. Dreadnought macrumors 68020

    Dreadnought

    Joined:
    Jul 22, 2002
    Location:
    Almere, The Netherlands
    #12
    That's nice to put into my B&W, three cards or so, that would add 7 cpu's. That will kick Atszyman's butt! :D Although I prefer a couple of liquid cooled G5's. I think that will be significantly faster and cheaper.
     

Share This Page