Harvard/MIT Student Creates GPU Database, Hacker-Style 135

Posted by Unknown Lamer on Monday April 22, 2013 @07:21PM from the search-faster dept.

First time accepted submitter IamIanB writes "Harvard Middle Eastern Studies student Todd Mostak's first tangle with big data didn't go well; trying to process and map 40 million geolocated tweets from the Arab Spring uprising took days. So while taking a database course across town at MIT, he developed a massively parallel database that uses GeForce Titan GPUs to do the data processing. The system sees 70x performance increases over CPU-based systems, and can out crunch a 1000 node MapReduce cluster, in some cases. All for around $5,000 worth of hardware. Mostak plans to release the system under an open source license; you can play with a data set of 125 million tweets hosted at Harvard's WorldMap and see the millisecond response time." I seem to recall a dedicated database query processor that worked by having a few hundred really small processors that was integrated with INGRES in the '80s.

Harvard/MIT Student Creates GPU Database, Hacker-Style

This discussion has been archived. No new comments can be posted.

Search 135 Comments Log In/Create an Account

Comments Filter:

Re:I'm not a computer scientist, and... (Score:5, Informative)

by gubon13 ( 2695335 ) writes: on Monday April 22, 2013 @07:36PM (#43520659)

Sort of a lazy effort on my part to not summarize, but here's a great explanation: https://en.bitcoin.it/wiki/Why_a_GPU_mines_faster_than_a_CPU [bitcoin.it].

Re:I'm not a computer scientist, and... (Score:5, Informative)

by PhamNguyen ( 2695929 ) writes: on Monday April 22, 2013 @07:40PM (#43520687)

GPUs are much faster for code that can be parallelized (basically this means having many cores doing the same thing, but on different data). However there is a signficant complexity in isolating hte parts of the code that can be done in parallel. Additionally, there is a cost to moving data to the GPU's memory, and also from the GPU memory to the GPU cores. CPU's on the other hand, have a cache architecture that means that much of the time, memory access is extremely fast.
Given progress in the last 10 years, the set of algorithms that can be parallelized is very large. So the GPU advantage should be overwhelming. The main issue is that the complexity writing a program that does things on the GPU is much higher.

Re:That Didn't Take Long: Database Down For Maint. (Score:5, Informative)

by tmostak ( 2904307 ) writes: on Monday April 22, 2013 @07:59PM (#43520839)

Hi... MapD creator here... this is the first time we've been seriously load tested, and I realize I might have a "locking" bug that's creating a deadlock when people hit the server at the exact same time. Todd

Re:I'm not a computer scientist, and... (Score:5, Informative)

by gatkinso ( 15975 ) writes: on Monday April 22, 2013 @08:01PM (#43520857)

This is a gross simplification, glossing over the details and not correct in some aspects... but close enough.
SIMD - single instruction multiple data. If you have thousands or millions of elements/records/whatever that all require the exact same processing (gee, say like a bunch of polygons being rotated x radians perhaps????) then this data can all be arranged into a bitmap and loaded onto the GPU at once. The GPU then performs the same operation on your data elements simultaneously (simplification). You then yank off the resultant bitmap and off you go. CPU arranges data, loads and unloads the data. GPU crunches it.
A CPU would have to operate on each of these elements serially.
Think of it this way - you are making pennies. GPU takes a big sheet of copper and stamps out 10000 pennies at a time. CPU takes a ribbon of copper and stamps out 1 penny at a time... but each iteration of the CPU is much faster than each iteration of the GPU. Perhaps the CPU can perform 7000 cycles per second, but the GPU can only perform 1 cycle per second. At the end of that second... the GPU produced 3000 more pennies than the CPU.
Some problem sets are not SIMD in nature. Lot's of branhcing or relienace on the value of neighboring elements. This will slow the GPU processing down insanely. FPGA is far better (and more expensive, and more difficult to program) than GPU for this. CPU is better as well.

Re:sounds like... (Score:5, Informative)

by tmostak ( 2904307 ) writes: on Monday April 22, 2013 @08:04PM (#43520879)

Hi, MapD creator here - and I have to disagree with you. The database ultimately stores everything on disk, but it caches what it can in GPU memory and performs all the computation there. So all the SQL operations are occurring on the GPU, after which, in case of the tweetmap demo, the results are rendered to a texture before being sent out as a png. But it works equally well as a traditional database - it doesn't do the whole SQL standard yet but can handle aggregations, joins, etc just like a normal database, just much faster. Todd

Re:PostgreSQL used GPU 2 years ago (Score:5, Informative)

by tmostak ( 2904307 ) writes: on Monday April 22, 2013 @08:14PM (#43520935)

The 70X is actually highly conservative - and this was benched against an optimized parallelized main-memory (i.e. not off of disk) CPU version, not say MySQL. On things like rendering heatmaps, graph query operations, or clustering you can get 300-500X speedups. The database caches what it can in GPU memory (could be 128GB on one node if you have 16 GPUs) and only sends back a bitmap of the results to be joined with data sitting in CPU memory. But yeah, if the data's not cached, then it won't be this fast. That's true, a lot of work has been done on GPU database processing - this is a bit different I think b/c it runs on multiple GPUs and b/c it tries to cache what it can on the GPU. Todd (MapD creator)

Re:I'm not a computer scientist, and... (Score:5, Informative)

by Morpf ( 2683099 ) writes: on Monday April 22, 2013 @08:19PM (#43520965)

Close, but not quite correct.
The point is GPUs are fast doing the same operation on multiple data. (e.g. multiplying a vector with a scalar) The emphasize is on _same operation_, which might not be the case for every problem one can solve parallel. You will loose speed as soon your elements of a wavefront (e.g. 16 threads, executed in lockstep) diverge into multiple execution paths. This happens if you have something like an "if" in your code and one for one work item the condition is evaluated to true and for another it's evaluated to false. Your wavefront will only be executed one path at a time, so your code becomes kind of "sequential" at this point. You will loose speed, too, if the way you access your GPU memory does not fulfill some restrictions. And by the way: I'm not speaking about some mere 1% performance loss but quite a number. ;) So generally speaking: not every problem one can solve in parallel can be efficiently solved by a GPU.
There is something similar to caches in OpenCL: it's called local data storage, but it's the programmers job to use them efficiently. Memory access is always slow if it's not registers you are accessing, be it CPU or GPU. When using a GPU you can hide part of the memory latency by scheduling way more threads than you can physically run and always switch to those who aren't waiting for memory. This way you waste less cycles waiting for memory.
I support your view writing for GPU takes quite a bit of effort. ;)

Re:I'm not a computer scientist, and... (Score:5, Informative)

by UnknownSoldier ( 67820 ) writes: on Monday April 22, 2013 @08:24PM (#43520991)

If one woman can have a baby in 9 months, then 9 women can have a baby in one month, right?
No.
Not every task can be run in parallel.
Now however if your data is _independent_ then you can distribute the work out to each core. Let's say you want to search 2000 objects for some matching value. On a 8-core CPU you would need 2000/8 = 250 searches. On the Titan each core could process 1 object.
There are also latency vs bandwidth issues, meaning it takes time to transfer the data from RAM to the GPU, process, and transfer the results back, but if the GPU's processing time is vastly less then the CPU, you can still have HUGE wins.
There are also SIMD / MIMD paradigms which I won't get into, but basically in layman's terms means the SIMD is able to process more data in the same amount of time.
You may be interested in reading:
http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html [blogspot.com]
http://stackoverflow.com/questions/7091958/cpu-vs-gpu-when-cpu-is-better [stackoverflow.com]
When your problem domain & data are able to be run in parallel then GPU's totally kick a CPU's in terms of processing power AND in price. i.e.
An i7 3770K costs around $330. Price/Core is $330/8 = $41.25/core
A GTX Titan costs around $1000. Price/Core is $1000/2688 = $0.37/core
Remember computing is about 2 extremes:
Slow & Flexible < - - - > Fast & Rigid
CPU (flexible) vs GPU (rigid)
* http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501 [newegg.com]
* http://www.newegg.com/Product/Product.aspx?Item=N82E16814130897 [newegg.com]

Re:sounds like... (Score:5, Informative)

by tmostak ( 2904307 ) writes: on Monday April 22, 2013 @08:54PM (#43521149)

So I use postgres all the time, but MapD isn't built on Postgres, it actually stores its own data on disk in column-form in (I admit crude) memory-mapped files. I have written a Postgres connector that connects MapD to Postgres though since I use postgres to store the tweets I harvest for long-term archiving. The connector uses pqxx (the C++ Postgres library). Todd

Re:sounds like... (Score:5, Informative)

by tmostak ( 2904307 ) writes: on Monday April 22, 2013 @08:58PM (#43521179)

I'm not using thrust - I rolled my own hash join algorithm. This is something I still haven't optimized a great deal and I'm sure your stuff runs much better. Would love to talk. Just contact me on Twitter (@toddmostak) and I'll give you my contact details. Todd

Re:That Didn't Take Long: Database Down For Maint. (Score:5, Informative)

by tmostak ( 2904307 ) writes: on Monday April 22, 2013 @10:27PM (#43521611)

Har har... Well things got tricky when I wrote the code to support streaming inserts (not implemented in the current map) so you could view tweets or whatever else as they came in - this required a lot of fine-grained locking. May just bandaid this and give locks to connections as they come in until I can figure out what's going on. Todd

Re:I'm not a computer scientist, and... (Score:5, Informative)

by PhamNguyen ( 2695929 ) writes: on Tuesday April 23, 2013 @02:21AM (#43522551)

What you are describing is GPU computing 5 to 10 years ago. Now, (1) you don't wrote shaders you write kernels. (2) a GPU can do most of the functions of a CPU, the difference is in things like branch prediction and caching. (3) threads execute in blocks of 16 or some other round number. There is no performance loss as long as all threads in the same block take the same execution path.

Re:Large datasets are mostly IO limited (Score:4, Informative)

by tmostak ( 2904307 ) writes: on Tuesday April 23, 2013 @02:22AM (#43522555)

Hi - MapD creator here. Agreed, GPUs aren't going to me of much use if you have petabytes of data and are I/O bound, but what I think unfortunately gets missed in the rush to indiscriminately throw everything into the "big data bucket" is that a lot of people do have medium-sized (say 5GB-500GB) datasets that they would like to query, visualize and analyze in an iterative, real-time fashion, something that existing solutions won't allow you to do (even big clusters often incur enough latency to make real-time analysis difficult).
And then you have super-linear algorithms like graph processing, spatial joins, neural nets, clustering, rendering blurred heatmaps which do really well on the GPU, which the formerly memory bound speedup of 70X turns into 400-500X. Particularly since databases are expected to do more and more viz and machine learning, I don't think these are edge cases
Finally, although GPU memory will always be more expensive (but faster) than CPU memory, MapD already can run on a 16-card 128GB GPU ram server, and I'm working on a multi-node distributed implementation where you could string many of these together. So having a terabyte of GPU RAM is not out of the question, which, given the column-store architecture of the db can be used more efficiently by caching only the necessary columns in memory. Of course it will cost more, but for some applications the performance benefits may be worth it.
I just think people need to realize that different problems need different solutions, and just b/c a system is not built to handle a petabyte of data doesn't mean its not worthwhile.

Re:Two thoughts based on this story (Score:2, Informative)

by Anonymous Coward writes: on Tuesday April 23, 2013 @04:33AM (#43522935)

Drop the "the", just FB, it's cleaner

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Harvard/MIT Student Creates GPU Database, Hacker-Style 135

Harvard/MIT Student Creates GPU Database, Hacker-Style More Login

Harvard/MIT Student Creates GPU Database, Hacker-Style

Re:I'm not a computer scientist, and... (Score:5, Informative)

Re:I'm not a computer scientist, and... (Score:5, Informative)

Re:That Didn't Take Long: Database Down For Maint. (Score:5, Informative)

Re:I'm not a computer scientist, and... (Score:5, Informative)

Re:sounds like... (Score:5, Informative)

Re:PostgreSQL used GPU 2 years ago (Score:5, Informative)

Re:I'm not a computer scientist, and... (Score:5, Informative)

Re:I'm not a computer scientist, and... (Score:5, Informative)

Re:sounds like... (Score:5, Informative)

Re:sounds like... (Score:5, Informative)

Re:That Didn't Take Long: Database Down For Maint. (Score:5, Informative)

Re:I'm not a computer scientist, and... (Score:5, Informative)

Re:Large datasets are mostly IO limited (Score:4, Informative)

Re:Two thoughts based on this story (Score:2, Informative)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot