Harvard/MIT Student Creates GPU Database, Hacker-Style 135
First time accepted submitter IamIanB writes "Harvard Middle Eastern Studies student Todd Mostak's first tangle with big data didn't go well; trying to process and map 40 million geolocated tweets from the Arab Spring uprising took days. So while taking a database course across town at MIT, he developed a massively parallel database that uses GeForce Titan GPUs to do the data processing. The system sees 70x performance increases over CPU-based systems, and can out crunch a 1000 node MapReduce cluster, in some cases. All for around $5,000 worth of hardware. Mostak plans to release the system under an open source license; you can play with a data set of 125 million tweets hosted at Harvard's WorldMap and see the millisecond response time."
I seem to recall a dedicated database query processor that worked by having a few hundred really small processors that was integrated with INGRES in the '80s.
Two thoughts based on this story (Score:5, Interesting)
1. Facebook would like to have a discussion with him.
2. The FBI would like to have a discussion with him.
Re: (Score:3)
1. Facebook would like to have a discussion with him.
2. The FBI would like to have a discussion with him.
Sadly, I offer one more thought...
3. Some patent attorney from East Texas would like to have a discussion with him.
Re: (Score:2, Informative)
Drop the "the", just FB, it's cleaner
I'm not a computer scientist, and... (Score:1)
I want to know why GPUs are so much better at some tasks than CPUs? And, why aren't they used more often if they are orders of magnitude faster?
Thanks.
Re:I'm not a computer scientist, and... (Score:4, Insightful)
Sprinters can run really fast. So, if speed is important in other sports, why aren't the other sports full of sprinters? Because being good at one thing doesn't mean you're well-suited to do everything. A sprinter who can't throw a ball is going to be terrible at a lot of sports.
Re:I'm not a computer scientist, and... (Score:5, Informative)
Re:I'm not a computer scientist, and... (Score:5, Informative)
GPUs are much faster for code that can be parallelized (basically this means having many cores doing the same thing, but on different data). However there is a signficant complexity in isolating hte parts of the code that can be done in parallel. Additionally, there is a cost to moving data to the GPU's memory, and also from the GPU memory to the GPU cores. CPU's on the other hand, have a cache architecture that means that much of the time, memory access is extremely fast.
Given progress in the last 10 years, the set of algorithms that can be parallelized is very large. So the GPU advantage should be overwhelming. The main issue is that the complexity writing a program that does things on the GPU is much higher.
Re: (Score:3)
>> The main issue is that the complexity writing a program that does things on the GPU is much higher.
Not so much. There is programming overhead, but it isn't too bad.
Re: (Score:2, Insightful)
Yes, it is that bad. Not only is it extremely platform-specific, the toolchains are crap. We're just now transitioning from "impossible to debug" to "difficult to debug".
Re: (Score:2)
I guess we all have our strengths and weaknesses.
Re: (Score:2)
Get back to us when you've written a GPU-powered database.
Re: (Score:2)
No GPU powered database - just an H.264 codec, real time LIDAR data processor, and several GIS data processing modules.
GPGPU development (I use CUDA which makes it even easier) really isn't that hard. But then again I would expect a person with a high six digit user id to be challenged by the fundamentals.
Re: (Score:2)
Colour me curious. I've never met a programmer who's capable of writing an H.264 codec but still has the arrogance of a sophomore. Do you have a link to your project?
Re:I'm not a computer scientist, and... (Score:5, Informative)
Close, but not quite correct.
The point is GPUs are fast doing the same operation on multiple data. (e.g. multiplying a vector with a scalar) The emphasize is on _same operation_, which might not be the case for every problem one can solve parallel. You will loose speed as soon your elements of a wavefront (e.g. 16 threads, executed in lockstep) diverge into multiple execution paths. This happens if you have something like an "if" in your code and one for one work item the condition is evaluated to true and for another it's evaluated to false. Your wavefront will only be executed one path at a time, so your code becomes kind of "sequential" at this point. You will loose speed, too, if the way you access your GPU memory does not fulfill some restrictions. And by the way: I'm not speaking about some mere 1% performance loss but quite a number. ;) So generally speaking: not every problem one can solve in parallel can be efficiently solved by a GPU.
There is something similar to caches in OpenCL: it's called local data storage, but it's the programmers job to use them efficiently. Memory access is always slow if it's not registers you are accessing, be it CPU or GPU. When using a GPU you can hide part of the memory latency by scheduling way more threads than you can physically run and always switch to those who aren't waiting for memory. This way you waste less cycles waiting for memory.
I support your view writing for GPU takes quite a bit of effort. ;)
Re: (Score:2)
Re: (Score:1)
Yes, you will somehow need to branch at one point - well at least I can't think of a way without branching - but not every branch makes your program crawl like a snail. For example the amount of work done in the branches really does matter. If you can't avoid branching try to do as little as possible in the branches. ;)
I for one would write the current position of a "hit" into the same position of a second array. Otherwise write a zero. So your branches are quite minimal:
if(hit) {
secondArray[
Re: (Score:2)
Re: (Score:2)
Close, but not quite correct.
The point is GPUs are fast doing the same operation on multiple data. (e.g. multiplying a vector with a scalar) The emphasize is on _same operation_, which might not be the case for every problem one can solve parallel. You will loose speed as soon your elements of a wavefront (e.g. 16 threads, executed in lockstep) diverge into multiple execution paths. This happens if you have something like an "if" in your code and one for one work item the condition is evaluated to true and for another it's evaluated to false. Your wavefront will only be executed one path at a time, so your code becomes kind of "sequential" at this point. You will loose speed, too, if the way you access your GPU memory does not fulfill some restrictions.
I'm not an expert on this subject, but with hundreds, or even thousands, of GPU cores, why not just run the calculation for all cases of an if/then and then toss out the ones that don't pan out? It's not a very efficient way to do things, but it could work.
I believe this is the principle of quantum computing. Process all possible answers simultaneously and then pick the right one(s).
Re: (Score:1)
Actually this is what is done on GPUs. Think of it this way: You have a number of "processors" which share one control flow. The number of "processors" sharing one control flow on a AMD 79xx is 64. Now all these "processors" evaluate the if-statement. If it's true for some and false for other "processors" than both paths are executed sequentially. Those "processors" which would normally not run -because they belong to the other branch- are masked, so they don't execute the instructions. If the if-statement
Re: (Score:3, Insightful)
Parallelization is not why GPUs are fast, its a side effect of rendering pixels, nothing more.
GPUs are fast because they do an extremely limited number of things REALLY REALLY fast, and when you're doing graphics ... well guess what, its all pretty much doing those few things the GPU does well over and over again, per pixel (or vertex). They are parallelized because those simply, super fast processors are also small from a chip perspective, so stuffing a ton of them on the chip so it can do many pixels in
Re:I'm not a computer scientist, and... (Score:5, Informative)
Re: (Score:1)
Actually parallelization IS why GPUs are fast. You have some restrictions but it's the parallel execution which gives you the boost in performance.
The things a GPU can do are not so limited as you might think.
The statement about "if" pausing all processors is wrong. On my card 64 work items are executed in lockstep on 16 processors in something called a wavefront. Now I have way more processors on the card. Furthermore only when the if statement in the control flow is evaluated true for some work items and
Re:I'm not a computer scientist, and... (Score:5, Informative)
This is a gross simplification, glossing over the details and not correct in some aspects... but close enough.
SIMD - single instruction multiple data. If you have thousands or millions of elements/records/whatever that all require the exact same processing (gee, say like a bunch of polygons being rotated x radians perhaps????) then this data can all be arranged into a bitmap and loaded onto the GPU at once. The GPU then performs the same operation on your data elements simultaneously (simplification). You then yank off the resultant bitmap and off you go. CPU arranges data, loads and unloads the data. GPU crunches it.
A CPU would have to operate on each of these elements serially.
Think of it this way - you are making pennies. GPU takes a big sheet of copper and stamps out 10000 pennies at a time. CPU takes a ribbon of copper and stamps out 1 penny at a time... but each iteration of the CPU is much faster than each iteration of the GPU. Perhaps the CPU can perform 7000 cycles per second, but the GPU can only perform 1 cycle per second. At the end of that second... the GPU produced 3000 more pennies than the CPU.
Some problem sets are not SIMD in nature. Lot's of branhcing or relienace on the value of neighboring elements. This will slow the GPU processing down insanely. FPGA is far better (and more expensive, and more difficult to program) than GPU for this. CPU is better as well.
Re: (Score:1)
not sure if i'm right, but i tend to think of any gpu-based application as having to construct data like pixels on a screen or image (since that's what gpu's are primarily designed to handle)
a cpu treats each pixel separately, whereas a gpu can process multiple pixels simultaneously
problem comes about if you try to feed data into a gpu that isn't like pixels
is the programming difficulties in trying to trick the gpu into thinking it's processing pixels even though it may be processing bitcoin algorithms etc?
Re: (Score:1)
Well, you don't have to trick the GPU in thinking it processes pixels. You can do general purpose computation with a language quite similar to C99.
You are right in that way, that you partition your problem in many subelements. In OpenCL those are called work items. But those are more like identical threads than pixels. Sometimes one maps the work items on a 2d or 3d grid if the problem domain fits. (e.g. image manipulation, physics simulation)
Actually it's not that hard implementing "normal" algorithms on a
Re:I'm not a computer scientist, and... (Score:5, Informative)
If one woman can have a baby in 9 months, then 9 women can have a baby in one month, right?
No.
Not every task can be run in parallel.
Now however if your data is _independent_ then you can distribute the work out to each core. Let's say you want to search 2000 objects for some matching value. On a 8-core CPU you would need 2000/8 = 250 searches. On the Titan each core could process 1 object.
There are also latency vs bandwidth issues, meaning it takes time to transfer the data from RAM to the GPU, process, and transfer the results back, but if the GPU's processing time is vastly less then the CPU, you can still have HUGE wins.
There are also SIMD / MIMD paradigms which I won't get into, but basically in layman's terms means the SIMD is able to process more data in the same amount of time.
You may be interested in reading:
http://perilsofparallel.blogspot.com/2008/09/larrabee-vs-nvidia-mimd-vs-simd.html [blogspot.com]
http://stackoverflow.com/questions/7091958/cpu-vs-gpu-when-cpu-is-better [stackoverflow.com]
When your problem domain & data are able to be run in parallel then GPU's totally kick a CPU's in terms of processing power AND in price. i.e.
An i7 3770K costs around $330. Price/Core is $330/8 = $41.25/core
A GTX Titan costs around $1000. Price/Core is $1000/2688 = $0.37/core
Remember computing is about 2 extremes:
Slow & Flexible < - - - > Fast & Rigid
CPU (flexible) vs GPU (rigid)
* http://www.newegg.com/Product/Product.aspx?Item=N82E16819116501 [newegg.com]
* http://www.newegg.com/Product/Product.aspx?Item=N82E16814130897 [newegg.com]
Re: (Score:3, Funny)
Now however if your data is _independent_ then you can distribute the work out to each core.
Let me translate this into a woman-baby analogy: if one woman can have a baby in 9 months, then 9 women can have 9 babies in 9 months. At first the challenge is joggling with the timing of dates and dividing the calendar for conception events as near as possible to each other to keep up the efficiency and synchronization. Afterwards the challenge is the alimony, paying up college and particularly the Thanksgiving, when the fruits of the labor come together.
Re: (Score:2)
If one woman can have a baby in 9 months, then 9 women can have a baby in one month, right?
No.
You're wrong, otherwise we'd need close to 130 million months per year. Furthermore, the 9 women have their 9 babies after ~9 months yielding in an average production rate of 1bpm (one baby per month) from this group of women -- If kept perpetually pregnant. If we put 90 women in the baby farm they will produce TEN Babies Per Month.
Some people's kids, I swear -- They must have botch the batch of logic circuits in your revision; This is Matrixology 101.
Re:I'm not a computer scientist, and... (Score:4, Insightful)
I think you totally missed his point -- tin whiskers on your circuit board? Blown caps?
The fact that 9 women can have 9 babies in 9 months for an average rate of 1/mo, does not disprove the assertion 9 women cannot have __a__ (i.e. a single) baby in one month. You're talking about something totally different and being awfully smug about it to boot.
Re: (Score:2)
What if you wanted only one baby? You'd still have to wait nine months, no matter how many women are involved.
Re: (Score:2)
If it is self sufficient from the parent - then that could be 1 every 30 years. So you draw the line where you want on this one.
Re: (Score:1)
"When your problem domain & data are able to be run in parallel then GPU's totally kick a CPU's in terms of processing power AND in price. i.e.
An i7 3770K costs around $330. Price/Core is $330/8 = $41.25/core
A GTX Titan costs around $1000. Price/Core is $1000/2688 = $0.37/core"
That's a very unfair comparison. For one the i7-3770k has SIMD as well (8-wide AVX). A better comparison is maximum GFLOPs/s.
The max for the 3700k is frequency * 8 (AVX) * 2(simultaneous multiplication and addition) * 4 cores.
Re: (Score:2)
If one woman can have a baby in 9 months, then 9 women can have a baby in one month, right?...
I'm sorry, this is slashdot. You must include an obligatory car analogy to get your point across.
Re: (Score:2)
Yeah, it is not exactly an apples to apples comparison. I really should of specified that.
Thanks for the extra details. I'm sure where you are getting 6 x 32-bit from though? Are you confusing the 6 ROPs?
http://images.anandtech.com/doci/6760/GK110_Block_Diagram_FINAL2.png [anandtech.com]
I just got my Titan and was kind of surprised to see the 14 SMXes myself. Each SMX has 192 FP32 cores (12*16) which explains where the odd 2688 comes from: 14*12*16 = 14*192 = 2688.
Re: (Score:2)
They do ONE thing well. Floating point ops. EVERYTHING ELSE THEY SUCK AT, including simple logic checks, like if statements are painfully mind numbingly slow on the GPU.
Same reason you can buy a $99 supercomputer (Score:1)
They're massively more parallel, running many more smaller simpler cores.
It's the same reason these guys can make a 16 core parallel computer for $99.... the cores are focused on their job so they can be smaller and cheaper and can put more on a die.
http://www.kickstarter.com/projects/adapteva/parallella-a-supercomputer-for-everyone/
So these guys can run 8 daughter boards, with 64 cores per board, 512 cores, and it looks like they plan on scaling to 4096 cores because they use the top 12 bits of the address
Re: (Score:1)
I want to know why GPUs are so much better at some tasks than CPUs? And, why aren't they used more often if they are orders of magnitude faster?
Thanks.
I'm glad you put the preface in there, because it's basic comp. sci.
Re: (Score:1)
Re: (Score:2)
Part of the answer is the "magic" of matrix math. You can represent multiple linear equations in every row of a matrix and when you apply one operation (add, multiply, etc) you performed it on all your encoded equations inside the matrix.
If you can, for example, represent your problem in a linear equation (algebra) then you can also formulate 50 similar equations. You want to "transform" all your equations by some operation (lets say divide by 20), so instead of calculating 50 times that operation for every
That Didn't Take Long: Database Down For Maint. (Score:3)
Slashdotted? I happened to catch the story just as it went live, and hit the link to the service. After scrolling the map and getting a couple of updates: Database is down for maintenance. The front end may not be as high performance as the back... or it may have been coincidence.
Re:That Didn't Take Long: Database Down For Maint. (Score:5, Informative)
Re: (Score:2)
Well since this is apparently from the guy who the article is talking about, perhaps someone could mod it up just a bit?
No points here
Re: (Score:2)
Is that why it's faster? It doesn't do any synchronization?
Re:That Didn't Take Long: Database Down For Maint. (Score:5, Informative)
Re: (Score:2, Interesting)
Re: (Score:1)
You mean, if Open Source isn't magic, it's bullshit? Way to straw man.
Re: (Score:1)
No, he means (and said) if Open Source isn't magic, claiming Open Source is magic is bullshit.
and the most amazing thing (Score:4, Funny)
as the TFS states he uses GPUs to do the data processing, but you are never going to believe what he uses to store the actual data, you won't believe it, that's why it's not mentioned in TFS. Sure sure, it's PostgreSQL, but the way the data was stored physically was in the computer monitor itself. Yes, he punched holes in computer monitors with a chisel and used punch card readers to read those holes from the screens.
Re:and the most amazing thing (Score:5, Funny)
Mod parent up!
Also: I heard he's using the printer port for commuication. By spooling tractor feed paper between two printers in a loop, and by stopping and starting simultaneous paper-feed jobs, he can create a cybernetic feedback between the two printers that results in a series of quickly occurring "error - paper jam" messages that (due to two taped-down "reset" buttons) are quickly translated from the wide bandwidth analog physical matrix into kajamabits of digital codes. The perceived bandwidth gain is much higher than just a single one or zero at a time.
That way, he can access the mainframe any time, from any physical location, and it will translate directly into a virtual presence.
Re: (Score:1)
They don't grt it. He solved the speed of processing and the lack of long term durability of storage by doing what's described in the original comment... Worked like a charm without needing to rithink the entire problem of a single bus used to retrieve and store data on the physical storage that still accessess data serially.
Re: (Score:1)
By spooling tractor feed paper between two printers in a loop, and by stopping and starting simultaneous paper-feed jobs, he can create a cybernetic feedback between the two printers that results in a series of quickly occurring "error - paper jam" messages that (due to two taped-down "reset" buttons) are quickly translated from the wide bandwidth analog physical matrix into kajamabits of digital codes
i would be really careful doing that... the system may become self-aware
sounds like... (Score:2, Redundant)
It sounds like he's doing standard GPU computations, loading everything into memory, and then calling it a "database", even though it really isn't a "database" in any traditional sense.
Re:sounds like... (Score:5, Informative)
Re:sounds like... (Score:4, Interesting)
Re:sounds like... (Score:5, Informative)
Re: (Score:2)
Re: (Score:1)
I'd be very interested to hear more details about the GPU SQL algorithms (JOIN in particular) if you are willing to share them. Did you use the set operations
in Thrust or did you write something custom?
Some of my colleagues are planning on releasing an open source library and some online tutorials about hash join and sort merge join in CUDA, and I would be very interested to share notes.
Re:sounds like... (Score:5, Informative)
Re: (Score:2)
I wonder what would it mean to the data if you were to lossily compress that png...
Re: (Score:2)
Horrible things, probably. Good thing PNG is lossless compression...
Re: (Score:2)
So, it sounds like you're implementing SQL as a data analytics language for in-memory data (plus a bunch of potentially useful algorithms), but apparently without the features that usually make a database a "database", like persistence, transactions, rollbacks, etc. It's those other features that make real databases slow, which is why you can't claim huge speedups over "databases" since you're not implementing the same thing.
Data analytics on GPUs is a great thing, which is why tons of people are doing it.
PostgreSQL used GPU 2 years ago (Score:1)
The 70x times seem optimistic. Does this include ALL the overheads for the GPU?
But this done and patented over 2 years ago.
http://www.scribd.com/doc/44661593/PostgreSQL-OpenCL-Procedural-Language
And there has been earlier work using SQLite on GPU's.
Re:PostgreSQL used GPU 2 years ago (Score:5, Informative)
Re: (Score:2)
Altera and Xilinx both have high level synthesis tools out that can target FPGA's using generic C. The Altera one allows you to target GPU's, CPU's or FPGA's. In the case of highly parallel tasks, an FPGA can run many times faster than even a GPU. There are fairly large gate count devices with ARM cores available now so you move the tasks around for better performance. I'd love to see some of these tasks targeting these devices.
Re: (Score:2)
The FPGAs are far more power efficient though.
First customer? (Score:1)
So this is where all the Titans ended up.... (Score:2)
Still waiting for one/two....to play games on....
Re: (Score:1)
obvious question (Score:1)
does it blend?
Could have... (Score:1)
Re: (Score:1)
Re: (Score:2)
Where's Onslaught?
Re: (Score:3)
Where's Onslaught?
It's in Norweight.
Re: (Score:2)
The owner is the submitter. He knew what he was getting into, or should have.
Re: (Score:2)
Ingres-Actian and Vectorwise (Score:1)
Large datasets are mostly IO limited (Score:5, Interesting)
For these kinds of datasets, and where more compute is necessary there is MARs [gpgpu.org].
Re: (Score:2)
Agreed. That easily fits into memory (3 times actually) on our main OLTP DB.
If it can fit into ram for less than $50K, it's not big data.
Re: (Score:2)
This project's innovation is noting that that GPUs have enough RAM now that you can push medium sized data sets into them if you have enough available. With lots of cores and high memory bandwidth, in-memory data sets in a GPU can do multi-core operations faster than in-memory data sets in a standard CPU/memory combination.
That's great for simple data operations that are easy to run in parallel and when the data set is small enough to fit in your available hardware. Break any of those assumptions, and you
Re:Large datasets are mostly IO limited (Score:4, Informative)
Hi - MapD creator here. Agreed, GPUs aren't going to me of much use if you have petabytes of data and are I/O bound, but what I think unfortunately gets missed in the rush to indiscriminately throw everything into the "big data bucket" is that a lot of people do have medium-sized (say 5GB-500GB) datasets that they would like to query, visualize and analyze in an iterative, real-time fashion, something that existing solutions won't allow you to do (even big clusters often incur enough latency to make real-time analysis difficult).
And then you have super-linear algorithms like graph processing, spatial joins, neural nets, clustering, rendering blurred heatmaps which do really well on the GPU, which the formerly memory bound speedup of 70X turns into 400-500X. Particularly since databases are expected to do more and more viz and machine learning, I don't think these are edge cases
Finally, although GPU memory will always be more expensive (but faster) than CPU memory, MapD already can run on a 16-card 128GB GPU ram server, and I'm working on a multi-node distributed implementation where you could string many of these together. So having a terabyte of GPU RAM is not out of the question, which, given the column-store architecture of the db can be used more efficiently by caching only the necessary columns in memory. Of course it will cost more, but for some applications the performance benefits may be worth it.
I just think people need to realize that different problems need different solutions, and just b/c a system is not built to handle a petabyte of data doesn't mean its not worthwhile.
Re: (Score:3)
Yeah, they actually do. For in-memory queries, analysis, and visualization, people use statistical and numerical languages like R, Matlab, Python, and others (as well as tools with nice graphics frontends). And they have full GPU support available these days. In many cases, the GPU support paralleli
Re: (Score:2)
Check out Twitter's own Storm system built on top of ZeroMQ.
http://storm-project.net/ [storm-project.net]
http://www.zeromq.org/ [zeromq.org]
You may find something you like.
Re: (Score:2)
125 million tweets != 1250000000
125,000,000 * 142 = 16.5GB and easily fits into my desktop's memory.
Re: (Score:1)
Re: (Score:2)
Mars needs guitars
"'Cause the man from Mars stopped eatin' cars and eatin' bars and now he only eats guitars"
See this video [youtube.com] at 3:47.
GPU is good - but you need the IOPS to leverage it (Score:2)
For data processing workloads, a frequent problem with GPU acceleration is that the working dataset size is too large to fit into the available GPU memory and the whole thing slows to a crawl on data ingest (physical disk seeks, random much of the time) or disk writes for persisting the results.
For folks serious about getting good ROI on their GPU hardware in real world scenarios, I strongly recommend you take a look at the fusion IO PCIe flash cards, which now support writing to and reading from them direc
PDF link for GPU "on-board" flash memory (Score:2)
Talk to IBM PureData - beat to the punch (Score:3)
Granted its not free or cheap, but IBM will ship you a prebuilt rack of 'stuff' that will load 5TB/hour and scan 128GB/sec. PGStrom came out in the last year. Custom hardware/ASIC/FPGA for this sort of thing is not new.
Good to see things like this. (Score:3)
Re: (Score:1)
Re: (Score:2)
Most normal mortals won't have any knowledge of why a large database is useful. Frankly the first thing that leaps to my humble mind is
trying to harvest money from the stock markets. Obviously there are numerous companies applying all kinds of computing power to the stock market. I do wonder if more computing power helps at this point or whether there is some toggle point at which massive data crunching would yield much better results.
Re: (Score:2)
Q: Whats better than a GPU database? (Score:2)
A: Indexes that don't suck.
Using GPUs and massivelly parallel blah blah blah is cool and all but most databases are not processor limited so why should we care?
Re: (Score:3, Insightful)
Re: (Score:2)
Try to heatmap or do hierarchical clustering on a billion rows in a few milliseconds with just the aid of indexes - not all applications need lots of cores and high memory bandwidth - but some do.
Even your examples are I/O rather than processor limited. Sending billions of rows to a GPU over the southbridge aint free.
MediumData (Score:1)
Code optimization (Score:2)
Student writes inefficient code, learns how to optimize it using known techniques, it becomes faster. Film at 11.
Oblig. In Soviet Russia ... (Score:1)
Re: (Score:1)
It's like tons of little fish devouring an elepant carcas rather than one shark doing the same. You asked for a non technical... Of-course it's still harddrives (or sdds today) all the way down.
Re: (Score:1)
does the shark have a laser?