The difference you are asking about though is ParAccel vs. Druid. ParAccel is the software that Amazon is licensing for RedShift.
Aside from just potential differences in performance, there are some functional differences (these are all based on a cursory understanding of what ParAccel does, I’ve read what I could find on it, but a lot of my understanding is extracted from interpretations of marketing text, which can be a mixed bag):
1) ParAccel is a full on database with all kinds of SQL support including things like joins and insert/update statements. Druid is intended as an analytical data store, it’s write semantics aren’t as fluid and it doesn’t do joins.
2) Data distribution model
ParAccel’s data distribution model is hash-based. Expanding the size of your cluster requires re-hashing the data across the nodes, making it difficult to do without taking downtime. From Amazon’s text, scaling up your redshift cluster is actually a multi-step process:
a) set cluster into read-only mode
b) copy data from cluster to new cluster that exists in parallel
c) redirect traffic to new cluster
They do not indicate if they are nice enough to not charge you for the extra machines consumed during the copy due to their own software’s limitations, but even if you were scaling up to 100 of the big nodes and you were copying 20TB at 2GB/s for the cluster, that would only be 3 hours or ~$2k so they probably figure that said cost doesn’t really matter.
Druid’s data distribution, on the other hand, is based on segments that already exist on some sort of highly available “deep” storage, like S3. You can lose all of your compute nodes and still reload everything (as long as your deep storage is still there).
3) Replication strategy
ParAccel’s hash-based distribution also generally means that the replication strategy is necessarily via hot spares. From what I can tell when you lose one node in ParAccel, you are covered by a hot spare that can come in and take it’s place, but when you lose that node as well, I’m not sure if there is some mechanism to protect you from losing data. They probably re-load from a backup on S3 or something, which does significantly mitigate your risk. Aside from just that though, a hot spare replication strategy often doesn’t lend itself to being able to utilize that spare copy for read queries, meaning that you only have one node serving the data at any one point in time, which can become hotspot/bottleneck. Allowing for reads on all of the replicas at the same time greatly complexifies mutations, they might have that based covered, but I do not know.
Druid’s distribution is on the segment-level meaning that you can add more nodes and have the data rebalance without doing a staged swap. The replication strategy also makes all replicas available for querying, so if you have a base replication factor of 2, you have two machines that are serving read queries against that segment. Druid doesn’t implement this yet, but hopefully by the end of the year we will be automatically adjusting replication factors of specific segments based on demand for the segment (i.e. dynamically responding to hotspots by adding replicas). There’s nothing stopping different segments being replicated at different levels, the thing left to implement is the communication mechanism to figure out when a segment should be scaled up/down.
4) Indexing strategy
I’m not sure if they’ve added it, but last I looked at ParAccel they didn’t have indexing strategies in place for the data, instead they only relied on column-orientation and brute force to process queries. Indexing structures do increase storage overhead of the data (and make it more difficult to allow for mutation), but they can also significantly speed up queries. Druid uses indexing structures to speed up query execution when a filter is provided.
5) Pluggable query execution engine
I’m not sure to what degree ParAccel allows for UDFs, but if you wanted to, you can plug in different query types to Druid and have it run some completely different set of functionality in a scatter-gather across the segments in your cluster. Right now, in the open source offering we have timeseries queries, groupBy, “search”, timeBoundary and segmentMetadata. Timeseries produces results that groupBy could also produce, but it is optimized for queries that are just returning a timeseries and don’t need to include any dimensions (it runs in half the time as the equivalent groupBy query). segmentMetadata just looks at the various segments and reports back statistics like segment-local cardinalities of dimensions and expected input data size assuming it was tsv input. The mechanisms for extending this are completely pluggable (Metamarkets maintains some proprietary query extensions that support our dashboard product that are implemented as modular plugins).
For More You Can Check Risk Management Software Video