Big Data isn’t all about NoSQL or about Hadoop.
When I started dealing with Big Data, it was all Hadoop. Top to bottom. There are talks about MongoDB and Cassandra from time to time, and in essence – these are probably considered the main 3 technologies when someone talks about Big Data. But they are very different from one another, and they address different problems.
-> If you are new to Big Data, better start with my short glossary.
As a side note (we will need it later), remember that Big Data is all about Volume, Velocity and Variety.
Here’s a shortlist of different technologies that are considered Big Data, and by considered I mean that there’s probably a debate if they are or not…
Hadoop and Map Reduce
Hadoop is the crown jewel and darling of Big Data. It can be used for storage, for transactions, for computations and for analytics. The funny thing is – it is a batch processing platform used for analytics purposes – sometimes with requirements of low latency and real time.
Hadoop deals mainly with Volume and Variety when it comes to Big Data.
NoSQL is all about ditching the static schema and enabling the use of something a bit more “comfortable”, assuming you plan for change – and change is the only constant these days.
NoSQL, like Big Data is more of a catch-all phrase. It includes things like Cassandra, Riak and HBase. It means No SQL, Not Only SQL or whatever the specific vendor/developer decided for his particular project.
The way I see it? Most NoSQL solutions tend to focus on the Variety problem with Big Data.
There’s a trend of shifting data processing from storage and traditional databases into pure memory. This can go from simple, distributed cache solutions, through key-value stores and up to in-memory data bases. These solutions are expensive (memory costs more than disk space), but they are fast. Real fast.
Since memory is volatile, reliability is achieved either by write behind mechanisms, where data is stored on disk for retrieval in cases of disaster; or simply by replication on multiple machines.
When it comes to in-memory solutions, they are all about Velocity – the speed at which you can process data and derive insights from it.
Columnar data stores
These are data bases that store data in a columnar fashion – instead of looking at rows of records – something you do for transactions and for the operation of a system, you look at data on the columns – you are not interested in an individual or a specific transaction, but rather on an aggregation of a field of a specific transaction type.
Put simply – these are data bases that are designed and built with analytics and reports in mind.
They are usually expensive, they are sometimes treated as Big Data and sometimes – they aren’t.
They deal with Velocity, where the interest is mainly in analytics.
Stream processing can be viewed as a type of an in-memory solution – data passes through a stream processor, gets modified along the way, counted, filtered, aggregated – whatever your heart desires.
They are found in places where storing the amounts of data isn’t practical even for something like Hadoop, and when we first wish to somehow aggregate or filter the data.
These tend to deal with Volume and Velocity aspects of Big Data.
Did I miss something? Probably. Are these technologies intermingled? Probably.
Me? I am just trying to make sense out of it all.