Big Data is an abused term. Time to put some order into it.
I’ve already written about Big Data and the fact that it isn’t really a technology but rather a set of mind. I am continuing my struggle with this thing.
Big Data is something I’ve been watching from the sidelines in the past couple of years, and lately it became something I need to know more about. My way of learning things? Reading and writing about them. To put some order into it, I arranged a “short list” of glossary terms – things I need to know accurately what they mean and not just use them without much understanding.
Here’s my list of Big Data Glossary Terms (you will find the definitions a bit circular at times).
You probably won’t bump into this one unless you are already well into database technicalities, but it is good to know it.
ACID stands for Atomicity, Consistency, Isolation, Durability – all features you’d expect from your database; harder to achieve on the new set of NoSQL databases.
- Atomicity – a transaction made on the database succeeds or fails as a whole – no interim state (best example is a bank transfer – when it fails, you want the money to stay in your account and not live in limbo-zone)
- Consistency – at all times, the database should stay in a valid state
- Isolation – if transactions are executed concurrently, they should execute as if they run sequentially one after the other (they see no mid-states of the database)
- Durability – once a transaction is committed – it succeeded, no matter what happens next to the database (think crash)
The CAP theorem states that it is impossible to guarantee all 3 features simultaneously in a distributed system:
- Consistency – all nodes see the same data at the same time
- Availability – every request receives a response about whether it succeeded or failed
- Partition tolerance – when parts of the nodes fail, the system continues to operate properly
Now, once you have Big Data, you need a distributed system. At that point, you need to stop thinking of ACID and start deciding which of the 3 properties you can live without (or live with less of).
An open source database in the form of a key-value store. Fast on read and write, good high availability. Crappy when it comes to non-simple schemas and the ability to do anything resembling SQL stuff. Consistency also isn’t one of its strong points (at least not for something like banking applications).
An open source framework for data intensive distributed applications. As a framework, it encompasses things like HBase and HDFS (listed here as well).
Hadoop is designed to work on commodity hardware and scale horizontally by throwing more machines onto a problem.
Hadoop is king of Big Data technologies. It is the most commonly used solution when referring to Big Data and is oftentimes used without any real need. While Hadoop is used for real time analytics, it is really good at write operations and slow on reads, making it… not that good for real time analytics as some incorrectly assume.
HBase sits at the heart of the Hadoop framework (another term on this page). It is a non-relational database. HBase is used to store anything (“unstructured data”), as long as it is large in size. It is a NoSQL database.
HDFS is the file system that is used by Hadoop/HBase. It is suitable for… Hadoop. There are other file systems you can use for Hadoop, but this is the basic one you get from the Apache foundation (home of Hadoop).
There’s a whole set of databases that don’t use any storage – or at least not for their daily operations. These are in-memory databases. They are fast as hell, scale horizontally up to a point most of the time, and are expensive when it comes to hardware (memory costs more than disk).
For real-time stuff, they probably need to be part of the solution.
The “new” SQL. Not really.
Map Reduce is a way of getting an answer from a distributed system. This is what you’ll find in a Hadoop database – you can put SQL on top of it, but that gets translated into Map Reduce jobs.
At its core, Map Reduce starts by mapping a problem into smaller data sets, each one running on a machine of its own. The answer from each machine is returned where it is reduced to the final response.
A hyped database that wants to replace SQL databases, but probably can’t for a lot of the use cases. It is marketed as being super-fast. If you start digging deeper about it, it seems like an unreliable database to use for mission critical stuff.
NewSQL is a marketing term that started due to NoSQL. The old regime of SQL databases had to reinvent itself, so NewSQL came to being. Other than that, I have no real clue what to do with this NewSQL thing.
NoSQL means “Not Only SQL”. It’s new and should be dealt with with great care – a lot of misleading data out there about these databases (probably in this glossary as well).
The “Not Only” part means that you should probably first check if SQL fits your needs, and if it doesn’t start looking elsewhere, and that elsewhere is a huge bucket of a lot of different types of databases that behave differently and are suitable for different tasks.
To make it simple, NoSQL is an abused generalized term just like Big Data.
Feel free to:
- Add terms
- Correct my definitions
- Go to Wikipedia and other resources out there to complete this education
Great article to start with. Thanks!
Ouch. not sure I like your Mongodb definition: you should mention the paradigm shift that occurs overhere.
Stay high-level, and add:
* BASE (><ACID) (paradigm shift again)
* lambda architecture (paradigm shift once again)
Four things to say here:
1. Thank you 🙂
2. I need to add BASE into the glossary
3. I need to add Lambda architecture into the glossary
4. I think you are correct on BASE and Lambda. But while they are required, they are not always required; and when they are, I am not sure that the way to go with them is with MongoDB and not with other solutions – just a thought
Great article! I’d like to point out one inaccuracy – MapReduce is not exactly what you wrote, though you got the general concept right. The mapping part is where the data is getting split into the tasks. The map jobs work on the entire dataset, which can be divided arbitrarily to feed several map jobs. The output of the map jobs ends up in the reduce jobs, which actually do the processing on the data, and each of these return their result, which is combined into the final output.
For example, suppose you’re trying to find the average salaries of the entire US population and you want to do it one person at a time – then the mapper will map the data by the SSN (for example) and the reducer will look at a single person and average their salaries.
Overall, it’s a neat approach, but I think it’s being pretty abused at times.
I stand corrected – thanks.