My Big Data Glossary

Big Data is an abused term. Time to put some order into it.

I've already written about Big Data and the fact that it isn't really a technology but rather a set of mind. I am continuing my struggle with this thing.

Big Data Glossary

Big Data is something I've been watching from the sidelines in the past couple of years, and lately it became something I need to know more about. My way of learning things? Reading and writing about them. To put some order into it, I arranged a "short list" of glossary terms – things I need to know accurately what they mean and not just use them without much understanding.

Here's my list of Big Data Glossary Terms (you will find the definitions a bit circular at times).

ACID

You probably won't bump into this one unless you are already well into database technicalities, but it is good to know it.

ACID stands for Atomicity, Consistency, Isolation, Durability – all features you'd expect from your database; harder to achieve on the new set of NoSQL databases.

Atomicity – a transaction made on the database succeeds or fails as a whole – no interim state (best example is a bank transfer – when it fails, you want the money to stay in your account and not live in limbo-zone)
Consistency – at all times, the database should stay in a valid state
Isolation – if transactions are executed concurrently, they should execute as if they run sequentially one after the other (they see no mid-states of the database)
Durability – once a transaction is committed – it succeeded, no matter what happens next to the database (think crash)

CAP Theorem

The CAP theorem states that it is impossible to guarantee all 3 features simultaneously in a distributed system:

Consistency – all nodes see the same data at the same time
Availability – every request receives a response about whether it succeeded or failed
Partition tolerance – when parts of the nodes fail, the system continues to operate properly

Now, once you have Big Data, you need a distributed system. At that point, you need to stop thinking of ACID and start deciding which of the 3 properties you can live without (or live with less of).

Cassandra

An open source database in the form of a key-value store. Fast on read and write, good high availability. Crappy when it comes to non-simple schemas and the ability to do anything resembling SQL stuff. Consistency also isn't one of its strong points (at least not for something like banking applications).

Hadoop

An open source framework for data intensive distributed applications. As a framework, it encompasses things like HBase and HDFS (listed here as well).

Hadoop is designed to work on commodity hardware and scale horizontally by throwing more machines onto a problem.

Hadoop is king of Big Data technologies. It is the most commonly used solution when referring to Big Data and is oftentimes used without any real need. While Hadoop is used for real time analytics, it is really good at write operations and slow on reads, making it… not that good for real time analytics as some incorrectly assume.

HBase

HBase sits at the heart of the Hadoop framework (another term on this page). It is a non-relational database. HBase is used to store anything ("unstructured data"), as long as it is large in size. It is a NoSQL database.

HDFS

HDFS is the file system that is used by Hadoop/HBase. It is suitable for… Hadoop. There are other file systems you can use for Hadoop, but this is the basic one you get from the Apache foundation (home of Hadoop).

In-Memory

There's a whole set of databases that don't use any storage – or at least not for their daily operations. These are in-memory databases. They are fast as hell, scale horizontally up to a point most of the time, and are expensive when it comes to hardware (memory costs more than disk).

For real-time stuff, they probably need to be part of the solution.

MapReduce

The "new" SQL. Not really.

Map Reduce is a way of getting an answer from a distributed system. This is what you'll find in a Hadoop database – you can put SQL on top of it, but that gets translated into Map Reduce jobs.

At its core, Map Reduce starts by mapping a problem into smaller data sets, each one running on a machine of its own. The answer from each machine is returned where it is reduced to the final response.

MongoDB

A hyped database that wants to replace SQL databases, but probably can't for a lot of the use cases. It is marketed as being super-fast. If you start digging deeper about it, it seems like an unreliable database to use for mission critical stuff.

NewSQL

NewSQL is a marketing term that started due to NoSQL. The old regime of SQL databases had to reinvent itself, so NewSQL came to being. Other than that, I have no real clue what to do with this NewSQL thing.

NoSQL

NoSQL means "Not Only SQL". It's new and should be dealt with with great care – a lot of misleading data out there about these databases (probably in this glossary as well).

The "Not Only" part means that you should probably first check if SQL fits your needs, and if it doesn't start looking elsewhere, and that elsewhere is a huge bucket of a lot of different types of databases that behave differently and are suitable for different tasks.

To make it simple, NoSQL is an abused generalized term just like Big Data.

Feel free to:

Add terms
Correct my definitions
Go to Wikipedia and other resources out there to complete this education

ACID

CAP Theorem

Cassandra

Hadoop

HBase

HDFS

In-Memory

MapReduce

MongoDB

NewSQL

NoSQL

Tsahi Levent-Levi

Share

On this page

Newsletter

Related reading

What getting rtcStats seen taught me about the whole WebRTC ecosystem

The Real Cost of Real-Time: What Your WebRTC Bill Actually Looks Like

Chrome moves to a 2 week release cycle. Where are you with your WebRTC app