MMS • Raul Salas
Which database do I use? I get this question often from senior management. In the Open Source NoSQL world this is a very complicated question. The first question that needs to be answered what are you trying to accomplish? We should start out outlining what Open source databases are available.
- Hadoop – Hadoop is an open source platform that provides excellent data management provision. It is a framework that supports the processing of large data sets in a distributed computing environment. It is designed to expand from single servers to thousands of machines, each providing computation and storage
- Cassandra – Apache Cassandra is a free and open-source distributed NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure as well as the ability to perform cross bi-directional data center replication.
- Spark – Apache Spark is an Open Source engine for large-scale data processing. It is optimized for the execution of multiple parallel operations on the same data set as they occur in many iterative machine learning tasks.
- Mongodb – MongoDB offers rapid and agile development opportunities thru its ability to handle unstructured data, high availability and the ability to scale across commodity hardware.
OK, so that’s really a lot to consume! At this point, your probably still thinking “which database platform do I use?” The best way to explain each platform is by Use/Case. Use/Case #1 – (Cassandra) Let’s start off with the Use Case that larger companies especially global ones that have multiple data centers and need the ability to ensure that there is a single pane of glass view of their data. While this sounds simple, large companies spend lots of money shipping data across multiple platforms and systems to ensure timely accurate reflection of their data. Cassandra can provide them with that functionality. Cassandra can be setup in multiple data centers and be updated in both data centers and the data will be reflected in both locations with eventual consistency. So, this is an important term “Eventual Consistency”, this does not mean that the data will be updated in real time. It does mean that at some point usually not far from real-time the data will be updated in both data centers. What this means is say for example a health care company can quickly and more importantly easily obtain the information it wants from a central data store. It can have the ability to track policy growth over multiple countries without too many processes massaging and shipping the data over to a consolidated data store. Use/Case #2 – (Hadoop) Your company wants a 360 degree view of their customer base. For example, a company wants to consolidate it’s CRM data, web site browsing data, call center activity, and social media sentiment analysis of their brand both by the general public and their customers. This Use Case involves lots of data from disparate places in both structured and unstructured form. In addition, lots of storage and compute power will also be required. Apache Hadoop fits this requirement perfectly! Hadoop has the ability to bring in many data formats and structures into it’s own file system called HDFS. The method in which HDFS operates makes storage very affordable. In addition, Hadoop’s ability to horizontally scale and use map-reduce on large data sets makes processing very large amounts of data possible as well. Hadoop can perform sentiment analysis on social media posts to determine what the general public thinks about your product and company. All of this can be combined and analyzed as a single pane of glass into your customers purchases, contacts, sales leads, website activity, as well as social media posts in order to obtain a 360 degree view of your company’s customers and business. Use/Case #3 – (Spark) Your a credit card processor and You would like to prevent fraud with credit card in real-time or near real-time. This is where Spark comes into play. Spark can store the data temporarily and execute machine learning algorithms on the data in real time. And make decisions on multiple inputs as to whether the transaction is fraudulent. Spark is a kind of like a temporary read-only store that is scalable. Use/Case #4 – (Mongodb) Your organization wants to quickly create a data warehouse from various sources of data and they would need to be related and consolidated into a single view for business analysis. Mongodb’s flexible schema allows dissimilar and unstructured data sources to be related and combined for easy analysis. In the case of large data sets, Horizontal scaling can be used to break up the data sets across multiple commodity based hardware. in order to meet response time SLA. Hopefully, this explains in a simplistic manner which NoSQL database you should use.