Choosing a Data Lake Storage System
Data Lake Storage System: Which One Should You Use?
When you create a Data Lake, one of the most overlooked questions is, “What storage technology should back the lake?” Most companies just go with whatever tech stack they are familiar with, or are being sold. In reality, the Data Lake storage system should be chosen using the same questions you ask when you build out any other piece of the system:
1. Does the system cover all requirements and SLAs that are currently known?
2. Can the system be easily expanded if more functionality (or space) is needed?
3. Is the system in line with budgetary and engineering talent constraints?
Once these questions have been reviewed and answered the selection of storage technology can be started.
There are five widely accepted storage systems being used for Data lakes. Each of them have both pros and cons as the basis for a Lake.
Data Storage System Pros and Cons
Type of System | Pro | Con |
---|---|---|
Hadoop Based System | Easily expandable and cheaper storage | Slower data retrieval times |
Non-Hadoop Based Storage + Hadoop / non-Hadoop Compute, e.g. S3 + Hive / Spark | Decouples storage and compute, optimized for cloud platforms | More difficult to implement on-prem |
Massively Parallel Processing System (MPP), e.g. H.P. Vertica or IBM Netezza | Fast record retrieval and ease of setup | High Cost |
NoSQL System (Cassandra, HBase) | Easily expandable and fast | Tech community less familiar with NoSQL systems |
SQL Database (SQL Server, Oracle, MySQL) | Well defined technology | Cannot handle large amounts of data without high cost |
At DesignMind, we have developed a proprietary pattern that not only ingests large amounts of data, but:
- Makes data available to users at all levels of the system
- Allows data to be accessed by multiple formats
- Allows for simplified schema evolution management
Read more in our white paper, “Data Lake Storage Systems That Work”. Questions? Contact us and we’ll get back to you promptly.