Is HDFS a Data Lake?
You might be wondering, “Is HDFS a Data Lake?”.
- Short Answer: No
- Long Answer: HDFS is often used to create data lakes. It can be used as a data lake but it is not a data lake itself.
A data lake is a type of architecture. It can be composed of one or more Hadoop clusters. It also doesn’t necessarily have to use Hadoop. They very often are built on top of Hadoop. They also can include multiple other technologies at the same time. Other technologies used might include Amazon S3.
- store a variety of data types
- very little if any pre-processing
A data lake is generally less rigid than a data warehouse. With a data lake you don’t generally need to pick out a schema ahead of time. It is a large system that can contain huge amounts of data. This data can come in many different forms and can be organized or structured in many different ways. It can be composed of structured data like data like a database, semi structured data like csv files, or just raw data. This is exactly the type of thing that HDFS is meant for. Data lakes tend to go together with Hadoop.
Compared to data warehouses, data lakes are supposed to be:
- more flexible
- less expensive
Data lakes are meant to make things easier than they would be with a data warehouse. They aren’t a magical solution and do need to be implemented properly. Planning is critical. They do need the right tools in place to make them accessible to end users. It is also important to not simply treat them as data warehouses.
Hadoop vs Data Lake
Essentially, it doesn’t really make sense to compare Hadoop to a data lake. It isn’t an apples to apples comparison. It is more like an apples to fruit basket comparison. A data lake is generally built using Hadoop and other technologies all working together. You could think of Hadoop as the apples and a data lake as a basket of fruit. A basket of fruit doesn’t always contain apples but it usually does. A basket of fruit could contain a single apple or multiple apples just like a data lake might contain a single Hadoop cluster or multiple Hadoop clusters. To extend the analogy you could think of Amazon S3 object storage as the bananas.