What Is A Data Pipeline?

A data pipeline is a process that transfers data from one system to another system. It may be more than one process. Data may or may not be transformed.

It is generally divided into steps that move data along the pipeline. It can include features like enrichment, filtering, and more. Data pipelines can be real-time or batch. An ETL pipeline is a type of data pipeline and is generally only used for batch processing. A dataflow is the movement of data along the pipeline and ETL is a common method for doing this.

The final destination or end point that data is transferred to is called the destination while the source that the data originates from is called the origin.

A data pipeline is generally going to be used to transfer data between an OLTP system and an OLAP system or data warehouse.

OLTP - online transaction processing
OLAP - online analytical processing

Overall, data pipelines are a core component used in data science and engineering that you really need to be familiare with if you want to have a career in data. Reading this page should be your first step in a much longer journey to understand data and everything that comes along with it. On this journey you will come across an entire zoo ( or sometimes jungle ) of different software products and tools all with different features, advantages, and disadvantages. Keep on exploring and see how many wild data analytics components you can find out there.

At the beginning it can be confusing sorting out all of the acronyms and different products but that doesn’t have to be a bad thing. You can treat it as something fun and interesting. You could almost compare it to hunting down new Pokemon. You gotta catch ‘em all so to speak.