Blog

The Case for a Reusable Data Pipeline Framework

Blog

September 20, 2022
5 minutes read

Author: Swathi Maheshkumar

If your IT team ingests multiple data sources, most likely you’ve built a data pipeline or two. These pipelines routinely and automatically transfer your data from its source to the data lake where it is stored or the data warehouse where it can be accessed for data analytics. Yet, every time you add a new data source, you may have to build a new data pipeline from scratch.

At CoStrategix, we developed a reusable data pipeline framework using Azure services that significantly reduces the time and effort it takes to ingest new data sources. Using a best-practices approach, this framework helps us both with our own internal projects as well as with client projects. Our framework:

Transfers data from various sources as quickly as possible so it can be available for data-driven projects
Transforms data into a usable format as efficiently as possible
Ensures the quality of data before it reaches the data warehouse

The Value Proposition for a Reusable Data Pipeline Framework

What we did was combine data collection with data transformation and data quality monitoring into a single, reproducible framework. Our generic data pipeline framework incorporates reusable components, so we minimize the configurational changes required for ingesting new data sources. Using this standard framework also enables us to orchestrate the ingestion of data from multiple sources through a single pipeline.

Other benefits include:

Reducing the dependencies for manual pipeline scripts
Increasing the data cycle time while decreasing manipulation time
How much time and manual effort are we able to save by standardizing our process, you might ask? We have built many, many data pipelines on behalf of our clients. So our level of expertise in this type of work is high.

Yet still, our generic data pipeline framework:

Reduced the time to ingest a new data source – 16x faster
Even when copying the sources, reduced the time to create a new pipeline – 4x faster

Preventing Bad Data Ingestion

The biggest challenge that companies face when building a data pipeline is ingesting bad data. Fixing incorrect/invalid data after it has been ingested takes a lot of resources, and it’s painful to backtrack to the starting point to find the source of the incorrect data.

Having the right architecture in place with the proper checks and validation inside of the framework can provide gold data using fewer resources – while minimizing or avoiding corrections.

We employ a 3-zone architecture in our data warehouses in order to help manage data quality. In Zone 1, we dump “dirty data,” or everyday files with unmodified source data in the data tables. In Zone 2, we transform the data (append/upsert/replace) based on business requirements. By the time data reaches Zone 3, it should have undergone all data transformations and validations to become gold data – data ready for analysis by the business teams.

The data pipeline framework we developed:

Employs checks on the application side (i.e. Azure Data Factory) to avoid ingestion of invalid/incorrect data or file formats
Employs validation of data in each zone before moving data into the following zone of the data warehouse
Captures logs for all successful and failed cases for the invalid/incorrect data, which gives sufficient details to find the root cause of any issue and correct it to avoid failure of the pipeline run in future
Parameterizes the pipeline to help in rerunning the pipelines for any previous day’s file ingestion
Adds additional metadata as the data traverses the pipeline to allow users to see when it was received, validated, and made available to the business users

CoStrategix is a strategic consulting and engineering company that helps organizations realize value from digital and data. If you’re looking to harness data for decision-making, we can help you modernize your data platform infrastructure, drive insights, and advance data literacy throughout your organization.

AI Strategy & Solutions – Elevate your business with advanced analytics

Data & Insights – Drive insights that lead to competitive advantage

Product Development – Build platforms that power unique digital capabilities

Platform & Technology Modernization – Modernize for stellar experiences, efficiency, and AI