Modern Data Stack

Last updated Dec 7, 2023

Table of Contents

  1. Why the Modern Data Stack?
  2. Integrating with Dagster
  3. A comment I made on Social
  4. Further Links
  5. Counter Arguments against Modern Data Stack

The Modern Data Stack (MDS) comprises a suite of open-source tools designed for end-to-end analytics. This includes data ingestion, transformation, machine learning, and integration into a columnar data warehouse or lake solution, all complemented by an analytics BI dashboard backend. The stack’s versatility allows extensions for data quality, data cataloging, and more.

MDS aims to enable data insights using the best-suited tools for each process. It’s worth noting that “Modern Data Stack” is a relatively new term, with its definition still evolving.

Synonym Names
A burgeoning term, ngods (new generation open-source data stack), has emerged. Previously, I’ve referred to this concept as the Open Data Stack Project. Additionally, Dagster introduced the term DataStack 2.0 in a recent blog postOpen Data Stack is my own definition of it.

Closed Source vs Open Source
Closed Source examples: dbt, Looker, Snowflake, Fivetran, Hightouch, Census
Open Source alternatives: airbyte, dbt, dagster, Superset, Reverse-ETL?

Modern Data Stack on a Laptop
DuckDB: Modern Data Stack in a Box


# Why the Modern Data Stack?


A perspective from Reddit highlights the shift in data warehousing and analytics. It underscores the reduced need for extensive teams and infrastructure, thanks to new tools that streamline data management and reporting. Particularly for small and mid-sized companies, MDS offers a competitive edge in data handling, allowing even a single data engineer to manage vast datasets efficiently.

A notable article discussing Lakehouse, Metrics Layer, and Clickhouse:
The Next Cloud Data Platform | Greylock

# Integrating with Dagster

The downside of MDS is the unbundling of Bundling vs Unbundling- Monolith Data vs Microservices, but Dagster helps integrate the full data stack together:

Dagster elevates the Modern Data Stack:

Explore more about its power with Dagster and Data Assets.


# A comment I made on Social


I often ponder over the ideal tools for a data stack. My preference leans toward a Cloud Data Warehouse such as Firebolt, SnowflakeBigQueryRedshift, or Synapse, as a starting point.

The journey typically begins with Airbyte for data integration, followed by SQL-based transformation with dbt. Orchestrating the processes in Python with tools like dagster is crucial.

From there, I would integrate additional open-source tools based on specific needs: Spark for processing, Delta Lake for data lake formatting and ACID Transactions, Amundsen for data cataloging, and Great Expectation for data quality, among others. For smaller projects, DuckDB is suitable for local OLAP scenarios, while Kubernetes and DevOps provide scalability.

For teams without data engineering resources, closed-source options like Ascend or Foundry are viable alternatives.

Feel free to reach out for further discussion or clarifications.



# Counter Arguments against Modern Data Stack