DataEngineering OpenSource List

  • LarkMidTable : 基于 flinkx 的分布式数据中台产品。

  • DataSphereStudio : DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling.

Meta Management

  • Apache Altas : Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.

  • 2020-Amundsen : Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.


  • 2016-ClickHouse : ClickHouse is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries.

  • 2017-Druid : Apache Druid (incubating) is a high performance analytics data store for event-driven data.

  • 2017-Mondrian : Mondrian is an Online Analytical Processing (OLAP) server that enables business users to analyze large quantities of data in real-time.

  • 2021-Pinot : Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency.

  • 2021-Datafuse : A Modern Real-Time Data Processing & Analytics DBMS with Cloud-Native Architecture, built to make the Data Cloud easy

  • 2023-ByConity : ByConity is a data warehouse designed for changes in modern cloud architecture. It adopts a cloud-native architecture design to meet the requirements of data warehouse users for flexible scaling, separation of reads and writes, resource isolation, and strong data consistency. At the same time, it provides excellent query and write performance.

OLAP Browser

  • 2015-Metabase : The simplest, fastest way to get business intelligence and analytics to everyone in your company.

  • Saiku : Saiku Analytics - The Worlds Greatest Open Source OLAP Browser

  • CBoard : An easy to use, self-service open BI reporting and BI dashboard platform.

  • Apache Superset : Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

  • Metatron Discovery : Metatron Discovery is an end-to-end big data self discovery solution. To learn more about it, visit our web site. Check our blog for upcoming events and development news.

  • Poli : An easy-to-use BI server built for SQL lovers. Power data analysis in SQL and gain faster business insights.

  • 2020-Cube.js : Cube is the semantic layer for building data applications. It helps data engineers and application developers access data from modern data stores, organize it into consistent definitions, and deliver it to every application.

  • 2015-Caravel : Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application

  • 2016-Redash : Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.

  • 2020-Rath : RATH is beyond an open-source alternative to Data Analysis and Visualization tools such as Tableau. It automates your Exploratory Data Analysis workflow with an Augmented Analytic engine by discovering patterns, insights, causals and presents those insights with powerful auto-generated multi-dimensional data visualization.


  • Presto : Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

  • Materialize : Materialize is a streaming database for real-time applications. Materialize accepts input data from a variety of streaming sources (e.g. Kafka) and files (e.g. CSVs), and lets you query them using SQL.

  • Doris : Doris is an MPP-based interactive SQL data warehousing for reporting and analysis. Its original name was Palo, developed in Baidu. After donated to Apache Software Foundation, it was renamed Doris.

Business Intelligence

  • Apache Superset: Apache Superset (incubating) is a modern, enterprise-ready business intelligence web application. Superset: Airbnb’s data exploration platform

  • Grid studio : Grid studio is a web-based spreadsheet application with full integration of the Python programming language.

  • Davinci : Davinci 是一个 DVaaS(Data Visualization as a Service)平台解决方案,面向业务人员/数据工程师/数据分析师/数据科学家,致力于提供一站式数据可视化解决方案。既可作为公有云/私有云独立部署使用,也可作为可视化插件集成到三方系统。用户只需在可视化 UI 上简单配置即可服务多种数据可视化应用,并支持高级交互/行业分析/模式探索/社交智能等可视化功能。

  • 2020-Querybook : Querybook is a Big Data Querying UI, combining collocated table metadata and a simple notebook interface.

  • 2021-DataEase : DataEase 是开源的数据可视化分析工具,帮助用户快速分析数据并洞察业务趋势,从而实现业务的改进与优化。DataEase 支持丰富的数据源连接,能够通过拖拉拽方式快速制作图表,并可以方便的与他人分享。

Data Lake

  • Apache Hudi : Upserts, Deletes And Incremental Processing on Big Data.

  • 2023-Paimon : Apache Paimon(incubating) is a streaming data lake platform that supports high-speed data ingestion, change data tracking and efficient real-time analytics.

Streaming Database

  • 2023-RisingWave : RisingWave is a cloud-native streaming database that uses SQL as the interface language. It is designed to reduce the complexity and cost of building real-time applications. RisingWave consumes streaming data, performs continuous queries, and updates results dynamically. As a database system, RisingWave maintains results inside its own storage and allows users to access data efficiently.

Data Aggregation

Data Orchestrator

  • Prefect : Prefect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest.

  • dagster : A data orchestrator for machine learning, analytics, and ETL.


  • awesome-etl : A curated list of awesome ETL frameworks, libraries, and software.

  • DataX : 阿里巴巴集团内被广泛使用的离线数据同步工具/平台

  • dbt : dbt (data build tool) enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.

  • 2021-SeaTunnel : SeaTunnel is a distributed, high-performance data integration platform for the synchronization and transformation of massive data (offline & real-time).

  • 2022-BitSail : BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.

  • 2023-DataCap : DataCap 是数据转换、集成和可视化的集成软件。支持多种数据源,文件类型,大数据相关数据库,关系型数据库,NoSQL 数据库等。通过软件可以实现管理多种数据源,对该源下的数据进行各种操作转换,制作数据图表,监控数据源等各种功能。

CDC & Data Pipeline

  • Debezium : Debezium is a distributed platform that turns your existing databases into event streams, so applications can see and respond immediately to each row-level change in the databases.

  • Canal : 阿里巴巴 mysql 数据库 Binlog 的增量订阅&消费组件。阿里云 DRDS( )、阿里巴巴 TDDL 二级索引、小表复制 powerd by canal.

  • Otter : 阿里巴巴分布式数据库同步系统(解决中美异地机房)

  • Arc : Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;

  • 2021-Vector : Vector is a high-performance, end-to-end (agent & aggregator) observability data pipeline that puts you in control of your observability data. Collect, transform, and route all your logs, metrics, and traces to any vendors you want today and any other vendors you may want tomorrow. Vector enables dramatic cost reduction, novel data enrichment, and data security where you need it, not where it is most convenient for your vendors. Additionally, it is open source and up to 10x faster than every alternative in the space.

  • 2022-Airbyte : Data integration platform for ELT pipelines from APIs, databases & files to databases, warehouses & lakes