streamforge-ai

Introduction: Open-source real-time AI data pipeline for CDC ingestion, feature generation, and storage-aware prefetching
More: Author   ReportBugs   
Tags:

1. Project Overview

StreamForge AI is a real-time data pipeline platform for AI and analytics workloads. It focuses on:

  • CDC ingestion from operational databases
  • stream processing for feature generation
  • object-storage-based data sinking
  • storage-aware prefetching for ML workloads

2. Motivation

  • Provide a minimal but realistic open-source AI data pipeline
  • Support local development and demo environments
  • Showcase best practices in streaming, storage, and pipeline orchestration
  • Demonstrate architecture leadership and contributor collaboration

3. Non-goals

  • Full production-grade multi-tenant platform
  • Large-scale distributed control plane
  • Enterprise authentication / authorization in v0.1

Architecture Summary

4.1 Ingestion layer

Uses Debezium to capture row-level changes from MySQL/Postgres and publish them to Kafka topics.

4.2 Streaming layer

Uses Apache Flink to:

  • consume CDC events
  • perform cleaning and transformation
  • compute simple feature aggregations
  • write processed outputs to storage

4.3 Storage layer

MinIO/S3-compatible storage is the initial storage target (and is optionally exercised by the prefetch demo). Future versions may support Iceberg table sinks for incremental analytics.

4.4 Prefetch layer

A lightweight prefetch engine analyzes expected access patterns and pulls selected objects into a hot cache area before an ML job starts (implemented in prefetch-engine/).

5. Initial module boundaries

  • stream-processor/
  • prefetch-engine/
  • deploy/

6. Design decisions

Why Kafka

Kafka is a widely adopted event backbone and works naturally with Debezium and Flink.

Flink provides strong streaming semantics, checkpointing, and event processing flexibility.

Why MinIO first

MinIO is simple for local demos and provides an S3-compatible interface.

Why prefetch demo

Prefetching is a practical optimization for AI pipelines with repeated object access and training cold starts.

Core Features

This repo focuses on an MVP demo set that illustrates the intended architecture:

  • MySQL -> Kafka CDC ingestion via Debezium (see deploy/cdc-mysql-kafka-debezium/)
  • End-to-end demo: MySQL -> Debezium -> Kafka -> Flink -> MinIO (see deploy/cdc-flink-minio-demo/)
  • Storage-aware prefetching demo for ML workloads (see prefetch-engine/)
  • Optional MinIO upload of processed outputs from the prefetch demo (see prefetch-engine/README.md)

Planned next:

  • Additional storage sinks (e.g., Iceberg)
  • Metrics and observability
  • Benchmark scenarios

Roadmap

v0.1 Local demo (MVP)

  • Storage-aware prefetch demo (prefetch-engine/)
  • MySQL -> Kafka CDC ingestion via Debezium (deploy/cdc-mysql-kafka-debezium/)
  • End-to-end demo: MySQL -> Debezium -> Kafka -> Flink -> MinIO (deploy/cdc-flink-minio-demo/)
  • Runnable local stack (Docker Compose) for the full demo (deploy/cdc-flink-minio-demo/docker-compose.yml)

v0.2 Streaming + sinks

  • Flink stream processor job + example (stream-processor/)
  • MinIO/S3 sink with an output layout and naming conventions (deploy/cdc-flink-minio-demo/feature-sink/)
  • Iceberg sink support (optional)

v0.3 Hardening + integrations

  • Schema evolution handling
  • Metrics and observability
  • Benchmark scenarios
  • Training-job integration example
  • Backfill / replay tooling for historical reprocessing
  • Data quality checks (basic validation + drift signals)
  • Cost/performance tuning guide (Flink, Kafka, MinIO, Iceberg)

v0.4 Lakehouse + governance

  • Iceberg-first sink mode (partitioning, compaction, snapshots)
  • Table/catalog integration (REST catalog or Hive Metastore)
  • Data lineage basics (job -> dataset -> feature artifacts)
  • Access patterns for offline/online features (example layouts)

v0.5 Platformization

  • Pipeline configuration as code (YAML) + validation
  • Simple control-plane API for starting/stopping pipelines
  • Web UI for demo environments (status, logs, artifacts)
  • Basic authn/authz for local multi-user demos
Apps
About Me
GitHub: Trinea
Facebook: Dev Tools