streamforge-ai
1. Project Overview
StreamForge AI is a real-time data pipeline platform for AI and analytics workloads. It focuses on:
- CDC ingestion from operational databases
- stream processing for feature generation
- object-storage-based data sinking
- storage-aware prefetching for ML workloads
2. Motivation
- Provide a minimal but realistic open-source AI data pipeline
- Support local development and demo environments
- Showcase best practices in streaming, storage, and pipeline orchestration
- Demonstrate architecture leadership and contributor collaboration
3. Non-goals
- Full production-grade multi-tenant platform
- Large-scale distributed control plane
- Enterprise authentication / authorization in v0.1
Architecture Summary
4.1 Ingestion layer
Uses Debezium to capture row-level changes from MySQL/Postgres and publish them to Kafka topics.
4.2 Streaming layer
Uses Apache Flink to:
- consume CDC events
- perform cleaning and transformation
- compute simple feature aggregations
- write processed outputs to storage
4.3 Storage layer
MinIO/S3-compatible storage is the initial storage target (and is optionally exercised by the prefetch demo). Future versions may support Iceberg table sinks for incremental analytics.
4.4 Prefetch layer
A lightweight prefetch engine analyzes expected access patterns and pulls selected objects into a hot cache area before an ML job starts (implemented in prefetch-engine/).
5. Initial module boundaries
stream-processor/prefetch-engine/deploy/
6. Design decisions
Why Kafka
Kafka is a widely adopted event backbone and works naturally with Debezium and Flink.
Why Flink
Flink provides strong streaming semantics, checkpointing, and event processing flexibility.
Why MinIO first
MinIO is simple for local demos and provides an S3-compatible interface.
Why prefetch demo
Prefetching is a practical optimization for AI pipelines with repeated object access and training cold starts.
Core Features
This repo focuses on an MVP demo set that illustrates the intended architecture:
- MySQL -> Kafka CDC ingestion via Debezium (see
deploy/cdc-mysql-kafka-debezium/) - End-to-end demo: MySQL -> Debezium -> Kafka -> Flink -> MinIO (see
deploy/cdc-flink-minio-demo/) - Storage-aware prefetching demo for ML workloads (see
prefetch-engine/) - Optional MinIO upload of processed outputs from the prefetch demo (see
prefetch-engine/README.md)
Planned next:
- Additional storage sinks (e.g., Iceberg)
- Metrics and observability
- Benchmark scenarios
Roadmap
v0.1 Local demo (MVP)
- Storage-aware prefetch demo (
prefetch-engine/) - MySQL -> Kafka CDC ingestion via Debezium (
deploy/cdc-mysql-kafka-debezium/) - End-to-end demo: MySQL -> Debezium -> Kafka -> Flink -> MinIO (
deploy/cdc-flink-minio-demo/) - Runnable local stack (Docker Compose) for the full demo (
deploy/cdc-flink-minio-demo/docker-compose.yml)
v0.2 Streaming + sinks
- Flink stream processor job + example (
stream-processor/) - MinIO/S3 sink with an output layout and naming conventions (
deploy/cdc-flink-minio-demo/feature-sink/) - Iceberg sink support (optional)
v0.3 Hardening + integrations
- Schema evolution handling
- Metrics and observability
- Benchmark scenarios
- Training-job integration example
- Backfill / replay tooling for historical reprocessing
- Data quality checks (basic validation + drift signals)
- Cost/performance tuning guide (Flink, Kafka, MinIO, Iceberg)
v0.4 Lakehouse + governance
- Iceberg-first sink mode (partitioning, compaction, snapshots)
- Table/catalog integration (REST catalog or Hive Metastore)
- Data lineage basics (job -> dataset -> feature artifacts)
- Access patterns for offline/online features (example layouts)
v0.5 Platformization
- Pipeline configuration as code (YAML) + validation
- Simple control-plane API for starting/stopping pipelines
- Web UI for demo environments (status, logs, artifacts)
- Basic authn/authz for local multi-user demos
Roadmap Links
- GitHub issues: https://github.com/cchenax/streamforge-ai/issues
- GitHub projects (if/when populated): https://github.com/cchenax/streamforge-ai/projects
Contribution Links
- Contribution guide:
CONTRIBUTING.md - Start a topic/track progress: https://github.com/cchenax/streamforge-ai/issues/new/choose
- Open PRs to
main: https://github.com/cchenax/streamforge-ai/pulls
