streamforge-ai
1. Project Overview
StreamForge AI is a real-time data pipeline platform for AI and analytics workloads. It focuses on:
- CDC ingestion from operational databases
- stream processing for feature generation
- object-storage-based data sinking
- storage-aware prefetching for ML workloads
2. Motivation
- Provide a minimal but realistic open-source AI data pipeline
- Support local development and demo environments
- Showcase best practices in streaming, storage, and pipeline orchestration
- Demonstrate architecture leadership and contributor collaboration
3. Non-goals
- Full production-grade multi-tenant platform
- Large-scale distributed control plane
- Enterprise authentication / authorization in v0.1
Architecture Summary
4.1 Ingestion layer
Uses Debezium to capture row-level changes from MySQL/Postgres and publish them to Kafka topics.
4.2 Streaming layer
Uses Apache Flink to:
- consume CDC events
- perform cleaning and transformation
- compute simple feature aggregations
- write processed outputs to storage
4.3 Storage layer
MinIO/S3-compatible storage is the initial storage target (and is optionally exercised by the prefetch demo). Future versions may support Iceberg table sinks for incremental analytics.
4.4 Prefetch layer
A lightweight prefetch engine analyzes expected access patterns and pulls selected objects into a hot cache area before an ML job starts (implemented in prefetch-engine/).
5. Initial module boundaries
stream-processor/prefetch-engine/deploy/
6. Design decisions
Why Kafka
Kafka is a widely adopted event backbone and works naturally with Debezium and Flink.
Why Flink
Flink provides strong streaming semantics, checkpointing, and event processing flexibility.
Why MinIO first
MinIO is simple for local demos and provides an S3-compatible interface.
Why prefetch demo
Prefetching is a practical optimization for AI pipelines with repeated object access and training cold starts.
Core Features
This repo focuses on an MVP demo set that illustrates the intended architecture:
- MySQL -> Kafka CDC ingestion via Debezium (see
deploy/cdc-mysql-kafka-debezium/) - End-to-end demo: MySQL -> Debezium -> Kafka -> Flink -> MinIO (see
deploy/cdc-flink-minio-demo/) - Storage-aware prefetching demo for ML workloads (see
prefetch-engine/) - Optional MinIO upload of processed outputs from the prefetch demo (see
prefetch-engine/README.md)
Planned next:
- Additional storage sinks (e.g., Iceberg)
- Metrics and observability
- Benchmark scenarios
Roadmap
v0.1 Local demo (MVP)
- Storage-aware prefetch demo (
prefetch-engine/) - MySQL -> Kafka CDC ingestion via Debezium (
deploy/cdc-mysql-kafka-debezium/) - End-to-end demo: MySQL -> Debezium -> Kafka -> Flink -> MinIO (
deploy/cdc-flink-minio-demo/) - Runnable local stack (Docker Compose) for the full demo (
deploy/cdc-flink-minio-demo/docker-compose.yml)
v0.2 Streaming + sinks
- Flink stream processor job + example (
stream-processor/) - MinIO/S3 sink with an output layout and naming conventions (
deploy/cdc-flink-minio-demo/feature-sink/) - Iceberg sink support (optional)
v0.3 Hardening + integrations
- Schema evolution handling
- Metrics and observability
- Benchmark scenarios
- Training-job integration example
- Backfill / replay tooling for historical reprocessing
- Data quality checks (basic validation + drift signals)
- Cost/performance tuning guide (Flink, Kafka, MinIO, Iceberg)
v0.4 Lakehouse + governance
- Iceberg-first sink mode (partitioning, compaction, snapshots)
- Table/catalog integration (REST catalog or Hive Metastore)
- Data lineage basics (job -> dataset -> feature artifacts)
- Access patterns for offline/online features (example layouts)
v0.5 Platformization
- Pipeline configuration as code (YAML) + validation
- Simple control-plane API for starting/stopping pipelines
- Web UI for demo environments (status, logs, artifacts)
- Basic authn/authz for local multi-user demos
v0.6 Expanded sources & feature serving
- Multi-source CDC support (PostgreSQL, MongoDB)
- Feature store integration (Feast, Tecton-compatible interface)
- Online feature serving API with caching
- Schema registry integration (Confluent Schema Registry or Apicurio)
- Stream-to-batch joins (Flink with Iceberg dimension tables)
v0.7 ML/AI deeper integration
- Real-time feature drift detection & alerting
- Streaming windowed feature examples (rolling aggregates, sessionization)
- Real-time inference pipeline example (Flink + model server)
- Training data sampling/downsampling for active learning
- Metadata catalog for features & pipelines (OpenLineage/Marquez)
v0.8 Enterprise readiness
- Kubernetes operator for cluster deployment
- Secrets management (Vault integration or Kubernetes Secrets)
- Fine-grained RBAC for pipelines/datasets
- Audit logging (API, job, and data access events)
- High-availability mode for control plane & critical services
Roadmap Links
- GitHub issues: https://github.com/cchenax/streamforge-ai/issues
- GitHub projects (if/when populated): https://github.com/cchenax/streamforge-ai/projects
Contribution Links
- Contribution guide:
CONTRIBUTING.md - Start a topic/track progress: https://github.com/cchenax/streamforge-ai/issues/new/choose
- Open PRs to
main: https://github.com/cchenax/streamforge-ai/pulls
