| Function | Open-source Tools | Cloud Services |
|---|---|---|
| Orchestration | Apache Airflow | |
| Luigi | ||
| Argo Workflows | AWS Step Functions | |
| Azure Logic Apps | ||
| Google Cloud Composer | ||
| ETL (Extract, Transform, Load) | Apache NiFi | |
| Apache Beam | ||
| Apache Spark | ||
| Talend Open Studio (QL) | Azure Data Factory | |
| AWS Glue | ||
| Google Cloud Dataflow | ||
| IBM DataStage | ||
| Matillion | ||
| Data Warehousing | Apache Hive | |
| Apache HBase | ||
| ClickHouse | ||
| Greenplum | ||
| Druid | Azure Synapse Analytics | |
| Amazon Redshift | ||
| Google BigQuery | ||
| Snowflake | ||
| Teradata Vantage | ||
| Data Transformation | Apache Spark | |
| Apache Flink | ||
| Apache Beam | ||
| dbt | ||
| Presto (Trino) | Databricks | |
| Google Cloud Dataprep | ||
| AWS Glue | ||
| DataRobot | ||
| Data Monitoring | Prometheus | |
| Grafana | ||
| Nagios | ||
| Zabbix | ||
| New Relic | Azure Monitor | |
| AWS CloudWatch | ||
| Google Cloud Operations Suite | ||
| Data Governance | Apache Atlas | |
| OpenLineage | ||
| DataHub | Azure Purview | |
| AWS Lake Formation | ||
| Google Cloud Data Catalog | ||
| Data Storage | HDFS | |
| Apache Cassandra | ||
| MongoDB | ||
| MinIO | ||
| Ceph | Azure Blob Storage | |
| AWS S3 | ||
| Google Cloud Storage | ||
| Snowflake | ||
| Data Integration | Apache Kafka | |
| RabbitMQ | ||
| Pulsar | ||
| NATS | Azure Event Grid | |
| AWS SNS/SQS | ||
| Google Cloud Pub/Sub | ||
| Data Security | Apache Ranger | |
| Vault | ||
| Keycloak | Azure Key Vault | |
| AWS KMS | ||
| Google Cloud Secret Manager | ||
| Data Visualization | Apache Superset | |
| Metabase | ||
| Redash | Power BI | |
| Tableau | ||
| Google Data Studio | ||
| AWS QuickSight | ||
| Data Lineage | Marquez | |
| OpenLineage | ||
| Apache Atlas | Azure Purview | |
| Alation | ||
| AWS Glue Data Catalog | ||
| Batch Data Processing | Apache Hadoop | |
| Apache Spark | ||
| Apache Flink | ||
| Apache Beam | AWS Batch | |
| Google Cloud Dataproc | ||
| Azure HDInsight | ||
| Databricks | ||
| Real-Time Data Processing | Apache Kafka | |
| Apache Flink | ||
| Apache Pulsar | AWS Kinesis | |
| Azure Stream Analytics | ||
| Google Cloud Dataflow | ||
| Machine Learning | TensorFlow | |
| PyTorch | ||
| Scikit-learn | ||
| XGBoost | ||
| H2O.ai | Azure Machine Learning | |
| AWS SageMaker | ||
| Google AI Platform | ||
| IBM Watson | ||
| Data Backup & Recovery | Bacula | |
| Duplicity | ||
| Restic | Azure Backup | |
| AWS Backup | ||
| Google Cloud Backup & DR | ||
| Data Pipeline Testing | Great Expectations | |
| dbt | ||
| pytest | Azure Data Factory Monitoring | |
| AWS Glue Monitoring | ||
| Data Cleansing | Trifacta | |
| DataWrangler | Google Cloud Dataprep | |
| AWS Glue DataBrew | ||
| Data Streaming | Apache Kafka | |
| Apache Pulsar | ||
| Apache Flink | ||
| Spark streaming | Azure Event Hubs | |
| AWS Kinesis | ||
| Google Cloud Pub/Sub | ||
| Job Scheduling | Cron | |
| Celery | ||
| Rundeck | ||
| Azure Scheduler | ||
| AWS Batch | ||
| Google Cloud Scheduler | ||
| Data Cataloging | Apache Atlas | |
| Amundsen | ||
| DataHub | Azure Purview | |
| AWS Glue Data Catalog | ||
| Google Cloud Data Catalog |
More:
Criteria: