🗞️ Here's your weekly ASF release roundup! 🗞️ 👉 Airflow Providers packages were just released. Users can install the providers via PyPI here: https://buff.ly/bXxWbgy 👉 Apache IoTDB 2.0.8 is now available for download: https://buff.ly/lDycode Apache IoTDB (Database for Internet of Things) is an IoT native database with high performance for data management and analysis, deployable on the edge and the cloud. 👉 Apache APISIX 3.16.0 has been released. Download here: https://buff.ly/FZPmvxb 👉 Apache Storm is a distributed, fault-tolerant, and high performance realtime computation system that provides strong guarantees on the processing of data. Apache Storm version 2.8.6 is now available for download: https://buff.ly/bmD4NRj 👉 Apache SkyWalking MCP 0.2.0 is now available for download: https://buff.ly/eo5sqCs SkyWalking is an APM (application performance monitor) tool for distributed systems, especially designed for microservices, cloud native and container-based (Docker, Kubernetes, Mesos) architectures. 👉 Apache Ant 1.10.17 is now available for download. Apache Ant is a Java library and command-line tool that helps building software. 👉 Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. Apache NiFi 2.9.0 is now available for download: https://buff.ly/yg9vvu0 #opensource #data #machinelearning #cloudcomputing #java #NoSQL #webserver #hadoop
Weekly ASF Release Roundup: Airflow Providers, Apache IoTDB, APISIX, Storm, SkyWalking, Ant, NiFi
More Relevant Posts
-
Great article by Deniz Sendil how real-time CDC using Oracle GoldenGate and open table formats are converging in modern lakehouse architectures. Especially interesting to see direct Apache Iceberg writes without an external processing engine, plus the flexibility around catalogs and multicloud deployment options.
🚀 New blog post: Building a Real-Time Apache Iceberg Lakehouse on OCI Object Storage with OCI GoldenGate Apache Iceberg has become the go-to open table format for modern lakehouse architectures, and OCI GoldenGate’s support for it keeps getting stronger. In my latest blog, I walk through how to configure OCI GoldenGate Big Data to replicate changes from a source database into Apache Iceberg tables on OCI Object Storage in real time, with no external processing engine needed. Here’s what makes this architecture compelling: ✅ OCI GoldenGate writes directly to Iceberg tables using the Iceberg Java API ✅ Uses the OCI Object Storage S3 Compatibility API, making it flexible and familiar ✅ Supports Hadoop, Nessie, Polaris, and REST catalogs ✅ Extends to multicloud environments with OCI GoldenGate on Azure and Google Cloud The blog covers step-by-step configuration, best practices such as key handling, aggregation windows, and uncompressed operations, and known limitations to watch out for. If you’re building a real-time open lakehouse on OCI, or thinking about it, this one’s for you 👇 🔗 https://lnkd.in/gwdnymzF #OCI #GoldenGate #ApacheIceberg #DataEngineering #Lakehouse #DataIntegration #Oracle #CDC #CloudData Jeffrey T. Pollock Denis Gray Tomas Vavra Peter Inzana Alex Lima Alex Kotopoulis Julien Testut Shrinidhi Kulkarni
To view or add a comment, sign in
-
Aggregation Strategies in Apache Flink: Keyed State vs Time Windows When building stateful streaming pipelines in Apache Flink, aggregation design is not just about correctness—it directly impacts state growth, memory pressure, checkpoint size, and long-term scalability. A subtle but critical design choice is how you group your data: 1. Using a derived string column (e.g., event_hour) 2. Using native time windows (window_start, window_end) Lets have an example to understand. Option 1: Grouping by a Derived event_hour Column SELECT DATE_FORMAT(event_time, 'yyyy-MM-dd HH:00:00') AS event_hour, COUNT(*) AS events_count FROM events GROUP BY DATE_FORMAT(event_time, 'yyyy-MM-dd HH:00:00'); How State is Managed ? -> Flink creates keyed state per unique event_hour value. -> Each hour becomes a distinct aggregation key. Internally: Key = "2026-05-01 10:00:00" → count Key = "2026-05-01 11:00:00" → count ... and so on Problem: Potentially Unbounded State Growth 1. Flink treats derived string values as ordinary grouping keys and does not automatically associate them with event-time window lifecycle semantics. 2. No automatic cleanup is tied to time progression. 3. State can continue growing over time unless TTL, retention, or custom cleanup mechanisms are configured. 1. 24 keys/day → 8,760 keys/year 2. Increasing key cardinality increases long-term state size. Operational Impact: 1. Increasing RocksDB state size 2. Larger checkpoint durations 3. Higher restore latency 4. Potential backpressure over time Option 2: Grouping by Hourly Time Windows (Recommended) SELECT window_start, window_end, COUNT(*) AS events_count FROM TABLE( TUMBLE( TABLE events, DESCRIPTOR(event_time), INTERVAL '1' HOUR ) ) GROUP BY window_start, window_end; Why This Is Fundamentally Different ? 1. State is scoped to a time window. 2. Windows have a defined lifecycle. 3. Flink leverages: 1. Watermarks 2. Window lifecycle management 3. Event-time progression Once the watermark passes the window end time (plus allowed lateness if configured), Flink automatically cleans up the window state. A small change in aggregation strategy can lead to massive optimization gains in Apache Flink. Flink is not a finite query engine over static data — it is a continuous stream processor managing evolving state over unbounded data. Design your aggregations the way Flink understands time, not the way traditional databases group rows. #ApacheFlink, #RealTimeDataStreaming , #DataEngineering #FlinkOptimization
To view or add a comment, sign in
-
Recently built a small real-time data pipeline using MySQL, Debezium, Kafka, Node.js, Docker, and AKHQ. This project helped me understand how data can flow in real time without writing manual sync logic. Simple flow: MySQL (CRUD operations) → Debezium captures changes (CDC) → Kafka stores events → Node.js consumer processes events Key learnings: - Changes in MySQL can be captured automatically using CDC (Change Data Capture) - No need for polling or manual data syncing - Docker networking works differently than local environments - localhost inside a container is not the same as your machine Kafka configuration is critical - especially listeners and advertised.listeners - Debezium requires proper MySQL binlog configuration - Using fixed Docker image versions (e.g., mysql:8.0.x) is more reliable than floating tags - AKHQ is extremely helpful for inspecting Kafka topics, messages, and connector status This project gave me practical exposure to event-driven architecture, distributed systems, and real-world debugging. It was a great hands-on experience building something beyond just theory. #Kafka #Debezium #MySQL #NodeJS #Docker #CDC #BackendDevelopment #SoftwareEngineering
To view or add a comment, sign in
-
Karapace vs Apicurio Registry — Which Schema Registry Should You Use with Apache Kafka? If you're running Apache Kafka without Confluent, you still need a Schema Registry to keep your producers and consumers aligned on data formats. Two powerful open-source options exist — and most teams don't know the difference. Here's a clear breakdown Karapace (by Aiven) • 1-to-1 drop-in replacement for Confluent Schema Registry • Built in Python, licensed under Apache 2.0 • Supports Avro, JSON Schema, Protobuf • Comes with a built-in Kafka REST Proxy (produce/consume over HTTP) • Schemas are stored as a Kafka topic (_schemas) • Ideal if you're migrating away from Confluent — zero client-side code changes needed Apicurio Registry (by Red Hat) • A broader API + Schema governance platform • Built in Java (Quarkus) — cloud-native, low memory footprint • Supports Avro, Protobuf, JSON Schema, OpenAPI, AsyncAPI, GraphQL • Multiple storage backends: Kafka, PostgreSQL, SQL Server • Full web console for schema management + compatibility rules • Works with Kafka Connect, Debezium, Camel connectors • Compatible with Confluent schema registry clients • Ideal for enterprises using Red Hat / OpenShift ecosystems When to choose which? Choose Karapace if: You need a lightweight Confluent Schema Registry replacement You want HTTP-based Kafka access via REST Proxy Your team uses Avro/Protobuf and wants zero migration friction Choose Apicurio if: You manage both event schemas AND API specs (OpenAPI, AsyncAPI) You need a rich UI and governance rules for schema evolution You're working in a Red Hat / OpenShift / Kubernetes environment Key insight: Both are 100% open source and Confluent-client compatible. The choice comes down to scope — Karapace is laser-focused on Kafka schemas, Apicurio is a full API + schema governance platform. Drop a comment if you've used either in production — would love to hear your experience! #ApacheKafka #DataEngineering #SchemaRegistry #Karapace #Apicurio #OpenSource #KafkaConnect #StreamProcessing #DataArchitecture #SoftwareEngineering
To view or add a comment, sign in
-
Building hybrid pipelines bridging on-prem Oracle with AWS? This one will save you hours. 🔍 Same code. Same subnet. Same SG. Spark ETL connected to on-prem Oracle fine. Python Shell threw a TCP failure. I ruled out the usual suspects — cx_Oracle driver, SG outbound rule on 1521, Direct Connect health, on-prem firewall. All fine. The issue was elsewhere. 💡 Root cause: Python Shell is NOT VPC-native by default. Spark is. Spark workers get ENIs injected into your private subnet before any code runs. They have VPC network identity from the start — DX/VPN tunnel is reachable. Python Shell is a serverless container. Without an explicitly attached Glue Network Connection, it runs in AWS-managed network — outside your VPC. Your DX tunnel is invisible to it. The TCP SYN never reaches on-prem. Bonus gotcha: Even with a Glue connection attached, ENI injection in Python Shell is async. Your script can start executing before the ENI is bound — So cx_Oracle.connect() fails not from bad config, but a cold-start race condition. 🔧 Fix checklist: 1️⃣ Create a Glue Connection (type: Network, not JDBC) → VPC + subnet with DX/VPN route + SG with outbound 1521 to on-prem CIDR 2️⃣ Attach it explicitly to the Python Shell job under the Connections tab → This is what triggers ENI injection. Easy to miss. 3️⃣ Wrap cx_Oracle.connect() in a retry loop → ENI bind can lag 5–10s on cold start 📌 Mental model: Python Shell → VPC is opt-in Spark ETL → VPC is mandatory at cluster init Tagging Amazon Web Services (AWS) — would love to see the ENI timing behavior and cold-start race condition called out explicitly in the Python Shell VPC connectivity docs. Additionally tagging a few folks who might find this useful — Stéphane Maarek Frank Kane Sundog Education. Would love your thoughts on the Python Shell ENI timing behavior! Hit this before? Drop a comment. #AWS #AWSGlue #DataEngineering #ETL #DataPipeline #Python #Oracle #CloudData
To view or add a comment, sign in
-
🚀 Day 1/30: Building an Enterprise Cloud Storage Backend You can’t build a scalable system on a weak data layer. So today, I focused entirely on laying the foundation. 🛠️ What I built: Initialized a Spring Boot application Connected it to PostgreSQL Designed core entities: User, Folder, File Implemented a self-referencing structure for infinite nested folders 🧠 Key decisions: Chose PostgreSQL over NoSQL → strong relationships + ACID consistency matter for file systems Implemented Soft Deletes using Hibernate (@SQLDelete, @SQLRestriction) to prevent accidental data loss 🐛 Challenges I faced: Fixed the classic JPA mappedBy issue (avoiding unnecessary join tables) Debugged the infamous “Port 8080 already in use” error (found and killed the rogue process) 📈 Takeaway: A clean data layer isn’t optional—it defines how far your system can scale. Next up → Spring Security + custom JPA repositories. What would you have designed differently at the data layer? Next up: Designing the Spring Security architecture and building the custom JPA Repositories. #SpringBoot #Java #BackendEngineering #PostgreSQL #SoftwareArchitecture #BuildInPublic #CloudStorage
To view or add a comment, sign in
-
-
Schema evolution in Apache Spark almost broke our production pipeline. Here's what I learned the hard way — and how to handle it properly. 🧵 We had a Kafka → Spark Structured Streaming → Delta Lake pipeline running flawlessly for months. Then one day upstream added a new column. That's it. That single change triggered cascading failures across 3 downstream jobs. Here's what most engineers miss about schema evolution: 1. mergeSchema is not a silver bullet Yes, setting mergeSchema = true in Delta Lake lets new columns flow in automatically. But it won't save you from column type changes or column renames — those break the transaction log silently or loudly, depending on your luck. 2. Column renames are the silent killer A renamed column looks like a delete + add to Spark. Your data doesn't break — it just shows up as null. In a trade reporting pipeline, null notional values on SFTR reports are a compliance nightmare. 3. Schema Registry is your first line of defence If you're on Kafka, enforce Avro or Protobuf schemas via Confluent Schema Registry. Set compatibility mode to BACKWARD or FULL. Let the registry reject breaking changes before they reach your stream. 4. Dead Letter Queues save your SLAs Don't let a bad record crash your micro-batch. Route schema-incompatible records to a DLQ topic, alert on it, and let your pipeline keep running. Reprocess after schema alignment. 5. Treat schema changes like DB migrations Version your schemas. Use a schema change ticket. Deploy schema changes before code changes. This is the discipline most data teams skip — until they're debugging a 2 AM incident. Schema evolution is not a Spark feature. It's a data contract discipline. How does your team handle schema changes in streaming pipelines? Drop your approach below 👇 #ApacheSpark#DataEngineering#DeltaLake#Kafka#StreamingData#DataArchitecture#SchemaEvolution
To view or add a comment, sign in
-
Kafka vs Database. I see this confusion every single day in developer communities. Wrote the most practical guide I could 👇 🚀 Apache Kafka vs Databases (SQL & NoSQL) — Part 2 is LIVE 👉 muslimahmad.com Blog Link: https://pinlnk.me/lqRJ → Kafka vs MySQL, PostgreSQL, MongoDB, DynamoDB → CRUD in terminal + Node.js (full code) → Tombstone messages explained → Why old data still shows after delete → Kafka + Database real-world pattern No fluff. Real commands. Real code. Real screenshots. 💻 💬 Are you using Kafka with a database in production? Drop your stack below! 🔗 muslimahmad.com #ApacheKafka #KafkaVsDatabase #KafkaCRUD #KafkaTombstone #KafkaNodeJS #KafkaVsMongoDB #KafkaVsMySQL #KafkaVsPostgreSQL #NodeJS #MongoDB #MySQL #PostgreSQL #MERN #CRUDOperations #EventSourcing #BackendDevelopment #SystemDesign #muslimahmad #muslimahmadkhan
To view or add a comment, sign in
-
Stop treating Apache Kafka like a "dumb" Message Queue. 🛑 If you are using Spring Boot + Kafka just to send data from Point A to Point B, you are missing out on the most powerful feature of modern microservices: Event Streaming. In a distributed system, a standard "request-response" cycle is a recipe for bottlenecks. If Service A waits for Service B, and Service B is slow, your entire system crawls. How we solve this with the Spring Kafka ecosystem: 1. Decoupling with Purpose: Service A emits an event ("OrderCreated") and moves on. It doesn't care who is listening. 2. Stateful Processing with Kafka Streams: Don't just move data—transform it. Use Kafka Streams to join topics, aggregate totals, or filter noise in real-time before the data even hits your database. 3. The "Idempotent Consumer" Pattern: In the cloud, network glitches happen. Your consumer might receive the same message twice. Use @KafkaListener combined with a unique MessageID check to ensure you don't process the same payment or order twice. The Technical Edge: By leveraging Spring for Apache Kafka, we get built-in support for transactions and "Exactly-Once" semantics (EOS). This is the gold standard for financial and high-integrity systems. Are you building reactive, event-driven systems, or are your microservices still "tightly coupled" under the hood? Let’s talk stream processing vs. batch processing below! 👇 #ApacheKafka #Java #SpringBoot #EventDriven #Microservices #BackendEngineering #KafkaStreams #SystemDesign
To view or add a comment, sign in
-
-
🚀 **What is a Kafka Controller? (The Brain of the Kafka Cluster)** When people learn Apache Kafka, they often focus on: * Producers * Consumers * Brokers But one hidden component quietly keeps the entire cluster coordinated: 👉 **The Kafka Controller** --- 🔹 **What is a Kafka Controller?** A Kafka Controller is a special broker responsible for managing the cluster’s metadata and coordination. Think of it as: 🧠 **The Brain of the Kafka Cluster** --- 🔹 **What does the controller actually do?** ✅ Tracks which brokers are alive ✅ Detects broker failures ✅ Elects partition leaders ✅ Manages partition assignments ✅ Handles failover operations Without the controller: 👉 The cluster cannot coordinate properly --- 🔹 **Important Concept** In a Kafka cluster: ✔️ Many brokers exist ❗ But only ONE broker acts as the active controller at a time --- 🔹 **Example** Imagine: * Broker 1 * Broker 2 * Broker 3 If Broker 2 is elected as controller: 👉 It manages cluster decisions while still behaving like a normal broker. --- 🔹 **What happens if a broker fails?** Suppose: * Partition Leader was on Broker 1 * Broker 1 crashes The controller will: 1️⃣ Detect the failure 2️⃣ Promote a follower replica as new leader 3️⃣ Inform the cluster about metadata changes 👉 This is how Kafka achieves fault tolerance. --- 🔹 **ZooKeeper vs KRaft** Older Kafka versions: * Used Apache ZooKeeper for controller election Modern Kafka (KRaft mode): * Uses built-in Raft consensus * No ZooKeeper dependency * Faster & more scalable --- 🧠 **Simple analogy** Brokers = Workers Controller = Manager Workers store and serve data, but the manager coordinates the system. --- 🎯 **Key Takeaway** The Kafka Controller does NOT store all data. Its real job is: 👉 Cluster coordination 👉 Metadata management 👉 Failover handling And that’s what keeps Kafka highly available at scale. #Kafka #SystemDesign #DistributedSystems #DataEngineering #BackendEngineering
To view or add a comment, sign in

Interesting