From 23e8aa6be5c5f79dde4c8c1aab7bbe4ebec5dcb9 Mon Sep 17 00:00:00 2001 From: aditya singh rathore Date: Mon, 18 May 2026 12:27:23 +0530 Subject: [PATCH] Refactor data pipeline cost blog --- blog/azure-cost-optimization/index.md | 69 +++++++++++++-------------- 1 file changed, 34 insertions(+), 35 deletions(-) diff --git a/blog/azure-cost-optimization/index.md b/blog/azure-cost-optimization/index.md index 9bd8b74c..7b71bd49 100644 --- a/blog/azure-cost-optimization/index.md +++ b/blog/azure-cost-optimization/index.md @@ -47,12 +47,12 @@ Our pipeline processed roughly 2GB of sales data per day and served a Power BI d This post is that investigation: every mistake explained, every fix documented, and the exact before-and-after numbers. If you're building data pipelines on Azure and haven't audited your costs recently, at least one of these is probably happening to you right now. **What you'll learn in this post:** -- 1. Why a Dedicated SQL Pool running 24/7 is the single most expensive default mistake in Azure Synapse -- 2. How to replace nightly full loads with watermark-based incremental loads in Azure Data Factory -- 3. How to right-size Spark pools and configure auto-termination to stop paying for idle compute -- 4. How partition pruning on Delta Lake tables can reduce data scanned by over 90% -- 5. How ADLS Gen2 lifecycle policies passively save money on storage with zero ongoing effort -- 6. When a scheduled micro-batch replaces a 24/7 streaming pipeline without any business impact +- Why a Dedicated SQL Pool running 24/7 is the single most expensive default mistake in Azure Synapse +- How to replace nightly full loads with watermark-based incremental loads in Azure Data Factory +- How to right-size Spark pools and configure auto-termination to stop paying for idle compute +- How partition pruning on Delta Lake tables can reduce data scanned by over 90% +- How ADLS Gen2 lifecycle policies passively save money on storage with zero ongoing effort +- When a scheduled micro-batch replaces a 24/7 streaming pipeline without any business impact @@ -85,10 +85,10 @@ A standard [Medallion Architecture](https://www.recodehive.com/blog/medallion-ar This was the single largest line item. A Dedicated SQL Pool at DW200c was provisioned to serve the Power BI dashboard and left running continuously 24 hours a day, 7 days a week because auto-pause had never been configured. -The fundamental billing model of a Dedicated SQL Pool is important to understand: **you pay for provisioned DWUs whether or not queries are running.** Our 30 users were active between 9am and 6pm on weekdays, 45 hours of actual usage per week. The pool was running for 168 hours per week. That is 123 hours of idle, fully-billed compute every single week. Over a month, that compounds into a significant waste. +The thing that surprised me most when I first dug into the bill was this: the SQL Pool charged us the same rate at 3am on a Saturday as it did during peak usage on a Tuesday afternoon. I had assumed, naively, that there was some kind of idle detection built in. There isn't. When it's provisioned, you're paying - full stop, whether a single query runs or not. Our 30 users were active between 9am and 6pm on weekdays, 45 hours of actual usage per week. The pool was running for 168 hours per week. That's 123 hours of idle, fully-billed compute every single week, and it showed up on our invoice as a flat $1,800 charge with no breakdown by usage. :::note -Dedicated SQL Pool billing is based on **provisioned DWU-hours**, not query execution time. Pausing the pool stops the DWU billing entirely. Only storage costs continue when the pool is paused. +The SQL Pool doesn't throttle billing when it's idle, provisioned capacity is billed by the hour regardless of query activity. Pausing the pool is the only way to stop the DWU clock. Storage costs continue when paused, but the compute component which is the large part, stops completely. ::: ### The Fix: Auto-Pause with Azure Automation Runbooks @@ -119,17 +119,16 @@ Schedule the pause runbook at 7pm weekdays and the resume runbook at 8am. Weeken If your workload is exploratory rather than dashboard-serving, consider whether [Serverless SQL Pool](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/on-demand-workspace-overview) is sufficient. Serverless pools bill per TB of data scanned rather than provisioned DWUs, which can be significantly cheaper for infrequent query patterns. ::: ---- ## Mistake #2: Full Load Running Every Night Instead of Incremental **Monthly cost of this mistake: ~$620** -The ADF pipeline was configured to pull **all records** from the source database on every nightly run — not just new or updated ones. Day 1 it pulled 2GB. By Day 60 it was pulling 120GB, processing records that had already been processed 59 times before. +The ADF pipeline was configured to pull **all records** from the source database on every nightly run not just new or updated ones. Day 1 it pulled 2GB. By Day 60 it was pulling 120GB, processing records that had already been processed 59 times before. This pattern is extremely common and extremely expensive. Every night, the ADF pipeline read the entire historical dataset, the Spark transformation job processed all of it, and the results were written back to Delta. The billable compute scaled with the dataset size, not with the actual volume of new data. -### The Fix — Watermark-Based Incremental Loading +### The Fix: Watermark-Based Incremental Loading A watermark stores the timestamp of the last successfully processed record. Every pipeline run reads only records newer than that timestamp, then updates the watermark on success. @@ -176,7 +175,7 @@ WHERE pipeline_name = 'sales_ingestion' ![Azure Data Factory pipeline canvas showing the Lookup (GetWatermark) → Copy Data → Stored Procedure watermark pattern](./img/03-adf-watermark-pipeline-canvas.png) :::tip -The watermark pattern requires a reliable `updated_at` or `created_at` column in the source table. If your source does not have one, work with the source team to add it the cost saving on the pipeline side will far outweigh the schema migration effort. +The watermark pattern requires a reliable `updated_at` or `created_at` column in the source table. If your source does not have one, work with the source team to add it, the cost saving on the pipeline side will far outweigh the schema migration effort. ::: @@ -188,7 +187,7 @@ When setting up the Spark pool in Azure Synapse, the default node size **DS3_v2 Two problems compounded each other. First, the cluster was consuming roughly 10x the compute it actually needed for the data volume. Second, auto-termination was set to 60 minutes, meaning after a 12-minute job, the cluster sat idle and fully billed for another 48 minutes before shutting down. -### The Fix — Right-Sizing, Autoscale, and Fast Termination +### The Fix: Right-Sizing, Autoscale, and Fast Termination The fix has three components that work together: @@ -225,9 +224,9 @@ A good rule of thumb for `spark.sql.shuffle.partitions`: aim for roughly **128MB The Silver layer Delta table was partitioned by `order_date`. The Spark transformation job, however, was reading the entire table and applying a date filter *after* the read not during it. -This subtle difference has an enormous cost impact. When a filter is applied after reading the table, Spark must open and scan every file in every partition before evaluating which rows to keep. When the filter is pushed down into the read, Spark uses the partition directory structure to skip every partition that does not match, opening only the relevant files. +Think of it like this: imagine your filing cabinet is organized by month, one drawer per month, clearly labelled. Partition pruning is pulling open only January's drawer. What we were doing instead was dumping every drawer onto the floor, sifting through three years of paper, and throwing away everything that wasn't from January then tidying it all back up. Every. Single. Night. The cabinet is organized correctly. We just weren't using the labels. -With 90 days of accumulated history, the naive approach was scanning 90x more data than necessary on every single run. +With 90 days of accumulated history, this approach was scanning 90x more data than necessary on every run. The fix is to push the filter into the read itself so Spark can use the partition directory structure to skip everything irrelevant before a single file is opened. @@ -264,7 +263,7 @@ silver_df = ( **Result:** Data scanned per run dropped from ~180GB to ~2GB. Spark runtime fell from 18 minutes to 4 minutes. Monthly saving: **~$260/month**. :::note -For partition pruning to work, two conditions must both be true. The table must be partitioned by the filter column, and the filter must be applied at read time — not in a subsequent transformation step. Applying the filter even one `.filter()` call after the `.load()` still results in full table scan in some execution contexts. +For partition pruning to work, two conditions must both be true. The table must be partitioned by the filter column, and the filter must be applied at read time not in a subsequent transformation step. Applying the filter even one `.filter()` call after the `.load()` still results in a full table scan in some execution contexts. ::: @@ -274,9 +273,9 @@ For partition pruning to work, two conditions must both be true. The table must The ADLS Gen2 Bronze layer had 14 months of raw Parquet files sitting on the **Hot** storage tier. No lifecycle policy had ever been configured. -ADLS Gen2 charges different rates depending on the storage access tier. Hot tier is priced for data that is accessed frequently. Cool tier costs roughly 44% less. Archive tier costs over 90% less than Hot for storage, though retrieval carries an additional cost. For data that has not been touched in 6 months, paying Hot tier prices is straightforward waste. +ADLS Gen2 charges different rates depending on the storage access tier. When I pulled our actual invoice line items, the numbers told a clear story: our Bronze container was costing us $0.023 per GB per month on Hot, while data we hadn't touched in months was sitting right next to yesterday's files paying the same rate. Moving files older than 30 days to Cool dropped that rate to roughly $0.013/GB, about 44% less for data we only needed occasionally. Files older than 180 days dropped to Archive at around $0.002/GB, which is where old Bronze raw files belong when the Silver layer already has the clean version. -Fourteen months of ~2GB/day accumulates to roughly 850GB in the Bronze layer. The storage cost difference between Hot and Cool on that volume is modest, but the pattern compounds across every container and every account and the fix requires exactly one policy configuration. +Fourteen months of ~2GB/day accumulates to roughly 850GB in the Bronze layer. The fix required exactly one policy configuration. ### The Fix: ADLS Gen2 Lifecycle Management Policy @@ -315,7 +314,7 @@ Bronze files automatically move to Cool after 30 days and Archive after 180 days Apply lifecycle policies to the Silver and Gold layers too, with longer thresholds. Silver data accessed for backfills after 90 days can move to Cool. Gold data older than 365 days can move to Archive if your reporting doesn't require historical drill-downs that old. ::: -**Result:** ~$160/month in combined storage and egress savings — purely passive, set once and forgotten. +**Result:** ~$160/month in combined storage and egress savings, purely passive, set once and forgotten. ## Mistake #6: A Streaming Pipeline for 15-Minute Update Requirements @@ -374,7 +373,7 @@ A useful mental model for this decision: **streaming is the right choice when la ::: -From $4,247 down to approximately $1,150 after all fixes — a **73% cost reduction** on a pipeline doing exactly the same work on exactly the same data. +From $4,247 down to approximately $1,150 after all fixes, a **73% cost reduction** on a pipeline doing exactly the same work on exactly the same data. ![Before and after bar chart showing Azure monthly cost by category, with before total of $4,247 and after total of $1,150](./img/01-azure-cost-before-after.png) @@ -409,7 +408,7 @@ Run through this every quarter. Each item is a question, if the answer is "no" o ## Key Lessons -**Defaults are expensive by design.** Azure's defaults 60-minute Spark termination, no SQL Pool auto-pause, no lifecycle policies are chosen for zero-friction setup, not cost efficiency. Every default should be reviewed and overridden deliberately on day one, not after the first billing surprise. +**Defaults are expensive by design.** Azure's defaults - 60-minute Spark termination, no SQL Pool auto-pause, no lifecycle policies are chosen for zero-friction setup, not cost efficiency. Every default should be reviewed and overridden deliberately on day one, not after the first billing surprise. **Incremental loading is not a future optimization.** For any pipeline reading time-series data from a growing source, a full load that runs daily compounds in cost the same way interest compounds on debt. The watermark pattern takes a few hours to implement and pays for itself within a week. @@ -423,35 +422,35 @@ Run through this every quarter. Each item is a question, if the answer is "no" o ## Frequently Asked Questions **Q: Should I always use Serverless SQL Pool instead of Dedicated SQL Pool to save costs?** -A: Not necessarily. Serverless SQL Pool bills per TB scanned, which is excellent for exploratory queries and infrequent access. Dedicated SQL Pool becomes more cost-effective when you have consistent, high-concurrency query loads — typically above 6–8 hours of active querying per day. The auto-pause approach gets the best of both: dedicated performance when needed, no billing when idle. +A: We actually tested this switch on our own setup before committing to the auto-pause approach. Serverless made sense for us until we crossed about 6 hours of daily query time, below that threshold, serverless was cheaper every single month without exception, and we didn't need to manage any pause/resume scheduling at all. If your Power BI dashboard is hit heavily throughout the business day, dedicated will eventually win on pure cost. But if usage is bursty or confined to a few hours, don't provision dedicated capacity and then fight to keep it paused, just go serverless from the start. **Q: What if my source system doesn't have a reliable `updated_at` column for watermarking?** -A: There are fallback options. Change Data Capture (CDC) at the database level is the most reliable — tools like [Debezium](https://debezium.io/) can stream row-level changes without modifying the source schema. If CDC is not available, sequence-based watermarking (using an auto-incrementing primary key) works for append-only tables. As a last resort, row hash comparison can detect changes, though it requires reading the full source on each run. +A: We ran into this with one of our source databases, no timestamp, no audit column, nothing. We ended up going the CDC route using Debezium, which captures row-level changes at the database log level without touching the source schema at all. It took about a day to set up and was the cleanest solution we found. For append-only tables, an auto-incrementing primary key works as a watermark substitute. If neither option exists, row hash comparison is a last resort, it detects changes, but you're still reading the full source every run, which defeats most of the point. **Q: How do I choose the right partition column for a Delta Lake table?** -A: Partition by the column that appears in your most frequent filter predicate — almost always a date column for time-series data. Avoid high-cardinality columns like user IDs or transaction IDs as partition columns; they create too many small files and make partition pruning less effective. A good partition should contain between 100MB and 1GB of data. +A: The short answer is: whichever column appears in your most common filter is your partition column, and for time-series data that's almost always a date. What I'd caution against is partitioning on something high-cardinality like user ID or transaction ID, we made that mistake early on and ended up with thousands of tiny files that made partition pruning useless and actually slowed down reads. A healthy partition should hold somewhere between 100MB and 1GB of data. If a partition is smaller than that, you're creating file overhead without the pruning benefit. **Q: Will moving data to Archive tier in ADLS Gen2 break my backfill pipelines?** -A: Yes, if pipelines try to read from Archive without rehydrating first. Archive tier data must be rehydrated to Hot or Cool before it can be read, which takes hours. The solution is to only archive data that has a known, long rehydration lead time — typically raw Bronze data older than 6 months, where backfill requests can be anticipated. For Silver and Gold layers, move to Cool but not Archive. +A: I learned this the hard way, a backfill job failed at 2am because Archive data doesn't just open like a normal file. You have to explicitly rehydrate it first, and depending on the priority tier you choose, that can take anywhere from an hour to fifteen hours. We got caught by this once, and now the rule on our team is: only archive data where we have at least 24 hours of lead time if a backfill request comes in. For Bronze raw files older than 180 days, that's usually fine, nobody's doing emergency backfills on six-month-old raw data. For Silver and Gold, we stop at Cool tier and don't go to Archive at all, because the rehydration wait is unacceptable mid-incident. **Q: Is a 5-minute Spark auto-termination window too aggressive for interactive notebooks?** -A: For scheduled jobs, 5 minutes is ideal. For interactive development notebooks, consider a separate Spark pool configuration with a longer termination window — 30 minutes is reasonable. Keep production job pools and development pools separate so cost optimization settings on production do not interrupt interactive work. +A: For scheduled production jobs, 5 minutes is ideal, the job finishes, and within 5 minutes the cluster is gone and billing stops. For interactive development work, 5 minutes will drive you crazy because the cluster spins down between notebook cells if you pause to think. What we do is maintain two separate Spark pool configurations: one for production jobs set to 5-minute termination, and one for dev work set to 60 minutes. They run on different node sizes too. Keep them separate and you get cost efficiency in production without interrupting development flow. **Q: How do I detect whether partition pruning is actually working in my Spark job?** -A: Run the job with the Synapse or Databricks Spark UI open and inspect the query plan. Look for a `PartitionFilters` entry in the scan node of the physical plan. If partition pruning is active, it will list the filter predicate there. If the scan shows `PartitionFilters: []`, the filter is not being pushed down and you are scanning the full table. +A: The fastest way is to run `df.explain()` and look at the physical plan output. If partition pruning is active, you'll see a `PartitionFilters` entry in the scan node listing your filter predicate. If that field shows `PartitionFilters: []` - empty brackets, you're scanning the full table regardless of what your code looks like. I've caught this bug three times by checking the plan after what looked like a correctly written filter, and each time it turned out the filter was being applied one transformation step too late, after Spark had already committed to a full read. --- ## References and Further Reading -- [Microsoft Docs — Azure Cost Management and Billing](https://learn.microsoft.com/en-us/azure/cost-management-billing/) -- [Microsoft Docs — Synapse SQL Pool Pause and Resume](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/pause-and-resume-compute-portal) -- [Microsoft Docs — ADLS Gen2 Lifecycle Management Policies](https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview) -- [Delta Lake — Partition Pruning and Optimization](https://docs.delta.io/latest/optimizations-oss.html) -- [Microsoft Docs — ADF Tumbling Window Trigger](https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger) -- [RecodeHive — Azure Storage and ADLS Gen2 Complete Guide](https://www.recodehive.com/blog/azure-storage-options) -- [RecodeHive — Hidden Cost of Streaming Pipelines](https://www.recodehive.com/blog/batch-vs-stream-processing) -- [RecodeHive — Medallion Architecture Explained](https://www.recodehive.com/blog/medallion-architecture) +- [Microsoft Docs - Azure Cost Management and Billing](https://learn.microsoft.com/en-us/azure/cost-management-billing/) +- [Microsoft Docs - Synapse SQL Pool Pause and Resume](https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/pause-and-resume-compute-portal) +- [Microsoft Docs - ADLS Gen2 Lifecycle Management Policies](https://learn.microsoft.com/en-us/azure/storage/blobs/lifecycle-management-overview) +- [Delta Lake - Partition Pruning and Optimization](https://docs.delta.io/latest/optimizations-oss.html) +- [Microsoft Docs - ADF Tumbling Window Trigger](https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-tumbling-window-trigger) +- [RecodeHive - Azure Storage and ADLS Gen2 Complete Guide](https://www.recodehive.com/blog/azure-storage-options) +- [RecodeHive - Hidden Cost of Streaming Pipelines](https://www.recodehive.com/blog/batch-vs-stream-processing) +- [RecodeHive - Medallion Architecture Explained](https://www.recodehive.com/blog/medallion-architecture)