How do I install Spark Optimization?

Install Spark Optimization with a single command: npx mdskills install sickn33/spark-optimization. This downloads the skill files into your project and your AI agent picks them up automatically.

What platforms support Spark Optimization?

Spark Optimization works with Claude Code, Claude Desktop, Cursor, Vscode Copilot, Windsurf, Continue Dev, Codex, Gemini Cli, Amp, Roo Code, Goose, Opencode, Trae, Qodo, Command Code, Databricks. Skills use the open SKILL.md format which is compatible with any AI coding agent that reads markdown instructions.

← Back to skills

Spark Optimization

Name: Spark Optimization: AI Agent Skill
Rating: 9 (1 reviews)
Author: sickn33

Verified

Monitoring & DebuggingIntermediate

Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.

by @sickn330Updated 2/20/2026

Add this skill

npx mdskills install sickn33/spark-optimization

Fork & Edit

Skill Advisor9.0

Comprehensive Spark optimization patterns with clear code examples and performance tuning strategies

+Provides extensive production-ready code examples for partitioning, joins, caching, and memory tuning
+Includes diagnostic functions for identifying skew and monitoring performance
+Covers all critical optimization dimensions with actionable configurations
-Requests shell execution permission without demonstrating clear need in skill content

SKILL.md

Edit in Browser

1---
2name: spark-optimization
3description: Optimize Apache Spark jobs with partitioning, caching, shuffle optimization, and memory tuning. Use when improving Spark performance, debugging slow jobs, or scaling data processing pipelines.
4---
5 
6# Apache Spark Optimization
7 
8Production patterns for optimizing Apache Spark jobs including partitioning strategies, memory management, shuffle optimization, and performance tuning.
9 
10## Do not use this skill when
11 
12- The task is unrelated to apache spark optimization
13- You need a different domain or tool outside this scope
14 
15## Instructions
16 
17- Clarify goals, constraints, and required inputs.
18- Apply relevant best practices and validate outcomes.
19- Provide actionable steps and verification.
20- If detailed examples are required, open `resources/implementation-playbook.md`.
21 
22## Use this skill when
23 
24- Optimizing slow Spark jobs
25- Tuning memory and executor configuration
26- Implementing efficient partitioning strategies
27- Debugging Spark performance issues
28- Scaling Spark pipelines for large datasets
29- Reducing shuffle and data skew
30 
31## Core Concepts
32 
33### 1. Spark Execution Model
34 
35```
36Driver Program
37    ↓
38Job (triggered by action)
39    ↓
40Stages (separated by shuffles)
41    ↓
42Tasks (one per partition)
43```
44 
45### 2. Key Performance Factors
46 
47| Factor | Impact | Solution |
48|--------|--------|----------|
49| **Shuffle** | Network I/O, disk I/O | Minimize wide transformations |
50| **Data Skew** | Uneven task duration | Salting, broadcast joins |
51| **Serialization** | CPU overhead | Use Kryo, columnar formats |
52| **Memory** | GC pressure, spills | Tune executor memory |
53| **Partitions** | Parallelism | Right-size partitions |
54 
55## Quick Start
56 
57```python
58from pyspark.sql import SparkSession
59from pyspark.sql import functions as F
60 
61# Create optimized Spark session
62spark = (SparkSession.builder
63    .appName("OptimizedJob")
64    .config("spark.sql.adaptive.enabled", "true")
65    .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
66    .config("spark.sql.adaptive.skewJoin.enabled", "true")
67    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
68    .config("spark.sql.shuffle.partitions", "200")
69    .getOrCreate())
70 
71# Read with optimized settings
72df = (spark.read
73    .format("parquet")
74    .option("mergeSchema", "false")
75    .load("s3://bucket/data/"))
76 
77# Efficient transformations
78result = (df
79    .filter(F.col("date") >= "2024-01-01")
80    .select("id", "amount", "category")
81    .groupBy("category")
82    .agg(F.sum("amount").alias("total")))
83 
84result.write.mode("overwrite").parquet("s3://bucket/output/")
85```
86 
87## Patterns
88 
89### Pattern 1: Optimal Partitioning
90 
91```python
92# Calculate optimal partition count
93def calculate_partitions(data_size_gb: float, partition_size_mb: int = 128) -> int:
94    """
95    Optimal partition size: 128MB - 256MB
96    Too few: Under-utilization, memory pressure
97    Too many: Task scheduling overhead
98    """
99    return max(int(data_size_gb * 1024 / partition_size_mb), 1)
100 
101# Repartition for even distribution
102df_repartitioned = df.repartition(200, "partition_key")
103 
104# Coalesce to reduce partitions (no shuffle)
105df_coalesced = df.coalesce(100)
106 
107# Partition pruning with predicate pushdown
108df = (spark.read.parquet("s3://bucket/data/")
109    .filter(F.col("date") == "2024-01-01"))  # Spark pushes this down
110 
111# Write with partitioning for future queries
112(df.write
113    .partitionBy("year", "month", "day")
114    .mode("overwrite")
115    .parquet("s3://bucket/partitioned_output/"))
116```
117 
118### Pattern 2: Join Optimization
119 
120```python
121from pyspark.sql import functions as F
122from pyspark.sql.types import *
123 
124# 1. Broadcast Join - Small table joins
125# Best when: One side < 10MB (configurable)
126small_df = spark.read.parquet("s3://bucket/small_table/")  # < 10MB
127large_df = spark.read.parquet("s3://bucket/large_table/")  # TBs
128 
129# Explicit broadcast hint
130result = large_df.join(
131    F.broadcast(small_df),
132    on="key",
133    how="left"
134)
135 
136# 2. Sort-Merge Join - Default for large tables
137# Requires shuffle, but handles any size
138result = large_df1.join(large_df2, on="key", how="inner")
139 
140# 3. Bucket Join - Pre-sorted, no shuffle at join time
141# Write bucketed tables
142(df.write
143    .bucketBy(200, "customer_id")
144    .sortBy("customer_id")
145    .mode("overwrite")
146    .saveAsTable("bucketed_orders"))
147 
148# Join bucketed tables (no shuffle!)
149orders = spark.table("bucketed_orders")
150customers = spark.table("bucketed_customers")  # Same bucket count
151result = orders.join(customers, on="customer_id")
152 
153# 4. Skew Join Handling
154# Enable AQE skew join optimization
155spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
156spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionFactor", "5")
157spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "256MB")
158 
159# Manual salting for severe skew
160def salt_join(df_skewed, df_other, key_col, num_salts=10):
161    """Add salt to distribute skewed keys"""
162    # Add salt to skewed side
163    df_salted = df_skewed.withColumn(
164        "salt",
165        (F.rand() * num_salts).cast("int")
166    ).withColumn(
167        "salted_key",
168        F.concat(F.col(key_col), F.lit("_"), F.col("salt"))
169    )
170 
171    # Explode other side with all salts
172    df_exploded = df_other.crossJoin(
173        spark.range(num_salts).withColumnRenamed("id", "salt")
174    ).withColumn(
175        "salted_key",
176        F.concat(F.col(key_col), F.lit("_"), F.col("salt"))
177    )
178 
179    # Join on salted key
180    return df_salted.join(df_exploded, on="salted_key", how="inner")
181```
182 
183### Pattern 3: Caching and Persistence
184 
185```python
186from pyspark import StorageLevel
187 
188# Cache when reusing DataFrame multiple times
189df = spark.read.parquet("s3://bucket/data/")
190df_filtered = df.filter(F.col("status") == "active")
191 
192# Cache in memory (MEMORY_AND_DISK is default)
193df_filtered.cache()
194 
195# Or with specific storage level
196df_filtered.persist(StorageLevel.MEMORY_AND_DISK_SER)
197 
198# Force materialization
199df_filtered.count()
200 
201# Use in multiple actions
202agg1 = df_filtered.groupBy("category").count()
203agg2 = df_filtered.groupBy("region").sum("amount")
204 
205# Unpersist when done
206df_filtered.unpersist()
207 
208# Storage levels explained:
209# MEMORY_ONLY - Fast, but may not fit
210# MEMORY_AND_DISK - Spills to disk if needed (recommended)
211# MEMORY_ONLY_SER - Serialized, less memory, more CPU
212# DISK_ONLY - When memory is tight
213# OFF_HEAP - Tungsten off-heap memory
214 
215# Checkpoint for complex lineage
216spark.sparkContext.setCheckpointDir("s3://bucket/checkpoints/")
217df_complex = (df
218    .join(other_df, "key")
219    .groupBy("category")
220    .agg(F.sum("amount")))
221df_complex.checkpoint()  # Breaks lineage, materializes
222```
223 
224### Pattern 4: Memory Tuning
225 
226```python
227# Executor memory configuration
228# spark-submit --executor-memory 8g --executor-cores 4
229 
230# Memory breakdown (8GB executor):
231# - spark.memory.fraction = 0.6 (60% = 4.8GB for execution + storage)
232#   - spark.memory.storageFraction = 0.5 (50% of 4.8GB = 2.4GB for cache)
233#   - Remaining 2.4GB for execution (shuffles, joins, sorts)
234# - 40% = 3.2GB for user data structures and internal metadata
235 
236spark = (SparkSession.builder
237    .config("spark.executor.memory", "8g")
238    .config("spark.executor.memoryOverhead", "2g")  # For non-JVM memory
239    .config("spark.memory.fraction", "0.6")
240    .config("spark.memory.storageFraction", "0.5")
241    .config("spark.sql.shuffle.partitions", "200")
242    # For memory-intensive operations
243    .config("spark.sql.autoBroadcastJoinThreshold", "50MB")
244    # Prevent OOM on large shuffles
245    .config("spark.sql.files.maxPartitionBytes", "128MB")
246    .getOrCreate())
247 
248# Monitor memory usage
249def print_memory_usage(spark):
250    """Print current memory usage"""
251    sc = spark.sparkContext
252    for executor in sc._jsc.sc().getExecutorMemoryStatus().keySet().toArray():
253        mem_status = sc._jsc.sc().getExecutorMemoryStatus().get(executor)
254        total = mem_status._1() / (1024**3)
255        free = mem_status._2() / (1024**3)
256        print(f"{executor}: {total:.2f}GB total, {free:.2f}GB free")
257```
258 
259### Pattern 5: Shuffle Optimization
260 
261```python
262# Reduce shuffle data size
263spark.conf.set("spark.sql.shuffle.partitions", "auto")  # With AQE
264spark.conf.set("spark.shuffle.compress", "true")
265spark.conf.set("spark.shuffle.spill.compress", "true")
266 
267# Pre-aggregate before shuffle
268df_optimized = (df
269    # Local aggregation first (combiner)
270    .groupBy("key", "partition_col")
271    .agg(F.sum("value").alias("partial_sum"))
272    # Then global aggregation
273    .groupBy("key")
274    .agg(F.sum("partial_sum").alias("total")))
275 
276# Avoid shuffle with map-side operations
277# BAD: Shuffle for each distinct
278distinct_count = df.select("category").distinct().count()
279 
280# GOOD: Approximate distinct (no shuffle)
281approx_count = df.select(F.approx_count_distinct("category")).collect()[0][0]
282 
283# Use coalesce instead of repartition when reducing partitions
284df_reduced = df.coalesce(10)  # No shuffle
285 
286# Optimize shuffle with compression
287spark.conf.set("spark.io.compression.codec", "lz4")  # Fast compression
288```
289 
290### Pattern 6: Data Format Optimization
291 
292```python
293# Parquet optimizations
294(df.write
295    .option("compression", "snappy")  # Fast compression
296    .option("parquet.block.size", 128 * 1024 * 1024)  # 128MB row groups
297    .parquet("s3://bucket/output/"))
298 
299# Column pruning - only read needed columns
300df = (spark.read.parquet("s3://bucket/data/")
301    .select("id", "amount", "date"))  # Spark only reads these columns
302 
303# Predicate pushdown - filter at storage level
304df = (spark.read.parquet("s3://bucket/partitioned/year=2024/")
305    .filter(F.col("status") == "active"))  # Pushed to Parquet reader
306 
307# Delta Lake optimizations
308(df.write
309    .format("delta")
310    .option("optimizeWrite", "true")  # Bin-packing
311    .option("autoCompact", "true")  # Compact small files
312    .mode("overwrite")
313    .save("s3://bucket/delta_table/"))
314 
315# Z-ordering for multi-dimensional queries
316spark.sql("""
317    OPTIMIZE delta.`s3://bucket/delta_table/`
318    ZORDER BY (customer_id, date)
319""")
320```
321 
322### Pattern 7: Monitoring and Debugging
323 
324```python
325# Enable detailed metrics
326spark.conf.set("spark.sql.codegen.wholeStage", "true")
327spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
328 
329# Explain query plan
330df.explain(mode="extended")
331# Modes: simple, extended, codegen, cost, formatted
332 
333# Get physical plan statistics
334df.explain(mode="cost")
335 
336# Monitor task metrics
337def analyze_stage_metrics(spark):
338    """Analyze recent stage metrics"""
339    status_tracker = spark.sparkContext.statusTracker()
340 
341    for stage_id in status_tracker.getActiveStageIds():
342        stage_info = status_tracker.getStageInfo(stage_id)
343        print(f"Stage {stage_id}:")
344        print(f"  Tasks: {stage_info.numTasks}")
345        print(f"  Completed: {stage_info.numCompletedTasks}")
346        print(f"  Failed: {stage_info.numFailedTasks}")
347 
348# Identify data skew
349def check_partition_skew(df):
350    """Check for partition skew"""
351    partition_counts = (df
352        .withColumn("partition_id", F.spark_partition_id())
353        .groupBy("partition_id")
354        .count()
355        .orderBy(F.desc("count")))
356 
357    partition_counts.show(20)
358 
359    stats = partition_counts.select(
360        F.min("count").alias("min"),
361        F.max("count").alias("max"),
362        F.avg("count").alias("avg"),
363        F.stddev("count").alias("stddev")
364    ).collect()[0]
365 
366    skew_ratio = stats["max"] / stats["avg"]
367    print(f"Skew ratio: {skew_ratio:.2f}x (>2x indicates skew)")
368```
369 
370## Configuration Cheat Sheet
371 
372```python
373# Production configuration template
374spark_configs = {
375    # Adaptive Query Execution (AQE)
376    "spark.sql.adaptive.enabled": "true",
377    "spark.sql.adaptive.coalescePartitions.enabled": "true",
378    "spark.sql.adaptive.skewJoin.enabled": "true",
379 
380    # Memory
381    "spark.executor.memory": "8g",
382    "spark.executor.memoryOverhead": "2g",
383    "spark.memory.fraction": "0.6",
384    "spark.memory.storageFraction": "0.5",
385 
386    # Parallelism
387    "spark.sql.shuffle.partitions": "200",
388    "spark.default.parallelism": "200",
389 
390    # Serialization
391    "spark.serializer": "org.apache.spark.serializer.KryoSerializer",
392    "spark.sql.execution.arrow.pyspark.enabled": "true",
393 
394    # Compression
395    "spark.io.compression.codec": "lz4",
396    "spark.shuffle.compress": "true",
397 
398    # Broadcast
399    "spark.sql.autoBroadcastJoinThreshold": "50MB",
400 
401    # File handling
402    "spark.sql.files.maxPartitionBytes": "128MB",
403    "spark.sql.files.openCostInBytes": "4MB",
404}
405```
406 
407## Best Practices
408 
409### Do's
410- **Enable AQE** - Adaptive query execution handles many issues
411- **Use Parquet/Delta** - Columnar formats with compression
412- **Broadcast small tables** - Avoid shuffle for small joins
413- **Monitor Spark UI** - Check for skew, spills, GC
414- **Right-size partitions** - 128MB - 256MB per partition
415 
416### Don'ts
417- **Don't collect large data** - Keep data distributed
418- **Don't use UDFs unnecessarily** - Use built-in functions
419- **Don't over-cache** - Memory is limited
420- **Don't ignore data skew** - It dominates job time
421- **Don't use `.count()` for existence** - Use `.take(1)` or `.isEmpty()`
422 
423## Resources
424 
425- [Spark Performance Tuning](https://spark.apache.org/docs/latest/sql-performance-tuning.html)
426- [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)
427- [Databricks Optimization Guide](https://docs.databricks.com/en/optimizations/index.html)
428

Full transparency — inspect the skill content before installing.