spark
Apache Spark
DAG scheduler, stage boundaries, shuffle internals, RDD vs DataFrame, and memory management in Spark 3.x.
DAG scheduler, stage boundaries, shuffle internals, RDD vs DataFrame, and memory management in Spark 3.x.
HDFS architecture, MapReduce internals, YARN resource management, and the Hadoop ecosystem overview.
HiveQL, partitioning strategies, bucketing, metastore architecture, and query optimization techniques.
DataFrame API, transformations vs actions, UDFs, Spark SQL integration, and performance tuning patterns.
Data engineering patterns — generators, decorators, async I/O, type hints, and testing with pytest.
Window functions, CTEs, execution plans, indexing strategies, and advanced aggregation patterns.