5 Comments
User's avatar
Arshadh's avatar

Thanks for the post. I didn't know that spark connect is not existing today. (With databricks-connect, I thought a native version had always been there).

Few questions.

1) Isn't a spark application a.k.a driver process ? In your post I see, both referred as separate items.

2) This may be something you planned to cover in part -2. How is the resource isolation planned, is that something to do with FAIR pools ?

Expand full comment
Daniel Aronovich's avatar

Great questions Arshadh!

1. A Spark application traditionally includes both the driver and the execution engine, running together in the same process. With Spark Connect, the application (client) is decoupled from the driver, meaning your code no longer runs inside the Spark driver process. Instead, it connects remotely to Spark’s execution engine, allowing for better isolation, flexibility, and multi-language support. So while the driver is still required, it is now a separate entity from the client application.

2. Resource isolation with Spark Connect is not directly tied to FAIR pools, but it does improve workload separation. Since each client runs independently from the Spark driver’s JVM, one user’s heavy query won’t crash the driver or impact others on a shared cluster. FAIR scheduling still plays a role in managing resource distribution among Spark tasks, but Spark Connect enhances stability by preventing client-side failures from affecting the whole cluster.

Hope I understood and answered your question :)

Expand full comment
Arshadh's avatar

Thank you.

Expand full comment
Yogesh Gupta's avatar

Spark always followed lazy evaluation approach. I am wondering if I have a spark job doing hundreds of transformation using Dataset APIs, JBDC from other database, reading from HDFS. In this scenario, how execution will happen? my second question is will this introduce too much traffic over spark connect? my third question is how are we going to configure the resources needed for a given complex spark job?

Expand full comment
Daniel Aronovich's avatar

Thanks for the great questions!

So execution remains lazy, meaning transformations are only computed when an action is triggered. The client sends a logical plan (not actual data) to the Spark driver, which optimizes and executes it on the cluster. This ensures efficient execution without unnecessary computation.

Network traffic is minimal since only logical plans are transmitted, and results are efficiently returned using Apache Arrow. Compared to JDBC-based approaches, Spark Connect reduces overhead while maintaining performance.

And last, resource configuration stays the same as traditional Spark, managed at the cluster level (spark.driver.memory, spark.executor.memory). Dynamic allocation helps scale resources efficiently, and each application runs in isolation, improving stability in shared environments.

Expand full comment