First Hurdle of My Spark Adventure: Docker containers
Overview
Geeeez figuring out multiple machines and Docker containers and Spark all at once was finicky!
As I was new to Docker and Spark, there were a lot of pitfalls that found a way to work in tandem to make my life very difficult. In this post I will elaborate on my final Docker configuration for worker and master nodes and then in the next post, I dive into how this enables the standalone Spark cluster to work on my mac and raspberry pis.
Lessons learned
Two critical lessons I learned: 1) A Docker container (like spark-master on my mac for example) has its own filesystem and network. We have to explicitly copy over files if the container must use them; we must also bind ports (e.g. 8080:8080), and define the network to use if we want to successfully communicate between containers on the Local Area Network.
2) At one point, I hit a weird bug where Docker consumed almost all available disk space. It turned out I had accumulated a ton of dangling images, unused volumes, and stale build layers from rebuilding containers during debugging. Cleanup commands I used:
docker system prune -f # Remove dangling images, stopped containers, unused networks/volumes
docker image prune -a -f # Remove all unused images (removes unreferenced ones)
docker volume prune -f # Clean up unused volumes (they persist even after container deletion)
du -sh ~/Library/Containers/com.docker.docker/Data/* # Check Docker disk usage on macOS
Docker config
There are 5 essential files for the proper functioning of containerized Spark workers on the RPis and a containerized Spark coordinator on my Mac: - compose.yml
- master/Dockerfile
- master/start-master.sh
(discussed in next post) - worker/Dockerfile
- worker/start-worker
(discussed in next post) - spark-env.sh
(discussed in next post)
I won’t give an entire overview of Docker, there are plenty of good videos and blogs that already do that, but it is important to understand the idea of Images, Containers and Instances. The Dockerfile is where you design your Image. The Image is what tells your computer how to build a Container. The Container can be put into operation on any compatible device as an Instance, once it is built.
master/Dockerfile
:
FROM openjdk:17-slim
ENV SPARK_VERSION="3.5.6"
ENV SPARK_HOME="/opt/spark"
ENV PATH="${SPARK_HOME}/sbin:${SPARK_HOME}/bin:$PATH"
EXPOSE 7077 8080
RUN apt-get update && \
apt-get install -y --no-install-recommends \
curl \
wget \
vim \
procps \
iproute2 \
ca-certificates && \
rm -rf /var/lib/apt/lists/* && \
curl -L -o /tmp/spark.tgz https://dlcdn.apache.org/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop3.tgz && \
tar -xzf /tmp/spark.tgz -C /opt && \
mv /opt/spark-${SPARK_VERSION}-bin-hadoop3 ${SPARK_HOME} && \
rm /tmp/spark.tgz && \
rm -rf ${SPARK_HOME}/examples \
${SPARK_HOME}/jars/*mesos* \
${SPARK_HOME}/jars/*kubernetes* \
${SPARK_HOME}/jars/*hive* \
${SPARK_HOME}/jars/*parquet* \
${SPARK_HOME}/jars/*orc*
# Copy startup and config scripts
COPY conf/spark-env.sh ${SPARK_HOME}/conf/spark-env.sh
COPY master/start-master.sh /opt/start-master.sh
RUN chmod +x ${SPARK_HOME}/conf/spark-env.sh /opt/start-master.sh
CMD ["/opt/start-master.sh"]
Quick rundown: for Spark I need the Docker container to be able to use Java17, so I use that as the base image: FROM openjdk:17-slim
. Spark must be downloaded from the web using curl; other dependencies can be installed using RUN apt-get install dependancy_x
The content at the bottom of the file is the interesting stuff. We must copy the scripts that are going to be executed, into the container. For me that is
COPY conf/spark-env.sh ${SPARK_HOME}/conf/spark-env.sh
COPY master/start-master.sh /opt/start-master.sh
The first script deals with configuring IP addresses and ports of the workers and coordinator nodes in the cluster. The second script starts an instance of the Spark master within the container. These are both covered in more detail in the next post.
The final line uses CMD
to specify what the container should run on startup — in this case, our master launch-script. It’s important that this is not a RUN
command, since RUN
executes during build-time and doesn’t persist into the running container.
compose.yml
:
The compose.yml file coordinates the operation of multiple containers.
services:
spark-master:
build:
context: .
dockerfile: master/Dockerfile
container_name: spark-master
ports:
- "7077:7077"
- "8080:8080"
environment:
- SPARK_MODE=master
- SPARK_MASTER_HOST=<MAC_LAN_IP>
- SPARK_PUBLIC_DNS=<MAC_LAN_IP>
- SPARK_LOCAL_IP=0.0.0.0
volumes:
- ./conf/spark-env.sh:/opt/spark/conf/spark-env.sh
spark-worker:
build:
context: .
dockerfile: worker/Dockerfile
container_name: spark-worker
network_mode: "host"
ports:
- "7078:7078"
- "8081:8081"
environment:
- SPARK_MODE=worker
- SPARK_MASTER_HOST=<MAC_LAN_IP>
volumes:
- ./conf/spark-env.sh:/opt/spark/conf/spark-env.sh