Low Orbit Flux Logo 2 D

Docker Hadoop Cluster Lab

Our goal with this lab is to setup a Hadoop cluster as quickly as possible from scratch using Docker. Our host that we use to run Docker on is a VirtualBox VM and runs Ubuntu Linux 18.04. Our desktop that we test from and which runs VirtualBox is running ubuntu 19.04. The containers will also be Ubuntu based. On this page we will review the steps we took to get this up and running.

NOTE - Many commands here were run directly as root. You should probably use sudo instead.

Due to the amount of troubleshooting and ad-hoc testing we spent a lot of time logged into the containers checking things out. To make this easier we isntalled a lot of tools and utilities that you normally wouldn't include in a light weight container. We basically ended up treating our containers as though they were just light weight VMs. This isn't really how they are meant to be used. You probably shouldn't do this in production but we think it is fine for a testing environment.

zippy-zap Destkop / Dev Host Desktop workstation running Ubuntu 19.-04
docker1 Docker Server Virtual Box VM running as a guest on our dev host. Uses Ubuntu 18.04.

Docker server ( VM ):

Install Docker:

sudo apt-get install docker.io

Create a test container and test it:

docker create --hostname test1 --name test1 -ti ubuntu bash
docker ps -a
docker start test1
docker stop test1
docker rm test1

docker image ls

docker attach test1
docker exec -it test1 /bin/bash

apt update
apt install vim -y


Install Basic Network Utils:

apt-get install net-tools -y
apt-get install dnsutils -y
apt install iproute2 -y

DNS / Hostnames

Inspect network settings on our test container:

docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test1

Test out creating a network. You need to specify a subnet if you want to be able to explicitly assign IPs to containers.

docker network create -d bridge --subnet my-network1
docker network ls
docker network inspect my-network1
docker network rm my-network1

Run a container connected to the network with an assigned IP.

docker run -tid --net=my-network1 --ip= --name=test1 ubuntu  

Connect a new container to the network like this:

docker create  --hostname test2 --name test2 -ti ubuntu
docker network connect my-network1 test2 --ip

Create a container connected to our network:

docker create --network my-network1 --ip --hostname test3 --name test3 -ti ubuntu


We need to be able to ssh between hosts. This not only makes administration and testing easier but it allows the setup scripts that come with Hadoop to work correctly. Here we setup an SSH server, add a user, and setup SSH keys.

apt update
apt install openssh-server -y
mkdir /run/sshd

useradd user1 -s /bin/bash
passwd user1
mkdir /home/user1
chown user1:user1 /home/user1
ssh user1@test5

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Network and Routing

We wanted to make sure that we could reach our containers directly from our desktop / dev host. We need to be able to SSH directly into our containers. We also need to be able to reach any web based tools Hadoop has to offer.

On our dev host we add a route to the private network used by our docker containers. We set our docker server as the router for this network.

sudo sysctl -w net.ipv4.ip_forward=1
sudo route add -net gw dev eno2

echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf

On the docker server, docker1 we need to clear out all firewall rules and save them.

sudo iptables -F
sudo iptables -P FORWARD ACCEPT

sudo apt install iptables-persistent netfilter-persistent
sudo cat /etc/iptables/rules.v4
sudo netfilter-persistent save

# now dev can ssh to containers on private network

Build Image From Existing Container

We tested out building an image from our test container. We opted not to use this as a better solution is to build an image from a Dockerfile.

docker create --network my-network1 --ip --hostname test1 --name test1 -ti ubuntu bash

docker commit test1 standard

Build Image From Docker File

We setup a directory for our builds. Our Dockerfile goes in this directory. This directory will also hold all of our configuration files and anything else the container should be built with.

mkdir mydockerbuild
cd mydockerbuild

We decided on using Supervisor to manage the processes within our contianers. This helps to avoid issues with multiprocess containers. Besides the normal Hadoop processes we also need to run SSHD.

vi supervisord.conf

command=/usr/sbin/sshd -D

On docker host and in Docker file

Host *
   StrictHostKeyChecking no

Our Standard Docker Image

This is our “Standard” Docker image. It is setup in such a way that we can treat it like a VM.

vi Dockerfile

FROM ubuntu

RUN apt update
RUN apt install vim -y
RUN apt-get install net-tools -y
RUN apt-get install dnsutils -y
RUN apt install iproute2 -y
RUN apt install openssh-server -y
RUN apt install supervisor -y

RUN useradd user1 -s /bin/bash
RUN mkdir /home/user1
RUN mkdir /home/user1/.ssh
RUN chown user1:user1 /home/user1

RUN mkdir /run/sshd

RUN ssh-keygen -t rsa -P '' -f /home/user1/.ssh/id_rsa
RUN cat /home/user1/.ssh/id_rsa.pub >> /home/user1/.ssh/authorized_keys
COPY id_rsa2.pub /home/user1/.ssh/id_rsa2.pub
RUN cat /home/user1/.ssh/id_rsa2.pub >> /home/user1/.ssh/authorized_keys
RUN chmod 0600 /home/user1/.ssh/authorized_keys
RUN chown -R user1:user1 /home/user1/.ssh
RUN cp -r /home/user1/.ssh /root/.ssh
RUN chown -R root:root /root/.ssh

COPY ssh_config /etc/ssh/ssh_config
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf

CMD ["/usr/bin/supervisord"]

We build this image like this:

docker build -t standard .    # build an image
docker images

Our Hadoop Docker Image

This is based on our “Standard” image but includes tools for running and managing Hadoop.

FROM standard

RUN useradd hdfs -s /bin/bash
RUN mkdir /home/hdfs
RUN cp -r /home/user1/.ssh /home/hdfs/.ssh
RUN chown -R hdfs:hdfs /home/hdfs

RUN useradd yarn -s /bin/bash
RUN mkdir /home/yarn
RUN cp -r /home/user1/.ssh /home/yarn/.ssh
RUN chown -R yarn:yarn /home/yarn

RUN useradd mapred -s /bin/bash
RUN mkdir /home/mapred
RUN cp -r /home/user1/.ssh /home/mapred/.ssh
RUN chown -R mapred:mapred /home/mapred

#COPY jdk-13.0.1_linux-x64_bin.tar.gz /
#RUN tar xvfz jdk-13.0.1_linux-x64_bin.tar.gz
#RUN rm jdk-13.0.1_linux-x64_bin.tar.gz
#RUN cp -r jdk-13.0.1 /opt
#RUN cd /opt && ln -s jdk-13.0.1 java
COPY jdk-7u15-linux-x64.tar.gz /
RUN tar xvfz jdk-7u15-linux-x64.tar.gz
RUN rm jdk-7u15-linux-x64.tar.gz
RUN cp -r jdk1.7.0_15 /opt
RUN cd /opt && ln -s jdk1.7.0_15 java

COPY hadoop-3.2.1.tar.gz /
RUN tar xvfz hadoop-3.2.1.tar.gz
RUN rm hadoop-3.2.1.tar.gz
RUN cp -r hadoop-3.2.1 /opt
RUN cd /opt && ln -s hadoop-3.2.1 hadoop

COPY env_vars.txt /
RUN cat env_vars.txt >> /home/user1/.bashrc
RUN cat env_vars.txt >> /root/.bashrc

RUN cat env_vars.txt >> /home/hdfs/.bashrc
RUN cat env_vars.txt >> /home/yarn/.bashrc
RUN cat env_vars.txt >> /home/mapred/.bashrc

RUN chown hdfs:hdfs /home/hdfs/.bashrc
RUN chown yarn:yarn /home/yarn/.bashrc
RUN chown mapred:mapred /home/mapred/.bashrc

COPY hadoop-env.sh /opt/hadoop/etc/hadoop/hadoop-env.sh
COPY workers /opt/hadoop/etc/hadoop/workers
COPY core-site.xml /opt/hadoop/etc/hadoop/core-site.xml
COPY hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml
COPY /opt/hadoop/etc/hadoop/mapred-site.xml
COPY /opt/hadoop/etc/hadoop/yarn-site.xml

RUN mkdir /var/HDFS-namenode && chown hdfs:hdfs /var/HDFS-namenode
RUN mkdir /var/HDFS-datanode && chown hdfs:hdfs /var/HDFS-datanode
RUN mkdir /opt/hadoop/logs && chmod a+rwx /opt/hadoop/logs

CMD ["/usr/bin/supervisord"]

We build this image with the following commands:

docker build -t my-hadoop .    # build an image
docker images

Storage Space Issues

One issue we ran into is that we ran out of space on our Docker server VM. The combined space needed to store all of the packages on each container used up what little space was avialable on the VM. Instead of building a new VM, we solved this by adding an additional virtual disk to our existing VM. Docker was using /var/lib/docker/overlay2 to store container volumes by default. We copied the data to our new volume and mounted it over this directory providing more space than we will likely ever need for this project.

mount /dev/sdb /var/lib/docker/overlay2

Create Hadoop Cluster Containers

We used the following commands to create the containers for our cluster. Note that we are explicitly specifing the IP addresses used. This is a quick and dirty solution that is to be replaced with something cleaner (Ansible).

docker create --network my-network1 --ip --hostname name-node1 --name name-node1 -ti my-hadoop
docker create --network my-network1 --ip --hostname resource-manager1 --name resource-manager1 -ti my-hadoop
docker create --network my-network1 --ip --hostname webappproxy1 --name webappproxy1 -ti my-hadoop
docker create --network my-network1 --ip --hostname history-server1 --name history-server1  -ti my-hadoop
docker create --network my-network1 --ip --hostname worker1 --name worker1 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker2 --name worker2 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker3 --name worker3 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker4 --name worker4 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker5 --name worker5 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker6 --name worker6 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker7 --name worker7 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker8 --name worker8 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker9 --name worker9 -ti my-hadoop
docker create --network my-network1 --ip --hostname worker10 --name worker10 -ti my-hadoop

We added the following /etc/hosts entries on our Docker server and on our dev host. name-node1 resource-manager1 webappproxy1 history-server1 worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9 worker10

We use this loop to actually start the containers in the cluster:

for i in \
;do  docker start $i; done

Docker / Hadoop Cluster Tear Down

We used the following loop to stop all of our containers.

for i in  name-node1 resource-manager1 webappproxy1 history-server1 worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9 worker10;do  docker stop $i; done

We used this command to destroy our containers.

for i in  name-node1 resource-manager1 webappproxy1 history-server1 worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9 worker10;do  docker rm $i; done

Testing Our Hadoop Cluster

While the container was up we used the following to test passwordless ssh access to the name-node as the hdfs user.

ssh hdfs@name-node1

From the name node we create an HDFS file system like this:

$HADOOP_HOME/bin/hdfs namenode -format hdfs://name-node1:9000

We test out some HDFS commands like this:

hdfs dfs -mkdir /user
hdfs dfs -put test1.txt /user
hdfs dfs -ls /user
hdfs dfs -cat /user/test1.txt

What remains to be done: