Docker Hadoop Cluster Lab
Our goal with this lab is to setup a Hadoop cluster as quickly as possible from scratch using Docker. Our host that we use to run Docker on is a VirtualBox VM and runs Ubuntu Linux 18.04. Our desktop that we test from and which runs VirtualBox is running ubuntu 19.04. The containers will also be Ubuntu based. On this page we will review the steps we took to get this up and running.
NOTE - Many commands here were run directly as root. You should probably use sudo instead.
Due to the amount of troubleshooting and ad-hoc testing we spent a lot of time logged into the containers checking things out. To make this easier we isntalled a lot of tools and utilities that you normally wouldn't include in a light weight container. We basically ended up treating our containers as though they were just light weight VMs. This isn't really how they are meant to be used. You probably shouldn't do this in production but we think it is fine for a testing environment.
zippy-zap | Destkop / Dev Host | Desktop workstation running Ubuntu 19.-04 |
docker1 | Docker Server | Virtual Box VM running as a guest on our dev host. Uses Ubuntu 18.04. |
name-node1 | ||
resource-manager1 | ||
webappproxy1 | ||
history-server1 | ||
worker1 | ||
worker2 | ||
worker3 | ||
worker4 | ||
worker5 | ||
worker6 | ||
worker7 | ||
worker8 | ||
worker9 | ||
worker10 |
- passwordless ssh access
Docker server ( VM ):
- static IP
- bridge network with dev machine
Install Docker:
sudo apt-get install docker.io
Create a test container and test it:
docker create --hostname test1 --name test1 -ti ubuntu bash
docker ps -a
docker start test1
docker stop test1
docker rm test1
docker image ls
docker attach test1
docker exec -it test1 /bin/bash
Install VIM:
apt update
apt install vim -y
Network
Install Basic Network Utils:
apt-get install net-tools -y
apt-get install dnsutils -y
apt install iproute2 -y
DNS / Hostnames
- Will only have the local host by default ( this is fine ): /etc/hosts
- On a custom network, other containers will be resolvable by name
- They won't be resolvable from the outside
Inspect network settings on our test container:
docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' test1
Test out creating a network. You need to specify a subnet if you want to be able to explicitly assign IPs to containers.
docker network create -d bridge --subnet 172.25.0.0/16 my-network1
docker network ls
docker network inspect my-network1
docker network rm my-network1
Run a container connected to the network with an assigned IP.
docker run -tid --net=my-network1 --ip=172.25.0.101 --name=test1 ubuntu
Connect a new container to the network like this:
docker create --hostname test2 --name test2 -ti ubuntu
docker network connect my-network1 test2 --ip 172.25.0.102
Create a container connected to our network:
docker create --network my-network1 --ip 172.25.0.103 --hostname test3 --name test3 -ti ubuntu
SSH
We need to be able to ssh between hosts. This not only makes administration and testing easier but it allows the setup scripts that come with Hadoop to work correctly. Here we setup an SSH server, add a user, and setup SSH keys.
apt update
apt install openssh-server -y
mkdir /run/sshd
/usr/sbin/sshd
useradd user1 -s /bin/bash
passwd user1
mkdir /home/user1
chown user1:user1 /home/user1
ssh user1@test5
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Network and Routing
We wanted to make sure that we could reach our containers directly from our desktop / dev host. We need to be able to SSH directly into our containers. We also need to be able to reach any web based tools Hadoop has to offer.
- docker1 added to /etc/hosts on dev box
- routing enabled on dev box
- routing enabled on docker1 ( already set )
- added a route
On our dev host we add a route to the private network used by our docker containers. We set our docker server as the router for this network.
sudo sysctl -w net.ipv4.ip_forward=1
sudo route add -net 172.25.0.0/16 gw 192.168.3.101 dev eno2
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf
On the docker server, docker1 we need to clear out all firewall rules and save them.
sudo iptables -F
sudo iptables -P FORWARD ACCEPT
sudo apt install iptables-persistent netfilter-persistent
sudo cat /etc/iptables/rules.v4
sudo netfilter-persistent save
# now dev can ssh to containers on private network
ssh 172.25.0.101
Build Image From Existing Container
We tested out building an image from our test container. We opted not to use this as a better solution is to build an image from a Dockerfile.
docker create --network my-network1 --ip 172.25.0.101 --hostname test1 --name test1 -ti ubuntu bash
docker commit test1 standard
Build Image From Docker File
We setup a directory for our builds. Our Dockerfile goes in this directory. This directory will also hold all of our configuration files and anything else the container should be built with.
mkdir mydockerbuild
cd mydockerbuild
We decided on using Supervisor to manage the processes within our contianers. This helps to avoid issues with multiprocess containers. Besides the normal Hadoop processes we also need to run SSHD.
vi supervisord.conf
[supervisord]
nodaemon=true
[program:sshd]
command=/usr/sbin/sshd -D
On docker host and in Docker file
/etc/ssh/ssh_config
Host *
StrictHostKeyChecking no
UserKnownHostsFile=/dev/null
Our Standard Docker Image
This is our “Standard” Docker image. It is setup in such a way that we can treat it like a VM.
vi Dockerfile
FROM ubuntu
RUN apt update
RUN apt install vim -y
RUN apt-get install net-tools -y
RUN apt-get install dnsutils -y
RUN apt install iproute2 -y
RUN apt install openssh-server -y
RUN apt install supervisor -y
RUN useradd user1 -s /bin/bash
RUN mkdir /home/user1
RUN mkdir /home/user1/.ssh
RUN chown user1:user1 /home/user1
RUN mkdir /run/sshd
RUN ssh-keygen -t rsa -P '' -f /home/user1/.ssh/id_rsa
RUN cat /home/user1/.ssh/id_rsa.pub >> /home/user1/.ssh/authorized_keys
COPY id_rsa2.pub /home/user1/.ssh/id_rsa2.pub
RUN cat /home/user1/.ssh/id_rsa2.pub >> /home/user1/.ssh/authorized_keys
RUN chmod 0600 /home/user1/.ssh/authorized_keys
RUN chown -R user1:user1 /home/user1/.ssh
RUN cp -r /home/user1/.ssh /root/.ssh
RUN chown -R root:root /root/.ssh
COPY ssh_config /etc/ssh/ssh_config
COPY supervisord.conf /etc/supervisor/conf.d/supervisord.conf
CMD ["/usr/bin/supervisord"]
We build this image like this:
docker build -t standard . # build an image
docker images
Our Hadoop Docker Image
This is based on our “Standard” image but includes tools for running and managing Hadoop.
FROM standard
RUN useradd hdfs -s /bin/bash
RUN mkdir /home/hdfs
RUN cp -r /home/user1/.ssh /home/hdfs/.ssh
RUN chown -R hdfs:hdfs /home/hdfs
RUN useradd yarn -s /bin/bash
RUN mkdir /home/yarn
RUN cp -r /home/user1/.ssh /home/yarn/.ssh
RUN chown -R yarn:yarn /home/yarn
RUN useradd mapred -s /bin/bash
RUN mkdir /home/mapred
RUN cp -r /home/user1/.ssh /home/mapred/.ssh
RUN chown -R mapred:mapred /home/mapred
#COPY jdk-13.0.1_linux-x64_bin.tar.gz /
#RUN tar xvfz jdk-13.0.1_linux-x64_bin.tar.gz
#RUN rm jdk-13.0.1_linux-x64_bin.tar.gz
#RUN cp -r jdk-13.0.1 /opt
#RUN cd /opt && ln -s jdk-13.0.1 java
COPY jdk-7u15-linux-x64.tar.gz /
RUN tar xvfz jdk-7u15-linux-x64.tar.gz
RUN rm jdk-7u15-linux-x64.tar.gz
RUN cp -r jdk1.7.0_15 /opt
RUN cd /opt && ln -s jdk1.7.0_15 java
COPY hadoop-3.2.1.tar.gz /
RUN tar xvfz hadoop-3.2.1.tar.gz
RUN rm hadoop-3.2.1.tar.gz
RUN cp -r hadoop-3.2.1 /opt
RUN cd /opt && ln -s hadoop-3.2.1 hadoop
COPY env_vars.txt /
RUN cat env_vars.txt >> /home/user1/.bashrc
RUN cat env_vars.txt >> /root/.bashrc
RUN cat env_vars.txt >> /home/hdfs/.bashrc
RUN cat env_vars.txt >> /home/yarn/.bashrc
RUN cat env_vars.txt >> /home/mapred/.bashrc
RUN chown hdfs:hdfs /home/hdfs/.bashrc
RUN chown yarn:yarn /home/yarn/.bashrc
RUN chown mapred:mapred /home/mapred/.bashrc
COPY hadoop-env.sh /opt/hadoop/etc/hadoop/hadoop-env.sh
COPY workers /opt/hadoop/etc/hadoop/workers
COPY core-site.xml /opt/hadoop/etc/hadoop/core-site.xml
COPY hdfs-site.xml /opt/hadoop/etc/hadoop/hdfs-site.xml
COPY /opt/hadoop/etc/hadoop/mapred-site.xml
COPY /opt/hadoop/etc/hadoop/yarn-site.xml
RUN mkdir /var/HDFS-namenode && chown hdfs:hdfs /var/HDFS-namenode
RUN mkdir /var/HDFS-datanode && chown hdfs:hdfs /var/HDFS-datanode
RUN mkdir /opt/hadoop/logs && chmod a+rwx /opt/hadoop/logs
CMD ["/usr/bin/supervisord"]
We build this image with the following commands:
docker build -t my-hadoop . # build an image
docker images
Storage Space Issues
One issue we ran into is that we ran out of space on our Docker server VM. The combined space needed to store all of the packages on each container used up what little space was avialable on the VM. Instead of building a new VM, we solved this by adding an additional virtual disk to our existing VM. Docker was using /var/lib/docker/overlay2 to store container volumes by default. We copied the data to our new volume and mounted it over this directory providing more space than we will likely ever need for this project.
mount /dev/sdb /var/lib/docker/overlay2
Create Hadoop Cluster Containers
We used the following commands to create the containers for our cluster. Note that we are explicitly specifing the IP addresses used. This is a quick and dirty solution that is to be replaced with something cleaner (Ansible).
docker create --network my-network1 --ip 172.25.0.102 --hostname name-node1 --name name-node1 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.103 --hostname resource-manager1 --name resource-manager1 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.104 --hostname webappproxy1 --name webappproxy1 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.105 --hostname history-server1 --name history-server1 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.106 --hostname worker1 --name worker1 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.107 --hostname worker2 --name worker2 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.108 --hostname worker3 --name worker3 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.109 --hostname worker4 --name worker4 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.110 --hostname worker5 --name worker5 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.111 --hostname worker6 --name worker6 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.112 --hostname worker7 --name worker7 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.113 --hostname worker8 --name worker8 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.114 --hostname worker9 --name worker9 -ti my-hadoop
docker create --network my-network1 --ip 172.25.0.115 --hostname worker10 --name worker10 -ti my-hadoop
We added the following /etc/hosts entries on our Docker server and on our dev host.
- hosts entries on docker host
- hosts entries on dev host
172.25.0.102 name-node1
172.25.0.103 resource-manager1
172.25.0.104 webappproxy1
172.25.0.105 history-server1
172.25.0.106 worker1
172.25.0.107 worker2
172.25.0.108 worker3
172.25.0.109 worker4
172.25.0.110 worker5
172.25.0.111 worker6
172.25.0.112 worker7
172.25.0.113 worker8
172.25.0.114 worker9
172.25.0.115 worker10
We use this loop to actually start the containers in the cluster:
for i in \
name-node1\
resource-manager1\
webappproxy1\
history-server1\
worker1\
worker2\
worker3\
worker4\
worker5\
worker6\
worker7\
worker8\
worker9\
worker10\
;do docker start $i; done
Docker / Hadoop Cluster Tear Down
We used the following loop to stop all of our containers.
for i in name-node1 resource-manager1 webappproxy1 history-server1 worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9 worker10;do docker stop $i; done
We used this command to destroy our containers.
for i in name-node1 resource-manager1 webappproxy1 history-server1 worker1 worker2 worker3 worker4 worker5 worker6 worker7 worker8 worker9 worker10;do docker rm $i; done
Testing Our Hadoop Cluster
While the container was up we used the following to test passwordless ssh access to the name-node as the hdfs user.
ssh hdfs@name-node1
From the name node we create an HDFS file system like this:
$HADOOP_HOME/bin/hdfs namenode -format hdfs://name-node1:9000
We test out some HDFS commands like this:
hdfs dfs -mkdir /user
hdfs dfs -put test1.txt /user
hdfs dfs -ls /user
hdfs dfs -cat /user/test1.txt
What remains to be done:
- Supervisord needs to control the procs
- Ansible needs to deploy the clusters
- Test MapReduce
- Make our cluster resilient