Low Orbit Flux Logo 2 F

Hadoop How To Add a New Datanode

Adding a new datanode to a Hadoop cluster is easy. Once you have done it a few times it will seem even easier. This is a pretty standard operation.

Here are the steps:

  1. Create an includes files in the conf directory. Add the existing and new data node IPs to this.

     vi ${HADOOP_HOME}/conf/includes
  2. Enable the includes file in hdfs-site.xml

     <property>
     <name>dfs.hosts</name>
     <value>${HADOOP_HOME}/conf/includes</value>
     <final>true</final>
     </property>
  3. Enable the includes file in mapred-site.xml

     <property>
     <name>mapred.hosts</name>
     <value>${HADOOP_HOME}/conf/includes</value>
     </property>
  4. Refresh nodes from the name node.

     ${HADOOP_HOME}/bin/hadoop dfsadmin -refreshNodes
  5. Refresh nodes from the Jobtracker.

     ${HADOOP_HOME}/bin/hadoop mradmin -refreshNodes
  6. Start the datanode and tasktracker from the new host.

     cd ${HADOOP_HOME}
     bin/hadoop-daemon.sh start datanode 
     bin/hadoop-daemon.sh start tasktracker
  7. Run the start balancer script.

     ${HADOOP_HOME}/bin/start-balancer.sh

More - Adding a New Hadoop Datanode

Adding an additional data node is a pretty standard thing to do when working with a Hadoop cluster. Once you get used to doing it, you won’t even need to think about it much. You will probably even end up automating the process. Generally you should be able to scale out as much as you want and it shouldn’t take a huge amount of effort. You could even go as far as to automatically scale out based on system load.

This is just one of many things that you will do to maintain a Hadoop cluster. You will probably want to test this out a lot in a development environment before attempting it in production. That goes for pretty much anything else you might do as well. I generally create virtualized or containerized environments to run my development Hadoop clusters.

References