Hadoop Setup
Apache Hadoop 3.2.1
Hadoop Common Hadoop Distributed File System (HDFS): Hadoop YARN Hadoop MapReduce Hadoop Ozone Hadoop Submarine - machine learning engine Mahout - machine learning and data mining library
install jdk
NOTE - use a supported version of Java. Our example just uses the latest. Don’t do this in production.
jar xvfz jdk-13.0.1_linux-x64_bin.tar.gz sudo mv jdk-13.0.1 /opt/ cd /opt sudo ln -s jdk-13.0.1 java
Use this instead:
jdk-7u15-linux-x64.tar.gz
Java install dir: /opt/java
Any time you install a newer version, you can just update the link and you won’t need to update your env variables or anything else that points to your JDK.
All users need this:
vi .bashrc
JAVA_HOME=/opt/java export JAVA_HOME PATH=$JAVA_HOME/bin:$PATH export PATH
Confirm it is working:
which java which javac
sudo apt-get install ssh sudo apt-get install pdsh
Download and unpack hadoop
https://hadoop.apache.org/releases.html
tar xvfz hadoop-3.2.1.tar.gz sudo mv hadoop-3.2.1 /opt/ sudo ln -s hadoop-3.2.1 hadoop
All Users:
vi .bashrc export HADOOP_HOME=/opt/hadoop export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
Set to the root of your Java installation
vi $HADOOP_HOME/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/opt/java
Test hadoop command hadoop
You can run hadoop in these modes:
- Local (Standalone) Mode
- Pseudo-Distributed Mode
- Fully-Distributed Mode
Standalone Operation
mkdir input cp $HADOOP_HOME/etc/hadoop/.xml input hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output ‘dfs[a-z.]+’ cat output/
Pseudo-Distributed Operation
vi $HADOOP_HOME/etc/hadoop/core-site.xml
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
JAVA_HOME in all sh files echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/hadoop-env.sh echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/yarn-env.sh echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/mapred-env.sh
make sure passwordless ssh works
ssh localhost
setup if needed:
ssh-keygen -t rsa -P ‘’ -f ~/.ssh/id_rsa cat ~/.ssh/id_rsa.pub » ~/.ssh/authorized_keys chmod 0600 ~/.ssh/authorized_keys
format filesystem hdfs namenode -format
Start NameNode daemon and DataNode daemon: start-dfs.sh
Logs:
log output is written to the $HADOOP_LOG_DIR directory (defaults to $HADOOP_HOME/logs).
NameNode - http://localhost:9870/
Create dirs in HDFS for mapreduce usage
hdfs dfs -mkdir /user hdfs dfs -mkdir /user/user1
Copy input files into HDFS
hdfs dfs -mkdir input hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml input
Run some examples:
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.2.1.jar grep input output ‘dfs[a-z.]+’
View the files on HDFS
hdfs dfs -cat output/*
Pull down the output files to the local FS and view them
hdfs dfs -get output output cat output/*
Stop the daemons:
stop-dfs.sh
YARN on a Single Node
echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/mapred-env.sh
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
Start ResourceManager daemon and NodeManager daemon:
start-yarn.sh
ResourceManager - http://localhost:8088/
Hadoop Cluster Setup
Masters ( each gets a dedicated host ) NameNode SecondaryNameNode ResourceManager
Other services ( dedicated or shared host ): WebAppProxy MapReduce Job History server
Workers: DataNode NodeManager
users hdfs yarn mapred
ead-only default configuration core-default.xml hdfs-default.xml yarn-default.xml mapred-default.xml
Site-specific configuration etc/hadoop/core-site.xml etc/hadoop/hdfs-site.xml etc/hadoop/yarn-site.xml etc/hadoop/mapred-site.xml etc/hadoop/hadoop-env.sh etc/hadoop/yarn-env.sh etc/hadoop/mapred-env.sh
JAVA_HOME in all sh files echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/hadoop-env.sh echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/yarn-env.sh echo “export JAVA_HOME=/opt/java” » $HADOOP_HOME/etc/hadoop/mapred-env.sh
Daemon Environment Variable NameNode HDFS_NAMENODE_OPTS DataNode HDFS_DATANODE_OPTS Secondary NameNode HDFS_SECONDARYNAMENODE_OPTS ResourceManager YARN_RESOURCEMANAGER_OPTS NodeManager YARN_NODEMANAGER_OPTS WebAppProxy YARN_PROXYSERVER_OPTS Map Reduce Job History Server MAPRED_HISTORYSERVER_OPTS
export HDFS_NAMENODE_OPTS=”-XX:+UseParallelGC -Xmx4g”
HADOOP_PID_DIR - The directory where the daemons’ process id files are stored. HADOOP_LOG_DIR - The directory where the daemons’ log files are stored. Log files are automatically created if they don’t exist. HADOOP_HEAPSIZE_MAX
/etc/profile.d:
HADOOP_HOME=/path/to/hadoop export HADOOP_HOME
All worker hostnames go here ( for helper scripts not Java):
$HADOOP_HOME/etc/hadoop/workers
Configure logging:
$HADOOP_HOME/etc/hadoop/log4j.properties
HDFS normally runs as the hdfs user YARN normally runs as the YARN user historyserver runs as mapred
[hdfs]$ $HADOOP_HOME/bin/hdfs namenode -format
[hdfs]$ $HADOOP_HOME/bin/hdfs –daemon start namenode [hdfs]$ $HADOOP_HOME/bin/hdfs –daemon start datanode [hdfs]$ $HADOOP_HOME/sbin/start-dfs.sh [yarn]$ $HADOOP_HOME/bin/yarn –daemon start resourcemanager [yarn]$ $HADOOP_HOME/bin/yarn –daemon start nodemanager [yarn]$ $HADOOP_HOME/bin/yarn –daemon start proxyserver [yarn]$ $HADOOP_HOME/sbin/start-yarn.sh
[mapred]$ $HADOOP_HOME/bin/mapred –daemon start historyserver
[hdfs]$ $HADOOP_HOME/bin/hdfs –daemon stop namenode [hdfs]$ $HADOOP_HOME/bin/hdfs –daemon stop datanode [hdfs]$ $HADOOP_HOME/sbin/stop-dfs.sh [yarn]$ $HADOOP_HOME/bin/yarn –daemon stop resourcemanager [yarn]$ $HADOOP_HOME/bin/yarn –daemon stop nodemanager [yarn]$ $HADOOP_HOME/sbin/stop-yarn.sh [yarn]$ $HADOOP_HOME/bin/yarn stop proxyserver [mapred]$ $HADOOP_HOME/bin/mapred –daemon stop historyserver
Daemon Web Interface Notes NameNode http://nn_host:port/ Default HTTP port is 9870. ResourceManager http://rm_host:port/ Default HTTP port is 8088. MapReduce JobHistory Server http://jhs_host:port/ Default HTTP port is 19888.
vi $HADOOP_HOME/etc/hadoop/core-site.xml
vi $HADOOP_HOME/etc/hadoop/hdfs-site.xml
vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
vi $HADOOP_HOME/etc/hadoop/yarn-site.xml
—————————— Big XLS of parameters —————–
Failed to retrieve data from /webhdfs/v1/?op=LISTSTATUS: Server Error
This happens because the java.activation package is deprecated in Java 9. If you are seeing this error, it is probably because you have an incompatible Java version. I wasn't able to fix this in Java 13.
To fix it, add this to your hadoop-env.sh:
export HADOOP_OPTS="--add-modules java.activation"