Hadoop 是一个开源的分布式计算框架,用于处理大规模数据。以下是在 Linux 系统上安装 Hadoop 的详细步骤:
1. 环境准备
安装 Java
Hadoop 依赖 Java 环境,推荐使用 OpenJDK 8 或 11:
bash
# Ubuntu/Debian
sudo apt update
sudo apt install openjdk-8-jdk
# CentOS/RHEL
sudo yum install java-1.8.0-openjdk-devel
# 验证安装
java -version
配置 SSH 免密登录
bash
# 生成密钥对
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# 将公钥添加到授权列表
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
# 测试 SSH 连接
ssh localhost
2. 下载并解压 Hadoop
bash
# 下载 Hadoop(以 3.3.6 版本为例)
wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
# 解压文件
tar -xzvf hadoop-3.3.6.tar.gz -C /opt/
# 创建软链接(可选,方便版本管理)
sudo ln -s /opt/hadoop-3.3.6 /opt/hadoop
# 添加环境变量
echo '
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
' >> ~/.bashrc
# 刷新环境变量
source ~/.bashrc
3. 配置 Hadoop
进入 Hadoop 配置目录:
bash
cd /opt/hadoop/etc/hadoop
编辑 core-site.xml
xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
编辑 hdfs-site.xml
xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/data/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/data/datanode</value>
</property>
</configuration>
编辑 mapred-site.xml
xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
编辑 yarn-site.xml
xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration>
配置 Java 路径
编辑 hadoop-env.sh
,添加:
bash
export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")
4. 初始化 HDFS
bash
# 创建数据目录
mkdir -p /opt/hadoop/data/namenode
mkdir -p /opt/hadoop/data/datanode
# 格式化 NameNode
hdfs namenode -format
5. 启动 Hadoop
bash
# 启动 HDFS
start-dfs.sh
# 启动 YARN
start-yarn.sh
# 查看进程
jps
正常情况下,会看到以下进程:
NameNode
DataNode
SecondaryNameNode
ResourceManager
NodeManager
6. 验证安装
访问 Web 界面
- NameNode 状态:http://localhost:9870
- ResourceManager 状态:http://localhost:8088
运行示例作业
bash
# 创建测试目录
hdfs dfs -mkdir -p /user/hadoop
# 上传文件到 HDFS
hdfs dfs -put /etc/profile /user/hadoop/
# 运行 WordCount 示例
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount /user/hadoop/profile /user/hadoop/output
# 查看结果
hdfs dfs -cat /user/hadoop/output/part-r-00000
7. 停止 Hadoop
bash
stop-yarn.sh
stop-dfs.sh