采用Hibench在本地对Spark Streaming进行benchmark

谨以此文祭奠过去三天起起落落的辛酸经历。


此间与无数Error碰面,奈何benchmark稍显冷门,任凭我历尽多种方法,于谷歌、百度、必应等知名搜索引擎也寻找无果。在翻阅HiBenchGithub Issues时,甚至发现与我相同的问题存在数个月之久无人问津,乃至于后来我写邮件求助亦如石沉大海。幸而我时常有灵光乍现的时刻,总会在经历一遍又一遍的试错后,偶得正确解决方案,最终克服了所有的问题,解决掉所有的Error,成功运行Benchmark,作文以记之。


软件安装

下载解压到某目录后,设置好路径:

1
$ gedit ~/.bashrc

添加

1
2
3
4
5
6
7
8
9
10
11
12
13
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export SCALA_HOME=/目录/scala-2.11.8
export PATH=$PATH:$SCALA_HOME/bin
export HADOOP_HOME=/目录/hadoop-2.8.3
export SPARK_HOME=/目录/spark-2.2.2
export PATH=$PATH:$SPARK_HOME/bin
export ZOOKEEPER_HOME=/目录/zookeeper-3.4.10
export KAFKA_HOME=/目录/kafka_2.11-0.8.2.2

1
$ source ~/.bashrc

生效


配置软件

Hadoop

进入文件夹

1
$ cd $HADOOP_HOME

单机

配置hadoop-env.sh

1
$ gedit etc/hadoop/hadoop-env.sh

JAVA_HOME改成正确目录

1
2
# set to the root of your Java installation
export JAVA_HOME=/Java的所在目录

配置core-site.xml

1
$ gedit etc/hadoop/core-site.xml

修改为:

1
2
3
4
5
6
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

配置hdfs-site.xml

1
$ gedit etc/hadoop/hdfs-site.xml

修改为:

1
2
3
4
5
6
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

确保ssh到localhost无需密码

1
$ ssh localhost

否则执行:

1
2
3
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

运行

1
2
$ bin/hdfs namenode -format
$ sbin/start-dfs.sh

在yarn上运行

(这步是需要的,我之前没有执行,导致Hibench一直无法正确运行)

配置mapred-site.xml

1
$ gedit etc/hadoop/mapred-site.xml

修改为:

1
2
3
4
5
6
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

配置yarn-site.xml

1
$ gedit etc/hadoop/yarn-site.xml

修改为:

1
2
3
4
5
6
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>

运行

1
$ sbin/start-yarn.sh

Zookeeper

进入文件夹

1
$ cd $ZOOKEEPER_HOME

创建配置文件

1
$ cp conf/zoo_sample.cfg conf/zoo.cfg

创建数据存储目录

1
$ mkdir /数据存储目录

配置数据存储目录

1
$ gedit conf/zoo.cfg

修改dataDir项为

1
dataDir=/你创建的目录

启动

1
$ bin/zkServer.sh start

Kafka

进入文件夹

1
$ cd $KAFKA_HOME

创建log目录

1
$ mkdir /log目录

配置server.properties

1
$ gedit config/server.properties

修改以下项

1
log.dirs=/你创建的目录

运行

注意不要关闭终端

1
$ bin/kafka-server-start.sh config/server.properties


Hibench

Build Hibench

进入Hibench源码目录

1
$ mvn -Phadoopbench -Psparkbench -Dspark=2.2 -Dscala=2.11 clean package

这里需要挺长一段时间


Hadoop bench

配置hadoop.conf

1
2
$ cp conf/hadoop.conf.template conf/hadoop.conf
$ gedit conf/hadoop.conf

修改

1
2
hibench.hadoop.home /Hadoop所在目录/hadoop-2.8.3
hibench.hdfs.master hdfs://localhost:9000

尝试运行Hadoop负载测试

1
2
$ bin/workloads/micro/wordcount/prepare/prepare.sh
$ bin/workloads/micro/wordcount/hadoop/run.sh

查看报告

1
$ more report/hibench.report

更详细数据在
report/identity/hadoop/bench.log
report/identity/hadoop/monitor.html


Spark bench

配置spark.conf

1
2
$ cp conf/spark.conf.template conf/spark.conf
$ gedit conf/spark.conf

修改

1
2
hibench.spark.home /Spark所在目录/spark-2.2.2
hibench.spark.master local[*]

尝试运行Spark负载测试

1
2
$ bin/workloads/micro/wordcount/prepare/prepare.sh
$ bin/workloads/micro/wordcount/spark/run.sh

查看报告

1
$ more report/hibench.report

Spark Streaming bench

配置Kafka

1
$ gedit conf/hibench.conf

修改为

1
2
3
4
5
hibench.streambench.kafka.home /Kafka所在目录/kafka_2.11-0.8.2.2
# zookeeper host:port of kafka cluster, host1:port1,host2:port2...
hibench.streambench.zkHost localhost:2181
# Kafka broker lists, written in mode host:port,host:port,..
hibench.streambench.kafka.brokerList localhost:9092

运行

数据生成

1
$ bin/workloads/streaming/identity/prepare/genSeedDataset.sh

这里会跑Hadoop,会占资源

1
$ bin/workloads/streaming/identity/prepare/dataGen.sh

这里开始就会一直发送数据,不要关闭

运行测试

这里注意把终端窗口拉长一些,以便找到Spark WebUI的URL

1
$ bin/workloads/streaming/identity/spark/run.sh

浏览器打开 http://某ip:4040 即可查看实时负载数据

生成报告

1
$ bin/workloads/streaming/identity/common/metrics_reader.sh

修改负载参数

修改数据量

1
$ gedit conf/hibench.conf
1
2
3
4
5
6
hibench.streambench.datagen.intervalSpan Interval span in millisecond (default: 50)
hibench.streambench.datagen.recordsPerInterval Number of records to generate per interval span (default: 5)
hibench.streambench.datagen.recordLength fixed length of record (default: 200)
hibench.streambench.datagen.producerNumber Number of KafkaProducer running on different thread (default: 1)
hibench.streambench.datagen.totalRounds Total round count of data send (default: -1 means infinity)
hibench.streambench.datagen.totalRecords Number of total records that will be generated (default: -1 means infinity)

修改spark配置

1
$ gedit conf/spark.conf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Spark streaming Batchnterval in millisecond (default 100)
hibench.streambench.spark.batchInterval 800
# Number of nodes that will receive kafka input (default: 4)
hibench.streambench.spark.receiverNumber 4
# Indicate RDD storage level. (default: 2)
# 0 = StorageLevel.MEMORY_ONLY
# 1 = StorageLevel.MEMORY_AND_DISK_SER
# other = StorageLevel.MEMORY_AND_DISK_SER_2
hibench.streambench.spark.storageLevel 2
# indicate whether to test the write ahead log new feature (default: false)
hibench.streambench.spark.enableWAL false
# if testWAL is true, this path to store stream context in hdfs shall be specified. If false, it can be empty (default: /var/tmp)
hibench.streambench.spark.checkpointPath /var/tmp
# whether to use direct approach or not (dafault: true)
hibench.streambench.spark.useDirectMode true

温馨提示

为了避免磁盘被占满,应及时清理kafka和zookeeper产生的数据文件


参考文献

  1. https://github.com/intel-hadoop/HiBench/blob/master/docs/build-hibench.md
  2. https://github.com/intel-hadoop/HiBench/blob/master/docs/run-sparkbench.md
  3. https://github.com/intel-hadoop/HiBench/blob/master/docs/run-streamingbench.md
  4. http://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/SingleCluster.html
  5. https://zookeeper.apache.org/doc/current/zookeeperStarted.html
  6. https://kafka.apache.org/quickstart