累加器
实现原理
累加器用来把 Executor 端变量信息聚合到 Driver 端。在 Driver 程序中定义的变量,在
Executor 端的每个 Task 都会得到这个变量的一份新的副本,每个 task 更新这些副本的值后,传回 Driver 端进行 merge。
val rdd = sparkContext.makeRDD(List(1,2,3,4,5))
// 声明累加器
var sum = sparkContext.longAccumulator("sum");
rdd.foreach(
num => {
// 使用累加器
sum.add(num)
}
)
// 获取累加器的值
println("sum = " + sum.value)
自定义累加器实现wordcount:
创建自定义累加器:
class WordCountAccumulator extends AccumulatorV2[String,mutable.Map[String,Long]] {
var map:mutable.Map[String,Long] = mutable.Map()
override def isZero: Boolean = map.isEmpty
override def copy(): AccumulatorV2[String, mutable.Map[String,Long]] = new WordCountAccumulator
override def reset(): Unit = map.clear()
override def add(v: String): Unit = {
map(v) = map.getOrElse(v,0L)+1L
}
override def merge(other: AccumulatorV2[String, mutable.Map[String,Long]
]): Unit = {
val map1 = map
val map2 = other.value
map = map1.foldLeft(map2)(
(innerMap,kv)=>{
innerMap(kv._1) = innerMap.getOrElse(kv._1,0L)+kv._2
innerMap
}
)
}
override def value: mutable.Map[String,Long] = map
}
调用自定义累加器:
val rdd = sparkContext.makeRDD(
List("spark","scala","spark hadoop","hadoop")
)
val acc = new WordCountAccumulator
sparkContext.register(acc)
rdd.flatMap(_.split(" ")).foreach(
word=>acc.add(word)
)
println(acc.value)
广播变量
实现原理
广播变量用来高效分发较大的对象。向所有工作节点发送一个较大的只读值,以供一个
或多个 Spark 操作使用。比如,如果你的应用需要向所有节点发送一个较大的只读查询表,
广播变量用起来都很顺手。在多个并行操作中使用同一个变量,但是 Spark 会为每个任务
分别发送。
val rdd1 = sparkContext.makeRDD(List( ("a",1), ("b", 2), ("c", 3), ("d", 4) ),4)
val list = List( ("a",4), ("b", 5), ("c", 6), ("d", 7))
val broadcast :Broadcast[List[(String,Int)]] = sparkContext.broadcast(list)
val resultRDD :RDD[(String,(Int,Int))] = rdd1.map{
case (key,num)=> {
var num2 = 0
for((k,v)<-broadcast.value){
if(k == key) {
num2 = v
}
}
(key,(num,num2))
}
}
resultRDD.collect().foreach(println)
sparkContext.stop()
}