导航：首页 > 互联网科技 >

第88课：Spark Streaming从Flume Pull数据案例实战及内幕源码解密

发表于：2025-01-23 作者：千家信息网编辑

千家信息网最后更新 2025年01月23日，本节课分成二部分讲解：一、Spark Streaming on Pulling from Flume实战二、Spark Streaming on Pulling from Flume源码解析先简单介绍

千家信息网最后更新 2025年01月23日第88课：Spark Streaming从Flume Pull数据案例实战及内幕源码解密

本节课分成二部分讲解：

一、Spark Streaming on Pulling from Flume实战

二、Spark Streaming on Pulling from Flume源码解析

先简单介绍下Flume的两种模式：推模式（Flume push to Spark Streaming）和拉模式（Spark Streaming pull from Flume ）

采用推模式：推模式的理解就是Flume作为缓存，存有数据。监听对应端口，如果服务可以连接，就将数据push过去。(简单，耦合要低)，缺点是Spark Streaming程序没有启动的话，Flume端会报错，同时会导致Spark Streaming程序来不及消费的情况。

采用拉模式：拉模式就是自己定义一个sink，Spark Streaming自己去channel里面取数据，根据自身条件去获取数据，稳定性好。

Flume pull实战：

第一步：安装Flume，本节课不在说明，参考（第87课：Flume推送数据到SparkStreaming案例实战和内幕源码解密）

第二步：配置Flume，首先参照官网（http://spark.apache.org/docs/latest/streaming-flume-integration.html）要求添加依赖或直接下载3个jar包，并将其放入Flume安装目录下的lib目录中

spark-streaming-flume-sink_2.10-1.6.0.jar、scala-library-2.10.5.jar、commons-lang3-3.3.2.jar

第三步：配置Flume环境参数，修改flume-conf.properties，从flume-conf.properties.template复制一份进行修改

#Flume pull模式

agent0.sources = source1

agent0.channels = memoryChannel

agent0.sinks = sink1

#配置Source1

agent0.sources.source1.type = spooldir

agent0.sources.source1.spoolDir = /home/hadoop/flume/tmp/TestDir

agent0.sources.source1.channels = memoryChannel

agent0.sources.source1.fileHeader = false

agent0.sources.source1.interceptors = il

agent0.sources.source1.interceptors.il.type = timestamp

#配置Sink1

agent0.sinks.sink1.type = org.apache.spark.streaming.flume.sink.SparkSink

agent0.sinks.sink1.hostname = SparkMaster

agent0.sinks.sink1.port = 9999

agent0.sinks.sink1.channel = memoryChannel

#配置channel

agent0.channels.memoryChannel.type = file

agent0.channels.memoryChannel.checkpointDir = /home/hadoop/flume/tmp/checkpoint

agent0.channels.memoryChannel.dataDirs = /home/hadoop/flume/tmp/dataDir

启动flume命令：

root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

或者root@SparkMaster:~/flume/flume-1.6.0# flume-ng agent --conf ./conf/ --conf-file ./conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

第四步：编写简单的业务代码（Java版）

package com.dt.spark.SparkApps.sparkstreaming;

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.function.FlatMapFunction;

import org.apache.spark.api.java.function.Function2;

import org.apache.spark.api.java.function.PairFunction;

import org.apache.spark.streaming.Durations;

import org.apache.spark.streaming.api.java.JavaDStream;

import org.apache.spark.streaming.api.java.JavaPairDStream;

import org.apache.spark.streaming.api.java.JavaReceiverInputDStream;

import org.apache.spark.streaming.api.java.JavaStreamingContext;

import org.apache.spark.streaming.flume.FlumeUtils;

import org.apache.spark.streaming.flume.SparkFlumeEvent;

import scala.Tuple2;

public class SparkStreamingPullDataFromFlume {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setMaster("spark://SparkMaster:7077");

conf.setAppName("SparkStreamingPullDataFromFlume");

JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(30));

// 获取数据

JavaReceiverInputDStream lines = FlumeUtils.createPollingStream(jsc, "SparkMaster", 9999);

// 进行单词切分

JavaDStream words = lines.flatMap(new FlatMapFunction() {

public Iterable call(SparkFlumeEvent event) throws Exception {

String line = new String(event.event().getBody().toString());

return Arrays.asList(line.split(" "));

}

});

// 进行map操作，转换成（key，value）格式

JavaPairDStream pairs = words.mapToPair(new PairFunction() {

public Tuple2 call(String word) throws Exception {

return new Tuple2(word, 1);

}

});

// 进行reduceByKey动作，将key相同的value值进行合并

JavaPairDStream wordsCount = pairs.reduceByKey(new Function2() {

public Integer call(Integer v1, Integer v2) throws Exception {

return v1 + v2;

}

});

wordsCount.print();

jsc.start();

jsc.awaitTermination();

jsc.close();

}

将程序打包成jar文件上传到Spark集群中

第五步：启动HDFS、Spark集群和Flume

启动Flume:root@SparkMaster:~/flume/flume-1.6.0/bin# ./flume-ng agent --conf ../conf/ --conf-file ../conf/flume-conf.properties --name agent0 -Dflume.root.logger=INFO,console

第六步：往/home/hadoop/flume/tmp/TestDir目录中上传测试文件，查看Flume的日志变化

第七步：通过spark-submit命令运行程序：

./spark-submit --class com.dt.spark.SparkApps.SparkStreamingPullDataFromFlume --name SparkStreamingPullDataFromFlume /home/hadoop/spark/SparkStreamingPullDataFromFlume.jar

每隔30秒查看运行结果

第二部分：源码分析

1、创建createPollingStream （FlumeUtils.scala ）

注意：默认的存储方式是MEMORY_AND_DISK_SER_2

/**

* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.

* This stream will poll the sink for data and will pull events as they are available.

* This stream will use a batch size of 1000 events and run 5 threads to pull data.

* @param hostname Address of the host on which the Spark Sink is running

* @param port Port of the host at which the Spark Sink is listening

* @param storageLevel Storage level to use for storing the received objects

def createPollingStream(

ssc: StreamingContext,

hostname: String,

port: Int,

storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2

): ReceiverInputDStream[SparkFlumeEvent] = {

createPollingStream(ssc, Seq(new InetSocketAddress(hostname, port)), storageLevel)

}

2、参数配置：默认的全局参数，private 级别配置无法修改

privateval DEFAULT_POLLING_PARALLELISM = 5

privateval DEFAULT_POLLING_BATCH_SIZE = 1000

/**

* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.

* This stream will poll the sink for data and will pull events as they are available.

* This stream will use a batch size of 1000 events and run 5 threads to pull data.

* @param addresses List of InetSocketAddresses representing the hosts to connect to.

* @param storageLevel Storage level to use for storing the received objects

def createPollingStream(

ssc: StreamingContext,

addresses: Seq[InetSocketAddress],

storageLevel: StorageLevel

): ReceiverInputDStream[SparkFlumeEvent] = {

createPollingStream(ssc, addresses, storageLevel,

DEFAULT_POLLING_BATCH_SIZE, DEFAULT_POLLING_PARALLELISM)

}

3、创建FlumePollingInputDstream对象

/**

* Creates an input stream that is to be used with the Spark Sink deployed on a Flume agent.

* This stream will poll the sink for data and will pull events as they are available.

* @param addresses List of InetSocketAddresses representing the hosts to connect to.

* @param maxBatchSize Maximum number of events to be pulled from the Spark sink in a

* single RPC call

* @param parallelism Number of concurrent requests this stream should send to the sink. Note

* that having a higher number of requests concurrently being pulled will

* result in this stream using more threads

* @param storageLevel Storage level to use for storing the received objects

def createPollingStream(

ssc: StreamingContext,

addresses: Seq[InetSocketAddress],

storageLevel: StorageLevel,

maxBatchSize: Int,

parallelism: Int

): ReceiverInputDStream[SparkFlumeEvent] = {

new FlumePollingInputDStream[SparkFlumeEvent](ssc, addresses, maxBatchSize,

parallelism, storageLevel)

}

4、继承自ReceiverInputDstream并覆写getReciver方法，调用FlumePollingReciver接口

private[streaming] class FlumePollingInputDStream[T: ClassTag](

_ssc: StreamingContext,

val addresses: Seq[InetSocketAddress],

val maxBatchSize: Int,

val parallelism: Int,

storageLevel: StorageLevel

) extends ReceiverInputDStream[SparkFlumeEvent](_ssc) {

override def getReceiver(): Receiver[SparkFlumeEvent] = {

new FlumePollingReceiver(addresses, maxBatchSize, parallelism, storageLevel)

}

5、ReceiverInputDstream 构建了一个线程池，设置为后台线程；并使用lazy和工厂方法创建线程和NioClientSocket（NioClientSocket底层使用NettyServer的方式）

lazy val channelFactoryExecutor =

Executors.newCachedThreadPool(new ThreadFactoryBuilder().setDaemon(true).

setNameFormat("Flume Receiver Channel Thread - %d").build())

lazy val channelFactory =

new NioClientSocketChannelFactory(channelFactoryExecutor, channelFactoryExecutor)

6、receiverExecutor 内部也是线程池；connections是指链接分布式Flume集群的FlumeConnection实体句柄的个数，线程拿到实体句柄访问数据。

lazy val receiverExecutor = Executors.newFixedThreadPool(parallelism,

new ThreadFactoryBuilder().setDaemon(true).setNameFormat("Flume Receiver Thread - %d").build())

private lazy val connections = new LinkedBlockingQueue[FlumeConnection]()

7、启动时创建NettyTransceiver，根据并行度(默认5个)循环提交FlumeBatchFetcher

override def onStart(): Unit = {

// Create the connections to each Flume agent.

addresses.foreach(host => {

val transceiver = new NettyTransceiver(host, channelFactory)

val client = SpecificRequestor.getClient(classOf[SparkFlumeProtocol.Callback], transceiver)

connections.add(new FlumeConnection(transceiver, client))

})

for (i <- 0 until parallelism) {

logInfo("Starting Flume Polling Receiver worker threads..")

// Threads that pull data from Flume.

receiverExecutor.submit(new FlumeBatchFetcher(this))

}

8、FlumeBatchFetcher run方法中从Receiver中获取connection链接句柄ack跟消息确认有关

def run(): Unit = {

while (!receiver.isStopped()) {

val connection = receiver.getConnections.poll()

val client = connection.client

var batchReceived = false

var seq: CharSequence = null

try {

getBatch(client) match {

case Some(eventBatch) =>

batchReceived = true

seq = eventBatch.getSequenceNumber

val events = toSparkFlumeEvents(eventBatch.getEvents)

if (store(events)) {

sendAck(client, seq)

} else {

sendNack(batchReceived, client, seq)

}

case None =>

}

} catch {

9、获取一批一批数据方法

/**

* Gets a batch of events from the specified client. This method does not handle any exceptions

* which will be propogated to the caller.

* @param client Client to get events from

* @return [[Some]] which contains the event batch if Flume sent any events back, else [[None]]

private def getBatch(client: SparkFlumeProtocol.Callback): Option[EventBatch] = {

val eventBatch = client.getEventBatch(receiver.getMaxBatchSize)

if (!SparkSinkUtils.isErrorBatch(eventBatch)) {

// No error, proceed with processing data

logDebug(s"Received batch of ${eventBatch.getEvents.size} events with sequence " +

s"number: ${eventBatch.getSequenceNumber}")

Some(eventBatch)

} else {

logWarning("Did not receive events from Flume agent due to error on the Flume agent: " +

eventBatch.getErrorMsg)

None

}

备注：

资料来源于：DT_大数据梦工厂

更多私密内容，请关注微信公众号：DT_Spark

如果您对大数据Spark感兴趣，可以免费听由王家林老师每天晚上20：00开设的Spark永久免费公开课，地址YY房间号：68917580

很赞哦！

数据模式配置线程方法程序实战源码参数句柄目录集群命令实体就是工厂文件方式链接二部数据库的安全要保护哪些东西数据库安全各自的含义是什么生产安全数据库录入数据库的安全性及管理数据库安全策略包含哪些海淀数据库安全审计系统建立农村房屋安全信息数据库易用的数据库客户端支持安全管理连接数据库失败ssl安全错误数据库的锁怎样保障安全台州口碑好的网络技术有哪些陕西北方学校校园网络安全日学校监控服务器维修服务腾讯公司深圳软件开发部吉林电商系统软件开发保亭管理软件开发费用网络安全班会师生交流情况服务器丢包原因临沧文山互联网科技韩国服务器cf下载无锡专业软件开发价格多少奥德赛ps4第一次育碧服务器健康码大数据用什么数据库数据库每个表主关键字段 linux 开启服务器政府网络安全中孚信息网络安全知识普及活动方案下列哪一项不是数据库山东省存储服务器怎么收费 redesky服务器怎么样软件开发文档需求分析辣椒直播app软件开发定制成都正规软件开发要多少钱甩手工具箱服务器不可用比较好的服务器领域公众号宝鸡库克酷网络技术有限公司古冶区电子网络技术售后保障网络安全保护政策美图软件开发笔试题目清河同方服务器内核禁用

千家信息网

千家信息网

第88课：Spark Streaming从Flume Pull数据案例实战及内幕源码解密

python如何进行自动解析和重命名多个文件

电脑中为什么打不开图标

相关文章