导航：首页 > 互联网科技 >

flume典型应用场景

发表于：2025-01-30 作者：千家信息网编辑

千家信息网最后更新 2025年01月30日，1.flume不同Source、Sink的配置文件编写（1）Source---spool 监听是一个目录，这个目录不能有子目录，监控的是这个目录下的文件。采集完成，这个目录下的文件会加上后缀（.COM

千家信息网最后更新 2025年01月30日flume典型应用场景

1.flume不同Source、Sink的配置文件编写

（1）Source---spool

监听是一个目录，这个目录不能有子目录，监控的是这个目录下的文件。采集完成，这个目录下的文件会加上后缀（.COMPLETED）
配置文件：

#Name the components on this agent#这里的a1指的是agent的名字，可以自定义，但注意：同一个节点下的agent的名字不能相同#定义的是sources、sinks、channels的别名a1.sources = r1a1.sinks = k1a1.channels = c1#指定source的类型和相关的参数a1.sources.r1.type = spooldira1.sources.r1.spoolDir = /home/hadoop/flumedata#设定channela1.channels.c1.type = memory#设定sinka1.sinks.k1.type = logger#Bind the source and sink to the channel#设置sources的通道a1.sources.r1.channels = c1#设置sink的通道a1.sinks.k1.channel = c1

（2）Source---netcat

一个NetCat Source用来监听一个指定端口，并将接收到的数据的每一行转换为一个事件。
数据源： netcat（监控tcp协议）
Channel：内存
数据目的地：控制台

配置文件

#指定代理a1.sources = r1a1.channels = c1a1.sinks = k1#指定sourcesa1.sources.r1.channels = c1#指定source的类型a1.sources.r1.type = netcat#指定需要监控的主机a1.sources.r1.bind = 192.168.191.130#指定需要监控的端口a1.sources.r1.port = 3212#指定channela1.channels.c1.type = memory#sinks  写出数据 loggera1.sinks.k1.channel=c1a1.sinks.k1.type=logger

（3）Source---avro

监听AVRO端口来接受来自外部AVRO客户端的事件流。利用Avro Source可以实现多级流动、扇出流、扇入流等效果。另外也可以接受通过flume提供的Avro客户端发送的日志信息。
数据源： avro
Channel：内存
数据目的地：控制台
配置文件

#指定代理a1.sources = r1a1.channels = c1a1.sinks = k1#指定sourcesa1.sources.r1 channels. = c1#指定source的类型a1.sources.r1.type = avro#指定需要监控的主机名a1.sources.r1.bind = hadoop03#指定需要监控的端口a1.sources.r1.port = 3212#指定channela1.channels.c1.type = memory#指定sinka1.sinks.k1.channel = c1a1.sinks.k1.type = logger

（4）采集日志文件到hdfs

source ====exec （一个Linux命令: tail -f）
channel====memory
sink====hdfs
注意：如果集群是高可用的集群，需要将core-site.xml 和hdfs-site.xml 放入flume的conf中。
配置文件：

a1.sources = r1a1.channels = c1a1.sinks = k1#指定sourcesa1.sources.r1.channels = c1#指定source的类型a1.sources.r1.type = exec#指定exec的commanda1.sources.r1.command = tail -F /home/hadoop/flumedata/zy.log#指定channela1.channels.c1.type = memory#指定sink 写入hdfsa1.sinks.k1.channel = c1a1.sinks.k1.type = hdfs#指定hdfs上生成的文件的路径年-月-日，时_分a1.sinks.k1.hdfs.path = /flume/%y-%m-%d/%H_%M#开启滚动a1.sinks.k1.hdfs.round = true#设定滚动的时间（设定目录的滚动）a1.sinks.k1.hdfs.roundValue = 24#时间的单位a1.sinks.k1.hdfs.roundUnit = hour#设定文件的滚动#当前文件滚动的时间间隔（单位是：秒）a1.sinks.k1.hdfs.rollInterval = 10#设定文件滚动的大小（文件多大，滚动一次）a1.sinks.k1.hdfs.rollSize = 1024#设定文件滚动的条数（多少条滚动一次）a1.sinks.k1.hdfs.rollCount = 10#指定时间来源(true表示指定使用本地时间)a1.sinks.k1.hdfs.useLocalTimeStamp = true#设定存储在hdfs上的文件类型，（DataStream，文本）a1.sinks.k1.hdfs.fileType = DataStream#加文件前缀a1.sinks.k1.hdfs.filePrefix = zzy#加文件后缀a1.sinks.k1.hdfs.fileSuffix = .log

2.flume典型的使用场景

（1）多代理流

从第一台机器的flume agent传送到第二台机器的flume agent。
例：
规划：
hadoop02：tail-avro.properties
使用 exec "tail -F /home/hadoop/testlog/welog.log"获取采集数据
使用 avro sink 数据都下一个 agent
hadoop03：avro-hdfs.properties
使用 avro 接收采集数据
使用 hdfs sink 数据到目的地
配置文件

#tail-avro.propertiesa1.sources = r1 a1.sinks = k1a1.channels = c1#Describe/configure the source a1.sources.r1.type = exec a1.sources.r1.command = tail -F /home/hadoop/testlog/date.log a1.sources.r1.channels = c1 #Describe the sinka1.sinks.k1.type = avro a1.sinks.k1.channel = c1 a1.sinks.k1.hostname = hadoop02 a1.sinks.k1.port = 4141 a1.sinks.k1.batch-size = 2#Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100#Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

#avro-hdfs.propertiesa1.sources = r1a1.sinks = k1a1.channels = c1#Describe/configure the sourcea1.sources.r1.type = avroa1.sources.r1.channels = c1a1.sources.r1.bind = 0.0.0.0a1.sources.r1.port = 4141#Describe k1a1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path =hdfs://myha01/testlog/flume-event/%y-%m-%d/%H-%Ma1.sinks.k1.hdfs.filePrefix = date_a1.sinks.k1.hdfs.maxOpenFiles = 5000a1.sinks.k1.hdfs.batchSize= 100a1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.writeFormat =Texta1.sinks.k1.hdfs.rollSize = 102400a1.sinks.k1.hdfs.rollCount = 1000000a1.sinks.k1.hdfs.rollInterval = 60a1.sinks.k1.hdfs.round = truea1.sinks.k1.hdfs.roundValue = 10a1.sinks.k1.hdfs.roundUnit = minutea1.sinks.k1.hdfs.useLocalTimeStamp = true#Use a channel which buffers events in memorya1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100#Bind the source and sink to the channela1.sources.r1.channels = c1a1.sinks.k1.channel = c1

（2）多路复用采集

在一份agent中有多个channel和多个sink，然后多个sink输出到不同的文件或者文件系统中。
规划：
Hadoop02：（tail-hdfsandlogger.properties）
使用 exec "tail -F /home/hadoop/testlog/datalog.log"获取采集数据
使用 sink1 将数据存储hdfs
使用 sink2 将数据都存储控制台

配置文件

#tail-hdfsandlogger.properties#2个channel和2个sink的配置文件#Name the components on this agenta1.sources = s1a1.sinks = k1 k2a1.channels = c1 c2#Describe/configure tail -F source1a1.sources.s1.type = execa1.sources.s1.command = tail -F /home/hadoop/logs/catalina.out#指定source进行扇出到多个channnel的规则a1.sources.s1.selector.type = replicatinga1.sources.s1.channels = c1 c2#Use a channel which buffers events in memory#指定channel c1a1.channels.c1.type = memory#指定channel c2a1.channels.c2.type = memory#Describe the sink#指定k1的设置a1.sinks.k1.type = hdfsa1.sinks.k1.hdfs.path=hdfs://myha01/flume_log/%y-%m-%d/%H-%Ma1.sinks.k1.hdfs.filePrefix = eventsa1.sinks.k1.hdfs.maxOpenFiles = 5000a1.sinks.k1.hdfs.batchSize= 100a1.sinks.k1.hdfs.fileType = DataStreama1.sinks.k1.hdfs.writeFormat =Texta1.sinks.k1.hdfs.rollSize = 102400a1.sinks.k1.hdfs.rollCount = 1000000a1.sinks.k1.hdfs.rollInterval = 60a1.sinks.k1.hdfs.round = truea1.sinks.k1.hdfs.roundValue = 10a1.sinks.k1.hdfs.roundUnit = minutea1.sinks.k1.hdfs.useLocalTimeStamp = truea1.sinks.k1.channel = c1#指定k2的a1.sinks.k2.type = loggera1.sinks.k2.channel = c2

（3）高可用部署采集

首先在三个web服务器中收集数据，然后交给collect，此处的collect是高可用的，首先collect01是主，所有收集到的数据发送给他，collect02只是出于热备状态不接受数据，当collect01宕机的时候，collect02顶替，然后接受数据，最终将数据发送给hdfs或者kafka。
agent和collecotr的部署

Agent1、Agent2数据分别流入到Collector1和Collector2中，Flume NG 本身提供了 Failover 机制，可以自动切换和恢复。再由Collector1和Collector2将数据输出到hdfs中。
示意图

配置文件：

#ha_agent.properties#agent name: agent1agent1.channels = c1agent1.sources = r1agent1.sinks = k1 k2#set gruopagent1.sinkgroups = g1#set channelagent1.channels.c1.type = memoryagent1.channels.c1.capacity = 1000agent1.channels.c1.transactionCapacity = 100agent1.sources.r1.channels = c1agent1.sources.r1.type = execagent1.sources.r1.command = tail -F /home/hadoop/testlog/testha.logagent1.sources.r1.interceptors = i1 i2agent1.sources.r1.interceptors.i1.type = staticagent1.sources.r1.interceptors.i1.key = Typeagent1.sources.r1.interceptors.i1.value = LOGINagent1.sources.r1.interceptors.i2.type = timestamp#set sink1agent1.sinks.k1.channel = c1agent1.sinks.k1.type = avroagent1.sinks.k1.hostname = hadoop02agent1.sinks.k1.port = 52020#set sink2agent1.sinks.k2.channel = c1agent1.sinks.k2.type = avroagent1.sinks.k2.hostname = hadoop03agent1.sinks.k2.port = 52020#set sink groupagent1.sinkgroups.g1.sinks = k1 k2#set failoveragent1.sinkgroups.g1.processor.type = failoveragent1.sinkgroups.g1.processor.priority.k1 = 10agent1.sinkgroups.g1.processor.priority.k2 = 1agent1.sinkgroups.g1.processor.maxpenalty = 10000

#ha_collector.properties#set agent namea1.sources = r1a1.channels = c1a1.sinks = k1#set channela1.channels.c1.type = memorya1.channels.c1.capacity = 1000a1.channels.c1.transactionCapacity = 100#other node,nna to nnsa1.sources.r1.type = avro##当前主机为什么，就修改成什么主机名a1.sources.r1.bind = hadoop03a1.sources.r1.port = 52020a1.sources.r1.interceptors = i1a1.sources.r1.interceptors.i1.type = statica1.sources.r1.interceptors.i1.key = Collector##当前主机为什么，就修改成什么主机名a1.sources.r1.interceptors.i1.value = hadoop03a1.sources.r1.channels = c1#set sink to hdfsa1.sinks.k1.type=hdfsa1.sinks.k1.hdfs.path= hdfs://myha01/flume_ha/loghdfsa1.sinks.k1.hdfs.fileType=DataStreama1.sinks.k1.hdfs.writeFormat=TEXTa1.sinks.k1.hdfs.rollInterval=10a1.sinks.k1.channel=c1a1.sinks.k1.hdfs.filePrefix=%Y-%m-%d

最后启动：

#先启动 hadoop02 和 hadoop03 上的 collector 角色：bin/flume-ng agent -c conf -f agentconf/ha_collector.properties -n a1 - Dflume.root.logger=INFO,console#然后启动 hadoop01，hadoop02 上的 agent 角色：bin/flume-ng agent -c conf -f agentconf/ha_agent.properties -n agent1 - Dflume.root.logger=INFO,console

很赞哦！