千家信息网

hbase之宽表与窄表对split的影响

发表于:2024-09-22 作者:千家信息网编辑
千家信息网最后更新 2024年09月22日,hbase的hbase.hregion.max.filesize属性值用来指定region分割的阀值, 该值默认为268435456(256MB), 当一个列族文件大小超过该值时,将会分裂成两个reg
千家信息网最后更新 2024年09月22日hbase之宽表与窄表对split的影响

hbase的hbase.hregion.max.filesize属性值用来指定region分割的阀值, 该值默认为268435456(256MB), 当一个列族文件大小超过该值时,将会分裂成两个region。
hbase的列可以有很多,设计时有两种方式可选择, 宽表(一行有很多列)和窄表
如有一个存储用户邮件的表
按宽表设计时,可以表示成(一个用户的所有邮件存成一行)
userid1 email1 emali2 email3 ... ... ... ... ... emailn
userid2 email1 emali2 email3 ... ... ... ... ... emailn
useridn
按窄表设计时,可以表示成(rowkey由用ID和emailID组成)
userid1_emialid1 email1
userid1_emialid2 email2
userid1_emialid3 email2
userid1_emialidn emailn
userid2_emialid1 email1
userid2_emialid2 email2
userid2_emialid3 email3
userid2_emialidn emailn
这两种设计方法会对region的分割造成影响, 今天在看HFileOutputFormat代码时发现它new出的RecordWriter对 region分割有一定的限制,

只有当rowkey不同是才会做分割, 而rowkey相同时即使region大小已经超过hbase.hregion.max.filesize值, 也不会分割
RecordWriter代码:

  1. public void write(ImmutableBytesWritable row, KeyValue kv)
  2. throws IOException {
  3. long length = kv.getLength();
  4. byte [] family = kv.getFamily();
  5. WriterLength wl = this.writers.get(family);
  6. if (wl == null || ((length + wl.written) >= maxsize) &&
  7. Bytes.compareTo(this.previousRow, 0, this.previousRow.length,
  8. kv.getBuffer(), kv.getRowOffset(), kv.getRowLength()) != 0) {
  9. // Get a new writer.
  10. Path basedir = new Path(outputdir, Bytes.toString(family));
  11. if (wl == null) {
  12. wl = new WriterLength();
  13. this.writers.put(family, wl);
  14. if (this.writers.size() > 1) throw new IOException("One family only");
  15. // If wl == null, first file in family. Ensure family dir exits.
  16. if (!fs.exists(basedir)) fs.mkdirs(basedir);
  17. }
  18. wl.writer = getNewWriter(wl.writer, basedir);
  19. LOG.info("Writer=" + wl.writer.getPath() +
  20. ((wl.written == 0)? "": ", wrote=" + wl.written));
  21. wl.written = 0;
  22. }
  23. kv.updateLatestStamp(this.now);
  24. wl.writer.append(kv);
  25. wl.written += length;
  26. // Copy the row so we know when a row transition.
  27. this.previousRow = kv.getRow();
  28. }

标红加粗部分说明当块大小大于hbase.hregion.max.filesize值, 并却当前行与上一次插入的行不同时才会分割region.
1. 宽表情况下, 单独一行大小超过hbase.hregion.max.filesize值, 不会做分割
2. 相同rowkey下插入很多不同版本的记录,即使大小超过hbase.hregion.max.filesize值, 也不会做分割

下面就来验证下:
为了尽早看到效果, 需要在hbase-site.xml中修改两个配置参数

  1. <property>
  2. <name>hbase.hregion.memstore.flush.sizename>
  3. <value>5value>
  4. <description>
  5. Memstore will be flushed to disk if size of the memstore
  6. exceeds this number of bytes. Value is checked by a thread that runs
  7. every hbase.server.thread.wakefrequency.
  8. description>
  9. property>
  10. <property>
  11. <name>hbase.hregion.max.filesizename>
  12. <value>10value>
  13. <description>
  14. Maximum HStoreFile size. If any one of a column families' HStoreFiles has
  15. grown to exceed this value, the hosting HRegion is split in two.
  16. Default: 256M.
  17. description>
  18. property>

建测试表t1和t2

  1. hbase(main):076:0* create 't1','f1'
  2. 0 row(s) in 1.6460 seconds
  3. hbase(main):077:0> create 't2','f1'
  4. 0 row(s) in 1.1790 seconds

查看系统表 .META.

  1. hbase(main):081:0* scan '.META.'
  2. ROW COLUMN+CELL
  3. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:regioninfo, timestamp=1314720667384, value=REGION => {NAME => 't1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad.', STARTKEY => '', ENDK
  4. . EY => '', ENCODED => d8acd6bc659ac8326b88850d645a90ad, TABLE => {{NAME => 't1', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
  5. => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  6. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:server, timestamp=1314720667941, value=yinjie:60020
  7. .
  8. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:serverstartcode, timestamp=1314720667941, value=1314716290123
  9. .
  10. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:regioninfo, timestamp=1314720672241, value=REGION => {NAME => 't2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71.', STARTKEY => '', ENDK
  11. . EY => '', ENCODED => 16bb3d2563eab3b4e25477c64e007e71, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
  12. => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  13. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:server, timestamp=1314720672346, value=yinjie:60020
  14. .
  15. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:serverstartcode, timestamp=1314720672346, value=1314716290123
  16. .
  17. 2 row(s) in 0.0230 seconds

可以看到此时,t1,t2都已有一个region
先往t1表插入10条记录,rowkwy相同

  1. hbase(main):086:0* for i in 0..9 do\
  2. hbase(main):087:1* put 't1','row1',"f1:c#{i}","swallow#{i}"\
  3. hbase(main):088:1* end
  4. 0 row(s) in 0.0180 seconds
  5. 0 row(s) in 0.0070 seconds
  6. 0 row(s) in 0.0420 seconds
  7. 0 row(s) in 0.0620 seconds
  8. 0 row(s) in 0.0120 seconds
  9. 0 row(s) in 0.0770 seconds
  10. 0 row(s) in 0.0150 seconds
  11. 0 row(s) in 0.1290 seconds
  12. 0 row(s) in 10.0740 seconds
  13. 0 row(s) in 0.1230 seconds
  14. => 0..9
  15. hbase(main):089:0>

查看t1记录

  1. hbase(main):089:0> scan 't1'
  2. ROW COLUMN+CELL
  3. row1 column=f1:c0, timestamp=1314720946495, value=swallow0
  4. row1 column=f1:c1, timestamp=1314720946507, value=swallow1
  5. row1 column=f1:c2, timestamp=1314720946903, value=swallow2
  6. row1 column=f1:c3, timestamp=1314720946939, value=swallow3
  7. row1 column=f1:c4, timestamp=1314720946976, value=swallow4
  8. row1 column=f1:c5, timestamp=1314720947055, value=swallow5
  9. row1 column=f1:c6, timestamp=1314720947070, value=swallow6
  10. row1 column=f1:c7, timestamp=1314720947198, value=swallow7
  11. row1 column=f1:c8, timestamp=1314720957272, value=swallow8
  12. row1 column=f1:c9, timestamp=1314720957392, value=swallow9
  13. 1 row(s) in 0.0300 seconds

查看 .META.

  1. hbase(main):090:0> scan '.META.'
  2. ROW COLUMN+CELL
  3. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:regioninfo, timestamp=1314720667384, value=REGION => {NAME => 't1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad.', STARTKEY => '', ENDK
  4. . EY => '', ENCODED => d8acd6bc659ac8326b88850d645a90ad, TABLE => {{NAME => 't1', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
  5. => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  6. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:server, timestamp=1314720667941, value=yinjie:60020
  7. .
  8. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:serverstartcode, timestamp=1314720667941, value=1314716290123
  9. .
  10. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:regioninfo, timestamp=1314720672241, value=REGION => {NAME => 't2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71.', STARTKEY => '', ENDK
  11. . EY => '', ENCODED => 16bb3d2563eab3b4e25477c64e007e71, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
  12. => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  13. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:server, timestamp=1314720672346, value=yinjie:60020
  14. .
  15. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:serverstartcode, timestamp=1314720672346, value=1314716290123
  16. .
  17. 2 row(s) in 0.0210 seconds

可以看到t1仍旧只有一个region

接下去往往t2表插入10条相同记录,但rowkwy不同

  1. hbase(main):091:0> for i in 0..9 do\
  2. hbase(main):092:1* put 't2',"row#{i}","f1:c#{i}","swallow#{i}"\
  3. hbase(main):093:1* end
  4. 0 row(s) in 0.1140 seconds
  5. 0 row(s) in 0.0080 seconds
  6. 0 row(s) in 0.0410 seconds
  7. 0 row(s) in 0.0820 seconds
  8. 0 row(s) in 0.0210 seconds
  9. 0 row(s) in 0.0410 seconds
  10. 0 row(s) in 0.0200 seconds
  11. 0 row(s) in 0.1210 seconds
  12. 0 row(s) in 0.0140 seconds
  13. 0 row(s) in 0.0360 seconds
  14. => 0..9

查看t2记录

  1. hbase(main):097:0* scan 't2'
  2. ROW COLUMN+CELL
  3. row0 column=f1:c0, timestamp=1314721110769, value=swallow0
  4. row1 column=f1:c1, timestamp=1314721110787, value=swallow1
  5. row2 column=f1:c2, timestamp=1314721110830, value=swallow2
  6. row3 column=f1:c3, timestamp=1314721110916, value=swallow3
  7. row4 column=f1:c4, timestamp=1314721110932, value=swallow4
  8. row5 column=f1:c5, timestamp=1314721110971, value=swallow5
  9. row6 column=f1:c6, timestamp=1314721110989, value=swallow6
  10. row7 column=f1:c7, timestamp=1314721111121, value=swallow7
  11. row8 column=f1:c8, timestamp=1314721111130, value=swallow8
  12. row9 column=f1:c9, timestamp=1314721111172, value=swallow9
  13. 10 row(s) in 1.0450 seconds

查看 .META.

  1. hbase(main):102:0> scan '.META.'
  2. ROW COLUMN+CELL
  3. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:regioninfo, timestamp=1314720667384, value=REGION => {NAME => 't1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad.', STARTKEY => '', ENDK
  4. . EY => '', ENCODED => d8acd6bc659ac8326b88850d645a90ad, TABLE => {{NAME => 't1', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
  5. => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  6. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:server, timestamp=1314720667941, value=yinjie:60020
  7. .
  8. t1,,1314720667274.d8acd6bc659ac8326b88850d645a90ad column=info:serverstartcode, timestamp=1314720667941, value=1314716290123
  9. .
  10. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:regioninfo, timestamp=1314721112130, value=REGION => {NAME => 't2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71.', STARTKEY => '', ENDK
  11. . EY => '', ENCODED => 16bb3d2563eab3b4e25477c64e007e71, OFFLINE => true, SPLIT => true, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILT
  12. ER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOC
  13. KCACHE => 'true'}]}}
  14. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:server, timestamp=1314720672346, value=yinjie:60020
  15. .
  16. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:serverstartcode, timestamp=1314720672346, value=1314716290123
  17. .
  18. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:splitA, timestamp=1314721112130, value=REGION => {NAME => 't2,,1314721111490.71df02214242923574b71fe5e2a19360.', STARTKEY => '', ENDKEY =
  19. . > 'row0', ENCODED => 71df02214242923574b71fe5e2a19360, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE
  20. => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  21. t2,,1314720672168.16bb3d2563eab3b4e25477c64e007e71 column=info:splitB, timestamp=1314721112130, value=REGION => {NAME => 't2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b061ca.', STARTKEY => 'row0',
  22. . ENDKEY => '', ENCODED => 915ee8d4a32c59a4ec3960e335b061ca, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SC
  23. OPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  24. t2,,1314721111490.71df02214242923574b71fe5e2a19360 column=info:regioninfo, timestamp=1314721112267, value=REGION => {NAME => 't2,,1314721111490.71df02214242923574b71fe5e2a19360.', STARTKEY => '', ENDK
  25. . EY => 'row0', ENCODED => 71df02214242923574b71fe5e2a19360, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATION_SC
  26. OPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  27. t2,,1314721111490.71df02214242923574b71fe5e2a19360 column=info:server, timestamp=1314721112267, value=yinjie:60020
  28. .
  29. t2,,1314721111490.71df02214242923574b71fe5e2a19360 column=info:serverstartcode, timestamp=1314721112267, value=1314716290123
  30. .
  31. t2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b0 column=info:regioninfo, timestamp=1314721112627, value=REGION => {NAME => 't2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b061ca.', STARTKEY => 'row
  32. 61ca. 0', ENDKEY => '', ENCODED => 915ee8d4a32c59a4ec3960e335b061ca, TABLE => {{NAME => 't2', FAMILIES => [{NAME => 'f1', BLOOMFILTER => 'NONE', REPLICATIO
  33. N_SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
  34. t2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b0 column=info:server, timestamp=1314721112627, value=yinjie:60020
  35. 61ca.
  36. t2,row0,1314721111490.915ee8d4a32c59a4ec3960e335b0 column=info:serverstartcode, timestamp=1314721112627, value=1314716290123
  37. 61ca.
  38. 4 row(s) in 0.0380 seconds

可以看到t2的region已经分裂.

大小 相同 设计 不同 一行 两个 代码 只有 用户 邮件 c# 影响 参数 同时 属性 情况 效果 文件 方式 法会 数据库的安全要保护哪些东西 数据库安全各自的含义是什么 生产安全数据库录入 数据库的安全性及管理 数据库安全策略包含哪些 海淀数据库安全审计系统 建立农村房屋安全信息数据库 易用的数据库客户端支持安全管理 连接数据库失败ssl安全错误 数据库的锁怎样保障安全 网络安全措施效果 服务器托管的方式特点 杭州市公安局网络安全分局 xp系统时间配置新的服务器 微信运维软件开发 邯郸erp与商城对接网络技术 佳时庆服务器什么时候开的 单位网络安全宣传教育情况 列车网络安全应用服务 宜昌至上未来互联网科技 深圳3d游戏软件开发公司 网络安全的第五个特点 服务器有多个网卡ip怎么配置 宜兴专业软件开发节能规范 和达天下互联网科技公司 域控服务器需要做什么 考三级数据库技术的要求 物联网百亿数据用什么数据库 戴尔服务器序列查询号 南阳川光五防的数据库密码是多少 嘉定区质量网络技术五星服务 2020服务器q2出货量 南开大学数据库考题 香港凯悦国际网络技术公司 网安保数据库安全解决方案 数据库技术及应用考研学校 戴尔服务器温度符号 唐山公共场所网络安全 ios数据库软件打开存档 ibm服务器管理口收日志
0