一条命令搞垮MongoDB实例
一条命令搞垮MongoDB实例
背景
Part1:写在最前
在副本集架构中,我们会经常通过rs.add(),rs.remove()命令来调整后台数据库架构,在本案例中,我们异常的触发到了一个MongoDB的BUG,并尽快的找到了官方的人进行咨询。在生产环境中,我们做实例迁移,将研发自行维护的MongoDB副本集迁移至DBA管理,由于硬件和版本都不符合规范,因此我们对集群先进行了升级处理,又使用了rs.add()和rs.remove()来完成数据库的迁移工作。
Part2:背景
在研发自行维护的数据库版本为2.6版本,我们先将数据库升级至3.4版本,并利用3.4版本的特性实现0 downtime开启认证。在研发的数据库原有架构中,存在离线节点,即hidden和no-vote节点。也正是因为这一特性,触发了MongoDB宕机。
实战
Part1:整体架构
原有集群为7节点副本集架构,其中2台为hidden节点,并且配置了no-vote参数。
Part2:迁移原理
我们利用新的机器使用rs.add()加入到原有副本集,在原有副本集的基础上添加了新的节点,待同步完成后,rs.remove()掉老的研发机器,完成实例迁移。
Part3:错误日志
在使用rs.remove时,我们提前写好了迁移文档,由于rs.remove()的速度很快,我们采取了直接rs.remove()多个节点的复制粘贴方式,这个也可能是后续导致crash的原因之一。
rs.remove()2018-04-17T14:54:23.793+0800 I NETWORK [conn163572] received client metadata from 192.168.1.100:16400 conn163572: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.4 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" }2018-04-17T14:54:23.811+0800 I NETWORK [thread1] connection accepted from 192.168.1.101:57568 #163573 (73 connections now open)2018-04-17T14:54:23.811+0800 I NETWORK [conn163573] received client metadata from 192.168.1.101:57568 conn163573: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" }2018-04-17T14:54:23.818+0800 I - [replication-25230] Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 6202018-04-17T14:54:23.818+0800 I - [replication-25230] ***aborting after invariant() failure2018-04-17T14:54:23.822+0800 I NETWORK [thread1] connection accepted from 192.168.1.102:32210 #163574 (74 connections now open)2018-04-17T14:54:23.822+0800 I NETWORK [conn163574] received client metadata from 192.168.1.102:32210 conn163574: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" }2018-04-17T14:54:23.822+0800 I NETWORK [thread1] connection accepted from 192.168.1.101:57569 #163575 (75 connections now open)2018-04-17T14:54:23.823+0800 I NETWORK [thread1] connection accepted from 192.168.1.103:50661 #163576 (76 connections now open)2018-04-17T14:54:23.824+0800 I NETWORK [conn163576] received client metadata from 192.168.1.103:50661 conn163576: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" }2018-04-17T14:54:23.830+0800 I NETWORK [thread1] connection accepted from 192.168.1.101:57570 #163577 (77 connections now open)2018-04-17T14:54:23.830+0800 I NETWORK [conn163577] received client metadata from 192.168.1.101:57570 conn163577: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS 6.3 Final", architecture: "x86_64", version: "2.6.32-279.23.1.mi5.el6.x86_64" }, platform: "CPython 2.7.6.final.0" }2018-04-17T14:54:23.832+0800 I NETWORK [thread1] connection accepted from 192.168.1.104:41163 #163578 (78 connections now open)2018-04-17T14:54:23.838+0800 F - [replication-25230] Got signal: 6 (Aborted). 0x7ffc3f0a41d1 0x7ffc3f0a3159 0x7ffc3f0a363d 0x7ffc3c0bf500 0x7ffc3bd4f8a5 0x7ffc3bd51085 0x7ffc3e2d7a8e 0x7ffc3ea5883c 0x7ffc3ea58999 0x7ffc3eb19b9d 0x7ffc3eaa2676 0x7ffc3e9c83e9 0x7ffc3e9b0032 0x7ffc3ea41bf7 0x7ffc3e3812f1 0x7ffc3ee25a5a 0x7ffc3ee28e53 0x7ffc3ee2932b 0x7ffc3f0185bc 0x7ffc3f01906c 0x7ffc3f019a56 0x7ffc3fdc7a00 0x7ffc3c0b7851 0x7ffc3be0511d----- BEGIN BACKTRACE -----{"backtrace":[{"b":"7FFC3DA22000","o":"16821D1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7FFC3DA22000","o":"1681159"},{"b":"7FFC3DA22000","o":"168163D"},{"b":"7FFC3C0B0000","o":"F500"},{"b":"7FFC3BD1D000","o":"328A5","s":"gsignal"},{"b":"7FFC3BD1D000","o":"34085","s":"abort"},{"b":"7FFC3DA22000","o":"8B5A8E","s":"_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j"},{"b":"7FFC3DA22000","o":"103683C"},{"b":"7FFC3DA22000","o":"1036999"},{"b":"7FFC3DA22000","o":"10F7B9D","s":"_ZNK5mongo4repl23TopologyCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS0_6OpTimeERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS8_18OplogQueryMetadataEEENS_6Date_tE"},{"b":"7FFC3DA22000","o":"1080676","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE"},{"b":"7FFC3DA22000","o":"FA63E9","s":"_ZN5mongo4repl31DataReplicatorExternalStateImpl18shouldStopFetchingERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE"},{"b":"7FFC3DA22000","o":"F8E032"},{"b":"7FFC3DA22000","o":"101FBF7","s":"_ZN5mongo4repl12OplogFetcher9_callbackERKNS_10StatusWithINS_7Fetcher13QueryResponseEEEPNS_14BSONObjBuilderE"},{"b":"7FFC3DA22000","o":"95F2F1","s":"_ZN5mongo7Fetcher9_callbackERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEPKc"},{"b":"7FFC3DA22000","o":"1403A5A"},{"b":"7FFC3DA22000","o":"1406E53","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE"},{"b":"7FFC3DA22000","o":"140732B"},{"b":"7FFC3DA22000","o":"15F65BC","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE"},{"b":"7FFC3DA22000","o":"15F706C","s":"_ZN5mongo10ThreadPool13_consumeTasksEv"},{"b":"7FFC3DA22000","o":"15F7A56","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE"},{"b":"7FFC3DA22000","o":"23A5A00"},{"b":"7FFC3C0B0000","o":"7851"},{"b":"7FFC3BD1D000","o":"E811D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.9-2.9", "gitVersion" : "dcdd0758e067949b27cfdf61c641905e619756e6", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-279.23.1.mi5.el6.x86_64", "version" : "#1 SMP Mon Sep 2 12:00:41 CST 2013", "machine" : "x86_64" }, "somap" : [ { "b" : "7FFC3DA22000", "elfType" : 3, "buildId" : "BE2829B1404DE963963BBCCA55C6EBBFE34CE20F" }, { "b" : "7FFFA8CFF000", "elfType" : 3, "buildId" : "A6539A72BE0493D91090D15510B7866310CBEA50" }, { "b" : "7FC5C55EA000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D0ABBCCAC542E41D33A638138FEC450AC08A1CF2" }, { "b" : "7FC5C37D9000", "path" : "/lib64/libbz2.so.1", "elfType" : 3, "buildId" : "732F8FD5054C4FA43CF0CD4CC8C5FF02CEA3CC54" }, { "b" : "7FC5BCDBF000", "path" : "/usr/lib64/libsasl2.so.2", "elfType" : 3, "buildId" : "C447C77E41A336BA9AEE12D08FBF5D15948D7468" }, { "b" : "7FC5BFF53000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "318EAB33420B000D542F09B91B716BACAB1AD546" }, { "b" : "7FC5C2773000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "3A8D65B9A373C0AFAF106F3A979835B16DBEFF1A" }, { "b" : "7FC5C416B000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "5E9DDD9EE40AD0D4DDD032CE1086E402B7FA955A" }, { "b" : "7FC5C5367000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "15B0822C819020F18BBF0E0C0286373155E03BE2" }, { "b" : "7FC5C40E3000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "A26BC945B5765B1258DB01FFEFB0C4F53F3961D7" }, { "b" : "7FC5C1ACD000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "CE152B8676517F23E7F54AD6408330979BE41443" }, { "b" : "7FC5C44B0000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "14853815DD64F2B830B8DCCB3A958A3804E13EFC" }, { "b" : "7FC5C451D000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "F2BBDD778ABFECFBA0C59BBCBA94D1151DDF96E4" }, { "b" : "7FC5C6800000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "D6BD776B36DAC438642CF84B282956738727901D" }, { "b" : "7FC5C2703000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "31545FEA4F1F72061992E79A1DF461EC719942E8" }, { "b" : "7FC5C04CC000", "path" : "/lib64/libcrypt.so.1", "elfType" : 3, "buildId" : "5F5EB7F30B61E0DAF6BBF8A367C388A54B7010EA" }, { "b" : "7FC5BEA88000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "76A3DEEB6876CBED69A57D3EBC1E2AFBCA84EC76" }, { "b" : "7FC5BEBA2000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "605701A8AE551604303523B4F0D3A7E98CF9E153" }, { "b" : "7FC5C059E000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "4623A78918C882770E81AE7B5EE9DDF8DD2B6674" }, { "b" : "7FC5BEB72000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "190D45F6743DEF9DF8169D65801D4575B01825BD" }, { "b" : "7FC5BF510000", "path" : "/lib64/libfreebl3.so", "elfType" : 3, "buildId" : "68195872ECFB188389D29AAF01031A976FD18168" }, { "b" : "7FC5BEB05000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "DAE2A7E4E8B37D43EF6839FF5D8E012AFCF21A69" }, { "b" : "7FC5BF902000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "8A8734DC37305D8CC2EF8F8C3E5EA03171DB07EC" }, { "b" : "7FC5C16E3000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "A287DC6B86A9823038F057105CE64671E0B392EC" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7ffc3f0a41d1] mongod(+0x1681159) [0x7ffc3f0a3159] mongod(+0x168163D) [0x7ffc3f0a363d] libpthread.so.0(+0xF500) [0x7ffc3c0bf500] libc.so.6(gsignal+0x35) [0x7ffc3bd4f8a5] libc.so.6(abort+0x175) [0x7ffc3bd51085] mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x0) [0x7ffc3e2d7a8e] mongod(+0x103683C) [0x7ffc3ea5883c] mongod(+0x1036999) [0x7ffc3ea58999] mongod(_ZNK5mongo4repl23TopologyCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS0_6OpTimeERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS8_18OplogQueryMetadataEEENS_6Date_tE+0x32D) [0x7ffc3eb19b9d] mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl22shouldChangeSyncSourceERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE+0xB6) [0x7ffc3eaa2676] mongod(_ZN5mongo4repl31DataReplicatorExternalStateImpl18shouldStopFetchingERKNS_11HostAndPortERKNS_3rpc15ReplSetMetadataEN5boost8optionalINS5_18OplogQueryMetadataEEE+0x59) [0x7ffc3e9c83e9] mongod(+0xF8E032) [0x7ffc3e9b0032] mongod(_ZN5mongo4repl12OplogFetcher9_callbackERKNS_10StatusWithINS_7Fetcher13QueryResponseEEEPNS_14BSONObjBuilderE+0x1E57) [0x7ffc3ea41bf7] mongod(_ZN5mongo7Fetcher9_callbackERKNS_8executor12TaskExecutor25RemoteCommandCallbackArgsEPKc+0x621) [0x7ffc3e3812f1] mongod(+0x1403A5A) [0x7ffc3ee25a5a] mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE+0x1B3) [0x7ffc3ee28e53] mongod(+0x140732B) [0x7ffc3ee2932b] mongod(_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockISt5mutexE+0x14C) [0x7ffc3f0185bc] mongod(_ZN5mongo10ThreadPool13_consumeTasksEv+0xBC) [0x7ffc3f01906c] mongod(_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x96) [0x7ffc3f019a56] mongod(+0x23A5A00) [0x7ffc3fdc7a00] libpthread.so.0(+0x7851) [0x7ffc3c0b7851] libc.so.6(clone+0x6D) [0x7ffc3be0511d]----- END BACKTRACE -----
Part4:问题分析
我们可以看到,在错误日志中存在如下内容:
Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 620
我们可以看到,在错误日志中存在如下内容:
Invariant failure i < _members.size() src/mongo/db/repl/repl_set_config.cpp 620
通过追寻源码,我们定位到该处逻辑:
透过invariant关键字,我们向上追寻,可以找到如下内容:
看完代码逻辑,我们认为报错的原因是由于批量rs.remove()引起,由于批量rs.remove(),代码逻辑中可以看出有一个判断:
const MemberConfig& ReplSetConfig::getMemberAt(size_t i) const
{ invariant(i < _members.size()); return _members[i]; }
例如,一个5节点的副本集,那么他的_id索引值为0,1,2,3,4。此时_members.size是5。 4 must <5逻辑是正确的。
但当我们使用rs.remove()命令删除超过1个节点成员时,例如同时删除_id=3和_id=4的,那么此时副本集就只有3个成员,_members.size变为3
但是i值却依旧因为某些原因停留在4,4<3这个逻辑不对,导致了被删除节点abort退出crash。当然这个推论目前我们没有确切的证据。
Part5:Mongo官方人员答复
同时,我们搜索到了相关的mongoDB jira,留言了我们的疑问,官方的jira中也是其他网友遇到的问题。他们的情况比我们的要更明显一些,是由于no-vote节点rs.remove导致的
笔者案例中也确实存在no-vote节点。目前官方给出的答复计划在4.1版本中进行修复。
官方jira连接
https://jira.mongodb.org/browse/SERVER-28079
Part6:规避方法
1. 将非投票节点升级为投票节点;
2. 请提前做好备份,以备不时之需。
--总结--
通过本文,我们了解到rs.remove()潜在可能导致mongoDB crash的场景,由于时间有限,且crash掉的机器本就是我们计划rs.remove()掉的节点,因此在笔者复现无果后,决定暂时放弃继续跟进这个issue。如果有网友遇到过类似的问题,且找到了根本原因和复现方式,可在博文下留言,笔者感激不尽!~由于笔者的水平有限,编写时间也很仓促,文中难免会出现一些错误或者不准确的地方,不妥之处恳请读者批评指正。喜欢笔者的文章,右上角点一波关注,谢谢~