千家信息网

Greenplum segment节点异常恢复

发表于:2024-11-22 作者:千家信息网编辑
千家信息网最后更新 2024年11月22日,Segment检测及故障切换机制GP Master首先会检测Primary状态,如果Primary不可连通,那么将会检测Mirror状态,Primary/Mirror状态总共有4种:Primary活着
千家信息网最后更新 2024年11月22日Greenplum segment节点异常恢复
Segment检测及故障切换机制

GP Master首先会检测Primary状态,如果Primary不可连通,那么将会检测Mirror状态,Primary/Mirror状态总共有4种:

  1. Primary活着,Mirror活着。GP Master探测Primary成功之后直接返回,进行下一个Segment检测;
  2. Primary活着,Mirror挂了。GP Master探测Primary成功之后,通过Primary返回的状态得知Mirror挂掉了(Mirror挂掉之后,Primary将会探测到,将自己变成ChangeTracking模式),这时候更新Master元信息,进行下一个Segment检测;
  3. Primary挂了,Mirror活着。GP Master探测Primary失败之后探测Mirror,发现Mirror是活着,这时候更新Master上面的元信息,同时使Mirror接管Primary(故障切换),进行下一个Segment检测;
  4. Primary挂了,Mirror挂了。GP Master探测Primary失败之后探测Mirror,Mirror也是挂了,直到重试最大值,结束这个Segment的探测,也不更新Master元信息了,进行下一个Segment检测。

上面的2-4需要进行gprecoverseg进行segment恢复。

对失败的segment节点;启动时会直接跳过,忽略。

[gpadmin@mdw ~]$ gpstart==≥ gpstart:mdw:gpadmin-[INFO]:-Starting gpstart with args: ==≥ gpstart:mdw:gpadmin-[INFO]:-Gathering information and validating the environment...==≥ gpstart:mdw:gpadmin-[INFO]:-Greenplum Binary Version: 'postgres (Greenplum Database) 5.0.0 build 1'==≥ 。。。。。。。。。。。。。。。。。。。。。。。。。。==≥ gpstart:mdw:gpadmin-[INFO]:-Master Started...==≥ gpstart:mdw:gpadmin-[INFO]:-Shutting down master==≥ gpstart:mdw:gpadmin-[WARNING]:-Skipping startup of segment marked down in configuration: on sdw2 directory /data/gpdata/gpdatam/gpseg0 <<<<<==≥ gpstart:mdw:gpadmin-[INFO]:---------------------------==≥ gpstart:mdw:gpadmin-[INFO]:-Master instance parameters==≥ gpstart:mdw:gpadmin-[INFO]:---------------------------==≥ gpstart:mdw:gpadmin-[INFO]:-Database                 = template1==≥ gpstart:mdw:gpadmin-[INFO]:-Master Port              = 1921==≥ gpstart:mdw:gpadmin-[INFO]:-Master directory         = /data/gpdata/pgmaster/gpseg-1==≥ gpstart:mdw:gpadmin-[INFO]:-Timeout                  = 600 seconds==≥ gpstart:mdw:gpadmin-[INFO]:-Master standby           = Off ==≥ gpstart:mdw:gpadmin-[INFO]:---------------------------------------==≥ gpstart:mdw:gpadmin-[INFO]:-Segment instances that will be started==≥ gpstart:mdw:gpadmin-[INFO]:---------------------------------------==≥ gpstart:mdw:gpadmin-[INFO]:-   Host   Datadir                               Port    Role==≥ gpstart:mdw:gpadmin-[INFO]:-   sdw1   /data/gpdata/gpdatap/gpseg0   40000   Primary==≥ gpstart:mdw:gpadmin-[INFO]:-   sdw2   /data/gpdata/gpdatap/gpseg1   40000   Primary==≥ gpstart:mdw:gpadmin-[INFO]:-   sdw1   /data/gpdata/gpdatam/gpseg1   50000   MirrorContinue with Greenplum instance startup Yy|Nn (default=N):> y==》gpstart:mdw:gpadmin-[INFO]:-Commencing parallel primary and mirror segment instance startup, please wait...==》==》gpstart:mdw:gpadmin-[INFO]:-Process results...==》gpstart:mdw:gpadmin-[INFO]:-----------------------------------------------------==》gpstart:mdw:gpadmin-[INFO]:-   Successful segment starts                                            = 3==》gpstart:mdw:gpadmin-[INFO]:-   Failed segment starts                                                = 0==》gpstart:mdw:gpadmin-[WARNING]:-Skipped segment starts (segments are marked down in configuration)   = 1   <<<<<<<<==》gpstart:mdw:gpadmin-[INFO]:-----------------------------------------------------==》gpstart:mdw:gpadmin-[INFO]:-==》gpstart:mdw:gpadmin-[INFO]:-Successfully started 3 of 3 segment instances, skipped 1 other segments ==》gpstart:mdw:gpadmin-[INFO]:-----------------------------------------------------==》gpstart:mdw:gpadmin-[WARNING]:-****************************************************************************==》gpstart:mdw:gpadmin-[WARNING]:-There are 1 segment(s) marked down in the database==》gpstart:mdw:gpadmin-[WARNING]:-To recover from this current state, review usage of the gprecoverseg==》gpstart:mdw:gpadmin-[WARNING]:-management utility which will recover failed segment instance databases.==》gpstart:mdw:gpadmin-[WARNING]:-****************************************************************************==》gpstart:mdw:gpadmin-[INFO]:-Starting Master instance mdw directory /data/gpdata/pgmaster/gpseg-1 ==》gpstart:mdw:gpadmin-[INFO]:-Command pg_ctl reports Master mdw instance active==》gpstart:mdw:gpadmin-[INFO]:-No standby master configured.  skipping...==》gpstart:mdw:gpadmin-[WARNING]:-Number of segments not attempted to start: 1==》gpstart:mdw:gpadmin-[INFO]:-Check status of database with gpstate utility51
查看数据库的mirror的节点启动状态
[gpadmin@mdw ~]$ gpstate -m==》gpstate:mdw:gpadmin-[INFO]:-Starting gpstate with args: -m==》gpstate:mdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.0.0 build 1'==》gpstate:mdw:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.0.0 build 1) ==》gpstate:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...==》gpstate:mdw:gpadmin-[INFO]:--------------------------------------------------------------==》gpstate:mdw:gpadmin-[INFO]:--Current GPDB mirror list and status==》gpstate:mdw:gpadmin-[INFO]:--Type = Spread==》gpstate:mdw:gpadmin-[INFO]:--------------------------------------------------------------==》gpstate:mdw:gpadmin-[INFO]:-   Mirror   Datadir                               Port    Status    Data Status    ==》gpstate:mdw:gpadmin-[WARNING]:-sdw2     /data/gpdata/gpdatam/gpseg0   50000   Failed                   <<<<<<<<==》gpstate:mdw:gpadmin-[INFO]:-   sdw1     /data/gpdata/gpdatam/gpseg1   50000   Passive   Synchronized==》gpstate:mdw:gpadmin-[INFO]:--------------------------------------------------------------==》gpstate:mdw:gpadmin-[WARNING]:-1 segment(s) configured as mirror(s) have failed

可直观看出"[WARNING]:-sdw2 /data/gpdata/gpdatam/gpseg0 50000 Failed "

如何恢复这个mirror segment呢?

首先产生一个恢复的配置文件 : gprecoverseg -o ./recov

---- 当然primary segment也是这样恢复 ----

[gpadmin@mdw ~]$ gprecoverseg -o ./recov==》gprecoverseg:mdw:gpadmin-[INFO]:-Starting gprecoverseg with args: -o ./recov==》gprecoverseg:mdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.0.0 build 1'==》gprecoverseg:mdw:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.0.0 build 1) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on ==》==》gprecoverseg:mdw:gpadmin-[INFO]:-Checking if segments are ready==》gprecoverseg:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...==》gprecoverseg:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...==》gprecoverseg:mdw:gpadmin-[INFO]:-Configuration file output to ./recov successfully.
查看恢复的配置文件

可以知道哪些segment需要恢复

[gpadmin@mdw ~]$ cat recovfilespaceOrder=fastdisksdw2:50000:/data/gpdata/gpdatam/gpseg03
使用配置文件进行恢复 :
[gpadmin@mdw ~]$ gprecoverseg -i ./recov==》gprecoverseg:mdw:gpadmin-[INFO]:-Starting gprecoverseg with args: -i ./recov==》gprecoverseg:mdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.0.0 build 1'==》gprecoverseg:mdw:gpadmin-[INFO]:-master Greenplum Version: 'PostgreSQL 8.3.23 (Greenplum Database 5.0.0 build 1) on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.4.2 compiled on ==》==》gprecoverseg:mdw:gpadmin-[INFO]:-Checking if segments are ready==》gprecoverseg:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...==》gprecoverseg:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...==》gprecoverseg:mdw:gpadmin-[INFO]:-Greenplum instance recovery parameters==》gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------==》gprecoverseg:mdw:gpadmin-[INFO]:-Recovery from configuration -i option supplied==》gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------==》gprecoverseg:mdw:gpadmin-[INFO]:-Recovery 1 of 1==》gprecoverseg:mdw:gpadmin-[INFO]:----------------------------------------------------------==》gprecoverseg:mdw:gpadmin-[INFO]:-   Synchronization mode                          = Incremental==》gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance host                          = sdw2==》gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance address                       = sdw2==》gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance directory                     = /data/gpdata/gpdatam/gpseg0==》gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance port                          = 50000==》gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance replication port              = 51000==》gprecoverseg:mdw:gpadmin-[INFO]:-   Failed instance fastdisk directory            = /data/gpdata/seg1/pg_mir_cdr/gpseg0==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance host                 = sdw1==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance address              = sdw1==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance directory            = /data/gpdata/gpdatap/gpseg0==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance port                 = 40000==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance replication port     = 41000==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Source instance fastdisk directory   = /data/gpdata/seg1/pg_pri_cdr/gpseg0==》gprecoverseg:mdw:gpadmin-[INFO]:-   Recovery Target                               = in-place==》gprecoverseg:mdw:gpadmin-[INFO]:-Process results...==》gprecoverseg:mdw:gpadmin-[INFO]:-Done updating primaries==》gprecoverseg:mdw:gpadmin-[INFO]:-******************************************************************==》gprecoverseg:mdw:gpadmin-[INFO]:-Updating segments for resynchronization is completed.==》gprecoverseg:mdw:gpadmin-[INFO]:-For segments updated successfully, resynchronization will continue in the background.==》gprecoverseg:mdw:gpadmin-[INFO]:-==》gprecoverseg:mdw:gpadmin-[INFO]:-Use  gpstate -s  to check the resynchronization progress.==》gprecoverseg:mdw:gpadmin-[INFO]:-******************************************************************35
查看恢复状态
[gpadmin@mdw ~]$ gpstate -m==》gpstate:mdw:gpadmin-[INFO]:-Starting gpstate with args: -m==》gpstate:mdw:gpadmin-[INFO]:-local Greenplum Version: 'postgres (Greenplum Database) 5.0.0 build 1'==》。。。。。。。。。。。。。。。。。。。。。。。。。。==》gpstate:mdw:gpadmin-[INFO]:-Obtaining Segment details from master...==》gpstate:mdw:gpadmin-[INFO]:--------------------------------------------------------------==》gpstate:mdw:gpadmin-[INFO]:--Current GPDB mirror list and status==》gpstate:mdw:gpadmin-[INFO]:--Type = Spread==》gpstate:mdw:gpadmin-[INFO]:--------------------------------------------------------------==》gpstate:mdw:gpadmin-[INFO]:-   Mirror   Datadir                               Port    Status    Data Status       ==》gpstate:mdw:gpadmin-[INFO]:-   sdw2     /data/gpdata/gpdatam/gpseg0   50000   Passive   Resynchronizing==》gpstate:mdw:gpadmin-[INFO]:-   sdw1     /data/gpdata/gpdatam/gpseg1   50000   Passive   Synchronized==》gpstate:mdw:gpadmin-[INFO]:--------------------------------------------------------------13
primary mirror角色对调

数据库的主备就恢复了,但是还有一步,是可选的。要不要把primary mirror角色对调一下,因为现在mirror和primary和优先角色是相反的。 如果要对调,使用以下命令,会停库来处理。

gprecoverseg -r

用于修复Segment的是gprecoverseg。使用方式比较简单,几个主要参数如下:
-i :主要参数,用于指定一个配置文件,该配置文件描述了需要修复的Segment和修复后的目的位置。 -F :可选项,指定后,gprecoverseg会将"-i"中指定的或标记"d"的实例删除,并从活着的Mirror复制一个完整一份到目标位置。 -r :当FTS发现有Primary宕机并进行主备切换,在gprecoverseg修复后,担当Primary的Mirror角色并不会立即切换回来,就会导致部分主机上活跃的Segment过多从而引起性能瓶颈。因此需要恢复Segment原先的角色,称为re-balance.
0