PostgreSQL 源码解读(130)- MVCC#14(vacuum过程-lazy_scan_heap函数)
本节简单介绍了PostgreSQL手工执行vacuum的处理流程,主要分析了ExecVacuum->vacuum->vacuum_rel->heap_vacuum_rel->lazy_scan_heap函数的实现逻辑,该函数扫描已打开的heap relation,清理堆中的每个页面。
一、数据结构
宏定义
Vacuum和Analyze命令选项
/* ---------------------- * Vacuum and Analyze Statements * Vacuum和Analyze命令选项 * * Even though these are nominally two statements, it's convenient to use * just one node type for both. Note that at least one of VACOPT_VACUUM * and VACOPT_ANALYZE must be set in options. * 虽然在这里有两种不同的语句,但只需要使用统一的Node类型即可. * 注意至少VACOPT_VACUUM/VACOPT_ANALYZE在选项中设置. * ---------------------- */typedef enum VacuumOption{ VACOPT_VACUUM = 1 << 0, /* do VACUUM */ VACOPT_ANALYZE = 1 << 1, /* do ANALYZE */ VACOPT_VERBOSE = 1 << 2, /* print progress info */ VACOPT_FREEZE = 1 << 3, /* FREEZE option */ VACOPT_FULL = 1 << 4, /* FULL (non-concurrent) vacuum */ VACOPT_SKIP_LOCKED = 1 << 5, /* skip if cannot get lock */ VACOPT_SKIPTOAST = 1 << 6, /* don't process the TOAST table, if any */ VACOPT_DISABLE_PAGE_SKIPPING = 1 << 7 /* don't skip any pages */} VacuumOption;
xl_heap_freeze_tuple
该结构表示'freeze plan',用于存储在vacuum期间冻结tuple所需要的信息
/* * This struct represents a 'freeze plan', which is what we need to know about * a single tuple being frozen during vacuum. * 该结构表示'freeze plan',用于存储在vacuum期间冻结tuple所需要的信息 *//* 0x01 was XLH_FREEZE_XMIN */#define XLH_FREEZE_XVAC 0x02#define XLH_INVALID_XVAC 0x04typedef struct xl_heap_freeze_tuple{ TransactionId xmax; OffsetNumber offset; uint16 t_infomask2; uint16 t_infomask; uint8 frzflags;} xl_heap_freeze_tuple;
二、源码解读
lazy_scan_heap扫描已打开的heap relation,清理堆中的每个页面,具体工作包括:
1.将DEAD元组截断为DEAD行指针
2.整理页面碎片
3.设置提交状态位(参见heap_page_prune)
4.构建空闲空间的DEAD元组和页链表
5.计算堆中存活元组数量的统计信息,并在合适的情况下将页标记为all-visible
6.执行index vacuuming并调用lazy_vacuum_heap回收DEAD行指针
其处理流程如下:
1.初始化相关变量
2.获取总块数(nblocks)
3.初始化统计信息和相关数组(vacrelstats/frozen)
4.计算下一个不能跳过的block(next_unskippable_block)
5.遍历每个block
5.1如已达next_unskippable_block块,计算下一个不能跳过的block
否则,如skipping_blocks为T,并且没有强制执行页面检查,则跳到下一个block
5.2如即将超出DEAD元组tid的可用空间,那么在处理此页面之前,执行vacuuming
5.2.1遍历index relation,调用lazy_vacuum_index执行vacuum
5.2.2调用lazy_vacuum_heap清理heap relation中的元组
5.2.3重置vacrelstats->num_dead_tuples计数器为0
5.2.4Vacuum FSM以使新释放的空间再顶层FSM pages中可见
5.3以扩展方式读取buffer
5.4获取buffer cleanup lock但不成功,则
A.aggressive为F并且非强制检查页面,则处理下一个block;
B.aggressive为T或者要求强制检查页面,如不需要冻结元组,则跳过该block;
C.aggressive为F(即要求强制检查页面),更新统计信息,跳过该block;
D.调用LockBufferForCleanup锁定buf,进入常规流程
5.5如为新页,执行相关处理逻辑(重新初始化或者标记buffer为脏),继续下一个block;
5.6如为空页,执行相关逻辑(设置all-visible标记等),继续下一个block;
5.7调用heap_page_prune清理该page中的所有HOT-update链
5.8遍历page中的行指针
5.8.1行指针未使用,继续下一个tuple
5.8.2行指针是重定向指针,继续下一个tuple
5.8.3行指针已废弃,调用lazy_record_dead_tuple记录需删除的tuple,设置all_visible,继续下一个tuple
5.8.4初始化tuple变量
5.8.5调用HeapTupleSatisfiesVacuum函数确定元组状态,根据元组状态执行相关标记处理
5.8.6如tupgone标记为T,记录需删除的tuple;否则调用heap_prepare_freeze_tuple判断是否需要冻结,如需冻结则记录偏移
5.9如冻结统计数>0,遍历需冻结的行指针,执行冻结;如需记录日志,则写WAL Record
5.10如果没有索引,那么执行vacuum page,而不需要二次扫描了.
5.11通过all_visible和all_visible_according_to_vm标记同步vm
5.12释放frozen
5.13更新统计信息
5.14位最后一批dead tuples执行清理
5.15vacuum FSM
5.16执行vacuum收尾工作,为每个索引更新统计信息
5.17记录系统日志
/* * lazy_scan_heap() -- scan an open heap relation * lazy_scan_heap() -- 扫描已打开的heap relation * * This routine prunes each page in the heap, which will among other * things truncate dead tuples to dead line pointers, defragment the * page, and set commit status bits (see heap_page_prune). It also builds * lists of dead tuples and pages with free space, calculates statistics * on the number of live tuples in the heap, and marks pages as * all-visible if appropriate. When done, or when we run low on space for * dead-tuple TIDs, invoke vacuuming of indexes and call lazy_vacuum_heap * to reclaim dead line pointers. * 这个例程将清理堆中的每个页面, * 其中包括将DEAD元组截断为DEAD行指针、整理页面碎片和设置提交状态位(参见heap_page_prune)。 * 它还构建具有空闲空间的DEAD元组和页链表, * 计算堆中存活元组数量的统计信息,并在适当的情况下将页标记为all-visible。 * 当完成时,或者当DEAD元组TIDs的空间不足时, * 执行index vacuuming并调用lazy_vacuum_heap来回收DEAD行指针。 * If there are no indexes then we can reclaim line pointers on the fly; * dead line pointers need only be retained until all index pointers that * reference them have been killed. * 如果没有索引,那么我们可以动态地回收行指针; * DEAD行指针需要保留到引用它们的所有索引指针都被清理为止。 */static voidlazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats, Relation *Irel, int nindexes, bool aggressive){ BlockNumber nblocks,//块数 blkno;//块号 HeapTupleData tuple;//元组 char *relname;//关系名称 TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;//冻结的XID TransactionId relminmxid = onerel->rd_rel->relminmxid;//最新的mxid BlockNumber empty_pages,//空页数 vacuumed_pages,//已被vacuum数 next_fsm_block_to_vacuum;//块号 //未被清理的元组数/仍存活的元组数(估算)/通过vacuum清理的元组数/DEAD但未被清理的元组数/未使用的行指针 double num_tuples, /* total number of nonremovable tuples */ live_tuples, /* live tuples (reltuples estimate) */ tups_vacuumed, /* tuples cleaned up by vacuum */ nkeep, /* dead-but-not-removable tuples */ nunused; /* unused item pointers */ IndexBulkDeleteResult **indstats; int i;//临时变量 PGRUsage ru0; Buffer vmbuffer = InvalidBuffer;//buffer BlockNumber next_unskippable_block;//block number bool skipping_blocks;//是否跳过block? xl_heap_freeze_tuple *frozen;//冻结元组数组 StringInfoData buf; const int initprog_index[] = { PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_TOTAL_HEAP_BLKS, PROGRESS_VACUUM_MAX_DEAD_TUPLES }; int64 initprog_val[3]; //初始化PGRUsage变量 pg_rusage_init(&ru0); //获取关系名称 relname = RelationGetRelationName(onerel); //记录操作日志 if (aggressive) ereport(elevel, (errmsg("aggressively vacuuming \"%s.%s\"", get_namespace_name(RelationGetNamespace(onerel)), relname))); else ereport(elevel, (errmsg("vacuuming \"%s.%s\"", get_namespace_name(RelationGetNamespace(onerel)), relname))); //初始化变量 empty_pages = vacuumed_pages = 0; next_fsm_block_to_vacuum = (BlockNumber) 0; num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0; indstats = (IndexBulkDeleteResult **) palloc0(nindexes * sizeof(IndexBulkDeleteResult *)); //获取该relation总的块数 nblocks = RelationGetNumberOfBlocks(onerel); //初始化统计信息 vacrelstats->rel_pages = nblocks; vacrelstats->scanned_pages = 0; vacrelstats->tupcount_pages = 0; vacrelstats->nonempty_pages = 0; vacrelstats->latestRemovedXid = InvalidTransactionId; //每个block都进行单独记录 lazy_space_alloc(vacrelstats, nblocks); //为frozen数组分配内存空间 frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage); /* Report that we're scanning the heap, advertising total # of blocks */ //报告正在扫描heap,并广播总的块数 //PROGRESS_VACUUM_PHASE_SCAN_HEAP状态 initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP; initprog_val[1] = nblocks;//总块数 initprog_val[2] = vacrelstats->max_dead_tuples;//最大废弃元组数 pgstat_progress_update_multi_param(3, initprog_index, initprog_val); /* * Except when aggressive is set, we want to skip pages that are * all-visible according to the visibility map, but only when we can skip * at least SKIP_PAGES_THRESHOLD consecutive pages. Since we're reading * sequentially, the OS should be doing readahead for us, so there's no * gain in skipping a page now and then; that's likely to disable * readahead and so be counterproductive. Also, skipping even a single * page means that we can't update relfrozenxid, so we only want to do it * if we can skip a goodly number of pages. * 除非设置了aggressive,否则我们希望跳过根据vm确定的全部可见页面, * 但只有当我们可以跳过至少SKIP_PAGES_THRESHOLD个连续页面时才可以。 * 因为我们是按顺序读取的,所以操作系统应该为我们提前读取, * 所以时不时地跳过一个页面是没有好处的;这可能会禁用readahead,从而产生反效果。 * 而且,即使跳过一个页面,也意味着我们无法更新relfrozenxid,所以我们只希望跳过相当多的页面。 * * When aggressive is set, we can't skip pages just because they are * all-visible, but we can still skip pages that are all-frozen, since * such pages do not need freezing and do not affect the value that we can * safely set for relfrozenxid or relminmxid. * 当设置了aggressive(T),我们不能仅仅因为页面都是可见的就跳过它们, * 但是我们仍然可以跳过全部冻结的页面,因为这些页面不需要冻结, * 并且不影响我们可以安全地为relfrozenxid或relminmxid设置新值。 * * Before entering the main loop, establish the invariant that * next_unskippable_block is the next block number >= blkno that we can't * skip based on the visibility map, either all-visible for a regular scan * or all-frozen for an aggressive scan. We set it to nblocks if there's * no such block. We also set up the skipping_blocks flag correctly at * this stage. * 在进入主循环之前,建立一个不变式,即next_unskippable_block: the next block number >= blkno, * 那么我们不能基于vm跳过它,对于常规扫描是全可见的,对于主动扫描是全冻结的。 * 如果不存在这样的block,那么我们就设它为nblocks。 * 同时,我们还在这个阶段正确地设置了skipping_blocks标志。 * * Note: The value returned by visibilitymap_get_status could be slightly * out-of-date, since we make this test before reading the corresponding * heap page or locking the buffer. This is OK. If we mistakenly think * that the page is all-visible or all-frozen when in fact the flag's just * been cleared, we might fail to vacuum the page. It's easy to see that * skipping a page when aggressive is not set is not a very big deal; we * might leave some dead tuples lying around, but the next vacuum will * find them. But even when aggressive *is* set, it's still OK if we miss * a page whose all-frozen marking has just been cleared. Any new XIDs * just added to that page are necessarily newer than the GlobalXmin we * computed, so they'll have no effect on the value to which we can safely * set relfrozenxid. A similar argument applies for MXIDs and relminmxid. * 注意:visibilitymap_get_status返回的值可能有点过时, * 因为我们在读取相应的堆页面或锁定缓冲区之前进行了测试。这没有什么问题。 * 如果我们错误地认为页面是全部可见或全部冻结, * 而实际上刚刚清除了标志,那么我们可能无法执行vacuum。 * 显而易见,在没有设置aggressive的情况下跳过一个页面并不是什么大问题; * 我们可能会留下一些DEAD元组,但是下一个vacuum会找到它们。 * 但是,即使设置了aggressive,如果我们错过了刚刚清除了所有冻结标记的页面,也没关系。 * 刚刚添加到该页面的任何新xid都必须比我们计算的GlobalXmin更新, * 因此它们不会影响我们安全地设置relfrozenxid的值。 * 类似的观点也适用于mxid和relminmxid。 * * We will scan the table's last page, at least to the extent of * determining whether it has tuples or not, even if it should be skipped * according to the above rules; except when we've already determined that * it's not worth trying to truncate the table. This avoids having * lazy_truncate_heap() take access-exclusive lock on the table to attempt * a truncation that just fails immediately because there are tuples in * the last page. This is worth avoiding mainly because such a lock must * be replayed on any hot standby, where it can be disruptive. * 即使按照上面的规则应该跳过pages,但我们将扫描该表的最后一页, * 至少扫描到可以确定该表是否有元组的extent内以确定是否存在元组, * 除非我们已经确定不值得尝试截断表,那么就不需要执行这样的扫描。 * 这避免了lazy_truncate_heap()函数对表进行访问独占锁定并尝试立即执行截断,因为最后一页中有元组。 * 这是值得的,主要是因为这样的锁必须在所有hot standby上replay,因为它可能会造成破坏。 */ //下一个未跳过的block next_unskippable_block = 0; if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0) { //选项没有禁用跳过PAGE while (next_unskippable_block < nblocks)//循环k { uint8 vmstatus;//vm状态 vmstatus = visibilitymap_get_status(onerel, next_unskippable_block, &vmbuffer); if (aggressive) { if ((vmstatus & VISIBILITYMAP_ALL_FROZEN) == 0) break;//遇到全冻结的block,跳出循环 } else { if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0) break;//如非强制扫描,遇到全可见block,跳出循环 } vacuum_delay_point(); next_unskippable_block++; } } if (next_unskippable_block >= SKIP_PAGES_THRESHOLD) skipping_blocks = true;//大于阈值,则设置为T else skipping_blocks = false;//否则为F for (blkno = 0; blkno < nblocks; blkno++) { //循环处理每个block Buffer buf;//缓冲区编号 Page page;//page OffsetNumber offnum,//偏移 maxoff; bool tupgone, hastup; int prev_dead_count;//上次已废弃元组统计 int nfrozen;//冻结统计 Size freespace;//空闲空间 bool all_visible_according_to_vm = false;//通过vm判断可见性的标记 bool all_visible;//全可见? bool all_frozen = true; /* provided all_visible is also true */ bool has_dead_tuples;//是否存在dead元组? TransactionId visibility_cutoff_xid = InvalidTransactionId;//事务ID /* see note above about forcing scanning of last page */ //请查看上述关于最后一个page的强制扫描注释 //全部扫描&尝试截断#define FORCE_CHECK_PAGE() \ (blkno == nblocks - 1 && should_attempt_truncation(vacrelstats)) //更新统计信息 pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); if (blkno == next_unskippable_block) { //到达了next_unskippable_block标记的地方 /* Time to advance next_unskippable_block */ //是时候增加next_unskippable_block计数了 next_unskippable_block++; //寻找下一个需跳过的block if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0) { while (next_unskippable_block < nblocks) { uint8 vmskipflags; vmskipflags = visibilitymap_get_status(onerel, next_unskippable_block, &vmbuffer); if (aggressive) { if ((vmskipflags & VISIBILITYMAP_ALL_FROZEN) == 0) break; } else { if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0) break; } vacuum_delay_point(); next_unskippable_block++; } } /* * We know we can't skip the current block. But set up * skipping_blocks to do the right thing at the following blocks. * 不能跳过当前block. * 但设置skipping_blocks标记处理接下来的blocks */ if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD) skipping_blocks = true; else skipping_blocks = false; /* * Normally, the fact that we can't skip this block must mean that * it's not all-visible. But in an aggressive vacuum we know only * that it's not all-frozen, so it might still be all-visible. * 通常,我们不能跳过这个块的事实一定意味着它不是完全可见的。 * 但在一个aggressive vacuum中,我们只知道它不是完全冻结的,所以它可能仍然是完全可见的。 */ if (aggressive && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer)) all_visible_according_to_vm = true; } else { //尚未到达next_unskippable_block标记的地方 /* * The current block is potentially skippable; if we've seen a * long enough run of skippable blocks to justify skipping it, and * we're not forced to check it, then go ahead and skip. * Otherwise, the page must be at least all-visible if not * all-frozen, so we can set all_visible_according_to_vm = true. * 当前块可能是可跳过的;如果我们已经看到了足够长的可跳过的块运行时间,则可以跳过它, * 并且如果我们不需要检查,那么就继续跳过它。 * 否则,页面必须至少全部可见(如果不是全部冻结的话), * 因此我们可以设置all_visible_according_to_vm = true。 */ if (skipping_blocks && !FORCE_CHECK_PAGE()) { /* * Tricky, tricky. If this is in aggressive vacuum, the page * must have been all-frozen at the time we checked whether it * was skippable, but it might not be any more. We must be * careful to count it as a skipped all-frozen page in that * case, or else we'll think we can't update relfrozenxid and * relminmxid. If it's not an aggressive vacuum, we don't * know whether it was all-frozen, so we have to recheck; but * in this case an approximate answer is OK. * 困难,棘手。如果这是在aggressive vacuum中, * 那么在我们检查页面是否可跳过时,页面肯定已经完全冻结,但现在可能不会了。 * 在这种情况下,我们必须小心地将其视为跳过的全部冻结页面, * 否则我们将认为无法更新relfrozenxid和relminmxid。 * 如果它不是一个aggressive vacuum,我们不知道它是否完全冻结了, * 所以我们必须重新检查;但在这种情况下,近似的答案是可以的。 */ if (aggressive || VM_ALL_FROZEN(onerel, blkno, &vmbuffer)) vacrelstats->frozenskipped_pages++;//完全冻结的page计数+1 continue;//跳到下一个block } all_visible_according_to_vm = true; } vacuum_delay_point(); /* * If we are close to overrunning the available space for dead-tuple * TIDs, pause and do a cycle of vacuuming before we tackle this page. * 如即将超出DEAD元组tid的可用空间,那么在处理此页面之前,暂停并执行一个vacuuming循环。 */ if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && vacrelstats->num_dead_tuples > 0) { //存在废弃的元组,而且: //MaxHeapTuplesPerPage + vacrelstats->num_dead_tuples > vacrelstats->max_dead_tuples const int hvp_index[] = { PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_NUM_INDEX_VACUUMS }; int64 hvp_val[2]; /* * Before beginning index vacuuming, we release any pin we may * hold on the visibility map page. This isn't necessary for * correctness, but we do it anyway to avoid holding the pin * across a lengthy, unrelated operation. * 在开始index vacuuming前,释放在vm page上持有的所有pin. * 这对于正确性并不是必需的,但是我们这样做是为了避免在一个冗长的、不相关的操作中持有pin。 */ if (BufferIsValid(vmbuffer)) { ReleaseBuffer(vmbuffer); vmbuffer = InvalidBuffer; } /* Log cleanup info before we touch indexes */ //在开始处理indexes前清除日志信息 vacuum_log_cleanup_info(onerel, vacrelstats); /* Report that we are now vacuuming indexes */ //正在清理vacuum indexes pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_VACUUM_INDEX); /* Remove index entries */ //遍历index relation,执行vacuum //删除指向在vacrelstats->dead_tuples元组的索引条目,更新运行时统计信息 for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], &indstats[i], vacrelstats); /* * Report that we are now vacuuming the heap. We also increase * the number of index scans here; note that by using * pgstat_progress_update_multi_param we can update both * parameters atomically. * 报告正在vacumming heap. * 这里会增加索引扫描,注意通过设置pgstat_progress_update_multi_param参数可以同时自动更新参数. */ hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP; hvp_val[1] = vacrelstats->num_index_scans + 1; pgstat_progress_update_multi_param(2, hvp_index, hvp_val); /* Remove tuples from heap */ //清理heap relation中的元组 lazy_vacuum_heap(onerel, vacrelstats); /* * Forget the now-vacuumed tuples, and press on, but be careful * not to reset latestRemovedXid since we want that value to be * valid. * 无需理会now-vacuumed元组, * 继续处理,但是要小心不要重置latestRemovedXid,因为我们希望该值是有效的。 */ vacrelstats->num_dead_tuples = 0;//重置计数 vacrelstats->num_index_scans++;//索引扫描次数+1 /* * Vacuum the Free Space Map to make newly-freed space visible on * upper-level FSM pages. Note we have not yet processed blkno. * Vacuum FSM以使新释放的空间再顶层FSM pages中可见. * 注意,我们还没有处理blkno。 */ FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); next_fsm_block_to_vacuum = blkno; /* Report that we are once again scanning the heap */ //报告再次扫描heap pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_SCAN_HEAP); } /* * Pin the visibility map page in case we need to mark the page * all-visible. In most cases this will be very cheap, because we'll * already have the correct page pinned anyway. However, it's * possible that (a) next_unskippable_block is covered by a different * VM page than the current block or (b) we released our pin and did a * cycle of index vacuuming. * 如需要标记page为all-visible,则在内存中PIN VM. * 在大多数情况下,这个动作的成本很低,因为我们已经pinned page了. * 但是,有可能(a) next_unskippable_block被不同的VM page而不是当前block覆盖 * (b) 释放了pin并且执行了index vacuuming */ visibilitymap_pin(onerel, blkno, &vmbuffer); //以扩展方式读取buffer buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno, RBM_NORMAL, vac_strategy); /* We need buffer cleanup lock so that we can prune HOT chains. */ //需要buffer cleanup lock以便清理HOT chains. //ConditionalLockBufferForCleanup - 跟LockBufferForCleanup类似,但不会等待锁的获取 if (!ConditionalLockBufferForCleanup(buf)) { //----------- 不能获取到锁 /* * If we're not performing an aggressive scan to guard against XID * wraparound, and we don't want to forcibly check the page, then * it's OK to skip vacuuming pages we get a lock conflict on. They * will be dealt with in some future vacuum. * 如果执行的不是aggressive扫描(用于避免XID wraparound),而且我们不希望强制检查页面, * 那么出现锁冲突跳过vacuuming pages也是可以接受的. * 这些page会在未来的vacuum中进行处理. */ if (!aggressive && !FORCE_CHECK_PAGE()) { //非aggressive扫描 && 不强制检查page //释放buffer,跳过pinned pages+1 ReleaseBuffer(buf); vacrelstats->pinskipped_pages++; continue; } /* * Read the page with share lock to see if any xids on it need to * be frozen. If not we just skip the page, after updating our * scan statistics. If there are some, we wait for cleanup lock. * 使用共享锁读取page,检查是否存在XIDs需要冻结. * 如无此需要,则更新扫描统计信息后跳过此page. * 如有此需要,则等待clean lock. * * We could defer the lock request further by remembering the page * and coming back to it later, or we could even register * ourselves for multiple buffers and then service whichever one * is received first. For now, this seems good enough. * 我们可以通过记住页面并稍后返回来进一步延迟锁请求, * 或者甚至可以为多个缓冲区注册,然后为最先接收到的缓冲区提供服务。 * * If we get here with aggressive false, then we're just forcibly * checking the page, and so we don't want to insist on getting * the lock; we only need to know if the page contains tuples, so * that we can update nonempty_pages correctly. It's convenient * to use lazy_check_needs_freeze() for both situations, though. * 如aggressive为F,那么强制执行page检查,这时候不希望一直持有锁, * 我们只需要知道page包含tuples以便可以正确的更新非空pages. * 对于这两种情况,都可以方便地使用lazy_check_needs_freeze()。 */ //共享方式锁定buffer LockBuffer(buf, BUFFER_LOCK_SHARE); //lazy_check_needs_freeze --> 扫描page检查是否存在元组需要清理以避免wraparound if (!lazy_check_needs_freeze(buf, &hastup)) { //不存在需要清理的tuples UnlockReleaseBuffer(buf); vacrelstats->scanned_pages++; vacrelstats->pinskipped_pages++; if (hastup) vacrelstats->nonempty_pages = blkno + 1; //跳过该block continue; } if (!aggressive) { /* * Here, we must not advance scanned_pages; that would amount * to claiming that the page contains no freezable tuples. * 在这里不需要增加scanned_pages,这相当于声明页面不包含可冻结的元组。 */ UnlockReleaseBuffer(buf); vacrelstats->pinskipped_pages++; if (hastup) vacrelstats->nonempty_pages = blkno + 1; continue; } LockBuffer(buf, BUFFER_LOCK_UNLOCK); LockBufferForCleanup(buf); /* drop through to normal processing */ } //更新统计信息 vacrelstats->scanned_pages++; vacrelstats->tupcount_pages++; //获取page page = BufferGetPage(buf); if (PageIsNew(page)) { //-------------- 新初始化的PAGE /* * An all-zeroes page could be left over if a backend extends the * relation but crashes before initializing the page. Reclaim such * pages for use. * 如果后台进程扩展了relation但在初始化页面前数据库崩溃,那么初始化(全0)的page可以一直保留. * 重新声明该页面可用即可. * * We have to be careful here because we could be looking at a * page that someone has just added to the relation and not yet * been able to initialize (see RelationGetBufferForTuple). To * protect against that, release the buffer lock, grab the * relation extension lock momentarily, and re-lock the buffer. If * the page is still uninitialized by then, it must be left over * from a crashed backend, and we can initialize it. * 在这里注意小心应对,我们可能正在搜索一个其他进程需要添加到relation的page, * 而且该page尚未初始化(详见RelationGetBufferForTuple). * 为了避免这种情况引起的问题,释放缓存锁,暂时获取关系扩展锁,并重新锁定缓冲. * 如果这时候page仍为初始化,那么该page肯定是一个崩溃的后台进程导致的, * 这时候我们可以初始化该page. * * We don't really need the relation lock when this is a new or * temp relation, but it's probably not worth the code space to * check that, since this surely isn't a critical path. * 对于新的或临时relation,这时候不需要获取relation锁, * 但是可能不值得花这么多代码来检查它,因为这肯定不是一个关键路径。 * * Note: the comparable code in vacuum.c need not worry because * it's got exclusive lock on the whole relation. * 注意:无需担心vacuum.c中的对比代码,因为代码并没有获取整个relation的独享锁. */ LockBuffer(buf, BUFFER_LOCK_UNLOCK); //ExclusiveLock锁定 LockRelationForExtension(onerel, ExclusiveLock); //ExclusiveLock释放 UnlockRelationForExtension(onerel, ExclusiveLock); //锁定buffer LockBufferForCleanup(buf); //再次判断page是否NEW if (PageIsNew(page)) { //page仍然是New的,那可以重新init该page了. ereport(WARNING, (errmsg("relation \"%s\" page %u is uninitialized --- fixing", relname, blkno))); PageInit(page, BufferGetPageSize(buf), 0); empty_pages++; } //获取空闲空间 freespace = PageGetHeapFreeSpace(page); //标记buffer为脏 MarkBufferDirty(buf); UnlockReleaseBuffer(buf); //标记page RecordPageWithFreeSpace(onerel, blkno, freespace); //下一个page continue; } if (PageIsEmpty(page)) { //----------------- 空PAGE empty_pages++; freespace = PageGetHeapFreeSpace(page); /* empty pages are always all-visible and all-frozen */ //空pages通常是all-visible和all-frozen的 if (!PageIsAllVisible(page)) { //Page不是all-Visible //处理之 START_CRIT_SECTION(); /* mark buffer dirty before writing a WAL record */ //写入WAL Record前标记该buffer为脏buffer MarkBufferDirty(buf); /* * It's possible that another backend has extended the heap, * initialized the page, and then failed to WAL-log the page * due to an ERROR. Since heap extension is not WAL-logged, * recovery might try to replay our record setting the page * all-visible and find that the page isn't initialized, which * will cause a PANIC. To prevent that, check whether the * page has been previously WAL-logged, and if not, do that * now. * 存在可能:另外一个后台进程已扩展了heap,并初始化了page,但记录日志失败. * 因为heap扩展是没有写日志的,恢复过程可能尝试回放我们的记录设置page * 为all-visible并发现该page并未初始化,这会导致PANIC. * 为了避免这种情况,检查page先前是否已记录日志,如没有,现在执行该操作. */ if (RelationNeedsWAL(onerel) && PageGetLSN(page) == InvalidXLogRecPtr) //如需要记录WAL Record但page的LSN非法,则记录日志 log_newpage_buffer(buf, true); //设置page的all-visible标记 PageSetAllVisible(page); //设置vm visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer, InvalidTransactionId, VISIBILITYMAP_ALL_VISIBLE | VISIBILITYMAP_ALL_FROZEN); END_CRIT_SECTION(); } UnlockReleaseBuffer(buf); RecordPageWithFreeSpace(onerel, blkno, freespace); //处理下一个block continue; } /* * Prune all HOT-update chains in this page. * 清理该page中的所有HOT-update链 * * We count tuples removed by the pruning step as removed by VACUUM. * 计算通过VACUUM的清理步骤清楚的tuples数量. */ tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false, &vacrelstats->latestRemovedXid); /* * Now scan the page to collect vacuumable items and check for tuples * requiring freezing. * 现在,扫描page统计已清理的条目数并检查哪些tuples需要冻结. */ all_visible = true; has_dead_tuples = false; nfrozen = 0; hastup = false; prev_dead_count = vacrelstats->num_dead_tuples; maxoff = PageGetMaxOffsetNumber(page);//获取最大偏移 /* * Note: If you change anything in the loop below, also look at * heap_page_is_all_visible to see if that needs to be changed. * 注意:如果在下面的循环中修改了业务逻辑, * 需要检查heap_page_is_all_visible判断是否需要改变. */ for (offnum = FirstOffsetNumber; offnum <= maxoff; offnum = OffsetNumberNext(offnum)) { ItemId itemid; itemid = PageGetItemId(page, offnum); /* Unused items require no processing, but we count 'em */ //未使用的条目无需处理,但需要计数. if (!ItemIdIsUsed(itemid)) { //未被使用,跳过 nunused += 1; continue; } /* Redirect items mustn't be touched */ //重定向的条目不需要"接触". if (ItemIdIsRedirected(itemid)) { //重定向的ITEM //该page不能被截断 hastup = true; /* this page won't be truncatable */ continue; } //设置行指针 ItemPointerSet(&(tuple.t_self), blkno, offnum); /* * DEAD item pointers are to be vacuumed normally; but we don't * count them in tups_vacuumed, else we'd be double-counting (at * least in the common case where heap_page_prune() just freed up * a non-HOT tuple). * 废弃的行指针将被正常vacuumed. * 但我们不需要通过tups_vacuumed变量计数,否则会重复统计. * (起码在通常情况下,heap_page_prune()会释放non-HOT元组) */ if (ItemIdIsDead(itemid)) { //记录需删除的tuple //vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr; //vacrelstats->num_dead_tuples++; lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); all_visible = false; continue; } Assert(ItemIdIsNormal(itemid)); //获取数据 tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid); tuple.t_len = ItemIdGetLength(itemid); tuple.t_tableOid = RelationGetRelid(onerel); tupgone = false; /* * The criteria for counting a tuple as live in this block need to * match what analyze.c's acquire_sample_rows() does, otherwise * VACUUM and ANALYZE may produce wildly different reltuples * values, e.g. when there are many recently-dead tuples. * 统计存活元组的计算策略需要与analyze.c中的acquire_sample_rows()逻辑匹配, * 否则的话,VACUUM/ANALYZE可能会产生差异很大的reltuples值, * 比如在出现非常多近期被废弃的元组的情况下. * * The logic here is a bit simpler than acquire_sample_rows(), as * VACUUM can't run inside a transaction block, which makes some * cases impossible (e.g. in-progress insert from the same * transaction). * 这里的逻辑比acquire_sample_rows()函数逻辑要简单许多, * 因为VACUUM不能在事务块内支线,这可以减少许多不必要的逻辑. */ //为VACUUM确定元组的状态. //在这里,主要目的是一个元组是否可能对所有正在运行中的事务可见. switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf)) { case HEAPTUPLE_DEAD: /* * Ordinarily, DEAD tuples would have been removed by * heap_page_prune(), but it's possible that the tuple * state changed since heap_page_prune() looked. In * particular an INSERT_IN_PROGRESS tuple could have * changed to DEAD if the inserter aborted. So this * cannot be considered an error condition. * 通常来说,废弃的元组可能已通过heap_page_prune()函数清除, * 但在heap_page_prune()搜索的过程中元组的状态可能会出现变更. * 特别是,如果插入程序中止,INSERT_IN_PROGRESS元组可能已经变成DEAD。 * 所以这不能被认为是一个错误条件。 * * If the tuple is HOT-updated then it must only be * removed by a prune operation; so we keep it just as if * it were RECENTLY_DEAD. Also, if it's a heap-only * tuple, we choose to keep it, because it'll be a lot * cheaper to get rid of it in the next pruning pass than * to treat it like an indexed tuple. * 如果该tuple是HOT-updated,那么必须通过pruge操作清理. * 因此元组状态调整为RECENTLY_DEAD. * 同时,如果这是一个HOT,我们选择保留该tuple, * 因为在下一次清理中删除它要比现在像处理索引元组那样处理它成本要低得多。 * * If this were to happen for a tuple that actually needed * to be deleted, we'd be in trouble, because it'd * possibly leave a tuple below the relation's xmin * horizon alive. heap_prepare_freeze_tuple() is prepared * to detect that case and abort the transaction, * preventing corruption. * 如果这种情况发生在需要删除的元组上,我们就有麻烦了, * 因为它可能会使关系小于xmin的元组保持活动状态。 * heap_prepare_freeze_tuple()函数用于检测这种状态,并终止事务以避免出现崩溃. */ if (HeapTupleIsHotUpdated(&tuple) || HeapTupleIsHeapOnly(&tuple)) nkeep += 1; else //可以删除元组 tupgone = true; /* we can delete the tuple */ //存在dead tuple,设置all Visible标记为F all_visible = false; break; case HEAPTUPLE_LIVE: /* * Count it as live. Not only is this natural, but it's * also what acquire_sample_rows() does. * 存活元组计数. * 这不仅很自然,而且acquire_sample_rows()也是这样做的。 */ live_tuples += 1; /* * Is the tuple definitely visible to all transactions? * 元组对所有事务肯定可见吗? * * NB: Like with per-tuple hint bits, we can't set the * PD_ALL_VISIBLE flag if the inserter committed * asynchronously. See SetHintBits for more info. Check * that the tuple is hinted xmin-committed because of * that. * 注意:与per-tuple hint bits类似,如果异步提交,那么不能设置PD_ALL_VISIBLE标记. * 详见SetHintBits函数. * 因此需要检测该元组已标记为xmin-committed. */ if (all_visible) { //all_visible = T TransactionId xmin; if (!HeapTupleHeaderXminCommitted(tuple.t_data)) { //xmin not committed,设置为F all_visible = false; break; } /* * The inserter definitely committed. But is it old * enough that everyone sees it as committed? * 插入器确实已经提交 * 但已足够老,其他进程都可以看到? */ xmin = HeapTupleHeaderGetXmin(tuple.t_data); if (!TransactionIdPrecedes(xmin, OldestXmin)) { //元组xmin比OldestXmin要小,则设置为F all_visible = false; break; } /* Track newest xmin on page. */ //跟踪page上最新的xmin //if (int32)(xmin > visibility_cutoff_xid) > 0,return T if (TransactionIdFollows(xmin, visibility_cutoff_xid)) visibility_cutoff_xid = xmin; } break; case HEAPTUPLE_RECENTLY_DEAD: /* * If tuple is recently deleted then we must not remove it * from relation. * 如元组是近期被删除的,那么不能从relation中删除这些元组. */ nkeep += 1; all_visible = false; break; case HEAPTUPLE_INSERT_IN_PROGRESS: /* * This is an expected case during concurrent vacuum. * 在并发vacuum期间这是可以预期的情况. * * We do not count these rows as live, because we expect * the inserting transaction to update the counters at * commit, and we assume that will happen only after we * report our results. This assumption is a bit shaky, * but it is what acquire_sample_rows() does, so be * consistent. * 不能统计这些元组为存活元组,因为我们期望插入事务在提交时更新计数器, * 同时我们假定只在报告了结果后才会发生. * 这个假设有点不可靠,但acquire_sample_rows()就是这么做的,所以要保持一致。 */ all_visible = false; break; case HEAPTUPLE_DELETE_IN_PROGRESS: /* This is an expected case during concurrent vacuum */ //在同步期间,这种情况可以预期 all_visible = false; /* * Count such rows as live. As above, we assume the * deleting transaction will commit and update the * counters after we report. * 这些行视为存活行. * 如上所述,我们假定删除事务会提交并在我们报告后更新计数器. */ live_tuples += 1; break; default: //没有其他状态了. elog(ERROR, "unexpected HeapTupleSatisfiesVacuum result"); break; } if (tupgone) { //记录需删除的tuple //vacrelstats->dead_tuples[vacrelstats->num_dead_tuples] = *itemptr; //vacrelstats->num_dead_tuples++; lazy_record_dead_tuple(vacrelstats, &(tuple.t_self)); HeapTupleHeaderAdvanceLatestRemovedXid(tuple.t_data, &vacrelstats->latestRemovedXid); tups_vacuumed += 1; has_dead_tuples = true; } else { bool tuple_totally_frozen;//所有都冻结标记 num_tuples += 1; hastup = true; /* * Each non-removable tuple must be checked to see if it needs * freezing. Note we already have exclusive buffer lock. * 每一个未清理的tuple必须检查看看是否需要冻结. * 注意我们已经持有了独占缓冲锁. */ if (heap_prepare_freeze_tuple(tuple.t_data, relfrozenxid, relminmxid, FreezeLimit, MultiXactCutoff, &frozen[nfrozen], &tuple_totally_frozen)) frozen[nfrozen++].offset = offnum; if (!tuple_totally_frozen) all_frozen = false; } } /* scan along page */ /* * If we froze any tuples, mark the buffer dirty, and write a WAL * record recording the changes. We must log the changes to be * crash-safe against future truncation of CLOG. * 如果冻结了所有的元组,标记缓冲为脏状态,写入WAL Record记录这些变化. * 必须记录这些变化以避免截断CLOG时出现崩溃导致数据丢失. */ if (nfrozen > 0) { //已冻结计数>0,执行相关处理 START_CRIT_SECTION(); //标记缓冲为脏 MarkBufferDirty(buf); /* execute collected freezes */ //执行冻结 for (i = 0; i < nfrozen; i++) { ItemId itemid; HeapTupleHeader htup; itemid = PageGetItemId(page, frozen[i].offset); htup = (HeapTupleHeader) PageGetItem(page, itemid); //执行冻结 heap_execute_freeze_tuple(htup, &frozen[i]); } /* Now WAL-log freezing if necessary */ //如需要,记录冻结日志 if (RelationNeedsWAL(onerel)) { XLogRecPtr recptr; recptr = log_heap_freeze(onerel, buf, FreezeLimit, frozen, nfrozen); PageSetLSN(page, recptr); } END_CRIT_SECTION(); } /* * If there are no indexes then we can vacuum the page right now * instead of doing a second scan. * 如果没有索引,那么现在执行vacuum page而不需要二次扫描. */ if (nindexes == 0 && vacrelstats->num_dead_tuples > 0) { //------------- 如无索引并且存在dead元组,执行清理 /* Remove tuples from heap */ //清除元组 lazy_vacuum_page(onerel, blkno, buf, 0, vacrelstats, &vmbuffer); has_dead_tuples = false; /* * Forget the now-vacuumed tuples, and press on, but be careful * not to reset latestRemovedXid since we want that value to be * valid. * 无需再关注现在已被vacuum的元组,继续,但要小心不要重置了latestRemovedXid, * 因为我们希望该值是有效的. */ vacrelstats->num_dead_tuples = 0;//重置计数器 vacuumed_pages++;//已完成的page+1 /* * Periodically do incremental FSM vacuuming to make newly-freed * space visible on upper FSM pages. Note: although we've cleaned * the current block, we haven't yet updated its FSM entry (that * happens further down), so passing end == blkno is correct. * 周期性的进行增量FSM vacuuming,以使新释放的空间在上层FSM pages中可见. * 注意:虽然我们已经清理了当前块,我们并不需要更新块的FSM入口(后续才进行处理), * 因此设置end == blkno是没有问题的. */ if (blkno - next_fsm_block_to_vacuum >= VACUUM_FSM_EVERY_PAGES) { //批量处理 FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); next_fsm_block_to_vacuum = blkno; } } //获取空闲空间 freespace = PageGetHeapFreeSpace(page); //以下if/else逻辑用于同步vm状态 /* mark page all-visible, if appropriate */ //如OK,标记页面为all-Visible if (all_visible && !all_visible_according_to_vm) { // uint8 flags = VISIBILITYMAP_ALL_VISIBLE; if (all_frozen) flags |= VISIBILITYMAP_ALL_FROZEN; /* * It should never be the case that the visibility map page is set * while the page-level bit is clear, but the reverse is allowed * (if checksums are not enabled). Regardless, set the both bits * so that we get back in sync. * 如page-level bit是否被清除,不应设置VM page,但允许反向设置(如没有启用校验和). * 不管怎样,把这两个标记位都设置好,这样我们就可以同步状态了. * * NB: If the heap page is all-visible but the VM bit is not set, * we don't need to dirty the heap page. However, if checksums * are enabled, we do need to make sure that the heap page is * dirtied before passing it to visibilitymap_set(), because it * may be logged. Given that this situation should only happen in * rare cases after a crash, it is not worth optimizing. * 注意:如果heap page是all-visible但VM没有设置,我们不需要设置该page为脏page. * 但是,如果启用了校验位, * 我们确实需要确保heap page在传递给visibilitymap_set()函数前标记为脏,因为可能需要记录日志. * 给定的这个条件应只出现在较为罕见的崩溃之后,因此不值得调优. */ PageSetAllVisible(page); MarkBufferDirty(buf); visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer, visibility_cutoff_xid, flags); } /* * As of PostgreSQL 9.2, the visibility map bit should never be set if * the page-level bit is clear. However, it's possible that the bit * got cleared after we checked it and before we took the buffer * content lock, so we must recheck before jumping to the conclusion * that something bad has happened. * 从PostgreSQL 9.2开始,如果页面级别位已清除,就不应该设置可见性映射位。 * 但是,可能会出现在我们检查之后和持有缓存内存锁之前,页面级别位被清理, * 因此我们必须在情况变坏之前重新检查 */ else if (all_visible_according_to_vm && !PageIsAllVisible(page) && VM_ALL_VISIBLE(onerel, blkno, &vmbuffer)) { elog(WARNING, "page is not marked all-visible but visibility map bit is set in relation \"%s\" page %u", relname, blkno); visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); } /* * It's possible for the value returned by GetOldestXmin() to move * backwards, so it's not wrong for us to see tuples that appear to * not be visible to everyone yet, while PD_ALL_VISIBLE is already * set. The real safe xmin value never moves backwards, but * GetOldestXmin() is conservative and sometimes returns a value * that's unnecessarily small, so if we see that contradiction it just * means that the tuples that we think are not visible to everyone yet * actually are, and the PD_ALL_VISIBLE flag is correct. * GetOldestXmin()返回的值有可能向后移动, * 因此我们看到的元组似乎还不是每个事务都可见, * 而PD_ALL_VISIBLE已经设置好了,这并没有错。 * 实际安全的xmin值永远都不应该往后移动,但GetOldestXmin()比较保守,有时会返回一个不必要的小值, * 因此如果我们看到这个毛病,那么意味着我们认为对所有事务都不可见的元组实际上仍在那里, * 而且PD_ALL_VISIBLE标记是正确的. * * There should never be dead tuples on a page with PD_ALL_VISIBLE * set, however. * 但是,在一个标记为PD_ALL_VISIBLE的page中,永远不应出现dead tupls. */ else if (PageIsAllVisible(page) && has_dead_tuples) { elog(WARNING, "page containing dead tuples is marked as all-visible in relation \"%s\" page %u", relname, blkno); PageClearAllVisible(page); MarkBufferDirty(buf); visibilitymap_clear(onerel, blkno, vmbuffer, VISIBILITYMAP_VALID_BITS); } /* * If the all-visible page is turned out to be all-frozen but not * marked, we should so mark it. Note that all_frozen is only valid * if all_visible is true, so we must check both. * 如all-visible page已被冻结但未被标记,我们应该标记它. * 注意all_frozen只有在all_visible为T的情况下才是有效的,因此必须两者都要检查. */ else if (all_visible_according_to_vm && all_visible && all_frozen && !VM_ALL_FROZEN(onerel, blkno, &vmbuffer)) { /* * We can pass InvalidTransactionId as the cutoff XID here, * because setting the all-frozen bit doesn't cause recovery * conflicts. * 我们可以把InvalidTransactionId作为截断XID参数进行传递, * 因为设置all-frozen位必会导致恢复冲突. */ visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr, vmbuffer, InvalidTransactionId, VISIBILITYMAP_ALL_FROZEN); } UnlockReleaseBuffer(buf); /* Remember the location of the last page with nonremovable tuples */ //使用未被清理的元组记录最后一个页面的位置. if (hastup) vacrelstats->nonempty_pages = blkno + 1; /* * If we remembered any tuples for deletion, then the page will be * visited again by lazy_vacuum_heap, which will compute and record * its post-compaction free space. If not, then we're done with this * page, so remember its free space as-is. (This path will always be * taken if there are no indexes.) * 如果我们记得要删除任何元组,那么lazy_vacuum_heap将再次访问该页,它将计算并记录压缩后的空闲空间。 * 如果不是,那么我们就清理完了这个页面,所以请记住它的空闲空间是原样的。 * (如果没有索引,则始终采用此路径。) */ if (vacrelstats->num_dead_tuples == prev_dead_count) RecordPageWithFreeSpace(onerel, blkno, freespace); } //结束block循环 /* report that everything is scanned and vacuumed */ //报告所有数据已扫描并vacuumed. pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno); pfree(frozen); /* save stats for use later */ //存储统计已备后用 vacrelstats->tuples_deleted = tups_vacuumed; vacrelstats->new_dead_tuples = nkeep; /* now we can compute the new value for pg_class.reltuples */ //现在可以为pg_class.reltuples设置新值了. vacrelstats->new_live_tuples = vac_estimate_reltuples(onerel, nblocks, vacrelstats->tupcount_pages, live_tuples); /* also compute total number of surviving heap entries */ //同时,技术存活的heap条目总数 vacrelstats->new_rel_tuples = vacrelstats->new_live_tuples + vacrelstats->new_dead_tuples; /* * Release any remaining pin on visibility map page. * 在vm page中释放所有的pin */ if (BufferIsValid(vmbuffer)) { ReleaseBuffer(vmbuffer); vmbuffer = InvalidBuffer; } /* If any tuples need to be deleted, perform final vacuum cycle */ /* XXX put a threshold on min number of tuples here? */ //如果仍有元组需要删除,执行最后的vacuum循环. //在这里为元组的最小数目设置一个阈值? if (vacrelstats->num_dead_tuples > 0) { const int hvp_index[] = { PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_NUM_INDEX_VACUUMS }; int64 hvp_val[2]; /* Log cleanup info before we touch indexes */ //在访问索引前记录清理信息 vacuum_log_cleanup_info(onerel, vacrelstats); /* Report that we are now vacuuming indexes */ //报告我们正在vacumming索引 pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_VACUUM_INDEX); /* Remove index entries */ //清理索引条目 for (i = 0; i < nindexes; i++) lazy_vacuum_index(Irel[i], &indstats[i], vacrelstats); /* Report that we are now vacuuming the heap */ //报告我们正在vacuuming heap hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP; hvp_val[1] = vacrelstats->num_index_scans + 1; pgstat_progress_update_multi_param(2, hvp_index, hvp_val); /* Remove tuples from heap */ //清理元组 pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_VACUUM_HEAP); lazy_vacuum_heap(onerel, vacrelstats); vacrelstats->num_index_scans++; } /* * Vacuum the remainder of the Free Space Map. We must do this whether or * not there were indexes. * vacuum FSM. * 不管是否存在索引,都必须如此处理. */ if (blkno > next_fsm_block_to_vacuum) FreeSpaceMapVacuumRange(onerel, next_fsm_block_to_vacuum, blkno); /* report all blocks vacuumed; and that we're cleaning up */ //报告所有blocks vacuumed,已完成清理. pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_VACUUMED, blkno); pgstat_progress_update_param(PROGRESS_VACUUM_PHASE, PROGRESS_VACUUM_PHASE_INDEX_CLEANUP); /* Do post-vacuum cleanup and statistics update for each index */ //执行vacuum收尾工作,为每个索引更新统计信息 for (i = 0; i < nindexes; i++) lazy_cleanup_index(Irel[i], indstats[i], vacrelstats); /* If no indexes, make log report that lazy_vacuum_heap would've made */ //如无索引,写日志 if (vacuumed_pages) ereport(elevel, (errmsg("\"%s\": removed %.0f row versions in %u pages", RelationGetRelationName(onerel), tups_vacuumed, vacuumed_pages))); /* * This is pretty messy, but we split it up so that we can skip emitting * individual parts of the message when not applicable. * 一起写日志会非常混乱,但我们把它拆分了,因此我们可以跳过发送消息的各个部分. */ initStringInfo(&buf); appendStringInfo(&buf, _("%.0f dead row versions cannot be removed yet, oldest xmin: %u\n"), nkeep, OldestXmin); appendStringInfo(&buf, _("There were %.0f unused item pointers.\n"), nunused); appendStringInfo(&buf, ngettext("Skipped %u page due to buffer pins, ", "Skipped %u pages due to buffer pins, ", vacrelstats->pinskipped_pages), vacrelstats->pinskipped_pages); appendStringInfo(&buf, ngettext("%u frozen page.\n", "%u frozen pages.\n", vacrelstats->frozenskipped_pages), vacrelstats->frozenskipped_pages); appendStringInfo(&buf, ngettext("%u page is entirely empty.\n", "%u pages are entirely empty.\n", empty_pages), empty_pages); appendStringInfo(&buf, _("%s."), pg_rusage_show(&ru0)); ereport(elevel, (errmsg("\"%s\": found %.0f removable, %.0f nonremovable row versions in %u out of %u pages", RelationGetRelationName(onerel), tups_vacuumed, num_tuples, vacrelstats->scanned_pages, nblocks), errdetail_internal("%s", buf.data))); pfree(buf.data);}
三、跟踪分析
测试脚本,执行压力测试的同时,执行vacuum
-- session 1pgbench -c 2 -C -f ./update.sql -j 1 -n -T 600 -U xdb testdb-- session 217:52:59 (xdb@[local]:5432)testdb=# vacuum verbose t1;
启动gdb,设置断点
(gdb) b lazy_scan_heapBreakpoint 1 at 0x6bc38a: file vacuumlazy.c, line 470.(gdb) cContinuing.Breakpoint 1, lazy_scan_heap (onerel=0x7f224a197788, options=5, vacrelstats=0x296d7b8, Irel=0x296d8b0, nindexes=1, aggressive=false) at vacuumlazy.c:470470 TransactionId relfrozenxid = onerel->rd_rel->relfrozenxid;(gdb)
输入参数
1-relation
(gdb) p *onerel$1 = {rd_node = {spcNode = 1663, dbNode = 16402, relNode = 50820}, rd_smgr = 0x2930270, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 1 '\001', rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7f224a197bb8, rd_att = 0x7f224a0d8050, rd_id = 50820, rd_lockInfo = {lockRelId = {relId = 50820, dbId = 16402}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x7f224a198fe8, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0, rd_indextuple = 0x0, rd_amhandler = 0, rd_indexcxt = 0x0, rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0, rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x2923e50}(gdb)
2-options=5,即VACOPT_VACUUM | VACOPT_VERBOSE
3-vacrelstats
(gdb) p *vacrelstats$2 = {hasindex = true, old_rel_pages = 75, rel_pages = 0, scanned_pages = 0, pinskipped_pages = 0, frozenskipped_pages = 0, tupcount_pages = 0, old_live_tuples = 10000, new_rel_tuples = 0, new_live_tuples = 0, new_dead_tuples = 0, pages_removed = 0, tuples_deleted = 0, nonempty_pages = 0, num_dead_tuples = 0, max_dead_tuples = 0, dead_tuples = 0x0, num_index_scans = 0, latestRemovedXid = 0, lock_waiter_detected = false}(gdb)
4-Irel
(gdb) p *Irel$3 = (Relation) 0x7f224a198688(gdb) p **Irel$4 = {rd_node = {spcNode = 1663, dbNode = 16402, relNode = 50823}, rd_smgr = 0x29302e0, rd_refcnt = 1, rd_backend = -1, rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 0 '\000', rd_statvalid = false, rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7f224a1988a0, rd_att = 0x7f224a1989b8, rd_id = 50823, rd_lockInfo = {lockRelId = {relId = 50823, dbId = 16402}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0, rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0, rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x0, rd_oidindex = 0, rd_pkindex = 0, rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0, rd_idattr = 0x0, rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x7f224a198d58, rd_indextuple = 0x7f224a198d20, rd_amhandler = 330, rd_indexcxt = 0x28cb340, rd_amroutine = 0x28cb480, rd_opfamily = 0x28cb598, rd_opcintype = 0x28cb5b8, rd_support = 0x28cb5d8, rd_supportinfo = 0x28cb600, rd_indoption = 0x28cb738, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0, rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x28cb718, rd_fdwroutine = 0x0, rd_toastoid = 0, pgstat_info = 0x2923ec8}(gdb)
5-nindexes=1,存在一个索引
6-aggressive=false,无需执行全表扫描
下面开始初始化相关变量
(gdb) n471 TransactionId relminmxid = onerel->rd_rel->relminmxid;(gdb) 483 Buffer vmbuffer = InvalidBuffer;(gdb) 488 const int initprog_index[] = {(gdb) 495 pg_rusage_init(&ru0);(gdb) 497 relname = RelationGetRelationName(onerel);(gdb) 498 if (aggressive)(gdb) 504 ereport(elevel,(gdb) 509 empty_pages = vacuumed_pages = 0;(gdb) 510 next_fsm_block_to_vacuum = (BlockNumber) 0;(gdb) 511 num_tuples = live_tuples = tups_vacuumed = nkeep = nunused = 0;(gdb) 514 palloc0(nindexes * sizeof(IndexBulkDeleteResult *));(gdb) 513 indstats = (IndexBulkDeleteResult **)(gdb) 516 nblocks = RelationGetNumberOfBlocks(onerel);(gdb) p relminmxid$5 = 1(gdb) p ru0$6 = {tv = {tv_sec = 1548669429, tv_usec = 578779}, ru = {ru_utime = {tv_sec = 0, tv_usec = 29531}, ru_stime = {tv_sec = 0, tv_usec = 51407}, {ru_maxrss = 7488, __ru_maxrss_word = 7488}, {ru_ixrss = 0, __ru_ixrss_word = 0}, {ru_idrss = 0, __ru_idrss_word = 0}, {ru_isrss = 0, __ru_isrss_word = 0}, {ru_minflt = 1819, __ru_minflt_word = 1819}, { ru_majflt = 0, __ru_majflt_word = 0}, {ru_nswap = 0, __ru_nswap_word = 0}, {ru_inblock = 2664, __ru_inblock_word = 2664}, {ru_oublock = 328, __ru_oublock_word = 328}, {ru_msgsnd = 0, __ru_msgsnd_word = 0}, { ru_msgrcv = 0, __ru_msgrcv_word = 0}, {ru_nsignals = 0, __ru_nsignals_word = 0}, {ru_nvcsw = 70, __ru_nvcsw_word = 70}, {ru_nivcsw = 3, __ru_nivcsw_word = 3}}}(gdb) p relname$7 = 0x7f224a197bb8 "t1"(gdb)
获取总块数
(gdb) n517 vacrelstats->rel_pages = nblocks;(gdb) p nblocks$8 = 75(gdb)
初始化统计信息和相关数组
(gdb) n518 vacrelstats->scanned_pages = 0;(gdb) 519 vacrelstats->tupcount_pages = 0;(gdb) 520 vacrelstats->nonempty_pages = 0;(gdb) 521 vacrelstats->latestRemovedXid = InvalidTransactionId;(gdb) 523 lazy_space_alloc(vacrelstats, nblocks);(gdb) 524 frozen = palloc(sizeof(xl_heap_freeze_tuple) * MaxHeapTuplesPerPage);(gdb) 527 initprog_val[0] = PROGRESS_VACUUM_PHASE_SCAN_HEAP;(gdb) 528 initprog_val[1] = nblocks;(gdb) 529 initprog_val[2] = vacrelstats->max_dead_tuples;(gdb) 530 pgstat_progress_update_multi_param(3, initprog_index, initprog_val);(gdb) p *vacrelstats$9 = {hasindex = true, old_rel_pages = 75, rel_pages = 75, scanned_pages = 0, pinskipped_pages = 0, frozenskipped_pages = 0, tupcount_pages = 0, old_live_tuples = 10000, new_rel_tuples = 0, new_live_tuples = 0, new_dead_tuples = 0, pages_removed = 0, tuples_deleted = 0, nonempty_pages = 0, num_dead_tuples = 0, max_dead_tuples = 21825, dead_tuples = 0x297e820, num_index_scans = 0, latestRemovedXid = 0, lock_waiter_detected = false}(gdb)
计算下一个不能跳过的block
第0个块也不能跳过(0 < 32),设置标记skipping_blocks为F
(gdb) n576 next_unskippable_block = 0;(gdb) 577 if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)(gdb) 579 while (next_unskippable_block < nblocks)(gdb) 583 vmstatus = visibilitymap_get_status(onerel, next_unskippable_block,(gdb) 585 if (aggressive)(gdb) p vmstatus$10 = 0 '\000'(gdb) n592 if ((vmstatus & VISIBILITYMAP_ALL_VISIBLE) == 0)(gdb) 593 break;(gdb) 600 if (next_unskippable_block >= SKIP_PAGES_THRESHOLD)(gdb) p next_unskippable_block$11 = 0(gdb) p SKIP_PAGES_THRESHOLD$12 = 32(gdb) n603 skipping_blocks = false;(gdb)
开始遍历每个block
初始化相关变量
(gdb) 605 for (blkno = 0; blkno < nblocks; blkno++)(gdb) 616 bool all_visible_according_to_vm = false;(gdb) 618 bool all_frozen = true; /* provided all_visible is also true */(gdb) 620 TransactionId visibility_cutoff_xid = InvalidTransactionId;(gdb) 626 pgstat_progress_update_param(PROGRESS_VACUUM_HEAP_BLKS_SCANNED, blkno);(gdb) 628 if (blkno == next_unskippable_block)(gdb)
blkno == next_unskippable_block,获取下一个不可跳过的block
(gdb) p blkno$13 = 0(gdb) p next_unskippable_block$14 = 0(gdb) n631 next_unskippable_block++;(gdb) 632 if ((options & VACOPT_DISABLE_PAGE_SKIPPING) == 0)(gdb) 634 while (next_unskippable_block < nblocks)(gdb) 638 vmskipflags = visibilitymap_get_status(onerel,(gdb) 641 if (aggressive)(gdb) p vmskipflags$15 = 0 '\000'(gdb) n648 if ((vmskipflags & VISIBILITYMAP_ALL_VISIBLE) == 0)(gdb) 649 break;(gdb) 660 if (next_unskippable_block - blkno > SKIP_PAGES_THRESHOLD)(gdb) p next_unskippable_block$16 = 1(gdb) n 1047 if (onerel->rd_rel->relhasoids &&(gdb) 1132 if (tupgone)(gdb)
tupgone为F,判断是否需要冻结(F)
获取偏移,遍历元组
(gdb) p tupgone$17 = false(gdb) n1144 num_tuples += 1;(gdb) 1145 hastup = true;(gdb) 1151 if (heap_prepare_freeze_tuple(tuple.t_data,(gdb) 1154 &frozen[nfrozen],(gdb) p nfrozen$18 = 0(gdb) n1151 if (heap_prepare_freeze_tuple(tuple.t_data,(gdb) 1158 if (!tuple_totally_frozen)(gdb) 1159 all_frozen = false;(gdb) 958 offnum = OffsetNumberNext(offnum))(gdb) 956 for (offnum = FirstOffsetNumber;(gdb)
该元组正常
(gdb) p offnum$19 = 3(gdb) n962 itemid = PageGetItemId(page, offnum);(gdb) 965 if (!ItemIdIsUsed(itemid))(gdb) 972 if (ItemIdIsRedirected(itemid))(gdb) 978 ItemPointerSet(&(tuple.t_self), blkno, offnum);(gdb) 986 if (ItemIdIsDead(itemid))(gdb) 993 Assert(ItemIdIsNormal(itemid));(gdb) 995 tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);(gdb) 996 tuple.t_len = ItemIdGetLength(itemid);(gdb) 997 tuple.t_tableOid = RelationGetRelid(onerel);(gdb) 999 tupgone = false;(gdb)
调用HeapTupleSatisfiesVacuum确定元组状态,主要目的是一个元组是否可能对所有正在运行中的事务可见
该元组是Live tuple
1012 switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))(gdb) (gdb) n1047 if (onerel->rd_rel->relhasoids &&(gdb) n1056 live_tuples += 1;(gdb) 1067 if (all_visible)(gdb) p all_visible$20 = false
跳出循环
(gdb) b vacuumlazy.c:1168Breakpoint 2 at 0x6bd4e7: file vacuumlazy.c, line 1168.(gdb) cContinuing.Breakpoint 2, lazy_scan_heap (onerel=0x7f224a197788, options=5, vacrelstats=0x296d7b8, Irel=0x296d8b0, nindexes=1, aggressive=false) at vacuumlazy.c:11681168 if (nfrozen > 0)(gdb)
更新统计信息
(gdb) n1203 if (nindexes == 0 &&(gdb) p nfrozen$23 = 0(gdb) n1232 freespace = PageGetHeapFreeSpace(page);(gdb) 1235 if (all_visible && !all_visible_according_to_vm)(gdb) 1268 else if (all_visible_according_to_vm && !PageIsAllVisible(page)(gdb) 1290 else if (PageIsAllVisible(page) && has_dead_tuples)(gdb) 1305 else if (all_visible_according_to_vm && all_visible && all_frozen &&(gdb) 1318 UnlockReleaseBuffer(buf);(gdb) 1321 if (hastup)(gdb) 1322 vacrelstats->nonempty_pages = blkno + 1;(gdb) p hastup$24 = true(gdb) n1331 if (vacrelstats->num_dead_tuples == prev_dead_count)(gdb) 1332 RecordPageWithFreeSpace(onerel, blkno, freespace);
继续下一个block
(gdb) 605 for (blkno = 0; blkno < nblocks; blkno++)(gdb) p blkno$25 = 0(gdb) n616 bool all_visible_according_to_vm = false;(gdb) p blkno$26 = 1(gdb)
判断(vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage && vacrelstats->num_dead_tuples > 0,不满足,继续执行
...(gdb) 701 vacuum_delay_point();(gdb) 707 if ((vacrelstats->max_dead_tuples - vacrelstats->num_dead_tuples) < MaxHeapTuplesPerPage &&(gdb) p vacrelstats->max_dead_tuples$27 = 21825(gdb) p vacrelstats->num_dead_tuples$28 = 0(gdb) p MaxHeapTuplesPerPageNo symbol "__builtin_offsetof" in current context.(gdb)
以扩展方式读取buffer
(gdb) n783 visibilitymap_pin(onerel, blkno, &vmbuffer);(gdb) 785 buf = ReadBufferExtended(onerel, MAIN_FORKNUM, blkno,(gdb) 789 if (!ConditionalLockBufferForCleanup(buf))(gdb)
取buffer cleanup lock,成功!
调用heap_page_prune清理该page中的所有HOT-update链
(gdb) n847 vacrelstats->scanned_pages++;(gdb) 848 vacrelstats->tupcount_pages++;(gdb) 850 page = BufferGetPage(buf);(gdb) 852 if (PageIsNew(page))(gdb) 894 if (PageIsEmpty(page))(gdb) 938 tups_vacuumed += heap_page_prune(onerel, buf, OldestXmin, false,(gdb) 945 all_visible = true;(gdb)
遍历page中的行指针
956 for (offnum = FirstOffsetNumber;(gdb) p maxoff$29 = 291(gdb) $30 = 291(gdb) n962 itemid = PageGetItemId(page, offnum);(gdb) n965 if (!ItemIdIsUsed(itemid))(gdb) 972 if (ItemIdIsRedirected(itemid))(gdb) 978 ItemPointerSet(&(tuple.t_self), blkno, offnum);(gdb) 986 if (ItemIdIsDead(itemid))(gdb) 993 Assert(ItemIdIsNormal(itemid));(gdb) 995 tuple.t_data = (HeapTupleHeader) PageGetItem(page, itemid);(gdb) 996 tuple.t_len = ItemIdGetLength(itemid);(gdb) 997 tuple.t_tableOid = RelationGetRelid(onerel);(gdb) 999 tupgone = false;(gdb) 1012 switch (HeapTupleSatisfiesVacuum(&tuple, OldestXmin, buf))(gdb) 1099 nkeep += 1;(gdb) 1100 all_visible = false;(gdb) 1101 break;(gdb) 1132 if (tupgone)(gdb) 1144 num_tuples += 1;
跳出循环
(gdb) cContinuing.Breakpoint 2, lazy_scan_heap (onerel=0x7f224a197788, options=5, vacrelstats=0x296d7b8, Irel=0x296d8b0, nindexes=1, aggressive=false) at vacuumlazy.c:11681168 if (nfrozen > 0)(gdb)
DONE!
四、参考资料
PG Source Code