千家信息网

怎么理解PostgreSQL中Clock Sweep算法

发表于:2025-01-21 作者:千家信息网编辑
千家信息网最后更新 2025年01月21日,本篇内容介绍了"怎么理解PostgreSQL中Clock Sweep算法"的有关知识,在实际案例的操作过程中,不少人都会遇到这样的困境,接下来就让小编带领大家学习一下如何处理这些情况吧!希望大家仔细阅
千家信息网最后更新 2025年01月21日怎么理解PostgreSQL中Clock Sweep算法

本篇内容介绍了"怎么理解PostgreSQL中Clock Sweep算法"的有关知识,在实际案例的操作过程中,不少人都会遇到这样的困境,接下来就让小编带领大家学习一下如何处理这些情况吧!希望大家仔细阅读,能够学有所成!

一、数据结构

BufferDesc
共享缓冲区的共享描述符(状态)数据

/* * Flags for buffer descriptors * buffer描述器标记 * * Note: TAG_VALID essentially means that there is a buffer hashtable * entry associated with the buffer's tag. * 注意:TAG_VALID本质上意味着有一个与缓冲区的标记相关联的缓冲区散列表条目。 *///buffer header锁定#define BM_LOCKED               (1U << 22)  /* buffer header is locked *///数据需要写入(标记为DIRTY)#define BM_DIRTY                (1U << 23)  /* data needs writing *///数据是有效的#define BM_VALID                (1U << 24)  /* data is valid *///已分配buffer tag#define BM_TAG_VALID            (1U << 25)  /* tag is assigned *///正在R/W#define BM_IO_IN_PROGRESS       (1U << 26)  /* read or write in progress *///上一个I/O出现错误#define BM_IO_ERROR             (1U << 27)  /* previous I/O failed *///开始写则变DIRTY#define BM_JUST_DIRTIED         (1U << 28)  /* dirtied since write started *///存在等待sole pin的其他进程#define BM_PIN_COUNT_WAITER     (1U << 29)  /* have waiter for sole pin *///checkpoint发生,必须刷到磁盘上#define BM_CHECKPOINT_NEEDED    (1U << 30)  /* must write for checkpoint *///持久化buffer(不是unlogged或者初始化fork)#define BM_PERMANENT            (1U << 31)  /* permanent buffer (not unlogged,                                             * or init fork) *//* *  BufferDesc -- shared descriptor/state data for a single shared buffer. *  BufferDesc -- 共享缓冲区的共享描述符(状态)数据 * * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change * the tag, state or wait_backend_pid fields.  In general, buffer header lock * is a spinlock which is combined with flags, refcount and usagecount into * single atomic variable.  This layout allow us to do some operations in a * single atomic operation, without actually acquiring and releasing spinlock; * for instance, increase or decrease refcount.  buf_id field never changes * after initialization, so does not need locking.  freeNext is protected by * the buffer_strategy_lock not buffer header lock.  The LWLock can take care * of itself.  The buffer header lock is *not* used to control access to the * data in the buffer! * 注意:必须持有Buffer header锁(BM_LOCKED标记)才能检查或修改tag/state/wait_backend_pid字段. * 通常来说,buffer header lock是spinlock,它与标记位/参考计数/使用计数组合到单个原子变量中. * 这个布局设计允许我们执行原子操作,而不需要实际获得或者释放spinlock(比如,增加或者减少参考计数). * buf_id字段在初始化后不会出现变化,因此不需要锁定. * freeNext通过buffer_strategy_lock锁而不是buffer header lock保护. * LWLock可以很好的处理自己的状态. * 务请注意的是:buffer header lock不用于控制buffer中的数据访问! * * It's assumed that nobody changes the state field while buffer header lock * is held.  Thus buffer header lock holder can do complex updates of the * state variable in single write, simultaneously with lock release (cleaning * BM_LOCKED flag).  On the other hand, updating of state without holding * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag * is not set.  Atomic increment/decrement, OR/AND etc. are not allowed. * 假定在持有buffer header lock的情况下,没有人改变状态字段. * 持有buffer header lock的进程可以执行在单个写操作中执行复杂的状态变量更新, *   同步的释放锁(清除BM_LOCKED标记). * 换句话说,如果没有持有buffer header lock的状态更新,会受限于CAS, *   这种情况下确保BM_LOCKED没有被设置. * 比如原子的增加/减少(AND/OR)等操作是不允许的. * * An exception is that if we have the buffer pinned, its tag can't change * underneath us, so we can examine the tag without locking the buffer header. * Also, in places we do one-time reads of the flags without bothering to * lock the buffer header; this is generally for situations where we don't * expect the flag bit being tested to be changing. * 一种例外情况是如果我们已有buffer pinned,该buffer的tag不能改变(在本进程之下), *   因此不需要锁定buffer header就可以检查tag了. * 同时,在执行一次性的flags读取时不需要锁定buffer header. * 这种情况通常用于我们不希望正在测试的flag bit将被改变. * * We can't physically remove items from a disk page if another backend has * the buffer pinned.  Hence, a backend may need to wait for all other pins * to go away.  This is signaled by storing its own PID into * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present, * there can be only one such waiter per buffer. * 如果其他进程有buffer pinned,那么进程不能物理的从磁盘页面中删除items. * 因此,后台进程需要等待其他pins清除.这可以通过存储它自己的PID到wait_backend_pid中, *   并设置标记位BM_PIN_COUNT_WAITER. * 目前,每个缓冲区只能由一个等待进程. * * We use this same struct for local buffer headers, but the locks are not * used and not all of the flag bits are useful either. To avoid unnecessary * overhead, manipulations of the state field should be done without actual * atomic operations (i.e. only pg_atomic_read_u32() and * pg_atomic_unlocked_write_u32()). * 本地缓冲头部使用同样的结构,但并不需要使用locks,而且并不是所有的标记位都使用. * 为了避免不必要的负载,状态域的维护不需要实际的原子操作 * (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32()) * * Be careful to avoid increasing the size of the struct when adding or * reordering members.  Keeping it below 64 bytes (the most common CPU * cache line size) is fairly important for performance. * 在增加或者记录成员变量时,小心避免增加结构体的大小. * 保持结构体大小在64字节内(通常的CPU缓存线大小)对于性能是非常重要的. */typedef struct BufferDesc{    //buffer tag    BufferTag   tag;            /* ID of page contained in buffer */    //buffer索引编号(0开始),指向相应的buffer pool slot    int         buf_id;         /* buffer's index number (from 0) */    /* state of the tag, containing flags, refcount and usagecount */    //tag状态,包括flags/refcount和usagecount    pg_atomic_uint32 state;    //pin-count等待进程ID    int         wait_backend_pid;   /* backend PID of pin-count waiter */    //空闲链表链中下一个空闲的buffer    int         freeNext;       /* link in freelist chain */    //缓冲区内容锁    LWLock      content_lock;   /* to lock access to buffer contents */} BufferDesc;

BufferTag
Buffer tag标记了buffer存储的是磁盘中哪个block

/* * Buffer tag identifies which disk block the buffer contains. * Buffer tag标记了buffer存储的是磁盘中哪个block * * Note: the BufferTag data must be sufficient to determine where to write the * block, without reference to pg_class or pg_tablespace entries.  It's * possible that the backend flushing the buffer doesn't even believe the * relation is visible yet (its xact may have started before the xact that * created the rel).  The storage manager must be able to cope anyway. * 注意:BufferTag必须足以确定如何写block而不需要参照pg_class或者pg_tablespace数据字典信息. * 有可能后台进程在刷新缓冲区的时候深圳不相信关系是可见的(事务可能在创建rel的事务之前). * 存储管理器必须可以处理这些事情. * * Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have * to be fixed to zero them, since this struct is used as a hash key. * 注意:如果在结构体中有填充的字节,INIT_BUFFERTAG必须将它们固定为零,因为这个结构体用作散列键. */typedef struct buftag{    //物理relation标识符    RelFileNode rnode;          /* physical relation identifier */    ForkNumber  forkNum;    //相对于relation起始的块号    BlockNumber blockNum;       /* blknum relative to begin of reln */} BufferTag;

二、源码解读

算法思想,在 Buffer Manager 中已有详细介绍,摘录如下:

WHILE true(1)     Obtain the candidate buffer descriptor pointed by the nextVictimBuffer(2)     IF the candidate descriptor is unpinned THEN(3)        IF the candidate descriptor's usage_count == 0 THEN                BREAK WHILE LOOP  /* the corresponding slot of this descriptor is victim slot. */           ELSE            Decrease the candidate descriptpor's usage_count by 1               END IF         END IF(4)     Advance nextVictimBuffer to the next one      END WHILE (5) RETURN buffer_id of the victim

算法思想朴素且简单,比起高大上的LRU算法要简单多了.
nextVictimBuffer可视为时钟的时针,把缓冲区视为环形缓冲区,时针循环周而往复的循环,如缓冲区满足unpinned(ref_count == 0) && usage_count == 0的条件,则选中该缓冲,否则,usage_count减一,继续循环,直至找到满足条件的buffer为止.选中的buffer一定是buffers中较少使用的那个.

StrategyGetBuffer中的代码片段:

...    /* Nothing on the freelist, so run the "clock sweep" algorithm */    //空闲链表中找不到或者满足不了条件,则执行"clock sweep"算法    //int NBuffers = 1000;    trycounter = NBuffers;//尝试次数    for (;;)    {        //------- 循环        //获取buffer描述符        buf = GetBufferDescriptor(ClockSweepTick());        /*         * If the buffer is pinned or has a nonzero usage_count, we cannot use         * it; decrement the usage_count (unless pinned) and keep scanning.         * 如果buffer已pinned,或者有一个非零值的usage_count,不能使用这个buffer.         * 减少usage_count(除非已pinned)继续扫描.         */        local_buf_state = LockBufHdr(buf);        if (BUF_STATE_GET_REFCOUNT(local_buf_state) == 0)        {            //----- refcount == 0            if (BUF_STATE_GET_USAGECOUNT(local_buf_state) != 0)            {                //usage_count <> 0                //usage_count - 1                local_buf_state -= BUF_USAGECOUNT_ONE;                //重置尝试次数                trycounter = NBuffers;            }            else            {                //usage_count = 0                /* Found a usable buffer */                //发现一个可用的buffer                if (strategy != NULL)                    //添加到该策略的环形缓冲区中                    AddBufferToRing(strategy, buf);                //输出参数赋值                *buf_state = local_buf_state;                //返回buf                return buf;            }        }        else if (--trycounter == 0)        {            //----- refcount <> 0 && --trycounter == 0            /*             * We've scanned all the buffers without making any state changes,             * so all the buffers are pinned (or were when we looked at them).             * We could hope that someone will free one eventually, but it's             * probably better to fail than to risk getting stuck in an             * infinite loop.             * 在没有改变任何状态的情况,我们已经完成了所有buffers的遍历,             *   因此所有的buffers已pinned(或者在搜索的时候pinned).             * 我们希望某些进程会周期性的释放buffer,但如果实在拿不到,那报错总比傻傻的死循环要好.             */            UnlockBufHdr(buf, local_buf_state);            elog(ERROR, "no unpinned buffers available");        }        //解锁buffer header        UnlockBufHdr(buf, local_buf_state);    }

ClockSweepTick()函数是StrategyGetBuffer()的辅助过程,移动时针(当前位置往前一格),返回时针指向的buffer.
其实现逻辑如下:

/* * ClockSweepTick - Helper routine for StrategyGetBuffer() * ClockSweepTick - StrategyGetBuffer()的辅助过程 * * Move the clock hand one buffer ahead of its current position and return the * id of the buffer now under the hand. * 移动时针(当前位置往前一格),返回时针指向的buffer. */static inline uint32ClockSweepTick(void){    uint32      victim;    /*     * Atomically move hand ahead one buffer - if there's several processes     * doing this, this can lead to buffers being returned slightly out of     * apparent order.     * 原子移动时针一格      *   - 如果有多个进程执行这个操作,这可能会导致缓冲返回的顺序稍微有些混乱.     *        */    victim =        pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1);    if (victim >= NBuffers)    {        //-------- 候选buffer大于NBuffers        //记录元素的victim        uint32      originalVictim = victim;        /* always wrap what we look up in BufferDescriptors */        //回卷BufferDescriptors中查找的内容        victim = victim % NBuffers;        /*         * If we're the one that just caused a wraparound, force         * completePasses to be incremented while holding the spinlock. We         * need the spinlock so StrategySyncStart() can return a consistent         * value consisting of nextVictimBuffer and completePasses.         * 如果我们正好导致了wraparound,在持有自旋锁的情况下强制completePasses增加.         * 之所以需要自旋锁是因为StrategySyncStart()才能返回nextVictimBuffer和completePasses的一致性值.         */        if (victim == 0)        {            //出现回卷            uint32      expected;//期望值(不考虑回卷)            uint32      wrapped;//回卷后的值            bool        success = false;//成功标记            //期望值            expected = originalVictim + 1;            while (!success)            {                /*                 * Acquire the spinlock while increasing completePasses. That                 * allows other readers to read nextVictimBuffer and                 * completePasses in a consistent manner which is required for                 * StrategySyncStart().  In theory delaying the increment                 * could lead to an overflow of nextVictimBuffers, but that's                 * highly unlikely and wouldn't be particularly harmful.                 * 在增加completePasses时请求获取自旋锁.                 * 这样可以让其他读取进程以一致性的方式读取nextVictimBuffer和completePasses,                 *   这种一致性的方式在StrategySyncStart()中是需要的.                 * 理论上来说,延迟增加可能会导致nextVictimBuffers溢出,                 *   但但这是非常不可能的,也不会特别有害。                 */                SpinLockAcquire(&StrategyControl->buffer_strategy_lock);                //获取回卷数                wrapped = expected % NBuffers;                //原子比较并交换数字                success = pg_atomic_compare_exchange_u32(&StrategyControl->nextVictimBuffer,                                                         &expected, wrapped);                if (success)                    //如成功,则增加计数                    StrategyControl->completePasses++;                //释放自旋锁                SpinLockRelease(&StrategyControl->buffer_strategy_lock);            }        }    }    //返回victim    return victim;}/* * pg_atomic_compare_exchange_u32 - CAS operation *pg_atomic_compare_exchange_u32 - CAS操作 *  * Atomically compare the current value of ptr with *expected and store newval * iff ptr and *expected have the same value. The current value of *ptr will * always be stored in *expected. * 原子比较ptr当前值与*expected,如果ptr和*expected值一致,则存储newval到ptr中. * *ptr的当前值通常会存储在*expected中. * * Return true if values have been exchanged, false otherwise. * 如值已交换,则返回T,否则返回F. * * Full barrier semantics. * 完整的屏障语义. */static inline boolpg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,                               uint32 *expected, uint32 newval){    AssertPointerAlignment(ptr, 4);    AssertPointerAlignment(expected, 4);    return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);}boolpg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,                                    uint32 *expected, uint32 newval){    bool        ret;    /*     * Do atomic op under a spinlock. It might look like we could just skip     * the cmpxchg if the lock isn't available, but that'd just emulate a     * 'weak' compare and swap. I.e. one that allows spurious failures. Since     * several algorithms rely on a strong variant and that is efficiently     * implementable on most major architectures let's emulate it here as     * well.     * 在自旋锁保护下执行原子操作.     * 这看起来如果锁不可能的话,我们可以跳过cmpxchg,但这只是模拟了一个"浅"比较和交换.     * 比如,这会引起spurious failures.     * 由于有几种算法依赖于一种强大的变体,而且这种变体可以在大多数主要架构上有效地实现,     *   因此我们在这里也对其进行模拟。     */    SpinLockAcquire((slock_t *) &ptr->sema);    /* perform compare/exchange logic */    //执行比较/交换逻辑    ret = ptr->value == *expected;//ptr与*expected是否一致?    *expected = ptr->value;//*expected赋值为ptr    if (ret)        ptr->value = newval;//值一致,则ptr设置为新值    /* and release lock */    //释放自旋锁    SpinLockRelease((slock_t *) &ptr->sema);    //返回结果    return ret;}

"怎么理解PostgreSQL中Clock Sweep算法"的内容就介绍到这里了,感谢大家的阅读。如果想了解更多行业相关的知识可以关注网站,小编将为大家输出更多高质量的实用文章!

0