导航：首页 > 数据库 >

PostgreSQL中BufferAlloc函数有什么作用

发表于：2025-02-23 作者：千家信息网编辑

千家信息网最后更新 2025年02月23日，这篇文章主要介绍"PostgreSQL中BufferAlloc函数有什么作用"，在日常操作中，相信很多人在PostgreSQL中BufferAlloc函数有什么作用问题上存在疑惑，小编查阅了各式资料，

千家信息网最后更新 2025年02月23日PostgreSQL中BufferAlloc函数有什么作用

这篇文章主要介绍"PostgreSQL中BufferAlloc函数有什么作用"，在日常操作中，相信很多人在PostgreSQL中BufferAlloc函数有什么作用问题上存在疑惑，小编查阅了各式资料，整理出简单好用的操作方法，希望对大家解答"PostgreSQL中BufferAlloc函数有什么作用"的疑惑有所帮助！接下来，请跟着小编一起来学习吧！

一、数据结构

BufferDesc
共享缓冲区的共享描述符(状态)数据

/* * Flags for buffer descriptors * buffer描述器标记 * * Note: TAG_VALID essentially means that there is a buffer hashtable * entry associated with the buffer's tag. * 注意:TAG_VALID本质上意味着有一个与缓冲区的标记相关联的缓冲区散列表条目。 *///buffer header锁定#define BM_LOCKED               (1U << 22)  /* buffer header is locked *///数据需要写入(标记为DIRTY)#define BM_DIRTY                (1U << 23)  /* data needs writing *///数据是有效的#define BM_VALID                (1U << 24)  /* data is valid *///已分配buffer tag#define BM_TAG_VALID            (1U << 25)  /* tag is assigned *///正在R/W#define BM_IO_IN_PROGRESS       (1U << 26)  /* read or write in progress *///上一个I/O出现错误#define BM_IO_ERROR             (1U << 27)  /* previous I/O failed *///开始写则变DIRTY#define BM_JUST_DIRTIED         (1U << 28)  /* dirtied since write started *///存在等待sole pin的其他进程#define BM_PIN_COUNT_WAITER     (1U << 29)  /* have waiter for sole pin *///checkpoint发生,必须刷到磁盘上#define BM_CHECKPOINT_NEEDED    (1U << 30)  /* must write for checkpoint *///持久化buffer(不是unlogged或者初始化fork)#define BM_PERMANENT            (1U << 31)  /* permanent buffer (not unlogged,                                             * or init fork) *//* *  BufferDesc -- shared descriptor/state data for a single shared buffer. *  BufferDesc -- 共享缓冲区的共享描述符(状态)数据 * * Note: Buffer header lock (BM_LOCKED flag) must be held to examine or change * the tag, state or wait_backend_pid fields.  In general, buffer header lock * is a spinlock which is combined with flags, refcount and usagecount into * single atomic variable.  This layout allow us to do some operations in a * single atomic operation, without actually acquiring and releasing spinlock; * for instance, increase or decrease refcount.  buf_id field never changes * after initialization, so does not need locking.  freeNext is protected by * the buffer_strategy_lock not buffer header lock.  The LWLock can take care * of itself.  The buffer header lock is *not* used to control access to the * data in the buffer! * 注意:必须持有Buffer header锁(BM_LOCKED标记)才能检查或修改tag/state/wait_backend_pid字段. * 通常来说,buffer header lock是spinlock,它与标记位/参考计数/使用计数组合到单个原子变量中. * 这个布局设计允许我们执行原子操作,而不需要实际获得或者释放spinlock(比如,增加或者减少参考计数). * buf_id字段在初始化后不会出现变化,因此不需要锁定. * freeNext通过buffer_strategy_lock锁而不是buffer header lock保护. * LWLock可以很好的处理自己的状态. * 务请注意的是:buffer header lock不用于控制buffer中的数据访问! * * It's assumed that nobody changes the state field while buffer header lock * is held.  Thus buffer header lock holder can do complex updates of the * state variable in single write, simultaneously with lock release (cleaning * BM_LOCKED flag).  On the other hand, updating of state without holding * buffer header lock is restricted to CAS, which insure that BM_LOCKED flag * is not set.  Atomic increment/decrement, OR/AND etc. are not allowed. * 假定在持有buffer header lock的情况下,没有人改变状态字段. * 持有buffer header lock的进程可以执行在单个写操作中执行复杂的状态变量更新, *   同步的释放锁(清除BM_LOCKED标记). * 换句话说,如果没有持有buffer header lock的状态更新,会受限于CAS, *   这种情况下确保BM_LOCKED没有被设置. * 比如原子的增加/减少(AND/OR)等操作是不允许的. * * An exception is that if we have the buffer pinned, its tag can't change * underneath us, so we can examine the tag without locking the buffer header. * Also, in places we do one-time reads of the flags without bothering to * lock the buffer header; this is generally for situations where we don't * expect the flag bit being tested to be changing. * 一种例外情况是如果我们已有buffer pinned,该buffer的tag不能改变(在本进程之下), *   因此不需要锁定buffer header就可以检查tag了. * 同时,在执行一次性的flags读取时不需要锁定buffer header. * 这种情况通常用于我们不希望正在测试的flag bit将被改变. * * We can't physically remove items from a disk page if another backend has * the buffer pinned.  Hence, a backend may need to wait for all other pins * to go away.  This is signaled by storing its own PID into * wait_backend_pid and setting flag bit BM_PIN_COUNT_WAITER.  At present, * there can be only one such waiter per buffer. * 如果其他进程有buffer pinned,那么进程不能物理的从磁盘页面中删除items. * 因此,后台进程需要等待其他pins清除.这可以通过存储它自己的PID到wait_backend_pid中, *   并设置标记位BM_PIN_COUNT_WAITER. * 目前,每个缓冲区只能由一个等待进程. * * We use this same struct for local buffer headers, but the locks are not * used and not all of the flag bits are useful either. To avoid unnecessary * overhead, manipulations of the state field should be done without actual * atomic operations (i.e. only pg_atomic_read_u32() and * pg_atomic_unlocked_write_u32()). * 本地缓冲头部使用同样的结构,但并不需要使用locks,而且并不是所有的标记位都使用. * 为了避免不必要的负载,状态域的维护不需要实际的原子操作 * (比如只有pg_atomic_read_u32() and pg_atomic_unlocked_write_u32()) * * Be careful to avoid increasing the size of the struct when adding or * reordering members.  Keeping it below 64 bytes (the most common CPU * cache line size) is fairly important for performance. * 在增加或者记录成员变量时,小心避免增加结构体的大小. * 保持结构体大小在64字节内(通常的CPU缓存线大小)对于性能是非常重要的. */typedef struct BufferDesc{    //buffer tag    BufferTag   tag;            /* ID of page contained in buffer */    //buffer索引编号(0开始)    int         buf_id;         /* buffer's index number (from 0) */    /* state of the tag, containing flags, refcount and usagecount */    //tag状态,包括flags/refcount和usagecount    pg_atomic_uint32 state;    //pin-count等待进程ID    int         wait_backend_pid;   /* backend PID of pin-count waiter */    //空闲链表链中下一个空闲的buffer    int         freeNext;       /* link in freelist chain */    //缓冲区内容锁    LWLock      content_lock;   /* to lock access to buffer contents */} BufferDesc;

BufferTag
Buffer tag标记了buffer存储的是磁盘中哪个block

/* * Buffer tag identifies which disk block the buffer contains. * Buffer tag标记了buffer存储的是磁盘中哪个block * * Note: the BufferTag data must be sufficient to determine where to write the * block, without reference to pg_class or pg_tablespace entries.  It's * possible that the backend flushing the buffer doesn't even believe the * relation is visible yet (its xact may have started before the xact that * created the rel).  The storage manager must be able to cope anyway. * 注意:BufferTag必须足以确定如何写block而不需要参照pg_class或者pg_tablespace数据字典信息. * 有可能后台进程在刷新缓冲区的时候深圳不相信关系是可见的(事务可能在创建rel的事务之前). * 存储管理器必须可以处理这些事情. * * Note: if there's any pad bytes in the struct, INIT_BUFFERTAG will have * to be fixed to zero them, since this struct is used as a hash key. * 注意:如果在结构体中有填充的字节,INIT_BUFFERTAG必须将它们固定为零，因为这个结构体用作散列键. */typedef struct buftag{    //物理relation标识符    RelFileNode rnode;          /* physical relation identifier */    ForkNumber  forkNum;    //相对于relation起始的块号    BlockNumber blockNum;       /* blknum relative to begin of reln */} BufferTag;

SMgrRelation
smgr.c维护一个包含SMgrRelation对象的hash表,SMgrRelation对象本质上是缓存的文件句柄.

/* * smgr.c maintains a table of SMgrRelation objects, which are essentially * cached file handles.  An SMgrRelation is created (if not already present) * by smgropen(), and destroyed by smgrclose().  Note that neither of these * operations imply I/O, they just create or destroy a hashtable entry. * (But smgrclose() may release associated resources, such as OS-level file * descriptors.) * smgr.c维护一个包含SMgrRelation对象的hash表,SMgrRelation对象本质上是缓存的文件句柄. * SMgrRelation对象(如非现成)通过smgropen()方法创建,通过smgrclose()方法销毁. * 注意:这些操作都不会执行I/O操作,只会创建或者销毁哈希表条目. * (但是smgrclose()方法可能会释放相关的资源,比如OS基本的文件描述符) * * An SMgrRelation may have an "owner", which is just a pointer to it from * somewhere else; smgr.c will clear this pointer if the SMgrRelation is * closed.  We use this to avoid dangling pointers from relcache to smgr * without having to make the smgr explicitly aware of relcache.  There * can't be more than one "owner" pointer per SMgrRelation, but that's * all we need. * SMgrRelation可能会有"宿主",这个宿主可能只是从某个地方指向它的指针而已; * 如SMgrRelationsmgr.c会清除该指针.这样做可以避免从relcache到smgr的悬空指针, *   而不必要让smgr显式的感知relcache(也就是隔离了smgr了relcache). * 每个SMgrRelation不能跟多个"owner"指针关联,但这就是我们所需要的. * * SMgrRelations that do not have an "owner" are considered to be transient, * and are deleted at end of transaction. * SMgrRelations如无owner指针,则被视为临时对象,在事务的最后被删除.  */typedef struct SMgrRelationData{    /* rnode is the hashtable lookup key, so it must be first! */    //-------- rnode是哈希表的搜索键,因此在结构体的首位    //关系物理定义ID    RelFileNodeBackend smgr_rnode;  /* relation physical identifier */    /* pointer to owning pointer, or NULL if none */    //--------- 指向拥有的指针,如无则为NULL    struct SMgrRelationData **smgr_owner;    /*     * These next three fields are not actually used or manipulated by smgr,     * except that they are reset to InvalidBlockNumber upon a cache flush     * event (in particular, upon truncation of the relation).  Higher levels     * store cached state here so that it will be reset when truncation     * happens.  In all three cases, InvalidBlockNumber means "unknown".     * 接下来的3个字段实际上并不用于或者由smgr管理,     *   除非这些表里在cache flush event发生时被重置为InvalidBlockNumber     *   (特别是在关系被截断时).     * 在这里,更高层的存储缓存了状态因此在截断发生时会被重置.     * 在这3种情况下,InvalidBlockNumber都意味着"unknown".     */    //当前插入的目标bloc    BlockNumber smgr_targblock; /* current insertion target block */    //最后已知的fsm fork大小    BlockNumber smgr_fsm_nblocks;   /* last known size of fsm fork */    //最后已知的vm fork大小    BlockNumber smgr_vm_nblocks;    /* last known size of vm fork */    /* additional public fields may someday exist here */    //------- 未来可能新增的公共域    /*     * Fields below here are intended to be private to smgr.c and its     * submodules.  Do not touch them from elsewhere.     * 下面的字段是smgr.c及其子模块私有的,不要从其他模块接触这些字段.     */    //存储管理器选择器    int         smgr_which;     /* storage manager selector */    /*     * for md.c; per-fork arrays of the number of open segments     * (md_num_open_segs) and the segments themselves (md_seg_fds).     * 用于md.c,打开段(md_num_open_segs)和段自身(md_seg_fds)的数组(每个fork一个)     */    int         md_num_open_segs[MAX_FORKNUM + 1];    struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];    /* if unowned, list link in list of all unowned SMgrRelations */    //如没有宿主,未宿主的SMgrRelations链表的链表链接.    struct SMgrRelationData *next_unowned_reln;} SMgrRelationData;typedef SMgrRelationData *SMgrRelation;

RelFileNodeBackend
组合relfilenode和后台进程ID,用于提供需要定位物理存储的所有信息.

/* * Augmenting a relfilenode with the backend ID provides all the information * we need to locate the physical storage.  The backend ID is InvalidBackendId * for regular relations (those accessible to more than one backend), or the * owning backend's ID for backend-local relations.  Backend-local relations * are always transient and removed in case of a database crash; they are * never WAL-logged or fsync'd. * 组合relfilenode和后台进程ID,用于提供需要定位物理存储的所有信息. * 对于普通的关系(可通过多个后台进程访问),后台进程ID是InvalidBackendId; * 如为临时表,则为自己的后台进程ID. * 临时表(backend-local relations)通常是临时存在的,在数据库崩溃时删除,无需WAL-logged或者fsync. */typedef struct RelFileNodeBackend{    RelFileNode node;//节点    BackendId   backend;//后台进程} RelFileNodeBackend;

二、源码解读

BufferAlloc是ReadBuffer的子过程.处理共享缓存的搜索.如果已无buffer可用,则选择一个可替换的buffer并删除旧页面,但注意不要读入新页面.
该函数的主要处理逻辑如下:
1.初始化,根据Tag确定hash值和分区锁定ID
2.检查block是否已在buffer pool中
3.在缓冲区中找到该buffer(buf_id >= 0)
3.1获取buffer描述符并Pin buffer
3.2如PinBuffer返回F,则执行StartBufferIO,如该函数返回F,则设置标记*foundPtr为F
3.3返回buf
4.在缓冲区中找不到该buffer(buf_id < 0)
4.1释放newPartitionLock
4.2执行循环,寻找合适的buffer
4.2.1确保在自旋锁尚未持有时,有一个空闲的refcount入口(条目)
4.2.2选择一个待淘汰的buffer
4.2.3拷贝buffer flags到oldFlags中
4.2.4Pin buffer,然后释放buffer自旋锁
4.2.5如buffer标记位BM_DIRTY,FlushBuffer
4.2.6如buffer标记为BM_TAG_VALID,计算原tag的hashcode和partition lock ID,并锁定新旧分区锁
否则需要新的分区,锁定新分区锁,重置原分区锁和原hash值
4.2.7尝试使用buffer新的tag构造hash表入口
4.2.8存在冲突(buf_id >= 0),在这里只需要像一开始处理的那样,视为已在缓冲池发现该buffer
4.2.9不存在冲突(buf_id < 0),锁定buffer header,如缓冲区没有变脏或者被pinned,则已找到buf,跳出循环
否则,解锁buffer header,删除hash表入口,释放锁,重新寻找buffer
4.3可以重新设置buffer tag,完成后解锁buffer header,删除原有的hash表入口,释放分区锁
4.4执行StartBufferIO,设置*foundPtr标记
4.5返回buf

/* * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared *      buffer.  If no buffer exists already, selects a replacement *      victim and evicts the old page, but does NOT read in new page. * BufferAlloc -- ReadBuffer的子过程.处理共享缓存的搜索. *      如果已无buffer可用,则选择一个可替换的buffer并删除旧页面,但注意不要读入新页面. * * "strategy" can be a buffer replacement strategy object, or NULL for * the default strategy.  The selected buffer's usage_count is advanced when * using the default strategy, but otherwise possibly not (see PinBuffer). * "strategy"可以是缓存替换策略对象,如为默认策略,则为NULL. * 如使用默认读取策略,则选中的缓冲buffer的usage_count会加一,但也可能不会增加(详细参见PinBuffer). * * The returned buffer is pinned and is already marked as holding the * desired page.  If it already did have the desired page, *foundPtr is * set true.  Otherwise, *foundPtr is set false and the buffer is marked * as IO_IN_PROGRESS; ReadBuffer will now need to do I/O to fill it. * 返回的buffer已pinned并已标记为持有指定的页面. * 如果确实已持有指定的页面,*foundPtr设置为T. * 否则的话,*foundPtr设置为F,buffer标记为IO_IN_PROGRESS,ReadBuffer将会执行I/O操作. * * *foundPtr is actually redundant with the buffer's BM_VALID flag, but * we keep it for simplicity in ReadBuffer. * *foundPtr跟buffer的BM_VALID标记是重复的,但为了ReadBuffer中的简化,仍然保持这个参数. * * No locks are held either at entry or exit. * 在进入或者退出的时候,不需要持有任何的Locks. */static BufferDesc *BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,            BlockNumber blockNum,            BufferAccessStrategy strategy,            bool *foundPtr){    //请求block的ID    BufferTag   newTag;         /* identity of requested block */    //newTag的Hash值    uint32      newHash;        /* hash value for newTag */    //缓冲区分区锁    LWLock     *newPartitionLock;   /* buffer partition lock for it */    //选中缓冲区对应的上一个ID    BufferTag   oldTag;         /* previous identity of selected buffer */    //oldTag的hash值    uint32      oldHash;        /* hash value for oldTag */    //原缓冲区分区锁    LWLock     *oldPartitionLock;   /* buffer partition lock for it */    //原标记位    uint32      oldFlags;    //buffer ID编号    int         buf_id;    //buffer描述符    BufferDesc *buf;    //是否有效    bool        valid;    //buffer状态    uint32      buf_state;    /* create a tag so we can lookup the buffer */    //创建一个tag,用于检索buffer    INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);    /* determine its hash code and partition lock ID */    //根据Tag确定hash值和分区锁定ID    newHash = BufTableHashCode(&newTag);    newPartitionLock = BufMappingPartitionLock(newHash);    /* see if the block is in the buffer pool already */    //检查block是否已在buffer pool中    LWLockAcquire(newPartitionLock, LW_SHARED);    buf_id = BufTableLookup(&newTag, newHash);    if (buf_id >= 0)    {        //---- 在缓冲区中找到该buffer        /*         * Found it.  Now, pin the buffer so no one can steal it from the         * buffer pool, and check to see if the correct data has been loaded         * into the buffer.         * 找到了!现在pin缓冲区,确保没有进程可以从缓冲区中删除         *   检查正确的数据是否已装载到缓冲区中.         */        buf = GetBufferDescriptor(buf_id);        //Pin缓冲区        valid = PinBuffer(buf, strategy);        /* Can release the mapping lock as soon as we've pinned it */        //一旦pinned,立即释放newPartitionLock        LWLockRelease(newPartitionLock);        //设置返回参数        *foundPtr = true;        if (!valid)        {            //如无效            /*             * We can only get here if (a) someone else is still reading in             * the page, or (b) a previous read attempt failed.  We have to             * wait for any active read attempt to finish, and then set up our             * own read attempt if the page is still not BM_VALID.             * StartBufferIO does it all.             * 程序执行到这里原因是(a)有其他进程仍然读入了该page,或者(b)上一次读取尝试失败.             * 在这里必须等到其他活动的读取完成,然后在page状态仍然不是BM_VALID时设置读取尝试.             * StartBufferIO过程执行这些工作.             */            if (StartBufferIO(buf, true))            {                /*                 * If we get here, previous attempts to read the buffer must                 * have failed ... but we shall bravely try again.                 */                //上一次尝试读取已然失败,这里还是需要勇敢的再试一次!                *foundPtr = false;//设置为F            }        }        //返回buf        return buf;    }    /*     * Didn't find it in the buffer pool.  We'll have to initialize a new     * buffer.  Remember to unlock the mapping lock while doing the work.     * 没有在缓冲池中发现该buffer.     * 这时候不得不初始化一个buffer.     * 记住:在执行工作的时候,记得首先解锁mapping lock.     */    LWLockRelease(newPartitionLock);    /* Loop here in case we have to try another victim buffer */    //循环,寻找合适的buffer    for (;;)    {        /*         * Ensure, while the spinlock's not yet held, that there's a free         * refcount entry.         * 确保在自旋锁尚未持有时,有一个空闲的refcount入口(条目).         */        ReservePrivateRefCountEntry();        /*         * Select a victim buffer.  The buffer is returned with its header         * spinlock still held!         * 选择一个待淘汰的buffer.         * 返回的buffer,仍然持有其header的自旋锁.         */        buf = StrategyGetBuffer(strategy, &buf_state);        Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);        /* Must copy buffer flags while we still hold the spinlock */        //在仍持有自旋锁的情况下必须拷贝buffer flags        oldFlags = buf_state & BUF_FLAG_MASK;        /* Pin the buffer and then release the buffer spinlock */        //Pin buffer,然后释放buffer自旋锁        PinBuffer_Locked(buf);        /*         * If the buffer was dirty, try to write it out.  There is a race         * condition here, in that someone might dirty it after we released it         * above, or even while we are writing it out (since our share-lock         * won't prevent hint-bit updates).  We will recheck the dirty bit         * after re-locking the buffer header.         * 如果buffer已脏,尝试刷新到磁盘上.         * 这里有一个竞争条件,那就是某些进程可能在我们在上面释放它(或者甚至在我们正在刷新时)之后使该缓冲区变脏.         * 在再次锁定buffer header后,我们会重新检查相应的dirty标记位.           */        if (oldFlags & BM_DIRTY)        {            /*             * We need a share-lock on the buffer contents to write it out             * (else we might write invalid data, eg because someone else is             * compacting the page contents while we write).  We must use a             * conditional lock acquisition here to avoid deadlock.  Even             * though the buffer was not pinned (and therefore surely not             * locked) when StrategyGetBuffer returned it, someone else could             * have pinned and exclusive-locked it by the time we get here. If             * we try to get the lock unconditionally, we'd block waiting for             * them; if they later block waiting for us, deadlock ensues.             * (This has been observed to happen when two backends are both             * trying to split btree index pages, and the second one just             * happens to be trying to split the page the first one got from             * StrategyGetBuffer.)             * 需要持有buffer内容的共享锁来刷出该缓冲区.             * (否则的话,我们可能会写入无效的数据,原因比如是其他进程在我们写入时压缩page).             * 在这里,必须使用条件锁来避免死锁.             * 在StrategyGetBuffer返回时虽然buffer尚未pinned,             *   其他进程可能已经pinned该buffer并且同时已持有独占锁.             * 如果我们尝试无条件的锁定,那么因为等待而阻塞.其他进程稍后又会等待本进程,那么死锁就会发生.             * (在实际中,两个后台进程在尝试分裂B树索引pages,             *  而第二个正好尝试分裂第一个进程通过StrategyGetBuffer获取的page时,会发生这种情况).             */            if (LWLockConditionalAcquire(BufferDescriptorGetContentLock(buf),                                         LW_SHARED))            {                //---- 执行有条件锁定请求(buffer内容共享锁)                /*                 * If using a nondefault strategy, and writing the buffer                 * would require a WAL flush, let the strategy decide whether                 * to go ahead and write/reuse the buffer or to choose another                 * victim.  We need lock to inspect the page LSN, so this                 * can't be done inside StrategyGetBuffer.                 * 如使用非默认的策略,则写缓冲会请求WAL flush,让策略确定如何继续以及写入/重用                 *   缓冲或者选择另外一个待淘汰的buffer.                 * 我们需要锁定,检查page的LSN,因此不能在StrategyGetBuffer中完成.                 */                if (strategy != NULL)                {                    //非默认策略                    XLogRecPtr  lsn;                    /* Read the LSN while holding buffer header lock */                    //在持有buffer header lock时读取LSN                    buf_state = LockBufHdr(buf);                    lsn = BufferGetLSN(buf);                    UnlockBufHdr(buf, buf_state);                    if (XLogNeedsFlush(lsn) &&                        StrategyRejectBuffer(strategy, buf))                    {                        //需要flush WAL并且StrategyRejectBuffer                        /* Drop lock/pin and loop around for another buffer */                        //清除lock/pin并循环到另外一个buffer                        LWLockRelease(BufferDescriptorGetContentLock(buf));                        UnpinBuffer(buf, true);                        continue;                    }                }                /* OK, do the I/O */                //现在可以执行I/O了                TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,                                                          smgr->smgr_rnode.node.spcNode,                                                          smgr->smgr_rnode.node.dbNode,                                                          smgr->smgr_rnode.node.relNode);                FlushBuffer(buf, NULL);                LWLockRelease(BufferDescriptorGetContentLock(buf));                ScheduleBufferTagForWriteback(&BackendWritebackContext,                                              &buf->tag);                TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,                                                         smgr->smgr_rnode.node.spcNode,                                                         smgr->smgr_rnode.node.dbNode,                                                         smgr->smgr_rnode.node.relNode);            }            else            {                /*                 * Someone else has locked the buffer, so give it up and loop                 * back to get another one.                 * 其他进程已经锁定了buffer,放弃,获取另外一个                 */                UnpinBuffer(buf, true);                continue;            }        }        /*         * To change the association of a valid buffer, we'll need to have         * exclusive lock on both the old and new mapping partitions.         * 修改有效缓冲区的相关性,需要在原有和新的映射分区上持有独占锁         */        if (oldFlags & BM_TAG_VALID)        {            //----------- buffer标记为BM_TAG_VALID            /*             * Need to compute the old tag's hashcode and partition lock ID.             * XXX is it worth storing the hashcode in BufferDesc so we need             * not recompute it here?  Probably not.             * 需要计算原tag的hashcode和partition lock ID.             * 这里是否值得存储hashcode在BufferDesc中而无需再次计算?可能不值得.             */            oldTag = buf->tag;            oldHash = BufTableHashCode(&oldTag);            oldPartitionLock = BufMappingPartitionLock(oldHash);            /*             * Must lock the lower-numbered partition first to avoid             * deadlocks.             * 必须首先锁定更低一级编号的分区以避免死锁             */            if (oldPartitionLock < newPartitionLock)            {                //按顺序锁定                LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);                LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);            }            else if (oldPartitionLock > newPartitionLock)            {                //按顺序锁定                LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);                LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);            }            else            {                /* only one partition, only one lock */                //只有一个分区,只需要一个锁                LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);            }        }        else        {            //----------- buffer未标记为BM_TAG_VALID            /* if it wasn't valid, we need only the new partition */            //buffer无效,需要新的分区            LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);            /* remember we have no old-partition lock or tag */            //不需要原有分区的锁&tag            oldPartitionLock = NULL;            /* this just keeps the compiler quiet about uninit variables */            //这行代码的目的是让编译器"闭嘴"            oldHash = 0;        }        /*         * Try to make a hashtable entry for the buffer under its new tag.         * This could fail because while we were writing someone else         * allocated another buffer for the same block we want to read in.         * Note that we have not yet removed the hashtable entry for the old         * tag.         * 尝试使用buffer新的tag构造hash表入口.         * 这可能会失败,因为在我们写入时其他进程可能已为我们希望读入的同一个block分配了另外一个buffer.         * 注意我们还没有删除原有tag的hash表入口.         */        buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);        if (buf_id >= 0)        {            /*             * Got a collision. Someone has already done what we were about to             * do. We'll just handle this as if it were found in the buffer             * pool in the first place.  First, give up the buffer we were             * planning to use.             * 存在冲突.某个进程已完成了我们准备做的事情.             * 在这里只需要像一开始处理的那样,视为已在缓冲池发现该buffer.             * 首先,放弃计划使用的buffer.             */            UnpinBuffer(buf, true);            /* Can give up that buffer's mapping partition lock now */            //放弃原有的partition lock            if (oldPartitionLock != NULL &&                oldPartitionLock != newPartitionLock)                LWLockRelease(oldPartitionLock);            /* remaining code should match code at top of routine */            //剩余的代码应匹配上面的处理过程            //详细参见以上代码注释            buf = GetBufferDescriptor(buf_id);            valid = PinBuffer(buf, strategy);            /* Can release the mapping lock as soon as we've pinned it */            //是否新partition lock            LWLockRelease(newPartitionLock);            //设置标记            *foundPtr = true;            if (!valid)            {                /*                 * We can only get here if (a) someone else is still reading                 * in the page, or (b) a previous read attempt failed.  We                 * have to wait for any active read attempt to finish, and                 * then set up our own read attempt if the page is still not                 * BM_VALID.  StartBufferIO does it all.                 */                if (StartBufferIO(buf, true))                {                    /*                     * If we get here, previous attempts to read the buffer                     * must have failed ... but we shall bravely try again.                     */                    *foundPtr = false;                }            }            return buf;        }        /*         * Need to lock the buffer header too in order to change its tag.         * 需要锁定缓冲头部,目的是修改tag         */        buf_state = LockBufHdr(buf);        /*         * Somebody could have pinned or re-dirtied the buffer while we were         * doing the I/O and making the new hashtable entry.  If so, we can't         * recycle this buffer; we must undo everything we've done and start         * over with a new victim buffer.         * 在我们执行I/O和标记新的hash表入口时,某些进程可能已经pinned或者重新弄脏了buffer.         * 如出现这样的情况,不能回收该缓冲区;必须回滚我们所做的所有事情,并重新寻找新的待淘汰的缓冲区.         */        oldFlags = buf_state & BUF_FLAG_MASK;        if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))            //已经OK了            break;        //解锁buffer header        UnlockBufHdr(buf, buf_state);        //删除hash表入口        BufTableDelete(&newTag, newHash);        //释放锁        if (oldPartitionLock != NULL &&            oldPartitionLock != newPartitionLock)            LWLockRelease(oldPartitionLock);        LWLockRelease(newPartitionLock);        UnpinBuffer(buf, true);        //重新寻找buffer    }    /*     * Okay, it's finally safe to rename the buffer.     * 现在终于可以安全的给buffer重命名了     *     * Clearing BM_VALID here is necessary, clearing the dirtybits is just     * paranoia.  We also reset the usage_count since any recency of use of     * the old content is no longer relevant.  (The usage_count starts out at     * 1 so that the buffer can survive one clock-sweep pass.)     * 如需要,清除BM_VALID标记,清除脏标记位.     * 我们还需要重置usage_count，因为使用旧内容的recency不再相关.     * (usage_count从1开始,因此buffer可以在一个时钟周期经过后仍能存活)     *     * Make sure BM_PERMANENT is set for buffers that must be written at every     * checkpoint.  Unlogged buffers only need to be written at shutdown     * checkpoints, except for their "init" forks, which need to be treated     * just like permanent relations.     * 确保标记为BM_PERMANENT的buffer必须在每次checkpoint时刷到磁盘上.     * Unlogged缓冲只需要在shutdown checkpoint时才需要写入,除非它们"init" forks,     *   这些操作需要类似持久化关系一样处理.     */    buf->tag = newTag;    buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |                   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |                   BUF_USAGECOUNT_MASK);    if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)        buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;    else        buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;    UnlockBufHdr(buf, buf_state);    if (oldPartitionLock != NULL)    {        BufTableDelete(&oldTag, oldHash);        if (oldPartitionLock != newPartitionLock)            LWLockRelease(oldPartitionLock);    }    LWLockRelease(newPartitionLock);    /*     * Buffer contents are currently invalid.  Try to get the io_in_progress     * lock.  If StartBufferIO returns false, then someone else managed to     * read it before we did, so there's nothing left for BufferAlloc() to do.     * 缓冲区内存已无效.     * 尝试获取io_in_progress lock.如StartBufferIO返回F,意味着其他进程已在我们完成前读取该缓冲区,     *   因此对于BufferAlloc()来说,已无事可做.     */    if (StartBufferIO(buf, true))        *foundPtr = false;    else        *foundPtr = true;    return buf;}

三、跟踪分析

测试脚本,查询数据表:

10:01:54 (xdb@[local]:5432)testdb=# select * from t1 limit 10;

启动gdb,设置断点

(gdb) b BufferAllocBreakpoint 1 at 0x8778ad: file bufmgr.c, line 1005.(gdb) cContinuing.Breakpoint 1, BufferAlloc (smgr=0x2267430, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, strategy=0x0,     foundPtr=0x7ffcc97fb4f3) at bufmgr.c:10051005        INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);(gdb)

输入参数
smgr-SMgrRelationData结构体指针
relpersistence-关系是否持久化
forkNum-fork类型,MAIN_FORKNUM对应数据文件,还有fsm/vm文件
blockNum-块号
strategy-buffer访问策略,为NULL
*foundPtr-输出参数

(gdb) p *smgr$1 = {smgr_rnode = {node = {spcNode = 1663, dbNode = 16402, relNode = 51439}, backend = -1}, smgr_owner = 0x7f86133f3778,   smgr_targblock = 4294967295, smgr_fsm_nblocks = 4294967295, smgr_vm_nblocks = 4294967295, smgr_which = 0,   md_num_open_segs = {0, 0, 0, 0}, md_seg_fds = {0x0, 0x0, 0x0, 0x0}, next_unowned_reln = 0x0}(gdb) p *smgr->smgr_owner$2 = (struct SMgrRelationData *) 0x2267430(gdb) p **smgr->smgr_owner$3 = {smgr_rnode = {node = {spcNode = 1663, dbNode = 16402, relNode = 51439}, backend = -1}, smgr_owner = 0x7f86133f3778,   smgr_targblock = 4294967295, smgr_fsm_nblocks = 4294967295, smgr_vm_nblocks = 4294967295, smgr_which = 0,   md_num_open_segs = {0, 0, 0, 0}, md_seg_fds = {0x0, 0x0, 0x0, 0x0}, next_unowned_reln = 0x0}(gdb)

1.初始化,根据Tag确定hash值和分区锁定ID

(gdb) n1008        newHash = BufTableHashCode(&newTag);(gdb) p newTag$4 = {rnode = {spcNode = 1663, dbNode = 16402, relNode = 51439}, forkNum = MAIN_FORKNUM, blockNum = 0}(gdb) n1009        newPartitionLock = BufMappingPartitionLock(newHash);(gdb) 1012        LWLockAcquire(newPartitionLock, LW_SHARED);(gdb) 1013        buf_id = BufTableLookup(&newTag, newHash);(gdb) p newHash$5 = 1398580903(gdb) p newPartitionLock$6 = (LWLock *) 0x7f85e5db9600(gdb) p *newPartitionLock$7 = {tranche = 59, state = {value = 536870913}, waiters = {head = 2147483647, tail = 2147483647}}(gdb)

2.检查block是否已在buffer pool中

(gdb) n1014        if (buf_id >= 0)(gdb) p buf_id$8 = -1

4.在缓冲区中找不到该buffer(buf_id < 0)
4.1释放newPartitionLock
4.2执行循环,寻找合适的buffer
4.2.1确保在自旋锁尚未持有时,有一个空闲的refcount入口(条目) --> ReservePrivateRefCountEntry

(gdb) n1056        LWLockRelease(newPartitionLock);(gdb) 1065            ReservePrivateRefCountEntry();(gdb)

4.2.2选择一个待淘汰的buffer

(gdb) n1071            buf = StrategyGetBuffer(strategy, &buf_state);(gdb) n1073            Assert(BUF_STATE_GET_REFCOUNT(buf_state) == 0);(gdb) p buf$9 = (BufferDesc *) 0x7f85e705fd80(gdb) p *buf$10 = {tag = {rnode = {spcNode = 0, dbNode = 0, relNode = 0}, forkNum = InvalidForkNumber, blockNum = 4294967295},   buf_id = 104, state = {value = 4194304}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 54, state = {      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}(gdb)

4.2.3拷贝buffer flags到oldFlags中

(gdb) n1076            oldFlags = buf_state & BUF_FLAG_MASK;(gdb)

4.2.4Pin buffer,然后释放buffer自旋锁

(gdb) 1079            PinBuffer_Locked(buf);(gdb)

4.2.5如buffer标记位BM_DIRTY,FlushBuffer

1088            if (oldFlags & BM_DIRTY)(gdb)

4.2.6如buffer标记为BM_TAG_VALID,计算原tag的hashcode和partition lock ID,并锁定新旧分区锁
否则需要新的分区,锁定新分区锁,重置原分区锁和原hash值

(gdb) 1166            if (oldFlags & BM_TAG_VALID)(gdb) 1200                LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);(gdb) 1202                oldPartitionLock = NULL;(gdb) 1204                oldHash = 0;(gdb) p oldFlags$11 = 4194304(gdb)

4.2.7尝试使用buffer新的tag构造hash表入口

(gdb) 1214            buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);(gdb) n1216            if (buf_id >= 0)(gdb) p buf_id$12 = -1(gdb)

4.2.9不存在冲突(buf_id < 0),锁定buffer header,如缓冲区没有变脏或者被pinned,则已找到buf,跳出循环
否则,解锁buffer header,删除hash表入口,释放锁,重新寻找buffer

(gdb) n1267            buf_state = LockBufHdr(buf);(gdb) 1275            oldFlags = buf_state & BUF_FLAG_MASK;(gdb) 1276            if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))(gdb) 1277                break;(gdb)

4.3可以重新设置buffer tag,完成后解锁buffer header,删除原有的hash表入口,释放分区锁

1301        buf->tag = newTag;(gdb) 1302        buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |(gdb) 1305        if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)(gdb) 1306            buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;(gdb) 1310        UnlockBufHdr(buf, buf_state);(gdb) 1312        if (oldPartitionLock != NULL)(gdb) 1319        LWLockRelease(newPartitionLock);(gdb) p *buf$13 = {tag = {rnode = {spcNode = 1663, dbNode = 16402, relNode = 51439}, forkNum = MAIN_FORKNUM, blockNum = 0},   buf_id = 104, state = {value = 2181300225}, wait_backend_pid = 0, freeNext = -2, content_lock = {tranche = 54, state = {      value = 536870912}, waiters = {head = 2147483647, tail = 2147483647}}}(gdb)

4.4执行StartBufferIO,设置*foundPtr标记

(gdb) 1326        if (StartBufferIO(buf, true))(gdb) n1327            *foundPtr = false;(gdb)

4.5返回buf

(gdb) 1331        return buf;(gdb) 1332    }(gdb)

执行完成

(gdb) ReadBuffer_common (smgr=0x2267430, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=0, mode=RBM_NORMAL, strategy=0x0,     hit=0x7ffcc97fb5eb) at bufmgr.c:747747         if (found)(gdb) 750             pgBufferUsage.shared_blks_read++;(gdb)

到此，关于"PostgreSQL中BufferAlloc函数有什么作用"的学习就结束了，希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习，快去试试吧！若想继续学习更多相关知识，请继续关注网站，小编会继续努力为大家带来更多实用的文章！

很赞哦！