千家信息网

PostgreSQL 源码解读(146)- Storage Manager#2(fsm_search_avail函数)

发表于:2025-02-02 作者:千家信息网编辑
千家信息网最后更新 2025年02月02日,本节简单介绍了PostgreSQL在执行插入过程中与存储相关的函数RecordAndGetPageWithFreeSpace->fsm_set_and_search,该函数设置给定的FSM page的
千家信息网最后更新 2025年02月02日PostgreSQL 源码解读(146)- Storage Manager#2(fsm_search_avail函数)

本节简单介绍了PostgreSQL在执行插入过程中与存储相关的函数RecordAndGetPageWithFreeSpace->fsm_set_and_search,该函数设置给定的FSM page的相应值,并返回合适的slot。

一、数据结构

FSMAddress
内部的FSM处理过程以逻辑地址scheme的方式工作,树的每一个层次都可以认为是一个独立的地址文件.

/* * The internal FSM routines work on a logical addressing scheme. Each * level of the tree can be thought of as a separately addressable file. * 内部的FSM处理过程工作在一个逻辑地址scheme上. * 树的每一个层次都可以认为是一个独立的地址文件. */typedef struct{    //层次    int         level;          /* level */    //该层次内的页编号    int         logpageno;      /* page number within the level */} FSMAddress;/* Address of the root page. *///根页地址static const FSMAddress FSM_ROOT_ADDRESS = {FSM_ROOT_LEVEL, 0};

FSMPage
FSM page数据结构.详细可参看src/backend/storage/freespace/README.

/* * Structure of a FSM page. See src/backend/storage/freespace/README for * details. * FSM page数据结构.详细可参看src/backend/storage/freespace/README. */typedef struct{    /*     * fsm_search_avail() tries to spread the load of multiple backends by     * returning different pages to different backends in a round-robin     * fashion. fp_next_slot points to the next slot to be returned (assuming     * there's enough space on it for the request). It's defined as an int,     * because it's updated without an exclusive lock. uint16 would be more     * appropriate, but int is more likely to be atomically     * fetchable/storable.     * fsm_search_avail()函数尝试通过在一轮循环中返回不同的页面到不同的后台进程,     *   从而分散在后台进程上分散负载.     * 该字段因为无需独占锁,因此定义为整型.     * unit16可能会更合适,但整型看起来更适合于原子提取和存储.     */    int         fp_next_slot;    /*     * fp_nodes contains the binary tree, stored in array. The first     * NonLeafNodesPerPage elements are upper nodes, and the following     * LeafNodesPerPage elements are leaf nodes. Unused nodes are zero.     * fp_nodes以数组的形式存储二叉树.     * 第一个NonLeafNodesPerPage元素是上一层的节点,接下来的LeafNodesPerPage元素是叶子节点.     * 未使用的节点为0.     */    uint8       fp_nodes[FLEXIBLE_ARRAY_MEMBER];} FSMPageData;typedef FSMPageData *FSMPage;

FSMLocalMap
对于小表,不需要创建FSM来存储空间信息,使用本地的内存映射信息.

/* Either already tried, or beyond the end of the relation *///已尝试或者已在表的末尾之后#define FSM_LOCAL_NOT_AVAIL 0x00/* Available to try *///可用于尝试#define FSM_LOCAL_AVAIL     0x01/* * For small relations, we don't create FSM to save space, instead we use * local in-memory map of pages to try.  To locate free space, we simply try * pages directly without knowing ahead of time how much free space they have. * 对于小表,不需要创建FSM来存储空间信息,使用本地的内存映射信息. * 为了定位空闲空间,我们不需要知道他们有多少空闲空间而是直接简单的对page进行尝试. * * Note that this map is used to the find the block with required free space * for any given relation.  We clear this map when we have found a block with * enough free space, when we extend the relation, or on transaction abort. * See src/backend/storage/freespace/README for further details. * 注意这个map用于搜索给定表的请求空闲空间. * 在找到有足够空闲空间的block/扩展了relation/在事务回滚时,则清除这个map的信息. * 详细可查看src/backend/storage/freespace/README. */typedef struct{    BlockNumber nblocks;//块数    uint8       map[HEAP_FSM_CREATION_THRESHOLD];//数组}           FSMLocalMap;static FSMLocalMap fsm_local_map ={    0,    {        FSM_LOCAL_NOT_AVAIL    }};#define FSM_LOCAL_MAP_EXISTS (fsm_local_map.nblocks > 0)

二、源码解读

fsm_set_and_search设置给定的FSM page的相应值,并返回合适的slot.
主要处理逻辑如下:
1.初始化相关变量
2.设置相应页面的FSM可用标记
3.如search_cat/minvalue(请求空间)不为0,则调用fsm_search_avail搜索slot
4.解锁,返回

搜索树的算法是逐渐扩展"search triangle"(暂且称为搜索三角),也就是说,被当前节点所覆盖的所有节点,确保我们从开始点向右搜索.在第一步,只有目标slot会被检查.当我们从左边子节点往上移动到父节点时,我们同时添加了父节点的右子树到搜索三角中.当我们从右边子节点往上移动到父节点时,我们同时删除了当前搜索三角(这时候已经知道了没有合适的page), 同时向右检索下一个更大尺寸的三角.因此,我们不会从起点向左搜索,在每一步搜索三角的尺寸都会翻倍,确保只需要log2(N)步就可以搜索N个页面.
例如,考虑下面这棵树:

       7   7       6 5   7   6   54 5 5 7 2 6 5 2            T

假定目标节点是字母T指示的节点,同时我们正在使用6或更大的数字搜索节点.检索从T开始.在第一轮迭代,移到右边,然后到父节点,到达最右边的节点 5.在第二轮迭代,移到右边,回卷,然后上溯,在第三个层次上到达节点 7这时候7满足我们的搜索要求,因此我们沿着7这条路径下降到底部.实际上,这是(考虑到回卷)起点右侧第一个满足条件的页面.

/* * Set value in given FSM page and slot. * 设置给定的FSM page的相应值,并返回合适的slot. * * If minValue > 0, the updated page is also searched for a page with at * least minValue of free space. If one is found, its slot number is * returned, -1 otherwise. * 如minValue大于0,更新后的page还将搜索空闲空间只是为最小值的page. * 如果找到了,返回slot编号,否则返回-1. *///search_slot = fsm_set_and_search(rel, addr, slot, old_cat, search_cat);static intfsm_set_and_search(Relation rel, FSMAddress addr, uint16 slot,                   uint8 newValue, uint8 minValue){    Buffer      buf;    Page        page;    int         newslot = -1;    //获取FSM的buffer    buf = fsm_readbuf(rel, addr, true);    //锁定    LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);    //获取FSM的page    page = BufferGetPage(buf);    if (fsm_set_avail(page, slot, newValue))        MarkBufferDirtyHint(buf, false);    if (minValue != 0)    {        /* Search while we still hold the lock */        //搜索slot        newslot = fsm_search_avail(buf, minValue,                                   addr.level == FSM_BOTTOM_LEVEL,                                   true);    }    UnlockReleaseBuffer(buf);    return newslot;}/* * Searches for a slot with category at least minvalue. * Returns slot number, or -1 if none found. * 搜索目录至少为最小值的slot. * * The caller must hold at least a shared lock on the page, and this * function can unlock and lock the page again in exclusive mode if it * needs to be updated. exclusive_lock_held should be set to true if the * caller is already holding an exclusive lock, to avoid extra work. * 调用者必须持有page共享锁,如page需要更新则该函数可能重新以独占模式解锁或锁定page. * 如调用者已持有独占锁exclusive_lock_held需设置为T,以避免额外的工作. * * If advancenext is false, fp_next_slot is set to point to the returned * slot, and if it's true, to the slot after the returned slot. * 如果advancenext是F,fp_next_slot被设置为指向返回的slot; * 如为T,则指向返回的slot后的slot. *///newslot = fsm_search_avail(buf, minValue, \                                   addr.level == FSM_BOTTOM_LEVEL, \                                   true);intfsm_search_avail(Buffer buf, uint8 minvalue, bool advancenext,                 bool exclusive_lock_held){    Page        page = BufferGetPage(buf);//获取page    FSMPage     fsmpage = (FSMPage) PageGetContents(page);//获取FSMPage    int         nodeno;    int         target;    uint16      slot;restart:    /*     * Check the root first, and exit quickly if there's no leaf with enough     * free space     * 首先检查根节点,如叶子节点无足够的空闲空间则快速退出.     */    if (fsmpage->fp_nodes[0] < minvalue)        return -1;    /*     * Start search using fp_next_slot.  It's just a hint, so check that it's     * sane.  (This also handles wrapping around when the prior call returned     * the last slot on the page.)     * 使用fp_next_slot开始搜索.     * 这个标记只是一个提示而已,检查它是否正常.     * (这同时处理了在上一次调用返回的页面最后一个slot的回卷问题)     */    //#define SlotsPerFSMPage   LeafNodesPerPage    //#define LeafNodesPerPage   (NodesPerPage - NonLeafNodesPerPage)    //#define NodesPerPage (BLCKSZ - MAXALIGN(SizeOfPageHeaderData) - \                       offsetof(FSMPageData, fp_nodes))     //#define NonLeafNodesPerPage (BLCKSZ / 2 - 1) = 4095     target = fsmpage->fp_next_slot;    if (target < 0 || target >= LeafNodesPerPage)        target = 0;    target += NonLeafNodesPerPage;    /*----------     * Start the search from the target slot.  At every step, move one     * node to the right, then climb up to the parent.  Stop when we reach     * a node with enough free space (as we must, since the root has enough     * space).     * 从目标slot开始检索.每一步,把节点移到右边,然后上溯父节点.     * 在到达一个有足够空闲空间的节点时停止(因为根节点有足够的空间).     *     * The idea is to gradually expand our "search triangle", that is, all     * nodes covered by the current node, and to be sure we search to the     * right from the start point.  At the first step, only the target slot     * is examined.  When we move up from a left child to its parent, we are     * adding the right-hand subtree of that parent to the search triangle.     * When we move right then up from a right child, we are dropping the     * current search triangle (which we know doesn't contain any suitable     * page) and instead looking at the next-larger-size triangle to its     * right.  So we never look left from our original start point, and at     * each step the size of the search triangle doubles, ensuring it takes     * only log2(N) work to search N pages.     * 算法思想是逐渐扩展"search triangle 搜索三角",也就是说,     *   被当前节点所覆盖的所有节点,确保我们从开始点向右搜索.     * 在第一步,只有目标slot会被检查.     * 当我们从左边子节点往上移动到父节点时,我们同时添加了父节点的右子树到搜索三角中.     * 当我们从右边子节点往上移动到父节点时,我们同时删除了当前搜索三角(这时候已经知道了没有合适的page),     *   同时向右检索下一个更大尺寸的三角.     * 因此,我们不会从起点向左搜索,在每一步搜索三角的尺寸都会翻倍,确保只需要log2(N)步就可以搜索N个页面.     *     * The "move right" operation will wrap around if it hits the right edge     * of the tree, so the behavior is still good if we start near the right.     * Note also that the move-and-climb behavior ensures that we can't end     * up on one of the missing nodes at the right of the leaf level.     * "move right"操作会在触碰到树的右边缘时回卷,因此在右边邻近开始搜索也是没有问题的.     * 注意move-and-climb操作确保了我们不能在叶子层右边的节点上结束.     *     * For example, consider this tree:     * 例如,考虑下面这棵树:     *     *         7     *     7       6     *   5   7   6   5     *  4 5 5 7 2 6 5 2     *              T     *     * Assume that the target node is the node indicated by the letter T,     * and we're searching for a node with value of 6 or higher. The search     * begins at T. At the first iteration, we move to the right, then to the     * parent, arriving at the rightmost 5. At the second iteration, we move     * to the right, wrapping around, then climb up, arriving at the 7 on the     * third level.  7 satisfies our search, so we descend down to the bottom,     * following the path of sevens.  This is in fact the first suitable page     * to the right of (allowing for wraparound) our start point.     * 假定目标节点是字母T指示的节点,同时我们正在使用6或更大的数字搜索节点.检索从T开始.     * 在第一轮迭代,移到右边,然后到父节点,到达最右边的节点 5.在第二轮迭代,移到右边,回卷,     *   然后上溯,在第三个层次上到达节点 7这时候7满足我们的搜索要求,     *   因此我们沿着7这条路径下降到底部.     * 实际上,这是(考虑到回卷)起点右侧第一个满足条件的页面.     *----------     */    nodeno = target;    while (nodeno > 0)    {        if (fsmpage->fp_nodes[nodeno] >= minvalue)            break;        /*         * Move to the right, wrapping around on same level if necessary, then         * climb up.         * 移动到右边,如需要在同一层次上回卷,然后上溯.         */        nodeno = parentof(rightneighbor(nodeno));    }    /*     * We're now at a node with enough free space, somewhere in the middle of     * the tree. Descend to the bottom, following a path with enough free     * space, preferring to move left if there's a choice.     * 现在已到达了有足够空闲空间的节点,树中间的某个位置上面.     * 沿着有足够空闲空间的路径下降到底部,如有选择,往左移动.     */    while (nodeno < NonLeafNodesPerPage)    {        //左树节点        int         childnodeno = leftchild(nodeno);        if (childnodeno < NodesPerPage &&            fsmpage->fp_nodes[childnodeno] >= minvalue)        {            //如有机会,往左移动            nodeno = childnodeno;            continue;        }        //指向右树节点        childnodeno++;          /* point to right child */        if (childnodeno < NodesPerPage &&            fsmpage->fp_nodes[childnodeno] >= minvalue)        {            //相当于下降了一层            nodeno = childnodeno;        }        else        {            /*             * Oops. The parent node promised that either left or right child             * has enough space, but neither actually did. This can happen in             * case of a "torn page", IOW if we crashed earlier while writing             * the page to disk, and only part of the page made it to disk.             * 父节点承诺了不是左树就是右树有足够的空间,但实际上并不是这样.             * 这会发生在"torn page"的情况下,在写入page到磁盘上时系统崩溃,             *   page只有一部分写入到磁盘中.             *             * Fix the corruption and restart.             * 修复损坏并重启.             */            RelFileNode rnode;            ForkNumber  forknum;            BlockNumber blknum;            //获取tag            BufferGetTag(buf, &rnode, &forknum, &blknum);            elog(DEBUG1, "fixing corrupt FSM block %u, relation %u/%u/%u",                 blknum, rnode.spcNode, rnode.dbNode, rnode.relNode);            /* make sure we hold an exclusive lock */            //确保持有独占锁            if (!exclusive_lock_held)            {                LockBuffer(buf, BUFFER_LOCK_UNLOCK);                LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);                exclusive_lock_held = true;            }            //重建FSM            fsm_rebuild_page(page);            MarkBufferDirtyHint(buf, false);            //重新搜索            goto restart;        }    }    /* We're now at the bottom level, at a node with enough space. */    //现在已到底层,在有足够空间的节点上.    slot = nodeno - NonLeafNodesPerPage;    /*     * Update the next-target pointer. Note that we do this even if we're only     * holding a shared lock, on the grounds that it's better to use a shared     * lock and get a garbled next pointer every now and then, than take the     * concurrency hit of an exclusive lock.     * 更新下一个目标块指针.     * 注意:就算我们只持有共享锁,也可以做这个事情,     *   基于这样的理由,最好是使用共享锁,时不时的获取一个打乱的下一个指针,     *   这样比持有独占锁要好.     *     * Wrap-around is handled at the beginning of this function.     * 在该函数的开始,回卷已处理.     */    fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);    return slot;}

三、跟踪分析

测试脚本

15:54:13 (xdb@[local]:5432)testdb=# insert into t1 values (1,'1','1');

启动gdb,设置断点

(gdb) b fsm_set_and_searchBreakpoint 1 at 0x888850: file freespace.c, line 676.(gdb) cContinuing.Breakpoint 1, fsm_set_and_search (rel=0x7fd0d2f10788, addr=..., slot=5, newValue=0 '\000', minValue=1 '\001')    at freespace.c:676676     int         newslot = -1;(gdb)

输入参数

(gdb) p *rel$1 = {rd_node = {spcNode = 1663, dbNode = 16402, relNode = 50820}, rd_smgr = 0x1233b00, rd_refcnt = 1, rd_backend = -1,   rd_islocaltemp = false, rd_isnailed = false, rd_isvalid = true, rd_indexvalid = 1 '\001', rd_statvalid = false,   rd_createSubid = 0, rd_newRelfilenodeSubid = 0, rd_rel = 0x7fd0d2f109a0, rd_att = 0x7fd0d2f10ab8, rd_id = 50820,   rd_lockInfo = {lockRelId = {relId = 50820, dbId = 16402}}, rd_rules = 0x0, rd_rulescxt = 0x0, trigdesc = 0x0,   rd_rsdesc = 0x0, rd_fkeylist = 0x0, rd_fkeyvalid = false, rd_partkeycxt = 0x0, rd_partkey = 0x0, rd_pdcxt = 0x0,   rd_partdesc = 0x0, rd_partcheck = 0x0, rd_indexlist = 0x7fd0d2f0f820, rd_oidindex = 0, rd_pkindex = 0,   rd_replidindex = 0, rd_statlist = 0x0, rd_indexattr = 0x0, rd_projindexattr = 0x0, rd_keyattr = 0x0, rd_pkattr = 0x0,   rd_idattr = 0x0, rd_projidx = 0x0, rd_pubactions = 0x0, rd_options = 0x0, rd_index = 0x0, rd_indextuple = 0x0,   rd_amhandler = 0, rd_indexcxt = 0x0, rd_amroutine = 0x0, rd_opfamily = 0x0, rd_opcintype = 0x0, rd_support = 0x0,   rd_supportinfo = 0x0, rd_indoption = 0x0, rd_indexprs = 0x0, rd_indpred = 0x0, rd_exclops = 0x0, rd_exclprocs = 0x0,   rd_exclstrats = 0x0, rd_amcache = 0x0, rd_indcollation = 0x0, rd_fdwroutine = 0x0, rd_toastoid = 0,   pgstat_info = 0x12275f0}(gdb) (gdb) p addr$2 = {level = 0, logpageno = 0}

获取buffere,并锁定

(gdb) n678     buf = fsm_readbuf(rel, addr, true);(gdb) 679     LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);(gdb) p buf$3 = 105(gdb)

获取page

(gdb) p page$4 = (Page) 0x7fd0bf41dc00 ""(gdb) p *page$5 = 0 '\000'

调用fsm_search_avail函数,进入fsm_search_avail

(gdb) n684         MarkBufferDirtyHint(buf, false);(gdb) 686     if (minValue != 0)(gdb) 690                                    addr.level == FSM_BOTTOM_LEVEL,(gdb) step689         newslot = fsm_search_avail(buf, minValue,(gdb) fsm_search_avail (buf=105, minvalue=1 '\001', advancenext=true, exclusive_lock_held=true) at fsmpage.c:161161     Page        page = BufferGetPage(buf);(gdb)

获取FSMPage,fp_nodes数组,0表示未使用

(gdb) n162     FSMPage     fsmpage = (FSMPage) PageGetContents(page);(gdb) p page$6 = (Page) 0x7fd0bf41dc00 ""(gdb) n173     if (fsmpage->fp_nodes[0] < minvalue)(gdb) p *fsmpage$7 = {fp_next_slot = 6, fp_nodes = 0x7fd0bf41dc1c "{{"}(gdb) p *fsmpage->fp_nodes$8 = 123 '{'(gdb) p fsmpage->fp_nodes[0]$9 = 123 '{'(gdb) p fsmpage->fp_nodes[1]$10 = 123 '{'(gdb) p fsmpage->fp_nodes[2]$11 = 0 '\000'(gdb) p fsmpage->fp_nodes[3]$12 = 123 '{'(gdb) p fsmpage->fp_nodes[4]$13 = 0 '\000'(gdb) p fsmpage->fp_nodes[5]$14 = 0 '\000'(gdb) p fsmpage->fp_nodes[6]$15 = 0 '\000'

使用fp_next_slot开始搜索.
从目标slot开始检索.每一步,把节点移到右边,然后上溯父节点.
在到达一个有足够空闲空间的节点时停止(因为根节点有足够的空间).

(gdb) n181     target = fsmpage->fp_next_slot;(gdb) 182     if (target < 0 || target >= LeafNodesPerPage)(gdb) p target$16 = 6(gdb) p LeafNodesPerPageNo symbol "__builtin_offsetof" in current context.(gdb) n184     target += NonLeafNodesPerPage;(gdb) 227     nodeno = target;(gdb) p target$17 = 4101(gdb)

循环,直至找到满足条件的节点.
方法是移动到右边,如需要在同一层次上回卷,然后上溯.

(gdb) p fsmpage->fp_nodes[nodeno]$19 = 0 '\000'(gdb) p minvalue$20 = 1 '\001'(gdb) n237         nodeno = parentof(rightneighbor(nodeno));(gdb) n228     while (nodeno > 0)(gdb) p nodeno$21 = 2050(gdb) n230         if (fsmpage->fp_nodes[nodeno] >= minvalue)(gdb) fsmpage->fp_nodes[nodeno]Undefined command: "fsmpage->fp_nodes".  Try "help".(gdb) p fsmpage->fp_nodes[nodeno]$22 = 0 '\000'(gdb) n237         nodeno = parentof(rightneighbor(nodeno));(gdb) 228     while (nodeno > 0)(gdb) 230         if (fsmpage->fp_nodes[nodeno] >= minvalue)(gdb) p nodeno$23 = 1025(gdb) n231             break;(gdb) p fsmpage->fp_nodes[nodeno]$24 = 1 '\001'(gdb)

现在已到达了有足够空闲空间的节点,树中间的某个位置上面.
沿着有足够空闲空间的路径下降到底部.
如有选择,往左移动.本例,往左移动了.

(gdb) n245     while (nodeno < NonLeafNodesPerPage)(gdb) p nodeno$25 = 1025(gdb) n247         int         childnodeno = leftchild(nodeno);(gdb) 249         if (childnodeno < NodesPerPage &&(gdb) p childnodeno$26 = 2051(gdb) n250             fsmpage->fp_nodes[childnodeno] >= minvalue)(gdb) p fsmpage->fp_nodes[childnodeno]$27 = 1 '\001'(gdb) n249         if (childnodeno < NodesPerPage &&(gdb) 252             nodeno = childnodeno;(gdb) 253             continue;(gdb)

找到了相应的叶子节点

(gdb) 245     while (nodeno < NonLeafNodesPerPage)(gdb) n247         int         childnodeno = leftchild(nodeno);(gdb) 249         if (childnodeno < NodesPerPage &&(gdb) 250             fsmpage->fp_nodes[childnodeno] >= minvalue)(gdb) 249         if (childnodeno < NodesPerPage &&(gdb) 255         childnodeno++;          /* point to right child */(gdb) 256         if (childnodeno < NodesPerPage &&(gdb) p childnodeno$28 = 4104(gdb) n257             fsmpage->fp_nodes[childnodeno] >= minvalue)(gdb) 256         if (childnodeno < NodesPerPage &&(gdb) 259             nodeno = childnodeno;(gdb) p childnodeno$29 = 4104(gdb) n245     while (nodeno < NonLeafNodesPerPage)(gdb)

现在已到底层,在有足够空间的节点上.
同时,更新下一个目标块指针.

(gdb) 293     slot = nodeno - NonLeafNodesPerPage;(gdb) n303     fsmpage->fp_next_slot = slot + (advancenext ? 1 : 0);(gdb) p slot$30 = 9(gdb) p nodeno$31 = 4104(gdb) n305     return slot;(gdb) 306 }(gdb)

回到fsm_set_and_search,返回slot

(gdb) fsm_set_and_search (rel=0x7fd0d2f10788, addr=..., slot=5, newValue=0 '\000', minValue=1 '\001') at freespace.c:694694     UnlockReleaseBuffer(buf);(gdb) 696     return newslot;(gdb) (gdb) p newslot$32 = 9(gdb)

DONE!

四、参考资料

PG Source Code
Database Cluster, Databases, and Tables

0