千家信息网

PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析

发表于:2025-01-23 作者:千家信息网编辑
千家信息网最后更新 2025年01月23日,这篇文章主要介绍"PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析",在日常操作中,相信很多人在PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析问题上存
千家信息网最后更新 2025年01月23日PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析

这篇文章主要介绍"PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析",在日常操作中,相信很多人在PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析问题上存在疑惑,小编查阅了各式资料,整理出简单好用的操作方法,希望对大家解答"PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析"的疑惑有所帮助!接下来,请跟着小编一起来学习吧!

这些函数在HJ_NEED_NEW_OUTER阶段中使用,包括ExecHashJoinOuterGetTuple、ExecPrepHashTableForUnmatched、ExecHashGetBucketAndBatch、ExecHashGetSkewBucket、ExecHashJoinSaveTuple和ExecFetchSlotMinimalTuple等。

一、数据结构

Plan
所有计划节点通过将Plan结构作为第一个字段从Plan结构"派生"。这确保了在将节点转换为计划节点时,一切都能正常工作。(在执行器中以通用方式传递时,节点指针经常被转换为Plan *)

/* ---------------- *      Plan node * * All plan nodes "derive" from the Plan structure by having the * Plan structure as the first field.  This ensures that everything works * when nodes are cast to Plan's.  (node pointers are frequently cast to Plan* * when passed around generically in the executor) * 所有计划节点通过将Plan结构作为第一个字段从Plan结构"派生"。 * 这确保了在将节点转换为计划节点时,一切都能正常工作。 * (在执行器中以通用方式传递时,节点指针经常被转换为Plan *) * * We never actually instantiate any Plan nodes; this is just the common * abstract superclass for all Plan-type nodes. * 从未实例化任何Plan节点;这只是所有Plan-type节点的通用抽象超类。 * ---------------- */typedef struct Plan{    NodeTag     type;//节点类型    /*     * 成本估算信息;estimated execution costs for plan (see costsize.c for more info)     */    Cost        startup_cost;   /* 启动成本;cost expended before fetching any tuples */    Cost        total_cost;     /* 总成本;total cost (assuming all tuples fetched) */    /*     * 优化器估算信息;planner's estimate of result size of this plan step     */    double      plan_rows;      /* 行数;number of rows plan is expected to emit */    int         plan_width;     /* 平均行大小(Byte为单位);average row width in bytes */    /*     * 并行执行相关的信息;information needed for parallel query     */    bool        parallel_aware; /* 是否参与并行执行逻辑?engage parallel-aware logic? */    bool        parallel_safe;  /* 是否并行安全;OK to use as part of parallel plan? */    /*     * Plan类型节点通用的信息.Common structural data for all Plan types.     */    int         plan_node_id;   /* unique across entire final plan tree */    List       *targetlist;     /* target list to be computed at this node */    List       *qual;           /* implicitly-ANDed qual conditions */    struct Plan *lefttree;      /* input plan tree(s) */    struct Plan *righttree;    List       *initPlan;       /* Init Plan nodes (un-correlated expr                                 * subselects) */    /*     * Information for management of parameter-change-driven rescanning     * parameter-change-driven重扫描的管理信息.     *      * extParam includes the paramIDs of all external PARAM_EXEC params     * affecting this plan node or its children.  setParam params from the     * node's initPlans are not included, but their extParams are.     *     * allParam includes all the extParam paramIDs, plus the IDs of local     * params that affect the node (i.e., the setParams of its initplans).     * These are _all_ the PARAM_EXEC params that affect this node.     */    Bitmapset  *extParam;    Bitmapset  *allParam;} Plan;

JoinState
Hash/NestLoop/Merge Join的基类

/* ---------------- *   JoinState information * *      Superclass for state nodes of join plans. *      Hash/NestLoop/Merge Join的基类 * ---------------- */typedef struct JoinState{    PlanState   ps;//基类PlanState    JoinType    jointype;//连接类型    //在找到一个匹配inner tuple的时候,如需要跳转到下一个outer tuple,则该值为T    bool        single_match;   /* True if we should skip to next outer tuple                                 * after finding one inner match */    //连接条件表达式(除了ps.qual)    ExprState  *joinqual;       /* JOIN quals (in addition to ps.qual) */} JoinState;

HashJoinState
Hash Join运行期状态结构体

/* these structs are defined in executor/hashjoin.h: */typedef struct HashJoinTupleData *HashJoinTuple;typedef struct HashJoinTableData *HashJoinTable;typedef struct HashJoinState{    JoinState   js;             /* 基类;its first field is NodeTag */    ExprState  *hashclauses;//hash连接条件    List       *hj_OuterHashKeys;   /* 外表条件链表;list of ExprState nodes */    List       *hj_InnerHashKeys;   /* 内表连接条件;list of ExprState nodes */    List       *hj_HashOperators;   /* 操作符OIDs链表;list of operator OIDs */    HashJoinTable hj_HashTable;//Hash表    uint32      hj_CurHashValue;//当前的Hash值    int         hj_CurBucketNo;//当前的bucket编号    int         hj_CurSkewBucketNo;//行倾斜bucket编号    HashJoinTuple hj_CurTuple;//当前元组    TupleTableSlot *hj_OuterTupleSlot;//outer relation slot    TupleTableSlot *hj_HashTupleSlot;//Hash tuple slot    TupleTableSlot *hj_NullOuterTupleSlot;//用于外连接的outer虚拟slot    TupleTableSlot *hj_NullInnerTupleSlot;//用于外连接的inner虚拟slot    TupleTableSlot *hj_FirstOuterTupleSlot;//    int         hj_JoinState;//JoinState状态    bool        hj_MatchedOuter;//是否匹配    bool        hj_OuterNotEmpty;//outer relation是否为空} HashJoinState;

HashJoinTable
Hash表数据结构

typedef struct HashJoinTableData{    int         nbuckets;       /* 内存中的hash桶数;# buckets in the in-memory hash table */    int         log2_nbuckets;  /* 2的对数(nbuckets必须是2的幂);its log2 (nbuckets must be a power of 2) */    int         nbuckets_original;  /* 首次hash时的桶数;# buckets when starting the first hash */    int         nbuckets_optimal;   /* 优化后的桶数(每个批次);optimal # buckets (per batch) */    int         log2_nbuckets_optimal;  /* 2的对数;log2(nbuckets_optimal) */    /* buckets[i] is head of list of tuples in i'th in-memory bucket */    //bucket [i]是内存中第i个桶中的元组链表的head item    union    {        /* unshared array is per-batch storage, as are all the tuples */        //未共享数组是按批处理存储的,所有元组均如此        struct HashJoinTupleData **unshared;        /* shared array is per-query DSA area, as are all the tuples */        //共享数组是每个查询的DSA区域,所有元组均如此        dsa_pointer_atomic *shared;    }           buckets;    bool        keepNulls;      /*如不匹配则存储NULL元组,该值为T;true to store unmatchable NULL tuples */    bool        skewEnabled;    /*是否使用倾斜优化?;are we using skew optimization? */    HashSkewBucket **skewBucket;    /* 倾斜的hash表桶数;hashtable of skew buckets */    int         skewBucketLen;  /* skewBucket数组大小;size of skewBucket array (a power of 2!) */    int         nSkewBuckets;   /* 活动的倾斜桶数;number of active skew buckets */    int        *skewBucketNums; /* 活动倾斜桶数组索引;array indexes of active skew buckets */    int         nbatch;         /* 批次数;number of batches */    int         curbatch;       /* 当前批次,第一轮为0;current batch #; 0 during 1st pass */    int         nbatch_original;    /* 在开始inner扫描时的批次;nbatch when we started inner scan */    int         nbatch_outstart;    /* 在开始outer扫描时的批次;nbatch when we started outer scan */    bool        growEnabled;    /* 关闭nbatch增加的标记;flag to shut off nbatch increases */    double      totalTuples;    /* 从inner plan获得的元组数;# tuples obtained from inner plan */    double      partialTuples;  /* 通过hashjoin获得的inner元组数;# tuples obtained from inner plan by me */    double      skewTuples;     /* 倾斜元组数;# tuples inserted into skew tuples */    /*     * These arrays are allocated for the life of the hash join, but only if     * nbatch > 1.  A file is opened only when we first write a tuple into it     * (otherwise its pointer remains NULL).  Note that the zero'th array     * elements never get used, since we will process rather than dump out any     * tuples of batch zero.     * 这些数组在散列连接的生命周期内分配,但仅当nbatch > 1时分配。     * 只有当第一次将元组写入文件时,文件才会打开(否则它的指针将保持NULL)。     * 注意,第0个数组元素永远不会被使用,因为批次0的元组永远不会转储.     */    BufFile   **innerBatchFile; /* 每个批次的inner虚拟临时文件缓存;buffered virtual temp file per batch */    BufFile   **outerBatchFile; /* 每个批次的outer虚拟临时文件缓存;buffered virtual temp file per batch */    /*     * Info about the datatype-specific hash functions for the datatypes being     * hashed. These are arrays of the same length as the number of hash join     * clauses (hash keys).     * 有关正在散列的数据类型的特定于数据类型的散列函数的信息。     * 这些数组的长度与散列连接子句(散列键)的数量相同。     */    FmgrInfo   *outer_hashfunctions;    /* outer hash函数FmgrInfo结构体;lookup data for hash functions */    FmgrInfo   *inner_hashfunctions;    /* inner hash函数FmgrInfo结构体;lookup data for hash functions */    bool       *hashStrict;     /* 每个hash操作符是严格?is each hash join operator strict? */    Size        spaceUsed;      /* 元组使用的当前内存空间大小;memory space currently used by tuples */    Size        spaceAllowed;   /* 空间使用上限;upper limit for space used */    Size        spacePeak;      /* 峰值的空间使用;peak space used */    Size        spaceUsedSkew;  /* 倾斜哈希表的当前空间使用情况;skew hash table's current space usage */    Size        spaceAllowedSkew;   /* 倾斜哈希表的使用上限;upper limit for skew hashtable */    MemoryContext hashCxt;      /* 整个散列连接存储的上下文;context for whole-hash-join storage */    MemoryContext batchCxt;     /* 该批次存储的上下文;context for this-batch-only storage */    /* used for dense allocation of tuples (into linked chunks) */    //用于密集分配元组(到链接块中)    HashMemoryChunk chunks;     /* 整个批次使用一个链表;one list for the whole batch */    /* Shared and private state for Parallel Hash. */    //并行hash使用的共享和私有状态    HashMemoryChunk current_chunk;  /* 后台进程的当前chunk;this backend's current chunk */    dsa_area   *area;           /* 用于分配内存的DSA区域;DSA area to allocate memory from */    ParallelHashJoinState *parallel_state;//并行执行状态    ParallelHashJoinBatchAccessor *batches;//并行访问器    dsa_pointer current_chunk_shared;//当前chunk的开始指针} HashJoinTableData;typedef struct HashJoinTableData *HashJoinTable;

HashJoinTupleData
Hash连接元组数据

/* ---------------------------------------------------------------- *              hash-join hash table structures * * Each active hashjoin has a HashJoinTable control block, which is * palloc'd in the executor's per-query context.  All other storage needed * for the hashjoin is kept in private memory contexts, two for each hashjoin. * This makes it easy and fast to release the storage when we don't need it * anymore.  (Exception: data associated with the temp files lives in the * per-query context too, since we always call buffile.c in that context.) * 每个活动的hashjoin都有一个可散列的控制块,它在执行程序的每个查询上下文中都是通过palloc分配的。 * hashjoin所需的所有其他存储都保存在私有内存上下文中,每个hashjoin有两个。 * 当不再需要它的时候,这使得释放它变得简单和快速。 * (例外:与临时文件相关的数据也存在于每个查询上下文中,因为在这种情况下总是调用buffile.c。) * * The hashtable contexts are made children of the per-query context, ensuring * that they will be discarded at end of statement even if the join is * aborted early by an error.  (Likewise, any temporary files we make will * be cleaned up by the virtual file manager in event of an error.) * hashtable上下文是每个查询上下文的子上下文,确保在语句结束时丢弃它们,即使连接因错误而提前中止。 *   (同样,如果出现错误,虚拟文件管理器将清理创建的任何临时文件。) * * Storage that should live through the entire join is allocated from the * "hashCxt", while storage that is only wanted for the current batch is * allocated in the "batchCxt".  By resetting the batchCxt at the end of * each batch, we free all the per-batch storage reliably and without tedium. * 通过整个连接的存储空间应从"hashCxt"分配,而只需要当前批处理的存储空间在"batchCxt"中分配。 * 通过在每个批处理结束时重置batchCxt,可以可靠地释放每个批处理的所有存储,而不会感到单调乏味。 *  * During first scan of inner relation, we get its tuples from executor. * If nbatch > 1 then tuples that don't belong in first batch get saved * into inner-batch temp files. The same statements apply for the * first scan of the outer relation, except we write tuples to outer-batch * temp files.  After finishing the first scan, we do the following for * each remaining batch: *  1. Read tuples from inner batch file, load into hash buckets. *  2. Read tuples from outer batch file, match to hash buckets and output. * 在内部关系的第一次扫描中,从执行者那里得到了它的元组。 * 如果nbatch > 1,那么不属于第一批的元组将保存到批内临时文件中。 * 相同的语句适用于外关系的第一次扫描,但是我们将元组写入外部批处理临时文件。 * 完成第一次扫描后,我们对每批剩余的元组做如下处理:  * 1.从内部批处理文件读取元组,加载到散列桶中。 * 2.从外部批处理文件读取元组,匹配哈希桶和输出。  * * It is possible to increase nbatch on the fly if the in-memory hash table * gets too big.  The hash-value-to-batch computation is arranged so that this * can only cause a tuple to go into a later batch than previously thought, * never into an earlier batch.  When we increase nbatch, we rescan the hash * table and dump out any tuples that are now of a later batch to the correct * inner batch file.  Subsequently, while reading either inner or outer batch * files, we might find tuples that no longer belong to the current batch; * if so, we just dump them out to the correct batch file. * 如果内存中的哈希表太大,可以动态增加nbatch。 * 散列值到批处理的计算是这样安排的: *   这只会导致元组进入比以前认为的更晚的批处理,而不会进入更早的批处理。 * 当增加nbatch时,重新扫描哈希表,并将现在属于后面批处理的任何元组转储到正确的内部批处理文件。 * 随后,在读取内部或外部批处理文件时,可能会发现不再属于当前批处理的元组; *   如果是这样,只需将它们转储到正确的批处理文件即可。 * ---------------------------------------------------------------- *//* these are in nodes/execnodes.h: *//* typedef struct HashJoinTupleData *HashJoinTuple; *//* typedef struct HashJoinTableData *HashJoinTable; */typedef struct HashJoinTupleData{    /* link to next tuple in same bucket */    //link同一个桶中的下一个元组    union    {        struct HashJoinTupleData *unshared;        dsa_pointer shared;    }           next;    uint32      hashvalue;      /* 元组的hash值;tuple's hash code */    /* Tuple data, in MinimalTuple format, follows on a MAXALIGN boundary */}           HashJoinTupleData;#define HJTUPLE_OVERHEAD  MAXALIGN(sizeof(HashJoinTupleData))#define HJTUPLE_MINTUPLE(hjtup)  \    ((MinimalTuple) ((char *) (hjtup) + HJTUPLE_OVERHEAD))

二、源码解读

ExecHashJoinOuterGetTuple
获取非并行模式下hashjoin的下一个外部元组:要么在第一次执行外部plan节点,要么从hashjoin批处理的临时文件中获取。

/*----------------------------------------------------------------------------------------------------                                    HJ_NEED_NEW_OUTER 阶段----------------------------------------------------------------------------------------------------*//* * ExecHashJoinOuterGetTuple * *      get the next outer tuple for a parallel oblivious hashjoin: either by *      executing the outer plan node in the first pass, or from the temp *      files for the hashjoin batches. *      获取非并行模式下hashjoin的下一个外部元组:要么在第一次执行外部plan节点,要么从hashjoin批处理的临时文件中获取。 * * Returns a null slot if no more outer tuples (within the current batch). * 如果没有更多外部元组(在当前批处理中),则返回空slot。 * * On success, the tuple's hash value is stored at *hashvalue --- this is * either originally computed, or re-read from the temp file. * 如果成功,tuple的散列值存储在输入参数*hashvalue中--这是最初计算的,或者是从临时文件中重新读取的。 */static TupleTableSlot *ExecHashJoinOuterGetTuple(PlanState *outerNode,//outer 节点                          HashJoinState *hjstate,//Hash Join执行状态                          uint32 *hashvalue)//Hash值{    HashJoinTable hashtable = hjstate->hj_HashTable;//hash表    int         curbatch = hashtable->curbatch;//当前批次    TupleTableSlot *slot;//返回的slot    if (curbatch == 0)          /* 第一个批次;if it is the first pass */    {        /*         * Check to see if first outer tuple was already fetched by         * ExecHashJoin() and not used yet.         * 检查第一个外部元组是否已经由ExecHashJoin()函数获取且尚未使用。         */        slot = hjstate->hj_FirstOuterTupleSlot;        if (!TupIsNull(slot))            hjstate->hj_FirstOuterTupleSlot = NULL;//重置slot        else            slot = ExecProcNode(outerNode);//如为NULL,则获取slot        while (!TupIsNull(slot))//slot不为NULL        {            /*             * We have to compute the tuple's hash value.             * 计算hash值             */            ExprContext *econtext = hjstate->js.ps.ps_ExprContext;//表达式计算上下文            econtext->ecxt_outertuple = slot;//存储获取的slot            if (ExecHashGetHashValue(hashtable, econtext,                                     hjstate->hj_OuterHashKeys,                                     true,  /* outer tuple */                                     HJ_FILL_OUTER(hjstate),                                     hashvalue))//计算Hash值            {                /* remember outer relation is not empty for possible rescan */                hjstate->hj_OuterNotEmpty = true;//设置标记(outer不为空)                return slot;//返回匹配的slot            }            /*             * That tuple couldn't match because of a NULL, so discard it and             * continue with the next one.             * 该元组无法匹配,丢弃它,继续下一个元组。             */            slot = ExecProcNode(outerNode);//继续获取下一个        }    }    else if (curbatch < hashtable->nbatch)//不是第一个批次    {        BufFile    *file = hashtable->outerBatchFile[curbatch];//获取缓冲的文件        /*         * In outer-join cases, we could get here even though the batch file         * is empty.         * 在外连接的情况下,即使批处理文件是空的,也可以在这里进行处理。         */        if (file == NULL)            return NULL;//如文件为NULL,则返回        slot = ExecHashJoinGetSavedTuple(hjstate,                                         file,                                         hashvalue,                                         hjstate->hj_OuterTupleSlot);//从文件中获取slot        if (!TupIsNull(slot))            return slot;//非NULL,则返回    }    /* End of this batch */    //已完成,则返回NULL    return NULL;}/* * ExecHashGetHashValue *      Compute the hash value for a tuple *      ExecHashGetHashValue - 计算元组的Hash值 * * The tuple to be tested must be in either econtext->ecxt_outertuple or * econtext->ecxt_innertuple.  Vars in the hashkeys expressions should have * varno either OUTER_VAR or INNER_VAR. * 要测试的元组必须位于econtext->ecxt_outertuple或econtext->ecxt_innertuple中。 * hashkeys表达式中的Vars应该具有varno,即OUTER_VAR或INNER_VAR。 * * A true result means the tuple's hash value has been successfully computed * and stored at *hashvalue.  A false result means the tuple cannot match * because it contains a null attribute, and hence it should be discarded * immediately.  (If keep_nulls is true then false is never returned.) * T意味着tuple的散列值已经成功计算并存储在*hashvalue参数中。 * F意味着元组不能匹配,因为它包含null属性,因此应该立即丢弃它。 * (如果keep_nulls为真,则永远不会返回F。) */boolExecHashGetHashValue(HashJoinTable hashtable,//Hash表                     ExprContext *econtext,//上下文                     List *hashkeys,//Hash键值链表                     bool outer_tuple,//是否外表元组                     bool keep_nulls,//是否保存NULL                     uint32 *hashvalue)//返回的Hash值{    uint32      hashkey = 0;//hash键    FmgrInfo   *hashfunctions;//hash函数    ListCell   *hk;//临时变量    int         i = 0;    MemoryContext oldContext;    /*     * We reset the eval context each time to reclaim any memory leaked in the     * hashkey expressions.     * 我们每次重置eval上下文来回收hashkey表达式中分配的内存。     */    ResetExprContext(econtext);    //切换上下文    oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);    if (outer_tuple)        hashfunctions = hashtable->outer_hashfunctions;//外表元组    else        hashfunctions = hashtable->inner_hashfunctions;//内表元组    foreach(hk, hashkeys)//遍历Hash键值    {        ExprState  *keyexpr = (ExprState *) lfirst(hk);//键值表达式        Datum       keyval;        bool        isNull;        /* rotate hashkey left 1 bit at each step */        //哈希键左移1位        hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);        /*         * Get the join attribute value of the tuple         * 获取元组的连接属性值         */        keyval = ExecEvalExpr(keyexpr, econtext, &isNull);        /*         * If the attribute is NULL, and the join operator is strict, then         * this tuple cannot pass the join qual so we can reject it         * immediately (unless we're scanning the outside of an outer join, in         * which case we must not reject it).  Otherwise we act like the         * hashcode of NULL is zero (this will support operators that act like         * IS NOT DISTINCT, though not any more-random behavior).  We treat         * the hash support function as strict even if the operator is not.         * 如果属性为NULL,并且join操作符是严格的,那么这个元组不能传递连接条件join qual,         *   因此可以立即拒绝它(除非正在扫描外连接的外表,在这种情况下不能拒绝它)。         * 否则,我们的行为就好像NULL的哈希码是零一样(这将支持IS NOT DISTINCT操作符,但不会有任何随机的情况出现)。         * 即使操作符不是严格的,也将哈希函数视为严格的。         *         * Note: currently, all hashjoinable operators must be strict since         * the hash index AM assumes that.  However, it takes so little extra         * code here to allow non-strict that we may as well do it.         * 注意:目前,所有哈希可连接操作符都必须严格,因为哈希索引AM假定如此。         *      但是,这里只需要很少的额外代码就可以实现非严格性,我们也可以这样做。         */        if (isNull)        {            //NULL值            if (hashtable->hashStrict[i] && !keep_nulls)            {                MemoryContextSwitchTo(oldContext);                //不保持NULL值,不匹配                return false;   /* cannot match */            }            /* else, leave hashkey unmodified, equivalent to hashcode 0 */            //否则的话,不修改hashkey,仍为0        }        else        {            //不为NULL            /* Compute the hash function */            //计算hash值            uint32      hkey;            hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i], keyval));            hashkey ^= hkey;        }        i++;//下一个键    }    //切换上下文    MemoryContextSwitchTo(oldContext);    //返回Hash键值    *hashvalue = hashkey;    return true;//成功获取}

ExecPrepHashTableForUnmatched
为ExecScanHashTableForUnmatched函数调用作准备

/* * ExecPrepHashTableForUnmatched *      set up for a series of ExecScanHashTableForUnmatched calls *      为ExecScanHashTableForUnmatched函数调用作准备 */voidExecPrepHashTableForUnmatched(HashJoinState *hjstate){    /*----------     * During this scan we use the HashJoinState fields as follows:     *     * hj_CurBucketNo: next regular bucket to scan     * hj_CurSkewBucketNo: next skew bucket (an index into skewBucketNums)     * hj_CurTuple: last tuple returned, or NULL to start next bucket     * 在这次扫描期间,我们使用HashJoinState结构体中的字段如下:     * hj_CurBucketNo: 下一个常规的bucket     * hj_CurSkewBucketNo: 下一个个倾斜的bucket     * hj_CurTuple: 最后返回的元组,或者为NULL(下一个bucket开始)     *----------     */    hjstate->hj_CurBucketNo = 0;    hjstate->hj_CurSkewBucketNo = 0;    hjstate->hj_CurTuple = NULL;}

ExecHashGetBucketAndBatch
确定哈希值的bucket号和批处理号

/* * ExecHashGetBucketAndBatch *      Determine the bucket number and batch number for a hash value * ExecHashGetBucketAndBatch *      确定哈希值的bucket号和批处理号 *  * Note: on-the-fly increases of nbatch must not change the bucket number * for a given hash code (since we don't move tuples to different hash * chains), and must only cause the batch number to remain the same or * increase.  Our algorithm is *      bucketno = hashvalue MOD nbuckets *      batchno = (hashvalue DIV nbuckets) MOD nbatch * where nbuckets and nbatch are both expected to be powers of 2, so we can * do the computations by shifting and masking.  (This assumes that all hash * functions are good about randomizing all their output bits, else we are * likely to have very skewed bucket or batch occupancy.) * 注意:nbatch的动态增加不能更改给定哈希码的桶号(因为我们不将元组移动到不同的哈希链), *   并且只能使批号保持不变或增加。我们的算法是: *      bucketno = hashvalue MOD nbuckets *      batchno = (hashvalue DIV nbuckets) MOD nbatch * 这里nbucket和nbatch都是2的幂,所以我们可以通过移动和屏蔽来进行计算。 * (这假定所有哈希函数都能很好地随机化它们的所有输出位,否则很可能会出现非常倾斜的桶或批处理占用。) * * nbuckets and log2_nbuckets may change while nbatch == 1 because of dynamic * bucket count growth.  Once we start batching, the value is fixed and does * not change over the course of the join (making it possible to compute batch * number the way we do here). * 当nbatch == 1时,由于动态bucket计数的增长,nbucket和log2_nbucket可能会发生变化。 * 一旦开始批处理,这个值就固定了,并且在连接过程中不会改变(这使得我们可以像这里那样计算批号)。 * * nbatch is always a power of 2; we increase it only by doubling it.  This * effectively adds one more bit to the top of the batchno. * nbatch总是2的幂;我们只是通过x2来调整。这相当于为批号的头部增加了一位。 */voidExecHashGetBucketAndBatch(HashJoinTable hashtable,                          uint32 hashvalue,                          int *bucketno,                          int *batchno){    uint32      nbuckets = (uint32) hashtable->nbuckets;//桶数    uint32      nbatch = (uint32) hashtable->nbatch;//批次号    if (nbatch > 1)//批次>1    {        /* we can do MOD by masking, DIV by shifting */        //我们可以通过屏蔽来实现MOD,通过移动来实现DIV        *bucketno = hashvalue & (nbuckets - 1);//nbuckets - 1后相当于N个1        *batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);    }    else    {        *bucketno = hashvalue & (nbuckets - 1);//只有一个批次,简单处理即可        *batchno = 0;    }}

ExecHashGetSkewBucket
返回这个哈希值的倾斜桶的索引,如果哈希值与任何活动的倾斜桶没有关联,则返回INVALID_SKEW_BUCKET_NO。

/* * ExecHashGetSkewBucket * *      Returns the index of the skew bucket for this hashvalue, *      or INVALID_SKEW_BUCKET_NO if the hashvalue is not *      associated with any active skew bucket. *      返回这个哈希值的倾斜桶的索引,如果哈希值与任何活动的倾斜桶没有关联,则返回INVALID_SKEW_BUCKET_NO。 */intExecHashGetSkewBucket(HashJoinTable hashtable, uint32 hashvalue){    int         bucket;    /*     * Always return INVALID_SKEW_BUCKET_NO if not doing skew optimization (in     * particular, this happens after the initial batch is done).     * 如果不进行倾斜优化(特别是在初始批处理完成之后),则返回INVALID_SKEW_BUCKET_NO。     */    if (!hashtable->skewEnabled)        return INVALID_SKEW_BUCKET_NO;    /*     * Since skewBucketLen is a power of 2, we can do a modulo by ANDing.'     * 由于skewBucketLen是2的幂,可以通过AND操作来做一个模。     */    bucket = hashvalue & (hashtable->skewBucketLen - 1);    /*     * While we have not hit a hole in the hashtable and have not hit the     * desired bucket, we have collided with some other hash value, so try the     * next bucket location.     * 虽然我们没有在哈希表中找到一个hole,也没有找到所需的bucket,     *   但是与其他一些哈希值发生了冲突,所以尝试下一个bucket位置。     */    while (hashtable->skewBucket[bucket] != NULL &&           hashtable->skewBucket[bucket]->hashvalue != hashvalue)        bucket = (bucket + 1) & (hashtable->skewBucketLen - 1);    /*     * Found the desired bucket?     * 找到了bucket,返回     */    if (hashtable->skewBucket[bucket] != NULL)        return bucket;    /*     * There must not be any hashtable entry for this hash value.     */    //否则返回INVALID_SKEW_BUCKET_NO    return INVALID_SKEW_BUCKET_NO;}

ExecHashJoinSaveTuple
在批处理文件中保存元组.每个元组在文件中记录的是它的散列值,然后是最小化格式的元组。

/* * ExecHashJoinSaveTuple *      save a tuple to a batch file. *      在批处理文件中保存元组 *  * The data recorded in the file for each tuple is its hash value, * then the tuple in MinimalTuple format. * 每个元组在文件中记录的是它的散列值,然后是最小化格式的元组。 *  * Note: it is important always to call this in the regular executor * context, not in a shorter-lived context; else the temp file buffers * will get messed up. * 注意:在常规执行程序上下文中调用它总是很重要的,而不是在较短的生命周期中调用它; *   否则临时文件缓冲区就会出现混乱。 */voidExecHashJoinSaveTuple(MinimalTuple tuple, uint32 hashvalue,                      BufFile **fileptr){    BufFile    *file = *fileptr;//文件指针    size_t      written;//写入大小    if (file == NULL)    {        /* First write to this batch file, so open it. */        //文件指针为NULL,首次写入,则打开批处理文件        file = BufFileCreateTemp(false);        *fileptr = file;    }    //首先写入hash值,返回写入的大小    written = BufFileWrite(file, (void *) &hashvalue, sizeof(uint32));    if (written != sizeof(uint32))//写入有误,报错        ereport(ERROR,                (errcode_for_file_access(),                 errmsg("could not write to hash-join temporary file: %m")));    //写入tuple    written = BufFileWrite(file, (void *) tuple, tuple->t_len);    if (written != tuple->t_len)//写入有误,报错        ereport(ERROR,                (errcode_for_file_access(),                 errmsg("could not write to hash-join temporary file: %m")));}

ExecFetchSlotMinimalTuple
以最小化物理元组的格式提取slot的数据

/* -------------------------------- *      ExecFetchSlotMinimalTuple *          Fetch the slot's minimal physical tuple. *          以最小化物理元组的格式提取slot的数据. * *      If the given tuple table slot can hold a minimal tuple, indicated by a *      non-NULL get_minimal_tuple callback, the function returns the minimal *      tuple returned by that callback. It assumes that the minimal tuple *      returned by the callback is "owned" by the slot i.e. the slot is *      responsible for freeing the memory consumed by the tuple. Hence it sets *      *shouldFree to false, indicating that the caller should not free the *      memory consumed by the minimal tuple. In this case the returned minimal *      tuple should be considered as read-only. *      如果给定的元组table slot可以保存由non-NULL get_minimal_tuple回调函数指示的最小元组, *        则函数将返回该回调函数返回的最小元组。 *      它假定回调函数返回的最小元组由slot"拥有",即slot负责释放元组所消耗的内存。 *      因此,它将*shouldFree设置为false,表示调用方不应该释放内存。 *      在这种情况下,返回的最小元组应该被认为是只读的。 * *      If that callback is not supported, it calls copy_minimal_tuple callback *      which is expected to return a copy of minimal tuple represnting the *      contents of the slot. In this case *shouldFree is set to true, *      indicating the caller that it should free the memory consumed by the *      minimal tuple. In this case the returned minimal tuple may be written *      up. *      如果不支持该回调函数,则调用copy_minimal_tuple回调函数, *        该回调将返回一个表示slot内容的最小元组副本。 *      *shouldFree被设置为true,这表示调用者应该释放内存。 *      在这种情况下,可以写入返回的最小元组。 * -------------------------------- */MinimalTupleExecFetchSlotMinimalTuple(TupleTableSlot *slot,                          bool *shouldFree){    /*     * sanity checks     * 安全检查     */    Assert(slot != NULL);    Assert(!TTS_EMPTY(slot));    if (slot->tts_ops->get_minimal_tuple)//调用slot->tts_ops->get_minimal_tuple    {        //调用成功,则该元组为只读,由slot负责释放        if (shouldFree)            *shouldFree = false;        return slot->tts_ops->get_minimal_tuple(slot);    }    else    {        //调用不成功,设置为true,由调用方释放        if (shouldFree)            *shouldFree = true;        return slot->tts_ops->copy_minimal_tuple(slot);//调用copy_minimal_tuple函数    }}

三、跟踪分析

测试脚本如下

testdb=# set enable_nestloop=false;SETtestdb=# set enable_mergejoin=false;SETtestdb=# explain verbose select dw.*,grjf.grbh,grjf.xm,grjf.ny,grjf.je testdb-# from t_dwxx dw,lateral (select gr.grbh,gr.xm,jf.ny,jf.je testdb(#                         from t_grxx gr inner join t_jfxx jf testdb(#                                        on gr.dwbh = dw.dwbh testdb(#                                           and gr.grbh = jf.grbh) grjftestdb-# order by dw.dwbh;                                          QUERY PLAN                                           ----------------------------------------------------------------------------------------------- Sort  (cost=14828.83..15078.46 rows=99850 width=47)   Output: dw.dwmc, dw.dwbh, dw.dwdz, gr.grbh, gr.xm, jf.ny, jf.je   Sort Key: dw.dwbh   ->  Hash Join  (cost=3176.00..6537.55 rows=99850 width=47)         Output: dw.dwmc, dw.dwbh, dw.dwdz, gr.grbh, gr.xm, jf.ny, jf.je         Hash Cond: ((gr.grbh)::text = (jf.grbh)::text)         ->  Hash Join  (cost=289.00..2277.61 rows=99850 width=32)               Output: dw.dwmc, dw.dwbh, dw.dwdz, gr.grbh, gr.xm               Inner Unique: true               Hash Cond: ((gr.dwbh)::text = (dw.dwbh)::text)               ->  Seq Scan on public.t_grxx gr  (cost=0.00..1726.00 rows=100000 width=16)                     Output: gr.dwbh, gr.grbh, gr.xm, gr.xb, gr.nl               ->  Hash  (cost=164.00..164.00 rows=10000 width=20)                     Output: dw.dwmc, dw.dwbh, dw.dwdz                     ->  Seq Scan on public.t_dwxx dw  (cost=0.00..164.00 rows=10000 width=20)                           Output: dw.dwmc, dw.dwbh, dw.dwdz         ->  Hash  (cost=1637.00..1637.00 rows=100000 width=20)               Output: jf.ny, jf.je, jf.grbh               ->  Seq Scan on public.t_jfxx jf  (cost=0.00..1637.00 rows=100000 width=20)                     Output: jf.ny, jf.je, jf.grbh(20 rows)

启动gdb,设置断点

(gdb) b ExecHashJoinOuterGetTupleBreakpoint 1 at 0x702edc: file nodeHashjoin.c, line 807.(gdb) b ExecHashGetHashValueBreakpoint 2 at 0x6ff060: file nodeHash.c, line 1778.(gdb) b ExecHashGetBucketAndBatchBreakpoint 3 at 0x6ff1df: file nodeHash.c, line 1880.(gdb) b ExecHashJoinSaveTupleBreakpoint 4 at 0x703973: file nodeHashjoin.c, line 1214.(gdb)

ExecHashGetHashValue
ExecHashGetHashValue->进入函数ExecHashGetHashValue

(gdb) cContinuing.Breakpoint 2, ExecHashGetHashValue (hashtable=0x14acde8, econtext=0x149c3d0, hashkeys=0x14a8e40, outer_tuple=false,     keep_nulls=false, hashvalue=0x7ffc7eba5c20) at nodeHash.c:17781778        uint32      hashkey = 0;

ExecHashGetHashValue->初始化,切换内存上下文

1778        uint32      hashkey = 0;(gdb) n1781        int         i = 0;(gdb) 1788        ResetExprContext(econtext);(gdb) 1790        oldContext = MemoryContextSwitchTo(econtext->ecxt_per_tuple_memory);(gdb) 1792        if (outer_tuple)

ExecHashGetHashValue->inner hash函数

1792        if (outer_tuple)(gdb) 1795            hashfunctions = hashtable->inner_hashfunctions;

ExecHashGetHashValue->获取hahs键信息
1号RTE(varnoold = 1,即t_dwxx)的dwbh字段(varattno = 2)

(gdb) 1797        foreach(hk, hashkeys)(gdb) 1799            ExprState  *keyexpr = (ExprState *) lfirst(hk);(gdb) 1804            hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);(gdb) p *keyexpr$1 = {tag = {type = T_ExprState}, flags = 2 '\002', resnull = false, resvalue = 0, resultslot = 0x0, steps = 0x14a8a00,   evalfunc = 0x6d1a6e , expr = 0x1498fc0, evalfunc_private = 0x6d1e97 ,   steps_len = 3, steps_alloc = 16, parent = 0x149b738, ext_params = 0x0, innermost_caseval = 0x0, innermost_casenull = 0x0,   innermost_domainval = 0x0, innermost_domainnull = 0x0}(gdb) p *(RelabelType *)keyexpr->expr$3 = {xpr = {type = T_RelabelType}, arg = 0x1499018, resulttype = 25, resulttypmod = -1, resultcollid = 100,   relabelformat = COERCE_IMPLICIT_CAST, location = -1}(gdb) p *((RelabelType *)keyexpr->expr)->arg$4 = {type = T_Var}(gdb) p *(Var *)((RelabelType *)keyexpr->expr)->arg$5 = {xpr = {type = T_Var}, varno = 65000, varattno = 2, vartype = 1043, vartypmod = 24, varcollid = 100, varlevelsup = 0,   varnoold = 1, varoattno = 2, location = 218}(gdb)

ExecHashGetHashValue->获取hash值,解析表达式

(gdb) n1809            keyval = ExecEvalExpr(keyexpr, econtext, &isNull);(gdb) 1824            if (isNull)(gdb) p hashkey$6 = 0(gdb) p keyval$7 = 140460362257270(gdb)

ExecHashGetHashValue->返回值不为NULL

(gdb) p isNull$8 = false(gdb) n1838                hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i], keyval));

ExecHashGetHashValue->计算Hash值

(gdb) n1839                hashkey ^= hkey;(gdb) p hkey$9 = 3663833849(gdb) p hashkey$10 = 0(gdb) n1842            i++;(gdb) p hashkey$11 = 3663833849(gdb)

ExecHashGetHashValue->返回计算结果

(gdb) n1797        foreach(hk, hashkeys)(gdb) 1845        MemoryContextSwitchTo(oldContext);(gdb) 1847        *hashvalue = hashkey;(gdb) 1848        return true;(gdb) 1849    }

ExecHashGetBucketAndBatch
ExecHashGetBucketAndBatch->进入ExecHashGetBucketAndBatch

(gdb) cContinuing.Breakpoint 3, ExecHashGetBucketAndBatch (hashtable=0x14acde8, hashvalue=3663833849, bucketno=0x7ffc7eba5bdc,     batchno=0x7ffc7eba5bd8) at nodeHash.c:18801880        uint32      nbuckets = (uint32) hashtable->nbuckets;

ExecHashGetBucketAndBatch->获取bucket数和批次数

1880        uint32      nbuckets = (uint32) hashtable->nbuckets;(gdb) n1881        uint32      nbatch = (uint32) hashtable->nbatch;(gdb) 1883        if (nbatch > 1)(gdb) p nbuckets$12 = 16384(gdb) p nbatch$13 = 1(gdb)

ExecHashGetBucketAndBatch->计算桶号和批次号(只有一个批次,设置为0)

(gdb) n1891            *bucketno = hashvalue & (nbuckets - 1);(gdb) 1892            *batchno = 0;(gdb) 1894    }(gdb) p bucketno$14 = (int *) 0x7ffc7eba5bdc(gdb) p *bucketno$15 = 11001(gdb)

ExecHashJoinOuterGetTuple
ExecHashJoinOuterGetTuple->进入ExecHashJoinOuterGetTuple函数

(gdb) info breakNum     Type           Disp Enb Address            What1       breakpoint     keep y   0x0000000000702edc in ExecHashJoinOuterGetTuple at nodeHashjoin.c:8072       breakpoint     keep y   0x00000000006ff060 in ExecHashGetHashValue at nodeHash.c:1778    breakpoint already hit 4 times3       breakpoint     keep y   0x00000000006ff1df in ExecHashGetBucketAndBatch at nodeHash.c:1880    breakpoint already hit 4 times4       breakpoint     keep y   0x0000000000703973 in ExecHashJoinSaveTuple at nodeHashjoin.c:1214(gdb) del 2(gdb) del 3(gdb) cContinuing.Breakpoint 1, ExecHashJoinOuterGetTuple (outerNode=0x149ba10, hjstate=0x149b738, hashvalue=0x7ffc7eba5ccc)    at nodeHashjoin.c:807807     HashJoinTable hashtable = hjstate->hj_HashTable;(gdb)

ExecHashJoinOuterGetTuple->查看输入参数
outerNode:outer relation为顺序扫描得到的relation(对t_jfxx进行顺序扫描)
hjstate:Hash Join执行状态
hashvalue:Hash值

(gdb) p *outerNode$16 = {type = T_SeqScanState, plan = 0x1494d10, state = 0x149b0f8, ExecProcNode = 0x71578d ,   ExecProcNodeReal = 0x71578d , instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0,   qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0,   ps_ResultTupleSlot = 0x149c178, ps_ExprContext = 0x149bb28, ps_ProjInfo = 0x0, scandesc = 0x7fbfa69a8308}(gdb) p *hjstate$17 = {js = {ps = {type = T_HashJoinState, plan = 0x1496d18, state = 0x149b0f8, ExecProcNode = 0x70291d ,       ExecProcNodeReal = 0x70291d , instrument = 0x0, worker_instrument = 0x0, worker_jit_instrument = 0x0,       qual = 0x0, lefttree = 0x149ba10, righttree = 0x149c2b8, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0,       ps_ResultTupleSlot = 0x14a7498, ps_ExprContext = 0x149b950, ps_ProjInfo = 0x149cef0, scandesc = 0x0},     jointype = JOIN_INNER, single_match = true, joinqual = 0x0}, hashclauses = 0x14a7b30, hj_OuterHashKeys = 0x14a8930,   hj_InnerHashKeys = 0x14a8e40, hj_HashOperators = 0x14a8ea0, hj_HashTable = 0x14acde8, hj_CurHashValue = 0,   hj_CurBucketNo = 0, hj_CurSkewBucketNo = -1, hj_CurTuple = 0x0, hj_OuterTupleSlot = 0x14a79f0,   hj_HashTupleSlot = 0x149cc18, hj_NullOuterTupleSlot = 0x0, hj_NullInnerTupleSlot = 0x0,   hj_FirstOuterTupleSlot = 0x149bbe8, hj_JoinState = 2, hj_MatchedOuter = false, hj_OuterNotEmpty = false}(gdb) p *hashvalue$18 = 32703(gdb)

ExecHashJoinOuterGetTuple->只有一个批次,批次号为0

(gdb) n808     int         curbatch = hashtable->curbatch;(gdb) 811     if (curbatch == 0)          /* if it is the first pass */(gdb) p curbatch$20 = 0

ExecHashJoinOuterGetTuple->获取首个outer tuple slot(不为NULL),重置hjstate->hj_FirstOuterTupleSlot为NULL

(gdb) n817         slot = hjstate->hj_FirstOuterTupleSlot;(gdb) 818         if (!TupIsNull(slot))(gdb) p *slot$21 = {type = T_TupleTableSlot, tts_isempty = false, tts_shouldFree = false, tts_shouldFreeMin = false, tts_slow = false,   tts_tuple = 0x14ac200, tts_tupleDescriptor = 0x7fbfa69a8308, tts_mcxt = 0x149afe0, tts_buffer = 345, tts_nvalid = 0,   tts_values = 0x149bc48, tts_isnull = 0x149bc70, tts_mintuple = 0x0, tts_minhdr = {t_len = 0, t_self = {ip_blkid = {        bi_hi = 0, bi_lo = 0}, ip_posid = 0}, t_tableOid = 0, t_data = 0x0}, tts_off = 0, tts_fixedTupleDescriptor = true}(gdb) (gdb) n819             hjstate->hj_FirstOuterTupleSlot = NULL;(gdb)

ExecHashJoinOuterGetTuple->循环获取,找到匹配的slot

(gdb) 823         while (!TupIsNull(slot))(gdb) n828             ExprContext *econtext = hjstate->js.ps.ps_ExprContext;(gdb)

ExecHashJoinOuterGetTuple->成功匹配,返回slot

(gdb) n830             econtext->ecxt_outertuple = slot;(gdb) 834                                      HJ_FILL_OUTER(hjstate),(gdb) 831             if (ExecHashGetHashValue(hashtable, econtext,(gdb) 838                 hjstate->hj_OuterNotEmpty = true;(gdb) 840                 return slot;(gdb) p *slot$22 = {type = T_TupleTableSlot, tts_isempty = false, tts_shouldFree = false, tts_shouldFreeMin = false, tts_slow = true,   tts_tuple = 0x14ac200, tts_tupleDescriptor = 0x7fbfa69a8308, tts_mcxt = 0x149afe0, tts_buffer = 345, tts_nvalid = 1,   tts_values = 0x149bc48, tts_isnull = 0x149bc70, tts_mintuple = 0x0, tts_minhdr = {t_len = 0, t_self = {ip_blkid = {        bi_hi = 0, bi_lo = 0}, ip_posid = 0}, t_tableOid = 0, t_data = 0x0}, tts_off = 2, tts_fixedTupleDescriptor = true}(gdb)

到此,关于"PostgreSQL中ExecHashJoin依赖其他函数的实现逻辑分析"的学习就结束了,希望能够解决大家的疑惑。理论与实践的搭配能更好的帮助大家学习,快去试试吧!若想继续学习更多相关知识,请继续关注网站,小编会继续努力为大家带来更多实用的文章!

0