千家信息网

如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集

发表于:2024-10-19 作者:千家信息网编辑
千家信息网最后更新 2024年10月19日,如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集,很多新手对此不是很清楚,为了帮助大家解决这个难题,下面小编将为大家详细讲解,有这方面需求的人可以来学习下,希
千家信息网最后更新 2024年10月19日如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集

如何使用Data Lake Analytics + OSS分析CSV格式的TPC-H数据集,很多新手对此不是很清楚,为了帮助大家解决这个难题,下面小编将为大家详细讲解,有这方面需求的人可以来学习下,希望你能有所收获。

1. 开通Data Lake Analytics与OSS服务

如果您已经开通,可以跳过该步骤。如果没有开通,可以参考:https://help.aliyun.com/document_detail/70386.html
进行产品开通服务申请。

2. 下载TPC-H测试数据集

可以从这下载TPC-H 100MB的数据集:
https://public-datasets-cn-hangzhou.oss-cn-hangzhou.aliyuncs.com/tpch_100m_data.zip

3. 上传数据文件到OSS

登录阿里云官网的OSS控制台:https://oss.console.aliyun.com/overview
规划您要使用的OSS bucket,创建或选择好后,点击"文件管理",因为有8个数据文件,为每个数据文件创建对应的文件目录:

创建好8个目录如下:

点击进入目录,上传相应的数据文件,例如,customer目录,则上传customer.tbl文件。

上传好后,如下图。然后,依次把其他7个数据文件也上传到对应的目录下。

至此,8个数据文件都上传到了您的OSS bucket中:

oss://xxx/tpch_100m/customer/customer.tbloss://xxx/tpch_100m/lineitem/lineitem.tbloss://xxx/tpch_100m/nation/nation.tbloss://xxx/tpch_100m/orders/orders.tbloss://xxx/tpch_100m/part/part.tbloss://xxx/tpch_100m/partsupp/partsupp.tbloss://xxx/tpch_100m/region/region.tbloss://xxx/tpch_100m/supplier/supplier.tbl

4. 登录Data Lake Analytics控制台

https://openanalytics.console.aliyun.com/
点击"登录数据库",输入开通服务时分配的用户名和密码,登录Data Lake Analytics控制台。

5. 创建Schema和Table

输入创建SCHEMA的语句,点击"同步执行"。

CREATE SCHEMA tpch_100m with DBPROPERTIES(  LOCATION = 'oss://test-bucket-julian-1/tpch_100m/',  catalog='oss');

(注意:目前在同一个阿里云region,Data Lake Analytics的schema名全局唯一,建议schema名尽量根据业务定义,已有重名schema,在创建时会提示报错,则请换一个schema名字。)

Schema创建好后,在"数据库"的下拉框中,选择刚刚创建的schema。然后在SQL文本框中输入建表语句,点击同步执行。
建表语句语法参考:https://help.aliyun.com/document_detail/72006.html

TPC-H对应的8个表的建表语句如下,分别贴入文档框中执行(LOCATION子句中的数据文件位置请根据您的实际OSS bucket目录相应修改)。(注意:目前控制台中还不支持多个SQL语句执行,请单条语句执行。)

CREATE EXTERNAL TABLE nation (    N_NATIONKEY INT,     N_NAME STRING,       N_ID STRING,    N_REGIONKEY INT,     N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/nation';CREATE EXTERNAL TABLE lineitem (    L_ORDERKEY INT,     L_PARTKEY INT,     L_SUPPKEY INT,     L_LINENUMBER INT,     L_QUANTITY DOUBLE,     L_EXTENDEDPRICE DOUBLE,     L_DISCOUNT DOUBLE,     L_TAX DOUBLE,     L_RETURNFLAG STRING,     L_LINESTATUS STRING,     L_SHIPDATE DATE,     L_COMMITDATE DATE,     L_RECEIPTDATE DATE,     L_SHIPINSTRUCT STRING,     L_SHIPMODE STRING,     L_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/lineitem';CREATE EXTERNAL TABLE orders (    O_ORDERKEY INT,     O_CUSTKEY INT,     O_ORDERSTATUS STRING,     O_TOTALPRICE DOUBLE,     O_ORDERDATE DATE,     O_ORDERPRIORITY STRING,     O_CLERK STRING,     O_SHIPPRIORITY INT,     O_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/orders';CREATE EXTERNAL TABLE supplier (    S_SUPPKEY INT,     S_NAME STRING,     S_ADDRESS STRING,     S_NATIONKEY INT,     S_PHONE STRING,     S_ACCTBAL DOUBLE,     S_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/supplier';CREATE EXTERNAL TABLE partsupp (    PS_PARTKEY INT,     PS_SUPPKEY INT,     PS_AVAILQTY INT,     PS_SUPPLYCOST DOUBLE,     PS_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/partsupp';CREATE EXTERNAL TABLE customer (    C_CUSTKEY INT,     C_NAME STRING,     C_ADDRESS STRING,     C_NATIONKEY INT,     C_PHONE STRING,     C_ACCTBAL DOUBLE,     C_MKTSEGMENT STRING,     C_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/customer';CREATE EXTERNAL TABLE part (    P_PARTKEY INT,     P_NAME STRING,     P_MFGR STRING,     P_BRAND STRING,     P_TYPE STRING,     P_SIZE INT,     P_CONTAINER STRING,     P_RETAILPRICE DOUBLE,     P_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/part';CREATE EXTERNAL TABLE region (    R_REGIONKEY INT,     R_NAME STRING,     R_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/tpch_100m/region';

查看更多

建表完毕后,刷新页面,在左边导航条中能看到schema下的8张表。

6. 执行TPC-H查询

TPC-H总共22条查询,如下:
Q1:

SELECT   l_returnflag,         l_linestatus,         Sum(l_quantity)                                       AS sum_qty,         Sum(l_extendedprice)                                  AS sum_base_price,         Sum(l_extendedprice * (1 - l_discount))               AS sum_disc_price,         Sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,         Avg(l_quantity)                                       AS avg_qty,         Avg(l_extendedprice)                                  AS avg_price,         Avg(l_discount)                                       AS avg_disc,         Count(*)                                              AS count_orderFROM     lineitemWHERE    l_shipdate <= date '1998-12-01' - INTERVAL '93' dayGROUP BY l_returnflag,         l_linestatusORDER BY l_returnflag,         l_linestatusLIMIT    1;

查看更多

Q2:

SELECT   s_acctbal,         s_name,         n_name,         p_partkey,         p_mfgr,         s_address,         s_phone,         s_commentFROM     part,         supplier,         partsupp,         nation,         regionWHERE    p_partkey = ps_partkeyAND      s_suppkey = ps_suppkeyAND      p_size = 35AND      p_type LIKE '%NICKEL'AND      s_nationkey = n_nationkeyAND      n_regionkey = r_regionkeyAND      r_name = 'MIDDLE EAST'

查看更多

Q3:

SELECT   l_orderkey,         Sum(l_extendedprice * (1 - l_discount)) AS revenue,         o_orderdate,         o_shippriorityFROM     customer,         orders,         lineitemWHERE    c_mktsegment = 'AUTOMOBILE'AND      c_custkey = o_custkeyAND      l_orderkey = o_orderkeyAND      o_orderdate < date '1995-03-31'AND      l_shipdate >  date '1995-03-31'GROUP BY l_orderkey,         o_orderdate,         o_shippriorityORDER BY revenue DESC,         o_orderdateLIMIT    10;

Q4:

SELECT   o_orderpriority,         Count(*) AS order_countFROM     orders,         lineitemWHERE    o_orderdate >= date '1997-10-01'AND      o_orderdate <  date '1997-10-01' + INTERVAL '3' monthAND      l_orderkey = o_orderkeyAND      l_commitdate < l_receiptdateGROUP BY o_orderpriorityORDER BY o_orderpriorityLIMIT    1;

Q5:

SELECT   n_name,         Sum(l_extendedprice * (1 - l_discount)) AS revenueFROM     customer,         orders,         lineitem,         supplier,         nation,         regionWHERE    c_custkey = o_custkeyAND      l_orderkey = o_orderkeyAND      l_suppkey = s_suppkeyAND      c_nationkey = s_nationkeyAND      s_nationkey = n_nationkeyAND      n_regionkey = r_regionkeyAND      r_name = 'ASIA'AND      o_orderdate >= date '1995-01-01'AND      o_orderdate <  date '1995-01-01' + INTERVAL '1' yearGROUP BY n_nameORDER BY revenue DESCLIMIT    1;

Q6:

SELECT sum(l_extendedprice * l_discount) AS revenueFROM lineitemWHERE l_shipdate >= date '1995-01-01'AND l_shipdate < date '1995-01-01' + interval '1' yearAND l_discount between 0.04 - 0.01 AND 0.04 + 0.01AND l_quantity < 24LIMIT 1;

Q7:

SELECT   supp_nation,         cust_nation,         l_year,         Sum(volume) AS revenueFROM     (                SELECT n1.n_name                          AS supp_nation,                       n2.n_name                          AS cust_nation,                       Extract(year FROM l_shipdate)      AS l_year,                       l_extendedprice * (1 - l_discount) AS volume                FROM   supplier,                       lineitem,                       orders,                       customer,                       nation n1,                       nation n2                WHERE  s_suppkey = l_suppkey                AND    o_orderkey = l_orderkey                AND    c_custkey = o_custkey                AND    s_nationkey = n1.n_nationkey                AND    c_nationkey = n2.n_nationkey                AND    ( (                                     n1.n_name = 'GERMANY'                              AND    n2.n_name = 'INDIA')                       OR     (                                     n1.n_name = 'INDIA'                              AND    n2.n_name = 'GERMANY') )                AND    l_shipdate BETWEEN date '1995-01-01' AND    date '1996-12-31' ) AS shippingGROUP BY supp_nation,         cust_nation,         l_yearORDER BY supp_nation,         cust_nation,         l_yearLIMIT    1;

查看更多

Q8:

SELECT   o_year,         Sum(         CASE                  WHEN nation = 'INDIA' THEN volume                  ELSE 0         end) / Sum(volume) AS mkt_shareFROM     (                SELECT Extract(year FROM o_orderdate)     AS o_year,                       l_extendedprice * (1 - l_discount) AS volume,                       n2.n_name                          AS nation                FROM   part,                       supplier,                       lineitem,                       orders,                       customer,                       nation n1,                       nation n2,                       region                WHERE  p_partkey = l_partkey                AND    s_suppkey = l_suppkey                AND    l_orderkey = o_orderkey                AND    o_custkey = c_custkey                AND    c_nationkey = n1.n_nationkey                AND    n1.n_regionkey = r_regionkey                AND    r_name = 'ASIA'                AND    s_nationkey = n2.n_nationkey                AND    o_orderdate BETWEEN date '1995-01-01' AND    date '1996-12-31'                AND    p_type = 'STANDARD ANODIZED STEEL' ) AS all_nationsGROUP BY o_yearORDER BY o_yearLIMIT    1;

查看更多

Q9:

SELECT   nation,         o_year,         Sum(amount) AS sum_profitFROM     (                SELECT n_name                                                          AS nation,                       Extract(year FROM o_orderdate) AS o_year,                       l_extendedprice * (1 - l_discount) - ps_supplycost * l_quantity AS amount                FROM   part,                       supplier,                       lineitem,                       partsupp,                       orders,                       nation                WHERE  s_suppkey = l_suppkey                AND    ps_suppkey = l_suppkey                AND    ps_partkey = l_partkey                AND    p_partkey = l_partkey                AND    o_orderkey = l_orderkey                AND    s_nationkey = n_nationkey                AND    p_name LIKE '%aquamarine%' ) AS profitGROUP BY nation,         o_yearORDER BY nation,         o_year DESCLIMIT    1;

查看更多

Q10:

SELECT   c_custkey,         c_name,         Sum(l_extendedprice * (1 - l_discount)) AS revenue,         c_acctbal,         n_name,         c_address,         c_phone,         c_commentFROM     customer,         orders,         lineitem,         nationWHERE    c_custkey = o_custkeyAND      l_orderkey = o_orderkeyAND      o_orderdate >= date '1994-08-01'AND      o_orderdate <  date '1994-08-01' + INTERVAL '3' monthAND      l_returnflag = 'R'AND      c_nationkey = n_nationkeyGROUP BY c_custkey,         c_name,         c_acctbal,         c_phone,         n_name,         c_address,         c_commentORDER BY revenue DESCLIMIT    20;

查看更多

Q11:

SELECT   ps_partkey,         Sum(ps_supplycost * ps_availqty) AS valueFROM     partsupp,         supplier,         nationWHERE    ps_suppkey = s_suppkeyAND      s_nationkey = n_nationkeyAND      n_name = 'PERU'GROUP BY ps_partkeyHAVING   Sum(ps_supplycost * ps_availqty) >(SELECT Sum(ps_supplycost * ps_availqty) * 0.0001000000 as sum_value  FROM partsupp,       supplier,       nation  WHERE  ps_suppkey = s_suppkey  AND    s_nationkey = n_nationkey  AND    n_name = 'PERU')ORDER BY value DESCLIMIT    1;

Q12:

SELECT l_shipmode, sum(case when o_orderpriority = '1-URGENT' or o_orderpriority = '2-HIGH' then 1    else 0end) AS high_line_count, sum(case when o_orderpriority <> '1-URGENT' and o_orderpriority <> '2-HIGH' then 1else 0end) AS low_line_countFROM orders,     lineitemWHERE o_orderkey = l_orderkeyAND l_shipmode in ('MAIL', 'TRUCK')AND l_commitdate < l_receiptdateAND l_shipdate < l_commitdateAND l_receiptdate >= date '1996-01-01'AND l_receiptdate < date '1996-01-01' + interval '1' yearGROUP BY l_shipmodeORDER BY l_shipmodeLIMIT 1;

Q13:

SELECT c_count, count(*) AS custdistFROM (    SELECT c_custkey, count(o_orderkey) AS c_count    FROM customer,         orders    WHERE c_custkey = o_custkey    AND o_comment NOT LIKE '%pending%accounts%'    GROUP BY c_custkey ) AS c_ordersGROUP BY c_countORDER BY custdist DESC, c_count DESCLIMIT 1;

Q14:

SELECT 100.00 * sum(case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount)    else 0end) / sum(l_extendedprice * (1 - l_discount)) AS promo_revenueFROM lineitem,     partWHERE l_partkey = p_partkeyAND l_shipdate >= date '1996-01-01'AND l_shipdate < date '1996-01-01' + interval '1' monthLIMIT 1;

Q15:

WITH revenue0 AS(SELECT l_suppkey AS supplier_no, sum(l_extendedprice * (1 - l_discount)) AS total_revenueFROM lineitemWHERE l_shipdate >= date '1993-01-01'AND l_shipdate < date '1993-01-01' + interval '3' monthGROUP BY l_suppkey)SELECT s_suppkey, s_name, s_address, s_phone, total_revenueFROM supplier, revenue0WHERE s_suppkey = supplier_noAND total_revenue IN (    SELECT max(total_revenue)    FROM revenue0 )ORDER BY s_suppkey;

Q16:

SELECT p_brand, p_type, p_size, count(distinct ps_suppkey) AS supplier_cntFROM partsupp,     partWHERE p_partkey = ps_partkeyAND p_brand <> 'Brand#23'AND p_type NOT LIKE 'PROMO BURNISHED%'AND p_size IN (1, 13, 10, 28, 21, 35, 31, 11)AND ps_suppkey NOT IN (    SELECT s_suppkey    FROM supplier    WHERE s_comment LIKE '%Customer%Complaints%' )GROUP BY p_brand, p_type, p_sizeORDER BY supplier_cnt DESC, p_brand, p_type, p_sizeLIMIT 1;

Q17:

SELECT    sum(l_extendedprice) / 7.0 AS avg_yearlyFROM    lineitem,    partWHERE p_partkey = l_partkey    AND p_brand = 'Brand#44'    AND p_container = 'WRAP PKG'    AND l_quantity < (        SELECT            0.2 * avg(l_quantity)        FROM            lineitem, part        WHERE            l_partkey = p_partkey    );

Q18:

SELECT c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)FROM customer,     orders,     lineitemWHERE o_orderkey IN (      SELECT l_orderkey      FROM lineitem      GROUP BY l_orderkey      HAVING sum(l_quantity) > 315 )AND c_custkey = o_custkeyAND o_orderkey = l_orderkeyGROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalpriceORDER BY o_totalprice DESC, o_orderdateLIMIT 100;

Q19:

SELECT sum(l_extendedprice* (1 - l_discount)) AS revenueFROM lineitem,     partWHERE ( p_partkey = l_partkey and p_brand = 'Brand#12'        and p_container in ('SM CASE', 'SM BOX', 'SM PACK', 'SM PKG')        and l_quantity >= 6 and l_quantity <= 6 + 10        and p_size between 1 and 5        and l_shipmode in ('AIR', 'AIR REG')        and l_shipinstruct = 'DELIVER IN PERSON' )    or ( p_partkey = l_partkey and p_brand = 'Brand#13'        and p_container in ('MED BAG', 'MED BOX', 'MED PKG', 'MED PACK')        and l_quantity >= 10 and l_quantity <= 10 + 10        and p_size between 1 and 10        and l_shipmode in ('AIR', 'AIR REG')        and l_shipinstruct = 'DELIVER IN PERSON' )    or ( p_partkey = l_partkey and p_brand = 'Brand#24'        and p_container in ('LG CASE', 'LG BOX', 'LG PACK', 'LG PKG')        and l_quantity >= 21 and l_quantity <= 21 + 10        and p_size between 1 and 15        and l_shipmode in ('AIR', 'AIR REG')        and l_shipinstruct = 'DELIVER IN PERSON' )LIMIT 1;

查看更多

Q20:

with temp_table as( select 0.5 * sum(l_quantity) as col1 from lineitem,      partsupp where l_partkey = ps_partkey and l_suppkey = ps_suppkey and l_shipdate >= date '1993-01-01' and l_shipdate < date '1993-01-01' + interval '1' year)select s_name, s_addressfrom supplier,     nationwhere s_suppkey in (    select ps_suppkey    from partsupp,         temp_table    where ps_partkey in (        select p_partkey        from part        where p_name like 'dark%' )        and ps_availqty > temp_table.col1 )    and s_nationkey = n_nationkey and n_name = 'JORDAN'order by s_namelimit 1;

Q21:

select    s_name,    count(*) as numwaitfrom    supplier,    lineitem l1,    orders,    nationwhere    s_suppkey = l1.l_suppkey    and o_orderkey = l1.l_orderkey    and o_orderstatus = 'F'    and l1.l_receiptdate > l1.l_commitdate    and exists (        select            *        from            lineitem l2        where            l2.l_orderkey = l1.l_orderkey            and l2.l_suppkey <> l1.l_suppkey    )    and not exists (        select            *        from            lineitem l3        where            l3.l_orderkey = l1.l_orderkey            and l3.l_suppkey <> l1.l_suppkey            and l3.l_receiptdate > l3.l_commitdate    )    and s_nationkey = n_nationkey    and n_name = 'SAUDI ARABIA'group by    s_nameorder by    numwait desc,    s_namelimit 100;

查看更多

Q22:

with temp_table_1 as(  select avg(c_acctbal) as avg_value  from customer  where c_acctbal > 0.00 and substring(c_phone from 1 for 2)  in ('33', '29', '37', '35', '25', '27', '43')),temp_table_2 as(  select count(*) as count1  from orders, customer  where o_custkey = c_custkey)select cntrycode, count(*) as numcust, sum(c_acctbal) as totacctbalfrom (    select substring(c_phone from 1 for 2) as cntrycode, c_acctbal    from customer, temp_table_1, temp_table_2    where substring(c_phone        from 1        for 2) in ('33', '29', '37', '35', '25', '27', '43')        and c_acctbal > temp_table_1.avg_value        and temp_table_2.count1 = 0) as custsalegroup by cntrycodeorder by cntrycodelimit 1;

7. 异步执行查询

Data Lake Analytics支持"同步执行"模式和"异步执行"模式。"同步执行"模式下,控制台界面等待执行结果返回;"异步执行"模式下,立刻返回查询任务的ID。

点击"执行状态",可以看到该异步查询任务的执行状态,主要分为:"RUNNING","SUCCESS","FAILURE"。

点击"刷新",当STATUS变为"SUCCESS"时,表示查询成功,同时可查看查询耗时"ELAPSE_TIME"和查询扫描的数据字节数"SCANNED_DATA_BYTES"。

8. 查看查询历史

点击"执行历史",可以看到您执行的查询的历史详细信息,包括:
1)查询语句;
2)查询耗时与执行具体时间;
3)查询结果返回行数;
4)查询状态;
5)查询扫描的字节数;
6)结果集回写到的目标OSS文件(Data Lake Analytics会将查询结果集保存用户的bucket中)。

查询结果文件自动上传到用户同region的OSS bucket中,其中包括结果数据文件和结果集元数据描述文件。

{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv{QueryLocation}/{query_name}|Unsaved}/{yyyy}/{mm}/{dd}/{query_id}/xxx.csv.metadata

其中QueryLocation为:

aliyun-oa-query-results--

看完上述内容是否对您有帮助呢?如果还想对相关知识有进一步的了解或阅读更多相关文章,请关注行业资讯频道,感谢您对的支持。

查询 数据 文件 更多 结果 语句 目录 控制台 控制 模式 同步 登录 历史 状态 用户 支持 服务 输入 任务 字节 数据库的安全要保护哪些东西 数据库安全各自的含义是什么 生产安全数据库录入 数据库的安全性及管理 数据库安全策略包含哪些 海淀数据库安全审计系统 建立农村房屋安全信息数据库 易用的数据库客户端支持安全管理 连接数据库失败ssl安全错误 数据库的锁怎样保障安全 企业网络安全准入控制系统推荐 明日之后两个服务器哪个好 上海板材套料软件开发商 幼儿园开展网络安全周活动 java控制数据库代码 广东远程服务器租赁虚拟主机 绘图软件开发费税率 农行云音箱连接服务器时断时连 服务器管理工具有哪些 网络安全员cism 服务器markdown搭建 数据库原理张红娟习题4 网络安全管理专题培训 增强网络安全意识的内容 网站如何数据库的图片 软件开发学语言的顺序 竹山服务软件开发诚信为本 宁夏惠普服务器维修调试 英雄联盟开放服务器下载 浦东新区品牌软件开发信息中心 网络安全ps制作 征途sf服务器安全吗 服务器设置多界面 服务器没密码能进去吗 网络安全指挥中心是全额事业编吗 增强网络安全意识的内容 人工智能软件开发如何收费 软件开发商业计划书附录 文艺汇演专题网络安全宣传周简报 数据库减少了冗杂
0