千家信息网

如何用Python定义Schema并生成Parquet文件

发表于:2025-01-19 作者:千家信息网编辑
千家信息网最后更新 2025年01月19日,本篇文章给大家分享的是有关如何用Python定义Schema并生成Parquet文件,小编觉得挺实用的,因此分享给大家学习,希望大家阅读完这篇文章后可以有所收获,话不多说,跟着小编一起来看看吧。一、简
千家信息网最后更新 2025年01月19日如何用Python定义Schema并生成Parquet文件

本篇文章给大家分享的是有关如何用Python定义Schema并生成Parquet文件,小编觉得挺实用的,因此分享给大家学习,希望大家阅读完这篇文章后可以有所收获,话不多说,跟着小编一起来看看吧。

一、简单字段定义

1、定义 Schema 并生成 Parquet 文件

import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq# 定义 Schemaschema = pa.schema([    ('id', pa.int32()),    ('email', pa.string())])# 准备数据ids = pa.array([1, 2], type = pa.int32())emails = pa.array(['first@example.com', 'second@example.com'], pa.string())# 生成 Parquet 数据batch = pa.RecordBatch.from_arrays(    [ids, emails],    schema = schema)table = pa.Table.from_batches([batch])# 写 Parquet 文件 plain.parquetpq.write_table(table, 'plain.parquet')import pandas as pdimport pyarrow as paimport pyarrow . parquet as pq# 定义 Schemaschema = pa . schema ( [     ( 'id' , pa . int32 ( ) ) ,     ( 'email' , pa . string ( ) )] )# 准备数据ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )emails = pa . array ( [ 'first@example.com' , 'second@example.com' ] , pa . string ( ) )# 生成 Parquet 数据batch = pa . RecordBatch . from_arrays (     [ ids , emails ] ,     schema = schema)table = pa . Table . from_batches ( [ batch ] )# 写 Parquet 文件 plain.parquetpq . write_table ( table , 'plain.parquet' )

2、验证 Parquet 数据文件

我们可以用工具 parquet-tools 来查看 plain.parquet 文件的数据和 Schema

 $ parquet-tools schema plain.parquet  message schema {      optional int32 id;      optional binary email (STRING);  }  $ parquet-tools cat --json plain.parquet  {"id":1,"email":"first@example.com"}  {"id":2,"email":"second@example.com"}

没问题,与我们期望的一致。也可以用 pyarrow 代码来获取其中的 Schema 和数据

schema = pq.read_schema('plain.parquet')print(schema)df = pd.read_parquet('plain.parquet')print(df.to_json())schema = pq . read_schema ( 'plain.parquet' )print ( schema )df = pd . read_parquet ( 'plain.parquet' )print ( df . to_json ( ) )

输出为:

schema = pq.read_schema('plain.parquet')print(schema)df = pd.read_parquet('plain.parquet')print(df.to_json())schema = pq . read_schema ( 'plain.parquet' )print ( schema )df = pd . read_parquet ( 'plain.parquet' )print ( df . to_json ( ) )

二、含嵌套字段定义

下面的 Schema 定义加入一个嵌套对象,在 address 下分 email_addresspost_addressSchema 定义及生成 Parquet 文件的代码如下

import pandas as pdimport pyarrow as paimport pyarrow.parquet as pq# 内部字段address_fields = [    ('email_address', pa.string()),    ('post_address', pa.string()),]# 定义 Parquet Schema,address 嵌套了 address_fieldsschema = pa.schema(j)# 准备数据ids = pa.array([1, 2], type = pa.int32())addresses = pa.array(    [('first@example.com', 'city1'), ('second@example.com', 'city2')],    pa.struct(address_fields))# 生成 Parquet 数据batch = pa.RecordBatch.from_arrays(    [ids, addresses],    schema = schema)table = pa.Table.from_batches([batch])# 写 Parquet 数据到文件pq.write_table(table, 'nested.parquet')import pandas as pdimport pyarrow as paimport pyarrow . parquet as pq# 内部字段address_fields = [     ( 'email_address' , pa . string ( ) ) ,     ( 'post_address' , pa . string ( ) ) ,]# 定义 Parquet Schema,address 嵌套了 address_fieldsschema = pa . schema ( j )# 准备数据ids = pa . array ( [ 1 , 2 ] , type = pa . int32 ( ) )addresses = pa . array (     [ ( 'first@example.com' , 'city1' ) , ( 'second@example.com' , 'city2' ) ] ,     pa . struct ( address_fields ))# 生成 Parquet 数据batch = pa . RecordBatch . from_arrays (     [ ids , addresses ] ,     schema = schema)table = pa . Table . from_batches ( [ batch ] )# 写 Parquet 数据到文件pq . write_table ( table , 'nested.parquet' )

1、验证 Parquet 数据文件

同样用 parquet-tools 来查看下 nested.parquet 文件

 $ parquet-tools schema nested.parquet  message schema {      optional int32 id;      optional group address {          optional binary email_address (STRING);          optional binary post_address (STRING);      }  }  $ parquet-tools cat --json nested.parquet  {"id":1,"address":{"email_address":"first@example.com","post_address":"city1"}}  {"id":2,"address":{"email_address":"second@example.com","post_address":"city2"}}

parquet-tools 看到的 Schama 并没有 struct 的字样,但体现了它 address 与下级属性的嵌套关系。

pyarrow 代码来读取 nested.parquet 文件的 Schema 和数据是什么样子

schema = pq.read_schema("nested.parquet")print(schema)df = pd.read_parquet('nested.parquet')print(df.to_json())schema = pq . read_schema ( "nested.parquet" )print ( schema )df = pd . read_parquet ( 'nested.parquet' )print ( df . to_json ( ) )

输出:

id: int32  -- field metadata --  PARQUET:field_id: '1'address: struct<email_address: string, post_address: string>  child 0, email_address: string    -- field metadata --    PARQUET:field_id: '3'  child 1, post_address: string    -- field metadata --    PARQUET:field_id: '4'  -- field metadata --  PARQUET:field_id: '2'{"id":{"0":1,"1":2},"address":{"0":{"email_address":"first@example.com","post_address":"city1"},"1":{"email_address":"second@example.com","post_address":"city2"}}}id : int32   -- field metadata --   PARQUET : field_id : '1'address : struct & lt ; email_address : string , post_address : string & gt ;   child 0 , email_address : string     -- field metadata --     PARQUET : field_id : '3'   child 1 , post_address : string     -- field metadata --     PARQUET : field_id : '4'   -- field metadata --   PARQUET : field_id : '2'{ "id" : { "0" : 1 , "1" : 2 } , "address" : { "0" : { "email_address" : "first@example.com" , "post_address" : "city1" } , "1" : { "email_address" : "second@example.com" , "post_address" : "city2" } } }

数据当然是一样的,有略微不同的是显示的 Schema 中, address 标识为 struct , 明确的表明它是一个 struct 类型,而不是只展示嵌套层次。

以上就是如何用Python定义Schema并生成Parquet文件,小编相信有部分知识点可能是我们日常工作会见到或用到的。希望你能通过这篇文章学到更多知识。更多详情敬请关注行业资讯频道。

0