通过Apache Spark和Pandas轻松介绍Apache Arrow

wptr33 2024-12-20 19:03 13 浏览

这次，我将尝试解释如何将Apache Arrow与Apache Spark和Python结合使用。首先，让我分享有关此开源项目的一些基本概念。

Apache Arrow是用于内存数据的跨语言开发平台。它为平面和分层数据指定了一种与语言无关的标准化列式存储格式，该格式组织用于在现代硬件上进行有效的分析操作。 [Apache箭头页面]

简而言之，它促进了许多组件之间的通信，例如，使用Python（熊猫）读取实木复合地板文件并转换为Spark数据框，Falcon Data Visualization或Cassandra，而无需担心转换。

一个好问题是问数据在内存中的外观如何？好吧，Apache Arrow利用列缓冲区来减少IO并加快分析处理性能。

在我们的例子中，我们将使用pyarrow库执行一些基本代码并检查一些功能。为了安装，我们有两个使用conda或pip命令*的选项。

conda install -c conda-forge pyarrow
pip install pyarrow

*建议在Python 3环境中使用conda。

带有HDFS的Apache Arrow（远程文件系统）

Apache Arrow附带了到Hadoop File System的基于C ++的接口的绑定。这意味着我们可以从HDFS读取或下载所有文件，并直接使用Python进行解释。

连接

主机是名称节点，端口通常是RPC或WEBHDFS，允许使用更多参数，例如user，kerberos ticket。强烈建议您阅读所需的环境变量。

import pyarrow as pa
host = '1970.x.x.x'
port = 8022
fs = pa.hdfs.connect(host, port)

· 如果您的连接位于数据或边缘节点的前面，则可以选择使用

fs = pa.hdfs.connect()

将Parquet文件写入HDFS

pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['one', 'two'], filesystem=fs)

从HDFS读取CSV

import pandas as pd
from pyarrow import csv
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('iris.csv', 'rb') as f: 
	df = pd.read_csv(f, nrows = 10)
	df.head()

从HDFS读取Parquet文件

有两种形式可以从HDFS读取实木复合地板文件

使用Pandas和Pyarrow引擎

import pandas as pd
pdIris = pd.read_parquet('hdfs:///iris/part-00000–27c8e2d3-fcc9–47ff-8fd1–6ef0b079f30e-c000.snappy.parquet', engine='pyarrow')
pdTrain.head()

Parquet

import pyarrow.parquet as pq
path = 'hdfs:///iris/part-00000–71c8h2d3-fcc9–47ff-8fd1–6ef0b079f30e-c000.snappy.parquet'
table = pq.read_table(path)
table.schema
df = table.to_pandas()
df.head()

其他文件扩展名

由于我们可以存储任何类型的文件（SAS，STATA，Excel，JSON或对象），因此Python可以轻松解释其中的大多数文件。为此，我们将使用open函数，该函数返回一个缓冲区对象，许多pandas函数（如read_sas，read_json）都可以接收该缓冲区对象作为输入，而不是字符串URL。

SAS

import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('/datalake/airplane.sas7bdat', 'rb') as f: 
	sas_df = pd.read_sas(f, format='sas7bdat')
	sas_df.head()

电子表格

import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('/datalake/airplane.xlsx', 'rb') as f: 
	g.download('airplane.xlsx')
	ex_df = pd.read_excel('airplane.xlsx')

JSON格式

import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('/datalake/airplane.json', 'rb') as f: 
	g.download('airplane.json')
	js_df = pd.read_json('airplane.json')

从HDFS下载文件

如果我们只需要下载文件，Pyarrow为我们提供了下载功能，可以将文件保存在本地。

import pandas as pd
import pyarrow as pa
fs = pa.hdfs.connect()
with fs.open('/datalake/airplane.cs', 'rb') as f: 
	g.download('airplane.cs')

上传文件到HDFS

如果我们只需要下载文件，Pyarrow为我们提供了下载功能，可以将文件保存在本地。

import pyarrow as pa
fs = pa.hdfs.connect()
with open('settings.xml') as f: 
	pa.hdfs.HadoopFileSystem.upload(fs, '/datalake/settings.xml', f)

Apache Arrow with Pandas（本地文件系统）

将Pandas Dataframe转换为Apache Arrow Table

import numpy as np
import pandas as pd
import pyarrow as pa
df = pd.DataFrame({'one': [20, np.nan, 2.5],'two': ['january', 'february', 'march'],'three': [True, False, True]},index=list('abc'))
table = pa.Table.from_pandas(df)

Pyarrow表到Pandas数据框

df_new = table.to_pandas()

读取CSV

from pyarrow import csv
fn = 'data/demo.csv'
table = csv.read_csv(fn)
 Ω

从Apache Arrow编写Parquet文件

import pyarrow.parquet as pq
pq.write_table(table, 'example.parquet')

读取Parquet文件

table2 = pq.read_table('example.parquet')
table2

从parquet文件中读取一些列

table2 = pq.read_table('example.parquet', columns=['one', 'three'])

从分区数据集读取

dataset = pq.ParquetDataset('dataset_name_directory/')
table = dataset.read()
table

将Parquet文件转换为Pandas DataFrame

pdf = pq.read_pandas('example.parquet', columns=['two']).to_pandas()
pdf

避免Pandas指数

table = pa.Table.from_pandas(df, preserve_index=False)
pq.write_table(table, 'example_noindex.parquet')
t = pq.read_table('example_noindex.parquet')
t.to_pandas()

检查元数据

parquet_file = pq.ParquetFile('example.parquet')
parquet_file.metadata

查看数据模式

parquet_file.schema

时间戳记

请记住，Pandas使用纳秒，因此您可以以毫秒为单位截断兼容性。

pq.write_table(table, where, coerce_timestamps='ms')
pq.write_table(table, where, coerce_timestamps='ms', allow_truncated_timestamps=True)

压缩

默认情况下，尽管允许其他编解码器，但Apache arrow使用快速压缩（压缩程度不高，但更易于访问）。

pq.write_table(table, where, compression='snappy')
pq.write_table(table, where, compression='gzip')
pq.write_table(table, where, compression='brotli')
pq.write_table(table, where, compression='none')

另外，在一个表中可以使用多个压缩

pq.write_table(table, 'example_diffcompr.parquet', compression={b'one': 'snappy', b'two': 'gzip'})

编写分区的Parquet表

df = pd.DataFrame({'one': [1, 2.5, 3], 'two': ['Peru', 'Brasil', 'Canada'], 'three': [True, False, True]}, index=list('abc'))
table = pa.Table.from_pandas(df)
pq.write_to_dataset(table, root_path='dataset_name',partition_cols=['one', 'two'])

· 兼容性说明：如果您使用pq.write_to_dataset创建一个供HIVE使用的表，则分区列值必须与您正在运行的HIVE版本的允许字符集兼容。

带有Apache Spark的Apache Arrow

Apache Arrow自2.3版本以来已与Spark集成在一起，它很好地演示了如何优化时间以避免序列化和反序列化过程，并与其他库进行了集成，例如Holden Karau上关于在Spark上加速Tensorflow Apache Arrow的演示。

存在其他有用的文章，例如Brian Cutler发表的文章以及Spark官方文档中的非常好的示例

Apache Arrow的一些有趣用法是：

· 加快从Pandas数据框到Spark数据框的转换

· 加快从Spark数据框到Pandas数据框的转换

· 与Pandas UDF（也称为矢量化UDF）一起使用

· 使用Apache Spark优化R

第三项是下一篇文章的一部分，因为这是一个非常有趣的主题，目的是在不损失性能的情况下扩展Pandas和Spark之间的集成，对于第四项，我建议您阅读该文章（于2019年发布！）以获得了解更多。

让我们先测试Pandas和Spark之间的转换，而不进行任何修改，然后再使用Arrow。

from pyspark.sql import SparkSession
warehouseLocation = "/antonio"
spark = SparkSession\
  .builder.appName("demoMedium")\
  .config("spark.sql.warehouse.dir", warehouseLocation)\
  .enableHiveSupport()\
  .getOrCreate()

#Create test Spark DataFrame
from pyspark.sql.functions import rand
df = spark.range(1 << 22).toDF("id").withColumn("x", rand())
df.printSchema()

#Benchmark time%time 
pdf = df.toPandas()spark.conf.set("spark.sql.execution.arrow.enabled", "true")
%time 
pdf = df.toPandas()
pdf.describe()

结果显然是使用Arrow减少时间转换更方便。

如果我们需要测试相反的情况（Pandas来激发df），那么我们也会及时发现优化。

%time df = spark.createDataFrame(pdf)
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
%time 
df = spark.createDataFrame(pdf)
df.describe().show()

结论

本文的目的是发现并了解Apache Arrow以及它如何与Apache Spark和Pandas一起使用，我也建议您查看It的官方页面，以进一步了解CUDA或C ++等其他可能的集成，如果您想更深入地了解它，并了解有关Apache Spark的更多信息，我认为Spark：权威指南是一本很好的书。

附注：如果您有任何疑问，或者想澄清一些问题，可以在Twitter和LinkedIn上找到我。我最近发表了Apache Druid的简要介绍，这是一个新的Apache项目，非常适合分析数十亿行。

(本文翻译自Antonio Cachuan的文章《A gentle introduction to Apache Arrow with Apache Spark and Pandas》，参考：https://towardsdatascience.com/a-gentle-introduction-to-apache-arrow-with-apache-spark-and-pandas-bb19ffe0ddae)

spark安装

上一篇：用NAS白嫖8个AI平台 !含通义千问、智谱、讯飞星火、Kimi等
下一篇：万字详文:腾讯研究员详解 Spark 部署与工作原理

通过Apache Spark和Pandas轻松介绍Apache Arrow

带有HDFS的Apache Arrow（远程文件系统）

Apache Arrow with Pandas（本地文件系统）

带有Apache Spark的Apache Arrow

结论

相关推荐

C# 13 和 .NET 9 全知道 :13 使用 ASP.NET Core 构建网站 (1)

因果推断Matching方式实现代码因果推断模型

git pull命令使用实例 git pull--rebase

面试官:git pull是哪两个指令的组合?

git 执行pull错误如何撤销 git pull fail

git pull 和git fetch 命令分别有什么作用?二者有什么区别?

git fetch 和git pull 的异同 git中fetch和pull的区别

git pull 之后本地代码被覆盖解决方案

还可以这样玩?Git基本原理及各种骚操作，涨知识了

git命令之pull git.pull

通过Apache Spark和Pandas轻松介绍Apache Arrow

带有HDFS的Apache Arrow（远程文件系统）

Apache Arrow with Pandas（本地文件系统）

带有Apache Spark的Apache Arrow

结论

相关推荐

C# 13 和 .NET 9 全知道 :13 使用 ASP.NET Core 构建网站 (1)

因果推断Matching方式实现代码 因果推断模型

git pull命令使用实例 git pull--rebase

面试官:git pull是哪两个指令的组合?

git 执行pull错误如何撤销 git pull fail

git pull 和git fetch 命令分别有什么作用?二者有什么区别?

git fetch 和git pull 的异同 git中fetch和pull的区别

git pull 之后本地代码被覆盖 解决方案

还可以这样玩?Git基本原理及各种骚操作，涨知识了

git命令之pull git.pull

因果推断Matching方式实现代码因果推断模型

git pull 之后本地代码被覆盖解决方案