Spark3.0怎么使用GPU加速

今天就跟大家聊聊有关Spark 3.0怎么使用GPU加速，可能很多人都不太了解，为了让大家更加了解，小编给大家总结了以下内容，希望大家根据这篇文章可以有所收获。

专注于为中小企业提供成都网站设计、成都网站建设、外贸网站建设服务,电脑端+手机端+微信端的三站合一,更高效的管理,为中小企业灵璧免费做网站提供优质的服务。我们立足成都，凝聚了一批互联网行业人才，有力地推动了成百上千企业的稳健成长，帮助中小企业通过网站建设实现规模扩充和转变。

概览

RAPIDS Accelerator for Apache Spark 使用 GPUs数据加速处理，通过 RAPIDS libraries来实现。

当数据科学家从传统数据分析转向 AI applications以满足复杂市场需求的时候，传统的CPU-based 处理不再满足速度与成本的需求。快速增长的 AI 分析需要新的框架来快速处理数据和节约成本，通过 GPUs来达到这个目标。

RAPIDS Accelerator for Apache Spark整合了 RAPIDS cuDF 库和 Spark 分布式计算框架。该RAPIDS Accelerator library又一个内置的加速 shuffle 基于 UCX ，可以配置为 GPU-to-GPU 通讯和RDMA能力。

Spark RAPIDS 下载 v0.4.1

RAPIDS Spark Package
cuDF 11.0 Package
cuDF 10.2 Package
cuDF 10.1 Package

RAPIDS Notebooks

cuML Notebooks
cuGraph Notebooks
CLX Notebooks
cuSpatial Notebooks
cuxfilter Notebooks
XGBoost Notebooks

介绍

这些 notebooks 提供了使用 RAPIDS的例子。设计为自包含 runtime version of the RAPIDS Docker Container 和 RAPIDS Nightly Docker Containers and can run on air-gapped systems。可以快速获得容器然后按照 RAPIDS.ai Getting Started page 进行安装和使用。

用法

获取最新的notebook repo 更新，运行 ./update.sh 或者使用命令：

git submodule update --init --remote --no-single-branch --depth 1

下载 CUDA Installer for Linux Ubuntu 20.04 x86_64

基础安装如下：

基本安装程序

安装说明：

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pinsudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600wget https://developer.download.nvidia.com/compute/cuda/11.1.0/local_installers/cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.debsudo dpkg -i cuda-repo-ubuntu2004-11-1-local_11.1.0-455.23.05-1_amd64.debsudo apt-key add /var/cuda-repo-ubuntu2004-11-1-local/7fa2af80.pubsudo apt-get updatesudo apt-get -y install cuda

该CUDA Toolkit 包含开源项目软件，可以在 here 找到。
可以在 Installer Checksums 中找到安装程序和补丁的校验和。

性能 & 成本与收益

Rapids Accelerator for Apache Spark 得益于 GPU 性能的同时降低了成本。如下： Spark 3.0怎么使用GPU加速 *ETL for FannieMae Mortgage Dataset (~200GB) as shown in our demo. Costs based on Cloud T4 GPU instance market price & V100 GPU price on Databricks Standard edition。

易于使用

运行以前的 Apache Spark 应用不需要改变代码。启动 Spark with the RAPIDS Accelerator for Apache Spark plugin jar然后打开配置，如下：

spark.conf.set('spark.rapids.sql.enabled','true')

physical plan with operators运行在GPU

一个统一的 AI framework for ETL + ML/DL

单一流水线，从数据准备到模型训练：

Spark 3.0怎么使用GPU加速

开始使用RAPIDS Accelerator for Apache Spark

Apache Spark 3.0+ 为用户提供了 plugin可以替换 SQL 和 DataFrame 操作。不需要对API做出改变，该 plugin替换 SQL operations为 GPU 加速版本。如果该操作不支持GPU加速将转而用 Spark CPU 版本。

⚠️注意plugin不能加速直接对RDDs的操作。

该 accelerator library 同时提供了Spark’s shuffle的实现，可以利用 UCX 优化 GPU data transfers，keeping as much data on the GPU as possible and bypassing the CPU to do GPU to GPU transfers。

该 GPU 加速处理 plugin 不要求加速的 shuffle 实现。但是，如果加速 SQL processing未开启，该shuffle implementation 将使用缺省的SortShuffleManager。

开启 GPU 处理加速，需要：

Apache Spark 3.0+
A spark cluster configured with GPUs that comply with the requirements for the version of cudf.

One GPU per executor.

The following jars:

A cudf jar that corresponds to the version of CUDA available on your cluster.
RAPIDS Spark accelerator plugin jar.

To set the config spark.plugins to com.nvidia.spark.SQLPlugin

Spark GPU 调度概览

Apache Spark 3.0 现在支持 GPU 调度与 cluster manager 一样。你可以让 Spark 请求 GPUs 然后赋予tasks。精确的配置取决于 cluster manager的配置。下面是一些例子：

Request your executor to have GPUs:

--conf spark.executor.resource.gpu.amount=1

Specify the number of GPUs per task:

--conf spark.task.resource.gpu.amount=1

Specify a GPU discovery script (required on YARN and K8S):

--conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh

查看部署的详细信息确定其方法和限制。

注意 spark.task.resource.gpu.amount 可以是小数，如果想要 multiple tasks to be run on an executor at the same time and assigned to the same GPU，可以设置为小于1的小数。要与 spark.executor.cores 设置相对应。例如，spark.executor.cores=2 将允许 2 tasks 在每一个 executor，并且希望 2 tasks 运行在同一个 GPU，将设置spark.task.resource.gpu.amount=0.5。

看完上述内容，你们对Spark 3.0怎么使用GPU加速有进一步的了解吗？如果还想了解更多知识或者相关内容，请关注创新互联行业资讯频道，感谢大家的支持。

本文标题：Spark3.0怎么使用GPU加速
转载源于：http://hbruida.cn/article/jeodjs.html