环境配置

Introduction

Tutorial & Docs

Quick Start

Deploy

Spark的部署方式可以独立部署，也可以基于Yarn或者Mesos这样的资源调度框架进行部署。

[Spark下载地址][1]

StandAlone

将Spark的程序文件解压之后可以通过如下命令直接启动一个独立的主机：

./sbin/start-master.sh

执行该命令之后Spark会自动执行jetty命令启动服务器，同时在命令行或者log日志文件中打印出系统地址，譬如：

15/05/28 13:20:57 INFO Master: Starting Spark master at spark://localhost.localdomain:7077
15/05/28 13:21:07 INFO MasterWebUI: Started MasterWebUI at http://192.168.199.166:8080

给出的这个spark://HOST:PORT地址可以供Spark work节点连接或者作为master的参数传入到SparkContext中。

需要启动work进程并在master中完成注册，可以使用如下命令：

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT

注意，这边的IP地址实际上指的是启动master时候他监听的域名而来的IP地址。

![enter description here][2]

执行上述命令时可以传入的参数为：

Argument	Meaning
-h HOST, –host HOST	Hostname to listen on
-i HOST, –ip HOST	Hostname to listen on (deprecated, use -h or –host)
-p PORT, –port PORT	Port for service to listen on (default: 7077 for master, random for worker)
–webui-port PORT	Port for web UI (default: 8080 for master, 8081 for worker)
-c CORES, –cores CORES	Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
-m MEM, –memory MEM	Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine’s total RAM minus 1 GB); only on worker
-d DIR, –work-dir DIR	Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker
–properties-file FILE	Path to a custom Spark properties file to load (default: conf/spark-defaults.conf)

Cluster Launch

sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
sbin/start-all.sh - Starts both a master and a number of slaves as described above.
sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
sbin/stop-all.sh - Stops both the master and the slaves as described above.

Docker

基于Docker的Spark集群搭建

Application Submit

./bin/spark-submit \
  --class <main-class>
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]

参数如下：

–class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
–master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
–deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
–conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
application-arguments: Arguments passed to the main method of your main class, if any

StandAlone

# Run application locally on 8 cores
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master local[8] \
  /path/to/examples.jar \
  100

# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://207.184.161.138:7077 \
  --deploy-mode cluster
  --supervise
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000

在任务提交之后，在管理界面会显示：

![enter description here][3]

Program

Initializing Spark

编写Spark程序的首先是需要创建一个JavaSparkContext对象，用于确定如何去连接到一个集群中。而如果需要创建一个SparkContext对象则需要新创建一个包含应用的基本信息SparkConf对象。

SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);

上述代码中的appName是该应用的名称，而master指Spark进程的URL，如果对于StandAlone进程既是类似于spark://Host:Port

Resilient Distributed Datasets (RDDs)

Parallelized Collections

List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);

External Datasets

RDD Operations

注意，由于Spark中采用了大量的异步操作，并不能像普通的Java程序中一样去同步进行遍历，大量的遍历等操作是利用类似于回调的方式构造的。

最近更新于0001-01-01