环境配置
Introduction
Tutorial & Docs
Quick Start
Deploy
[Spark 下载地址][1]
StandAlone
将
./sbin/start-master.sh
执行该命令之后
15/05/28 13:20:57 INFO Master: Starting Spark master at spark://localhost.localdomain:7077
15/05/28 13:21:07 INFO MasterWebUI: Started MasterWebUI at http://192.168.199.166:8080
给出的这个
需要启动
./bin/spark-class org.apache.spark.deploy.worker.Worker spark://IP:PORT
注意,这边的
IP 地址实际上指的是启动master 时候他监听的域名而来的IP 地址。
![enter description here][2]
执行上述命令时可以传入的参数为:
Argument | Meaning |
---|---|
-h HOST, –host HOST | Hostname to listen on |
-i HOST, –ip HOST | Hostname to listen on (deprecated, use -h or –host) |
-p PORT, –port PORT | Port for service to listen on (default: 7077 for master, random for worker) |
–webui-port PORT | Port for web UI (default: 8080 for master, 8081 for worker) |
-c CORES, –cores CORES | Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker |
-m MEM, –memory MEM | Total amount of memory to allow Spark applications to use on the machine, in a format like 1000M or 2G (default: your machine’s total RAM minus 1 GB); only on worker |
-d DIR, –work-dir DIR | Directory to use for scratch space and job output logs (default: SPARK_HOME/work); only on worker |
–properties-file FILE | Path to a custom Spark properties file to load (default: conf/spark-defaults.conf) |
Cluster Launch
- sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
- sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
- sbin/start-all.sh - Starts both a master and a number of slaves as described above.
- sbin/stop-master.sh - Stops the master that was started via the bin/start-master.sh script.
- sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
- sbin/stop-all.sh - Stops both the master and the slaves as described above.
Docker
Application Submit
./bin/spark-submit \
--class <main-class>
--master <master-url> \
--deploy-mode <deploy-mode> \
--conf <key>=<value> \
... # other options
<application-jar> \
[application-arguments]
参数如下:
- –class: The entry point for your application (e.g. org.apache.spark.examples.SparkPi)
- –master: The master URL for the cluster (e.g. spark://23.195.26.187:7077)
- –deploy-mode: Whether to deploy your driver on the worker nodes (cluster) or locally as an external client (client) (default: client)
- –conf: Arbitrary Spark configuration property in key=value format. For values that contain spaces wrap “key=value” in quotes (as shown).
- application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes.
- application-arguments: Arguments passed to the main method of your main class, if any
StandAlone
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar \
100
# Run on a Spark Standalone cluster in client deploy mode
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
# Run on a Spark Standalone cluster in cluster deploy mode with supervise
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://207.184.161.138:7077 \
--deploy-mode cluster
--supervise
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000
在任务提交之后,在管理界面会显示:
![enter description here][3]
Program
Initializing Spark
编写
SparkConf conf = new SparkConf().setAppName(appName).setMaster(master);
JavaSparkContext sc = new JavaSparkContext(conf);
上述代码中的
Resilient Distributed Datasets (RDDs)
Parallelized Collections
List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);
JavaRDD<Integer> distData = sc.parallelize(data);
External Datasets
RDD Operations
注意,由于