用于设置在Reducer的partition数目少于多少的时候,Sort Based Shuffle内部不使用Merge Sort的方式处理数据,而是直接将每个partition写入单独的文件。这个方式和Hash Based的方式是类似的,区别就是在最后这些文件还是会合并成一个单独的文件,并通过一个index索引文件来标记不同partition的位置信息。从Reducer看来,数据文件和索引文件的格式和内部是否做过Merge Sort是完全相同的。这个可以看做SortBased Shuffle在Shuffle量比较小的时候对于Hash Based Shuffle的一种折衷。当然了它和Hash Based Shuffle一样,也存在同时打开文件过多导致内存占用增加的问题。因此如果GC比较严重或者内存比较紧张,可以适当的降低这个值。
Customize the locality wait for node locality. For example, you can set this to 0 to skip node locality and search immediately for rack locality (if your cluster has rack information).
spark.locality.wait.process
spark.locality.wait
Customize the locality wait for process locality. This affects tasks that attempt to access cached data in a particular executor process.
spark.locality.wait.rack
spark.locality.wait
Customize the locality wait for rack locality.
spark.scheduler.maxRegisteredResourcesWaitingTime
30s
Maximum amount of time to wait for resources to register before scheduling begins.
spark.scheduler.mode
FIFO
作业调度模式。可以设置为FAIR公平调度或者FIFO先进先出
spark.scheduler.revive.interval
1s
The interval length for the scheduler to revive the worker resource offers to run tasks.
spark.speculation
false
若设置为true,则会根据执行慢的stage多次启动,以最先完成为准。
spark.speculation.interval
100ms
speculate 的频率
spark.speculation.multiplier
1.5
当task比所有任务执行时间的中值长多少倍时启动speculate
spark.speculation.quantile
0.75
启动speculate前任务完成数据量所占比例值
spark.task.cpus
1
每个task分配的cpu数量
spark.task.maxFailures
4
task失败的最多次数,比重试次数多1
10. 动态分配内存
属性
默认值
描述
spark.dynamicAllocation.enabled
false
是否启动动态资源分配
spark.dynamicAllocation.executorIdleTimeout
60s
若动态分配设为true且executor处于idle状态的时间已超时,则移除executor
spark.dynamicAllocation.cachedExecutorIdleTimeout
infinity
若executor缓存数据超时,且动态内存分配为true,则移除缓存
spark.dynamicAllocation.initialExecutors
spark.dynamicAllocation.minExecutors
若动态分配为true,执行器的初始数量
spark.dynamicAllocation.maxExecutor
infinity
执行器最大数量
spark.dynamicAllocation.minExecutor
0
执行器最少数量
spark.dynamicAllocation.schedulerBacklogTimeout
1s
If dynamic allocation is enabled and there have been pending tasks backlogged for more than this duration, new executors will be requested.
Same as spark.dynamicAllocation.schedulerBacklogTimeout, but used only for subsequent executor requests.
11. 安全
属性
默认值
描述
spark.acls.enable
false
Whether Spark acls should are enabled. If enabled, this checks to see if the user has access permissions to view or modify the job. Note this requires the user to be known, so if the user comes across as null no checks are done. Filters can be used with the UI to authenticate and set the user.
spark.admin.acls
Empty
Comma separated list of users/administrators that have view and modify access to all Spark jobs. This can be used if you run on a shared cluster and have a set of administrators or devs who help debug when things work. Putting a “*” in the list means any user can have the priviledge of admin.
spark.authenticate
false
Whether Spark authenticates its internal connections. See spark.authenticate.secret if not running on YARN.
spark.authenticate.secret
None
Set the secret key used for Spark to authenticate between components. This needs to be set if not running on YARN and authentication is enabled.
spark.authenticate.enableSaslEncryption
false
Enable encrypted communication when authentication is enabled. This option is currently only supported by the block transfer service.
spark.network.sasl.serverAlwaysEncrypt
false
Disable unencrypted connections for services that support SASL authentication. This is currently supported by the external shuffle service.
spark.core.connection.ack.wait.timeout
60s
How long for the connection to wait for ack to occur before timing out and giving up. To avoid unwilling timeout caused by long pause like GC, you can set larger value.
spark.core.connection.auth.wait.timeout
30s
How long for the connection to wait for authentication to occur before timing out and giving up.
spark.modify.acls
Empty
Comma separated list of users that have modify access to the Spark job. By default only the user that started the Spark job has access to modify it (kill it for example). Putting a “*” in the list means any user can have access to modify it.
spark.ui.filters
None
Comma separated list of filter class names to apply to the Spark web UI. The filter should be a standard javax servlet Filter. Parameters to each filter can also be specified by setting a java system property of:spark..params=’param1=value1,param2=value2’For example:-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params=’param1=foo,param2=testing’
spark.ui.view.acls
Empty
Comma separated list of users that have view access to the Spark web ui. By default only the user that started the Spark job has view access. Putting a “*” in the list means any user can have view access to this Spark job.
12. 加密Encryption
属性
默认值
描述
spark.ssl.enabled
false
是否在所有协议中支持SSL连接
spark.ssl.enabledAlgorithms
Empty
一些列的密码运算,指定的cipher需要被JVM支持
spark.ssl.keyPassword
None
私钥密码
spark.ssl.keyStore
None
key存储文件,可以是组件启动的相对路径也可以是绝对路径
spark.ssl.keyStorePassword
None
A password to the key-store
spark.ssl.protocol
None
A protocol name. The protocol must be supported by JVM. The reference list of protocols one can find on this page.
spark.ssl.trustStore
None
A path to a trust-store file. The path can be absolute or relative to the directory where the component is started in.
spark.ssl.trustStorePassword
None
A password to the trust-store.
13. Spark Streaming
属性
默认值
描述
spark.streaming.backpressure.enabled
false
Enables or disables Spark Streaming’s internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).
spark.streaming.blockInterval
200ms
Interval at which data received by Spark Streaming receivers is chunked into blocks of data before storing them in Spark. Minimum recommended - 50 ms. See the performance tuning section in the Spark Streaming programing guide for more details.
spark.streaming.receiver.maxRate
not set
Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details.
spark.streaming.receiver.writeAheadLog.enable
false
Enable write ahead logs for receivers. All the input data received through receivers will be saved to write ahead logs that will allow it to be recovered after driver failures. See the deployment guide in the Spark Streaming programing guide for more details.
spark.streaming.unpersist
true
Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from Spark’s memory. The raw input data received by Spark Streaming is also automatically cleared. Setting this to false will allow the raw data and persisted RDDs to be accessible outside the streaming application as they will not be cleared automatically. But it comes at the cost of higher memory usage in Spark.
spark.streaming.stopGracefullyOnShutdown
false
If true, Spark shuts down the StreamingContext gracefully on JVM shutdown rather than immediately.
spark.streaming.kafka.maxRatePerPartition
not set
Maximum rate (number of records per second) at which data will be read from each Kafka partition when using the new Kafka direct stream API. See the Kafka Integration guide for more details.
spark.streaming.kafka.maxRetries
1
Maximum number of consecutive retries the driver will make in order to find the latest offsets on the leader of each partition (a default value of 1 means that the driver will make a maximum of 2 attempts). Only applies to the new Kafka direct stream API.
spark.streaming.ui.retainedBatches
1000
How many batches the Spark Streaming UI and status APIs remember before garbage collecting.
Whether to close the file after writing a write ahead log record on the driver. Set this to ‘true’ when you want to use S3 (or any file system that does not support flushing) for the metadata WAL on the driver.
Whether to close the file after writing a write ahead log record on the receivers. Set this to ‘true’ when you want to use S3 (or any file system that does not support flushing) for the data WAL on the receivers.
14. SparkR
属性
默认值
描述
spark.r.numRBackendThreads
2
RBackend维护的RPC句柄个数
spark.r.command
Rscript
Executable for executing R scripts in cluster modes for both driver and workers.
spark.r.driver.command
spark.r.command
Executable for executing R scripts in client modes for driver. Ignored in cluster modes