Download & Install Spark
download spark 2.2.0 pre-built for hadoop 2.7 untar spark distribution under /opt/spark
Configure Spark Master node
configure the following environment variables in /opt/spark-2.3.1-bin-hadoop2.7/conf/ please note in this example spark master port and webui_port are specified on custom ports, these are optional
SPARK_MASTER_PORT=7177 SPARK_MASTER_WEBUI_PORT=8180 SPARK_LOCAL_IP= SPARK_WORKER_MEMORY=168g SPARK_WORKER_CORES=21 SPARK_LOCAL_DIRS=/opt/spark-2.3.1-bin-hadoop2.7/tmp #SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
Configure Spark Worker node
SPARK_MASTER_HOST should be defined for worker nodes
SPARK_MASTER_HOST= SPARK_LOCAL_IP= SPARK_WORKER_MEMORY=168g SPARK_WORKER_CORES=21 SPARK_LOCAL_DIRS=/opt/spark-2.3.1-bin-hadoop2.7/tmp #SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
mkdir -p /var/lib/spark/rdd useradd -u 1120 spark chown -R spark: spark chmod g+w -R spark mkdir -p /opt/spark-2.3.1-bin-hadoop2.7/tmp chmod 777 /opt/spark-2.3.1-bin-hadoop2.7/tmp setfacl -Rdm g::rwx /opt/spark-2.3.1-bin-hadoop2.7/tmp ; setfacl -Rdm o::rwx /opt/spark-2.3.1-bin-hadoop2.7/tmp ; getfacl /opt/spark-2.3.1-bin-hadoop2.7/tmp
Derby Network Server Install
Copy the following jars from yourscp root@ /opt/spark/jars scp root@ /opt/spark/jars
Download hadoop-aws-2.7.x.jar and aws-java-sdk-1.7.4.jar place them under /opt/spark/jars/
- hadoop-aws-2.7.x.jar can be found in hadoop-2.7.4.tar.gz/hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.4.jar
- aws-java-sdk-1.7.x.jar can be found in
create the following spark-defaults.conf, this same config can be used in both master and slave nodes
cat >> /opt/spark/conf/spark-defaults.conf << EOF spark.master spark:// # spark.eventLog.enabled true # spark.eventLog.dir hdfs://namenode:8021/directory # spark.serializer org.apache.spark.serializer.KryoSerializer spark.sql.warehouse.dir /opt/spark-2.3.1-bin-hadoop2.7/tmp/warehouse spark.driver.memory 2g spark.executor.memory 2g spark.executor.cores 2 spark.jars /opt/spark-2.3.1-bin-hadoop2.7/jars/hadoop-aws-2.7.3.jar,/opt/spark-2.3.1-bin-hadoop2.7/jars/aws-java-sdk-1.7.4.jar spark.task.reaper.enabled true spark.task.reaper.killTimeout 300s #spark.driver.extraJavaOptions -Dderby.system.home=/opt/spark-2.3.1-bin-hadoop2.7/tmp/derby EOF
Integrate Spark master and worker with supervisord
The following scripts are needed to be under $SPARK_HOME/sbin/. These script are modified from the original, and, it launches the process with
command so it retains the parent PID for supervisor to manage
Create the supervisord config for master node
cat > /etc/supervisord.d/spark23-master.ini << EOF [program:spark23-master] environment=SPARK_NO_DAEMONIZE="true" command=/opt/spark-2.3.1-bin-hadoop2.7/sbin/ ; the program (relative uses PATH, can take args) process_name=%(program_name)s ; process_name expr (default %(program_name)s) numprocs=1 ; number of processes copies to start (def 1) directory=/opt/spark-2.3.1-bin-hadoop2.7 ; directory to cwd to before exec (def no cwd) ;umask=022 ; umask for process (default None) priority=3 ; the relative start priority (default 999) ;autostart=true ; start at supervisord start (default: true) ;autorestart=true ; retstart at unexpected quit (default: true) startsecs=10 ; number of secs prog must stay running (def. 1) ;startretries=3 ; max # of serial start failures (default 3) ;exitcodes=0,2 ; 'expected' exit codes for process (default 0,2) ;stopsignal=QUIT ; signal used to kill process (default TERM) ;stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10) user=spark ; setuid to this UNIX account to run the program redirect_stderr=true ; redirect proc stderr to stdout (default false) stdout_logfile=/opt/spark-2.3.1-bin-hadoop2.7/logs/%(program_name)s-stdout.log ; stdout log path, NONE for none; default AUTO ;stdout_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB) ;stdout_logfile_backups=10 ; # of stdout logfile backups (default 10) ;stdout_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0) ;stdout_events_enabled=false ; emit events on stdout writes (default false) ;stderr_logfile=/a/path ; stderr log path, NONE for none; default AUTO ;stderr_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB) ;stderr_logfile_backups=10 ; # of stderr logfile backups (default 10) ;stderr_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0) ;stderr_events_enabled=false ; emit events on stderr writes (default false) ;environment=A=1,B=2 ; process environment additions (def no adds) ;serverurl=AUTO ; override serverurl computation (childutils) EOF
Create the supervisord config for worker node
cat > /etc/supervisord.d/spark23-slave.ini << EOF
command=/opt/spark-2.3.1-bin-hadoop2.7/sbin/ spark:// ; the program (relative uses PATH, can take args)
process_name=%(program_name)s ; process_name expr (default %(program_name)s)
numprocs=1 ; number of processes copies to start (def 1)
directory=/opt/spark-2.3.1-bin-hadoop2.7 ; directory to cwd to before exec (def no cwd)
;umask=022 ; umask for process (default None)
priority=4 ; the relative start priority (default 999)
;autostart=true ; start at supervisord start (default: true)
;autorestart=true ; retstart at unexpected quit (default: true)
startsecs=10 ; number of secs prog must stay running (def. 1)
;startretries=3 ; max # of serial start failures (default 3)
;exitcodes=0,2 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT ; signal used to kill process (default TERM)
;stopwaitsecs=10 ; max num secs to wait b4 SIGKILL (default 10)
user=spark ; setuid to this UNIX account to run the program
redirect_stderr=true ; redirect proc stderr to stdout (default false)
stdout_logfile=/opt/spark-2.3.1-bin-hadoop2.7/logs/%(program_name)s-stdout.log ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10 ; # of stdout logfile backups (default 10)
;stdout_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0)
;stdout_events_enabled=false ; emit events on stdout writes (default false)
;stderr_logfile=/a/path ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups=10 ; # of stderr logfile backups (default 10)
;stderr_capture_maxbytes=1MB ; number of bytes in 'capturemode' (default 0)
;stderr_events_enabled=false ; emit events on stderr writes (default false)
;environment=A=1,B=2 ; process environment additions (def no adds)
;serverurl=AUTO ; override serverurl computation (childutils)
Add the new config into supervisord and start it
sudo supervisorctl reread sudo supervisorctl update
No comments:
Post a Comment