Tuesday, August 7, 2018

JupyterHub Setup (CentOS7)


We are going to use Anaconda to manage Python & jupyterhub packages.

Install Anaconda

wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
sudo bash Anaconda3-5.0.1-Linux-x86_64.sh

Install Jupyterhub Prerequisites

sudo yum install epel-release.noarch
sudo apt-get install npm nodejs-legacy
sudo yum install npm nodejs-legacy
sudo npm install -g configurable-http-proxy
We are going to install JupyterHub OAuth package as we will be integrate authentication with GitHub OAuth

Install Jupyterhub

/opt/conda/bin/pip install oauthenticator
/opt/conda/bin/pip install --upgrade notebook
Generate a sample configuration file with setting authorized users and admin jupyterhub --generate-config cat >> /opt/jupyterhub/jupyterhub_config.py << EOF from oauthenticator.github import GitHubOAuthenticator c.JupyterHub.authenticator_class = GitHubOAuthenticator
c.Authenticator.whitelist = {'wuf', 'fwu'}
c.Authenticator.admin_users = {'wuf'}
EOF
Create a startup script, to include all the necessary custom configuration required for our jupyterhub. Note the OAUTH & GITHUB configuration are required to integrate with our Enterprise GitHub mkdir -p /opt/jupyterhub
cat > jupyterhub.sh << EOF
#!/bin/bash
export export PATH="/opt/conda/bin:$PATH"
export OAUTH_CLIENT_ID=********************
export OAUTH_CLIENT_SECRET=*****************************************
export OAUTH_CALLBACK_URL=http://hostname:8000/hub/oauth_callback
export GITHUB_HOST=github.internal.server
export GITHUB_HTTP=true
/opt/conda/bin/jupyterhub --ip hostname -f /opt/jupyterhub/jupyterhub_config.py
EOF
GitHub OAUTH is configured by your internal GitHub site admin
Users needs to be created on local system:
groupadd -g 500 users
useradd -m -s /bin/bash -u 1000 -g 500 wuf
useradd -m -s /bin/bash -u 1001 -g 500 fwu
jupyterhub kernel can be listed and validated by running the following command, by default the kernels are loaded from /opt/conda/share/jupyter/kernels and /usr/local/share/jupyter/kernels
jupyter kernelspec list
We will place all our custom kernels here:
mkdir -p /usr/local/share/jupyter/kernels
Conda/Anaconda come with python 3, we can install python 2.7 if needed.

Install python 2.7 environment

/opt/conda/bin/conda create -n py27 python=2.7 anaconda

activating python 2.7 environment

source activate /opt/conda/envs/py27

Install additional python packages

# install packagesin python3
# using conda 
/opt/conda/bin/conda install pandas numpy matplotlib seaborn requests tabulate six future xgboost
# using pip
/opt/conda/bin/pip install hyperopt

# install packages in python2.7
source activate /opt/conda/envs/py27
# using conda
/opt/conda/envs/py27/bin/conda install pandas numpy matplotlib seaborn requests tabulate six future xgboost
# using pip
/opt/conda/envs/py27/bin/pip install hyperopt

Install jupyterhub notebook extension

/opt/conda/bin/conda install -c conda-forge jupyter_contrib_nbextensions
/opt/conda/bin/jupyter contrib nbextension install --system

/opt/conda/bin/conda clean --all

Install supervisor to manage jupyterhub start/stop

the path commands are required to install supervisor packages by system managed python instead of the python managed by conda sudo yum install supervisor export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin sudo easy_install superlance exec bash
cat > /etc/supervisord.d/jupyterhub.ini << EOF
[program:jupyterhub]
command=/opt/jupyterhub/jupyterhub.sh               ; the program (relative uses PATH, can take args)
process_name=%(program_name)s  ; process_name expr (default %(program_name)s)
numprocs=1                     ; number of processes copies to start (def 1)
directory=/opt/jupyterhub        ; directory to cwd to before exec (def no cwd)
priority=1                     ; the relative start priority (default 999)
startsecs=30                    ; number of secs prog must stay running (def. 1)
redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/var/log/%(program_name)s-stdout.log        ; stdout log path, NONE for none; default AUTO

[eventlistener:crashmailbatch]
command=crashmailbatch -t alert@domain.com -f supervisord@%(host_node_name)s -s "jupyterhub crashed on %(host_node_name)s"
events=PROCESS_STATE,TICK_60
EOF

JupyterHub start/stop/restart

supervisorctl start jupyterhub
supervisorctl stop jupyterhub
supervisorctl restart jupyterhub

JupyterHub + oauthenticator Setup (docker)

Clone the latest code from jupyterhub oauthenticator

git clone https://github.com/jupyterhub/oauthenticator.git
cd oauthenticator/examples/full

Create env contains all necessary environment variables to be injected into the docker container

cat > env << EOF
OAUTH_CLIENT_ID=***********************
OAUTH_CLIENT_SECRET=********************************************
OAUTH_CALLBACK_URL=http://hostname:8000/hub/oauth_callback
GITHUB_HOST=github.internal.server 
GITHUB_HTTP=true
EOF

Create users

cat > userslist << EOF
wuf admin
fwu
EOF

Build the docker container

docker build -t jupyterhub-oauth .

Rename the old container

If there is an old container with the same name, we need to rename this container and change its restart policy to avoid starting both container competing for the same ports
docker rename jupyterhub jupyterhub-backup
docker update --restart=no jupyterhub-backup
docker ps -a
docker rename  jupyterhub 

Start the container

The following line starts the docker container with 20G base storage (default is 10G), set restart policy to be always yes, we set network to use host as when running spark there is many random ports to be used
docker run -d -it --storage-opt size=20G --network host --env-file=env --restart always jupyterhub-oauth

Enter the container

docker exec -it jupyterhub bash

Set the docker container timezone

/etc# echo "America/Toronto" > /etc/timezone
/etc# rm localtime
/etc# ln -snf /usr/share/zoneinfo/America/Toronto /etc/localtime
/etc# dpkg-reconfigure -f noninteractive tzdata

Apache Spark 2.x + Derby + AWS Integration


Download & Install Spark

download spark 2.2.0 pre-built for hadoop 2.7 untar spark distribution under /opt/spark

Configure Spark Master node

configure the following environment variables in /opt/spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh please note in this example spark master port and webui_port are specified on custom ports, these are optional
SPARK_MASTER_PORT=7177
SPARK_MASTER_WEBUI_PORT=8180
SPARK_LOCAL_IP=10.4.12.38
SPARK_WORKER_MEMORY=168g
SPARK_WORKER_CORES=21
SPARK_LOCAL_DIRS=/opt/spark-2.3.1-bin-hadoop2.7/tmp
#SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"

Configure Spark Worker node

SPARK_MASTER_HOST should be defined for worker nodes
SPARK_MASTER_HOST=192.168.1.10
SPARK_LOCAL_IP=192.168.1.11
SPARK_WORKER_MEMORY=168g
SPARK_WORKER_CORES=21
SPARK_LOCAL_DIRS=/opt/spark-2.3.1-bin-hadoop2.7/tmp
#SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"
mkdir -p /var/lib/spark/rdd

useradd -u 1120 spark
chown -R spark: spark
chmod g+w -R spark
mkdir -p /opt/spark-2.3.1-bin-hadoop2.7/tmp
chmod 777 /opt/spark-2.3.1-bin-hadoop2.7/tmp
setfacl -Rdm g::rwx /opt/spark-2.3.1-bin-hadoop2.7/tmp ; setfacl -Rdm o::rwx /opt/spark-2.3.1-bin-hadoop2.7/tmp ; getfacl /opt/spark-2.3.1-bin-hadoop2.7/tmp

Copy the following jars from your Derby Network Server Install

scp root@192.168.1.10:/opt/derby/lib/derbytools.jar /opt/spark/jars
scp root@192.168.1.10:/opt/derby/lib/derbyclient.jar /opt/spark/jars

Download hadoop-aws-2.7.x.jar and aws-java-sdk-1.7.4.jar place them under /opt/spark/jars/

create the following spark-defaults.conf, this same config can be used in both master and slave nodes

cat >> /opt/spark/conf/spark-defaults.conf << EOF
spark.master                     spark://192.168.1.10:7177

# spark.eventLog.enabled         true
# spark.eventLog.dir             hdfs://namenode:8021/directory
# spark.serializer               org.apache.spark.serializer.KryoSerializer

spark.sql.warehouse.dir      /opt/spark-2.3.1-bin-hadoop2.7/tmp/warehouse
spark.driver.memory              2g
spark.executor.memory            2g
spark.executor.cores             2
spark.jars                       /opt/spark-2.3.1-bin-hadoop2.7/jars/hadoop-aws-2.7.3.jar,/opt/spark-2.3.1-bin-hadoop2.7/jars/aws-java-sdk-1.7.4.jar
spark.task.reaper.enabled        true
spark.task.reaper.killTimeout    300s
#spark.driver.extraJavaOptions    -Dderby.system.home=/opt/spark-2.3.1-bin-hadoop2.7/tmp/derby
EOF

Integrate Spark master and worker with supervisord

The following scripts are needed to be under $SPARK_HOME/sbin/. These script are modified from the original spark-daemon.sh, start-master.sh and start-slave.sh, it launches the process with exec command so it retains the parent PID for supervisor to manage it.
supervisor-spark-daemon.sh
supervisor-start-master.sh
supervisor-start-slave.sh
supervisor-start-thriftserver.sh
Create the supervisord config for master node
cat > /etc/supervisord.d/spark23-master.ini << EOF
[program:spark23-master]
environment=SPARK_NO_DAEMONIZE="true"
command=/opt/spark-2.3.1-bin-hadoop2.7/sbin/supervisor-start-master.sh           ; the program (relative uses PATH, can take args)
process_name=%(program_name)s  ; process_name expr (default %(program_name)s)
numprocs=1                     ; number of processes copies to start (def 1)
directory=/opt/spark-2.3.1-bin-hadoop2.7           ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
priority=3                     ; the relative start priority (default 999)
;autostart=true                ; start at supervisord start (default: true)
;autorestart=true              ; retstart at unexpected quit (default: true)
startsecs=10                    ; number of secs prog must stay running (def. 1)
;startretries=3                ; max # of serial start failures (default 3)
;exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
user=spark                     ; setuid to this UNIX account to run the program
redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/opt/spark-2.3.1-bin-hadoop2.7/logs/%(program_name)s-stdout.log        ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stdout_events_enabled=false   ; emit events on stdout writes (default false)
;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions (def no adds)
;serverurl=AUTO                ; override serverurl computation (childutils)
EOF
Create the supervisord config for worker node
cat > /etc/supervisord.d/spark23-slave.ini << EOF
[program:spark23-slave]
environment=SPARK_NO_DAEMONIZE="true"
command=/opt/spark-2.3.1-bin-hadoop2.7/sbin/supervisor-start-slave.sh spark://10.4.12.36:7177 ; the program (relative uses PATH, can take args)
process_name=%(program_name)s  ; process_name expr (default %(program_name)s)
numprocs=1                     ; number of processes copies to start (def 1)
directory=/opt/spark-2.3.1-bin-hadoop2.7 ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
priority=4                     ; the relative start priority (default 999)
;autostart=true                ; start at supervisord start (default: true)
;autorestart=true              ; retstart at unexpected quit (default: true)
startsecs=10                    ; number of secs prog must stay running (def. 1)
;startretries=3                ; max # of serial start failures (default 3)
;exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
user=spark                     ; setuid to this UNIX account to run the program
redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/opt/spark-2.3.1-bin-hadoop2.7/logs/%(program_name)s-stdout.log        ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stdout_events_enabled=false   ; emit events on stdout writes (default false)
;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions (def no adds)
;serverurl=AUTO                ; override serverurl computation (childutils)
EOF
Add the new config into supervisord and start it
sudo supervisorctl reread
sudo supervisorctl update