geek @ mountain

Tuesday, August 7, 2018

JupyterHub Setup (CentOS7)

We are going to use Anaconda to manage Python & jupyterhub packages.

Install Anaconda

wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
sudo bash Anaconda3-5.0.1-Linux-x86_64.sh

Install Jupyterhub Prerequisites

sudo yum install epel-release.noarch
sudo apt-get install npm nodejs-legacy
sudo yum install npm nodejs-legacy
sudo npm install -g configurable-http-proxy

We are going to install JupyterHub OAuth package as we will be integrate authentication with GitHub OAuth

Install Jupyterhub

/opt/conda/bin/pip install oauthenticator
/opt/conda/bin/pip install --upgrade notebook

Generate a sample configuration file with setting authorized users and admin jupyterhub --generate-config cat >> /opt/jupyterhub/jupyterhub_config.py << EOF from oauthenticator.github import GitHubOAuthenticator c.JupyterHub.authenticator_class = GitHubOAuthenticator

c.Authenticator.whitelist = {'wuf', 'fwu'}
c.Authenticator.admin_users = {'wuf'}
EOF

Create a startup script, to include all the necessary custom configuration required for our jupyterhub. Note the OAUTH & GITHUB configuration are required to integrate with our Enterprise GitHub mkdir -p /opt/jupyterhub

cat > jupyterhub.sh << EOF
#!/bin/bash
export export PATH="/opt/conda/bin:$PATH"
export OAUTH_CLIENT_ID=********************
export OAUTH_CLIENT_SECRET=*****************************************
export OAUTH_CALLBACK_URL=http://hostname:8000/hub/oauth_callback
export GITHUB_HOST=github.internal.server
export GITHUB_HTTP=true
/opt/conda/bin/jupyterhub --ip hostname -f /opt/jupyterhub/jupyterhub_config.py
EOF

GitHub OAUTH is configured by your internal GitHub site admin

Users needs to be created on local system:

groupadd -g 500 users
useradd -m -s /bin/bash -u 1000 -g 500 wuf
useradd -m -s /bin/bash -u 1001 -g 500 fwu

jupyterhub kernel can be listed and validated by running the following command, by default the kernels are loaded from /opt/conda/share/jupyter/kernels and /usr/local/share/jupyter/kernels

jupyter kernelspec list

We will place all our custom kernels here:

mkdir -p /usr/local/share/jupyter/kernels

Conda/Anaconda come with python 3, we can install python 2.7 if needed.

Install python 2.7 environment

/opt/conda/bin/conda create -n py27 python=2.7 anaconda

activating python 2.7 environment

source activate /opt/conda/envs/py27

Install additional python packages

# install packagesin python3
# using conda 
/opt/conda/bin/conda install pandas numpy matplotlib seaborn requests tabulate six future xgboost
# using pip
/opt/conda/bin/pip install hyperopt

# install packages in python2.7
source activate /opt/conda/envs/py27
# using conda
/opt/conda/envs/py27/bin/conda install pandas numpy matplotlib seaborn requests tabulate six future xgboost
# using pip
/opt/conda/envs/py27/bin/pip install hyperopt

Install jupyterhub notebook extension

/opt/conda/bin/conda install -c conda-forge jupyter_contrib_nbextensions
/opt/conda/bin/jupyter contrib nbextension install --system

/opt/conda/bin/conda clean --all

Install supervisor to manage jupyterhub start/stop

the path commands are required to install supervisor packages by system managed python instead of the python managed by conda sudo yum install supervisor export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/root/bin sudo easy_install superlance exec bash

cat > /etc/supervisord.d/jupyterhub.ini << EOF
[program:jupyterhub]
command=/opt/jupyterhub/jupyterhub.sh               ; the program (relative uses PATH, can take args)
process_name=%(program_name)s  ; process_name expr (default %(program_name)s)
numprocs=1                     ; number of processes copies to start (def 1)
directory=/opt/jupyterhub        ; directory to cwd to before exec (def no cwd)
priority=1                     ; the relative start priority (default 999)
startsecs=30                    ; number of secs prog must stay running (def. 1)
redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/var/log/%(program_name)s-stdout.log        ; stdout log path, NONE for none; default AUTO

[eventlistener:crashmailbatch]
command=crashmailbatch -t alert@domain.com -f supervisord@%(host_node_name)s -s "jupyterhub crashed on %(host_node_name)s"
events=PROCESS_STATE,TICK_60
EOF

JupyterHub start/stop/restart

supervisorctl start jupyterhub
supervisorctl stop jupyterhub
supervisorctl restart jupyterhub

JupyterHub + oauthenticator Setup (docker)

Clone the latest code from jupyterhub oauthenticator

git clone https://github.com/jupyterhub/oauthenticator.git
cd oauthenticator/examples/full

Create env contains all necessary environment variables to be injected into the docker container

cat > env << EOF
OAUTH_CLIENT_ID=***********************
OAUTH_CLIENT_SECRET=********************************************
OAUTH_CALLBACK_URL=http://hostname:8000/hub/oauth_callback
GITHUB_HOST=github.internal.server 
GITHUB_HTTP=true
EOF

Create users

cat > userslist << EOF
wuf admin
fwu
EOF

Build the docker container

docker build -t jupyterhub-oauth .

Rename the old container

If there is an old container with the same name, we need to rename this container and change its restart policy to avoid starting both container competing for the same ports

docker rename jupyterhub jupyterhub-backup
docker update --restart=no jupyterhub-backup
docker ps -a
docker rename  jupyterhub

Start the container

The following line starts the docker container with 20G base storage (default is 10G), set restart policy to be always yes, we set network to use host as when running spark there is many random ports to be used

docker run -d -it --storage-opt size=20G --network host --env-file=env --restart always jupyterhub-oauth

Enter the container

docker exec -it jupyterhub bash

Set the docker container timezone

/etc# echo "America/Toronto" > /etc/timezone
/etc# rm localtime
/etc# ln -snf /usr/share/zoneinfo/America/Toronto /etc/localtime
/etc# dpkg-reconfigure -f noninteractive tzdata

Apache Spark 2.x + Derby + AWS Integration

Download & Install Spark

download spark 2.2.0 pre-built for hadoop 2.7 untar spark distribution under /opt/spark

Configure Spark Master node

configure the following environment variables in /opt/spark-2.3.1-bin-hadoop2.7/conf/spark-env.sh please note in this example spark master port and webui_port are specified on custom ports, these are optional

SPARK_MASTER_PORT=7177
SPARK_MASTER_WEBUI_PORT=8180
SPARK_LOCAL_IP=10.4.12.38
SPARK_WORKER_MEMORY=168g
SPARK_WORKER_CORES=21
SPARK_LOCAL_DIRS=/opt/spark-2.3.1-bin-hadoop2.7/tmp
#SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"

Configure Spark Worker node

SPARK_MASTER_HOST should be defined for worker nodes

SPARK_MASTER_HOST=192.168.1.10
SPARK_LOCAL_IP=192.168.1.11
SPARK_WORKER_MEMORY=168g
SPARK_WORKER_CORES=21
SPARK_LOCAL_DIRS=/opt/spark-2.3.1-bin-hadoop2.7/tmp
#SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true"

mkdir -p /var/lib/spark/rdd

useradd -u 1120 spark
chown -R spark: spark
chmod g+w -R spark
mkdir -p /opt/spark-2.3.1-bin-hadoop2.7/tmp
chmod 777 /opt/spark-2.3.1-bin-hadoop2.7/tmp
setfacl -Rdm g::rwx /opt/spark-2.3.1-bin-hadoop2.7/tmp ; setfacl -Rdm o::rwx /opt/spark-2.3.1-bin-hadoop2.7/tmp ; getfacl /opt/spark-2.3.1-bin-hadoop2.7/tmp

Copy the following jars from your Derby Network Server Install

scp root@192.168.1.10:/opt/derby/lib/derbytools.jar /opt/spark/jars
scp root@192.168.1.10:/opt/derby/lib/derbyclient.jar /opt/spark/jars

Download hadoop-aws-2.7.x.jar and aws-java-sdk-1.7.4.jar place them under /opt/spark/jars/

hadoop-aws-2.7.x.jar can be found in hadoop-2.7.4.tar.gz/hadoop/share/hadoop/tools/lib/hadoop-aws-2.7.4.jar
aws-java-sdk-1.7.x.jar can be found in aws-java-sdk.zip

create the following spark-defaults.conf, this same config can be used in both master and slave nodes

cat >> /opt/spark/conf/spark-defaults.conf << EOF
spark.master                     spark://192.168.1.10:7177

# spark.eventLog.enabled         true
# spark.eventLog.dir             hdfs://namenode:8021/directory
# spark.serializer               org.apache.spark.serializer.KryoSerializer

spark.sql.warehouse.dir      /opt/spark-2.3.1-bin-hadoop2.7/tmp/warehouse
spark.driver.memory              2g
spark.executor.memory            2g
spark.executor.cores             2
spark.jars                       /opt/spark-2.3.1-bin-hadoop2.7/jars/hadoop-aws-2.7.3.jar,/opt/spark-2.3.1-bin-hadoop2.7/jars/aws-java-sdk-1.7.4.jar
spark.task.reaper.enabled        true
spark.task.reaper.killTimeout    300s
#spark.driver.extraJavaOptions    -Dderby.system.home=/opt/spark-2.3.1-bin-hadoop2.7/tmp/derby
EOF

Integrate Spark master and worker with supervisord

The following scripts are needed to be under $SPARK_HOME/sbin/. These script are modified from the original spark-daemon.sh, start-master.sh and start-slave.sh, it launches the process with exec command so it retains the parent PID for supervisor to manage it.

supervisor-spark-daemon.sh
supervisor-start-master.sh
supervisor-start-slave.sh
supervisor-start-thriftserver.sh

Create the supervisord config for master node

cat > /etc/supervisord.d/spark23-master.ini << EOF
[program:spark23-master]
environment=SPARK_NO_DAEMONIZE="true"
command=/opt/spark-2.3.1-bin-hadoop2.7/sbin/supervisor-start-master.sh           ; the program (relative uses PATH, can take args)
process_name=%(program_name)s  ; process_name expr (default %(program_name)s)
numprocs=1                     ; number of processes copies to start (def 1)
directory=/opt/spark-2.3.1-bin-hadoop2.7           ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
priority=3                     ; the relative start priority (default 999)
;autostart=true                ; start at supervisord start (default: true)
;autorestart=true              ; retstart at unexpected quit (default: true)
startsecs=10                    ; number of secs prog must stay running (def. 1)
;startretries=3                ; max # of serial start failures (default 3)
;exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
user=spark                     ; setuid to this UNIX account to run the program
redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/opt/spark-2.3.1-bin-hadoop2.7/logs/%(program_name)s-stdout.log        ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stdout_events_enabled=false   ; emit events on stdout writes (default false)
;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions (def no adds)
;serverurl=AUTO                ; override serverurl computation (childutils)
EOF

Create the supervisord config for worker node

cat > /etc/supervisord.d/spark23-slave.ini << EOF
[program:spark23-slave]
environment=SPARK_NO_DAEMONIZE="true"
command=/opt/spark-2.3.1-bin-hadoop2.7/sbin/supervisor-start-slave.sh spark://10.4.12.36:7177 ; the program (relative uses PATH, can take args)
process_name=%(program_name)s  ; process_name expr (default %(program_name)s)
numprocs=1                     ; number of processes copies to start (def 1)
directory=/opt/spark-2.3.1-bin-hadoop2.7 ; directory to cwd to before exec (def no cwd)
;umask=022                     ; umask for process (default None)
priority=4                     ; the relative start priority (default 999)
;autostart=true                ; start at supervisord start (default: true)
;autorestart=true              ; retstart at unexpected quit (default: true)
startsecs=10                    ; number of secs prog must stay running (def. 1)
;startretries=3                ; max # of serial start failures (default 3)
;exitcodes=0,2                 ; 'expected' exit codes for process (default 0,2)
;stopsignal=QUIT               ; signal used to kill process (default TERM)
;stopwaitsecs=10               ; max num secs to wait b4 SIGKILL (default 10)
user=spark                     ; setuid to this UNIX account to run the program
redirect_stderr=true          ; redirect proc stderr to stdout (default false)
stdout_logfile=/opt/spark-2.3.1-bin-hadoop2.7/logs/%(program_name)s-stdout.log        ; stdout log path, NONE for none; default AUTO
;stdout_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stdout_logfile_backups=10     ; # of stdout logfile backups (default 10)
;stdout_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stdout_events_enabled=false   ; emit events on stdout writes (default false)
;stderr_logfile=/a/path        ; stderr log path, NONE for none; default AUTO
;stderr_logfile_maxbytes=1MB   ; max # logfile bytes b4 rotation (default 50MB)
;stderr_logfile_backups=10     ; # of stderr logfile backups (default 10)
;stderr_capture_maxbytes=1MB   ; number of bytes in 'capturemode' (default 0)
;stderr_events_enabled=false   ; emit events on stderr writes (default false)
;environment=A=1,B=2           ; process environment additions (def no adds)
;serverurl=AUTO                ; override serverurl computation (childutils)
EOF

Add the new config into supervisord and start it

sudo supervisorctl reread
sudo supervisorctl update

Thursday, June 28, 2018

Check du disk usage by file types on Linux

#!/bin/bash

for ext in `find . -type f | perl -ne 'print $1 if m/\.([^.\/]+)$/' | sort -u`; do
    echo "`find . -name "*."$ext -print0 | du -ch --files0-from=- | tail -1` by "$ext
done

Monday, June 25, 2018

Jupyter Kernel & DataStax Enterprise Spark Integration

Install DSE from tarball

`tar xpvf dse-4.8.15-bin.tar.gz`

These hadoop config files needs to be copied from the target DSE environment

$DSE_HOME/resources/hadoop/conf
dse-mapred-default.xml
dse-core-default.xml

Create the following temp work directories for spark

Note all users use the notebook must have write permission to the var/lib/spark, group writable permission is recommended

mkdir -p /opt/dse4/var/lib/spark/worker
mkdir -p /opt/dse4/var/lib/spark/rdd
chown -R root:fleet /opt/dse4/var/lib/spark
chmod g+w -R /opt/dse4/var/lib/spark

The following environment variables needs to be set to the DSE home

/opt/dse4/resources/spark/conf/spark-env.sh

export SPARK_WORKER_DIR="/opt/dse4/var/lib/spark/worker"
export SPARK_LOCAL_DIRS="/opt/dse4/var/lib/spark/rdd"

Wednesday, June 6, 2018

Multiple PHP version with Apache on CentOS 7

Step 1

Install all the necessary repos and packages big thanks to https://rpms.remirepo.net/wizard/

the following commands assume you already sudo su - or you will have to add sudo to each of the commands:

yum install httpd -y
yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
yum install http://rpms.remirepo.net/enterprise/remi-release-7.rpm
yum install yum-utils -y
yum install php56 -y
yum install php72 -y
yum install php56-php-fpm -y
yum install php72-php-fpm -y

stop both fpm servers

systemctl stop php56-php-fpm
systemctl stop php72-php-fpm

by default it listens on 127.0.0.1 port 9000, make them listen on different ports

sed -i 's/:9000/:9056/' /etc/opt/remi/php56/php-fpm.d/www.conf
sed -i 's/:9000/:9072/' /etc/opt/remi/php72/php-fpm.d/www.conf

now two different version of fpm can be started on different ports

systemctl start php72-php-fpm
systemctl start php56-php-fpm

Step 2
make script wrapper to call php56-cgi and php72-cgi

cat > /var/www/cgi-bin/php56.fcgi << EOF
#!/bin/bash
exec /bin/php56-cgi
EOF

cat > /var/www/cgi-bin/php72.fcgi << EOF
#!/bin/bash
exec /bin/php72-cgi
EOF

make them executable by apache

sudo chmod 755 /var/www/cgi-bin/php56.fcgi
sudo chmod 755 /var/www/cgi-bin/php72.fcgi

create php configuration for apache. by default it runs php56-fcgi handler

cat > /etc/httpd/conf.d/php.conf << EOF
ScriptAlias /cgi-bin/ "/var/www/cgi-bin/"
AddHandler php56-fcgi .php
Action php56-fcgi /cgi-bin/php56.fcgi
Action php72-fcgi /cgi-bin/php72.fcgi

<Directory /var/www/html/php56>
    DirectoryIndex index.php
    AllowOverride all
    Require all granted
</Directory>
<Directory /var/www/html/php72>
    DirectoryIndex index.php
    AllowOverride all
    Require all granted
</Directory>
EOF

Step 2 - Alternative

Remi Collet pointed out this could be replaced by SetHandler with httpd 2.4

cat > /etc/httpd/conf.d/php.conf << EOF
<Directory /var/www/html/php56>
    SetHandler "proxy:fcgi://127.0.0.1:9056" 
    DirectoryIndex index.php
    AllowOverride all
    Require all granted
</Directory>
<Directory /var/www/html/php72>
    SetHandler "proxy:fcgi://127.0.0.1:9072"
    DirectoryIndex index.php
    AllowOverride all
    Require all granted
</Directory>
EOF

`SetHandler "proxy:fcgi://127.0.0.1:9072"`

Step 3

make test pages, create .htaccess to use php72-fcgi

mkdir -p /var/www/html/php56
mkdir -p /var/www/html/php72
echo "" > /var/www/html/php56/index.php
echo "" > /var/www/html/php72/index.php
echo "AddHandler php72-fcgi .php" > /var/www/html/php72/.htaccess

Now you should be able to test it

(http://127.0.0.1/php56)
(http://127.0.0.1/php72)

If you want to startup these instance automatically after server reboot. That's all, you are ready to go!

sudo systemctl enable httpd
sudo systemctl enable php56-php-fpm
sudo systemctl enable php72-php-fpm