Friday, November 10, 2017

JupyterHub & Spark Setup

Install Java

Java is required to run pyspark

Docker

Docker container is on Debian jessie, jessie-backports needs to be added for OpenJDK
echo "deb http://cdn-fastly.deb.debian.org/debian jessie-backports main" >> /etc/apt/sources.list
apt-get update -y
Please note the -t jessie-backports option
apt-get install -t jessie-backports openjdk-8-jdk -y

Download & Install Spark

download spark 2.2.0 pre-built for hadoop 2.7 [https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz] untar spark distribution under /opt/spark

Configure Spark

configure the following environment variables in /opt/spark/conf/spark-env.sh For example:
SPARK_LOCAL_IP=192.168.1.11
SPARK_WORKER_MEMORY=4g
SPARK_WORKER_CORES=2
Make the default spark rdd directory group writable
mkdir -p /var/lib/spark/rdd
chown -R spark: spark
chmod g+w -R spark

chmod group read/writable on a deep sub-directory under home

On rare cases where you need to provide group read and writable access from a multi-level sub-directory under your home, not that it's recommended from security perspective.  Note when making your home directory readable by group you will have problem with passwordless ssh.

cd ; p="/home/runwuf/project/test/p1/runwuf" ; while [ $p != "/home"  ] ; do chmod g+rw $p ; p=`echo $p | rev | cut -f2- -d"/" | rev` ; done