Friday, November 10, 2017

JupyterHub & Spark Setup

Install Java

Java is required to run pyspark

Docker

Docker container is on Debian jessie, jessie-backports needs to be added for OpenJDK
echo "deb http://cdn-fastly.deb.debian.org/debian jessie-backports main" >> /etc/apt/sources.list
apt-get update -y
Please note the -t jessie-backports option
apt-get install -t jessie-backports openjdk-8-jdk -y

Download & Install Spark

download spark 2.2.0 pre-built for hadoop 2.7 [https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz] untar spark distribution under /opt/spark

Configure Spark

configure the following environment variables in /opt/spark/conf/spark-env.sh For example:
SPARK_LOCAL_IP=192.168.1.11
SPARK_WORKER_MEMORY=4g
SPARK_WORKER_CORES=2
Make the default spark rdd directory group writable
mkdir -p /var/lib/spark/rdd
chown -R spark: spark
chmod g+w -R spark

No comments: