Saturday, 26 September 2015

Installation of VM image for Jupyter Ipython Notebook for Pyspark

Process of installing VM image for Jupyter Ipython Notebook for Pyspark :

VirtualBox 4.3.28 (or later).
Make sure you have a virtualbox installed in your machine by the command:
vboxmanage –version
If it is not installed then install it with the following command:
sudo apt-get install virtualbox

Vagrant 1.7.2 (or later).
Make sure you have a vagrant installed in your machine by the command:
vagrant –version
If it is not installed then install it with the following command:
sudo apt-get install vagrant



Create a file named Vagrantfile in the empty directory of your choice having the following code in it:

# -*- mode: ruby -*-
# vi: set ft=ruby :

ipythonPort = 8001                 # Ipython port to forward (also set in IPython notebook config)

Vagrant.configure(2) do |config|
  config.ssh.insert_key = true
  config.vm.define "sparkvm" do |master|
    master.vm.box = "sparkmooc/base"
    master.vm.box_url = "https://atlas.hashicorp.com/sparkmooc/boxes/base/versions/0.0.7.1/providers/virtualbox.box"
    master.vm.box_download_insecure = true
    master.vm.boot_timeout = 900
    master.vm.network :forwarded_port, host: ipythonPort, guest: ipythonPort, auto_correct: true   # IPython port (set in notebook config)
    master.vm.network :forwarded_port, host: 4040, guest: 4040, auto_correct: true                 # Spark UI (Driver)
    master.vm.hostname = "sparkvm"
    master.vm.usable_port_range = 4040..4090

    master.vm.provider :virtualbox do |v|
      v.name = master.vm.hostname.to_s
    end
  end
end


Then run the command vagrant up.

Once the VM is running, to access the notebook, open a web browser to "http://localhost:8001/" (on Windows and Mac) or "http://127.0.0.1:8001/" (on Linux).


Spark and Pyspark-Cassandra connector installation

Skip to end of metadata
Go to start of metadata
Using following commands easily install Java in Ubuntu machine:
$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
To check the Java installation is successful
$ java -version
It shows installed java version
java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build 1.7.0_72-b14)_ Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
In next step is install Scala, follow the following instructions to set up Scala. First download the Scala from here
Copy downloaded file to some location for example /urs/local/src, untar the file and set path variable,
$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
$ sudo mkdir /usr/local/src/scala
$ sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/
$ vi .bashrc
And add following in the end of the file
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
restart bashrc
$ . .bashrc
To check the Scala is installed successfully
$ scala -version
It shows installed Scala version Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
Or just type scala. It goes to scala interactive shell
$ scala
scala>
In next step install git.
sudo apt-get install git
Finally download spark ditributaion from here
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz
$ tar xvf spark-1.4.1.tgz
Once spark is dowwnloaded,follow the following command to build spark:
$ sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Clone pyspark-connector from Target holding in the directory of your choice and build using sbt:
$ sudo git clone https://github.com/TargetHolding/pyspark-cassandra.git
Building:
Pyspark Cassandra can be compiled using:
$ sudo apt-get install sbt
go to pyspark-cassandra and compile:
$ sbt compile
The package can be published locally with:
$ sbt spPublishLocal
A Java / JVM library as well as a python library is required to use PySpark Cassandra. They can be built with:
$ make dist
This creates:
1) a fat jar with the Spark Cassandra Connector and additional classes for bridging Spark and PySpark for Cassandra data and
2) a python source distribution at:
target/pyspark_cassandra-<version>.jar
target/pyspark_cassandra_<version>-<python version>.egg.
Command to run in spark cluster from pyspark:
$ export SPARK_MASTER_IP=127.0.0.1
$ ./sbin/start-master.sh
Now you can start up a single set of workers. It'll start in the foreground:
$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
$./pyspark --jars ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --driver-class-path ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --py-files ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5-py2.7.egg --conf spark.cassandra.connection.host=host-ip --master spark://127.0.0.1:7077

If you have problem installing scala, then follow the following steps:
This is on Ubuntu 15.04 but should work on 14.04 the same
1) Remove the following lines from your bashrc
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
2) Remove and reinstall scala
sudo rm -rf /usr/local/src/scala
# The following line is only needed if you installed scala another way, if so remove the #
# sudo apt-get remove scala-library scala
wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala