Spark and Pyspark-Cassandra connector installation:

Using following commands easily install Java in Ubuntu machine:

$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer

To check the Java installation is successful

$ java -version
It shows installed java version
java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build 1.7.0_72-b14)_ Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

In next step is install Scala, follow the following instructions to set up Scala. First download the Scala from here
Copy downloaded file to some location for example /urs/local/src, untar the file and set path variable,

$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
$ sudo mkdir /usr/local/src/scala
$ sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/

$ vi .bashrc
And add following in the end of the file
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH

restart bashrc
$ . .bashrc

To check the Scala is installed successfully
$ scala -version
It shows installed Scala version Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
Or just type scala. It goes to scala interactive shell
$ scala
scala>

In next step install git.
sudo apt-get install git
Finally download spark ditributaion from here
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz
$ tar xvf spark-1.4.1.tgz

Once spark is dowwnloaded,follow the following command to build spark:
$ sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package

Clone pyspark-connector from Target holding in the directory of your choice and build using sbt:
$ sudo git clone https://github.com/TargetHolding/pyspark-cassandra.git

Building:
Pyspark Cassandra can be compiled using:
$ sudo apt-get install sbt
go to pyspark-cassandra and compile:
$ sbt compile
The package can be published locally with:
$ sbt spPublishLocal

A Java / JVM library as well as a python library is required to use PySpark Cassandra. They can be built with:
$ make dist
This creates:

1) a fat jar with the Spark Cassandra Connector and additional classes for bridging Spark and PySpark for Cassandra data and

2) a python source distribution at:
target/pyspark_cassandra-<version>.jar
target/pyspark_cassandra_<version>-<python version>.egg.

Command to run in spark cluster from pyspark:
$ export SPARK_MASTER_IP=127.0.0.1
$ ./sbin/start-master.sh
Now you can start up a single set of workers. It'll start in the foreground:
$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
$./pyspark --jars ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --driver-class-path ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --py-files ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5-py2.7.egg --conf spark.cassandra.connection.host=host-ip --master spark://127.0.0.1:7077

If you have problem installing scala, then follow the following steps:

This is on Ubuntu 15.04 but should work on 14.04 the same
1) Remove the following lines from your bashrc
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
2) Remove and reinstall scala
sudo rm -rf /usr/local/src/scala
# The following line is only needed if you installed scala another way, if so remove the #
# sudo apt-get remove scala-library scala
wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala

TechnicalStuff-you-are-looking-for

Saturday, 26 September 2015

Spark and Pyspark-Cassandra connector installation

5 comments: