Saturday, 26 September 2015

Spark and Pyspark-Cassandra connector installation

Using following commands easily install Java in Ubuntu machine:
$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
To check the Java installation is successful
$ java -version
It shows installed java version
java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build 1.7.0_72-b14)_ Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
In next step is install Scala, follow the following instructions to set up Scala. First download the Scala from here
Copy downloaded file to some location for example /urs/local/src, untar the file and set path variable,
$ wget
$ sudo mkdir /usr/local/src/scala
$ sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/
$ vi .bashrc
And add following in the end of the file
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
restart bashrc
$ . .bashrc
To check the Scala is installed successfully
$ scala -version
It shows installed Scala version Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
Or just type scala. It goes to scala interactive shell
$ scala
In next step install git.
sudo apt-get install git
Finally download spark ditributaion from here
$ wget
$ tar xvf spark-1.4.1.tgz
Once spark is dowwnloaded,follow the following command to build spark:
$ sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Clone pyspark-connector from Target holding in the directory of your choice and build using sbt:
$ sudo git clone
Pyspark Cassandra can be compiled using:
$ sudo apt-get install sbt
go to pyspark-cassandra and compile:
$ sbt compile
The package can be published locally with:
$ sbt spPublishLocal
A Java / JVM library as well as a python library is required to use PySpark Cassandra. They can be built with:
$ make dist
This creates:
1) a fat jar with the Spark Cassandra Connector and additional classes for bridging Spark and PySpark for Cassandra data and
2) a python source distribution at:
target/pyspark_cassandra_<version>-<python version>.egg.
Command to run in spark cluster from pyspark:
$ ./sbin/
Now you can start up a single set of workers. It'll start in the foreground:
$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://
$./pyspark --jars ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --driver-class-path ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --py-files ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5-py2.7.egg --conf --master spark://

If you have problem installing scala, then follow the following steps:
This is on Ubuntu 15.04 but should work on 14.04 the same
1) Remove the following lines from your bashrc
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
2) Remove and reinstall scala
sudo rm -rf /usr/local/src/scala
# The following line is only needed if you installed scala another way, if so remove the #
# sudo apt-get remove scala-library scala
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala


  4. Hi,

    I am not able to complie the sbt .

    Please help me to solve this error.

    Compiling 7 Scala sources to /home/veeresh/hadoop/tal/pyspark-cassandra/target/scala-2.10/classes...
    [error] /home/veeresh/hadoop/tal/pyspark-cassandra/src/main/scala/pyspark_cassandra/Pickling.scala:17: not found: object pyspark_util
    [error] import pyspark_util.Conversions._
    [error] ^
    [error] /home/veeresh/hadoop/tal/pyspark-cassandra/src/main/scala/pyspark_cassandra/Pickling.scala:18: not found: object pyspark_util
    [error] import pyspark_util.{ Pickling => PicklingUtils, _ }
