Saturday, 26 September 2015

Spark and Pyspark-Cassandra connector installation

Skip to end of metadata
Go to start of metadata
Using following commands easily install Java in Ubuntu machine:
$ sudo apt-add-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java7-installer
To check the Java installation is successful
$ java -version
It shows installed java version
java version "1.7.0_72"_ Java(TM) SE Runtime Environment (build 1.7.0_72-b14)_ Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
In next step is install Scala, follow the following instructions to set up Scala. First download the Scala from here
Copy downloaded file to some location for example /urs/local/src, untar the file and set path variable,
$ wget http://www.scala-lang.org/files/archive/scala-2.10.4.tgz
$ sudo mkdir /usr/local/src/scala
$ sudo tar xvf scala-2.10.4.tgz -C /usr/local/src/scala/
$ vi .bashrc
And add following in the end of the file
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
restart bashrc
$ . .bashrc
To check the Scala is installed successfully
$ scala -version
It shows installed Scala version Scala code runner version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
Or just type scala. It goes to scala interactive shell
$ scala
scala>
In next step install git.
sudo apt-get install git
Finally download spark ditributaion from here
$ wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.1.tgz
$ tar xvf spark-1.4.1.tgz
Once spark is dowwnloaded,follow the following command to build spark:
$ sudo build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
Clone pyspark-connector from Target holding in the directory of your choice and build using sbt:
$ sudo git clone https://github.com/TargetHolding/pyspark-cassandra.git
Building:
Pyspark Cassandra can be compiled using:
$ sudo apt-get install sbt
go to pyspark-cassandra and compile:
$ sbt compile
The package can be published locally with:
$ sbt spPublishLocal
A Java / JVM library as well as a python library is required to use PySpark Cassandra. They can be built with:
$ make dist
This creates:
1) a fat jar with the Spark Cassandra Connector and additional classes for bridging Spark and PySpark for Cassandra data and
2) a python source distribution at:
target/pyspark_cassandra-<version>.jar
target/pyspark_cassandra_<version>-<python version>.egg.
Command to run in spark cluster from pyspark:
$ export SPARK_MASTER_IP=127.0.0.1
$ ./sbin/start-master.sh
Now you can start up a single set of workers. It'll start in the foreground:
$ ./bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
$./pyspark --jars ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --driver-class-path ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5.jar --py-files ${PYSPARK_ROOT}/pyspark_cassandra-0.1.5-py2.7.egg --conf spark.cassandra.connection.host=host-ip --master spark://127.0.0.1:7077

If you have problem installing scala, then follow the following steps:
This is on Ubuntu 15.04 but should work on 14.04 the same
1) Remove the following lines from your bashrc
export SCALA_HOME=/usr/local/src/scala/scala-2.10.4
export PATH=$SCALA_HOME/bin:$PATH
2) Remove and reinstall scala
sudo rm -rf /usr/local/src/scala
# The following line is only needed if you installed scala another way, if so remove the #
# sudo apt-get remove scala-library scala
wget http://www.scala-lang.org/files/archive/scala-2.11.7.deb
sudo dpkg -i scala-2.11.7.deb
sudo apt-get update
sudo apt-get install scala

5 comments:



  1. hi,this is excellent information..we provide by very easy learning good information.

    Cassandra Training in Chennai

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hi,

    I am not able to complie the sbt .

    Please help me to solve this error.

    Compiling 7 Scala sources to /home/veeresh/hadoop/tal/pyspark-cassandra/target/scala-2.10/classes...
    [error] /home/veeresh/hadoop/tal/pyspark-cassandra/src/main/scala/pyspark_cassandra/Pickling.scala:17: not found: object pyspark_util
    [error] import pyspark_util.Conversions._
    [error] ^
    [error] /home/veeresh/hadoop/tal/pyspark-cassandra/src/main/scala/pyspark_cassandra/Pickling.scala:18: not found: object pyspark_util
    [error] import pyspark_util.{ Pickling => PicklingUtils, _ }

    ReplyDelete