Why , When and How Hadoop and Spark...: 2017

Sunday, 23 July 2017

Installing Kubernets

In last blog, we discussed about what is kubernetes and what are it’s advantages. In this second post of the series, we are going to discuss how to install the kubernetes locally on your machine.You can find all the posts in the series

Installing Kubernetes on Local Machine

One of the cool features of kubernetes that it can be installed and tried out in local. It behaves exactly as it will be on a cluster. To try out kubernetes on local we need to install minikube and kubectl.

The below are the steps

Step 1 :Pre-Requisites

To install the kubernetes on local machine, we install minikube. But minikube normally uses some kind of virtualization layer to install the needed software. So for our example, we will use virtualbox as our virtualization layer. For more pre-requisites refer here.

Download and Install virtualbox from here.

Step 2 : Install MiniKube

Run the below commands to install minikube on linux. For other operating system, refer here. Latest version as of now is 0.16.0

curl -Lo minikube https://storage.googleapis.com/minikube/releases/v0.16.0/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

Step 3 : Install KubeCtl

Kubectl is a command line utility which communicates to kubernetes over it’s REST API. We can install it using below command

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
chmod +x ./kubectl
sudo mv ./kubectl /usr/local/bin/kubectl

Interacting With Minikube

Once we installed the minikube and kubectl , we can start playing with kubernetes.

We can start minikube using below command. It downloads minikube iso and start a virtual machine in virtualbox.

minikube start

We can open the kubernetes dashboard using below command

minikube dashboard

We can check is anything running or not, using below command

kubectl get po --all-namespaces

This command should show some kubernetes container running.

Now we have successfully installed and configured the kubernetes on our machine.

In our next post, we will discuss the different abstractions of kubernetes and how to use them in our applications.

Kubernets with Spark

As our workloads become more and more micro service oriented, building an infrastructure to deploy them easily becomes important. Most of the big data applications need multiple services likes HDFS, YARN, Spark and their clusters. Creating, deploying and monitoring them manually is tedious and error prone.

So most of the users move to cloud to simplify it. Solutions like EMR, Databricks etc help in this regard. But then users will be locked into those specific services. Also sometimes we want same deployment strategy to work on premise also. Most of the cloud providers don’t have that option today.

So we need a framework which helps us to create and monitor complex big data clusters. Also it should helps us move between on premise and other cloud providers seamlessly. Kubernetes is one those frameworks that can help us in that regard.

In this set of posts, we are going to discuss how kubernetes, an open source container orchestration framework from Google, helps us to achieve a deployment strategy for spark and other big data tools which works across the on premise and cloud. As part of the series, we will discuss how to install, configure and scale kubernetes on local and cloud. Also we are going to discuss how to build our own customised images for the services and applications.

This is the first blog in the series where we discuss about what is kubernetes and it’s advantages. You can access all other blogs in the series here.

What is Kubernetes?

Kubernetes is an open source container orchestration framework. In simple words, it’s a framework which allows us to create and manage multiple containers. These containers will be docker containers which will be running some services. These can be your typical webapp, database or even big data tools like spark, hbase etc.

Why Kubernetes?

Most of the readers may have tried docker before. It’s a framework which allows developers containerise their application. It has become a popular way to develop, test and deploy applications on scale. When we already have docker, what is kubernetes bring into picture? Can’t we just build our clusters using normal docker itself?

The below are the some of the advantages of using kubernetes over plain docker tools.

Orchestration

One of the import feature that sets kubernetes apart from docker is it’s not a container framework. But it’s more of a orchestration layer for multiple containers that normally make an application. Docker itself has compose feature but it’s very limited. So as our application become complex, we will have multiple containers which needs to be orchestrated. Doing them manually becomes tricky. So kubernetes helps in that regard.

Also kubernetes has support for multiple container frameworks. Currently it supports docker and rkt. This makes users to choose their own container frameworks rather than sticking with only docker.

Cloud Independent

One of the import design goal of kubernetes, is ability to run everywhere. We can run kubernetes in local machine, on-premise clusters or on cloud. Kubernetes has support for AWS,GCE and Azure out of the box. Not only it normalises the deployment across the cloud, it will use best tool for given problem given by specific cloud. So it tries to optimise for each cloud.

Support for Easy Clustering

One of the hard part of installing big data tools like spark on cloud is to build the cluster and maintain it. Creating clusters often need tinkering with networking to make sure all services are started in right places. Also once cluster is up and running, making sure each node has sufficient resources also is tricky.

Often scaling cluster, adding node or removing it, is tricky. Kubernetes makes all this much easier compared to current solutions. It has excellent support to virtual networking and ability to easily scale clusters on will.

Support for Service Upgradation and Rollback

One of the hard part of clustered applications, is to update the software. Sometime it may be you want to update the application code or want to update version of spark itself. Having a well defined strategy to upgrade the clusters with check and balances is super critical. Also when things go south, ability to rollback in reasonably time frame is also important.

Kubernetes provides well defined image ( container image) based upgradation policies which can unify the upgrading different services across cluster. This makes life easier for all the ops people out there.

Effective Resource Isolation and Management

One of the question, we often ponder should we run services like kafka next to spark or not? Most of the time people advise to have separate machines so that each service gets sufficient resources. But defining machine size and segregating services based on machines becomes tricky as we want to scale our services.

Kubernetes frees you from the machine. Kubernetes asks you to define how much resources you want to dedicate for service. Once you do that, it will take care of figuring out which machine to run those. It will make sure that it will effectively using all resources across machines and also give guarantees about resource allocation. You no more need to worry about is one service is taking over all resources and depriving others or your machines are under utilized.

Not only kubernetes allows you to define resources In terms of GB of RAM or number of cpu’s, it allows it to be defined in terms of percentage of machine resource or in terms of no of requests. These options are there to dedicate the resources more granularly.

Well Defined Storage Management

One of the challenges of micro service oriented architectures is to store the state across the restart/ upgradation of containers. It’s critical for applications like Databases not loose data when something goes wrong with container or machine.

Kubernetes gives a clear abstraction of storage who’s life cycle is independent of the container itself. This makes users ability to use different storages like host based, network attached drives to make sure that there will be no data loss. These abstractions ties well with persistence options provided by cloud like EBS from aws. Kubernetes makes long running persistent services like databases a breeze.

Now we know what kubernetes brings to the table. In our next post, we will be discussing how to install kubernetes on local machine.

Sunday, 11 June 2017

Setup Zeppelin to access LDAP ,create new Interpreters to access SparkR, JDBC and externalise notebooks to S3

Use Apache Zeppelin as an interactive notebook that enables interactive data exploration.

Apache Zeppelin is a web-based notebook for data analysis, visualisation and reporting. Zeppelin lets you perform data analysis interactively and view the outcome of your analysis visually. It supports the Scala functional programming language with Spark by default. If you have used Jupyter Notebook (previously known as IPython Notebook) or Databricks Cloud before, you will find Zeppelin familiar.

Create AWS EMR cluster with Zeppelin.

Once you have created AWS EMR cluster with zeppelin ,you need to make configuration changes to use other interpreters like JDBC,PIG,Python,HBASE, livy for sparkR ,etc.These interpreters helps you to access different data sources in you environment.

Prerequisite:

1)Install maven

sudo wget http://repos.fedorapeople.org/repos/dchen/apache-maven/epel-apache-maven.repo -O /etc/yum.repos.d/epel-apache-maven.repo
sudo sed -i s/\$releasever/6/g /etc/yum.repos.d/epel-apache-maven.repo
sudo yum install -y apache-maven
mvn --version

2)Install git(if not available)

sudo yum install git.

3)Install livy server .An Open Source REST Service for Apache Spark (Apache License)

git clone https://github.com/cloudera/livy.git

cd livy

mvn package -DskipTests

sudo vi ~/.bashrc

add below in path

export SPARK_HOME=/usr/lib/spark

export HADOOP_CONF_DIR=/etc/hadoop/conf

source ~/.bashrc nohup ./bin/livy-server

Check if the livy has started by copying url to browser:http://ip-xx-xx-xx-xx.ap-southeast-2.compute.internal:8998/ui

Livy server in interpreter section of zeppelin

To add JDBC Interpreter into the zeppelin notebooks to access phoenix, aurora,redshift etc.

Install all community managed interpreters
./bin/install-interpreter.sh --all
Install specific interpreters
./bin/install-interpreter.sh --name md,shell,jdbc,python
You can get full list of community managed interpreters by running
./bin/install-interpreter.sh --list

Then add the same on interpreter section of zeppelin .Please go through https://zeppelin.apache.org/docs/0.7.1/interpreter/jdbc.html to configure redshift,aurora,hive,etc to jdbc.

restart the zeppelin after adding new interpreter each time: sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart.

To add LDAP configurations through shiro.

Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. In this documentation, we will explain step by step how Shiro works for Zeppelin notebook authentication.

When you connect to Apache Zeppelin, you will be asked to enter your credentials. Once you logged in, then you have access to all notes including other user's notes.

1. Secure the HTTP channel

To secure the HTTP channel, you have to change both anon and authc settings in conf/shiro.ini. In here, anonmeans "the access is anonymous" and authc means "formed auth security".

The default status of them is

/** = anon
#/** = authc

Deactivate the line "/** = anon" and activate the line "/** = authc" in conf/shiro.ini file.

#/** = anon
/** = authc

For the further information about shiro.ini file format, please refer to Shiro Configuration.

2. Secure the Websocket channel

Set to property zeppelin.anonymous.allowed to false in conf/zeppelin-site.xml. If you don't have this file yet, just copy conf/zeppelin-site.xml.template to conf/zeppelin-site.xml.

3. Start Zeppelin

bin/zeppelin-daemon.sh start (or restart).

4.Groups and permissions for LDAP

In case you want to leverage user groups and permissions, use one of the following configuration for LDAP or AD under [main] segment in shiro.ini

eg of shiro.ini

[main] ### A sample for configuring Active Directory Realm

activeDirectoryRealm = org.apache.zeppelin.realm.ActiveDirectoryGroupRealm

#activeDirectoryRealm.systemUsername = userNameA

#activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = cn=svc.aws.abc,ou=ServiceAccounts,ou=ABC,Users,dc=ABC,dc=abccompany,dc=com,dc=xy

activeDirectoryRealm.systemPassword = password!

activeDirectoryRealm.searchBase = dc=xyz,dc=abccompany,dc=com,dc=xy

activeDirectoryRealm.url = ldap://yourldapport:389

activeDirectoryRealm.groupRolesMap = "cn=Right-Usr-AP-Science.Zeppelin.Developer-U-GS,ou=Rights,ou=ABCGroups,dc=xyz,dc=abccompany,DC=com,DC=xy":"developers"

activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://user/zeppelin/conf/zeppelin.jceks

[roles]

#role1 = *

#role2 = *

#role3 = *

admin = *

developers = *

make sure LDAP and other configurations are correct. Follow this path for details https://zeppelin.apache.org/docs/0.7.1/security/shiroauthentication.html

restart zeppelin.

Externalize notebooks to S3

In Zeppelin there are two option for storage Notebook, by default the notebook is storage in the notebook folder in your local File System and the second option is S3.

Notebook Storage in S3

you need the following folder structure on S3

bucket_name/

username/

notebook/

set the enviroment variable in the file zeppelin-env.sh:

export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name

export ZEPPELIN_NOTEBOOK_S3_USER = username

export ZEPPELIN_NOTEBOOK_S3_BUCKET=abc-science-data-dev

export ZEPPELIN_NOTEBOOK_S3_USER=zeppelin

export ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo

in the file zeppelin-site.xml uncommet and complete the next property:

<property>
  <name>zeppelin.notebook.s3.user</name>
  <value>username</value>
  <description>user name for s3 folder structure</description>
</property>
<property>
  <name>zeppelin.notebook.s3.bucket</name>
  <value>bucket_name</value>
  <description>bucket name for notebook storage</description>
</property>

uncomment the next property for use S3NotebookRepo class:

<property>
  <name>zeppelin.notebook.storage</name>
  <value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value>
  <description>notebook persistence layer implementation</description>
</property>

comment the next property:

<property>
  <name>zeppelin.notebook.storage</name>
  <value>org.apache.zeppelin.notebook.repo.VFSNotebookRepo</value>
  <description>notebook persistence layer implementation</description>
</property>

restart zeppelin