Sunday, 11 June 2017

Setup Zeppelin to access LDAP ,create new Interpreters to access SparkR, JDBC and externalise notebooks to S3

Use Apache Zeppelin as an interactive notebook that enables interactive data exploration.

Apache Zeppelin is a web-based notebook for data analysis, visualisation and reporting. Zeppelin lets you perform data analysis interactively and view the outcome of your analysis visually. It supports the Scala functional programming language with Spark by default. If you have used Jupyter Notebook (previously known as IPython Notebook) or Databricks Cloud before, you will find Zeppelin familiar.

Create AWS EMR cluster with Zeppelin.

Once you have created AWS EMR cluster with zeppelin ,you need to make configuration changes to use other interpreters like JDBC,PIG,Python,HBASE, livy for sparkR ,etc.These interpreters helps you to access different data sources in you environment.
Prerequisite:
1)Install maven

2)Install git(if not available)

   sudo yum install git.

3)Install livy server .An Open Source REST Service for Apache Spark (Apache License)

git clone https://github.com/cloudera/livy.git      

cd livy      

mvn package -DskipTests

sudo vi ~/.bashrc

     add below in path 

export SPARK_HOME=/usr/lib/spark    

export HADOOP_CONF_DIR=/etc/hadoop/conf     

source ~/.bashrc     nohup ./bin/livy-server

Check if the livy has started by copying url to browser:http://ip-xx-xx-xx-xx.ap-southeast-2.compute.internal:8998/ui

     Livy server in interpreter section of zeppelin

To add JDBC Interpreter into the zeppelin notebooks to access phoenix, aurora,redshift etc.

  • Install all community managed interpreters
  • ./bin/install-interpreter.sh --all
  • Install specific interpreters
  • ./bin/install-interpreter.sh --name md,shell,jdbc,python     
  • You can get full list of community managed interpreters by running
  • ./bin/install-interpreter.sh --list
Then add the same on interpreter section of zeppelin .Please go through https://zeppelin.apache.org/docs/0.7.1/interpreter/jdbc.html to configure redshift,aurora,hive,etc to jdbc.
restart the zeppelin after adding new interpreter each time: sudo /usr/lib/zeppelin/bin/zeppelin-daemon.sh restart.

To add LDAP configurations through shiro.

Apache Shiro is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. In this documentation, we will explain step by step how Shiro works for Zeppelin notebook authentication.
When you connect to Apache Zeppelin, you will be asked to enter your credentials. Once you logged in, then you have access to all notes including other user's notes.

1. Secure the HTTP channel

To secure the HTTP channel, you have to change both anon and authc settings in conf/shiro.ini. In here, anonmeans "the access is anonymous" and authc means "formed auth security".
The default status of them is
/** = anon
#/** = authc
Deactivate the line "/** = anon" and activate the line "/** = authc" in conf/shiro.ini file.
#/** = anon
/** = authc
For the further information about shiro.ini file format, please refer to Shiro Configuration.

  2. Secure the Websocket channel

Set to property zeppelin.anonymous.allowed to false in conf/zeppelin-site.xml. If you don't have this file yet, just copy conf/zeppelin-site.xml.template to conf/zeppelin-site.xml.

  3. Start Zeppelin

bin/zeppelin-daemon.sh start (or restart).

4.Groups and permissions for LDAP

In case you want to leverage user groups and permissions, use one of the following configuration for LDAP or AD under [main] segment in shiro.ini
eg of shiro.ini

[main] ### A sample for configuring Active Directory Realm
activeDirectoryRealm = org.apache.zeppelin.realm.ActiveDirectoryGroupRealm
#activeDirectoryRealm.systemUsername = userNameA
#activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = cn=svc.aws.abc,ou=ServiceAccounts,ou=ABC,Users,dc=ABC,dc=abccompany,dc=com,dc=xy
activeDirectoryRealm.systemPassword = password!
activeDirectoryRealm.searchBase = dc=xyz,dc=abccompany,dc=com,dc=xy
activeDirectoryRealm.url = ldap://yourldapport:389
activeDirectoryRealm.groupRolesMap = "cn=Right-Usr-AP-Science.Zeppelin.Developer-U-GS,ou=Rights,ou=ABCGroups,dc=xyz,dc=abccompany,DC=com,DC=xy":"developers"
activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://user/zeppelin/conf/zeppelin.jceks
[roles]
#role1 = *
#role2 = *
#role3 = *
admin = *
developers = *
make sure LDAP and other configurations are correct. Follow this path for details https://zeppelin.apache.org/docs/0.7.1/security/shiroauthentication.html 

restart zeppelin.


Externalize notebooks to S3

In Zeppelin there are two option for storage Notebook, by default the notebook is storage in the notebook folder in your local File System and the second option is S3.

Notebook Storage in S3


you need the following folder structure on S3
bucket_name/ 
      username/ 
         notebook/

set the enviroment variable in the file zeppelin-env.sh:

export ZEPPELIN_NOTEBOOK_S3_BUCKET = bucket_name
export ZEPPELIN_NOTEBOOK_S3_USER = username
eg
export ZEPPELIN_NOTEBOOK_S3_BUCKET=abc-science-data-dev
export ZEPPELIN_NOTEBOOK_S3_USER=zeppelin
export ZEPPELIN_NOTEBOOK_STORAGE=org.apache.zeppelin.notebook.repo.S3NotebookRepo

in the file zeppelin-site.xml uncommet and complete the next property:


<property>
  <name>zeppelin.notebook.s3.user</name>
  <value>username</value>
  <description>user name for s3 folder structure</description>
</property>
<property>
  <name>zeppelin.notebook.s3.bucket</name>
  <value>bucket_name</value>
  <description>bucket name for notebook storage</description>
</property>

uncomment the next property for use S3NotebookRepo class:

<property>
  <name>zeppelin.notebook.storage</name>
  <value>org.apache.zeppelin.notebook.repo.S3NotebookRepo</value>
  <description>notebook persistence layer implementation</description>
</property>

comment the next property:

<property>
  <name>zeppelin.notebook.storage</name>
  <value>org.apache.zeppelin.notebook.repo.VFSNotebookRepo</value>
  <description>notebook persistence layer implementation</description>
</property>


restart zeppelin