We are deploying the good old Hadoop on top of Kubernetes on Jetstream. Don’t ask why.

As usual we start with a full-fledged Kubernetes deployment on Jetstream (1) deployed via Kubespray

Deploy Hadoop via helm

Fortunately we have a Helm chart which deploys all the Hadoop components. It is deprecated since November 2020, but it still works fine on Kubernetes 1.19.7.

Clone the usual repository with gh:

gh repo clone zonca/jupyterhub-deploy-kubernetes-jetstream
cd hadoop/

Verify the configuration in stable_hadoop_values.yaml, I’m currently keeping it simple, so no persistence.

Install Hadoop via Helm:

bash install_hadoop.sh

Once the pods are running, you should see:

> kubectl get pods
NAME                      READY   STATUS    RESTARTS   AGE
hadoop-hadoop-hdfs-dn-0   1/1     Running   0          144m
hadoop-hadoop-hdfs-nn-0   1/1     Running   0          144m
hadoop-hadoop-yarn-nm-0   1/1     Running   0          144m
hadoop-hadoop-yarn-rm-0   1/1     Running   0          144m

Launch a test job

Get a terminal on the YARN node manager:

bash login_yarn.sh

You have now access to the Hadoop 2.9.0 cluster. Launch a test MapReduce job to compute pi:

bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar pi 16 1000

Access the YARN Dashboard

You can also export the YARN dashboard from the cluster to your local machine.

bash expose_yarn.sh

Connect locally to port 8088 to check the status of the jobs.

Make sure this port is never exposed publicly. I learned the hard way that there are botnets scanning the internet and compromising the YARN service for crypto-mining, see this article for details.