We are deploying the good old Hadoop on top of Kubernetes on Jetstream. Don’t ask why.
As usual we start with a full-fledged Kubernetes deployment on Jetstream (1) deployed via Kubespray
Deploy Hadoop via helm
Fortunately we have a Helm chart which deploys all the Hadoop components. It is deprecated since November 2020, but it still works fine on Kubernetes 1.19.7.
Clone the usual repository with
gh repo clone zonca/jupyterhub-deploy-kubernetes-jetstream cd hadoop/
Verify the configuration in
stable_hadoop_values.yaml, I’m currently keeping it simple,
so no persistence.
Install Hadoop via Helm:
Once the pods are running, you should see:
> kubectl get pods NAME READY STATUS RESTARTS AGE hadoop-hadoop-hdfs-dn-0 1/1 Running 0 144m hadoop-hadoop-hdfs-nn-0 1/1 Running 0 144m hadoop-hadoop-yarn-nm-0 1/1 Running 0 144m hadoop-hadoop-yarn-rm-0 1/1 Running 0 144m
Launch a test job
Get a terminal on the YARN node manager:
You have now access to the Hadoop 2.9.0 cluster. Launch a test MapReduce job to compute pi:
bin/yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.9.0.jar pi 16 1000
Access the YARN Dashboard
You can also export the YARN dashboard from the cluster to your local machine.
Connect locally to port 8088 to check the status of the jobs.
Make sure this port is never exposed publicly. I learned the hard way that there are botnets scanning the internet and compromising the YARN service for crypto-mining, see this article for details.