Installing and using Hadoop and Pyspark on Ubuntu with VirtualBox and VMware

ubuntu-22.04.2-desktop-amd64.iso

su
usermod -aG ubuntu sudo

sudo apt install default-jdk

wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz

tar -xvzf hadoop-3.3.6.tar.gz

su mv hadoop-3.3.6 /usr/local/hadoop

wget https://repo.anaconda.com/archive/Anaconda3-2023.07-1-Linux-x86_64.sh

bash Anaconda3-2023.07-1-Linux-x86_64.sh

In the following screenshots, we see Pyspark being used with Hadoop and a file successfully written to the local file system.

predictions.select('prediction', 'label').write \
    .format('csv') \
    .option('header', 'true') \
    .save('predictions.csv')

Connecting to Github:

sudo dpkg -i gitkraken-amd64.deb

Jupyter Lab: