Data Processing and Data Analysis — Improved workflow using AWS EMR

5 min readNov 13, 2021

In this document, I will demonstrate how to do data analysis and plots using a simple workflow with EMR notebooks. This workflow eliminates the complexity of disjointed workflow of separate data processing (in EMR) and exploratory data analysis (in Python notebook). This is based on my experience using the methods from AWS blog entry

Motivation

Data scientists dive deep into data using aggregation and plotting the results. On a small dataset this can be done on laptop using many tools (R, Python, etc). But as the size of data grows, this is not practical since the data needs large amounts of disk space and and processing will need large amounts of RAM and processing.

Big data tools like Apache Spark process large amounts of data. But the aggregated data has to be transported and plotted in another environment. I was following this workflow but I needed to follow several steps and keep the data in sync between processing step (in AWS EMR) and analysis step (my Jupyter notebook on my laptop).

Old way — notebooks in Amazon EMR produce csv

Amazon EMR is big data platform to process and analyze large amounts of data using Apache Spark, Apache Hadoop and other open source projects (Hive, Presto, etc.). Amazon EMR has introduced notebooks with both spark and Pyspark kernels to interact with spark sessions and process the data.

For many weeks, I processed my data using notebooks (Pyspark) and output my spark data frame with aggregated results into csv files on S3. I still use this method if I want to have a file containing results. But for producing quick analysis and charts this method is inefficient and forces user into a disjointed workflow.

spark_df.write.option(“header”,”true”).csv(“s3://mybucket/out”)

New way — set up EMR and use sparkmagic

I started doing data analysis inside Pyspark notebook using the methods from AWS blog entry In then next two sections I will describe my describe my experience in using the two methods proposed in the blog.

I was able to setup EMR cluster using defaults. The only decision I made is choice of applications — I chose EMR version 6.4.0 with JupyterEnterpriseGateway 2.1.0, Spark 3.1.2, Presto 0.254.1, Livy 0.7.1, JupyterHub 1.4.1. I used web interface and it took me only few clicks.
I recommend using minimum of m5.x4large.

Best Option — Use libraries provided by EMR

EMR has pandas and plotly installed on my local jupyter environment (EMR 6.4.0) but I needed to stream the data from spark kernel to local kernel. I needed sparkmagic and I was pleasantly surprised to find out that EMR has it already enabled.
AWS blog entry has documented the steps but I would recommend reading the documentation of sparkmagic (also use %%help to get quick help in notebook).
Note: Blog said they could find matplotlib in local libs but I only found plotly in EMR 6.3 and EMR 6.4. You may check the package list for your EMR using conda list (may depend on your EMR version) .

%%local
conda list

Transform data to local pandas dataframe.

Only limitation here is to make sure the data frame you are taking to pandas is not very huge (<100MB). There are two options — direct transform and sql magic.

There are two options and I prefer the second.

If you already have spark data frame ready — you will just transport it to local data frame. Both data frames will coexist — one in spark and one in local. I prefer second method (sqlmagic) for clarity.

%spark -o all_students -n -1

I prefer using sqlmagic — I can quickly write a query and also choose a new name for the local pandas dataframe.

%%sql -o school_num_students_local -n -1 -q
SELECT s_name AS school, COUNT(st_name) AS num_students
FROM STUDENTS
GROUP BY s_name

Write Python Code and Functions

Treat %%local as your local python kernel and write code (think functions) that can be used in notebook later.

%%local 
import datetime
def update_fig(fig, title_text):
    fig.update_layout(title=title_text)

Use Pandas and Plotly to do more analysis and plot

%%local
import pandas as pd
import plotly.express as px
fig = px.bar(school_num_students,x="school",y="num_students")
update_fig(fig,'Students by School')
fig.show()

Pro Tips:

If you are getting errors in spark startup — restart your notebook kernel
Choose a bigger instance type if you get errors related to memory
Choose auto terminating EMR cluster if you would like the cluster to shut down at the end of day

Conclusion

This article covered easier way of doing data analysis on EMR notebook after doing data processing. This eliminates the need for multiple steps withe separate data processing and data analysis.

Another option — Installing python packages into cluster:

This may be a better option if you like to work with libraries not available in local (see previous section). For me pandas and plotly are all I need so I do not use this option but my experience is captured below and I can confirm it works.

I tried to install pandas and other libs onto cluster as described in AWS blog entry

sc.list_packages()
sc.install_pypi_package("pandas==0.25.1")
sc.install_pypi_package("matplotlib", "https://pypi.org/simple") #Install matplotlib from given PyPI repository

But I ran into issue with package versions, there may be other possible issues finding repositories and getting past the security setup for your EMR cluster b your organization.

Failed building wheel for pillow
Command "/tmp/1636747058971-0/bin/python -u -c "import setuptools, tokenize;__file__='/mnt/tmp/pip-build-d664m4y8/pillow/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/pip-3vwbtvlr-record/install-record.txt --single-version-externally-managed --compile --install-headers /tmp/1636747058971-0/include/site/python3.7/pillow" failed with error code 1 in /mnt/tmp/pip-build-d664m4y8/pillow/
You are using pip version 9.0.1, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

I found a workaround like this and got everything working.

sc.install_pypi_package("matplotlib==3.1.1", "https://pypi.org/simple") #Install matplotlib from given PyPI repository

I got everything to work best practice is to uninstall at end of notebook but if you own the cluster, you may keep them installed for other notebooks. It is also best practice to uninstall packages from cluster.

sc.uninstall_package('pandas')
sc.uninstall_package('matplotlib')