Data processing using Sagemaker connected to EMR

Ram Thiruveedhi
2 min readMar 26, 2022
Photo by Luke Chesser on Unsplash

In my last blog post, I discussed adding data analysis to data processing in EMR. Though that workflow works well for data processing, there are two draw backs.

  • EMR has only basic set of data processing libraries — these are sufficient to do basic data analysis and plotting as detailed in my blog post. But for data scientists who work with many libraries (Machine Learning and others), EMR cluster may be limiting. Hence they choose to export data and bring the data into Sagemaker. This again leads to disjointed workflow that needs additional monitoring to keep Sagemaker and EMR in sync.
  • (Minor issue) Data scientists who setup their projects in Sagemaker would prefer to have all their code on Sagemaker.

While browsing Amazon Sagemaker examples, I discovered a better way of using Amazon EMR. The post and setup is straightforward but I will share tldr version and some lessons learnt.

Connect EMR to Sagemaker

For full instructions refer to the documentation
Here is tldr version

  • The doc is unclear about setup. There are two links to two different setups. I tested both and both work. You do NOT need both. I prefer this simpler setup.
  • Warning: If you follow the instructions in the doc, please note that you may get error (missing sklearn). You can either install sklearn into Sagemaker or test the connection using pyspark commands on any other csv file.
df = spark.read.format("csv").options(header="true").load(data_s3_path)

Minor Con for this setup:

  • The setup needs to be redone if you want to terminate EMR cluster. This may not be issue for projects that last weeks and months. For ad-hoc data projects, time spent in the setup needs to be repeated. Personally I do not mind the additional setup. First time it took me an hour or two but it got faster. The effort was worth the benefit (all my notebooks are now in Sagemaker)

--

--