A Data Engineer's Guide to Amazon EMR

This white paper presents explanations for key architecture components of Amazon Elastic MapReduce (EMR) and guides for getting started. Learn the building blocks of distributed workloads on EMR including ETL with Spark, Big Data Migration, and Machine Learning.

The authors have contributed best practices and lessons learned from their thousands of hours of combined experience.

About the authors: 

— Sam Portillo, Data Engineer

— Pooja Krishnan, Data Engineer

— Emma York, Data Engineer and Technical Manager

— Rodrigo Moran, Software / Data Engineer 


Our latest news

Articles 08/11/2022

A Primer on Snowflake Stored Procedures

In that blog, I briefly examine Snowflake Procedures and discuss when Procedures should be used versus User Defined Functions (UDF)s. In the following, I am going to examine Snowflake Procedures further.

Read more
A Primer on Snowflake Stored Procedures
Articles 08/08/2022

Transform Data in your Warehouse using dbt, Airflow, and Redshift

In this blog post, I will explain how you can run all of your transformation processes using dbt directly on Airflow and take advantage of all its features. All of the code in this blog post is available at this GitHub repository.

Read more
Articles 08/04/2022

A No-Framework Approach to Building a CLI with Go

In this guide, we will be building a CLI tool from scratch. No fancy frameworks or libraries -- instead, we are building our own highly minimal framework loosely based on Cobra. Here's a taste of what we're building: 🐟 gupi

Read more