AWS Athena, Redshift, EMR

Tola Ore-Aruwaji
3 min readAug 1, 2022

--

Photo by Isaac Smith on Unsplash

What is Athena?

AWS Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL.

Athena can query structured or unstructured data stored in S3. Athena does not load the data into any compute, it queries the data directly from S3, and you don’t need to manage any of the compute.

AWS Athena also comes with ODBC and JDBC drivers, so you can install and connect using standard SQL tools and analytic tools.

You can easily integrate AWS Athena with AWS Glue to create tables and schemas from different data sources. Glue will perform the ETL (Extract, Transform, Load) jobs on the data by loading them through Athena.

What is Amazon Redshift?

Amazon Redshift is a fully managed petabyte-scale data warehouse service in the cloud. Redshift data warehouse is a collection of computing resources called nodes, which are organized into a group called a cluster.

Each cluster runs an Amazon redshift engine and contains one or more databases.

Importance of Amazon Redshift

  1. Amazon Redshift is designed for Online transaction Processing (OLAP) and Business applications that lets you run complex queries.

2. You can integrate various ETL tools.

3. The architecture is based on PostgreSQL, so most of your existing SQL client applications will be charged at a minimal cost.

What is Amazon EMR?

EMR is a managed cluster platform that lets you analyze and process very vast amounts of data by running big data frameworks, such as Apache Hadoop and Apache Spark on AWS. AWS EMR lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as S3 and DynamoDB.

You can use Hadoop and Spark, as well as other open-source software, such as Apache Hive and Apache Pig, to process data for analytics purposes and intelligence workloads.

EMR Components

EMR is built on a collection of EC2 instances. The EC2 instances are called nodes in the cluster.

The EMR installs different software components on each node type, defining the node’s role in the distributed architecture of EMR.

There are three types of nodes in an EMR cluster.

  • Master node: Manages the cluster, running software components to coordinate the distribution of data and tasks across other nodes for processing.
  • Core node: These nodes have software components that run tasks and store data in the Hadoop Distributed File System(HDFS) on your cluster
  • Task node: This node is made up of software components that only run tasks and do not store data in HDFS.

In the next post, we will be going deeper into these components with a practical session; stay tuned :)

--

--