logo

Apache Hadoop Course

course overview

Click to View dates & book now

Overview

This four-day data analyst course is for anyone who wants to access, manipulate, transform, and analyse massive data sets in the Hadoop cluster using SQL and familiar scripting languages. This is the core curriculum in the data analyst learning path.

Cloudera's Data Analyst Training course focuses on Apache Hive and Apache Impala. You will learn how to apply traditional data analytics and business intelligence skills to big data. Cloudera presents the tools data professionals need to access, manipulate, transform, and analyse complex data sets using SQL and familiar scripting languages.

Apache Hive makes transformation and analysis of complex, multi-structured data scalable in Cloudera environments. Apache Impala enables real-time interactive analysis of the data stored in Hadoop using a native SQL environment. Together, they make multi-structured data accessible to analysts, database administrators, and others without Java programming expertise.

Audience

This course is designed for:

  • data analysts
  • business intelligence specialists
  • developers
  • system architects
  • database administrators

Skills Gained

Through instructor-led discussion and interactive, hands-on exercises, participants will navigate the Hadoop ecosystem, learning skills such as:

  • How the open source ecosystem of big data tools addresses challenges not met by traditional RDBMSs
  • Using Apache Hive and Apache Impala to provide SQL access to data
  • Hive and Impala syntax and data formats, including functions and subqueries
  • Create, modify, and delete tables, views, and databases; load data; and store results of queries
  • Create and use partitions and different file formats
  • Combining two or more datasets using JOIN or UNION, as appropriate
  • What analytic and windowing functions are, and how to use them
  • Store and query complex or nested data structures
  • Process and analyse semi-structured and unstructured data
  • Techniques for optimising Hive and Impala queries
  • Extending the capabilities of Hive and Impala using parameters, custom file formats and SerDes, and external scripts
  • How to determine whether Hive, Impala, an RDBMS, or a mix of these is best for a given task

Prerequisites

Some knowledge of SQL is assumed, as is basic Linux command-line familiarity. Prior knowledge of Apache Hadoop is not required.

The supply of this course by DDLS is governed by the booking terms and conditions. Please read the terms and conditions carefully before enrolling in this course, as enrolment in the course is conditional on acceptance of these terms and conditions.

Outline

  • Apache Hadoop Fundamentals
  • Introduction to Apache Hive and Impala
  • Querying with Apache Hive and Impala
  • Common Operators and Built-In Functions
  • Data Management
  • Data Storage and Performance
  • Working with Multiple Datasets
  • Analytic Functions and Windowing
  • Complex Data
  • Analysing Text
  • Apache Hive Optimisation
  • Apache Impala Optimisation
  • Extending Apache Hive and Impala
  • Choosing the Best Tool for the Job

Introduction

  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Distributed Data Processing: YARN, MapReduce, and Spark
  • Data Processing and Analysis: Hive and Impala
  • Database Integration: Sqoop
  • Other Hadoop Data Tools
  • Exercise Scenario Explanation
  • What is Hive?
  • What is Impala?
  • Why Use Hive and Impala?
  • Schema and Data Storage
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases
  • Databases and Tables
  • Basic Hive and Impala Query Language Syntax
  • Data Types
  • Using Hue to Execute Queries
  • Using Beeline (Hive's Shell)
  • Using the Impala Shell
  • Operators
  • Scalar Functions
  • Aggregate Functions
  • Data Storage
  • Creating Databases and Tables
  • Loading Data
  • Altering Databases and Tables
  • Simplifying Queries with Views
  • Storing Query Results
  • Partitioning Tables
  • Loading Data into Partitioned Tables
  • When to Use Partitioning
  • Choosing a File Format
  • Using Avro and Parquet File Formats
  • UNION and Joins
  • Handling NULL Values in Joins
  • Advanced Joins
  • Using Common Analytic Functions
  • Other Analytic Functions
  • Sliding Windows
  • Complex Data with Hive
  • Complex Data with Impala
  • Using Regular Expressions with Hive and Impala
  • Processing Text Data with SerDes in Hive
  • Sentiment Analysis and n-grams
  • Understanding Query Performance
  • Bucketing
  • Hive on Spark
  • How Impala Executes Queries
  • Improving Impala Performance
  • Custom SerDes and File Formats in Hive
  • Data Transformation with Custom Scripts in Hive
  • User-Defined Functions
  • Parameterised Queries
  • Comparing Hive, Impala, and Relational Databases
  • Which to Choose?

Conclusion

Talk to an expert

Thinking about Onsite?

If you need training for 3 or more people, you should ask us about onsite training. Putting aside the obvious location benefit, content can be customised to better meet your business objectives and more can be covered than in a public classroom. Its a cost effective option. One on one training can be delivered too, at reasonable rates.

Submit an enquiry from any page on this site, and let us know you are interested in the requirements box, or simply mention it when we contact you.

All $ prices are in USD unless it’s a NZ or AU date

SPVC = Self Paced Virtual Class

LVC = Live Virtual Class

Please Note: All courses are availaible as Live Virtual Classes

Trusted by over 1/2 million students in 15 countries

Our clients have included prestigious national organisations such as Oxford University Press, multi-national private corporations such as JP Morgan and HSBC, as well as public sector institutions such as the Department of Defence and the Department of Health.