Introduction to Spark 3 with Scala Training in Boston

Enroll in or hire us to teach our Introduction to Spark 3 with Scala class in Boston, Massachusetts by calling us @303.377.6176. Like all HSG classes, Introduction to Spark 3 with Scala may be offered either onsite or via instructor led virtual training. Consider looking at our public training schedule to see if it is scheduled: Public Training Classes
Provided there are enough attendees, Introduction to Spark 3 with Scala may be taught at one of our local training facilities.
We offer private customized training for groups of 3 or more attendees.

Course Description

 

This course introduces the Apache Spark distributed computing engine, and is suitable for developers, data analysts, architects, technical managers, and anyone who needs to use Spark in a hands-on manner. It is based on the Spark 3.x release.

The course provides a solid technical introduction to the Spark architecture and how Spark works. It covers the basic building blocks of Spark (e.g. RDDs and the distributed compute engine), as well as higher-level constructs that provide a simpler and more capable interface (e.g. DataSets/DataFrames and Spark SQL). It includes in-depth coverage of Spark SQL, DataFrames, and DataSets, which are now the preferred programming API. This includes exploring possible performance issues and strategies for optimization.

The course also covers more advanced capabilities such as the use of Spark Streaming to process streaming data, and integrating with the Kafka server.

The course is very hands-on, with many labs. Participants will interact with Spark through the Spark shell (for interactive, ad-hoc processing) as well as through programs using the Spark API. After taking this course, you will be ready to work with Spark in an informed and productive manner.

Labs are supported in Scala. There is a separate course for Python users.

Course Length: 4 Days
Course Tuition: $1890 (US)

Prerequisites

Working knowledge of some programming language - no Java experience needed

Course Outline

 
Session 1 (Optional): Scala Ramp Up
Scala Introduction, Variables, Data Types, Control Flow
The Scala Interpreter
Collections and their Standard Methods (e.g. map())
Functions, Methods, Function Literals
Class, Object, Trait, case Class
 
Session 2: Introduction to Spark
Overview, Motivations, Spark Systems
Spark Ecosystem
Spark vs. Hadoop
Acquiring and Installing Spark
The Spark Shell, SparkContext
 
Session 3: RDDs and Spark Architecture
RDD Concepts, Lifecycle, Lazy Evaluation
RDD Partitioning and Transformations
Working with RDDs - Creating and Transforming (map, filter, etc.)
 
Session 4: Spark SQL, DataFrames, and DataSets
Overview
SparkSession, Loading/Saving Data, Data Formats (JSON, CSV, Parquet, text ...)
Introducing DataFrames and DataSets (Creation and Schema Inference)
Supported Data Formats (JSON, Text, CSV, Parquet)
Working with the DataFrame (untyped) Query DSL (Column, Filtering, Grouping, Aggregation)
SQL-based Queries
Working with the DataSet (typed) API
Mapping and Splitting (flatMap(), explode(), and split())
DataSets vs. DataFrames vs. RDDs
 
Session 5: Shuffling Transformations and Performance
Grouping, Reducing, Joining
Shuffling, Narrow vs. Wide Dependencies, and Performance Implications
Exploring the Catalyst Query Optimizer (explain(), Query Plans, Issues with lambdas)
The Tungsten Optimizer (Binary Format, Cache Awareness, Whole-Stage Code Gen)
 
Session 6: Performance Tuning
Caching - Concepts, Storage Type, Guidelines
Minimizing Shuffling for Increased Performance
Using Broadcast Variables and Accumulators
General Performance Guidelines
 
Session 7: Creating Standalone Applications
Core API, SparkSession.Builder
Configuring and Creating a SparkSession
Building and Running Applications - sbt/build.sbt and spark-submit
Application Lifecycle (Driver, Executors, and Tasks)
Cluster Managers (Standalone, YARN, Mesos)
Logging and Debugging
 
Session 8: Spark Streaming
Introduction and Streaming Basics
Structured Streaming (Spark 2+)
Continuous Applications
Table Paradigm, Result Table
Steps for Structured Streaming
Sources and Sinks
Consuming Kafka Data
Kafka Overview
Structured Streaming - "kafka" format
Processing the Stream

Java Programming Uses & Stats

Java Programming is Used For:
Android & IOS Development Software Products Video Games Desktop GUI's
Difficulty
Popularity
Year Created
1995
Pros

Most Commonly Used: 
According to Oracle, three billion devices run on Java.  And, because of its real-world applications, it consistently ranks at the top of the TIOBE Programming Community Index. 

Great Career Choice: 
Some of the fastest-growing salaries in the U.S. in 2018 are for Java developers.  (Glassdoor)  

Android Apps Development:
Developers predominatly use their Java skills in building apps for Google's Android. The Android platform is the number one mobile paltform in the world

It Can Run On Any Platform:
Java can compile on Windows and run the same compiled file on Linux, Windows and Mac.

Great Supporting IDE's:
Over the years, coding in Java has become simpler with the introduction of open source development tools, i.e. Eclipse and NetBeans that use Java capabilities for debugging.  
 

Cons

Uses a Lot of Memory:
Performance can be significantly slower with Java and more memory-consuming than natively compiled languages such as C or C++.

Difficulty in Learning: 
Learning Java can be a bit challenging if you are a beginner.  However, once you get the hang of Object Oriented Programming and a decent grasp of the syntax, you will be well on your way.

Slow Start Up Times:
There is quite a bit of one-time initialization done by JDK classes before compiling as well as loading classes and verification (making sure code doesn't do evil things, all of which takes longer that some other languages such as C. 

Verbose and Complex Code:
Long, over-complicated sentences make code less readable and scannable. Compare to let's say Python, we can see how clear Python code appears: It doesn’t require semicolons; uses “and,” “or,” and “not” as operators instead of Java’s “&&,” “||,” and “!”; and generally has fewer bells and whistles such as parentheses or curly braces.

Commercial License Cost:
Companies have to prepare for the changes that Oracle will institute in 2019 . Today, the current version of Java is free and available for redistribution for general purpose computing. However, If you are a DEVELOPER, Oracle recommends you review the roadmap information for Java SE 8 and beyond and take appropriate action depending on the type of application you develop and your distribution mode.

Java Programming Job Market
Average Salary
$102,000
Job Count
26,856
Top Job Locations

New York City 
San Jose
Washington D.C, 

Complimentary Skills to have along with Java Programming

- If you are an experienced Java developer, learning a complimentary language to Java should come much more naturally.  As an example JetBrains recently created the Kotlin programming language which is officially supported by Google for mobile development.  Kotlin compiles to Java bytecode and runs on the JVM; it's purported to address many of Java's shortcomings...

Interesting Reads Take a class with us and receive a book of your choosing for 50% off MSRP.