Apache Spark Training


Apache Spark is a powerful open source processing engine for Hadoop data built around speed, ease of use, and sophisticated analytics.

Fast and Easy Big Data Processing with Spark

Apache Spark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Scala, Java, and Python that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. It also supports a rich set of higher-level tools including Shark (Hive on Spark), MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Outline of Apache Spark:

Spark components
Cluster managers
Hardware & configuration
Linking with Spark
Monitoring and measuring

Contents you learn

Prototype distributed applications with Spark’s interactive shell
Learn different ways to interact with Spark’s distributed representation of data (RDDs)
Load data from the various data sources
Integrate Shark queries with Spark programs
bigQuery Spark with a SQL-like query syntax
Effectively test your distributed software
Tune a Spark installation
Install and set up Spark on your cluster
Work effectively with large data sets

Apache Spark application

Driver program
Java program that creates a Spark Context
Worker processes that execute tasks and store data