內容大鋼
Apache Spark易學易用令人驚喜。但如果你尚未看到期望的性能改善效果,或者還是沒有足夠信心在生產環境中使用Spark,這本實用書籍——《高性能Spark(影印版)(英文版)》就是給你準備的。作者霍爾頓·卡勞和雷切爾·沃倫展示了如何使用更少資源,讓Spark查詢運行更快、處理更大數據的性能優化方法。
本書描述了減少數據基礎設施成本和開發時間的技巧,適用於軟體工程師、數據工程師、開發者和系統管理員。你不僅可以從中獲得關於Spark的全面理解,也將學會如何讓它運轉自如。
目錄
Preface
1.Introduction to High Performance Spark
What Is Spark and Why Performance Matters
What You Can Expect to Get from This Book
Spark Versions
Why Scala?
To Be a Spark Expert You Have to Learn a Little Scala Anyway
The Spark Scala API Is Easier to Use Than the lava API
Scala Is More Performant Than Python
Why Not Scala?
Learning Scala
Conclusion
2.How Spark Works
How Spark Fits into the Big Data Ecosystem
Spark Components
Spark Model of Parallel Computing: RDDs
Lazy Evaluation
In-Memory Persistence and Memory Management
Immutability and the RDD Interface
Types of RDDs
Functions on RDDs: Transformations Versus Actions
Wide Versus Narrow Dependencies
Spark Job Scheduling
Resource Allocation Across Applications
The Spark Application
The Anatomy of a Spark lob
The DAG
Jobs
Stages
Tasks
Conclusion
3.DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)
Spark SQL Dependencies
Managing Spark Dependencies
Avoiding Hive JARs
Basics of Schemas
DataFrame API
Transformations
Multi-DataFrame Transformations
Plain Old SQL Queries and Interacting with Hive Data
Data Representation in DataFrames and Datasets
Tungsten
Data Loading and Saving Functions
DataFrameWriter and DataFrameReader
Formats
Save Modes
Partitions (Discovery and Writing)
Datasets
Interoperability with RDDs, DataFrames, and Local Collections
Compile-Time Strong Typing
Easier Functional (RDD "like") Transformations
Relational Transformations
Multi-Dataset Relational Transformations
Grouped Operations on Datasets
Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs)
Query Optimizer
Logical and Physical Plans
Code Generation
Large Query Plans and Iterative Algorithms
Debugging Spark SQL Queries
JDBC/ODBC Server
Conclusion
4.Joins (SQL and Core)
Core Spark Joins
Choosing a Join Type
Choosing an Execution Plan
Spark SQL Joins
DataFrame Joins
Dataset Joins
Conclusion
5.Effective Transformations
Narrow Versus Wide Transformations
Implications for Performance
Implications for Fault Tolerance
The Special Case of coalesce
What Type of RDD Does Your Transformation Return?
Minimizing Object Creation
Reusing Existing Objects
Using Smaller Data Structures
Iterator-to-Iterator Transformations with mapPartitions
What Is an Iterator-to-Iterator Transformation?
Space and Time Advantages
An Example
Set Operations
Reducing Setup Overhead
Shared Variables
Broadcast Variables
Accumulators
Reusing RDDs
Cases for Reuse
Deciding if Recompute Is Inexpensive Enough
Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
Alluxio (nee Tachyon)
LRU Caching
Noisy Cluster Considerations
Interaction with Accumulators
Conclusion
6.Working with Key/Value Data
The Goldilocks Example
Goldilocks Version 0: Iterative Solution
How to Use PairRDDFunctions and OrderedRDDFunctions
Actions on Key/Value Pairs
What's So Dangerous About the groupByKey Function
Goldilocks Version 1: groupByKey Solution
Choosing an Aggregation Operation
Dictionary of Aggregation Operations with Performance Considerations
Multiple RDD Operations
Co-Grouping
Partitioners and Key/Value Data
Using the Spark Partitioner Object
Hash Partitioning
Range Partitioning
Custom Partitioning
Preserving Partitioning Information Across Transformations
Leveraging Co-Located and Co-Partitioned RDDs
Dictionary of Mapping and Partitioning Functions PairRDDFunctions
Dictionary of OrderedRDDOperations
Sorting by Two Keys with SortByKey
Secondary Sort and repartitionAndSortWithinPartitions
Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
How Not to Sort by Two Orderings
Goldilocks Version 2: Secondary Sort
A Different Approach to Goldilocks
Goldilocks Version 3: Sort on Cell Values
Straggler Detection and Unbalanced Data
Back to Goldilocks (Again)
Goldilocks Version 4: Reduce to Distinct on Each Partition
Conclusion
7.Going Beyond Scala
Beyond Scala within the JVM
Beyond Scala, and Beyond the JVM
How PySpark Works
How SparkR Works
Spark.jl (Julia Spark)
How Eclair JS Works
Spark on the Common Language Runtime (CLR)--C# and Friends
Calling Other Languages from Spark
Using Pipe and Friends
JNI
Java Native Access (JNA)
Underneath Everything Is FORTRAN
Getting to the GPU
The Future
Conclusion
8.Testing and Validation
Unit Testing
General Spark Unit Testing
Mocking RDDs
Getting Test Data
Generating Large Datasets
Sampling
Property Checking with ScalaCheck
Computing RDD Difference
Integration Testing
Choosing Your Integration Testing Environment
Verifying Performance
Spark Counters for Verifying Performance
Projects for Verifying Performance
Job Validation
Conclusion
9.Spark MLlib and ML
Choosing Between Spark MLlib and Spark ML
Working with MLlib
Getting Started with MLlib (Organization and Imports)
MLlib Feature Encoding and Data Preparation
Feature Scaling and Selection
MLlib Model Training
Predicting
Serving and Persistence
Model Evaluation
Working with Spark ML
Spark ML Organization and Imports
Pipeline Stages
Explain Params
Data Encoding
Data Cleaning
Spark ML Models
Putting It All Together in a Pipeline
Training a Pipeline
Accessing Individual Stages
Data Persistence and Spark ML
Extending Spark ML Pipelines with Your Own Algorithms
Model and Pipeline Persistence and Serving with Spark ML
General Serving Considerations
Conclusion
10.Spark Components and Packages
Stream Processing with Spark
Sources and Sinks
Batch Intervals
Data Checkpoint Intervals
Considerations for DStreams
Considerations for Structured Streaming
High Availability Mode (or Handling Driver Failure or Checkpointing)
GraphX
Using Community Packages and Libraries
Creating a Spark Package
Conclusion
A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist
Index