  • 作者:(加)霍爾頓·卡勞//雷切爾·沃倫
  • 出版社:東南大學
  • ISBN:9787564175184
  • 出版日期:2018/02/01
  • 裝幀:平裝
  • 頁數:341
人民幣:RMB 88 元

    Apache Spark易學易用令人驚喜。但如果你尚未看到期望的性能改善效果,或者還是沒有足夠信心在生產環境中使用Spark,這本實用書籍——《高性能Spark(影印版)(英文版)》就是給你準備的。作者霍爾頓·卡勞和雷切爾·沃倫展示了如何使用更少資源,讓Spark查詢運行更快、處理更大數據的性能優化方法。


1.Introduction to High Performance Spark
  What Is Spark and Why Performance Matters
  What You Can Expect to Get from This Book
  Spark Versions
  Why Scala?
    To Be a Spark Expert You Have to Learn a Little Scala Anyway
    The Spark Scala API Is Easier to Use Than the lava API
    Scala Is More Performant Than Python
    Why Not Scala?
    Learning Scala
2.How Spark Works
  How Spark Fits into the Big Data Ecosystem
    Spark Components
  Spark Model of Parallel Computing: RDDs
    Lazy Evaluation
    In-Memory Persistence and Memory Management
    Immutability and the RDD Interface
    Types of RDDs
    Functions on RDDs: Transformations Versus Actions
    Wide Versus Narrow Dependencies
  Spark Job Scheduling
    Resource Allocation Across Applications
    The Spark Application
  The Anatomy of a Spark lob
    The DAG
3.DataFrames, Datasets, and Spark SQL
  Getting Started with the SparkSession (or HiveContext or SQLContext)
  Spark SQL Dependencies
    Managing Spark Dependencies
    Avoiding Hive JARs
  Basics of Schemas
  DataFrame API
    Multi-DataFrame Transformations
    Plain Old SQL Queries and Interacting with Hive Data
  Data Representation in DataFrames and Datasets
  Data Loading and Saving Functions
    DataFrameWriter and DataFrameReader
    Save Modes
    Partitions (Discovery and Writing)
    Interoperability with RDDs, DataFrames, and Local Collections

    Compile-Time Strong Typing
    Easier Functional (RDD "like") Transformations
    Relational Transformations
    Multi-Dataset Relational Transformations
    Grouped Operations on Datasets
  Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs)
  Query Optimizer
    Logical and Physical Plans
    Code Generation
    Large Query Plans and Iterative Algorithms
  Debugging Spark SQL Queries
  JDBC/ODBC Server
4.Joins (SQL and Core)
  Core Spark Joins
    Choosing a Join Type
    Choosing an Execution Plan
  Spark SQL Joins
    DataFrame Joins
    Dataset Joins
5.Effective Transformations
  Narrow Versus Wide Transformations
    Implications for Performance
    Implications for Fault Tolerance
    The Special Case of coalesce
  What Type of RDD Does Your Transformation Return?
  Minimizing Object Creation
    Reusing Existing Objects
    Using Smaller Data Structures
  Iterator-to-Iterator Transformations with mapPartitions
    What Is an Iterator-to-Iterator Transformation?
    Space and Time Advantages
    An Example
  Set Operations
  Reducing Setup Overhead
    Shared Variables
    Broadcast Variables
  Reusing RDDs
    Cases for Reuse
    Deciding if Recompute Is Inexpensive Enough
    Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
    Alluxio (nee Tachyon)
    LRU Caching
    Noisy Cluster Considerations
    Interaction with Accumulators
6.Working with Key/Value Data
  The Goldilocks Example

    Goldilocks Version 0: Iterative Solution
    How to Use PairRDDFunctions and OrderedRDDFunctions
  Actions on Key/Value Pairs
  What's So Dangerous About the groupByKey Function
    Goldilocks Version 1: groupByKey Solution
  Choosing an Aggregation Operation
    Dictionary of Aggregation Operations with Performance Considerations
  Multiple RDD Operations
  Partitioners and Key/Value Data
    Using the Spark Partitioner Object
    Hash Partitioning
    Range Partitioning
    Custom Partitioning
    Preserving Partitioning Information Across Transformations
    Leveraging Co-Located and Co-Partitioned RDDs
    Dictionary of Mapping and Partitioning Functions PairRDDFunctions
  Dictionary of OrderedRDDOperations
    Sorting by Two Keys with SortByKey
  Secondary Sort and repartitionAndSortWithinPartitions
    Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
    How Not to Sort by Two Orderings
    Goldilocks Version 2: Secondary Sort
    A Different Approach to Goldilocks
    Goldilocks Version 3: Sort on Cell Values
  Straggler Detection and Unbalanced Data
    Back to Goldilocks (Again)
    Goldilocks Version 4: Reduce to Distinct on Each Partition
7.Going Beyond Scala
  Beyond Scala within the JVM
  Beyond Scala, and Beyond the JVM
    How PySpark Works
    How SparkR Works
    Spark.jl (Julia Spark)
    How Eclair JS Works
    Spark on the Common Language Runtime (CLR)--C# and Friends
  Calling Other Languages from Spark
    Using Pipe and Friends
    Java Native Access (JNA)
    Underneath Everything Is FORTRAN
    Getting to the GPU
  The Future
8.Testing and Validation
  Unit Testing
    General Spark Unit Testing
    Mocking RDDs
  Getting Test Data

    Generating Large Datasets
  Property Checking with ScalaCheck
    Computing RDD Difference
  Integration Testing
    Choosing Your Integration Testing Environment
  Verifying Performance
    Spark Counters for Verifying Performance
    Projects for Verifying Performance
  Job Validation
9.Spark MLlib and ML
  Choosing Between Spark MLlib and Spark ML
  Working with MLlib
    Getting Started with MLlib (Organization and Imports)
    MLlib Feature Encoding and Data Preparation
    Feature Scaling and Selection
    MLlib Model Training
    Serving and Persistence
    Model Evaluation
  Working with Spark ML
    Spark ML Organization and Imports
    Pipeline Stages
    Explain Params
    Data Encoding
    Data Cleaning
    Spark ML Models
    Putting It All Together in a Pipeline
    Training a Pipeline
    Accessing Individual Stages
    Data Persistence and Spark ML
    Extending Spark ML Pipelines with Your Own Algorithms
    Model and Pipeline Persistence and Serving with Spark ML
  General Serving Considerations
10.Spark Components and Packages
  Stream Processing with Spark
    Sources and Sinks
    Batch Intervals
    Data Checkpoint Intervals
    Considerations for DStreams
    Considerations for Structured Streaming
    High Availability Mode (or Handling Driver Failure or Checkpointing)
  Using Community Packages and Libraries
    Creating a Spark Package
A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist

