幫助中心 | 我的帳號 | 關於我們

高性能Spark(影印版)(英文版)

  • 作者:(加)霍爾頓·卡勞//雷切爾·沃倫
  • 出版社:東南大學
  • ISBN:9787564175184
  • 出版日期:2018/02/01
  • 裝幀:平裝
  • 頁數:341
人民幣:RMB 88 元      售價:
放入購物車
加入收藏夾

內容大鋼
    Apache Spark易學易用令人驚喜。但如果你尚未看到期望的性能改善效果,或者還是沒有足夠信心在生產環境中使用Spark,這本實用書籍——《高性能Spark(影印版)(英文版)》就是給你準備的。作者霍爾頓·卡勞和雷切爾·沃倫展示了如何使用更少資源,讓Spark查詢運行更快、處理更大數據的性能優化方法。
    本書描述了減少數據基礎設施成本和開發時間的技巧,適用於軟體工程師、數據工程師、開發者和系統管理員。你不僅可以從中獲得關於Spark的全面理解,也將學會如何讓它運轉自如。

作者介紹
(加)霍爾頓·卡勞//雷切爾·沃倫

目錄
Preface
1.Introduction to High Performance Spark
  What Is Spark and Why Performance Matters
  What You Can Expect to Get from This Book
  Spark Versions
  Why Scala?
    To Be a Spark Expert You Have to Learn a Little Scala Anyway
    The Spark Scala API Is Easier to Use Than the lava API
    Scala Is More Performant Than Python
    Why Not Scala?
    Learning Scala
  Conclusion
2.How Spark Works
  How Spark Fits into the Big Data Ecosystem
    Spark Components
  Spark Model of Parallel Computing: RDDs
    Lazy Evaluation
    In-Memory Persistence and Memory Management
    Immutability and the RDD Interface
    Types of RDDs
    Functions on RDDs: Transformations Versus Actions
    Wide Versus Narrow Dependencies
  Spark Job Scheduling
    Resource Allocation Across Applications
    The Spark Application
  The Anatomy of a Spark lob
    The DAG
    Jobs
    Stages
    Tasks
  Conclusion
3.DataFrames, Datasets, and Spark SQL
  Getting Started with the SparkSession (or HiveContext or SQLContext)
  Spark SQL Dependencies
    Managing Spark Dependencies
    Avoiding Hive JARs
  Basics of Schemas
  DataFrame API
    Transformations
    Multi-DataFrame Transformations
    Plain Old SQL Queries and Interacting with Hive Data
  Data Representation in DataFrames and Datasets
    Tungsten
  Data Loading and Saving Functions
    DataFrameWriter and DataFrameReader
    Formats
    Save Modes
    Partitions (Discovery and Writing)
  Datasets
    Interoperability with RDDs, DataFrames, and Local Collections

    Compile-Time Strong Typing
    Easier Functional (RDD "like") Transformations
    Relational Transformations
    Multi-Dataset Relational Transformations
    Grouped Operations on Datasets
  Extending with User-Defined Functions and Aggregate Functions (UDFs,UDAFs)
  Query Optimizer
    Logical and Physical Plans
    Code Generation
    Large Query Plans and Iterative Algorithms
  Debugging Spark SQL Queries
  JDBC/ODBC Server
  Conclusion
4.Joins (SQL and Core)
  Core Spark Joins
    Choosing a Join Type
    Choosing an Execution Plan
  Spark SQL Joins
    DataFrame Joins
    Dataset Joins
  Conclusion
5.Effective Transformations
  Narrow Versus Wide Transformations
    Implications for Performance
    Implications for Fault Tolerance
    The Special Case of coalesce
  What Type of RDD Does Your Transformation Return?
  Minimizing Object Creation
    Reusing Existing Objects
    Using Smaller Data Structures
  Iterator-to-Iterator Transformations with mapPartitions
    What Is an Iterator-to-Iterator Transformation?
    Space and Time Advantages
    An Example
  Set Operations
  Reducing Setup Overhead
    Shared Variables
    Broadcast Variables
    Accumulators
  Reusing RDDs
    Cases for Reuse
    Deciding if Recompute Is Inexpensive Enough
    Types of Reuse: Cache, Persist, Checkpoint, Shuffle Files
    Alluxio (nee Tachyon)
    LRU Caching
    Noisy Cluster Considerations
    Interaction with Accumulators
  Conclusion
6.Working with Key/Value Data
  The Goldilocks Example

    Goldilocks Version 0: Iterative Solution
    How to Use PairRDDFunctions and OrderedRDDFunctions
  Actions on Key/Value Pairs
  What's So Dangerous About the groupByKey Function
    Goldilocks Version 1: groupByKey Solution
  Choosing an Aggregation Operation
    Dictionary of Aggregation Operations with Performance Considerations
  Multiple RDD Operations
    Co-Grouping
  Partitioners and Key/Value Data
    Using the Spark Partitioner Object
    Hash Partitioning
    Range Partitioning
    Custom Partitioning
    Preserving Partitioning Information Across Transformations
    Leveraging Co-Located and Co-Partitioned RDDs
    Dictionary of Mapping and Partitioning Functions PairRDDFunctions
  Dictionary of OrderedRDDOperations
    Sorting by Two Keys with SortByKey
  Secondary Sort and repartitionAndSortWithinPartitions
    Leveraging repartitionAndSortWithinPartitions for a Group by Key and Sort Values Function
    How Not to Sort by Two Orderings
    Goldilocks Version 2: Secondary Sort
    A Different Approach to Goldilocks
    Goldilocks Version 3: Sort on Cell Values
  Straggler Detection and Unbalanced Data
    Back to Goldilocks (Again)
    Goldilocks Version 4: Reduce to Distinct on Each Partition
  Conclusion
7.Going Beyond Scala
  Beyond Scala within the JVM
  Beyond Scala, and Beyond the JVM
    How PySpark Works
    How SparkR Works
    Spark.jl (Julia Spark)
    How Eclair JS Works
    Spark on the Common Language Runtime (CLR)--C# and Friends
  Calling Other Languages from Spark
    Using Pipe and Friends
    JNI
    Java Native Access (JNA)
    Underneath Everything Is FORTRAN
    Getting to the GPU
  The Future
  Conclusion
8.Testing and Validation
  Unit Testing
    General Spark Unit Testing
    Mocking RDDs
  Getting Test Data

    Generating Large Datasets
    Sampling
  Property Checking with ScalaCheck
    Computing RDD Difference
  Integration Testing
    Choosing Your Integration Testing Environment
  Verifying Performance
    Spark Counters for Verifying Performance
    Projects for Verifying Performance
  Job Validation
  Conclusion
9.Spark MLlib and ML
  Choosing Between Spark MLlib and Spark ML
  Working with MLlib
    Getting Started with MLlib (Organization and Imports)
    MLlib Feature Encoding and Data Preparation
    Feature Scaling and Selection
    MLlib Model Training
    Predicting
    Serving and Persistence
    Model Evaluation
  Working with Spark ML
    Spark ML Organization and Imports
    Pipeline Stages
    Explain Params
    Data Encoding
    Data Cleaning
    Spark ML Models
    Putting It All Together in a Pipeline
    Training a Pipeline
    Accessing Individual Stages
    Data Persistence and Spark ML
    Extending Spark ML Pipelines with Your Own Algorithms
    Model and Pipeline Persistence and Serving with Spark ML
  General Serving Considerations
  Conclusion
10.Spark Components and Packages
  Stream Processing with Spark
    Sources and Sinks
    Batch Intervals
    Data Checkpoint Intervals
    Considerations for DStreams
    Considerations for Structured Streaming
    High Availability Mode (or Handling Driver Failure or Checkpointing)
  GraphX
  Using Community Packages and Libraries
    Creating a Spark Package
  Conclusion
A.Tuning, Debugging, and Other Things Developers Like to Pretend Don't Exist
Index

  • 商品搜索:
  • | 高級搜索
首頁新手上路客服中心關於我們聯絡我們Top↑
Copyrightc 1999~2008 美商天龍國際圖書股份有限公司 臺灣分公司. All rights reserved.
營業地址:臺北市中正區重慶南路一段103號1F 105號1F-2F
讀者服務部電話:02-2381-2033 02-2381-1863 時間:週一-週五 10:00-17:00
 服務信箱:bookuu@69book.com 客戶、意見信箱:cs@69book.com
ICP證:浙B2-20060032