万本电子书0元读

万本电子书0元读

顶部广告

Hadoop Beginner's Guide电子书

售       价:¥

12人正在读 | 0人评论 9.8

作       者:Garry Turkington

出  版  社:Packt Publishing

出版时间:2013-02-22

字       数:347.6万

所属分类: 进口书 > 外文原版书 > 电脑/网络

温馨提示:数字商品不支持退换货,不提供源文件,不支持导出打印

为你推荐

  • 读书简介
  • 目录
  • 累计评论(0条)
  • 读书简介
  • 目录
  • 累计评论(0条)
As a Packt Beginner's Guide, the book is packed with clear step-by-step instructions for performing the most useful tasks, getting you up and running quickly, and learning by doing. This book assumes no existing experience with Hadoop or cloud services. It assumes you have familiarity with a programming language such as Java or Ruby but gives you the needed background on the other topics.
目录展开

Hadoop Beginner's Guide

Table of Contents

Hadoop Beginner's Guide

Credits

About the Author

About the Reviewers

www.PacktPub.com

Support files, eBooks, discount offers and more

Why Subscribe?

Free Access for Packt account holders

Preface

What this book covers

What you need for this book

Who this book is for

Conventions

Time for action – heading

What just happened?

Pop quiz – heading

Have a go hero – heading

Reader feedback

Customer support

Downloading the example code

Errata

Piracy

Questions

1. What It's All About

Big data processing

The value of data

Historically for the few and not the many

Classic data processing systems

Scale-up

Early approaches to scale-out

Limiting factors

A different approach

All roads lead to scale-out

Share nothing

Expect failure

Smart software, dumb hardware

Move processing, not data

Build applications, not infrastructure

Hadoop

Thanks, Google

Thanks, Doug

Thanks, Yahoo

Parts of Hadoop

Common building blocks

HDFS

MapReduce

Better together

Common architecture

What it is and isn't good for

Cloud computing with Amazon Web Services

Too many clouds

A third way

Different types of costs

AWS – infrastructure on demand from Amazon

Elastic Compute Cloud (EC2)

Simple Storage Service (S3)

Elastic MapReduce (EMR)

What this book covers

A dual approach

Summary

2. Getting Hadoop Up and Running

Hadoop on a local Ubuntu host

Other operating systems

Time for action – checking the prerequisites

What just happened?

Setting up Hadoop

A note on versions

Time for action – downloading Hadoop

What just happened?

Time for action – setting up SSH

What just happened?

Configuring and running Hadoop

Time for action – using Hadoop to calculate Pi

What just happened?

Three modes

Time for action – configuring the pseudo-distributed mode

What just happened?

Configuring the base directory and formatting the filesystem

Time for action – changing the base HDFS directory

What just happened?

Time for action – formatting the NameNode

What just happened?

Starting and using Hadoop

Time for action – starting Hadoop

What just happened?

Time for action – using HDFS

What just happened?

Time for action – WordCount, the Hello World of MapReduce

What just happened?

Have a go hero – WordCount on a larger body of text

Monitoring Hadoop from the browser

The HDFS web UI

The MapReduce web UI

Using Elastic MapReduce

Setting up an account in Amazon Web Services

Creating an AWS account

Signing up for the necessary services

Time for action – WordCount on EMR using the management console

What just happened?

Have a go hero – other EMR sample applications

Other ways of using EMR

AWS credentials

The EMR command-line tools

The AWS ecosystem

Comparison of local versus EMR Hadoop

Summary

3. Understanding MapReduce

Key/value pairs

What it mean

Why key/value data?

Some real-world examples

MapReduce as a series of key/value transformations

Pop quiz – key/value pairs

The Hadoop Java API for MapReduce

The 0.20 MapReduce Java API

The Mapper class

The Reducer class

The Driver class

Writing MapReduce programs

Time for action – setting up the classpath

What just happened?

Time for action – implementing WordCount

What just happened?

Time for action – building a JAR file

What just happened?

Time for action – running WordCount on a local Hadoop cluster

What just happened?

Time for action – running WordCount on EMR

What just happened?

The pre-0.20 Java MapReduce API

Hadoop-provided mapper and reducer implementations

Time for action – WordCount the easy way

What just happened?

Walking through a run of WordCount

Startup

Splitting the input

Task assignment

Task startup

Ongoing JobTracker monitoring

Mapper input

Mapper execution

Mapper output and reduce input

Partitioning

The optional partition function

Reducer input

Reducer execution

Reducer output

Shutdown

That's all there is to it!

Apart from the combiner…maybe

Why have a combiner?

Time for action – WordCount with a combiner

What just happened?

When you can use the reducer as the combiner

Time for action – fixing WordCount to work with a combiner

What just happened?

Reuse is your friend

Pop quiz – MapReduce mechanics

Hadoop-specific data types

The Writable and WritableComparable interfaces

Introducing the wrapper classes

Primitive wrapper classes

Array wrapper classes

Map wrapper classes

Time for action – using the Writable wrapper classes

What just happened?

Other wrapper classes

Have a go hero – playing with Writables

Making your own

Input/output

Files, splits, and records

InputFormat and RecordReader

Hadoop-provided InputFormat

Hadoop-provided RecordReader

OutputFormat and RecordWriter

Hadoop-provided OutputFormat

Don't forget Sequence files

Summary

4. Developing MapReduce Programs

Using languages other than Java with Hadoop

How Hadoop Streaming works

Why to use Hadoop Streaming

Time for action – implementing WordCount using Streaming

What just happened?

Differences in jobs when using Streaming

Analyzing a large dataset

Getting the UFO sighting dataset

Getting a feel for the dataset

Time for action – summarizing the UFO data

What just happened?

Examining UFO shapes

Time for action – summarizing the shape data

What just happened?

Time for action – correlating of sighting duration to UFO shape

What just happened?

Using Streaming scripts outside Hadoop

Time for action – performing the shape/time analysis from the command line

What just happened?

Java shape and location analysis

Time for action – using ChainMapper for field validation/analysis

What just happened?

Have a go hero

Too many abbreviations

Using the Distributed Cache

Time for action – using the Distributed Cache to improve location output

What just happened?

Counters, status, and other output

Time for action – creating counters, task states, and writing log output

What just happened?

Too much information!

Summary

5. Advanced MapReduce Techniques

Simple, advanced, and in-between

Joins

When this is a bad idea

Map-side versus reduce-side joins

Matching account and sales information

Time for action – reduce-side join using MultipleInputs

What just happened?

DataJoinMapper and TaggedMapperOutput

Implementing map-side joins

Using the Distributed Cache

Have a go hero - Implementing map-side joins

Pruning data to fit in the cache

Using a data representation instead of raw data

Using multiple mappers

To join or not to join...

Graph algorithms

Graph 101

Graphs and MapReduce – a match made somewhere

Representing a graph

Time for action – representing the graph

What just happened?

Overview of the algorithm

The mapper

The reducer

Iterative application

Time for action – creating the source code

What just happened?

Time for action – the first run

What just happened?

Time for action – the second run

What just happened?

Time for action – the third run

What just happened?

Time for action – the fourth and last run

What just happened?

Running multiple jobs

Final thoughts on graphs

Using language-independent data structures

Candidate technologies

Introducing Avro

Time for action – getting and installing Avro

What just happened?

Avro and schemas

Time for action – defining the schema

What just happened?

Time for action – creating the source Avro data with Ruby

What just happened?

Time for action – consuming the Avro data with Java

What just happened?

Using Avro within MapReduce

Time for action – generating shape summaries in MapReduce

What just happened?

Time for action – examining the output data with Ruby

What just happened?

Time for action – examining the output data with Java

What just happened?

Have a go hero – graphs in Avro

Going forward with Avro

Summary

6. When Things Break

Failure

Embrace failure

Or at least don't fear it

Don't try this at home

Types of failure

Hadoop node failure

The dfsadmin command

Cluster setup, test files, and block sizes

Fault tolerance and Elastic MapReduce

Time for action – killing a DataNode process

What just happened?

NameNode and DataNode communication

Have a go hero – NameNode log delving

Time for action – the replication factor in action

What just happened?

Time for action – intentionally causing missing blocks

What just happened?

When data may be lost

Block corruption

Time for action – killing a TaskTracker process

What just happened?

Comparing the DataNode and TaskTracker failures

Permanent failure

Killing the cluster masters

Time for action – killing the JobTracker

What just happened?

Starting a replacement JobTracker

Have a go hero – moving the JobTracker to a new host

Time for action – killing the NameNode process

What just happened?

Starting a replacement NameNode

The role of the NameNode in more detail

File systems, files, blocks, and nodes

The single most important piece of data in the cluster – fsimage

DataNode startup

Safe mode

SecondaryNameNode

So what to do when the NameNode process has a critical failure?

BackupNode/CheckpointNode and NameNode HA

Hardware failure

Host failure

Host corruption

The risk of correlated failures

Task failure due to software

Failure of slow running tasks

Time for action – causing task failure

What just happened?

Have a go hero – HDFS programmatic access

Hadoop's handling of slow-running tasks

Speculative execution

Hadoop's handling of failing tasks

Have a go hero – causing tasks to fail

Task failure due to data

Handling dirty data through code

Using Hadoop's skip mode

Time for action – handling dirty data by using skip mode

What just happened?

To skip or not to skip...

Summary

7. Keeping Things Running

A note on EMR

Hadoop configuration properties

Default values

Time for action – browsing default properties

What just happened?

Additional property elements

Default storage location

Where to set properties

Setting up a cluster

How many hosts?

Calculating usable space on a node

Location of the master nodes

Sizing hardware

Processor / memory / storage ratio

EMR as a prototyping platform

Special node requirements

Storage types

Commodity versus enterprise class storage

Single disk versus RAID

Finding the balance

Network storage

Hadoop networking configuration

How blocks are placed

Rack awareness

The rack-awareness script

Time for action – examining the default rack configuration

What just happened?

Time for action – adding a rack awareness script

What just happened?

What is commodity hardware anyway?

Pop quiz – setting up a cluster

Cluster access control

The Hadoop security model

Time for action – demonstrating the default security

What just happened?

User identity

The super user

More granular access control

Working around the security model via physical access control

Managing the NameNode

Configuring multiple locations for the fsimage class

Time for action – adding an additional fsimage location

What just happened?

Where to write the fsimage copies

Swapping to another NameNode host

Having things ready before disaster strikes

Time for action – swapping to a new NameNode host

What just happened?

Don't celebrate quite yet!

What about MapReduce?

Have a go hero – swapping to a new NameNode host

Managing HDFS

Where to write data

Using balancer

When to rebalance

MapReduce management

Command line job management

Have a go hero – command line job management

Job priorities and scheduling

Time for action – changing job priorities and killing a job

What just happened?

Alternative schedulers

Capacity Scheduler

Fair Scheduler

Enabling alternative schedulers

When to use alternative schedulers

Scaling

Adding capacity to a local Hadoop cluster

Have a go hero – adding a node and running balancer

Adding capacity to an EMR job flow

Expanding a running job flow

Summary

8. A Relational View on Data with Hive

Overview of Hive

Why use Hive?

Thanks, Facebook!

Setting up Hive

Prerequisites

Getting Hive

Time for action – installing Hive

What just happened?

Using Hive

Time for action – creating a table for the UFO data

What just happened?

Time for action – inserting the UFO data

What just happened?

Validating the data

Time for action – validating the table

What just happened?

Time for action – redefining the table with the correct column separator

What just happened?

Hive tables – real or not?

Time for action – creating a table from an existing file

What just happened?

Time for action – performing a join

What just happened?

Have a go hero – improve the join to use regular expressions

Hive and SQL views

Time for action – using views

What just happened?

Handling dirty data in Hive

Have a go hero – do it!

Time for action – exporting query output

What just happened?

Partitioning the table

Time for action – making a partitioned UFO sighting table

What just happened?

Bucketing, clustering, and sorting... oh my!

User-Defined Function

Time for action – adding a new User Defined Function (UDF)

What just happened?

To preprocess or not to preprocess...

Hive versus Pig

What we didn't cover

Hive on Amazon Web Services

Time for action – running UFO analysis on EMR

What just happened?

Using interactive job flows for development

Have a go hero – using an interactive EMR cluster

Integration with other AWS products

Summary

9. Working with Relational Databases

Common data paths

Hadoop as an archive store

Hadoop as a preprocessing step

Hadoop as a data input tool

The serpent eats its own tail

Setting up MySQL

Time for action – installing and setting up MySQL

What just happened?

Did it have to be so hard?

Time for action – configuring MySQL to allow remote connections

What just happened?

Don't do this in production!

Time for action – setting up the employee database

What just happened?

Be careful with data file access rights

Getting data into Hadoop

Using MySQL tools and manual import

Have a go hero – exporting the employee table into HDFS

Accessing the database from the mapper

A better way – introducing Sqoop

Time for action – downloading and configuring Sqoop

What just happened?

Sqoop and Hadoop versions

Sqoop and HDFS

Time for action – exporting data from MySQL to HDFS

What just happened?

Mappers and primary key columns

Other options

Sqoop's architecture

Importing data into Hive using Sqoop

Time for action – exporting data from MySQL into Hive

What just happened?

Time for action – a more selective import

What just happened?

Datatype issues

Time for action – using a type mapping

What just happened?

Time for action – importing data from a raw query

What just happened?

Have a go hero

Sqoop and Hive partitions

Field and line terminators

Getting data out of Hadoop

Writing data from within the reducer

Writing SQL import files from the reducer

A better way – Sqoop again

Time for action – importing data from Hadoop into MySQL

What just happened?

Differences between Sqoop imports and exports

Inserts versus updates

Have a go hero

Sqoop and Hive exports

Time for action – importing Hive data into MySQL

What just happened?

Time for action – fixing the mapping and re-running the export

What just happened?

Other Sqoop features

Incremental merge

Avoiding partial exports

Sqoop as a code generator

AWS considerations

Considering RDS

Summary

10. Data Collection with Flume

A note about AWS

Data data everywhere...

Types of data

Getting network traffic into Hadoop

Time for action – getting web server data into Hadoop

What just happened?

Have a go hero

Getting files into Hadoop

Hidden issues

Keeping network data on the network

Hadoop dependencies

Reliability

Re-creating the wheel

A common framework approach

Introducing Apache Flume

A note on versioning

Time for action – installing and configuring Flume

What just happened?

Using Flume to capture network data

Time for action – capturing network traffic in a log file

What just happened?

Time for action – logging to the console

What just happened?

Writing network data to log files

Time for action – capturing the output of a command to a flat file

What just happened?

Logs versus files

Time for action – capturing a remote file in a local flat file

What just happened?

Sources, sinks, and channels

Sources

Sinks

Channels

Or roll your own

Understanding the Flume configuration files

Have a go hero

It's all about events

Time for action – writing network traffic onto HDFS

What just happened?

Time for action – adding timestamps

What just happened?

To Sqoop or to Flume...

Time for action – multi level Flume networks

What just happened?

Time for action – writing to multiple sinks

What just happened?

Selectors replicating and multiplexing

Handling sink failure

Have a go hero - Handling sink failure

Next, the world

Have a go hero - Next, the world

The bigger picture

Data lifecycle

Staging data

Scheduling

Summary

11. Where to Go Next

What we did and didn't cover in this book

Upcoming Hadoop changes

Alternative distributions

Why alternative distributions?

Bundling

Free and commercial extensions

Cloudera Distribution for Hadoop

Hortonworks Data Platform

MapR

IBM InfoSphere Big Insights

Choosing a distribution

Other Apache projects

HBase

Oozie

Whir

Mahout

MRUnit

Other programming abstractions

Pig

Cascading

AWS resources

HBase on EMR

SimpleDB

DynamoDB

Sources of information

Source code

Mailing lists and forums

LinkedIn groups

HUGs

Conferences

Summary

A. Pop Quiz Answers

Chapter 3, Understanding MapReduce

Pop quiz – key/value pairs

Pop quiz – walking through a run of WordCount

Chapter 7, Keeping Things Running

Pop quiz – setting up a cluster

Index

累计评论(0条) 0个书友正在讨论这本书 发表评论

发表评论

发表评论,分享你的想法吧!

买过这本书的人还买过

读了这本书的人还在读

回顶部