Surrounding the Requirements
Requirements
Business Needs
Compliance Requirements
Data Profiling
Security Requirements
Data Integration
Data Latency
Archiving and Lineage
End User Delivery Interfaces
Available Skills
Legacy Licenses
Architecture
ETL Tool Versus Hand Coding (Buy a Tool Suite or Roll Your Own)
The Back Room—Preparing the Data
The Front Room—Data Access
The Mission of the Data Warehouse
What the Data Warehouse Is
What the Data Warehouse Is Not
Industry Terms Not Used Consistently
Resolving Architectural Conflict: A Hybrid Approach
How the Data warehouse Is Changing
The Mission of the ETL Team
ETL Data Structures
To Stage or Not to Stage
Designing the Staging Area
Data Structures in the ETL System
Flat Files
XML Data Sets
Relational Tables
Independent DBMS Working Tables
Third Normal Form Entity/Relation Models
Nonrelational Data Sources
Dimensional Data Models: The Handoff from the Back Room to the Front Room
Fact Tables
Dimension Tables
Atomic and Aggregate Fact Tables
Surrogate Key Mapping Tables
Planning and Design Standards
Impact Analysis
Metadata Capture
Naming Conventions
Auditing Data Transformation Steps
Summary
Extracting
The Logical Data Map
Designing Logical Before Physical
Inside the Logical Data Map
Components of the Logical Data Map
Using Tools for the Logical Data Map
Building the Logical Data Map
Data Discovery Phase
Data Content Analysis
Collecting Business Rules in the ETL Process
Integrating Heterogeneous Data Sources
The Challenge of Extracting from Disparate Platforms
Connecting to Diverse Sources Through ODBC
Mainframe Sources
Working with COBOL Copybooks
EBCDIC Character Set
Converting EBCDIC to ASCII
Tranferring Data Between Platforms
Handling Mainframe Numeric Data
Using PICtures
Unpacking Packed Decimal
Working with Redefined Fields
Multiple OCCURS
Managing Multiple Mainframe Record Type Files
Handling Mainframe Variable Record Lengths
Flat Files
Processing Fixed Length Flat Files
Processing Delimited Flat Files
XML Sources
Character Sets
XML Meta Data
Web Log Sources
W3C Common and Extended Formats
Name Value Pairs in Web Logs
ERP System Sources
Extracting Changed Data
Detecting Changes
Extraction Tips
Detecting Deleted or Overwritten Fact Records at the Source
Summary
Cleaning and Conforming
Defining Data Quality
Assumptions
Part 1: Design Objectives
Understand Your Key Constituencies
Competing Factors
Balancing Conflicting Priorities
Formulate a Policy
Part 2: Cleaning Deliverables
Data Profiling Deliverable
Cleaning Deliverable #1: Error Event Table
Cleaning Deliverable #2: Audit Dimension
Audit Dimension Fine Points
Part 3: Screens and Their Measurements
Anomaly Detection Phase
Types of Enforcement
Column Property Enforcement
Structure Enforcement
Data and Value Rule Enforcement
Measurements Driving Screen Design
Overall Process Flow
The Show Must Go On—Usually
Screens
Known Table Row Counts
Column Nullity
Column Numeric and Date Ranges
Column Length Restriction
Column Explicit Valid Values
Column Explicit Invalid Values
Checking Table Row Count Reasonability
Checking Column Distribution Reasonability
General Data and Value Rule Reasonability
Part 4: Conforming Deliverables
Conformed Dimensions
Designing the Conformed Dimensions
Taking the Pledge
Permissible Variations of Conformed Dimensions
Conformed Facts
The Fact Table Provider
The Dimension Manager: Publishing Conformed Dimensions to Affected Fact Tables
Detailed Delivery Steps for Conformed Dimensions
Implementing the Conforming Modules
Matching Drives Deduplication
Surviving: Final Step of Conforming
Delivering
Summary
Delivering Dimension Tables
The Basic Structure of a Dimension
The Grain of a Dimension
The Basic Load Plan for a Dimension
Flat Dimensions and Snowflaked Dimensions
Date and Time Dimensions
Big Dimensions
Small Dimensions
One Dimension or Two
Dimensional Roles
Dimensions as Subdimensions of Another Dimension
Degenerate Dimensions
Slowly Changing Dimensions
Type 1 Slowly Changing Dimension (Overwrite)
Type 2 Slowly Changing Dimension (Partitioning History)
Precise Time Stamping of a Type 2 Slowly Changing Dimension
Type 3 Slowly Changing Dimension (Alternate Realities)
Hybrid Slowly Changing Dimensions
Late-Arriving Dimension Records and Correcting Bad Data
Multivalued Dimensions and Bridge Tables
Ragged Hierarchies and Bridge Tables
Populating Hierarchy Bridge Tables
Using Positional Attributes in a Dimension to Represent Text Facts
Summary
Delivering Fact Tables
The Basic Structure of a Fact Table
Guaranteeing Referential Integrity
Surrogate Key Pipeline
Using the Dimension Instead of a Lookup Table
Fundamental Grains
Transaction Grain Fact Tables
Periodic Snapshot Fact Tables
Accumulating Snapshot Fact Tables
Preparing for Loading Fact Tables
Managing Indexes
Managing Partions
Outwitting the Rollback Log
Loading the Data
Incremental Loading
Inserting Facts
Updating and Correcting Facts
Negating Facts
Updating Facts
Deleting Facts
Physically Deleting Facts
Logically Deleting Facts
Factless Fact Tables
Augmenting a Type 1 Fact Table with Type 2 History
Graceful Modifications
Multiple Units of Measure in a Fact Table
Collecting Revenue in Multiple Currencies
Late Arriving Facts
Aggregations
Design Requirements #1 Through #4
Administering Aggregations, Including Materialized Views
Delivering Dimensional Data to OLAP Cubes
Cube Data Sources
Processing Dimensions
Changes in Dimension Data
Processing Facts
Integrating OLAP Processing into the ETL System
OLAP Wrapup
Summary
Development
Current Marketplace ETL Tool Suite Offerings
Current Scripting Languages
Time Is of the Essence
Push Me or Pull Me
Ensuring Transfers with Sentinels
Sorting Data During a Preload
Sorting on Mainframe Systems
Sorting on UNIX and Windows Systems
Trimming the Fat (Filtering)
Extracting a Subset of the Source File Records on Mainframe Systems
Extracting a Subset of the Source File Fields
Extracting a Subset of the Source File Records on UNIX and Windows Systems
Extracting a Subset of the Source File Fields
Creating Aggregated Extracts on Mainframe Systems
Creating Aggregated Extracts on UNIX and Windows Systems
Using Database Bulk Loader Utilities to Speed Inserts
Preparing for Bulk Load
Managing Database Features to Improve Performance
The Order of Things
The Effect of Aggregates and Group Bys on Performance
Performance Impact of Using Scalar Functions
Avoiding Triggers
Overcoming ODBC Bottlenecks
Benefiting from Parallel Processing
Troubleshooting Performance Problems
Increasing ETL Throughput
Reducing Input/Output Contention
Eliminating Database Reads/Writes
Filtering as Soon as Possible
Partiioning and Parellelizing
Updating Aggregates Incrementally
Taking Only What You Need
Bulk Loading/Eliminating Logging
Dropping Database Constraints and Indexes
Eliminating Network Traffic
Letting the ETL Engine Do the Work
Summary
Operations
Scheduling and Support
Reliability, Availability, Manageability Analysis for ETL
ETL Scheduling 101
Scheduling Tools
Load Dependencies
Metadata
Migrating to Production
Operational Support for the Data Warehouse
Bundling Version Releases
Supporting the ETL System in Production
Achieving Optimal ETL Performance
Estimating Load Time
Vulnerabilities of Long-Running ETL Processes
Minimizing the Risk of Load Failures
Purging Historic Data
Monitoring the ETL System
Measuring ETL Specific Performance Indicators
Measuring Infrastructure Performance Indicators
Measuring Data Warehouse Usage to Help Manage ETL Processes
Tuning ETL Processes
Explaining Database Overhead
ETL System Security
Securing the Development Environment
Securing the Production Environment
Short-Term Archiving and Recovery
Long-Term Archiving and Recovery
Media, Formats, Software, and Hardware
Obsolete Formats and Archaic Formats
Hard Copy, Standards, and Museums
Refreshing, Migrating, Emulating, and Encapsulating
Summary
Metadata
Defining Metadata
Metadata—What Is It?
Source System Metadata
Data-Staging Metadata
DBMS Metadata
Front Room Metadata
Business Metadata
Business Definitions
Source System Information
Data Warehouse Data Dictionary
Logical Data Maps
Technical Metadata
System Inventory
Data Models
Data Definitions
Business Rules
ETL-Generated Metadata
ETL Job Metadata
Transformation Metadata
Batch Metadata
Data Quality Error Event Metadata
Process Execution Metadata
Metadata Standards and Practices
Establishing Rudimentary Standards
Naming Conventions
Impact Analysis
Summary
Responsibilities
Planning and Leadership
Having Dedicated Leadership
Planning Large, Building Small
Hiring Qualified Developers
Building Teams with Database Expertise
Don’t Try to Save the World
Enforcing Standardization
Monitoring, Auditing, and Publishing Statistics
Maintaining Documentation
Providing and Utilizing Metadata
Keeping It Simple
Optimizing Throughput
Managing the Project
Responsibility of the ETL Team
Defining the Project
Planning the Project
Determining the Tool Set
Staffing Your Project
Project Plan Guidelines
Managing Scope
Summary
Real-Time ETL Systems
Why Real-Time ETL?
Defining Real-Time ETL
Challenges and Opportunities of Real-Time Data Warehousing
Real-Time Data Warehousing Review
Generation 1—The Operational Data Store
Generation 2—The Real-Time Partition
Recent CRM Trends
The Strategic Role of the Dimension Manager
Categorizing the Requirement
Data Freshness and Historical Needs
Reporting Only or Integration, Too?
Just the Facts or Dimension Changes, Too?
Alerts, Continuous Polling, or Nonevents?
Data Integration or Application Integration?
Point-to-Point Versus Hub-and-Spoke
Customer Data Cleanup Considerations
Real-Time ETL Approaches
Microbatch ETL
Enterprise Application Integration
Capture, Transform, and Flow
Enterprise Information Integration
The Real-Time Dimension Manager
Microbatch Processing
Choosing an Approach—A decision Guide
Summary
Conclusions
Deepening the Definition of ETL
The Future of Data Warehousing and ETL in Particular
Ongoing Evolution of ETL Systems