Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

by Wes McKinney

Quick Summary

The second edition of this O'Reilly reference by the creator of pandas provides comprehensive coverage of Python's data analysis ecosystem. McKinney covers IPython/Jupyter, NumPy array operations, pandas DataFrames, data loading and cleaning, data wrangling, plotting with matplotlib, time series analysis, and advanced pandas features. It is the definitive practical guide for using Python to manipulate, process, and analyze structured data.

Detailed Summary

Wes McKinney, the original creator of the pandas library, wrote "Python for Data Analysis" as the authoritative reference for using Python's scientific computing stack for data manipulation. The second edition, updated for Python 3.6 and pandas 0.20+, is structured as a progressive tutorial that also serves as a reference.

The book opens with preliminaries explaining the Python data analysis ecosystem: NumPy for numerical computing, pandas for structured data manipulation, matplotlib for plotting, IPython and Jupyter for interactive development, and SciPy, scikit-learn, and statsmodels for scientific computing and statistical modeling.

Chapter 2 covers Python language basics with particular emphasis on IPython and Jupyter Notebook workflows, including tab completion, introspection, magic commands, and matplotlib integration. Chapter 3 reviews Python's built-in data structures (tuples, lists, dicts, sets), comprehensions, functions, generators, and file I/O.

Chapters 4 through 5 form the technical core. Chapter 4 provides deep coverage of NumPy: ndarray creation, data types, arithmetic operations, indexing (basic, boolean, and fancy), transposing, universal functions, array-oriented programming, linear algebra, and random number generation. Chapter 5 introduces pandas Series and DataFrame objects, essential functionality (reindexing, dropping entries, selection, arithmetic, function application, sorting, ranking), and descriptive statistics.

Chapters 6 through 7 address data loading (CSV, JSON, HTML, HDF5, Excel, databases) and data cleaning/preparation (handling missing data, data transformation, string manipulation, regular expressions). Chapter 8 covers data wrangling: hierarchical indexing, combining and merging datasets, reshaping and pivoting. Chapter 9 covers plotting and visualization with matplotlib and seaborn.

Chapter 10 covers data aggregation and group operations -- the split-apply-combine paradigm that is central to real-world data analysis. Chapter 11 addresses time series analysis: date and time types, time series indexing, date ranges and frequencies, shifting, resampling, and moving window functions. Chapter 12 covers advanced pandas features including categorical data, advanced GroupBy usage, method chaining, and more.

Chapter 13 provides extended examples: data from the US Baby Names dataset, the USDA food database, and the 2012 Federal Election Commission database, showing complete analytical workflows.

Chapter 14 covers NumPy advanced features: ndarray internals, advanced array manipulation, broadcasting, structured arrays, and memory-mapped files.

This book is indispensable for anyone doing data analysis in Python. Its unique authority derives from the fact that the author literally built the tool being documented. The focus is relentlessly practical, with real datasets and real analysis patterns throughout.