Author: Jason Brownlee
Data preparation is the transformation of raw data into a form that is more appropriate for modeling.
It is a challenging topic to discuss as the data differs in form, type, and structure from project to project.
Nevertheless, there are common data preparation tasks across projects. It is a huge field of study and goes by many names, such as “data cleaning,” “data wrangling,” “data preprocessing,” “feature engineering,” and more. Some of these are distinct data preparation tasks, and some of the terms are used to describe the entire data preparation process.
Even though it is a challenging topic to discuss, there are a number of books on the topic.
In this post, you will discover the top books on data cleaning, data preparation, feature engineering, and related topics.
Let’s get started.
Discover data cleaning, feature selection, data transforms, dimensionality reduction and much more in my new book, with 30 step-by-step tutorials and full Python source code.
Overview
The focus here is on data preparation for tabular data, e.g. data in the form of a table with rows and columns as it looks in an excel spreadsheet.
Data preparation is an important topic for all data types, although specialty methods are required for each, such as image data in computer vision, text data in natural language processing, and sequence data in time series forecasting.
Data preparation is often a chapter in a machine learning textbook, although there are books dedicated to the topic. We will focus on these books.
I have gathered all the books I can find on the topic data preparation, selected what I think are the best or better books, and organized them into three groups; they are:
- Data Cleaning
- Data Wrangling
- Feature Engineering
I will try to give the flavor of each book, including the goal, the table of contents, and where to learn more about it.
Want to Get Started With Data Preparation?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Data Cleaning
Data cleaning refers to identifying and fixing errors in the data prior to modeling, including, but not limited to, outliers, missing values, and much more.
The top books on data cleaning include:
- Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work, 2012.
- Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data, 2012.
- Data Cleaning, 2019.
Let’s take a closer look at each in turn.
“Bad Data Handbook”
The book “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work” was edited by Q. Ethan Mccallum and was published in 2012.
Bad data is described not only as corrupt data but any data that impairs the modeling process.
It’s tough to nail down a precise definition of “Bad Data.” Some people consider it a purely hands-on, technical phenomenon: missing values, malformed records, and cranky file formats. Sure, that’s part of the picture, but Bad Data is so much more. […] Bad Data is data that gets in the way.
— Page 1, “Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work,” 2012.
It is a collection of essays by 19 machine learning practitioners and us full of useful nuggets on data preparation and management.
The complete table of contents for the book is listed below.
- Chapter 01: Setting the Pace: What Is Bad Data?
- Chapter 02: Is It Just Me, or Does This Data Smell Funny?
- Chapter 03: Data Intended for Human Consumption, Not Machine Consumption
- Chapter 04: Bad Data Lurking in Plain Text
- Chapter 05: (Re)Organizing the Web’s Data
- Chapter 06: Detecting Liars and the Confused in Contradictory Online Reviews
- Chapter 07: Will the Bad Data Please Stand Up?
- Chapter 08: Blood, Sweat, and Urine
- Chapter 09: When Data and Reality Don’t Match
- Chapter 10: Subtle Sources of Bias and Error
- Chapter 11: Don’t Let the Perfect Be the Enemy of the Good: Is Bad Data Really Bad?
- Chapter 12: When Databases Attack: A Guide for When to Stick to Files
- Chapter 13: Crouching Table, Hidden Network
- Chapter 14: Myths of Cloud Computing
- Chapter 15: The Dark Side of Data Science
- Chapter 16: How to Feed and Care for Your Machine-Learning Expert
- Chapter 17: Data Traceability
- Chapter 18: Social Media: Erasable Ink?
- Chapter 19: Data Quality Analysis Demystified: Knowing When Your Data Is Good Enough
I like this book a lot; it is full of valuable practical advice. I highly recommend it!
Learn More:
“Best Practices in Data Cleaning”
The book “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data” was written by Jason Osborne and was published in 2012.
This is a more general textbook on data preparation for computational-based social sciences rather than machine learning specifically. Nevertheless, it contains a ton of useful advice.
My goal in writing this book is to collect, in one place, a systematic overview of what I consider to be best practices in data cleaning—things I can demonstrate as making a difference in your data analyses. I seek to change the status quo, the current state of affairs in quantitative research in the social sciences (and beyond).
— Page 2, “Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data,” 2012.
The complete table of contents for the book is listed below.
- Chapter 01: Why Data Cleaning Is Important: Debunking the Myth of Robustness
- Chapter 02: Power and Planning for Data Collection: Debunking the Myth of Adequate Power
- Chapter 03: Being True to the Target Population: Debunking the Myth of Representativeness
- Chapter 04: Using Large Data Sets With Probability Sampling Frameworks: Debunking the Myth of Equality
- Chapter 05: Screening Your Data for Potential Problems: Debunking the Myth of Perfect Data
- Chapter 06: Dealing With Missing or Incomplete Data: Debunking the Myth of Emptiness
- Chapter 07: Extreme and Influential Data Points: Debunking the Myth of Equality
- Chapter 08: Improving the Normality of Variables Through Box-Cox Transformation: Debunking the Myth of Distributional Irrelevance
- Chapter 09: Does Reliability Matter? Debunking the Myth of Perfect Measurement
- Chapter 10: Random Responding, Motivated Misresponding, and Response Sets: Debunking the Myth of the Motivated Participant
- Chapter 11: Why Dichotomizing Continuous Variables Is Rarely a Good Practice: Debunking the Myth of Categorization
- Chapter 12: The Special Challenge of Cleaning Repeated Measures Data: Lots of Pits in Which to Fall
- Chapter 13: Now That the Myths Are Debunked…: Visions of Rational Quantitative Methodology for the 21st Century
I think this is a great reference guide for general data preparation techniques, perhaps better coverage than most “machine learning” focused books given the stronger statistical focus.
Learn More:
“Data Cleaning”
The book “Data Cleaning” was written by Ihab Ilyas and Xu Chu, and published in 2019.
As the name suggests, the book is focused on data cleaning techniques that fix errors in raw data prior to modeling.
Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data. Rather than focus on a particular data cleaning task, in this book, we give an overview of the end-to-end data cleaning process, describing various error detection and repair methods, and attempt to anchor these proposals with multiple taxonomies and views.
— Page ixx, “Data Cleaning,” 2019.
The complete table of contents for the book is listed below.
- Chapter 01: Introduction
- Chapter 02: Outlier Detection
- Chapter 03: Data Deduplication
- Chapter 04: Data Transformation
- Chapter 05: Data Quality Rule Definition and Discovery
- Chapter 06: Rule-Based Data Cleaning
- Chapter 07: Machine Learning and Probabilistic Data Cleaning
- Chapter 08: Conclusion and Future Thoughts
It is more of a textbook than a practical book and is a good fit for academics and researchers looking for both a review of methods and references to the original research papers.
Learn More:
Data Wrangling
Data wrangling is a more general or colloquial term for data preparation that might include some data cleaning and feature engineering.
The top books on data wrangling include:
- Data Wrangling with Python: Tips and Tools to Make Your Life Easier, 2016.
- Principles of Data Wrangling: Practical Techniques for Data Preparation, 2017.
- Data Wrangling with R, 2016.
Let’s take a closer look at each in turn.
“Data Wrangling with Python”
The book “Data Wrangling with Python: Tips and Tools to Make Your Life Easier” was written by Jacqueline Kazil and Katharine Jarmul and was published in 2016.
The focus of this book are the tools and methods to help you get raw data into a form ready for modeling.
Data wrangling is about taking a messy or unrefined source of data and turning it into something useful.
— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.
This is a beginner’s book for those making their first steps into Python for data preparation and modeling, e.g. current excel users.
This book is for folks who want to explore data wrangling beyond desktop tools. If you are great at Excel and want to take your data analysis to the next level, this book will help!
— Page xii, “Data Wrangling with Python: Tips and Tools to Make Your Life Easier,” 2016.
The complete table of contents for the book is listed below.
- Chapter 01: Introduction to Python
- Chapter 02: Python Basics
- Chapter 03: Data Meant to Be Read by Machines
- Chapter 04: Working with Excel Files
- Chapter 05: PDFs and Problem Solving in Python
- Chapter 06: Acquiring and Storing Data
- Chapter 07: Data Cleanup: Investigation, Matching, and Formatting
- Chapter 08: Data Cleanup: Standardizing and Scripting
- Chapter 09: Data Exploration and Analysis
- Chapter 10: Presenting Your Data
- Chapter 11: Web Scraping: Acquiring and Storing Data from the Web
- Chapter 12: Advanced Web Scraping: Screen Scrapers and Spiders
- Chapter 13: APIs
- Chapter 14: Automation and Scaling
- Chapter 15: Conclusion
This is the book to get if you are just starting out with Python for data loading and organization.
Learn More:
“Principles of Data Wrangling”
The book “Principles of Data Wrangling: Practical Techniques for Data Preparation” was written by Tye Rattenbury, et al. and was published in 2017.
Data wrangling is used to describe all of the tasks related to getting data ready for modeling.
The phrase data wrangling, born in the modern context of agile analytics, is meant to describe the lion’s share of the time people spend working with data.
— Page ix, “Principles of Data Wrangling: Practical Techniques for Data Preparation,” 2017.
The complete table of contents for the book is listed below.
- Chapter 01: Introduction
- Chapter 02: A Data Workflow Framework
- Chapter 03: The Dynamics of Data Wrangling
- Chapter 04: Profiling
- Chapter 05: Transformation: Structuring
- Chapter 06: Transformation: Enriching
- Chapter 07: Using Transformation to Clean Data
- Chapter 08: Roles and Responsibilities
- Chapter 09: Data Wrangling Tools
It’s a good book, but very high level. Perhaps it is better suited to the manager than the practitioner. For example, I don’t think I saw a single line of code.
Learn More:
“Data Wrangling with R”
The book “Data Wrangling with R” was written by Bradley Boehmke and was published in 2016.
As its name suggests, this book is focused on data preparation with R.
In this book, I will help you learn the essentials of preprocessing data leveraging the R programming language to easily and quickly turn noisy data into usable pieces of information.
— Page v, Data Wrangling with R, 2016.
This is a practical book. It has lots of small, focused chapters with code examples on specific problems you will encounter during data preparation. It’s a welcome change compared to many of the other high-level books in this round-up.
The complete table of contents for the book is listed below.
- Chapter 01: The Role of Data Wrangling
- Chapter 02: Introduction to R
- Chapter 03: The Basics
- Chapter 04: Dealing with Numbers
- Chapter 05: Dealing with Character Strings
- Chapter 06: Dealing with Regular Expressions
- Chapter 07: Dealing with Factors
- Chapter 08: Dealing with Dates
- Chapter 09: Data Structure Basics
- Chapter 10: Managing Vectors
- Chapter 11: Managing Lists
- Chapter 12: Managing Matrices
- Chapter 13: Managing Data Frames
- Chapter 14: Dealing with Missing Values
- Chapter 15: Importing Data
- Chapter 16: Scraping Data
- Chapter 17: Exporting Data
- Chapter 18: Functions
- Chapter 19: Loop Control Statements
- Chapter 20: Simplify Your Code with %>%
- Chapter 21: Reshaping Your Data with tidyr
- Chapter 22: Transforming Your Data with dplyr
I’m a fan of this book, and if you are using R, you need a copy. A downside is that there is a little too much of the R basics in this book. I would rather these beleft out and the reader directed to an introductory R book, lifting the requirements on the reader slightly.
Learn More:
Feature Engineering
Feature engineering refers to creating new input variables from raw data, although it also refers to data preparation more generally.
Top books on feature engineering include:
- Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.
- Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 2018.
Let’s take a closer look at each in turn.
“Feature Engineering and Selection”
The book “Feature Engineering and Selection: A Practical Approach for Predictive Models” was written by Max Kuhn and Kjell Johnson and was published in 2019.
This book describes the general process of preparing raw data for modeling as feature engineering.
Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering.
— Page xi, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.
The examples in the book are demonstrated using R, which is important, as the author Max Kuhn is also creator of the popular caret package.
An important perspective taken in the book is that data preparation is not just about meeting the expectations of modeling algorithms; it is required to best expose the underlying structure of the problem, requiring iterative trial and error. This is the same perspective that I take in general and it’s refreshing to see in a modern book.
… we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance.
— Page xii, “Feature Engineering and Selection: A Practical Approach for Predictive Models,” 2019.
The complete table of contents for the book is listed below.
- Chapter 1. Introduction
- Chapter 2. Illustrative Example: Predicting Risk Ischemic Stroke
- Chapter 3. A Review of the Predictive Modeling Process
- Chapter 4. Exploratory Visualizations
- Chapter 5. Encoding Categorical Predictors
- Chapter 6. Engineering Numeric Predictors
- Chapter 7. Detecting Interaction Effects
- Chapter 8. Handling Missing Data
- Chapter 9. Working with Profile Data
- Chapter 10. Feature Selection Overview
- Chapter 11. Greedy Search Methods
- Chapter 12. Global Search Methods
I think this is a must-own book, even if R is not your primary language. The breadth of the methods discussed is worth the sticker price alone.
Learn More:
“Feature Engineering for Machine Learning”
The book “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists” was written by Alice Zheng and Amanda Casari and was published in 2018.
I think this book has the most direct definitions up front of all of the books I looked at, describing a feature as a numerical input to a model and feature engineering about getting useful numerical features from the raw data. Very crisp!
A feature is a numeric representation of an aspect of raw data. Features sit between data and models in the machine learning pipeline. Feature engineering is the act of extracting features from raw data and transforming them into formats that are suitable for the machine learn‐ ing model.
— Page vii, “Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists,” 2018.
The examples are in Python and focus on using NumPy and Pandas, and there are lots of worked examples, which are great. I think this is a good sister book or Python equivalent to the above “Data Wrangling with R” or “Feature Engineering and Selection,” although perhaps with less coverage.
The complete table of contents for the book is listed below.
- Chapter 1: Machine Learning Pipeline
- Chapter 2: Fancy Tricks with Simple Numbers
- Chapter 3: Text Data: Flattening, Filtering, and Chunking
- Chapter 4: The Effects of Feature Scaling: From Bag-of-Words to Tf-Idf
- Chapter 5: Categorical variables: Counting Eggs in the Age of Robotic Chickens
- Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA
- Chapter 7: Nonlinear Featurization via K-Means Model Stacking
- Chapter 8: Automating the Featurizer: Image Feature Extraction and Deep Learning
- Chapter 9: Back to the Future: Building an Academic Paper Recommender
- Appendix A: Linear Modeling and Linear Algebra Basics
I like the book.
I guess I would prefer to drop the math and direct the reader to a textbook. I would also prefer the examples to focus on the machine learning modeling pipeline rather than standalone transforms. But I’m being picky and pushing hard for directly useful code on a given project.
Learn More:
Recommendations
You have to pick the book that is right for you, based on your needs, e.g. code or textbook, Python or R.
I own all of these books, but the two I recommend are:
- Feature Engineering and Selection: A Practical Approach for Predictive Models, 2019.
- Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists, 2018.
The reason is I like practical books and I like the R and Python perspectives when I’m figuring out what to try.
A close follow-up would be:
- Data Wrangling with R, 2016.
- Bad Data Handbook: Cleaning Up The Data So You Can Get Back To Work, 2012.
The first is super practical; the second is full of super helpful (yet super specific) advice.
For textbooks, needed for their references by most researchers, I’d probably recommend:
Summary
In this post, you discovered the top books on data cleaning, data preparation, feature engineering and related topics.
Did I miss a good book on data preparation?
Let me know in the comments below.
Have you read any of the books listed?
Let me know what you think of it in the comments.
The post 8 Top Books on Data Cleaning and Feature Engineering appeared first on Machine Learning Mastery.