Data Cleaning

Data cleaning is the process of modifying data using processes such as:

selection
correction
alteration
rearrangement
regrouping
removal

for purposes such as:

consistency with other data
consistency with an intended use
elimination of incorrect data
elimination of unwanted data

Benefits of Data Cleaning using SQL Code over Python Pandas Code

Very Large Datasets Handling

The size of datasets Python Pandas can handle depends on the memory of the processor. SQL code is designed to handle datasets of any size.

Ease of Handling Simple Query Logic

If only simple data cleaning logic is being applied, SQL code can be simpler than Python Pandas code.

Benefits of Data Cleaning using Python Pandas Code over SQL Code

In general, simple data cleaning operations might be more easily handled using SQL code. As data cleaning complexity moves from simple to increasing complexity, the advantages of using Python Pandas becomes significant.

Logic Implementation

Python Pandas has significantly more capabilities and flexibility in implementing code cleaning logic than SQL code. As SQL code logic grows in complexity, it becomes unwieldy and difficult to debug and maintain.

Code Reflection Capability

Python Pandas code has the capability for reflection, that is, to modify itself depending on execution time requirements.

Programming Constructs

The Python language has a very large number of programming constructs compared to SQL.

Code Modification and Testing

Code logic can be compartmentalized and documented for relatively easy modification and testing compared to SQL.

Python Pandas DataFrame Functions

Using Python Pandas is one effective way to clean data. It provides a large set of data manipulation functions.

Below are select Pandas DataFrame functions useful for data cleaning:

abs( ) - returns absolute numeric value of each element
add( ) - adds additional data
add_prefix( ) - adds string prefix
add_suffix( ) - adds string suffix
aggregate( ) - aggregates data
align( ) - aligns two objects with a specified join method
all( ) - returns whether all elements are True
any( ) - returns whether any element is True
append( ) - appends rows of other to the end of caller, returning a new object
apply( ) - applies a function along an axis (columns or rows) of the DataFrame
applymap( ) - applies a function to DataFrame elements
asfreq( ) - converts a time series to a specified frequency.
assign( ) - assigns new columns to a DataFrame.
astype( ) - casts an object to a specified data type
at_time( ) - selects values at particular time of day
between_time( ) - selects values between particular times of the day
bool( ) - returns the boolean of a single element
clip( ) - trim values at a given threshold
columns( ) - the column names of the DataFrame
combine( ) - performs column-wise combine with another DataFrame
combine_first( ) - updates null elements with value in the same location in other
copy( ) - makes a copy of an object’s indices and data
count( ) - counts non-NA cells
cummax( ) - return a cumulative maximum over a DataFrame or Series axis
cummin( ) - returns a cumulative minimum over a DataFrame or Series axis
cumprod( ) - returns a cumulative product over a DataFrame or Series axis
cumsum( ) - returns a cumulative sum over a DataFrame or Series axis.
describe( ) - generates descriptive statistics
diff( ) - first discrete difference of element
divide( ) - gets Floating division of DataFrame and other elements
drop( ) - drops specified labels from rows or columns
drop_duplicates( ) - returns DataFrame with duplicate rows removed
droplevel( ) - returns DataFrame with requested index / column level(s) removed
dropna( ) - removes missing values
duplicated( ) - returns boolean Series denoting duplicate rows
eq( ) - gets Equal to of DataFrame and other, element-wise
equals( ) - test whether two objects contain the same elements
explode( ) - transforms each element of a list-like to a row, replicating index values
fillna( ) - fills NA/NaN values using the specified method
filter( ) - subsets the DataFrame rows or columns according to the specified index labels
first_valid_index( ) - returns index for first non-NA/null value
from_dict( ) - constructs a DataFrame from a dictionary
from_records( ) - converts records to a DataFrame
get( ) - gets an item from object for given key
groupby( ) - groups DataFrame using a mapper or by a series of columns
gt( ) - gets greater than of a DataFrame
head( ) - returns the first n rows
idxmax( ) - returns the index of first occurrence of maximum over requested axis
idxmin( ) - returns the index of first occurrence of minimum over requested axis
infer_objects( ) - attempts to infer better types for object columns
insert( ) - inserts a column into DataFrame at specified location
interpolate( ) - interpolates values according to specified methods
isin( ) - determines whether each element in the DataFrame is contained in specified values
isna( ) - detects missing values.
isnull( ) - detects null values
items( ) - iterates over (column name, Series) pairs
iteritems( ) - iterates over (column name, Series) pairs.
iterrows( ) - iterates over DataFrame rows as (index, Series) pairs
itertuples( ) - iterates over DataFrame rows as named tuples
join( ) - joins columns
last( ) - subsets final periods of time series data based on a date offset
last_valid_index( ) - returns the index for last non-NA/null value
le( ) - gets less than or equal to of DataFrame
loc( ) - selects a range of rows
lookup( ) - label-based indexing function for DataFrames
lt( ) - gets less than of DataFrame
mad( ) - returns the mean absolute deviation of values
mask( ) - replaces values where a specified condition is True
max( ) - returns the maximum of the values for specified data
mean( ) - returns the mean of the values for specified data
median( ) - returns the median of the values for specified data
memory_usage( ) - returns the memory usage of each column in bytes
merge( ) - merges DataFrame or named Series objects with a database-style join
min( ) - returns the minimum of the values for specified data
mode( ) - gets the mode(s) of each element for specified data.
multiply( ) - gets multiplication of DataFrame and other data
ne( ) - gets not equal to of DataFrame and other data
nlargest( ) - returns the first n rows ordered by columns in descending order
notna( ) - detects existing (non-missing) values
notnull( ) - detects existing (non-missing) values.
nsmallest( ) - returns the first n rows ordered by columns in ascending order
nunique( ) - counts distinct observations over specified data
pct_change( ) - percentage change between the current and a prior element
pivot_table( ) - creates a spreadsheet-style pivot table as a DataFrame
pop( ) - returns item and drops it from DataFrame
prod( ) - returns the product of the values for the specified data
product( ) - returns the product of the values for the specified data
query( ) - queries the columns of a DataFrame with a boolean expression
radd( ) - gets addition of DataFrame and specified data
read_gbq( ) - reads Google BigQuery data into a DataFrame
read_json( ) - reads JSON data into a DataFrame
read_pickle( ) - reads pickled file data into a DataFrame
read_sql_query( ) - reads an SQL DB table into a DataFrame using an SQL query
read_sql_table( ) - reads an SQL DB table into a DataFrame
rdiv( ) - gets floating point division of DataFrame and other specified data
reindex( ) - conforms DataFrame to new index with optional filling logic
reindex_like( ) - returns an object with matching indices as other object
rename( ) - alters axes labels
rename_axis( ) - sets the name of the axis for the index or columns
replace( ) - replaces values with new values in a DataFrame
resample( ) - resamples time-series data
reset_index( ) - resets an index
rfloordiv( ) - gets Integer division of a DataFrame and specified data
rmod( ) - gets modulo of a DataFrame and specified data
rmul( ) - gets Multiplication of a DataFrame and specified
round( ) - rounds a DataFrame to a variable number of decimal places
rsub( ) - gets subtraction of a DataFrame and specified data
rtruediv( ) - gets floating point division of a DataFrame and specified data
sample( ) - returns a random sample of selected DataFrame elements
select_dtypes( ) - returns a subset of the DataFrame’s columns based on column types
sem( ) - returns the unbiased standard error of the mean over specified DataFrame data
set_axis( ) - assigns a desired index to a given DataFrame axis
set_index( ) - sets a DataFrame index using existing columns
shift( ) - shifts an index by a specified number of periods
size( ) - returns an int representing the number of elements in the object
skew( ) - returns an unbiased skew over a requested DataFrame axis
sort_index( ) - sorts by labels along a DataFrame axis
sort_values( ) - sorts by values along a DataFrame axis
squeeze( ) - squeezes a 1 dimensional axis objects into scalars
std( ) - returns sample standard deviation over a specified DataFrame axis
sub( ) - gets subtraction of a DataFrame and specified data
subtract( ) - gets subtraction of a DataFrame and element-wise data (binary operator sub)
sum( ) - returns the sum of the values for the specified axis
tail( ) - returns the last n rows of data
take( ) - returns the elements in the given positional indices along an axis
to_clipboard( ) - copies objects to the system clipboard
to_csv( ) - writes object to a comma-separated values (csv) file
to_datetime( ) - converts fields to datetime based on supplied specifications
to_dict( ) - converts a DataFrame to a dictionary
to_excel( ) - writes objects to an Excel spreadsheet
to_gbq( ) - writes a DataFrame to a Google BigQuery table
to_hdf( ) - writes data to an HDF5 file using HDFStore
to_html( ) - renders a DataFrame as an HTML table
to_json( ) - converts an object to a JSON string
to_latex( ) - renders an object to a LaTeX object
to_markdown( ) - prints DataFrame in Markdown-friendly format
to_numpy( ) - converts a DataFrame to a NumPy array
to_parquet( ) - writes a DataFrame to the binary parquet format
to_period( ) - converts a DataFrame from DatetimeIndex to PeriodIndex
to_pickle( ) - writes an object to Pickle (serialized) object file
to_records( ) - converts a DataFrame to a NumPy record array
to_sql( ) - writes records stored in a DataFrame to a SQL database
to_stata( ) - exports a DataFrame object to Stata (statistics) dta format
to_string( ) - renders a DataFrame to a console-friendly tabular output
to_timestamp( ) - casts to DatetimeIndex of timestamps, at beginning of period
to_xarray( ) - returns an xarray (multi-dimensional) object from the pandas object
transform( ) - produces a DataFrame using a transform function
transpose( ) - transposes index and columns
truediv( ) - gets floating point division of a DataFrame and specified element-wise data
truncate( ) - truncates a Series or DataFrame before and after a specified index value
tshift( ) - shift a time index
tz_convert( ) - converts tz-aware axis to target time zone
tz_localize( ) - localizes a tz-naive index of a Series or DataFrame to target time zone
update( ) - modifies in place using non-NA values from another DataFrame
value_counts( ) - returns a Series containing counts of unique rows in the DataFrame
var( ) - returns unbiased variance over the specified axis
where( ) - replaces values where the condition is False

Python Example using Pandas

To download the code below click here.

"""
data_cleaning_using_pandas.py
provides examples of using a Python Pandas DataFrame and its methods
"""

# Import needed libraries.
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import pprint

# Set parameters.
selected_row_count = 20

# Load test data.
iris_data = load_iris()

# Create a DataFrame from the test data.
data_frame = pd.DataFrame(iris_data.data, columns=iris_data.feature_names)
data_frame_subset = data_frame.head(selected_row_count)

# Print the data.
print("Data Subset of First " + str(selected_row_count) + " Rows")
print(data_frame_subset.to_string())

# Get a count of non-NA cells.
non_na_cell_count = data_frame_subset.count()

# Print the count.
print("Non Empty Cell count:")
print(non_na_cell_count)

# Get the mean of cell rows.
mean_of_rows = data_frame_subset.mean(axis=1)

# Print the means.
print("Means of Rows:")
print(mean_of_rows)

# Get the mean of cell columns.
mean_of_columns = data_frame_subset.mean(axis=0)

# Print the means.
print("Means of Columns:")
print(mean_of_columns)

# Round values to 0 decimal places.
data_frame_with_rounded_values = data_frame_subset.round(decimals=0)

# Print the DataFrame.
print("Data Subset with Rounded Values:")
print(data_frame_with_rounded_values.to_string())

# Drop duplicate rows.
data_frame_rounded_without_duplicates = data_frame_with_rounded_values.drop_duplicates()

# Print the means.
print("Data Subset with Rounded Values without Duplicates:")
print(data_frame_rounded_without_duplicates.to_string())

Output is below:

Data Subset of First 20 Rows
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
               5.1               3.5                1.4               0.2
               4.9               3.0                1.4               0.2
               4.7               3.2                1.3               0.2
               4.6               3.1                1.5               0.2
               5.0               3.6                1.4               0.2
               5.4               3.9                1.7               0.4
               4.6               3.4                1.4               0.3
               5.0               3.4                1.5               0.2
               4.4               2.9                1.4               0.2
               4.9               3.1                1.5               0.1
              5.4               3.7                1.5               0.2
              4.8               3.4                1.6               0.2
              4.8               3.0                1.4               0.1
              4.3               3.0                1.1               0.1
              5.8               4.0                1.2               0.2
              5.7               4.4                1.5               0.4
              5.4               3.9                1.3               0.4
              5.1               3.5                1.4               0.3
              5.7               3.8                1.7               0.3
              5.1               3.8                1.5               0.3
Non Empty Cell count:
sepal length (cm)    20
sepal width (cm)     20
petal length (cm)    20
petal width (cm)     20
dtype: int64
Means of Rows:
   2.550
   2.375
   2.350
   2.350
   2.550
   2.850
   2.425
   2.525
   2.225
   2.400
  2.700
  2.500
  2.325
  2.125
  2.800
  3.000
  2.750
  2.575
  2.875
  2.675
dtype: float64
Means of Columns:
sepal length (cm)    5.035
sepal width (cm)     3.480
petal length (cm)    1.435
petal width (cm)     0.235
dtype: float64
Data Subset with Rounded Values:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
               5.0               4.0                1.0               0.0
               5.0               3.0                1.0               0.0
               5.0               3.0                1.0               0.0
               5.0               3.0                2.0               0.0
               5.0               4.0                1.0               0.0
               5.0               4.0                2.0               0.0
               5.0               3.0                1.0               0.0
               5.0               3.0                2.0               0.0
               4.0               3.0                1.0               0.0
               5.0               3.0                2.0               0.0
              5.0               4.0                2.0               0.0
              5.0               3.0                2.0               0.0
              5.0               3.0                1.0               0.0
              4.0               3.0                1.0               0.0
              6.0               4.0                1.0               0.0
              6.0               4.0                2.0               0.0
              5.0               4.0                1.0               0.0
              5.0               4.0                1.0               0.0
              6.0               4.0                2.0               0.0
              5.0               4.0                2.0               0.0
Data Subset with Rounded Values without Duplicates:
    sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
               5.0               4.0                1.0               0.0
               5.0               3.0                1.0               0.0
               5.0               3.0                2.0               0.0
               5.0               4.0                2.0               0.0
               4.0               3.0                1.0               0.0
              6.0               4.0                1.0               0.0
              6.0               4.0                2.0               0.0