Data Cleaning
Data cleaning is the process of modifying data using processes such as:
selection
correction
alteration
rearrangement
regrouping
removal
for purposes such as:
consistency with other data
consistency with an intended use
elimination of incorrect data
elimination of unwanted data
Benefits of Data Cleaning using SQL Code over Python Pandas Code
Very Large Datasets Handling
The size of datasets Python Pandas can handle depends on the memory of the processor. SQL code is designed to handle datasets of any size.
Ease of Handling Simple Query Logic
If only simple data cleaning logic is being applied, SQL code can be simpler than Python Pandas code.
Benefits of Data Cleaning using Python Pandas Code over SQL Code
In general, simple data cleaning operations might be more easily handled using SQL code. As data cleaning complexity moves from simple to increasing complexity, the advantages of using Python Pandas becomes significant.
Logic Implementation
Python Pandas has significantly more capabilities and flexibility in implementing code cleaning logic than SQL code. As SQL code logic grows in complexity, it becomes unwieldy and difficult to debug and maintain.
Code Reflection Capability
Python Pandas code has the capability for reflection, that is, to modify itself depending on execution time requirements.
Programming Constructs
The Python language has a very large number of programming constructs compared to SQL.
Code Modification and Testing
Code logic can be compartmentalized and documented for relatively easy modification and testing compared to SQL.
Python Pandas DataFrame Functions
Using Python Pandas is one effective way to clean data. It provides a large set of data manipulation functions.
Below are select Pandas DataFrame functions useful for data cleaning:
abs
( ) - returns absolute numeric value of each elementadd
( ) - adds additional dataadd_prefix
( ) - adds string prefixadd_suffix
( ) - adds string suffixaggregate
( ) - aggregates dataalign
( ) - aligns two objects with a specified join methodall
( ) - returns whether all elements are Trueany
( ) - returns whether any element is Trueappend
( ) - appends rows of other to the end of caller, returning a new objectapply
( ) - applies a function along an axis (columns or rows) of the DataFrameapplymap
( ) - applies a function to DataFrame elementsasfreq
( ) - converts a time series to a specified frequency.assign
( ) - assigns new columns to a DataFrame.astype
( ) - casts an object to a specified data typeat_time
( ) - selects values at particular time of daybetween_time
( ) - selects values between particular times of the daybool
( ) - returns the boolean of a single elementclip
( ) - trim values at a given thresholdcolumns( ) - the column names of the DataFrame
combine
( ) - performs column-wise combine with another DataFramecombine_first
( ) - updates null elements with value in the same location in othercopy
( ) - makes a copy of an object’s indices and datacount
( ) - counts non-NA cellscummax
( ) - return a cumulative maximum over a DataFrame or Series axiscummin
( ) - returns a cumulative minimum over a DataFrame or Series axiscumprod
( ) - returns a cumulative product over a DataFrame or Series axiscumsum
( ) - returns a cumulative sum over a DataFrame or Series axis.describe
( ) - generates descriptive statisticsdiff
( ) - first discrete difference of elementdivide
( ) - gets Floating division of DataFrame and other elementsdrop
( ) - drops specified labels from rows or columnsdrop_duplicates
( ) - returns DataFrame with duplicate rows removeddroplevel
( ) - returns DataFrame with requested index / column level(s) removeddropna
( ) - removes missing valuesduplicated
( ) - returns boolean Series denoting duplicate rowseq
( ) - gets Equal to of DataFrame and other, element-wiseequals
( ) - test whether two objects contain the same elementsexplode
( ) - transforms each element of a list-like to a row, replicating index valuesfillna
( ) - fills NA/NaN values using the specified methodfilter
( ) - subsets the DataFrame rows or columns according to the specified index labelsfirst_valid_index
( ) - returns index for first non-NA/null valuefrom_dict
( ) - constructs a DataFrame from a dictionaryfrom_records
( ) - converts records to a DataFrameget
( ) - gets an item from object for given keygroupby
( ) - groups DataFrame using a mapper or by a series of columnsgt
( ) - gets greater than of a DataFramehead
( ) - returns the first n rowsidxmax
( ) - returns the index of first occurrence of maximum over requested axisidxmin
( ) - returns the index of first occurrence of minimum over requested axisinfer_objects
( ) - attempts to infer better types for object columnsinsert
( ) - inserts a column into DataFrame at specified locationinterpolate
( ) - interpolates values according to specified methodsisin
( ) - determines whether each element in the DataFrame is contained in specified valuesisna
( ) - detects missing values.isnull
( ) - detects null valuesitems
( ) - iterates over (column name, Series) pairsiteritems
( ) - iterates over (column name, Series) pairs.iterrows
( ) - iterates over DataFrame rows as (index, Series) pairsitertuples
( ) - iterates over DataFrame rows as named tuplesjoin
( ) - joins columnslast
( ) - subsets final periods of time series data based on a date offsetlast_valid_index
( ) - returns the index for last non-NA/null valuele
( ) - gets less than or equal to of DataFrameloc( ) - selects a range of rows
lookup
( ) - label-based indexing function for DataFrameslt
( ) - gets less than of DataFramemad
( ) - returns the mean absolute deviation of valuesmask
( ) - replaces values where a specified condition is Truemax
( ) - returns the maximum of the values for specified datamean
( ) - returns the mean of the values for specified datamedian
( ) - returns the median of the values for specified datamemory_usage
( ) - returns the memory usage of each column in bytesmerge
( ) - merges DataFrame or named Series objects with a database-style joinmin
( ) - returns the minimum of the values for specified datamode
( ) - gets the mode(s) of each element for specified data.multiply
( ) - gets multiplication of DataFrame and other datane
( ) - gets not equal to of DataFrame and other datanlargest
( ) - returns the first n rows ordered by columns in descending ordernotna
( ) - detects existing (non-missing) valuesnotnull
( ) - detects existing (non-missing) values.nsmallest
( ) - returns the first n rows ordered by columns in ascending ordernunique
( ) - counts distinct observations over specified datapct_change
( ) - percentage change between the current and a prior elementpivot_table( ) - creates a spreadsheet-style pivot table as a DataFrame
pop
( ) - returns item and drops it from DataFrameprod
( ) - returns the product of the values for the specified dataproduct
( ) - returns the product of the values for the specified dataquery
( ) - queries the columns of a DataFrame with a boolean expressionradd
( ) - gets addition of DataFrame and specified dataread_gbq
( ) - reads Google BigQuery data into a DataFrameread_json( ) - reads JSON data into a DataFrame
read_pickle
( ) - reads pickled file data into a DataFrameread_sql_query
( ) - reads an SQL DB table into a DataFrame using an SQL queryread_sql_table
( ) - reads an SQL DB table into a DataFramerdiv
( ) - gets floating point division of DataFrame and other specified datareindex
( ) - conforms DataFrame to new index with optional filling logicreindex_like
( ) - returns an object with matching indices as other objectrename
( ) - alters axes labelsrename_axis
( ) - sets the name of the axis for the index or columnsreplace
( ) - replaces values with new values in a DataFrameresample
( ) - resamples time-series datareset_index
( ) - resets an indexrfloordiv
( ) - gets Integer division of a DataFrame and specified datarmod
( ) - gets modulo of a DataFrame and specified datarmul
( ) - gets Multiplication of a DataFrame and specifiedround
( ) - rounds a DataFrame to a variable number of decimal placesrsub
( ) - gets subtraction of a DataFrame and specified datartruediv
( ) - gets floating point division of a DataFrame and specified datasample
( ) - returns a random sample of selected DataFrame elementsselect_dtypes
( ) - returns a subset of the DataFrame’s columns based on column typessem
( ) - returns the unbiased standard error of the mean over specified DataFrame dataset_axis
( ) - assigns a desired index to a given DataFrame axisset_index
( ) - sets a DataFrame index using existing columnsshift
( ) - shifts an index by a specified number of periodssize( ) - returns an int representing the number of elements in the object
skew
( ) - returns an unbiased skew over a requested DataFrame axissort_index
( ) - sorts by labels along a DataFrame axissort_values
( ) - sorts by values along a DataFrame axissqueeze
( ) - squeezes a 1 dimensional axis objects into scalarsstd
( ) - returns sample standard deviation over a specified DataFrame axissub
( ) - gets subtraction of a DataFrame and specified datasubtract
( ) - gets subtraction of a DataFrame and element-wise data (binary operator sub)sum
( ) - returns the sum of the values for the specified axistail
( ) - returns the last n rows of datatake
( ) - returns the elements in the given positional indices along an axisto_clipboard
( ) - copies objects to the system clipboardto_csv
( ) - writes object to a comma-separated values (csv) fileto_datetime( ) - converts fields to datetime based on supplied specifications
to_dict
( ) - converts a DataFrame to a dictionaryto_excel
( ) - writes objects to an Excel spreadsheetto_gbq
( ) - writes a DataFrame to a Google BigQuery tableto_hdf
( ) - writes data to an HDF5 file using HDFStoreto_html
( ) - renders a DataFrame as an HTML tableto_json
( ) - converts an object to a JSON stringto_latex
( ) - renders an object to a LaTeX objectto_markdown
( ) - prints DataFrame in Markdown-friendly formatto_numpy
( ) - converts a DataFrame to a NumPy arrayto_parquet
( ) - writes a DataFrame to the binary parquet formatto_period
( ) - converts a DataFrame from DatetimeIndex to PeriodIndexto_pickle
( ) - writes an object to Pickle (serialized) object fileto_records
( ) - converts a DataFrame to a NumPy record arrayto_sql
( ) - writes records stored in a DataFrame to a SQL databaseto_stata
( ) - exports a DataFrame object to Stata (statistics) dta formatto_string
( ) - renders a DataFrame to a console-friendly tabular outputto_timestamp
( ) - casts to DatetimeIndex of timestamps, at beginning of periodto_xarray
( ) - returns an xarray (multi-dimensional) object from the pandas objecttransform
( ) - produces a DataFrame using a transform functiontranspose
( ) - transposes index and columnstruediv
( ) - gets floating point division of a DataFrame and specified element-wise datatruncate
( ) - truncates a Series or DataFrame before and after a specified index valuetshift
( ) - shift a time indextz_convert
( ) - converts tz-aware axis to target time zonetz_localize
( ) - localizes a tz-naive index of a Series or DataFrame to target time zoneupdate
( ) - modifies in place using non-NA values from another DataFramevalue_counts( ) - returns a Series containing counts of unique rows in the DataFrame
var
( ) - returns unbiased variance over the specified axiswhere
( ) - replaces values where the condition is False
Python Example using Pandas
To download the code below click here.
""" data_cleaning_using_pandas.py provides examples of using a Python Pandas DataFrame and its methods """ # Import needed libraries. import pandas as pd import numpy as np from sklearn.datasets import load_iris import pprint # Set parameters. selected_row_count = 20 # Load test data. iris_data = load_iris() # Create a DataFrame from the test data. data_frame = pd.DataFrame(iris_data.data, columns=iris_data.feature_names) data_frame_subset = data_frame.head(selected_row_count) # Print the data. print("Data Subset of First " + str(selected_row_count) + " Rows") print(data_frame_subset.to_string()) # Get a count of non-NA cells. non_na_cell_count = data_frame_subset.count() # Print the count. print("Non Empty Cell count:") print(non_na_cell_count) # Get the mean of cell rows. mean_of_rows = data_frame_subset.mean(axis=1) # Print the means. print("Means of Rows:") print(mean_of_rows) # Get the mean of cell columns. mean_of_columns = data_frame_subset.mean(axis=0) # Print the means. print("Means of Columns:") print(mean_of_columns) # Round values to 0 decimal places. data_frame_with_rounded_values = data_frame_subset.round(decimals=0) # Print the DataFrame. print("Data Subset with Rounded Values:") print(data_frame_with_rounded_values.to_string()) # Drop duplicate rows. data_frame_rounded_without_duplicates = data_frame_with_rounded_values.drop_duplicates() # Print the means. print("Data Subset with Rounded Values without Duplicates:") print(data_frame_rounded_without_duplicates.to_string())
Output is below:
Data Subset of First 20 Rows sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.1 3.5 1.4 0.2 1 4.9 3.0 1.4 0.2 2 4.7 3.2 1.3 0.2 3 4.6 3.1 1.5 0.2 4 5.0 3.6 1.4 0.2 5 5.4 3.9 1.7 0.4 6 4.6 3.4 1.4 0.3 7 5.0 3.4 1.5 0.2 8 4.4 2.9 1.4 0.2 9 4.9 3.1 1.5 0.1 10 5.4 3.7 1.5 0.2 11 4.8 3.4 1.6 0.2 12 4.8 3.0 1.4 0.1 13 4.3 3.0 1.1 0.1 14 5.8 4.0 1.2 0.2 15 5.7 4.4 1.5 0.4 16 5.4 3.9 1.3 0.4 17 5.1 3.5 1.4 0.3 18 5.7 3.8 1.7 0.3 19 5.1 3.8 1.5 0.3 Non Empty Cell count: sepal length (cm) 20 sepal width (cm) 20 petal length (cm) 20 petal width (cm) 20 dtype: int64 Means of Rows: 0 2.550 1 2.375 2 2.350 3 2.350 4 2.550 5 2.850 6 2.425 7 2.525 8 2.225 9 2.400 10 2.700 11 2.500 12 2.325 13 2.125 14 2.800 15 3.000 16 2.750 17 2.575 18 2.875 19 2.675 dtype: float64 Means of Columns: sepal length (cm) 5.035 sepal width (cm) 3.480 petal length (cm) 1.435 petal width (cm) 0.235 dtype: float64 Data Subset with Rounded Values: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.0 4.0 1.0 0.0 1 5.0 3.0 1.0 0.0 2 5.0 3.0 1.0 0.0 3 5.0 3.0 2.0 0.0 4 5.0 4.0 1.0 0.0 5 5.0 4.0 2.0 0.0 6 5.0 3.0 1.0 0.0 7 5.0 3.0 2.0 0.0 8 4.0 3.0 1.0 0.0 9 5.0 3.0 2.0 0.0 10 5.0 4.0 2.0 0.0 11 5.0 3.0 2.0 0.0 12 5.0 3.0 1.0 0.0 13 4.0 3.0 1.0 0.0 14 6.0 4.0 1.0 0.0 15 6.0 4.0 2.0 0.0 16 5.0 4.0 1.0 0.0 17 5.0 4.0 1.0 0.0 18 6.0 4.0 2.0 0.0 19 5.0 4.0 2.0 0.0 Data Subset with Rounded Values without Duplicates: sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) 0 5.0 4.0 1.0 0.0 1 5.0 3.0 1.0 0.0 3 5.0 3.0 2.0 0.0 5 5.0 4.0 2.0 0.0 8 4.0 3.0 1.0 0.0 14 6.0 4.0 1.0 0.0 15 6.0 4.0 2.0 0.0