Features with One to One Correspondence
How do we learn about the redundant features?
A data set may have features referring to the same information. The following example shows that the train.csv data in the Kaggle competition: "Predict West Nile virus in mosquitos across the city of Chicago", have features of Address, AddressNumberAndStreet, Latitude and Longitude referring to the locations of mosquito traps. These features have the same total number of unique values (138 of them) and they represent the same information.
-------------------------------------------------------------------------------- Number of unique values and their corresponding features -------------------------------------------------------------------------------- Number of unique values: 64, Features: ['Block'] Number of unique values: 128, Features: ['Street'] Number of unique values: 50, Features: ['NumMosquitos'] Number of unique values: 4, Features: ['AddressAccuracy'] Number of unique values: 7, Features: ['Species'] Number of unique values: 136, Features: ['Trap'] Number of unique values: 138, Features: ['Address', 'AddressNumberAndStreet', 'Latitude', 'Longitude'] Number of unique values: 2, Features: ['WnvPresent'] Number of unique values: 95, Features: ['Date'] ********** Conclusion for one to one correspondence ********** Features ['Address', 'AddressNumberAndStreet', 'Latitude', 'Longitude'] are one to one correspondence.
Some data sets may have features of different names that do not suggest the same information at the first place, in contrast to the example above. Even for features sharing the same total number of unique values, there is no guarantee that the features represent the same information. In order to determine whether the chosen features are basically representing the same information, their unique values must fulfill the transformation of one to one correspondence. I designed functions: unique_sets() and one2one(), which help to evaluate whether any pair of features sharing the same total number of unique values are also one to one correspondence.
unique_sets() basically collect the features sharing the same total number of unique values into the same group. Obviously, two features that have different total number of unique values will not be one to one correspondence. one2one() evaluates the correspondence with the help of groupby() function of pandas.DataFrame. In this case, for a DataFrame with two features, we can perform grouping based on the first feature (df.columns[0], refer to the code below). We should expect the total number of unique values in each group is exactly 1, for one to one relationship to be valid.
def one2one(df):
"""Check whether the two columns of a DataFrame have one to one correspondence.
Parameters
__________
df : pandas.DataFrame, shape [n_samples, 2]
Returns
-------
relation : boolean
True, if one to one correspondence found.
"""
assert len(df.columns) == 2, "DataFrame does not have two columns"
counts = []
# check the unique mapping from the 1st column to the 2nd column
# the relationship implies bidirectional mapping, if that is True
group = df.groupby(df.columns[0])
nums = group.transform(lambda x: len(x.unique()))
counts.append(nums.values)
# check whether the counts are 1 for each row
relation = False
if np.all(counts[0] == 1):
relation = True
return relation
Based on these functions, I designed a helper function one2one_sets() which first executes the unique_sets(), followed by inspecting one to one correspondence of any pair of features having the same total number of unique values.
Let us refer to a corrupted data set, train-corrupted.csv, which mislabel some parts of the data in train.csv. train-corrupted.csv has the Address of 1st entry marked incorrectly as "6200 North Mandell Avenue, Chicago, IL 60646, USA" meanwhile the AddressNumberAndStreet of the second entry marked incorrectly as "6200 N MANDELL AVE, Chicago, IL". In this manner, Address and AddressNumberAndStreet no longer fulfills the one to one relationship, and the one2one_sets() correctly identifies that only Latitude and Longitude features fulfill the required criteria even these features share the same total number of unqiue values.
Lastly, if we apply the designed script to investigate all the features of weather.csv data, we can conclude that the features Station and Depth are referring to the same information. Hence, we can thus remove all the redundant features of our data before continuing our analysis.
The following demonstrates the example code of using one2one_sets() function inspecting the one to one correspondence of our data, as shown in this post.
""" train.csv and weather.csv files downloaded from
https://www.kaggle.com/c/predict-west-nile-virus/data """
import sys
sys.path.append("../bin/")
from data import DataIn
# from preprocess import one2one
if __name__ == "__main__":
train = DataIn("train.csv")
train.one2one_sets()
corrupted = DataIn("train-corrupted.csv")
corrupted.one2one_sets()
weather = DataIn("weather.csv")
weather.one2one_sets()