五天数据清理挑战

具体内容来源及文件下载请到：Data Cleaning Challenge

Day1:Handling missing values

Take a first look at the data

导入库和读取csv文件不再赘述！
首先，检查是否有缺失值

1 2	# 查看5行数据 nfl_data.sample(5)

See how many missing data points we have

检查各个column有多少缺失值

# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

查看缺失值的百分比,这样或许更加直观

# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)    # np.product:Return the product of array elements over a given axis
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

Figure out why the data is missing

一个很重要的问题是：

某个数据的缺失是因为它没有被记录还是压根就不存在

如果是本来就不存在的话，那就没必要去猜它是啥了
如果是因为没有被记录的话，那么就需要基于这个数字所在的行和列的其他数据来猜它了

Tips：反复阅读数据的文档说明是个很好的方法

Drop missing values

如果说处理数据很匆忙，或者无法弄明白为啥缺失了，可以选择移除包含该数据的行或者列（但是，不推荐这样做）
移除带有缺失值的行：

1 2	# remove all the rows that contain a missing value nfl_data.dropna()

移除带有缺失值的列：

1	columns_with_na_dropped = nfl_data.dropna(axis=1)

Filling in missing values automatically

除了直接移除缺失值外，也可以尝试来填补
首先是直接用0来填补

# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()

# replace all NA's with 0
subset_nfl_data.fillna(0)

也可以通过设置method = ‘bfill’，来用相邻行的同column数据来填补

1
2
3

# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

Day2:Scaling and normalization

Scaling

Scaling:缩放，修改数据的范围，常用于SVM和KNN

# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000)  # 绘制指数分布

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")

可以看到数据的形状并没有变化，但范围是从0~8变为了0~1

Normalization

Normalization：标准化，使数据尽可能满足正态分布，常用于t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes
这里使用的是Box-Cox转化

# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")

可以看到转化后更加接近于正态分布了

Day 3: Parsing dates

Check the data type of our date column

1	print(landslides['date'].head())

输出：

0 3/2/07
1 3/22/07
2 4/6/07
3 4/14/07
4 4/15/07
Name: date, dtype: object

作为人，可以清楚地知道这些数代表的是日期，但是Python不知道，左下角的dtpe：object，Pandas使用object表示数据包含多种数据种类，但通常都包含字符串。
其实Pandas是有一种datetime64的数据类型的

1 2	# check the data type of our date column landslides['date'].dtype

输出：

dtype(‘O’)

同样也可以用这种方法查看数据类型，‘O’表示object

Convert our date columns to datetime

将字符串转化为 datetime 时，要注明原字符串在表示日期时所遵循的格式，%d for day, %m for month, %y for a two-digit year and %Y for a four digit year.
具体的：

1/17/07 has the format “%m/%d/%y”
17-1-2007 has the format “%d-%m-%Y”

1 2	landslides.date = pd.to_datetime(landslides.date, format="%m/%d/%y") landslides.date.head()

输出：

0 2007-03-02
1 2007-03-22
2 2007-04-06
3 2007-04-14
4 2007-04-15
Name: date, dtype: datetime64[ns]

如果日期的格式不是固定的”%m/%d/%y”，而是多种格式的混合时，可以让pandas来推断正确的格式，这样：

1	landslides['date_parsed'] = pd.to_datetime(landslides['Date'], infer_datetime_format=True)

但是不要总是使用这个方法，因为pandas不能总是判断正确格式，而且会很慢

Select just the day of the month from our column

对于datatime的数据，可以选取出其中的day，month，year

1
2
3

day_date = landslides.date.dt.day
month_date = landslides.date.dt.month
year_date = landslides.date.dt.year

Plot the day of the month to check the date parsing

可视化日期

# remove na's
day_date = day_date.dropna()

# plot the day of the month
sns.distplot(day_date, kde=False, bins=31)  # kde 用于调节是否显示核密度估计 bins 用于控制直方图的划分

补充一个时间序列可视化教程

Day 4: Character encodings

What are encodings?

在Python3处理文本时，可能会遭遇的两种数据类型，一种是string，文本的默认属性

# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

输出：

str

另一种是bytes，是一个整数序列，可以从string转成bytes

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")

# check the type
type(after)

输出：

bytes

当然也可以从bytes转回成string

1	after.decode('utf-8')

输出：

‘This is the euro symbol: €’

Reading in files with encoding problems

读取非UTF-8编码的文件会出错，可以采用的方法是先读取前10000bytes来判断下

# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

输出：

{‘encoding’: ‘Windows-1252’, ‘confidence’: 0.73, ‘language’: ‘’}

即chardet有73%概率认为是Windows-1252编码的，然后可以按此编码方式来打开文件

# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

Saving your files with UTF-8 encoding

默认都是按utf-8来保存的

1 2	# save our file (will be saved as UTF-8 by default!) kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

Day 5: Inconsistent Data Entry

Do some preliminary text pre-processing

首先是检查city这一栏，有没有输入错误

# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

通过全部转小写，去头尾的空格，可以消除80%的输出错误

# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

Use fuzzy matching to correct inconsistent data entry

使用fuzzywuzzy包，可以用来分辨那些相似的字符串
Fuzzywuzzy返回的是两个字符串之间的几率值，越接近于100，相似度也就越高

# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

输出：

[(‘d. i khan’, 100),
(‘d.i khan’, 100),
(‘d.g khan’, 88),
(‘khanewal’, 50),
(‘sudhanoti’, 47),
(‘hangu’, 46),
(‘kohat’, 46),
(‘dara adam khel’, 45),
(‘chaman’, 43),
(‘mardan’, 43)]

可以针对几率在90以上的进行替换，因此可以写出个转化的函数

# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
    # get a list of unique strings
    strings = df[column].unique()
    
    # get the top 10 closest matches to our input string
    matches = fuzzywuzzy.process.extract(string_to_match, strings, 
                                         limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

    # only get matches with a ratio > 90
    close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

    # get the rows of all the close matches in our dataframe
    rows_with_matches = df[column].isin(close_matches)

    # replace all rows with close matches with the input matches 
    df.loc[rows_with_matches, column] = string_to_match
    
    # let us know the function's done
    print("All done!")

然后便可以针对某一个字符串进行相似词的替换了

1 2	# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan" replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")