五天数据清理挑战

具体内容来源及文件下载请到:Data Cleaning Challenge

Day1:Handling missing values

Take a first look at the data

导入库和读取csv文件不再赘述!
首先,检查是否有缺失值

1
2
# 查看5行数据
nfl_data.sample(5)

See how many missing data points we have

检查各个column有多少缺失值

1
2
3
4
5
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

查看缺失值的百分比,这样或许更加直观

1
2
3
4
5
6
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape) # np.product:Return the product of array elements over a given axis
total_missing = missing_values_count.sum()

# percent of data that is missing
(total_missing/total_cells) * 100

Figure out why the data is missing

一个很重要的问题是:

某个数据的缺失是因为它没有被记录还是压根就不存在

如果是本来就不存在的话,那就没必要去猜它是啥了
如果是因为没有被记录的话,那么就需要基于这个数字所在的行和列的其他数据来猜它了

Tips:反复阅读数据的文档说明是个很好的方法

Drop missing values

如果说处理数据很匆忙,或者无法弄明白为啥缺失了,可以选择移除包含该数据的行或者列(但是,不推荐这样做)
移除带有缺失值的行:

1
2
# remove all the rows that contain a missing value
nfl_data.dropna()

移除带有缺失值的列:

1
columns_with_na_dropped = nfl_data.dropna(axis=1)

Filling in missing values automatically

除了直接移除缺失值外,也可以尝试来填补
首先是直接用0来填补

1
2
3
4
5
# get a small subset of the NFL dataset
subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()

# replace all NA's with 0
subset_nfl_data.fillna(0)

也可以通过设置method = ‘bfill’,来用相邻行的同column数据来填补

1
2
3
# replace all NA's the value that comes directly after it in the same column, 
# then replace all the reamining na's with 0
subset_nfl_data.fillna(method = 'bfill', axis=0).fillna(0)

Day2:Scaling and normalization

Scaling

Scaling:缩放,修改数据的范围,常用于SVM和KNN

1
2
3
4
5
6
7
8
9
10
11
12
# generate 1000 data points randomly drawn from an exponential distribution
original_data = np.random.exponential(size = 1000) # 绘制指数分布

# mix-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(scaled_data, ax=ax[1])
ax[1].set_title("Scaled data")


可以看到数据的形状并没有变化,但范围是从0~8变为了0~1

Normalization

Normalization:标准化,使数据尽可能满足正态分布,常用于t-tests, ANOVAs, linear regression, linear discriminant analysis (LDA) and Gaussian naive Bayes
这里使用的是Box-Cox转化

1
2
3
4
5
6
7
8
9
# normalize the exponential data with boxcox
normalized_data = stats.boxcox(original_data)

# plot both together to compare
fig, ax=plt.subplots(1,2)
sns.distplot(original_data, ax=ax[0])
ax[0].set_title("Original Data")
sns.distplot(normalized_data[0], ax=ax[1])
ax[1].set_title("Normalized data")


可以看到转化后更加接近于正态分布了

Day 3: Parsing dates

Check the data type of our date column

1
print(landslides['date'].head())

输出:

0 3/2/07
1 3/22/07
2 4/6/07
3 4/14/07
4 4/15/07
Name: date, dtype: object

作为人,可以清楚地知道这些数代表的是日期,但是Python不知道,左下角的dtpe:object,Pandas使用object表示数据包含多种数据种类,但通常都包含字符串。
其实Pandas是有一种datetime64的数据类型的

1
2
# check the data type of our date column
landslides['date'].dtype

输出:

dtype(‘O’)

同样也可以用这种方法查看数据类型,‘O’表示object

Convert our date columns to datetime

将字符串转化为 datetime 时,要注明原字符串在表示日期时所遵循的格式,%d for day, %m for month, %y for a two-digit year and %Y for a four digit year.
具体的:

  • 1/17/07 has the format “%m/%d/%y”
  • 17-1-2007 has the format “%d-%m-%Y”
1
2
landslides.date = pd.to_datetime(landslides.date, format="%m/%d/%y")
landslides.date.head()

输出:

0 2007-03-02
1 2007-03-22
2 2007-04-06
3 2007-04-14
4 2007-04-15
Name: date, dtype: datetime64[ns]

如果日期的格式不是固定的”%m/%d/%y”,而是多种格式的混合时,可以让pandas来推断正确的格式,这样:

1
landslides['date_parsed'] = pd.to_datetime(landslides['Date'], infer_datetime_format=True)

但是不要总是使用这个方法,因为pandas不能总是判断正确格式,而且会很慢

Select just the day of the month from our column

对于datatime的数据,可以选取出其中的day,month,year

1
2
3
day_date = landslides.date.dt.day
month_date = landslides.date.dt.month
year_date = landslides.date.dt.year

Plot the day of the month to check the date parsing

可视化日期

1
2
3
4
5
# remove na's
day_date = day_date.dropna()

# plot the day of the month
sns.distplot(day_date, kde=False, bins=31) # kde 用于调节是否显示核密度估计 bins 用于控制直方图的划分

补充一个时间序列可视化教程

Day 4: Character encodings

What are encodings?

在Python3处理文本时,可能会遭遇的两种数据类型,一种是string,文本的默认属性

1
2
3
4
5
# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

输出:

str

另一种是bytes,是一个整数序列,可以从string转成bytes

1
2
3
4
5
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")

# check the type
type(after)

输出:

bytes

当然也可以从bytes转回成string

1
after.decode('utf-8')

输出:

‘This is the euro symbol: €’

Reading in files with encoding problems

读取非UTF-8编码的文件会出错,可以采用的方法是先读取前10000bytes来判断下

1
2
3
4
5
6
# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

输出:

{‘encoding’: ‘Windows-1252’, ‘confidence’: 0.73, ‘language’: ‘’}

即chardet有73%概率认为是Windows-1252编码的,然后可以按此编码方式来打开文件

1
2
3
4
5
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

Saving your files with UTF-8 encoding

默认都是按utf-8来保存的

1
2
# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")

Day 5: Inconsistent Data Entry

Do some preliminary text pre-processing

首先是检查city这一栏,有没有输入错误

1
2
3
4
5
6
# get all the unique values in the 'City' column
cities = suicide_attacks['City'].unique()

# sort them alphabetically and then take a closer look
cities.sort()
cities

通过全部转小写,去头尾的空格,可以消除80%的输出错误

1
2
3
4
# convert to lower case
suicide_attacks['City'] = suicide_attacks['City'].str.lower()
# remove trailing white spaces
suicide_attacks['City'] = suicide_attacks['City'].str.strip()

Use fuzzy matching to correct inconsistent data entry

使用fuzzywuzzy包, 可以用来分辨那些相似的字符串
Fuzzywuzzy返回的是两个字符串之间的几率值,越接近于100,相似度也就越高

1
2
3
4
5
# get the top 10 closest matches to "d.i khan"
matches = fuzzywuzzy.process.extract("d.i khan", cities, limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# take a look at them
matches

输出:

[(‘d. i khan’, 100),
(‘d.i khan’, 100),
(‘d.g khan’, 88),
(‘khanewal’, 50),
(‘sudhanoti’, 47),
(‘hangu’, 46),
(‘kohat’, 46),
(‘dara adam khel’, 45),
(‘chaman’, 43),
(‘mardan’, 43)]

可以针对几率在90以上的进行替换,因此可以写出个转化的函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# function to replace rows in the provided column of the provided dataframe
# that match the provided string above the provided ratio with the provided string
def replace_matches_in_column(df, column, string_to_match, min_ratio = 90):
# get a list of unique strings
strings = df[column].unique()

# get the top 10 closest matches to our input string
matches = fuzzywuzzy.process.extract(string_to_match, strings,
limit=10, scorer=fuzzywuzzy.fuzz.token_sort_ratio)

# only get matches with a ratio > 90
close_matches = [matches[0] for matches in matches if matches[1] >= min_ratio]

# get the rows of all the close matches in our dataframe
rows_with_matches = df[column].isin(close_matches)

# replace all rows with close matches with the input matches
df.loc[rows_with_matches, column] = string_to_match

# let us know the function's done
print("All done!")

然后便可以针对某一个字符串进行相似词的替换了

1
2
# use the function we just wrote to replace close matches to "d.i khan" with "d.i khan"
replace_matches_in_column(df=suicide_attacks, column='City', string_to_match="d.i khan")
-------------本文结束 感谢您的阅读-------------