Python之pandas

Series

  • series是一个像数组一样的一维序列,并伴有一个数组表示label,叫做index,默认的index是0,1,2…,当然也可以在创建Series时设定好index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
obj = pd.Series([4, 7, 5, -3])
obj
# 0 4
# 1 7
# 2 5
# 3 -3
# dtype: int64

obj2 = pd.Series([4, 7, 5, -3], index = ['a', 'b', 'c', 'd'])
obj2
# a 4
# b 7
# c 5
# d -3
# dtype: int64
  • 使用numpy函数或类似的操作,会保留index-value的关系
1
2
3
4
5
6
7
import numpy as np
np.exp(obj2)
# a 54.598150
# b 1096.633158
# c 148.413159
# d 0.049787
# dtype: float64
  • 另一种看待series的方法,它是一个长度固定,有顺序的dict,从index映射到value,因此也可以用现有的dict来创建series
1
2
3
4
5
6
7
8
9
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon':16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

# Ohio 35000
# Texas 71000
# Oregon 16000
# Utah 5000
# dtype: int64
  • pandas中的isnull和notnull函数可以用来检测缺失数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64

pd.isnull(obj4)
# California True
# Ohio False
# Oregon False
# Texas False
# dtype: bool

pd.notnull(obj4)
# California False
# Ohio True
# Oregon True
# Texas True
# dtype: bool
  • Series有个特色是自动按照index label来排序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
obj3
# Ohio 35000
# Texas 71000
# Oregon 16000
# Utah 5000
# dtype: int64

obj4
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64

obj3 + obj4
# California NaN
# Ohio 70000.0
# Oregon 32000.0
# Texas 142000.0
# Utah NaN
# dtype: float64
  • series的index能被直接更改
1
2
3
4
5
6
7
8
9
10
11
12
13
14
obj
# 0 4
# 1 7
# 2 5
# 3 -3
# dtype: int64

obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
obj
# Bob 4
# Steve 7
# Jeff -5
# Ryan 3
# dtype: int64

DataFrame

  • 用Excel来理解DataFrame是更为直观的
  • 构建一个dataframe的方法,用一个dcit,dict里的值是list
  • dataframe也会像series一样,自动给数据赋index
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

frame = pd.DataFrame(data)

frame

# state year pop
# 0 Ohio 2000 1.5
# 1 Ohio 2001 1.7
# 2 Ohio 2002 3.6
# 3 Nevada 2001 2.4
# 4 Nevada 2002 2.9
# 5 Nevada 2003 3.2
  • 如果指定一列的话,则会自动按指定的列排序
1
2
3
4
5
6
7
8
pd.DataFrame(data, columns=['year', 'state', 'pop'])
# year state pop
# 0 2000 Ohio 1.5
# 1 2001 Ohio 1.7
# 2 2002 Ohio 3.6
# 3 2001 Nevada 2.4
# 4 2002 Nevada 2.9
# 5 2003 Nevada 3.2
  • 从DataFrame里提取一列的话会返回series格式,可以以属性或是dict一样的形式来提取
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
frame['state']
# 0 Ohio
# 1 Ohio
# 2 Ohio
# 3 Nevada
# 4 Nevada
# 5 Nevada
# Name: state, dtype: object

frame.year
# 0 2000
# 1 2001
# 2 2002
# 3 2001
# 4 2002
# 5 2003
# Name: year, dtype: int64
  • 如果是提取一行的话,需要用到loc,loc中使用index
1
2
3
4
5
frame.loc[1]
# state Ohio
# year 2001
# pop 1.7
# Name: 1, dtype: object
  • 如果把list或array赋给column的话,长度必须符合DataFrame的长度。如果把一二series赋给DataFrame,会按DataFrame的index来赋值,不够的地方用缺失数据来表示
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
frame['debt'] = 0
frame
# state year pop debt
# 0 Ohio 2000 1.5 0
# 1 Ohio 2001 1.7 0
# 2 Ohio 2002 3.6 0
# 3 Nevada 2001 2.4 0
# 4 Nevada 2002 2.9 0
# 5 Nevada 2003 3.2 0

list = np.arange(6)
frame['debt'] = list
frame
# state year pop debt
# 0 Ohio 2000 1.5 0
# 1 Ohio 2001 1.7 1
# 2 Ohio 2002 3.6 2
# 3 Nevada 2001 2.4 3
# 4 Nevada 2002 2.9 4
# 5 Nevada 2003 3.2 5

val = pd.Series([-1.2, -1.3, -1.5], index = [1, 3, 5])
frame['debt'] = val
frame
# state year pop debt
# 0 Ohio 2000 1.5 NaN
# 1 Ohio 2001 1.7 -1.2
# 2 Ohio 2002 3.6 NaN
# 3 Nevada 2001 2.4 -1.3
# 4 Nevada 2002 2.9 NaN
# 5 Nevada 2003 3.2 -1.5
  • 任何对series的改变,会反映在DataFrame上。除非我们用copy方法来新建一个。
  • 把上面这种嵌套dcit传给DataFrame,pandas会把外层dcit的key当做列,内层key当做行索引
1
2
3
4
5
6
7
8
9
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)
frame3

# Nevada Ohio
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6
  • 如果DataFrame的index和column有自己的name属性,也会被显示
1
2
3
4
5
6
7
8
frame3.index.name = 'year'; frame3.columns.name = 'state'
frame3

# state Nevada Ohio
# year
# 2000 NaN 1.5
# 2001 2.4 1.7
# 2002 2.9 3.6
  • index object是不可被修改的
1
2
3
4
5
6
7
8
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index
# Index(['a', 'b', 'c'], dtype='object')
index[1]
# 'b'
index[1] = 'd'
# TypeError: Index does not support mutable operations
  • 所以这个index不仅像数组,还有点像set,但是与set不同的是,index是可以重复的
1
2
3
4
obj = pd.Series(range(3), index=['a', 'c', 'c'])
index = obj.index
index
# Index(['a', 'c', 'c'], dtype='object')

主要功能

重新索引:reindex

  • reindex:重新生成一个更改index的obj,如果没有对应的index,则会引入数据缺失
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj
# d 4.5
# b 7.2
# a -5.3
# c 3.6
# dtype: float64

obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2
# a -5.3
# b 7.2
# c 3.6
# d 4.5
# e NaN
# dtype: float64
  • 对于DataFrame,reindex既可以更改row,也可以更改column
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
frame = pd.DataFrame(np.arange(9).reshape(3, 3),
index=['a', 'c', 'd'],
columns=['Ohio', 'Texas', 'California'])
frame
# Ohio Texas California
# a 0 1 2
# c 3 4 5
# d 6 7 8

frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2
# Ohio Texas California
# a 0.0 1.0 2.0
# b NaN NaN NaN
# c 3.0 4.0 5.0
# d 6.0 7.0 8.0

states = ['Texas', 'Utah', 'California']
frame3 =frame.reindex(columns=states)
frame3
# Texas Utah California
# a 1 NaN 2
# c 4 NaN 5
# d 7 NaN 8

删除记录

  • 对于series,drop回返回一个新的object,并删去指定的axis的值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# e 4.0
# dtype: float64
new_obj = obj.drop('c')
new_obj
# a 0.0
# b 1.0
# d 3.0
# e 4.0
# dtype: float64
obj.drop(['b', 'c'])
# a 0.0
# d 3.0
# e 4.0
# dtype: float64
  • 对于DataFrame,index能按行或列的axis来删除
  • 删除行的,直接drop行的labels即可
  • 删除列的,需指定axis = 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
data = pd.DataFrame(np.arange(16).reshape(4, 4),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15

data.drop(['Colorado', 'Ohio'])
# one two three four
# Utah 8 9 10 11
# New York 12 13 14 15

data.drop('two', axis=1)

# one three four
# Ohio 0 2 3
# Colorado 4 6 7
# Utah 8 10 11
# New York 12 14 15
  • drop也可以不返回一个新的object,而是直接更改series or dataframe,设定inplace为True即可
1
2
3
4
5
6
7
obj.drop('c', inplace=True)
obj
# a 0.0
# b 1.0
# d 3.0
# e 4.0
# dtype: float64

索引,选择,过滤

  • Series,可以用整数或者label来选中行,用label来切片的时候,和python的切片不一样的在于,会包括尾节点:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj
# a 0.0
# b 1.0
# c 2.0
# d 3.0
# dtype: float64

obj['b':'c']
# b 1.0
# c 2.0
# dtype: float64

obj[1:2]
# b 1.0
# dtype: float64
  • DataFrame,可以通过一个值或序列,选中一个以上的列,行选择的语法格式为data[:2],也有一些特别的用法,比如传入布尔数组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
data
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15

data['one']
# Ohio 0
# Colorado 4
# Utah 8
# New York 12
# Name: one, dtype: int64

data[:2]
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7

data[data['three']>6]
# one two three four
# Utah 8 9 10 11
# New York 12 13 14 15
  • 还有一种方法是boolean dataframe
1
2
3
4
5
6
7
8
9
10
11
12
13
14
data < 5
# one two three four
# Ohio True True True True
# Colorado True False False False
# Utah False False False False
# New York False False False False

data[data<5] = 0
data
# one two three four
# Ohio 0 0 0 0
# Colorado 0 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15

loc和iloc

  • loc索引是通过labels
  • iloc索引是通过整数
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
data
# one two three four
# Ohio 0 1 2 3
# Colorado 4 5 6 7
# Utah 8 9 10 11
# New York 12 13 14 15

data.loc['Colorado']
# one 0
# two 5
# three 6
# four 7
# Name: Colorado, dtype: int64

data.iolc[2]
# one 0
# two 5
# three 6
# four 7
# Name: Colorado, dtype: int64

data.loc['Colorado', ['two', 'three']]
# two 5
# three 6
# Name: Colorado, dtype: int64

data.iloc[1, [1,2]]
# two 5
# three 6
# Name: Colorado, dtype: int64
1
* loc和iloc也可以进行切片
1
2
3
4
5
6
7
8
9
10
11
data.loc[:'Utah', 'two']
# Ohio 0
# Colorado 5
# Utah 9
# Name: two, dtype: int64

data.iloc[:3, 1]
# Ohio 0
# Colorado 5
# Utah 9
# Name: two, dtype: int64

算数与数据对齐

  • 如果两个object相加,但他们各自的index并不相同,最后结果得到的index是这两个index的合集
  • 在DataFrame中,数据对齐同时发生在行和列上
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1
# a 7.3
# c -2.5
# d 3.4
# e 1.5
# dtype: float64

s2
# a 2.1
# c 3.6
# e -1.5
# f 4.0
# g 3.1
# dtype: float64

s1+s2
# a 9.4
# c 1.1
# d NaN
# e 0.0
# f NaN
# g NaN
# dtype: float64

df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=['b', 'c', 'd'],
index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=['b', 'd', 'e'],
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1
# b c d
# Ohio 0.0 1.0 2.0
# Texas 3.0 4.0 5.0
# Colorado 6.0 7.0 8.0
df2
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.0 11.0
df1+df2
# b c d e
# Colorado NaN NaN NaN NaN
# Ohio 3.0 NaN 6.0 NaN
# Oregon NaN NaN NaN NaN
# Texas 9.0 NaN 12.0NaN
# Utah NaN NaN NaN NaN
  • 使用fill_value可以对缺失值进行填充,fill_value为0时,填充的值是原来的值
1
2
3
4
5
6
7
df1.add(df2, fill_value=0)
# b c d e
# Colorado 6.0 7.0 8.0 NaN
# Ohio 3.0 1.0 6.0 5.0
# Oregon 9.0 NaN 10.0 11.0
# Texas 9.0 4.0 12.0 8.0
# Utah 0.0 NaN 1.0 2.0

DataFrame和Series之间的操作

  • series的index和dataframe的列匹配,向下按行进行广播
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
columns=['b','d','e'],
index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

frame
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.011.0

series
# b 0.0
# d 1.0
# e 2.0
# Name: Utah, dtype: float64

frame - series
# b d e
# Utah 0.0 0.0 0.0
# Ohio 3.0 3.0 3.0
# Texas 6.0 6.0 6.0
# Oregon 9.0 9.0 9.0
  • 如果一个index既不在DataFrame的column中,也不再series里的index中,那么结果也是合集
1
2
3
4
5
6
7
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
frame + series2
# b d e f
# Utah 0.0 NaN 3.0 NaN
# Ohio 3.0 NaN 6.0 NaN
# Texas 6.0 NaN 9.0 NaN
# Oregon 9.0 NaN 12.0 NaN
  • 如果想要广播列,去匹配行,必须要用到算数方法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
frame
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.011.0
series3 = frame['d']
series3
# Utah 1.0
# Ohio 4.0
# Texas 7.0
# Oregon 10.0
# Name: d, dtype: float64
frame.sub(series3, axis='index')
# b d e
# Utah -1.0 0.0 1.0
# Ohio -1.0 0.0 1.0
# Texas -1.0 0.0 1.0
# Oregon-1.0 0.0 1.0

函数应用和映射

  • 把一个用在一维数组上的函数,应用在一行或一列上,要用到DataFrame中的apply函数
  • 默认是应用在每一列,也可以设置axis来应用在每一行
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
frame
# b d e
# Utah 0.0 1.0 2.0
# Ohio 3.0 4.0 5.0
# Texas 6.0 7.0 8.0
# Oregon 9.0 10.011.0

f = lambda x: x.max() - x.min()
frame.apply(f)
# b 9.0
# d 9.0
# e 9.0
# dtype: float64

frame.apply(f, axis='columns')
# Utah 2.0
# Ohio 2.0
# Texas 2.0
# Oregon 2.0
# dtype: float64
  • 对于应用在每个元素的函数,需要applymap函数
  • 比如下例中的将每个元素转成保留小数点两位的浮点数
1
2
3
4
5
6
7
format = lambda x: '%.2f' % x
frame.applymap(format)
# b d e
# Utah 0.00 1.00 2.00
# Ohio 3.00 4.00 5.00
# Texas 6.00 7.00 8.00
# Oregon 9.00 10.00 11.00

排序和排名

  • 使用sort_index函数来排序index,默认排的是index,也可以设置axis=1来排序column,默认是升序,也可以设置ascending为False为降序
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
index=['three', 'one'],
columns=['d', 'a', 'b', 'c'])
frame
# d a b c
# three 0 1 2 3
# one 4 5 6 7
frame.sort_index()
# d a b c
# one 4 5 6 7
# three 0 1 2 3
frame.sort_index(axis=1)
# a b c d
# three 1 2 3 0
# one 5 6 7 4
frame.sort_index(axis=1, ascending=False)
# d c b a
# three 0 3 2 1
# one 4 7 6 5
  • 用sort_values方法来排序值,缺失值会排在最后
  • DataFrame通过by选择一列或者多列,多列的话是以list的形式传入
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()
# 4 -3.0
# 5 2.0
# 0 4.0
# 2 7.0
# 1 NaN
# 3 NaN
# dtype: float64

frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame
# b a
# 0 4 0
# 1 7 1
# 2 -3 0
# 3 2 1

frame.sort_values(by='b')
# b a
# 2 -3 0
# 3 2 1
# 0 4 0
# 1 7 1

frame.sort_values(by=['a', 'b'])
# b a
# 2 -3 0
# 0 4 0
# 3 2 1
# 1 7 1
  • rank函数表示在这个数在原来的Series中排第几名,有相同的数,取其排名平均(默认)作为值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj
# 0 7
# 1 -5
# 2 7
# 3 4
# 4 2
# 5 0
# 6 4
# dtype: int64
obj.sort_values()
# 1 -5
# 5 0
# 4 2
# 3 4
# 6 4
# 0 7
# 2 7
# dtype: int64
obj.rank()
# 0 6.5
# 1 1.0
# 2 6.5
# 3 4.5
# 4 3.0
# 5 2.0
# 6 4.5
# dtype: float64
  • dataframe 可以根据行或列来计算rank
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
frame = pd.DataFrame({'b': [4.3, 7, -3, 2],
'a': [0, 1, 0, 1],
'c': [-2, 5, 8, -2.5]})
frame
# a b c
# 0 0 4.3 -2.0
# 1 1 7.0 5.0
# 2 0 -3.0 8.0
# 3 1 2.0 -2.5
frame.rank(axis='columns')
# a b c
# 0 2.0 3.0 1.0
# 1 1.0 3.0 2.0
# 2 2.0 1.0 3.0
# 3 2.0 3.0 1.0

重复

  • 对于有重复的index或者column,可以使用is_unique来判断
1
2
3
obj = pd.Series(np.arange(5), index=['a', 'a', 'b', 'b', 'c'])
obj.index.is_unique
# False
  • 数据选择,对于有重复的label,会返回一个Series,否则返回值
1
2
3
4
5
6
obj['a']
# a 0
# a 1
# dtype: int64
obj['c']
# 4
  • 同样应用于DataFrame,如果有重复,返回DataFrame,否则返回Series
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'a', 'b', 'b', 'c'])
df
# 0 1 2
# a 0.413265 -1.520993 0.211549
# a 0.192379 0.119111 -0.637629
# b -0.400623 0.455354 0.059163
# b -0.278948 1.009609 -0.859333
# c -0.221088 -1.393599 -0.311840
df.loc['a']
# 0 1 2
# a 0.413265 -1.520993 0.211549
# a 0.192379 0.119111 -0.637629
df.loc['c']
# 0 -0.221088
# 1 -1.393599
# 2 -0.311840
# Name: c, dtype: float64

值计数

  • 可以从一维的Series中提取信息,使用unique函数,返回除去重复的还有哪些值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj
# 0 c
# 1 a
# 2 d
# 3 a
# 4 a
# 5 b
# 6 b
# 7 c
# 8 c
# dtype: object
obj.unique()
# array(['c', 'a', 'd', 'b'], dtype=object)
  • value_counts能计算series中值出现的频率
1
2
3
4
5
6
obj.value_counts()
# c 3
# a 3
# b 2
# d 1
# dtype: int64
  • isin 能实现一个向量化的集合成员关系检查,能用于过滤数据集,检查一个子集,是否在series的values中,或在dataframe的column中
1
2
3
4
5
6
7
8
9
10
11
obj.isin(['b', 'c'])
# 0 True
# 1 False
# 2 False
# 3 False
# 4 False
# 5 True
# 6 True
# 7 True
# 8 True
# dtype: bool

汇总和描述性统计

  • 从series中提取单个值(比如sum或mean)
  • 计算的时候,NA(即缺失值)会被除外,除非整个切片全是NA。我们可以用skipna来跳过计算NA
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
[np.nan, np.nan], [0.75, -1.3]],
index=['a', 'b', 'c', 'd'],
columns=['one', 'two'])
df
# one two
# a 1.40 NaN
# b 7.10 -4.5
# c NaN NaN
# d 0.75 -1.3

df.sum()
# one 9.25
# two -5.80
# dtype: float64

df.sum(axis='columns')
# a 1.40
# b 2.60
# c 0.00
# d -0.55
# dtype: float64

df.mean(axis='columns)
# a 1.400
# b 1.300
# c NaN
# d -0.275
# dtype: float64

df.mean(axis = 'columns', skipna = False)
# a NaN
# b 1.300
# c NaN
# d -0.275
# dtype: float64
  • 还有idxmin和idxmax能返回最大最小的index,cumsum能进行累加
1
2
3
4
5
6
7
8
9
10
11
df.idxmax()
# one b
# two d
# dtype: object

df.cumsum()
# one two
# a 1.40 NaN
# b 8.50 -4.5
# c NaN NaN
# d 9.25 -5.8
  • describe能一下子产生多维汇总数据
  • 对于非数值性的数据,describe能产生另一种汇总统计
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
df.describe()
# one two
# count 3.000000 2.000000
# mean 3.083333 -2.900000
# std 3.493685 2.262742
# min 0.750000 -4.500000
# 25% 1.075000 -3.700000
# 50% 1.400000 -2.900000
# 75% 4.250000 -2.100000
# max 7.100000 -1.300000

obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj
# 0 a
# 1 a
# 2 b
# 3 c
# 4 a
# 5 a
# 6 b
# 7 c
# 8 a
# 9 a
# 10 b
# 11 c
# 12 a
# 13 a
# 14 b
# 15 c
# dtype: object

obj.describe()
# count 16
# unique 3
# top a
# freq 8
# dtype: object
-------------本文结束 感谢您的阅读-------------