原文地址:Pandas 中文网
导入 Pandas 与 Numpy
本节是帮助 Pandas 新手快速上手的简介,本节以下列方式导入 Pandas 与 NumPy
1 | import numpy as np |
生成对象
详见数据结构简介文档
用值列表生成 Series 时,Pandas 默认自动生成整数索引:
1
2
3
4
5
6
7
8
9
10
11In [3]: s = pd.Series([1, 3, 5, np.nan, 6, 8])
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64用含日期时间索引与标签的 NumPy 数组生成 DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19In [5]: dates = pd.date_range('20200401', periods=6)
In [6]: dates
Out[6]:
DatetimeIndex(['2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04',
'2020-04-05', '2020-04-06'],
dtype='datetime64[ns]', freq='D')
In [7]: df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
In [8]: df
Out[8]:
A B C D
2020-04-01 -0.859006 0.435218 0.074366 1.433158
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-03 1.417718 -0.555007 0.157268 0.378105
2020-04-04 -1.015458 0.561142 0.892268 0.412834
2020-04-05 0.516340 -0.805451 0.546185 0.785496
2020-04-06 0.554840 -2.502272 -1.278944 0.211286用 Series 字典对象生成 DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
14In [9]: df2 = pd.DataFrame({'A': 1.,
...: 'B': pd.Timestamp('20200401'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2020-04-01 1.0 3 test foo
1 1.0 2020-04-01 1.0 3 train foo
2 1.0 2020-04-01 1.0 3 test foo
3 1.0 2020-04-01 1.0 3 train foo
DataFrame 的列有不同的数据类型:
1 | In [11]: df2.dtypes |
- IPython 支持 tab 键自动补全列名与公共属性。下面是部分可自动补全的属性:列 A,B,C,D 和 E 都可以自动补全;为了简介起见,此处只显示了部分属性。
1
2
3
4
5
6
7
8
9
10
11
12
13In [12]: df2.<TAB> # noqa: E225, E999
df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.clip_lower
df2.align df2.clip_upper
df2.all df2.columns
df2.any df2.combine
df2.append df2.combine_first
df2.apply df2.compound
df2.applymap df2.consolidate
df2.D
查看数据
详见基础用法文档
查看 DataFrame 头部和尾部数据:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15In [12]: df.head()
Out[12]:
A B C D
2020-04-01 -0.859006 0.435218 0.074366 1.433158
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-03 1.417718 -0.555007 0.157268 0.378105
2020-04-04 -1.015458 0.561142 0.892268 0.412834
2020-04-05 0.516340 -0.805451 0.546185 0.785496
In [13]: df.tail(3)
Out[13]:
A B C D
2020-04-04 -1.015458 0.561142 0.892268 0.412834
2020-04-05 0.516340 -0.805451 0.546185 0.785496
2020-04-06 0.554840 -2.502272 -1.278944 0.211286查看索引与列名:
1
2
3
4
5
6
7
8In [14]: df.index
Out[14]:
DatetimeIndex(['2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04',
'2020-04-05', '2020-04-06'],
dtype='datetime64[ns]', freq='D')
In [15]: df.columns
Out[15]: Index([u'A', u'B', u'C', u'D'], dtype='object')
DataFrame.to_numpy()
输出底层数据的 NumPy 对象。注意,DataFrame 的列由多种数据类型组成,该操作耗费系统资源比较大,这也是 Pandas 和 NumPy 的本质区别:NumPy 数组只有一种数据类型,DataFrame 每列的数据类型各不相同
。调用 DataFrame.to_numpy()
时,Pandas 查找支持 DataFrame 里所有数据类型的 NumPy 数据类型。还有一种数据类型是 object
, 可以把 DataFrame 列里值强制转换为 python 对象。
下面的 df 这个 DataFrame 里的值都是浮点数,DataFrame.to_numpy() 操作会很快,而且不复制数据:
1 | In [16]: df.to_numpy() |
df2 这个 DataFrame 包含了多种类型,DataFrame.to_numpy() 操作就会耗费较多资源
1 | In [17]: df2.to_numpy() |
注意:DataFrame.to_numpy() 的输出不包含行索引和列标签。
describe() 可以快速查看数据的统计摘要:
1
2
3
4
5
6
7
8
9
10
11In [18]: df.describe()
Out[18]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.220338 -0.770027 0.238191 0.515485
std 0.955282 1.203376 0.836357 0.538705
min -1.015458 -2.502272 -1.278944 -0.127971
25% -0.515170 -1.516705 0.095092 0.252991
50% 0.535590 -0.680229 0.351726 0.395469
75% 0.669405 0.187662 0.805747 0.692331
max 1.417718 0.561142 1.038006 1.433158转置数据 DataFrame.T
1
2
3
4
5
6
7In [20]: df.T
Out[20]:
2020-04-01 2020-04-02 2020-04-03 2020-04-04 2020-04-05 2020-04-06
A -0.859006 0.707593 1.417718 -1.015458 0.516340 0.554840
B 0.435218 -1.753789 -0.555007 0.561142 -0.805451 -2.502272
C 0.074366 1.038006 0.157268 0.892268 0.546185 -1.278944
D 1.433158 -0.127971 0.378105 0.412834 0.785496 0.211286按轴排序 DataFrame.sort_index()
1
2
3
4
5
6
7
8
9In [23]: df.sort_index(axis=1, ascending=False)
Out[23]:
D C B A
2020-04-01 1.433158 0.074366 0.435218 -0.859006
2020-04-02 -0.127971 1.038006 -1.753789 0.707593
2020-04-03 0.378105 0.157268 -0.555007 1.417718
2020-04-04 0.412834 0.892268 0.561142 -1.015458
2020-04-05 0.785496 0.546185 -0.805451 0.516340
2020-04-06 0.211286 -1.278944 -2.502272 0.554840按值排序 DataFrame.sort_values()
1
2
3
4
5
6
7
8
9In [24]: df.sort_values(by='B')
Out[24]:
A B C D
2020-04-06 0.554840 -2.502272 -1.278944 0.211286
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-05 0.516340 -0.805451 0.546185 0.785496
2020-04-03 1.417718 -0.555007 0.157268 0.378105
2020-04-01 -0.859006 0.435218 0.074366 1.433158
2020-04-04 -1.015458 0.561142 0.892268 0.412834
选择
注意:选择、设置标准 Python / Numpy 的表达式已经非常直观,交互也很方便,但对于生产代码,我们还是推荐优化过的 Pandas 数据访问方法:.at、.iat、.loc 和 .iloc。
获取数据
选择单列,产生 Series,与 df.A 等效
1
2
3
4
5
6
7
8
9In [25]: df['A']
Out[25]:
2020-04-01 -0.859006
2020-04-02 0.707593
2020-04-03 1.417718
2020-04-04 -1.015458
2020-04-05 0.516340
2020-04-06 0.554840
Freq: D, Name: A, dtype: float64用 [ ] 切片行
1
2
3
4
5
6
7
8
9
10
11
12
13In [28]: df[0:3]
Out[28]:
A B C D
2020-04-01 -0.859006 0.435218 0.074366 1.433158
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-03 1.417718 -0.555007 0.157268 0.378105
In [29]: df['20200402':'20200404']
Out[29]:
A B C D
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-03 1.417718 -0.555007 0.157268 0.378105
2020-04-04 -1.015458 0.561142 0.892268 0.412834
按标签选择
详见按标签选择
用标签提取一行的数据
1
2
3
4
5
6
7In [30]: df.loc[dates[0]]
Out[30]:
A -0.859006
B 0.435218
C 0.074366
D 1.433158
Name: 2020-04-01 00:00:00, dtype: float64用标签选择多列数据
1
2
3
4
5
6
7
8
9In [36]: df.loc[:, ['A', 'B']]
Out[36]:
A B
2020-04-01 -0.859006 0.435218
2020-04-02 0.707593 -1.753789
2020-04-03 1.417718 -0.555007
2020-04-04 -1.015458 0.561142
2020-04-05 0.516340 -0.805451
2020-04-06 0.554840 -2.502272用标签切片,包含行与列结束点
1
2
3
4
5
6In [37]: df.loc['20200402': '20200404', ['A', 'B']]
Out[37]:
A B
2020-04-02 0.707593 -1.753789
2020-04-03 1.417718 -0.555007
2020-04-04 -1.015458 0.561142返回对象降维
1
2
3
4
5In [38]: df.loc['20200402', ['A', 'B']]
Out[38]:
A 0.707593
B -1.753789
Name: 2020-04-02 00:00:00, dtype: float64提取标量值
1
2In [39]: df.loc[dates[0], 'A']
Out[39]: -0.8590064033920874快速访问标量,与上述方法等效
1
2In [41]: df.at[dates[0], 'A']
Out[41]: -0.8590064033920874
按位置选择
详见按位置选择
用整数位置选择
1
2
3
4
5
6
7In [45]: df.iloc[3]
Out[45]:
A -1.015458
B 0.561142
C 0.892268
D 0.412834
Name: 2020-04-04 00:00:00, dtype: float64类似 NumPy/Python,用整数切片
1
2
3
4
5In [48]: df.iloc[3:5, 0:2]
Out[48]:
A B
2020-04-04 -1.015458 0.561142
2020-04-05 0.516340 -0.805451类似 NumPy/ Python,用整数列表按位置切片
1
2
3
4
5
6In [52]: df.iloc[[1,2,4], [0,2]]
Out[52]:
A C
2020-04-02 0.707593 1.038006
2020-04-03 1.417718 0.157268
2020-04-05 0.516340 0.546185显式整行切片
1
2
3
4
5In [53]: df.iloc[1:3, :]
Out[53]:
A B C D
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-03 1.417718 -0.555007 0.157268 0.378105显式整列切片
1
2
3
4
5
6
7
8
9In [54]: df.iloc[:, 1:3]
Out[54]:
B C
2020-04-01 0.435218 0.074366
2020-04-02 -1.753789 1.038006
2020-04-03 -0.555007 0.157268
2020-04-04 0.561142 0.892268
2020-04-05 -0.805451 0.546185
2020-04-06 -2.502272 -1.278944显式提取值
1
2In [55]: df.iloc[1, 1]
Out[55]: -1.7537894094589277快速访问标量,与上述方法等效
1
2In [56]: df.iat[1, 1]
Out[56]: -1.7537894094589277
布尔索引
用单列的值选择数据
1
2
3
4
5
6
7In [57]: df[df.A > 0]
Out[57]:
A B C D
2020-04-02 0.707593 -1.753789 1.038006 -0.127971
2020-04-03 1.417718 -0.555007 0.157268 0.378105
2020-04-05 0.516340 -0.805451 0.546185 0.785496
2020-04-06 0.554840 -2.502272 -1.278944 0.211286选择 DataFrame 里满足条件的值
1
2
3
4
5
6
7
8
9In [58]: df[df > 0]
Out[58]:
A B C D
2020-04-01 NaN 0.435218 0.074366 1.433158
2020-04-02 0.707593 NaN 1.038006 NaN
2020-04-03 1.417718 NaN 0.157268 0.378105
2020-04-04 NaN 0.561142 0.892268 0.412834
2020-04-05 0.516340 NaN 0.546185 0.785496
2020-04-06 0.554840 NaN NaN 0.211286用 isin() 筛选
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19In [59]: df2 = df.copy()
In [60]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [61]: df2
Out[61]:
A B C D E
2020-04-01 -0.859006 0.435218 0.074366 1.433158 one
2020-04-02 0.707593 -1.753789 1.038006 -0.127971 one
2020-04-03 1.417718 -0.555007 0.157268 0.378105 two
2020-04-04 -1.015458 0.561142 0.892268 0.412834 three
2020-04-05 0.516340 -0.805451 0.546185 0.785496 four
2020-04-06 0.554840 -2.502272 -1.278944 0.211286 three
In [62]: df2[df2['E'].isin(['two', 'four'])]
Out[62]:
A B C D E
2020-04-03 1.417718 -0.555007 0.157268 0.378105 two
2020-04-05 0.516340 -0.805451 0.546185 0.785496 four
赋值
用索引自动对齐新增列的数据
1
2
3
4
5
6
7
8
9
10
11In [63]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20200401', periods=6))
In [64]: s1
Out[64]:
2020-04-01 1
2020-04-02 2
2020-04-03 3
2020-04-04 4
2020-04-05 5
2020-04-06 6
Freq: D, dtype: int64按标签赋值
1
In [65]: df.at[dates[0], 'A'] = 0
按位置赋值
1
In [67]: df.iat[0, 1] = 0
按 NumPy 数组赋值
1
In [69]: df.loc[:, 'D'] = np.array([5] * len(df))
上述赋值结果
1
2
3
4
5
6
7
8
9In [70]: df
Out[70]:
A B C D
2020-04-01 0.000000 0.000000 0.074366 5
2020-04-02 0.707593 -1.753789 1.038006 5
2020-04-03 1.417718 -0.555007 0.157268 5
2020-04-04 -1.015458 0.561142 0.892268 5
2020-04-05 0.516340 -0.805451 0.546185 5
2020-04-06 0.554840 -2.502272 -1.278944 5用 where 条件赋值
1
2
3
4
5
6
7
8
9
10
11
12
13In [78]: df2 = df.copy()
In [79]: df2[df2 > 0] = -df2
In [80]: df2
Out[80]:
A B C D F
2020-04-01 0.000000 0.000000 -0.074366 -5 -1
2020-04-02 -0.707593 -1.753789 -1.038006 -5 -2
2020-04-03 -1.417718 -0.555007 -0.157268 -5 -3
2020-04-04 -1.015458 -0.561142 -0.892268 -5 -4
2020-04-05 -0.516340 -0.805451 -0.546185 -5 -5
2020-04-06 -0.554840 -2.502272 -1.278944 -5 -6
缺失值
Pandas 主要用 np.nan 表示缺失数据,计算时,默认不包含空值。详见缺失数据
重建索引 DataFrame.reindex
重建索引(reindex)可以更改,添加,删除指定轴的索引,并返回数据副本,即不更改原数据。1
2
3
4
5
6
7
8
9
10
11In [81]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [82]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [83]: df1
Out[83]:
A B C D F E
2020-04-01 0.000000 0.000000 0.074366 5 1 1.0
2020-04-02 0.707593 -1.753789 1.038006 5 2 1.0
2020-04-03 1.417718 -0.555007 0.157268 5 3 NaN
2020-04-04 -1.015458 0.561142 0.892268 5 4 NaN删除所有含缺失值的行
1
2
3
4
5In [84]: df1.dropna(how='any')
Out[84]:
A B C D F E
2020-04-01 0.000000 0.000000 0.074366 5 1 1.0
2020-04-02 0.707593 -1.753789 1.038006 5 2 1.0填充缺失值
1
2
3
4
5
6
7In [85]: df1.fillna(value=5)
Out[85]:
A B C D F E
2020-04-01 0.000000 0.000000 0.074366 5 1 1.0
2020-04-02 0.707593 -1.753789 1.038006 5 2 1.0
2020-04-03 1.417718 -0.555007 0.157268 5 3 5.0
2020-04-04 -1.015458 0.561142 0.892268 5 4 5.0提取 nan 值得布尔掩码
1
2
3
4
5
6
7In [86]: pd.isna(df1)
Out[86]:
A B C D F E
2020-04-01 False False False False False False
2020-04-02 False False False False False False
2020-04-03 False False False False False True
2020-04-04 False False False False False True
运算
详见二进制操作
内容未完,更多请参考官网
- 原文地址:Pandas 中文网