3. 分组统计

Pandas 的分组统计功能可以按某一列的内容对数据行进行分组,并对其应用统计函数,比如求和,平均数,中位数,标准差等等

举例来说,用 .groupby() 方法,我们可以对下面这数据表按 'Company' 列进行分组,并用 .mean() 求每组的平均值:

In [1]:
import pandas as pd
import numpy as np
d={'Company':['Google','Google','Google','Oracle','Oracle','Oracle','Twitter','Twitter','Twitter'],
   'Person':['Peter','Sam','Ann','Cindy','Tom','Lee','Gim','Will','Bob'], 'Sale':[200,300,275,250,270,300,350,220,250],
   'Wage':[4000,3870,3900,3500,3600,3870,3650,3220,3660]}
df=pd.DataFrame(d)
df
Out[1]:
Company Person Sale Wage
0 Google Peter 200 4000
1 Google Sam 300 3870
2 Google Ann 275 3900
3 Oracle Cindy 250 3500
4 Oracle Tom 270 3600
5 Oracle Lee 300 3870
6 Twitter Gim 350 3650
7 Twitter Will 220 3220
8 Twitter Bob 250 3660

以平均销售额(mean of 'Sale')度对公司(‘Company’)分类 不是排序 String index按首字母派列

In [2]:
df.groupby('Company').mean()#结果中,Sale列就变成每个公司的分组平均数了。
Out[2]:
Sale Wage
Company
Google 258.333333 3923.333333
Oracle 273.333333 3656.666667
Twitter 273.333333 3510.000000
In [3]:
type(df.groupby('Company'))
Out[3]:
pandas.core.groupby.groupby.DataFrameGroupBy
In [4]:
df.groupby('Company')['Wage'].mean()
Out[4]:
Company
Google     3923.333333
Oracle     3656.666667
Twitter    3510.000000
Name: Wage, dtype: float64
In [5]:
df.groupby('Person').mean()
Out[5]:
Sale Wage
Person
Ann 275 3900
Bob 250 3660
Cindy 250 3500
Gim 350 3650
Lee 300 3870
Peter 200 4000
Sam 300 3870
Tom 270 3600
Will 220 3220

4. 计数

用 .count(axis,level,numeric_only) 方法,能对 DataFrame 中的某个元素/非零元素出现的次数进行计数。

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

axis : {0 or ‘index’, 1 or ‘columns’}, default 0

If 0 or ‘index’ counts are generated for each column . If 1 or ‘columns’ counts are generated for each row.

level : int or str, optional

If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.

numeric_only : boolean, default False

Include only float, int or boolean data.
In [6]:
df.set_index(["Company", "Person"]) #制作一个Multiindex
Out[6]:
Sale Wage
Company Person
Google Peter 200 4000
Sam 300 3870
Ann 275 3900
Oracle Cindy 250 3500
Tom 270 3600
Lee 300 3870
Twitter Gim 350 3650
Will 220 3220
Bob 250 3660
In [7]:
df.set_index(["Company", "Person"]).count(level="Company") #对Multiindex的某一个level统计 
Out[7]:
Sale Wage
Company
Google 3 3
Oracle 3 3
Twitter 3 3
In [8]:
df.set_index(["Person", "Company"])
Out[8]:
Sale Wage
Person Company
Peter Google 200 4000
Sam Google 300 3870
Ann Google 275 3900
Cindy Oracle 250 3500
Tom Oracle 270 3600
Lee Oracle 300 3870
Gim Twitter 350 3650
Will Twitter 220 3220
Bob Twitter 250 3660
In [9]:
df.set_index(["Person", "Company"]).count(level="Person")
Out[9]:
Sale Wage
Person
Ann 1 1
Bob 1 1
Cindy 1 1
Gim 1 1
Lee 1 1
Peter 1 1
Sam 1 1
Tom 1 1
Will 1 1
In [10]:
df
Out[10]:
Company Person Sale Wage
0 Google Peter 200 4000
1 Google Sam 300 3870
2 Google Ann 275 3900
3 Oracle Cindy 250 3500
4 Oracle Tom 270 3600
5 Oracle Lee 300 3870
6 Twitter Gim 350 3650
7 Twitter Will 220 3220
8 Twitter Bob 250 3660
In [11]:
df.count() #每一列的非零元素
Out[11]:
Company    9
Person     9
Sale       9
Wage       9
dtype: int64
In [12]:
df.count(axis=1) #每一行的非零元素
Out[12]:
0    4
1    4
2    4
3    4
4    4
5    4
6    4
7    4
8    4
dtype: int64
In [13]:
df['Company']
Out[13]:
0     Google
1     Google
2     Google
3     Oracle
4     Oracle
5     Oracle
6    Twitter
7    Twitter
8    Twitter
Name: Company, dtype: object
In [14]:
df['Company'].count()
Out[14]:
9

5. 数据描述

Pandas 的 .describe() 方法将对 DataFrame 里的数据进行分析,并一次性生成多个描述性的统计指标,方便用户对数据有一个直观上的认识。

生成的指标,从左到右分别是:计数、平均数、标准差、最小值、25% 50% 75% 位置的值、最大值

In [15]:
df.groupby('Company').describe()
Out[15]:
Sale Wage
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
Company
Google 3.0 258.333333 52.041650 200.0 237.5 275.0 287.5 300.0 3.0 3923.333333 68.068593 3870.0 3885.0 3900.0 3950.0 4000.0
Oracle 3.0 273.333333 25.166115 250.0 260.0 270.0 285.0 300.0 3.0 3656.666667 191.398363 3500.0 3550.0 3600.0 3735.0 3870.0
Twitter 3.0 273.333333 68.068593 220.0 235.0 250.0 300.0 350.0 3.0 3510.000000 251.197134 3220.0 3435.0 3650.0 3655.0 3660.0
In [16]:
df.groupby('Company')['Sale'].describe() # 功能与df.groupby('Company').describe()['Sale']
Out[16]:
count mean std min 25% 50% 75% max
Company
Google 3.0 258.333333 52.041650 200.0 237.5 275.0 287.5 300.0
Oracle 3.0 273.333333 25.166115 250.0 260.0 270.0 285.0 300.0
Twitter 3.0 273.333333 68.068593 220.0 235.0 250.0 300.0 350.0
In [17]:
df.groupby('Company').describe()['Sale']
Out[17]:
count mean std min 25% 50% 75% max
Company
Google 3.0 258.333333 52.041650 200.0 237.5 275.0 287.5 300.0
Oracle 3.0 273.333333 25.166115 250.0 260.0 270.0 285.0 300.0
Twitter 3.0 273.333333 68.068593 220.0 235.0 250.0 300.0 350.0

如果你不喜欢这个排版,你可以用 .transpose() 方法获得一个竖排的格式:

In [18]:
df.groupby('Company').describe().transpose()
Out[18]:
Company Google Oracle Twitter
Sale count 3.000000 3.000000 3.000000
mean 258.333333 273.333333 273.333333
std 52.041650 25.166115 68.068593
min 200.000000 250.000000 220.000000
25% 237.500000 260.000000 235.000000
50% 275.000000 270.000000 250.000000
75% 287.500000 285.000000 300.000000
max 300.000000 300.000000 350.000000
Wage count 3.000000 3.000000 3.000000
mean 3923.333333 3656.666667 3510.000000
std 68.068593 191.398363 251.197134
min 3870.000000 3500.000000 3220.000000
25% 3885.000000 3550.000000 3435.000000
50% 3900.000000 3600.000000 3650.000000
75% 3950.000000 3735.000000 3655.000000
max 4000.000000 3870.000000 3660.000000
In [19]:
df.groupby('Company')['Sale'].describe().transpose()
Out[19]:
Company Google Oracle Twitter
count 3.000000 3.000000 3.000000
mean 258.333333 273.333333 273.333333
std 52.041650 25.166115 68.068593
min 200.000000 250.000000 220.000000
25% 237.500000 260.000000 235.000000
50% 275.000000 270.000000 250.000000
75% 287.500000 285.000000 300.000000
max 300.000000 300.000000 350.000000

如果你只想看 Google 的数据,还能这样:

In [20]:
df.groupby('Company').describe().transpose()['Google']
Out[20]:
Sale  count       3.000000
      mean      258.333333
      std        52.041650
      min       200.000000
      25%       237.500000
      50%       275.000000
      75%       287.500000
      max       300.000000
Wage  count       3.000000
      mean     3923.333333
      std        68.068593
      min      3870.000000
      25%      3885.000000
      50%      3900.000000
      75%      3950.000000
      max      4000.000000
Name: Google, dtype: float64

6. 堆叠(Concat)

堆叠基本上就是简单地把多个 DataFrame 堆在一起,拼成一个更大的 DataFrame。当你进行堆叠的时候,请务必注意你数据表的索引和列的延伸方向,堆叠的方向要和它一致。

比如,有这样3个 DataFrame:

In [21]:
df1=pd.DataFrame({'A':['A0','A1','A2'],'B':['B0','B1','B2'],'C':['C1','C2','C3']},index=[0,1,2])
df2=pd.DataFrame({'A':['A3','A4','A5'],'B':['B3','B4','B5'],'C':['C3','C4','C5']},index=[3,4,5])
df3=pd.DataFrame({'A':['A6','A7','A8'],'B':['B6','B7','B8'],'C':['C6','C7','C8']},index=[6,7,8])
pd.concat([df1,df2,df3]) #[df1,df2,df3]
Out[21]:
A B C
0 A0 B0 C1
1 A1 B1 C2
2 A2 B2 C3
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
7 A7 B7 C7
8 A8 B8 C8
In [22]:
pd.concat([df1,df2,df3],axis=0)
Out[22]:
A B C
0 A0 B0 C1
1 A1 B1 C2
2 A2 B2 C3
3 A3 B3 C3
4 A4 B4 C4
5 A5 B5 C5
6 A6 B6 C6
7 A7 B7 C7
8 A8 B8 C8
In [23]:
pd.concat([df1,df2,df3],axis=1) 
#注意,这里出现了一大堆空值。因为我们用来堆叠的3个 DataFrame 里,有许多索引是没有对应数据的。
#因此,当你使用 pd.concat() 的时候,一定要注意堆叠方向的坐标轴(行或列)含有所需的所有数据。
Out[23]:
A B C A B C A B C
0 A0 B0 C1 NaN NaN NaN NaN NaN NaN
1 A1 B1 C2 NaN NaN NaN NaN NaN NaN
2 A2 B2 C3 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN A3 B3 C3 NaN NaN NaN
4 NaN NaN NaN A4 B4 C4 NaN NaN NaN
5 NaN NaN NaN A5 B5 C5 NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN A6 B6 C6
7 NaN NaN NaN NaN NaN NaN A7 B7 C7
8 NaN NaN NaN NaN NaN NaN A8 B8 C8

默认index 都一样的index可以横向叠加

In [24]:
df1=pd.DataFrame({'A':['A0','A1','A2'],'B':['B0','B1','B2'],'C':['C1','C2','C3']})
df2=pd.DataFrame({'A':['A3','A4','A5'],'B':['B3','B4','B5'],'C':['C3','C4','C5']})
df3=pd.DataFrame({'A':['A6','A7','A8'],'B':['B6','B7','B8'],'C':['C6','C7','C8']})
pd.concat([df1,df2,df3],axis=0) #[df1,df2,df3]
Out[24]:
A B C
0 A0 B0 C1
1 A1 B1 C2
2 A2 B2 C3
0 A3 B3 C3
1 A4 B4 C4
2 A5 B5 C5
0 A6 B6 C6
1 A7 B7 C7
2 A8 B8 C8
In [25]:
pd.concat([df1,df2,df3],axis=1) #[df1,df2,df3]
Out[25]:
A B C A B C A B C
0 A0 B0 C1 A3 B3 C3 A6 B6 C6
1 A1 B1 C2 A4 B4 C4 A7 B7 C7
2 A2 B2 C3 A5 B5 C5 A8 B8 C8

7. 归并(Merge)

使用 pd.merge() 函数,能将多个 DataFrame 归并在一起,它的合并方式类似合并 SQL 数据表的方式。

归并操作的基本语法是 pd.merge(left_dataframe, right_dataframe, how='inner', on='Key') 。其中 :

  • left_dataframe 参数代表放在左侧的 DataFrame
  • right_dataframe 参数代表放在右边的 DataFrame
  • how='inner' 指的是当左右两个 DataFrame 中存在不重合的 Key 时,取结果的方式:inner 代表交集(default);Outer 代表并集
  • on='Key' 代表需要合并的键值所在的列,最后整个表格会以该列为准进行归并。
In [26]:
merge_df1=pd.DataFrame({'key':['k1','k2','k3'],'A':['A1','A2','A3'],'B':['B1','B2','B3']})
merge_df2=pd.DataFrame({'key':['k1','k2','k3'],'C':['C1','C2','C3'],'D':['D1','D2','D3']})
pd.merge(merge_df1,merge_df2,how='inner',on='key')
Out[26]:
key A B C D
0 k1 A1 B1 C1 D1
1 k2 A2 B2 C2 D2
2 k3 A3 B3 C3 D3

同时,我们可以传入多个 on 参数,这样就能按多个键值进行归并:

In [27]:
merge_df1=pd.DataFrame({'key1':['k1','k2','k3'],'key2':['k4','k5','k6'],'A':['A1','A2','A3'],'B':['B1','B2','B3']})
merge_df2=pd.DataFrame({'key1':['k1','k2','k3'],'key2':['k4','k5','k6'],'C':['C1','C2','C3'],'D':['D1','D2','D3']})
pd.merge(merge_df1,merge_df2,how='inner',on=['key1','key2'])
Out[27]:
key1 key2 A B C D
0 k1 k4 A1 B1 C1 D1
1 k2 k5 A2 B2 C2 D2
2 k3 k6 A3 B3 C3 D3

outer 当key不完全一样

In [28]:
merge_df1=pd.DataFrame({'key1':['k1','k3','k4'],'A':['A1','A2','A3'],'B':['B1','B2','B3']})
merge_df2=pd.DataFrame({'key1':['k1','k2','k3'],'C':['C1','C2','C3'],'D':['D1','D2','D3']})
pd.merge(merge_df1,merge_df2,how='outer',on='key1') 
Out[28]:
key1 A B C D
0 k1 A1 B1 C1 D1
1 k3 A2 B2 C3 D3
2 k4 A3 B3 NaN NaN
3 k2 NaN NaN C2 D2

8. 连接(Join)

如果你要把两个表连在一起,然而它们之间没有太多共同的列,那么你可以试试 .join() 方法。

!! 和 .merge() 不同,连接采用索引作为公共的键,而不是某一列。

同样有 how=‘inner’ /'outer' 默认为inner 交集

In [29]:
join_df1=pd.DataFrame({'A':['A1','A2','A3'],'B':['B1','B2','B3']})
join_df2=pd.DataFrame({'C':['C1','C2','C3'],'D':['D1','D2','D3']})
join_df1.join(join_df2)
Out[29]:
A B C D
0 A1 B1 C1 D1
1 A2 B2 C2 D2
2 A3 B3 C3 D3
In [30]:
join_df1=pd.DataFrame({'A':['A1','A2','A3'],'B':['B1','B2','B3']},index=['key1','key2','key3'])
join_df2=pd.DataFrame({'C':['C1','C2','C3'],'D':['D1','D2','D3']},index=['key1','key3','key4'])
join_df1.join(join_df2)
Out[30]:
A B C D
key1 A1 B1 C1 D1
key2 A2 B2 NaN NaN
key3 A3 B3 C2 D2
In [31]:
join_df1=pd.DataFrame({'A':['A1','A2','A3'],'B':['B1','B2','B3']},index=['key1','key2','key3'])
join_df2=pd.DataFrame({'C':['C1','C2','C3'],'D':['D1','D2','D3']},index=['key1','key3','key4'])
join_df1.join(join_df2,how='outer')
Out[31]:
A B C D
key1 A1 B1 C1 D1
key2 A2 B2 NaN NaN
key3 A3 B3 C2 D2
key4 NaN NaN C3 D3

9. 数值处理

9.1 查找不重复的值

不重复的值,在一个 DataFrame 里往往是独一无二,与众不同的。找到不重复的值,在数据分析中有助于避免样本偏差。在 Pandas 里,主要用到 3 种方法:

  • .unique() 方法。所有不重复的值:
  • .nunique() 方法,获取所有不重复值的个数:
  • .value_counts() 同时获得所有值和对应值的计数:
In [32]:
join_df1=pd.DataFrame({'A':['A1','A2','A3','A1'],'B':['B1','B2','B3','B4']})
join_df1.head()
Out[32]:
A B
0 A1 B1
1 A2 B2
2 A3 B3
3 A1 B4
In [33]:
join_df1['A'].unique()
Out[33]:
array(['A1', 'A2', 'A3'], dtype=object)
In [34]:
join_df1['B'].nunique()
Out[34]:
4
In [35]:
join_df1['A'].nunique()
Out[35]:
3
In [36]:
join_df1['A'].value_counts()
Out[36]:
A1    2
A2    1
A3    1
Name: A, dtype: int64
In [37]:
join_df1.value_counts()# DataFrame没有这个属性 只有Series有
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-f4aa70f5183d> in <module>
----> 1 join_df1.value_counts()# DataFrame没有这个属性 只有Series有

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/generic.py in __getattr__(self, name)
   4374             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   4375                 return self[name]
-> 4376             return object.__getattribute__(self, name)
   4377 
   4378     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'value_counts'
In [ ]:
type(join_df1['A'])

9.2 apply() 方法

用 .apply() 方法,可以对 DataFrame 中的数据应用自定义函数,进行数据处理。

比如,我们先定义一个 square() 函数,然后对表中的 col1 列应用这个函数:

In [38]:
# suppose we have a function to calculate the square of a given value

def square(x):
    return x*x


# we can call the self-defined function with .apply()

df_apply=pd.DataFrame({'Company':['Google','Google','Google','Oracle','Oracle','Oracle','Twitter','Twitter','Twitter'],
   'Person':['Peter','Sam','Ann','Cindy','Tom','Lee','Gim','Will','Bob'], 'Sale':[20,30,27.5,25,27,30,35,22,25]})
df_apply['Sale'].apply(square)
Out[38]:
0     400.00
1     900.00
2     756.25
3     625.00
4     729.00
5     900.00
6    1225.00
7     484.00
8     625.00
Name: Sale, dtype: float64
In [39]:
df_apply['Company'].apply(len) #Company列的内容字符长度
Out[39]:
0    6
1    6
2    6
3    6
4    6
5    6
6    7
7    7
8    7
Name: Company, dtype: int64

有的时候,你定义了一个函数,而它其实只会被用到一次。那么,我们可以用 lambda 表达式来代替函数定义,简化代码。比如,我们可以用这样的 lambda 表达式代替上面 的Square函数定义:

In [40]:
df_apply['Sale'].apply(lambda x: x*x)
Out[40]:
0     400.00
1     900.00
2     756.25
3     625.00
4     729.00
5     900.00
6    1225.00
7     484.00
8     625.00
Name: Sale, dtype: float64

9.3 获取 DataFrame 的属性

DataFrame 的属性包括列和索引的名字。假如你不确定表中的某个列名是否含有空格之类的字符,你可以通过 .columns 来获取属性值,以查看具体的列名。

In [41]:
df_apply.index
Out[41]:
RangeIndex(start=0, stop=9, step=1)
In [42]:
df_apply.columns
Out[42]:
Index(['Company', 'Person', 'Sale'], dtype='object')

9.4 排序

如果想要将整个表按某一列的值进行排序,可以用 .sort_values('Column_Name') :

表格变成按 col2 列的值从小到大排序。要注意的是,表格的索引 index 还是对应着排序前的行,并没有因为排序而丢失原来的索引数据。

In [43]:
df_apply.sort_values('Sale')
Out[43]:
Company Person Sale
0 Google Peter 20.0
7 Twitter Will 22.0
3 Oracle Cindy 25.0
8 Twitter Bob 25.0
4 Oracle Tom 27.0
2 Google Ann 27.5
1 Google Sam 30.0
5 Oracle Lee 30.0
6 Twitter Gim 35.0
In [44]:
df_apply#df_apply.set_index('Person')
Out[44]:
Company Person Sale
0 Google Peter 20.0
1 Google Sam 30.0
2 Google Ann 27.5
3 Oracle Cindy 25.0
4 Oracle Tom 27.0
5 Oracle Lee 30.0
6 Twitter Gim 35.0
7 Twitter Will 22.0
8 Twitter Bob 25.0
In [45]:
df_apply.sort_values('Sale')
Out[45]:
Company Person Sale
0 Google Peter 20.0
7 Twitter Will 22.0
3 Oracle Cindy 25.0
8 Twitter Bob 25.0
4 Oracle Tom 27.0
2 Google Ann 27.5
1 Google Sam 30.0
5 Oracle Lee 30.0
6 Twitter Gim 35.0
In [46]:
type(df_apply.sort_values('Sale'))
Out[46]:
pandas.core.frame.DataFrame
In [47]:
df_apply.sort_values('Sale').loc['Will']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1789                 if not ax.contains(key):
-> 1790                     error()
   1791             except TypeError as e:

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in error()
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 

KeyError: 'the label [Will] is not in the [index]'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-47-e99213bf1e72> in <module>
----> 1 df_apply.sort_values('Sale').loc['Will']

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1476 
   1477             maybe_callable = com._apply_if_callable(key, self.obj)
-> 1478             return self._getitem_axis(maybe_callable, axis=axis)
   1479 
   1480     def _is_scalar_access(self, key):

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1909 
   1910         # fall thru to straight lookup
-> 1911         self._validate_key(key, axis)
   1912         return self._get_label(key, axis=axis)
   1913 

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1796                 raise
   1797             except:
-> 1798                 error()
   1799 
   1800     def _is_scalar_access(self, key):

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in error()
   1783                 raise KeyError(u"the label [{key}] is not in the [{axis}]"
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 
   1787             try:

KeyError: 'the label [Will] is not in the [index]'
In [48]:
df_apply.sort_values('Sale').loc['Will','Company']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1789                 if not ax.contains(key):
-> 1790                     error()
   1791             except TypeError as e:

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in error()
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 

KeyError: 'the label [Will] is not in the [index]'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-48-373f7b5d5e4f> in <module>
----> 1 df_apply.sort_values('Sale').loc['Will','Company']

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1470             except (KeyError, IndexError):
   1471                 pass
-> 1472             return self._getitem_tuple(key)
   1473         else:
   1474             # we by definition only have the 0th axis

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    868     def _getitem_tuple(self, tup):
    869         try:
--> 870             return self._getitem_lowerdim(tup)
    871         except IndexingError:
    872             pass

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
    996         for i, key in enumerate(tup):
    997             if is_label_like(key) or isinstance(key, tuple):
--> 998                 section = self._getitem_axis(key, axis=i)
    999 
   1000                 # we have yielded a scalar ?

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1909 
   1910         # fall thru to straight lookup
-> 1911         self._validate_key(key, axis)
   1912         return self._get_label(key, axis=axis)
   1913 

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in _validate_key(self, key, axis)
   1796                 raise
   1797             except:
-> 1798                 error()
   1799 
   1800     def _is_scalar_access(self, key):

~/anaconda3/envs/dataanalysis/lib/python3.7/site-packages/pandas/core/indexing.py in error()
   1783                 raise KeyError(u"the label [{key}] is not in the [{axis}]"
   1784                                .format(key=key,
-> 1785                                        axis=self.obj._get_axis_name(axis)))
   1786 
   1787             try:

KeyError: 'the label [Will] is not in the [index]'

9.5 查找空值

假如你有一个很大的数据集,你可以用 Pandas 的 .isnull() 方法,方便快捷地发现表中的空值。

这返回的是一个新的 DataFrame,里面用布尔值(True/False)表示原 DataFrame 中对应位置的数据是否是空值。

In [49]:
df_apply.isnull()
Out[49]:
Company Person Sale
0 False False False
1 False False False
2 False False False
3 False False False
4 False False False
5 False False False
6 False False False
7 False False False
8 False False False
In [50]:
df_apply['Company'].isnull()
Out[50]:
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8    False
Name: Company, dtype: bool

9.6 数据透视表

在使用 Excel 的时候,你或许已经试过数据透视表的功能了。数据透视表是一种汇总统计表,它展现了原表格中数据的汇总统计结果。Pandas 的数据透视表能自动帮你对数据进行分组、切片、筛选、排序、计数、求和或取平均值,并将结果直观地显示出来。

比如,这里有个关于动物的统计表:

In [51]:
data={'A':['Dog','Dog','Dog','Goat','Goat','Goat'],
      'B':['Brown','Brown','Black','Brown','Brown','Black'],
      'C':['x','y','x','y','x','y'],
      'D':[1,3,2,4,5,3]}
df_pivot=pd.DataFrame(data)
df_pivot
Out[51]:
A B C D
0 Dog Brown x 1
1 Dog Brown y 3
2 Dog Black x 2
3 Goat Brown y 4
4 Goat Brown x 5
5 Goat Black y 3

Pandas 数据透视表的语法是 .pivot_table(data, values='', index=[''], columns=['']) ,

  • values 要汇总的数据所在列,
  • index 成为数据汇总表的索引的列
  • columns 成为数据汇总表的列的列
In [52]:
pd.pivot_table(df_pivot,values='D',index=['A','B'],columns='C')
Out[52]:
C x y
A B
Dog Black 2.0 NaN
Brown 1.0 3.0
Goat Black NaN 3.0
Brown 5.0 4.0

或者也可以直接调用 df 对象的方法:

In [53]:
df_pivot.pivot_table(values='D',index=['A','B'],columns='C')
Out[53]:
C x y
A B
Dog Black 2.0 NaN
Brown 1.0 3.0
Goat Black NaN 3.0
Brown 5.0 4.0

9.7 导入导出数据

采用类似 pd.read_ 这样的方法,你可以用 Pandas 读取各种不同格式的数据文件,包括 Excel 表格、CSV 文件、SQL 数据库,甚至 HTML 文件等。

9.7.1 写入 CSV 文件

In [54]:
df_pivot.to_csv('New_DataFrame.csv') # See problem if the index doesn't be ignored
In [55]:
df_pivot.to_csv('New_DataFrame_2.csv',index=False)# if you don't want the index 0,1,2..

9.7.2 读取 CSV 文件

pd.read_csv() 就能将 CSV 文件里的数据转换成 DataFrame 对象:

In [56]:
import os
os.chdir("/home/junyao/文档/pandas_practice")
In [57]:
new_df=pd.read_csv('New_DataFrame_2.csv')
new_df
Out[57]:
A B C D
0 Dog Brown x 1
1 Dog Brown y 3
2 Dog Black x 2
3 Goat Brown y 4
4 Goat Brown x 5
5 Goat Black y 3
In [58]:
new_df1=pd.read_csv('New_DataFrame.csv')
new_df1
Out[58]:
Unnamed: 0 A B C D
0 0 Dog Brown x 1
1 1 Dog Brown y 3
2 2 Dog Black x 2
3 3 Goat Brown y 4
4 4 Goat Brown x 5
5 5 Goat Black y 3

9.7.3 写入 Excel 表格文件

In [59]:
df_pivot_new = df_pivot.set_index('A')
df_pivot_new
Out[59]:
B C D
A
Dog Brown x 1
Dog Brown y 3
Dog Black x 2
Goat Brown y 4
Goat Brown x 5
Goat Black y 3
In [60]:
df_pivot_new.to_excel('New_DataFrame.xlsx',sheet_name='Sheet 1')

9.7.4 读出 Excel 表格文件

Excel 文件是一个不错的数据来源。使用 pd.read_excel() 方法,我们能将 Excel 表格中的数据导入 Pandas 中。请注意,Pandas 只能导入表格文件中的数据,其他对象,例如宏、图形和公式等都不会被导入。如果文件中存在有此类对象,可能会导致 pd.read_excel() 方法执行失败。

In [61]:
new_df_pivot_new=pd.read_excel('New_DataFrame.xlsx')
In [62]:
new_df_pivot_new
Out[62]:
A B C D
0 Dog Brown x 1
1 Dog Brown y 3
2 Dog Black x 2
3 Goat Brown y 4
4 Goat Brown x 5
5 Goat Black y 3

9.7.5 读取 HTML的数据

为了读取 HTML 文件,你需要安装 htmllib5,lxml 以及 BeautifulSoup4 库,在终端或者命令提示符运行以下命令来安装:

- conda install lxml

- conda install html5lib

- conda install BeautifulSoup4

举个例子,我们用让 Pandas 读取这个页面的数据: https://en.wikipedia.org/wiki/Udacity 。由于一个页面上含有多个不同的表格,我们需要通过下标 [0, ..., len(tables) - 1] 访问数组中的不同元素。

下面的这个例子,我们显示的是页面中的第 2 个表格:

In [64]:
df_html=pd.read_html('https://en.wikipedia.org/wiki/Udacity')
In [65]:
df_html[0]
Out[65]:
0 1
0 NaN NaN
1 Type of site Online education
2 Available in English
3 Founded June 2011[1]
4 Created by Sebastian Thrun, David Stavens, Mike Sokolsky
5 CEO Vishal Makhijani [2]
6 Website www.udacity.com
7 Alexa rank 1,781 (September 2017[3]
8 Commercial Yes
9 Registration Required
10 Users 1.6 Million[4]
11 Launched February 2012
12 Current status Active
In [66]:
df_html[1]
Out[66]:
0 1 2 3
0 ID Course Name Primary Instructor(s) Launch Date
1 BIO110 Tales from the Genome: Introduction to Genetic... Matthew Cook, Lauren Castellano, Joanna Mounta... NaN
2 CS046 Intro to Programming in Java: Learning Java Cay Horstmann 3 June 2013[32]
3 CS101 Introduction to Computer Science: Building a S... David Evans 20 February 2012
4 CS212 Design of Computer Programs: Programming Princ... Peter Norvig 16 April 2012
5 CS215 Algorithms: Crunching Social Networks Michael Littman 25 June 2012
6 CS222 Making Math Matter: Differential Equations in ... Jörn Loviscach 3 September 2012
7 CS253 Web Development: How to Build a Blog Steve Huffman 16 April 2012
8 CS255 HTML5 Game Development: Building High Performa... Colt McAnlis, Peter Lubbers 4 February 2013[30]
9 CS256 Mobile Web Development: Building Mobile Web Ex... Chris Wilson, Peter Lubbers NaN
10 CS258 Software Testing: How to Make Software Fail John Regehr 25 June 2012
11 CS259 Software Debugging: Automating the Boring Tasks Andreas Zeller 3 September 2012
12 CS262 Programming Languages: Building a Web Browser Westley Weimer 16 April 2012
13 CS271 Introduction to Artificial Intelligence: AI-Class Sebastian Thrun, Peter Norvig 28 November 2012[n 1]
14 CS291 Interactive Rendering: Introduction to 3D Comp... Eric Haines 11 March 2013
15 CS313 Intro to Theoretical Computer Science: Dealing... Sebastian Wernicke 1 October 2012
16 CS344 Introduction to Parallel Programming: Using CU... John Owens, David Luebke 4 February 2013[30]
17 CS348 Functional Hardware Verification: How to Verif... Axel Scherer, Hannes Fröhlich 12 March 2013[30]
18 CS373 Artificial Intelligence: Programming A Robotic... Sebastian Thrun 20 February 2012
19 CS387 Applied Cryptography: Science of Secrets David Evans 16 April 2012
20 Design101 The Design of Everyday Things Don Norman, Kristian Simsarian NaN
21 EP245 How to Build a Startup: The Lean LaunchPad Steve Blank 14 September 2012
22 MA006 Visualizing Algebra: Patterns and Problems Susan McClory, Sandra DeSousa 30 January 2013[33]
23 MA008 College Algebra: Animals, Architecture, and In... Julie Sliva Spitzer 30 January 2013[33]
24 PH100 Intro to Physics: Landmarks in Physics Andy Brown 25 June 2012
25 PS001 Introduction to Psychology: The Science of Tho... Susan Snycerski, Greg Feist 3 June 2013[32]
26 ST095 Statistics: The Science of Decisions Sean Laraway, Ronald Rogers 30 January 2013[33]
27 ST101 Intro to Statistics: Making Decisions Based on... Sebastian Thrun 25 June 2012