[ github ] 10道题串起 Pandas 常用知识点

发布于:2022-12-18 ⋅ 阅读:(1399) ⋅ 点赞:(1)

Github 地址: https://github.com/guipsamora/pandas_exercises
数据集: https://www.aliyundrive.com/s/Wk6L6R67rrn

1. 了解数据

1.1 导入数据集并查看数据集基本情况
import pandas as pd
chipotle = pd.read_csv('001_chipotle.tsv',sep='\t')
chipotle.head()
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
chipotle.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')
chipotle.index
RangeIndex(start=0, stop=4622, step=1)
1.2 被下单最多的商品是什么
  • as_index = false 使用默认索引列,
  • 显示效果上, 使列名都在一行, “不合并单元格”, 更符合使用习惯
item_buy_count = chipotle[['item_name','quantity']].groupby('item_name').sum()['quantity'].nlargest(1)
item_buy_count
item_name
Chicken Bowl    761
Name: quantity, dtype: int64
  • 原答案
item_buy_count_1 = chipotle[['item_name','quantity']].groupby('item_name', as_index=False).agg({'quantity':'sum'})
item_buy_count_1.sort_values('quantity', ascending=False, inplace=True)
item_buy_count_1.head()
item_name quantity
17 Chicken Bowl 761
18 Chicken Burrito 591
25 Chips and Guacamole 506
39 Steak Burrito 386
10 Canned Soft Drink 351
1.3 有多少种商品被下单
  • nunique() 非重复计数
chipotle['item_name'].nunique()
50
# 简单做个图
s = chipotle['item_name'].value_counts()

from matplotlib import pyplot as plt
plt.figure(figsize=(20,4),dpi=100)
plt.xticks(fontsize=13, rotation='vertical')
plt.bar(s.index,s.values)
<BarContainer object of 50 artists>


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-S9a7xR1e-1658480041724)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_15_1.png)]

1.4 被下单的商品中, choice_description 的排名
  • 没搞明白什么意思
chipotle['choice_description'].value_counts().head()
[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64
1.5 一共有多少商品被下单
chipotle['quantity'].sum()
4972
1.6 将item_price转换为浮点数
chipotle['item_price'] = chipotle['item_price'].str.lstrip('$').astype('float')
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   order_id            4622 non-null   int64  
 1   quantity            4622 non-null   int64  
 2   item_name           4622 non-null   object 
 3   choice_description  3376 non-null   object 
 4   item_price          4622 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.7+ KB
  • 原解法
chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))
---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_19644/772157282.py in <module>
----> 1 chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))


NameError: name 'chipo' is not defined
1.7 总收入
chipotle['sub_total'] = (chipotle['item_price'] * chipotle['quantity'])
chipotle['sub_total'].sum()
39237.02
1.8 总订单数
chipotle['order_id'].nunique()
1834
1.9 单均价
chipotle['sub_total'].sum()/chipotle['order_id'].nunique()
21.39423118865867
  • 原答案
chipotle[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
21.39423118865867

2. 数据过滤与排序

2.1 数据导入, 查看数据集基本情况
import pandas as pd
euro = pd.read_csv('Euro2012_stats.csv')
euro.head()
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19

5 rows × 35 columns

euro.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Team                        16 non-null     object 
 1   Goals                       16 non-null     int64  
 2   Shots on target             16 non-null     int64  
 3   Shots off target            16 non-null     int64  
 4   Shooting Accuracy           16 non-null     object 
 5   % Goals-to-shots            16 non-null     object 
 6   Total shots (inc. Blocked)  16 non-null     int64  
 7   Hit Woodwork                16 non-null     int64  
 8   Penalty goals               16 non-null     int64  
 9   Penalties not scored        16 non-null     int64  
 10  Headed goals                16 non-null     int64  
 11  Passes                      16 non-null     int64  
 12  Passes completed            16 non-null     int64  
 13  Passing Accuracy            16 non-null     object 
 14  Touches                     16 non-null     int64  
 15  Crosses                     16 non-null     int64  
 16  Dribbles                    16 non-null     int64  
 17  Corners Taken               16 non-null     int64  
 18  Tackles                     16 non-null     int64  
 19  Clearances                  16 non-null     int64  
 20  Interceptions               16 non-null     int64  
 21  Clearances off line         15 non-null     float64
 22  Clean Sheets                16 non-null     int64  
 23  Blocks                      16 non-null     int64  
 24  Goals conceded              16 non-null     int64  
 25  Saves made                  16 non-null     int64  
 26  Saves-to-shots ratio        16 non-null     object 
 27  Fouls Won                   16 non-null     int64  
 28  Fouls Conceded              16 non-null     int64  
 29  Offsides                    16 non-null     int64  
 30  Yellow Cards                16 non-null     int64  
 31  Red Cards                   16 non-null     int64  
 32  Subs on                     16 non-null     int64  
 33  Subs off                    16 non-null     int64  
 34  Players Used                16 non-null     int64  
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
2.1 有多少球队参与了 2012 年欧洲杯
euro['Team'].nunique()
16
2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里
discripline = euro[['Team','Yellow Cards', 'Red Cards']]
discripline
Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0
5 Germany 4 0
6 Greece 9 1
7 Italy 16 0
8 Netherlands 5 0
9 Poland 7 1
10 Portugal 12 0
11 Republic of Ireland 6 1
12 Russia 6 0
13 Spain 11 0
14 Sweden 7 0
15 Ukraine 5 0
2.3 对 discripline 排序
discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
discripline
C:\Users\LiCN\AppData\Local\Temp/ipykernel_19644/3135130171.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
Team Yellow Cards Red Cards
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
6 Greece 9 1
0 Croatia 9 0
9 Poland 7 1
1 Czech Republic 7 0
14 Sweden 7 0
11 Republic of Ireland 6 1
4 France 6 0
12 Russia 6 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
2 Denmark 4 0
5 Germany 4 0
2.4 计算每个球队拿到的黄牌数的平均值
discripline['Yellow Cards'].mean()
7.4375
2.5 进球大于 6 的球队
discripline[discripline['Yellow Cards']>6]
# discripline[discripline.Yellow Cards >6]
Team Yellow Cards Red Cards
7 Italy 16 0
10 Portugal 12 0
13 Spain 11 0
6 Greece 9 1
0 Croatia 9 0
9 Poland 7 1
1 Czech Republic 7 0
14 Sweden 7 0
2.6 选取以字母 G 开头的球队数据
euro[euro['Team'].str.startswith('G')]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 13 65.1% 67 48 12 9 1 12 12 20

2 rows × 35 columns

2.7 选取前 7 列
euro.iloc[:,0:7]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.9% 16.0% 32
1 Czech Republic 4 13 18 41.9% 12.9% 39
2 Denmark 4 10 10 50.0% 20.0% 27
3 England 5 11 18 50.0% 17.2% 40
4 France 3 22 24 37.9% 6.5% 65
5 Germany 10 32 32 47.8% 15.6% 80
6 Greece 5 8 18 30.7% 19.2% 32
7 Italy 6 34 45 43.0% 7.5% 110
8 Netherlands 2 12 36 25.0% 4.1% 60
9 Poland 2 15 23 39.4% 5.2% 48
10 Portugal 6 22 42 34.3% 9.3% 82
11 Republic of Ireland 1 7 12 36.8% 5.2% 28
12 Russia 5 9 31 22.5% 12.5% 59
13 Spain 12 42 33 55.9% 16.0% 100
14 Sweden 5 17 19 47.2% 13.8% 39
15 Ukraine 2 7 26 21.2% 6.0% 38
2.8 选取除了最后 3 列之外的全部列
euro.iloc[:,:-3]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0

16 rows × 32 columns

2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)
euro[euro['Team'].isin(['England', 'Italy', 'Russia'])][['Team','Shooting Accuracy']]
Team Shooting Accuracy
3 England 50.0%
7 Italy 43.0%
12 Russia 22.5%
  • 原解法
euro.loc[euro['Team'].isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
Team Shooting Accuracy
3 England 50.0%
7 Italy 43.0%
12 Russia 22.5%

3. 数据分组 探索酒类消费数据

import pandas as pd

drinks = pd.read_csv('drinks.csv')
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
3.1 哪个大陆平均消耗的啤酒更多?
drinks.groupby('continent')['beer_servings'].mean().sort_values().head(1)
continent
AS    37.045455
Name: beer_servings, dtype: float64
3.2 打印出每个大陆红酒消耗的描述性统计值
drinks.groupby('continent')['wine_servings'].describe()
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
3.3 打印出每个大陆每种酒类的消耗平均值
drinks.groupby('continent').mean()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 61.471698 16.339623 16.264151 3.007547
AS 37.045455 60.840909 9.068182 2.170455
EU 193.777778 132.555556 142.222222 8.617778
OC 89.687500 58.437500 35.625000 3.381250
SA 175.083333 114.750000 62.416667 6.308333
3.4 打印出每个大陆每种酒类的消耗的中位数
drinks.groupby('continent').median()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 32.0 3.0 2.0 2.30
AS 17.5 16.0 1.0 1.20
EU 219.0 122.0 128.0 10.00
OC 52.5 37.0 8.5 1.75
SA 162.5 108.5 12.0 6.85
3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值
drinks.groupby('continent')['spirit_servings'].agg(['mean','max','min'])
mean max min
continent
AF 16.339623 152 0
AS 60.840909 326 0
EU 132.555556 373 0
OC 58.437500 254 0
SA 114.750000 302 25

4. apply 函数 (1960-2014 美国犯罪数据探索)

4.1 数据导入及数据集基本情况
import pandas as pd

crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')

crime.head()
Year Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
0 1960 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1 1961 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
2 1962 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
3 1963 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
4 1964 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Year                55 non-null     int64
 1   Population          55 non-null     int64
 2   Total               55 non-null     int64
 3   Violent             55 non-null     int64
 4   Property            55 non-null     int64
 5   Murder              55 non-null     int64
 6   Forcible_Rape       55 non-null     int64
 7   Robbery             55 non-null     int64
 8   Aggravated_assault  55 non-null     int64
 9   Burglary            55 non-null     int64
 10  Larceny_Theft       55 non-null     int64
 11  Vehicle_Theft       55 non-null     int64
dtypes: int64(12)
memory usage: 5.3 KB
4.2 将 Year 数据类型转换为 datetime
crime['Year'] = pd.to_datetime(crime['Year'], format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Year                55 non-null     datetime64[ns]
 1   Population          55 non-null     int64         
 2   Total               55 non-null     int64         
 3   Violent             55 non-null     int64         
 4   Property            55 non-null     int64         
 5   Murder              55 non-null     int64         
 6   Forcible_Rape       55 non-null     int64         
 7   Robbery             55 non-null     int64         
 8   Aggravated_assault  55 non-null     int64         
 9   Burglary            55 non-null     int64         
 10  Larceny_Theft       55 non-null     int64         
 11  Vehicle_Theft       55 non-null     int64         
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB
4.3 将 Year 设置为索引
# 好像不要 drop=True 原索引也不会成为新的一列
crime.set_index('Year', inplace=True)
crime.head()
Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961-01-01 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962-01-01 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963-01-01 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964-01-01 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
4.4 删除名为 Total 的列
del crime['Total']
4.5 每十年重组数据 resample 非常重要
  • Pandas中的resample,重新采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
  • 参数表示采样的规则 常见时间频率 :A year M month W week D day H hour T minute S second
  • 按照每十年为一组进行采样统计
crime_10Ys = crime.resample('10As').sum()
crime_10Ys
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 1915053175 4134930 45160900 106180 236720 1633510 2158520 13321100 26547700 5292100
1970-01-01 2121193298 9607930 91383800 192230 554570 4159020 4702120 28486000 53157800 9739900
1980-01-01 2371370069 14074328 117048900 206439 865639 5383109 7619130 33073494 72040253 11935411
1990-01-01 2612825258 17527048 119053499 211664 998827 5748930 10568963 26750015 77679366 14624418
2000-01-01 2947969117 13968056 100944369 163068 922499 4230366 8652124 21565176 67970291 11412834
2010-01-01 1570146307 6072017 44095950 72867 421059 1749809 3764142 10125170 30401698 3569080
  • 用 population 的最大值替换重组后的 population 值, 上面每个十年的 population 加总没有意义
  • 用十年中的最大值替换掉这个加总值
crime_10Ys['Population'] = crime['Population'].resample('10As').max()
crime_10Ys.head()
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 201385000 4134930 45160900 106180 236720 1633510 2158520 13321100 26547700 5292100
1970-01-01 220099000 9607930 91383800 192230 554570 4159020 4702120 28486000 53157800 9739900
1980-01-01 248239000 14074328 117048900 206439 865639 5383109 7619130 33073494 72040253 11935411
1990-01-01 272690813 17527048 119053499 211664 998827 5748930 10568963 26750015 77679366 14624418
2000-01-01 307006550 13968056 100944369 163068 922499 4230366 8652124 21565176 67970291 11412834
4.6 何时是美国历史上生存最危险的年代
  • idxmax()
crime
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 179323175 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961-01-01 182992000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962-01-01 185771000 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963-01-01 188483000 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964-01-01 191141000 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
1965-01-01 193526000 387390 4352000 9960 23410 138690 215330 1282500 2572600 496900
1966-01-01 195576000 430180 4793300 11040 25820 157990 235330 1410100 2822000 561200
1967-01-01 197457000 499930 5403500 12240 27620 202910 257160 1632100 3111600 659800
1968-01-01 199399000 595010 6125200 13800 31670 262840 286700 1858900 3482700 783600
1969-01-01 201385000 661870 6749000 14760 37170 298850 311090 1981900 3888600 878500
1970-01-01 203235298 738820 7359200 16000 37990 349860 334970 2205000 4225800 928400
1971-01-01 206212000 816500 7771700 17780 42260 387700 368760 2399300 4424200 948200
1972-01-01 208230000 834900 7413900 18670 46850 376290 393090 2375500 4151200 887200
1973-01-01 209851000 875910 7842200 19640 51400 384220 420650 2565500 4347900 928800
1974-01-01 211392000 974720 9278700 20710 55400 442400 456210 3039200 5262500 977100
1975-01-01 213124000 1039710 10252700 20510 56090 470500 492620 3265300 5977700 1009600
1976-01-01 214659000 1004210 10345500 18780 57080 427810 500530 3108700 6270800 966000
1977-01-01 216332000 1029580 9955000 19120 63500 412610 534350 3071500 5905700 977700
1978-01-01 218059000 1085550 10123400 19560 67610 426930 571460 3128300 5991000 1004100
1979-01-01 220099000 1208030 11041500 21460 76390 480700 629480 3327700 6601000 1112800
1980-01-01 225349264 1344520 12063700 23040 82990 565840 672650 3795200 7136900 1131700
1981-01-01 229146000 1361820 12061900 22520 82500 592910 663900 3779700 7194400 1087800
1982-01-01 231534000 1322390 11652000 21010 78770 553130 669480 3447100 7142500 1062400
1983-01-01 233981000 1258090 10850500 19310 78920 506570 653290 3129900 6712800 1007900
1984-01-01 236158000 1273280 10608500 18690 84230 485010 685350 2984400 6591900 1032200
1985-01-01 238740000 1328800 11102600 18980 88670 497870 723250 3073300 6926400 1102900
1986-01-01 240132887 1489169 11722700 20613 91459 542775 834322 3241410 7257153 1224137
1987-01-01 242282918 1483999 12024700 20096 91110 517704 855088 3236184 7499900 1288674
1988-01-01 245807000 1566220 12356900 20680 92490 542970 910090 3218100 7705900 1432900
1989-01-01 248239000 1646040 12605400 21500 94500 578330 951710 3168200 7872400 1564800
1990-01-01 248709873 1820130 12655500 23440 102560 639270 1054860 3073900 7945700 1635900
1991-01-01 252177000 1911770 12961100 24700 106590 687730 1092740 3157200 8142200 1661700
1992-01-01 255082000 1932270 12505900 23760 109060 672480 1126970 2979900 7915200 1610800
1993-01-01 257908000 1926020 12218800 24530 106010 659870 1135610 2834800 7820900 1563100
1994-01-01 260341000 1857670 12131900 23330 102220 618950 1113180 2712800 7879800 1539300
1995-01-01 262755000 1798790 12063900 21610 97470 580510 1099210 2593800 7997700 1472400
1996-01-01 265228572 1688540 11805300 19650 96250 535590 1037050 2506400 7904700 1394200
1997-01-01 267637000 1634770 11558175 18208 96153 498534 1023201 2460526 7743760 1354189
1998-01-01 270296000 1531044 10944590 16914 93103 446625 974402 2329950 7373886 1240754
1999-01-01 272690813 1426044 10208334 15522 89411 409371 911740 2100739 6955520 1152075
2000-01-01 281421906 1425486 10182586 15586 90178 408016 911706 2050992 6971590 1160002
2001-01-01 285317559 1439480 10437480 16037 90863 423557 909023 2116531 7092267 1228391
2002-01-01 287973924 1423677 10455277 16229 95235 420806 891407 2151252 7057370 1246646
2003-01-01 290690788 1383676 10442862 16528 93883 414235 859030 2154834 7026802 1261226
2004-01-01 293656842 1360088 10319386 16148 95089 401470 847381 2144446 6937089 1237851
2005-01-01 296507061 1390745 10174754 16740 94347 417438 862220 2155448 6783447 1235859
2006-01-01 299398484 1418043 9983568 17030 92757 447403 860853 2183746 6607013 1192809
2007-01-01 301621157 1408337 9843481 16929 90427 445125 855856 2176140 6568572 1095769
2008-01-01 304374846 1392628 9767915 16442 90479 443574 842134 2228474 6588046 958629
2009-01-01 307006550 1325896 9337060 15399 89241 408742 812514 2203313 6338095 795652
2010-01-01 309330219 1251248 9112625 14772 85593 369089 781844 2168457 6204601 739565
2011-01-01 311587816 1206031 9052743 14661 84175 354772 752423 2185140 6151095 716508
2012-01-01 313873685 1217067 9001992 14866 85141 355051 762009 2109932 6168874 723186
2013-01-01 316497531 1199684 8650761 14319 82109 345095 726575 1931835 6018632 700294
2014-01-01 318857056 1197987 8277829 14249 84041 325802 741291 1729806 5858496 689527
crime.idxmax()
Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]
  • 各种犯罪数量的最大值集中出现在1991年~1993年, 从整个数据集看人口数量是在逐渐增长的(从1960年的1.8亿到2014年的3.2亿), 说明这一时期犯罪率高于其他时期(但并不能断言整个90年代是最危险的年代)

5. 合并(探索虚拟姓名数据)

5.1 构建虚拟数据
import numpy as np
import pandas as pd

raw_data_1 = {
 'subject_id': ['1', '2', '3', '4', '5'],
 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
 'subject_id': ['4', '5', '6', '7', '8'],
 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

data_1 = pd.DataFrame(raw_data_1)
data_2 = pd.DataFrame(raw_data_2)
data_3 = pd.DataFrame(raw_data_3)

print(data_1)
print(data_2)
print(data_3)
  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
  subject_id  test_id
0          1       51
1          2       15
2          3       15
3          4       61
4          5       16
5          7       14
6          8       15
7          9        1
8         10       61
9         11       16
5.2 上下合并, 鸡蛋下落方向, axis=0
all_data = pd.concat([data_1,data_2], ignore_index=True)
all_data
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
5 4 Billy Bonder
6 5 Brian Black
7 6 Bran Balwner
8 7 Bryce Brice
9 8 Betty Btisan
5.3 左右合并, axis=1
by_col = pd.concat([data_1,data_2], axis=1)
by_col
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan
5.4 按照 subject_id 的值对 all_data 和 data3 作合并
all_data = pd.merge(all_data,data_3,on='subject_id')
all_data
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15
5.5 对 data1 和 data2 按照 subject_id 作内连接
inner_join = pd.merge(data_1,data_2, on='subject_id', how='inner')
inner_join
subject_id first_name_x last_name_x first_name_y last_name_y
0 4 Alice Aoni Billy Bonder
1 5 Ayoung Atiches Brian Black
5.6 对 data1 和 data2 按照 subject_id 作外连接
outer_join = pd.merge(data_1,data_2, on='subject_id', how='outer')
outer_join
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan

6. 统计 探索风速数据

6.1 数据导入等
import pandas as pd
import datetime

wind = pd.read_table('wind.data', sep='\s+', parse_dates=[[0,1,2]])
wind.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
wind.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Yr_Mo_Dy  6574 non-null   datetime64[ns]
 1   RPT       6568 non-null   float64       
 2   VAL       6571 non-null   float64       
 3   ROS       6572 non-null   float64       
 4   KIL       6569 non-null   float64       
 5   SHA       6572 non-null   float64       
 6   BIR       6574 non-null   float64       
 7   DUB       6571 non-null   float64       
 8   CLA       6572 non-null   float64       
 9   MUL       6571 non-null   float64       
 10  CLO       6573 non-null   float64       
 11  BEL       6574 non-null   float64       
 12  MAL       6570 non-null   float64       
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB
6.2 2061 年是错误数据, 尝试修复
  • 原答案少一层’pd.to_datetime()', 这样的话应用函数后 wind[‘Yr_Mo_Dy’] 将不再是 datetime格式\
    • 单个元素的话可以直接用 x.year
def fix_year(x):
    year=x.year-100 if x.year>1978 else x.year
    return pd.to_datetime(datetime.date(year, x.month, x.day),format='%Y/%m/%d')

wind['Yr_Mo_Dy']  = wind['Yr_Mo_Dy'].apply(fix_year)
wind['Yr_Mo_Dy'].dt.year.unique()
array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)
6.3 将日期设为索引, 注意数据类型, 应该是 datetime64
wind.set_index('Yr_Mo_Dy', inplace=True)
wind.head()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
6.4 对应每一个 location, 一共有多少数据缺失
wind.isna().sum()
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64
6.5 对应每一个 location,一共有多少完整的数据值
  • notna() 和 info() 显示的都能显示非空行
wind.notna().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64
# 原答案
wind.shape[0] - wind.isnull().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64
6.6 对于全体数据, 计算风速的平均值
wind.mean().mean()
10.227982360836938
6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差
  • describe() 是沿columns方向的
wind_des = wind.describe().T.iloc[:,[3,7,1,2]]
wind_des
min max mean std
RPT 0.67 35.80 12.362987 5.618413
VAL 0.21 33.37 10.644314 5.267356
ROS 1.50 33.84 11.660526 5.008450
KIL 0.00 28.46 6.306468 3.605811
SHA 0.13 37.54 10.455834 4.936125
BIR 0.00 26.16 7.092254 3.968683
DUB 0.00 30.37 9.797343 4.977555
CLA 0.00 31.08 8.495053 4.499449
MUL 0.00 25.88 8.493590 4.166872
CLO 0.04 28.21 8.707332 4.503954
BEL 0.13 42.38 13.121007 5.835037
MAL 0.67 42.54 15.599079 6.699794
# 一般写法
day_stats = pd.DataFrame()
day_stats['min'] = wind.min(axis=1)
day_stats['max'] = wind.max(axis=1)
day_stats['mean'] = wind.mean(axis=1)
day_stats['std'] = wind.std(axis=1)
day_stats.head()
6.8 对于每一个 location, 计算一月份的平均风速
  • 注意, 原po主说把1961年的1月和1962年的1月都当作1月份

  • 注意 query() 的用法

wind['date'] = wind.index
wind.head()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 1961-01-02
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 1961-01-03
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1961-01-04
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1961-01-05
wind['year']= wind['date'].dt.year
wind['month']= wind['date'].dt.month
wind['day']= wind['date'].dt.day
wind.head()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date year month day
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1961 1 1
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83 1961-01-02 1961 1 2
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71 1961-01-03 1961 1 3
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88 1961-01-04 1961 1 4
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83 1961-01-05 1961 1 5
# 原
wind['year'] = wind['date'].apply(lambda x: x.year)
wind['month'] = wind['date'].apply(lambda x: x.month)
wind['day'] = wind['date'].apply(lambda x: x.day)
wind[wind['month']==1].iloc[:,0:12].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64
  • 解法二, loc 定位
wind.loc[wind.index.month==1].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64
6.9 对每个地区的记录进行年度向下采样
  • DatetimeIndex 多个独立的日期组合起来
  • PeriodIndex 指定了间隔频率的 DatetimeIndex, 通过将 DatetimeIndex.to_period (“”),指定间隔频率来得到
wind.groupby(wind.index.to_period('A')).mean()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961 12.299583 10.351796 11.362369 6.958227 10.881763 7.729726 9.733923 8.858788 8.647652 9.835577 13.502795 13.680773
1962 12.246923 10.110438 11.732712 6.960440 10.657918 7.393068 11.020712 8.793753 8.316822 9.676247 12.930685 14.323956
1963 12.813452 10.836986 12.541151 7.330055 11.724110 8.434712 11.075699 10.336548 8.903589 10.224438 13.638877 14.999014
1964 12.363661 10.920164 12.104372 6.787787 11.454481 7.570874 10.259153 9.467350 7.789016 10.207951 13.740546 14.910301
1965 12.451370 11.075534 11.848767 6.858466 11.024795 7.478110 10.618712 8.879918 7.907425 9.918082 12.964247 15.591644
1966 13.461973 11.557205 12.020630 7.345726 11.805041 7.793671 10.579808 8.835096 8.514438 9.768959 14.265836 16.307260
1967 12.737151 10.990986 11.739397 7.143425 11.630740 7.368164 10.652027 9.325616 8.645014 9.547425 14.774548 17.135945
1968 11.835628 10.468197 11.409754 6.477678 10.760765 6.067322 8.859180 8.255519 7.224945 7.832978 12.808634 15.017486
1969 11.166356 9.723699 10.902000 5.767973 9.873918 6.189973 8.564493 7.711397 7.924521 7.754384 12.621233 15.762904
1970 12.600329 10.726932 11.730247 6.217178 10.567370 7.609452 9.609890 8.334630 9.297616 8.289808 13.183644 16.456027
1971 11.273123 9.095178 11.088329 5.241507 9.440329 6.097151 8.385890 6.757315 7.915370 7.229753 12.208932 15.025233
1972 12.463962 10.561311 12.058333 5.929699 9.430410 6.358825 9.704508 7.680792 8.357295 7.515273 12.727377 15.028716
1973 11.828466 10.680493 10.680493 5.547863 9.640877 6.548740 8.482110 7.614274 8.245534 7.812411 12.169699 15.441096
1974 13.643096 11.811781 12.336356 6.427041 11.110986 6.809781 10.084603 9.896986 9.331753 8.736356 13.252959 16.947671
1975 12.008575 10.293836 11.564712 5.269096 9.190082 5.668521 8.562603 7.843836 8.797945 7.382822 12.631671 15.307863
1976 11.737842 10.203115 10.761230 5.109426 8.846339 6.311038 9.149126 7.146202 8.883716 7.883087 12.332377 15.471448
1977 13.099616 11.144493 12.627836 6.073945 10.003836 8.586438 11.523205 8.378384 9.098192 8.821616 13.459068 16.590849
1978 12.504356 11.044274 11.380000 6.082356 10.167233 7.650658 9.489342 8.800466 9.089753 8.301699 12.967397 16.771370
6.10 对每个地区的记录进行月度向下采样
wind.groupby(wind.index.to_period('M')).mean()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01 14.841333 11.988333 13.431613 7.736774 11.072759 8.588065 11.184839 9.245333 9.085806 10.107419 13.880968 14.703226
1961-02 16.269286 14.975357 14.441481 9.230741 13.852143 10.937500 11.890714 11.846071 11.821429 12.714286 18.583214 15.411786
1961-03 10.890000 11.296452 10.752903 7.284000 10.509355 8.866774 9.644194 9.829677 10.294138 11.251935 16.410968 15.720000
1961-04 10.722667 9.427667 9.998000 5.830667 8.435000 6.495000 6.925333 7.094667 7.342333 7.237000 11.147333 10.278333
1961-05 9.860968 8.850000 10.818065 5.905333 9.490323 6.574839 7.604000 8.177097 8.039355 8.499355 11.900323 12.011613
... ... ... ... ... ... ... ... ... ... ... ... ...
1978-08 9.645161 8.259355 9.032258 4.502903 7.368065 5.935161 5.650323 5.417742 7.241290 5.536774 10.466774 12.054194
1978-09 10.913667 10.895000 10.635000 5.725000 10.372000 9.278333 10.790333 9.583000 10.069333 8.939000 15.680333 19.391333
1978-10 9.897742 8.670968 9.295806 4.721290 8.525161 6.774194 8.115484 7.337742 8.297742 8.243871 13.776774 17.150000
1978-11 16.151667 14.802667 13.508000 7.317333 11.475000 8.743000 11.492333 9.657333 10.701333 10.676000 17.404667 20.723000
1978-12 16.175484 13.748065 15.635161 7.094839 11.398710 9.241613 12.077419 10.194839 10.616774 11.028710 13.859677 21.371613

216 rows × 12 columns

6.11 对每个地区的记录进行周度向下采样
wind.groupby(wind.index.to_period('W')).mean()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1960-12-26/1961-01-01 15.040000 14.960000 13.170000 9.290000 NaN 9.870000 13.670000 10.250000 10.830000 12.580000 18.500000 15.040000
1961-01-02/1961-01-08 13.541429 11.486667 10.487143 6.417143 9.474286 6.435714 11.061429 6.616667 8.434286 8.497143 12.481429 13.238571
1961-01-09/1961-01-15 12.468571 8.967143 11.958571 4.630000 7.351429 5.072857 7.535714 6.820000 5.712857 7.571429 11.125714 11.024286
1961-01-16/1961-01-22 13.204286 9.862857 12.982857 6.328571 8.966667 7.417143 9.257143 7.875714 7.145714 8.124286 9.821429 11.434286
1961-01-23/1961-01-29 19.880000 16.141429 18.225714 12.720000 17.432857 14.828571 15.528571 15.160000 14.480000 15.640000 20.930000 22.530000
... ... ... ... ... ... ... ... ... ... ... ... ...
1978-11-27/1978-12-03 14.934286 11.232857 13.941429 5.565714 10.215714 8.618571 9.642857 7.685714 9.011429 9.547143 11.835714 18.728571
1978-12-04/1978-12-10 20.740000 19.190000 17.034286 9.777143 15.287143 12.774286 14.437143 12.488571 13.870000 14.082857 18.517143 23.061429
1978-12-11/1978-12-17 16.758571 14.692857 14.987143 6.917143 11.397143 7.272857 10.208571 7.967143 9.168571 8.565714 11.102857 15.562857
1978-12-18/1978-12-24 11.155714 8.008571 13.172857 4.004286 7.825714 6.290000 7.798571 8.667143 7.151429 8.072857 11.845714 18.977143
1978-12-25/1978-12-31 14.951429 11.801429 16.035714 6.507143 9.660000 8.620000 13.708571 10.477143 10.868571 11.471429 12.947143 26.844286

940 rows × 12 columns

6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。
# 重采样并聚合
weekly = wind.resample('W').agg(['min','max','mean','std'])
weekly
RPT VAL ROS ... CLO BEL MAL
min max mean std min max mean std min max ... mean std min max mean std min max mean std
Yr_Mo_Dy
1961-01-01 15.04 15.04 15.040000 NaN 14.96 14.96 14.960000 NaN 13.17 13.17 ... 12.580000 NaN 18.50 18.50 18.500000 NaN 15.04 15.04 15.040000 NaN
1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88 11.486667 3.949525 7.62 12.33 ... 8.497143 1.704941 5.46 17.54 12.481429 4.349139 10.88 16.46 13.238571 1.773062
1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08 8.967143 3.148945 7.08 19.50 ... 7.571429 4.084293 5.25 20.71 11.125714 5.552215 5.17 16.92 11.024286 4.692355
1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37 9.862857 3.837785 7.29 20.79 ... 8.124286 4.783952 6.50 15.92 9.821429 3.626584 6.79 17.96 11.434286 4.237239
1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91 16.141429 5.170224 12.67 25.84 ... 15.640000 3.713368 14.04 27.71 20.930000 5.210726 17.50 27.63 22.530000 3.874721
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1978-12-03 9.08 21.29 14.934286 4.931754 4.54 21.34 11.232857 5.978968 8.21 24.04 ... 9.547143 6.284973 4.92 21.42 11.835714 5.950112 11.50 25.75 18.728571 6.393188
1978-12-10 9.92 29.33 20.740000 7.215012 12.54 24.79 19.190000 4.953060 7.21 25.37 ... 14.082857 5.516405 9.54 26.08 18.517143 5.600389 15.34 34.59 23.061429 8.093976
1978-12-17 9.87 23.13 16.758571 4.499431 3.21 24.04 14.692857 7.578665 8.04 18.05 ... 8.565714 5.487801 5.00 21.50 11.102857 6.631925 6.92 22.83 15.562857 6.005594
1978-12-24 6.21 16.62 11.155714 3.522759 3.63 13.29 8.008571 3.882900 8.50 22.21 ... 8.072857 3.023131 3.21 19.79 11.845714 5.750301 10.29 31.71 18.977143 7.194108
1978-12-31 7.21 20.33 14.951429 4.350400 5.46 17.41 11.801429 4.705392 7.83 27.29 ... 11.471429 5.533397 1.21 21.79 12.947143 7.523148 11.96 41.46 26.844286 11.621233

940 rows × 48 columns

weekly.loc[weekly.index[1:53],:].head()
RPT VAL ROS ... CLO BEL MAL
min max mean std min max mean std min max ... mean std min max mean std min max mean std
Yr_Mo_Dy
1961-01-08 10.58 18.50 13.541429 2.631321 6.63 16.88 11.486667 3.949525 7.62 12.33 ... 8.497143 1.704941 5.46 17.54 12.481429 4.349139 10.88 16.46 13.238571 1.773062
1961-01-15 9.04 19.75 12.468571 3.555392 3.54 12.08 8.967143 3.148945 7.08 19.50 ... 7.571429 4.084293 5.25 20.71 11.125714 5.552215 5.17 16.92 11.024286 4.692355
1961-01-22 4.92 19.83 13.204286 5.337402 3.42 14.37 9.862857 3.837785 7.29 20.79 ... 8.124286 4.783952 6.50 15.92 9.821429 3.626584 6.79 17.96 11.434286 4.237239
1961-01-29 13.62 25.04 19.880000 4.619061 9.96 23.91 16.141429 5.170224 12.67 25.84 ... 15.640000 3.713368 14.04 27.71 20.930000 5.210726 17.50 27.63 22.530000 3.874721
1961-02-05 10.58 24.21 16.827143 5.251408 9.46 24.21 15.460000 5.187395 9.04 19.70 ... 9.460000 2.839501 9.17 19.33 14.012857 4.210858 7.17 19.25 11.935714 4.336104

5 rows × 48 columns

7. 探索泰坦尼克灾难数据(可视化)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

plt.rcParams["font.sans-serif"]=['SimHei']  #正常显示中文
plt.rcParams["axes.unicode_minus"]=False   # 正常显示负号

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')
7.1 导入数据等
titanic = pd.read_csv('train.csv')
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
7.1 绘制一个展示乘客性别比例的扇形图
males = (titanic['Sex']=='male').sum()
females = (titanic['Sex']=='female').sum()

proportions = [males, females]
#plt.figure(figsize=(8,8), dpi=100)
#plt.Subplot(111,facecolor='#F0F0F0')

plt.pie(
    proportions,
    labels = ['Males', 'Females'],
    shadow = True,
    colors = ['#E76278','#FAC699'],
    explode = (0.15,0),
    startangle = 90,
    autopct = '%1.1f%%'
    )

plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()


[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JYeSODzv-1658480041729)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_149_0.png)]

7.2 绘制一个展示船票费用与乘客年龄和船票的散点图
lm = sns.lmplot(x='Age', y='Fare', data=titanic, hue='Sex', fit_reg=False)
lm.set(title='Fare_Age Scatter')

plt.tight_layout()
plt.show()

在这里插入图片描述

7.3 有多少人生还
titanic['Survived'].sum()
342
7.4 绘制一个展示船票价格的直方图
plt.figure(figsize=(10,4), dpi=100)
plt.subplot(111, facecolor='#F0F0F0')

plt.title('船票价格分布')
plt.xlabel('船票价格')
plt.ylabel('人数')

plt.hist(titanic['Fare'], bins=50, color='#447C69')
plt.xticks(np.arange(0,300,20), fontsize=10)
plt.yticks(fontsize=10)

plt.tight_layout()

在这里插入图片描述

8. 探索 Pokemon 数据

8.1 数据导入及其他
import pandas as pd
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
 "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
 "type": ['grass', 'fire', 'water', 'bug'],
 "hp": [45, 39, 44, 45],
 "pokedex": ['yes', 'no','yes','no']
 }
8.2 创建 DataFrame
pokemon = pd.DataFrame(raw_data)
pokemon.head()
name evolution type hp pokedex
0 Bulbasaur Ivysaur grass 45 yes
1 Charmander Charmeleon fire 39 no
2 Squirtle Wartortle water 44 yes
3 Caterpie Metapod bug 45 no
8.3 列重新排序
pokemon = pokemon[['name','type','hp','evolution','pokedex']]
pokemon.head()
name type hp evolution pokedex
0 Bulbasaur grass 45 Ivysaur yes
1 Charmander fire 39 Charmeleon no
2 Squirtle water 44 Wartortle yes
3 Caterpie bug 45 Metapod no
8.4 添加列 place
pokemon['place'] = ['park','street','lake','forest']
pokemon
name type hp evolution pokedex place
0 Bulbasaur grass 45 Ivysaur yes park
1 Charmander fire 39 Charmeleon no street
2 Squirtle water 44 Wartortle yes lake
3 Caterpie bug 45 Metapod no forest
8.5 查看每个列的数据类型
pokemon.dtypes
name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object

9. 探索 Apple 公司股价数据

9.1 导入数据等
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
apple = pd.read_csv('Apple_stock.csv')
apple.head()
Date Open High Low Close Volume Adj Close
0 2014-07-08 96.27 96.80 93.92 95.35 65130000 95.35
1 2014-07-07 94.14 95.99 94.10 95.97 56305400 95.97
2 2014-07-03 93.67 94.10 93.20 94.03 22891800 94.03
3 2014-07-02 93.87 94.06 93.09 93.48 28420900 93.48
4 2014-07-01 93.52 94.07 93.13 93.52 38170200 93.52
apple.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8465 entries, 1980-12-12 to 2014-07-08
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       8465 non-null   float64
 1   High       8465 non-null   float64
 2   Low        8465 non-null   float64
 3   Close      8465 non-null   float64
 4   Volume     8465 non-null   int64  
 5   Adj Close  8465 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 721.0 KB
9.1 有重复的日期吗
  • 两种方法
apple['Date'].duplicated().sum()
# 没有重复数据
0
apple['Date'].is_unique
True
# 也可以用在索引上
apple.index.is_unique
True
  • 将时间列设为索引
apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace=True)
apple.head()
Open High Low Close Volume Adj Close
Date
2014-07-08 96.27 96.80 93.92 95.35 65130000 95.35
2014-07-07 94.14 95.99 94.10 95.97 56305400 95.97
2014-07-03 93.67 94.10 93.20 94.03 22891800 94.03
2014-07-02 93.87 94.06 93.09 93.48 28420900 93.48
2014-07-01 93.52 94.07 93.13 93.52 38170200 93.52
9.2 将 index 设置为升序
apple.sort_index(ascending=True, inplace=True)
apple.head()
Open High Low Close Volume Adj Close
Date
1980-12-12 28.75 28.87 28.75 28.75 117258400 0.45
1980-12-15 27.38 27.38 27.25 27.25 43971200 0.42
1980-12-16 25.37 25.37 25.25 25.25 26432000 0.39
1980-12-17 25.87 26.00 25.87 25.87 21610400 0.40
1980-12-18 26.63 26.75 26.63 26.63 18362400 0.41
9.3 找到每个月的最后一个交易日(business day)
# 'BM' business month end frequency
apple_month = apple.resample('BM').mean()
apple_month.head()
Open High Low Close Volume Adj Close
Date
1980-12-31 30.481538 30.567692 30.443077 30.443077 2.586252e+07 0.473077
1981-01-30 31.754762 31.826667 31.654762 31.654762 7.249867e+06 0.493810
1981-02-27 26.480000 26.572105 26.407895 26.407895 4.231832e+06 0.411053
1981-03-31 24.937727 25.016818 24.836364 24.836364 7.962691e+06 0.387727
1981-04-30 27.286667 27.368095 27.227143 27.227143 6.392000e+06 0.423333
9.4 数据集中最早的日期和最晚的日期相差多少天?
(apple.index.max()-apple.index.min()).days
12261
9.5 数据集中一共有多少个月?
len(apple.resample('BM').mean())
404
9.6 按照时间顺序可视化 Adj Close 值
plt.figure(figsize=(8,3), dpi=100)
apple['Adj Close'].plot(title='Apple Stock')
<AxesSubplot:title={'center':'Apple Stock'}, xlabel='Date'>

在这里插入图片描述

10. 探索 Iris 纸鸢花数据

10.1 导入数据等
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.head()
5.1 3.5 1.4 0.2 Iris-setosa
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   5.1          149 non-null    float64
 1   3.5          149 non-null    float64
 2   1.4          149 non-null    float64
 3   0.2          149 non-null    float64
 4   Iris-setosa  149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
10.2 添加列名
iris.columns = ['sepal_length','sepal_width', 'petal_length','petal_width', 'class']
iris.head()
sepal_length sepal_width petal_length petal_width class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
10.3 数据集有缺失值吗
iris.isna().sum()
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64
10.4 将petal_lenth 的第10到19行设置为缺失值
  • python range() 函数是左闭右开,下面的stop=20,不包括20
for i in range(10,20):
    iris['petal_length'].at[i] = np.nan

iris.head(21)
sepal_length sepal_width petal_length petal_width class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
5 4.6 3.4 1.4 0.3 Iris-setosa
6 5.0 3.4 1.5 0.2 Iris-setosa
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.4 3.7 1.5 0.2 Iris-setosa
10 4.8 3.4 NaN 0.2 Iris-setosa
11 4.8 3.0 NaN 0.1 Iris-setosa
12 4.3 3.0 NaN 0.1 Iris-setosa
13 5.8 4.0 NaN 0.2 Iris-setosa
14 5.7 4.4 NaN 0.4 Iris-setosa
15 5.4 3.9 NaN 0.4 Iris-setosa
16 5.1 3.5 NaN 0.3 Iris-setosa
17 5.7 3.8 NaN 0.3 Iris-setosa
18 5.1 3.8 NaN 0.3 Iris-setosa
19 5.4 3.4 NaN 0.2 Iris-setosa
20 5.1 3.7 1.5 0.4 Iris-setosa
iris.iloc[10:20, 1:2] = np.nan
iris.head(21)
sepal_length sepal_width petal_length petal_width class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
5 4.6 3.4 1.4 0.3 Iris-setosa
6 5.0 3.4 1.5 0.2 Iris-setosa
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.4 3.7 1.5 0.2 Iris-setosa
10 4.8 NaN 1.6 0.2 Iris-setosa
11 4.8 NaN 1.4 0.1 Iris-setosa
12 4.3 NaN 1.1 0.1 Iris-setosa
13 5.8 NaN 1.2 0.2 Iris-setosa
14 5.7 NaN 1.5 0.4 Iris-setosa
15 5.4 NaN 1.3 0.4 Iris-setosa
16 5.1 NaN 1.4 0.3 Iris-setosa
17 5.7 NaN 1.7 0.3 Iris-setosa
18 5.1 NaN 1.5 0.3 Iris-setosa
19 5.4 NaN 1.7 0.2 Iris-setosa
20 5.1 3.7 1.5 0.4 Iris-setosa
10.5 将缺失值全部替换为 1.0
iris.fillna(1, inplace=True)
iris.head(21)
sepal_length sepal_width petal_length petal_width class
0 4.9 3.0 1.4 0.2 Iris-setosa
1 4.7 3.2 1.3 0.2 Iris-setosa
2 4.6 3.1 1.5 0.2 Iris-setosa
3 5.0 3.6 1.4 0.2 Iris-setosa
4 5.4 3.9 1.7 0.4 Iris-setosa
5 4.6 3.4 1.4 0.3 Iris-setosa
6 5.0 3.4 1.5 0.2 Iris-setosa
7 4.4 2.9 1.4 0.2 Iris-setosa
8 4.9 3.1 1.5 0.1 Iris-setosa
9 5.4 3.7 1.5 0.2 Iris-setosa
10 4.8 3.4 1.0 0.2 Iris-setosa
11 4.8 3.0 1.0 0.1 Iris-setosa
12 4.3 3.0 1.0 0.1 Iris-setosa
13 5.8 4.0 1.0 0.2 Iris-setosa
14 5.7 4.4 1.0 0.4 Iris-setosa
15 5.4 3.9 1.0 0.4 Iris-setosa
16 5.1 3.5 1.0 0.3 Iris-setosa
17 5.7 3.8 1.0 0.3 Iris-setosa
18 5.1 3.8 1.0 0.3 Iris-setosa
19 5.4 3.4 1.0 0.2 Iris-setosa
20 5.1 3.7 1.5 0.4 Iris-setosa
10.6 删除列
del iris['class']
iris.head()
sepal_length sepal_width petal_length petal_width
0 4.9 3.0 1.4 0.2
1 4.7 3.2 1.3 0.2
2 4.6 3.1 1.5 0.2
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
10.7 将数据的前三行设置为缺失值
iris.iloc[0:3,:] = np.nan
iris.head()
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
10.8 删除有缺失值的行
iris.drop_duplicates(inplace=True)
iris.head()
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
3 5.0 3.6 1.4 0.2
4 5.4 3.9 1.7 0.4
5 4.6 3.4 1.4 0.3
6 5.0 3.4 1.5 0.2
10.9 重新设置索引
iris.reset_index(inplace=True, drop=True)
iris.head()
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 5.0 3.6 1.4 0.2
2 5.4 3.9 1.7 0.4
3 4.6 3.4 1.4 0.3
4 5.0 3.4 1.5 0.2
本文含有隐藏内容,请 开通VIP 后查看

网站公告

今日签到

点亮在社区的每一天
去签到