Github 地址: https://github.com/guipsamora/pandas_exercises
数据集: https://www.aliyundrive.com/s/Wk6L6R67rrn
1. 了解数据
1.1 导入数据集并查看数据集基本情况
import pandas as pd
chipotle = pd.read_csv('001_chipotle.tsv',sep='\t')
chipotle.head()
|
order_id |
quantity |
item_name |
choice_description |
item_price |
0 |
1 |
1 |
Chips and Fresh Tomato Salsa |
NaN |
$2.39 |
1 |
1 |
1 |
Izze |
[Clementine] |
$3.39 |
2 |
1 |
1 |
Nantucket Nectar |
[Apple] |
$3.39 |
3 |
1 |
1 |
Chips and Tomatillo-Green Chili Salsa |
NaN |
$2.39 |
4 |
2 |
2 |
Chicken Bowl |
[Tomatillo-Red Chili Salsa (Hot), [Black Beans... |
$16.98 |
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 4622 non-null int64
1 quantity 4622 non-null int64
2 item_name 4622 non-null object
3 choice_description 3376 non-null object
4 item_price 4622 non-null object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB
chipotle.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
'item_price'],
dtype='object')
chipotle.index
RangeIndex(start=0, stop=4622, step=1)
1.2 被下单最多的商品是什么
- as_index = false 使用默认索引列,
- 显示效果上, 使列名都在一行, “不合并单元格”, 更符合使用习惯
item_buy_count = chipotle[['item_name','quantity']].groupby('item_name').sum()['quantity'].nlargest(1)
item_buy_count
item_name
Chicken Bowl 761
Name: quantity, dtype: int64
item_buy_count_1 = chipotle[['item_name','quantity']].groupby('item_name', as_index=False).agg({'quantity':'sum'})
item_buy_count_1.sort_values('quantity', ascending=False, inplace=True)
item_buy_count_1.head()
|
item_name |
quantity |
17 |
Chicken Bowl |
761 |
18 |
Chicken Burrito |
591 |
25 |
Chips and Guacamole |
506 |
39 |
Steak Burrito |
386 |
10 |
Canned Soft Drink |
351 |
1.3 有多少种商品被下单
chipotle['item_name'].nunique()
50
s = chipotle['item_name'].value_counts()
from matplotlib import pyplot as plt
plt.figure(figsize=(20,4),dpi=100)
plt.xticks(fontsize=13, rotation='vertical')
plt.bar(s.index,s.values)
<BarContainer object of 50 artists>
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-S9a7xR1e-1658480041724)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_15_1.png)]](https://img-blog.csdnimg.cn/0de568b1c4ed4fe4a89e45d9d9fb9227.png)
1.4 被下单的商品中, choice_description 的排名
chipotle['choice_description'].value_counts().head()
[Diet Coke] 134
[Coke] 123
[Sprite] 77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]] 42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]] 40
Name: choice_description, dtype: int64
1.5 一共有多少商品被下单
chipotle['quantity'].sum()
4972
1.6 将item_price转换为浮点数
chipotle['item_price'] = chipotle['item_price'].str.lstrip('$').astype('float')
chipotle.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 order_id 4622 non-null int64
1 quantity 4622 non-null int64
2 item_name 4622 non-null object
3 choice_description 3376 non-null object
4 item_price 4622 non-null float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.7+ KB
chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_19644/772157282.py in <module>
----> 1 chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))
NameError: name 'chipo' is not defined
1.7 总收入
chipotle['sub_total'] = (chipotle['item_price'] * chipotle['quantity'])
chipotle['sub_total'].sum()
39237.02
1.8 总订单数
chipotle['order_id'].nunique()
1834
1.9 单均价
chipotle['sub_total'].sum()/chipotle['order_id'].nunique()
21.39423118865867
chipotle[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()
21.39423118865867
2. 数据过滤与排序
2.1 数据导入, 查看数据集基本情况
import pandas as pd
euro = pd.read_csv('Euro2012_stats.csv')
euro.head()
|
Team |
Goals |
Shots on target |
Shots off target |
Shooting Accuracy |
% Goals-to-shots |
Total shots (inc. Blocked) |
Hit Woodwork |
Penalty goals |
Penalties not scored |
... |
Saves made |
Saves-to-shots ratio |
Fouls Won |
Fouls Conceded |
Offsides |
Yellow Cards |
Red Cards |
Subs on |
Subs off |
Players Used |
0 |
Croatia |
4 |
13 |
12 |
51.9% |
16.0% |
32 |
0 |
0 |
0 |
... |
13 |
81.3% |
41 |
62 |
2 |
9 |
0 |
9 |
9 |
16 |
1 |
Czech Republic |
4 |
13 |
18 |
41.9% |
12.9% |
39 |
0 |
0 |
0 |
... |
9 |
60.1% |
53 |
73 |
8 |
7 |
0 |
11 |
11 |
19 |
2 |
Denmark |
4 |
10 |
10 |
50.0% |
20.0% |
27 |
1 |
0 |
0 |
... |
10 |
66.7% |
25 |
38 |
8 |
4 |
0 |
7 |
7 |
15 |
3 |
England |
5 |
11 |
18 |
50.0% |
17.2% |
40 |
0 |
0 |
0 |
... |
22 |
88.1% |
43 |
45 |
6 |
5 |
0 |
11 |
11 |
16 |
4 |
France |
3 |
22 |
24 |
37.9% |
6.5% |
65 |
1 |
0 |
0 |
... |
6 |
54.6% |
36 |
51 |
5 |
6 |
0 |
11 |
11 |
19 |
5 rows × 35 columns
euro.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Team 16 non-null object
1 Goals 16 non-null int64
2 Shots on target 16 non-null int64
3 Shots off target 16 non-null int64
4 Shooting Accuracy 16 non-null object
5 % Goals-to-shots 16 non-null object
6 Total shots (inc. Blocked) 16 non-null int64
7 Hit Woodwork 16 non-null int64
8 Penalty goals 16 non-null int64
9 Penalties not scored 16 non-null int64
10 Headed goals 16 non-null int64
11 Passes 16 non-null int64
12 Passes completed 16 non-null int64
13 Passing Accuracy 16 non-null object
14 Touches 16 non-null int64
15 Crosses 16 non-null int64
16 Dribbles 16 non-null int64
17 Corners Taken 16 non-null int64
18 Tackles 16 non-null int64
19 Clearances 16 non-null int64
20 Interceptions 16 non-null int64
21 Clearances off line 15 non-null float64
22 Clean Sheets 16 non-null int64
23 Blocks 16 non-null int64
24 Goals conceded 16 non-null int64
25 Saves made 16 non-null int64
26 Saves-to-shots ratio 16 non-null object
27 Fouls Won 16 non-null int64
28 Fouls Conceded 16 non-null int64
29 Offsides 16 non-null int64
30 Yellow Cards 16 non-null int64
31 Red Cards 16 non-null int64
32 Subs on 16 non-null int64
33 Subs off 16 non-null int64
34 Players Used 16 non-null int64
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB
2.1 有多少球队参与了 2012 年欧洲杯
euro['Team'].nunique()
16
2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里
discripline = euro[['Team','Yellow Cards', 'Red Cards']]
discripline
|
Team |
Yellow Cards |
Red Cards |
0 |
Croatia |
9 |
0 |
1 |
Czech Republic |
7 |
0 |
2 |
Denmark |
4 |
0 |
3 |
England |
5 |
0 |
4 |
France |
6 |
0 |
5 |
Germany |
4 |
0 |
6 |
Greece |
9 |
1 |
7 |
Italy |
16 |
0 |
8 |
Netherlands |
5 |
0 |
9 |
Poland |
7 |
1 |
10 |
Portugal |
12 |
0 |
11 |
Republic of Ireland |
6 |
1 |
12 |
Russia |
6 |
0 |
13 |
Spain |
11 |
0 |
14 |
Sweden |
7 |
0 |
15 |
Ukraine |
5 |
0 |
2.3 对 discripline 排序
discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
discripline
C:\Users\LiCN\AppData\Local\Temp/ipykernel_19644/3135130171.py:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
|
Team |
Yellow Cards |
Red Cards |
7 |
Italy |
16 |
0 |
10 |
Portugal |
12 |
0 |
13 |
Spain |
11 |
0 |
6 |
Greece |
9 |
1 |
0 |
Croatia |
9 |
0 |
9 |
Poland |
7 |
1 |
1 |
Czech Republic |
7 |
0 |
14 |
Sweden |
7 |
0 |
11 |
Republic of Ireland |
6 |
1 |
4 |
France |
6 |
0 |
12 |
Russia |
6 |
0 |
3 |
England |
5 |
0 |
8 |
Netherlands |
5 |
0 |
15 |
Ukraine |
5 |
0 |
2 |
Denmark |
4 |
0 |
5 |
Germany |
4 |
0 |
2.4 计算每个球队拿到的黄牌数的平均值
discripline['Yellow Cards'].mean()
7.4375
2.5 进球大于 6 的球队
discripline[discripline['Yellow Cards']>6]
|
Team |
Yellow Cards |
Red Cards |
7 |
Italy |
16 |
0 |
10 |
Portugal |
12 |
0 |
13 |
Spain |
11 |
0 |
6 |
Greece |
9 |
1 |
0 |
Croatia |
9 |
0 |
9 |
Poland |
7 |
1 |
1 |
Czech Republic |
7 |
0 |
14 |
Sweden |
7 |
0 |
2.6 选取以字母 G 开头的球队数据
euro[euro['Team'].str.startswith('G')]
|
Team |
Goals |
Shots on target |
Shots off target |
Shooting Accuracy |
% Goals-to-shots |
Total shots (inc. Blocked) |
Hit Woodwork |
Penalty goals |
Penalties not scored |
... |
Saves made |
Saves-to-shots ratio |
Fouls Won |
Fouls Conceded |
Offsides |
Yellow Cards |
Red Cards |
Subs on |
Subs off |
Players Used |
5 |
Germany |
10 |
32 |
32 |
47.8% |
15.6% |
80 |
2 |
1 |
0 |
... |
10 |
62.6% |
63 |
49 |
12 |
4 |
0 |
15 |
15 |
17 |
6 |
Greece |
5 |
8 |
18 |
30.7% |
19.2% |
32 |
1 |
1 |
1 |
... |
13 |
65.1% |
67 |
48 |
12 |
9 |
1 |
12 |
12 |
20 |
2 rows × 35 columns
2.7 选取前 7 列
euro.iloc[:,0:7]
|
Team |
Goals |
Shots on target |
Shots off target |
Shooting Accuracy |
% Goals-to-shots |
Total shots (inc. Blocked) |
0 |
Croatia |
4 |
13 |
12 |
51.9% |
16.0% |
32 |
1 |
Czech Republic |
4 |
13 |
18 |
41.9% |
12.9% |
39 |
2 |
Denmark |
4 |
10 |
10 |
50.0% |
20.0% |
27 |
3 |
England |
5 |
11 |
18 |
50.0% |
17.2% |
40 |
4 |
France |
3 |
22 |
24 |
37.9% |
6.5% |
65 |
5 |
Germany |
10 |
32 |
32 |
47.8% |
15.6% |
80 |
6 |
Greece |
5 |
8 |
18 |
30.7% |
19.2% |
32 |
7 |
Italy |
6 |
34 |
45 |
43.0% |
7.5% |
110 |
8 |
Netherlands |
2 |
12 |
36 |
25.0% |
4.1% |
60 |
9 |
Poland |
2 |
15 |
23 |
39.4% |
5.2% |
48 |
10 |
Portugal |
6 |
22 |
42 |
34.3% |
9.3% |
82 |
11 |
Republic of Ireland |
1 |
7 |
12 |
36.8% |
5.2% |
28 |
12 |
Russia |
5 |
9 |
31 |
22.5% |
12.5% |
59 |
13 |
Spain |
12 |
42 |
33 |
55.9% |
16.0% |
100 |
14 |
Sweden |
5 |
17 |
19 |
47.2% |
13.8% |
39 |
15 |
Ukraine |
2 |
7 |
26 |
21.2% |
6.0% |
38 |
2.8 选取除了最后 3 列之外的全部列
euro.iloc[:,:-3]
|
Team |
Goals |
Shots on target |
Shots off target |
Shooting Accuracy |
% Goals-to-shots |
Total shots (inc. Blocked) |
Hit Woodwork |
Penalty goals |
Penalties not scored |
... |
Clean Sheets |
Blocks |
Goals conceded |
Saves made |
Saves-to-shots ratio |
Fouls Won |
Fouls Conceded |
Offsides |
Yellow Cards |
Red Cards |
0 |
Croatia |
4 |
13 |
12 |
51.9% |
16.0% |
32 |
0 |
0 |
0 |
... |
0 |
10 |
3 |
13 |
81.3% |
41 |
62 |
2 |
9 |
0 |
1 |
Czech Republic |
4 |
13 |
18 |
41.9% |
12.9% |
39 |
0 |
0 |
0 |
... |
1 |
10 |
6 |
9 |
60.1% |
53 |
73 |
8 |
7 |
0 |
2 |
Denmark |
4 |
10 |
10 |
50.0% |
20.0% |
27 |
1 |
0 |
0 |
... |
1 |
10 |
5 |
10 |
66.7% |
25 |
38 |
8 |
4 |
0 |
3 |
England |
5 |
11 |
18 |
50.0% |
17.2% |
40 |
0 |
0 |
0 |
... |
2 |
29 |
3 |
22 |
88.1% |
43 |
45 |
6 |
5 |
0 |
4 |
France |
3 |
22 |
24 |
37.9% |
6.5% |
65 |
1 |
0 |
0 |
... |
1 |
7 |
5 |
6 |
54.6% |
36 |
51 |
5 |
6 |
0 |
5 |
Germany |
10 |
32 |
32 |
47.8% |
15.6% |
80 |
2 |
1 |
0 |
... |
1 |
11 |
6 |
10 |
62.6% |
63 |
49 |
12 |
4 |
0 |
6 |
Greece |
5 |
8 |
18 |
30.7% |
19.2% |
32 |
1 |
1 |
1 |
... |
1 |
23 |
7 |
13 |
65.1% |
67 |
48 |
12 |
9 |
1 |
7 |
Italy |
6 |
34 |
45 |
43.0% |
7.5% |
110 |
2 |
0 |
0 |
... |
2 |
18 |
7 |
20 |
74.1% |
101 |
89 |
16 |
16 |
0 |
8 |
Netherlands |
2 |
12 |
36 |
25.0% |
4.1% |
60 |
2 |
0 |
0 |
... |
0 |
9 |
5 |
12 |
70.6% |
35 |
30 |
3 |
5 |
0 |
9 |
Poland |
2 |
15 |
23 |
39.4% |
5.2% |
48 |
0 |
0 |
0 |
... |
0 |
8 |
3 |
6 |
66.7% |
48 |
56 |
3 |
7 |
1 |
10 |
Portugal |
6 |
22 |
42 |
34.3% |
9.3% |
82 |
6 |
0 |
0 |
... |
2 |
11 |
4 |
10 |
71.5% |
73 |
90 |
10 |
12 |
0 |
11 |
Republic of Ireland |
1 |
7 |
12 |
36.8% |
5.2% |
28 |
0 |
0 |
0 |
... |
0 |
23 |
9 |
17 |
65.4% |
43 |
51 |
11 |
6 |
1 |
12 |
Russia |
5 |
9 |
31 |
22.5% |
12.5% |
59 |
2 |
0 |
0 |
... |
0 |
8 |
3 |
10 |
77.0% |
34 |
43 |
4 |
6 |
0 |
13 |
Spain |
12 |
42 |
33 |
55.9% |
16.0% |
100 |
0 |
1 |
0 |
... |
5 |
8 |
1 |
15 |
93.8% |
102 |
83 |
19 |
11 |
0 |
14 |
Sweden |
5 |
17 |
19 |
47.2% |
13.8% |
39 |
3 |
0 |
0 |
... |
1 |
12 |
5 |
8 |
61.6% |
35 |
51 |
7 |
7 |
0 |
15 |
Ukraine |
2 |
7 |
26 |
21.2% |
6.0% |
38 |
0 |
0 |
0 |
... |
0 |
4 |
4 |
13 |
76.5% |
48 |
31 |
4 |
5 |
0 |
16 rows × 32 columns
2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)
euro[euro['Team'].isin(['England', 'Italy', 'Russia'])][['Team','Shooting Accuracy']]
|
Team |
Shooting Accuracy |
3 |
England |
50.0% |
7 |
Italy |
43.0% |
12 |
Russia |
22.5% |
euro.loc[euro['Team'].isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]
|
Team |
Shooting Accuracy |
3 |
England |
50.0% |
7 |
Italy |
43.0% |
12 |
Russia |
22.5% |
3. 数据分组 探索酒类消费数据
import pandas as pd
drinks = pd.read_csv('drinks.csv')
drinks.head()
|
country |
beer_servings |
spirit_servings |
wine_servings |
total_litres_of_pure_alcohol |
continent |
0 |
Afghanistan |
0 |
0 |
0 |
0.0 |
AS |
1 |
Albania |
89 |
132 |
54 |
4.9 |
EU |
2 |
Algeria |
25 |
0 |
14 |
0.7 |
AF |
3 |
Andorra |
245 |
138 |
312 |
12.4 |
EU |
4 |
Angola |
217 |
57 |
45 |
5.9 |
AF |
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 country 193 non-null object
1 beer_servings 193 non-null int64
2 spirit_servings 193 non-null int64
3 wine_servings 193 non-null int64
4 total_litres_of_pure_alcohol 193 non-null float64
5 continent 170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB
3.1 哪个大陆平均消耗的啤酒更多?
drinks.groupby('continent')['beer_servings'].mean().sort_values().head(1)
continent
AS 37.045455
Name: beer_servings, dtype: float64
3.2 打印出每个大陆红酒消耗的描述性统计值
drinks.groupby('continent')['wine_servings'].describe()
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
continent |
|
|
|
|
|
|
|
|
AF |
53.0 |
16.264151 |
38.846419 |
0.0 |
1.0 |
2.0 |
13.00 |
233.0 |
AS |
44.0 |
9.068182 |
21.667034 |
0.0 |
0.0 |
1.0 |
8.00 |
123.0 |
EU |
45.0 |
142.222222 |
97.421738 |
0.0 |
59.0 |
128.0 |
195.00 |
370.0 |
OC |
16.0 |
35.625000 |
64.555790 |
0.0 |
1.0 |
8.5 |
23.25 |
212.0 |
SA |
12.0 |
62.416667 |
88.620189 |
1.0 |
3.0 |
12.0 |
98.50 |
221.0 |
3.3 打印出每个大陆每种酒类的消耗平均值
drinks.groupby('continent').mean()
|
beer_servings |
spirit_servings |
wine_servings |
total_litres_of_pure_alcohol |
continent |
|
|
|
|
AF |
61.471698 |
16.339623 |
16.264151 |
3.007547 |
AS |
37.045455 |
60.840909 |
9.068182 |
2.170455 |
EU |
193.777778 |
132.555556 |
142.222222 |
8.617778 |
OC |
89.687500 |
58.437500 |
35.625000 |
3.381250 |
SA |
175.083333 |
114.750000 |
62.416667 |
6.308333 |
3.4 打印出每个大陆每种酒类的消耗的中位数
drinks.groupby('continent').median()
|
beer_servings |
spirit_servings |
wine_servings |
total_litres_of_pure_alcohol |
continent |
|
|
|
|
AF |
32.0 |
3.0 |
2.0 |
2.30 |
AS |
17.5 |
16.0 |
1.0 |
1.20 |
EU |
219.0 |
122.0 |
128.0 |
10.00 |
OC |
52.5 |
37.0 |
8.5 |
1.75 |
SA |
162.5 |
108.5 |
12.0 |
6.85 |
3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值
drinks.groupby('continent')['spirit_servings'].agg(['mean','max','min'])
|
mean |
max |
min |
continent |
|
|
|
AF |
16.339623 |
152 |
0 |
AS |
60.840909 |
326 |
0 |
EU |
132.555556 |
373 |
0 |
OC |
58.437500 |
254 |
0 |
SA |
114.750000 |
302 |
25 |
4. apply 函数 (1960-2014 美国犯罪数据探索)
4.1 数据导入及数据集基本情况
import pandas as pd
crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')
crime.head()
|
Year |
Population |
Total |
Violent |
Property |
Murder |
Forcible_Rape |
Robbery |
Aggravated_assault |
Burglary |
Larceny_Theft |
Vehicle_Theft |
0 |
1960 |
179323175 |
3384200 |
288460 |
3095700 |
9110 |
17190 |
107840 |
154320 |
912100 |
1855400 |
328200 |
1 |
1961 |
182992000 |
3488000 |
289390 |
3198600 |
8740 |
17220 |
106670 |
156760 |
949600 |
1913000 |
336000 |
2 |
1962 |
185771000 |
3752200 |
301510 |
3450700 |
8530 |
17550 |
110860 |
164570 |
994300 |
2089600 |
366800 |
3 |
1963 |
188483000 |
4109500 |
316970 |
3792500 |
8640 |
17650 |
116470 |
174210 |
1086400 |
2297800 |
408300 |
4 |
1964 |
191141000 |
4564600 |
364220 |
4200400 |
9360 |
21420 |
130390 |
203050 |
1213200 |
2514400 |
472800 |
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 55 non-null int64
1 Population 55 non-null int64
2 Total 55 non-null int64
3 Violent 55 non-null int64
4 Property 55 non-null int64
5 Murder 55 non-null int64
6 Forcible_Rape 55 non-null int64
7 Robbery 55 non-null int64
8 Aggravated_assault 55 non-null int64
9 Burglary 55 non-null int64
10 Larceny_Theft 55 non-null int64
11 Vehicle_Theft 55 non-null int64
dtypes: int64(12)
memory usage: 5.3 KB
4.2 将 Year 数据类型转换为 datetime
crime['Year'] = pd.to_datetime(crime['Year'], format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Year 55 non-null datetime64[ns]
1 Population 55 non-null int64
2 Total 55 non-null int64
3 Violent 55 non-null int64
4 Property 55 non-null int64
5 Murder 55 non-null int64
6 Forcible_Rape 55 non-null int64
7 Robbery 55 non-null int64
8 Aggravated_assault 55 non-null int64
9 Burglary 55 non-null int64
10 Larceny_Theft 55 non-null int64
11 Vehicle_Theft 55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB
4.3 将 Year 设置为索引
crime.set_index('Year', inplace=True)
crime.head()
|
Population |
Total |
Violent |
Property |
Murder |
Forcible_Rape |
Robbery |
Aggravated_assault |
Burglary |
Larceny_Theft |
Vehicle_Theft |
Year |
|
|
|
|
|
|
|
|
|
|
|
1960-01-01 |
179323175 |
3384200 |
288460 |
3095700 |
9110 |
17190 |
107840 |
154320 |
912100 |
1855400 |
328200 |
1961-01-01 |
182992000 |
3488000 |
289390 |
3198600 |
8740 |
17220 |
106670 |
156760 |
949600 |
1913000 |
336000 |
1962-01-01 |
185771000 |
3752200 |
301510 |
3450700 |
8530 |
17550 |
110860 |
164570 |
994300 |
2089600 |
366800 |
1963-01-01 |
188483000 |
4109500 |
316970 |
3792500 |
8640 |
17650 |
116470 |
174210 |
1086400 |
2297800 |
408300 |
1964-01-01 |
191141000 |
4564600 |
364220 |
4200400 |
9360 |
21420 |
130390 |
203050 |
1213200 |
2514400 |
472800 |
4.4 删除名为 Total 的列
del crime['Total']
4.5 每十年重组数据 resample 非常重要
- Pandas中的resample,重新采样,是对原样本重新处理的一个方法,是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
- 参数表示采样的规则 常见时间频率 :A year M month W week D day H hour T minute S second
- 按照每十年为一组进行采样统计
crime_10Ys = crime.resample('10As').sum()
crime_10Ys
|
Population |
Violent |
Property |
Murder |
Forcible_Rape |
Robbery |
Aggravated_assault |
Burglary |
Larceny_Theft |
Vehicle_Theft |
Year |
|
|
|
|
|
|
|
|
|
|
1960-01-01 |
1915053175 |
4134930 |
45160900 |
106180 |
236720 |
1633510 |
2158520 |
13321100 |
26547700 |
5292100 |
1970-01-01 |
2121193298 |
9607930 |
91383800 |
192230 |
554570 |
4159020 |
4702120 |
28486000 |
53157800 |
9739900 |
1980-01-01 |
2371370069 |
14074328 |
117048900 |
206439 |
865639 |
5383109 |
7619130 |
33073494 |
72040253 |
11935411 |
1990-01-01 |
2612825258 |
17527048 |
119053499 |
211664 |
998827 |
5748930 |
10568963 |
26750015 |
77679366 |
14624418 |
2000-01-01 |
2947969117 |
13968056 |
100944369 |
163068 |
922499 |
4230366 |
8652124 |
21565176 |
67970291 |
11412834 |
2010-01-01 |
1570146307 |
6072017 |
44095950 |
72867 |
421059 |
1749809 |
3764142 |
10125170 |
30401698 |
3569080 |
- 用 population 的最大值替换重组后的 population 值, 上面每个十年的 population 加总没有意义
- 用十年中的最大值替换掉这个加总值
crime_10Ys['Population'] = crime['Population'].resample('10As').max()
crime_10Ys.head()
|
Population |
Violent |
Property |
Murder |
Forcible_Rape |
Robbery |
Aggravated_assault |
Burglary |
Larceny_Theft |
Vehicle_Theft |
Year |
|
|
|
|
|
|
|
|
|
|
1960-01-01 |
201385000 |
4134930 |
45160900 |
106180 |
236720 |
1633510 |
2158520 |
13321100 |
26547700 |
5292100 |
1970-01-01 |
220099000 |
9607930 |
91383800 |
192230 |
554570 |
4159020 |
4702120 |
28486000 |
53157800 |
9739900 |
1980-01-01 |
248239000 |
14074328 |
117048900 |
206439 |
865639 |
5383109 |
7619130 |
33073494 |
72040253 |
11935411 |
1990-01-01 |
272690813 |
17527048 |
119053499 |
211664 |
998827 |
5748930 |
10568963 |
26750015 |
77679366 |
14624418 |
2000-01-01 |
307006550 |
13968056 |
100944369 |
163068 |
922499 |
4230366 |
8652124 |
21565176 |
67970291 |
11412834 |
4.6 何时是美国历史上生存最危险的年代
crime
|
Population |
Violent |
Property |
Murder |
Forcible_Rape |
Robbery |
Aggravated_assault |
Burglary |
Larceny_Theft |
Vehicle_Theft |
Year |
|
|
|
|
|
|
|
|
|
|
1960-01-01 |
179323175 |
288460 |
3095700 |
9110 |
17190 |
107840 |
154320 |
912100 |
1855400 |
328200 |
1961-01-01 |
182992000 |
289390 |
3198600 |
8740 |
17220 |
106670 |
156760 |
949600 |
1913000 |
336000 |
1962-01-01 |
185771000 |
301510 |
3450700 |
8530 |
17550 |
110860 |
164570 |
994300 |
2089600 |
366800 |
1963-01-01 |
188483000 |
316970 |
3792500 |
8640 |
17650 |
116470 |
174210 |
1086400 |
2297800 |
408300 |
1964-01-01 |
191141000 |
364220 |
4200400 |
9360 |
21420 |
130390 |
203050 |
1213200 |
2514400 |
472800 |
1965-01-01 |
193526000 |
387390 |
4352000 |
9960 |
23410 |
138690 |
215330 |
1282500 |
2572600 |
496900 |
1966-01-01 |
195576000 |
430180 |
4793300 |
11040 |
25820 |
157990 |
235330 |
1410100 |
2822000 |
561200 |
1967-01-01 |
197457000 |
499930 |
5403500 |
12240 |
27620 |
202910 |
257160 |
1632100 |
3111600 |
659800 |
1968-01-01 |
199399000 |
595010 |
6125200 |
13800 |
31670 |
262840 |
286700 |
1858900 |
3482700 |
783600 |
1969-01-01 |
201385000 |
661870 |
6749000 |
14760 |
37170 |
298850 |
311090 |
1981900 |
3888600 |
878500 |
1970-01-01 |
203235298 |
738820 |
7359200 |
16000 |
37990 |
349860 |
334970 |
2205000 |
4225800 |
928400 |
1971-01-01 |
206212000 |
816500 |
7771700 |
17780 |
42260 |
387700 |
368760 |
2399300 |
4424200 |
948200 |
1972-01-01 |
208230000 |
834900 |
7413900 |
18670 |
46850 |
376290 |
393090 |
2375500 |
4151200 |
887200 |
1973-01-01 |
209851000 |
875910 |
7842200 |
19640 |
51400 |
384220 |
420650 |
2565500 |
4347900 |
928800 |
1974-01-01 |
211392000 |
974720 |
9278700 |
20710 |
55400 |
442400 |
456210 |
3039200 |
5262500 |
977100 |
1975-01-01 |
213124000 |
1039710 |
10252700 |
20510 |
56090 |
470500 |
492620 |
3265300 |
5977700 |
1009600 |
1976-01-01 |
214659000 |
1004210 |
10345500 |
18780 |
57080 |
427810 |
500530 |
3108700 |
6270800 |
966000 |
1977-01-01 |
216332000 |
1029580 |
9955000 |
19120 |
63500 |
412610 |
534350 |
3071500 |
5905700 |
977700 |
1978-01-01 |
218059000 |
1085550 |
10123400 |
19560 |
67610 |
426930 |
571460 |
3128300 |
5991000 |
1004100 |
1979-01-01 |
220099000 |
1208030 |
11041500 |
21460 |
76390 |
480700 |
629480 |
3327700 |
6601000 |
1112800 |
1980-01-01 |
225349264 |
1344520 |
12063700 |
23040 |
82990 |
565840 |
672650 |
3795200 |
7136900 |
1131700 |
1981-01-01 |
229146000 |
1361820 |
12061900 |
22520 |
82500 |
592910 |
663900 |
3779700 |
7194400 |
1087800 |
1982-01-01 |
231534000 |
1322390 |
11652000 |
21010 |
78770 |
553130 |
669480 |
3447100 |
7142500 |
1062400 |
1983-01-01 |
233981000 |
1258090 |
10850500 |
19310 |
78920 |
506570 |
653290 |
3129900 |
6712800 |
1007900 |
1984-01-01 |
236158000 |
1273280 |
10608500 |
18690 |
84230 |
485010 |
685350 |
2984400 |
6591900 |
1032200 |
1985-01-01 |
238740000 |
1328800 |
11102600 |
18980 |
88670 |
497870 |
723250 |
3073300 |
6926400 |
1102900 |
1986-01-01 |
240132887 |
1489169 |
11722700 |
20613 |
91459 |
542775 |
834322 |
3241410 |
7257153 |
1224137 |
1987-01-01 |
242282918 |
1483999 |
12024700 |
20096 |
91110 |
517704 |
855088 |
3236184 |
7499900 |
1288674 |
1988-01-01 |
245807000 |
1566220 |
12356900 |
20680 |
92490 |
542970 |
910090 |
3218100 |
7705900 |
1432900 |
1989-01-01 |
248239000 |
1646040 |
12605400 |
21500 |
94500 |
578330 |
951710 |
3168200 |
7872400 |
1564800 |
1990-01-01 |
248709873 |
1820130 |
12655500 |
23440 |
102560 |
639270 |
1054860 |
3073900 |
7945700 |
1635900 |
1991-01-01 |
252177000 |
1911770 |
12961100 |
24700 |
106590 |
687730 |
1092740 |
3157200 |
8142200 |
1661700 |
1992-01-01 |
255082000 |
1932270 |
12505900 |
23760 |
109060 |
672480 |
1126970 |
2979900 |
7915200 |
1610800 |
1993-01-01 |
257908000 |
1926020 |
12218800 |
24530 |
106010 |
659870 |
1135610 |
2834800 |
7820900 |
1563100 |
1994-01-01 |
260341000 |
1857670 |
12131900 |
23330 |
102220 |
618950 |
1113180 |
2712800 |
7879800 |
1539300 |
1995-01-01 |
262755000 |
1798790 |
12063900 |
21610 |
97470 |
580510 |
1099210 |
2593800 |
7997700 |
1472400 |
1996-01-01 |
265228572 |
1688540 |
11805300 |
19650 |
96250 |
535590 |
1037050 |
2506400 |
7904700 |
1394200 |
1997-01-01 |
267637000 |
1634770 |
11558175 |
18208 |
96153 |
498534 |
1023201 |
2460526 |
7743760 |
1354189 |
1998-01-01 |
270296000 |
1531044 |
10944590 |
16914 |
93103 |
446625 |
974402 |
2329950 |
7373886 |
1240754 |
1999-01-01 |
272690813 |
1426044 |
10208334 |
15522 |
89411 |
409371 |
911740 |
2100739 |
6955520 |
1152075 |
2000-01-01 |
281421906 |
1425486 |
10182586 |
15586 |
90178 |
408016 |
911706 |
2050992 |
6971590 |
1160002 |
2001-01-01 |
285317559 |
1439480 |
10437480 |
16037 |
90863 |
423557 |
909023 |
2116531 |
7092267 |
1228391 |
2002-01-01 |
287973924 |
1423677 |
10455277 |
16229 |
95235 |
420806 |
891407 |
2151252 |
7057370 |
1246646 |
2003-01-01 |
290690788 |
1383676 |
10442862 |
16528 |
93883 |
414235 |
859030 |
2154834 |
7026802 |
1261226 |
2004-01-01 |
293656842 |
1360088 |
10319386 |
16148 |
95089 |
401470 |
847381 |
2144446 |
6937089 |
1237851 |
2005-01-01 |
296507061 |
1390745 |
10174754 |
16740 |
94347 |
417438 |
862220 |
2155448 |
6783447 |
1235859 |
2006-01-01 |
299398484 |
1418043 |
9983568 |
17030 |
92757 |
447403 |
860853 |
2183746 |
6607013 |
1192809 |
2007-01-01 |
301621157 |
1408337 |
9843481 |
16929 |
90427 |
445125 |
855856 |
2176140 |
6568572 |
1095769 |
2008-01-01 |
304374846 |
1392628 |
9767915 |
16442 |
90479 |
443574 |
842134 |
2228474 |
6588046 |
958629 |
2009-01-01 |
307006550 |
1325896 |
9337060 |
15399 |
89241 |
408742 |
812514 |
2203313 |
6338095 |
795652 |
2010-01-01 |
309330219 |
1251248 |
9112625 |
14772 |
85593 |
369089 |
781844 |
2168457 |
6204601 |
739565 |
2011-01-01 |
311587816 |
1206031 |
9052743 |
14661 |
84175 |
354772 |
752423 |
2185140 |
6151095 |
716508 |
2012-01-01 |
313873685 |
1217067 |
9001992 |
14866 |
85141 |
355051 |
762009 |
2109932 |
6168874 |
723186 |
2013-01-01 |
316497531 |
1199684 |
8650761 |
14319 |
82109 |
345095 |
726575 |
1931835 |
6018632 |
700294 |
2014-01-01 |
318857056 |
1197987 |
8277829 |
14249 |
84041 |
325802 |
741291 |
1729806 |
5858496 |
689527 |
crime.idxmax()
Population 2014-01-01
Violent 1992-01-01
Property 1991-01-01
Murder 1991-01-01
Forcible_Rape 1992-01-01
Robbery 1991-01-01
Aggravated_assault 1993-01-01
Burglary 1980-01-01
Larceny_Theft 1991-01-01
Vehicle_Theft 1991-01-01
dtype: datetime64[ns]
- 各种犯罪数量的最大值集中出现在1991年~1993年, 从整个数据集看人口数量是在逐渐增长的(从1960年的1.8亿到2014年的3.2亿), 说明这一时期犯罪率高于其他时期(但并不能断言整个90年代是最危险的年代)
5. 合并(探索虚拟姓名数据)
5.1 构建虚拟数据
import numpy as np
import pandas as pd
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
data_1 = pd.DataFrame(raw_data_1)
data_2 = pd.DataFrame(raw_data_2)
data_3 = pd.DataFrame(raw_data_3)
print(data_1)
print(data_2)
print(data_3)
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
subject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16
5.2 上下合并, 鸡蛋下落方向, axis=0
all_data = pd.concat([data_1,data_2], ignore_index=True)
all_data
|
subject_id |
first_name |
last_name |
0 |
1 |
Alex |
Anderson |
1 |
2 |
Amy |
Ackerman |
2 |
3 |
Allen |
Ali |
3 |
4 |
Alice |
Aoni |
4 |
5 |
Ayoung |
Atiches |
5 |
4 |
Billy |
Bonder |
6 |
5 |
Brian |
Black |
7 |
6 |
Bran |
Balwner |
8 |
7 |
Bryce |
Brice |
9 |
8 |
Betty |
Btisan |
5.3 左右合并, axis=1
by_col = pd.concat([data_1,data_2], axis=1)
by_col
|
subject_id |
first_name |
last_name |
subject_id |
first_name |
last_name |
0 |
1 |
Alex |
Anderson |
4 |
Billy |
Bonder |
1 |
2 |
Amy |
Ackerman |
5 |
Brian |
Black |
2 |
3 |
Allen |
Ali |
6 |
Bran |
Balwner |
3 |
4 |
Alice |
Aoni |
7 |
Bryce |
Brice |
4 |
5 |
Ayoung |
Atiches |
8 |
Betty |
Btisan |
5.4 按照 subject_id 的值对 all_data 和 data3 作合并
all_data = pd.merge(all_data,data_3,on='subject_id')
all_data
|
subject_id |
first_name |
last_name |
test_id |
0 |
1 |
Alex |
Anderson |
51 |
1 |
2 |
Amy |
Ackerman |
15 |
2 |
3 |
Allen |
Ali |
15 |
3 |
4 |
Alice |
Aoni |
61 |
4 |
4 |
Billy |
Bonder |
61 |
5 |
5 |
Ayoung |
Atiches |
16 |
6 |
5 |
Brian |
Black |
16 |
7 |
7 |
Bryce |
Brice |
14 |
8 |
8 |
Betty |
Btisan |
15 |
5.5 对 data1 和 data2 按照 subject_id 作内连接
inner_join = pd.merge(data_1,data_2, on='subject_id', how='inner')
inner_join
|
subject_id |
first_name_x |
last_name_x |
first_name_y |
last_name_y |
0 |
4 |
Alice |
Aoni |
Billy |
Bonder |
1 |
5 |
Ayoung |
Atiches |
Brian |
Black |
5.6 对 data1 和 data2 按照 subject_id 作外连接
outer_join = pd.merge(data_1,data_2, on='subject_id', how='outer')
outer_join
|
subject_id |
first_name_x |
last_name_x |
first_name_y |
last_name_y |
0 |
1 |
Alex |
Anderson |
NaN |
NaN |
1 |
2 |
Amy |
Ackerman |
NaN |
NaN |
2 |
3 |
Allen |
Ali |
NaN |
NaN |
3 |
4 |
Alice |
Aoni |
Billy |
Bonder |
4 |
5 |
Ayoung |
Atiches |
Brian |
Black |
5 |
6 |
NaN |
NaN |
Bran |
Balwner |
6 |
7 |
NaN |
NaN |
Bryce |
Brice |
7 |
8 |
NaN |
NaN |
Betty |
Btisan |
6. 统计 探索风速数据
6.1 数据导入等
import pandas as pd
import datetime
wind = pd.read_table('wind.data', sep='\s+', parse_dates=[[0,1,2]])
wind.head()
|
Yr_Mo_Dy |
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
0 |
2061-01-01 |
15.04 |
14.96 |
13.17 |
9.29 |
NaN |
9.87 |
13.67 |
10.25 |
10.83 |
12.58 |
18.50 |
15.04 |
1 |
2061-01-02 |
14.71 |
NaN |
10.83 |
6.50 |
12.62 |
7.67 |
11.50 |
10.04 |
9.79 |
9.67 |
17.54 |
13.83 |
2 |
2061-01-03 |
18.50 |
16.88 |
12.33 |
10.13 |
11.17 |
6.17 |
11.25 |
NaN |
8.50 |
7.67 |
12.75 |
12.71 |
3 |
2061-01-04 |
10.58 |
6.63 |
11.75 |
4.58 |
4.54 |
2.88 |
8.63 |
1.79 |
5.83 |
5.88 |
5.46 |
10.88 |
4 |
2061-01-05 |
13.33 |
13.25 |
11.42 |
6.17 |
10.71 |
8.21 |
11.92 |
6.54 |
10.92 |
10.34 |
12.92 |
11.83 |
wind.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Yr_Mo_Dy 6574 non-null datetime64[ns]
1 RPT 6568 non-null float64
2 VAL 6571 non-null float64
3 ROS 6572 non-null float64
4 KIL 6569 non-null float64
5 SHA 6572 non-null float64
6 BIR 6574 non-null float64
7 DUB 6571 non-null float64
8 CLA 6572 non-null float64
9 MUL 6571 non-null float64
10 CLO 6573 non-null float64
11 BEL 6574 non-null float64
12 MAL 6570 non-null float64
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB
6.2 2061 年是错误数据, 尝试修复
- 原答案少一层’pd.to_datetime()', 这样的话应用函数后 wind[‘Yr_Mo_Dy’] 将不再是 datetime格式\
def fix_year(x):
year=x.year-100 if x.year>1978 else x.year
return pd.to_datetime(datetime.date(year, x.month, x.day),format='%Y/%m/%d')
wind['Yr_Mo_Dy'] = wind['Yr_Mo_Dy'].apply(fix_year)
wind['Yr_Mo_Dy'].dt.year.unique()
array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)
6.3 将日期设为索引, 注意数据类型, 应该是 datetime64
wind.set_index('Yr_Mo_Dy', inplace=True)
wind.head()
|
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01 |
15.04 |
14.96 |
13.17 |
9.29 |
NaN |
9.87 |
13.67 |
10.25 |
10.83 |
12.58 |
18.50 |
15.04 |
1961-01-02 |
14.71 |
NaN |
10.83 |
6.50 |
12.62 |
7.67 |
11.50 |
10.04 |
9.79 |
9.67 |
17.54 |
13.83 |
1961-01-03 |
18.50 |
16.88 |
12.33 |
10.13 |
11.17 |
6.17 |
11.25 |
NaN |
8.50 |
7.67 |
12.75 |
12.71 |
1961-01-04 |
10.58 |
6.63 |
11.75 |
4.58 |
4.54 |
2.88 |
8.63 |
1.79 |
5.83 |
5.88 |
5.46 |
10.88 |
1961-01-05 |
13.33 |
13.25 |
11.42 |
6.17 |
10.71 |
8.21 |
11.92 |
6.54 |
10.92 |
10.34 |
12.92 |
11.83 |
6.4 对应每一个 location, 一共有多少数据缺失
wind.isna().sum()
RPT 6
VAL 3
ROS 2
KIL 5
SHA 2
BIR 0
DUB 3
CLA 2
MUL 3
CLO 1
BEL 0
MAL 4
dtype: int64
6.5 对应每一个 location,一共有多少完整的数据值
- notna() 和 info() 显示的都能显示非空行
wind.notna().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
wind.shape[0] - wind.isnull().sum()
RPT 6568
VAL 6571
ROS 6572
KIL 6569
SHA 6572
BIR 6574
DUB 6571
CLA 6572
MUL 6571
CLO 6573
BEL 6574
MAL 6570
dtype: int64
6.6 对于全体数据, 计算风速的平均值
wind.mean().mean()
10.227982360836938
6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差
wind_des = wind.describe().T.iloc[:,[3,7,1,2]]
wind_des
|
min |
max |
mean |
std |
RPT |
0.67 |
35.80 |
12.362987 |
5.618413 |
VAL |
0.21 |
33.37 |
10.644314 |
5.267356 |
ROS |
1.50 |
33.84 |
11.660526 |
5.008450 |
KIL |
0.00 |
28.46 |
6.306468 |
3.605811 |
SHA |
0.13 |
37.54 |
10.455834 |
4.936125 |
BIR |
0.00 |
26.16 |
7.092254 |
3.968683 |
DUB |
0.00 |
30.37 |
9.797343 |
4.977555 |
CLA |
0.00 |
31.08 |
8.495053 |
4.499449 |
MUL |
0.00 |
25.88 |
8.493590 |
4.166872 |
CLO |
0.04 |
28.21 |
8.707332 |
4.503954 |
BEL |
0.13 |
42.38 |
13.121007 |
5.835037 |
MAL |
0.67 |
42.54 |
15.599079 |
6.699794 |
day_stats = pd.DataFrame()
day_stats['min'] = wind.min(axis=1)
day_stats['max'] = wind.max(axis=1)
day_stats['mean'] = wind.mean(axis=1)
day_stats['std'] = wind.std(axis=1)
day_stats.head()
6.8 对于每一个 location, 计算一月份的平均风速
wind['date'] = wind.index
wind.head()
|
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
date |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01 |
15.04 |
14.96 |
13.17 |
9.29 |
NaN |
9.87 |
13.67 |
10.25 |
10.83 |
12.58 |
18.50 |
15.04 |
1961-01-01 |
1961-01-02 |
14.71 |
NaN |
10.83 |
6.50 |
12.62 |
7.67 |
11.50 |
10.04 |
9.79 |
9.67 |
17.54 |
13.83 |
1961-01-02 |
1961-01-03 |
18.50 |
16.88 |
12.33 |
10.13 |
11.17 |
6.17 |
11.25 |
NaN |
8.50 |
7.67 |
12.75 |
12.71 |
1961-01-03 |
1961-01-04 |
10.58 |
6.63 |
11.75 |
4.58 |
4.54 |
2.88 |
8.63 |
1.79 |
5.83 |
5.88 |
5.46 |
10.88 |
1961-01-04 |
1961-01-05 |
13.33 |
13.25 |
11.42 |
6.17 |
10.71 |
8.21 |
11.92 |
6.54 |
10.92 |
10.34 |
12.92 |
11.83 |
1961-01-05 |
wind['year']= wind['date'].dt.year
wind['month']= wind['date'].dt.month
wind['day']= wind['date'].dt.day
wind.head()
|
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
date |
year |
month |
day |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01 |
15.04 |
14.96 |
13.17 |
9.29 |
NaN |
9.87 |
13.67 |
10.25 |
10.83 |
12.58 |
18.50 |
15.04 |
1961-01-01 |
1961 |
1 |
1 |
1961-01-02 |
14.71 |
NaN |
10.83 |
6.50 |
12.62 |
7.67 |
11.50 |
10.04 |
9.79 |
9.67 |
17.54 |
13.83 |
1961-01-02 |
1961 |
1 |
2 |
1961-01-03 |
18.50 |
16.88 |
12.33 |
10.13 |
11.17 |
6.17 |
11.25 |
NaN |
8.50 |
7.67 |
12.75 |
12.71 |
1961-01-03 |
1961 |
1 |
3 |
1961-01-04 |
10.58 |
6.63 |
11.75 |
4.58 |
4.54 |
2.88 |
8.63 |
1.79 |
5.83 |
5.88 |
5.46 |
10.88 |
1961-01-04 |
1961 |
1 |
4 |
1961-01-05 |
13.33 |
13.25 |
11.42 |
6.17 |
10.71 |
8.21 |
11.92 |
6.54 |
10.92 |
10.34 |
12.92 |
11.83 |
1961-01-05 |
1961 |
1 |
5 |
wind['year'] = wind['date'].apply(lambda x: x.year)
wind['month'] = wind['date'].apply(lambda x: x.month)
wind['day'] = wind['date'].apply(lambda x: x.day)
wind[wind['month']==1].iloc[:,0:12].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
wind.loc[wind.index.month==1].mean()
RPT 14.847325
VAL 12.914560
ROS 13.299624
KIL 7.199498
SHA 11.667734
BIR 8.054839
DUB 11.819355
CLA 9.512047
MUL 9.543208
CLO 10.053566
BEL 14.550520
MAL 18.028763
dtype: float64
6.9 对每个地区的记录进行年度向下采样
- DatetimeIndex 多个独立的日期组合起来
- PeriodIndex 指定了间隔频率的 DatetimeIndex, 通过将 DatetimeIndex.to_period (“”),指定间隔频率来得到
wind.groupby(wind.index.to_period('A')).mean()
|
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
1961 |
12.299583 |
10.351796 |
11.362369 |
6.958227 |
10.881763 |
7.729726 |
9.733923 |
8.858788 |
8.647652 |
9.835577 |
13.502795 |
13.680773 |
1962 |
12.246923 |
10.110438 |
11.732712 |
6.960440 |
10.657918 |
7.393068 |
11.020712 |
8.793753 |
8.316822 |
9.676247 |
12.930685 |
14.323956 |
1963 |
12.813452 |
10.836986 |
12.541151 |
7.330055 |
11.724110 |
8.434712 |
11.075699 |
10.336548 |
8.903589 |
10.224438 |
13.638877 |
14.999014 |
1964 |
12.363661 |
10.920164 |
12.104372 |
6.787787 |
11.454481 |
7.570874 |
10.259153 |
9.467350 |
7.789016 |
10.207951 |
13.740546 |
14.910301 |
1965 |
12.451370 |
11.075534 |
11.848767 |
6.858466 |
11.024795 |
7.478110 |
10.618712 |
8.879918 |
7.907425 |
9.918082 |
12.964247 |
15.591644 |
1966 |
13.461973 |
11.557205 |
12.020630 |
7.345726 |
11.805041 |
7.793671 |
10.579808 |
8.835096 |
8.514438 |
9.768959 |
14.265836 |
16.307260 |
1967 |
12.737151 |
10.990986 |
11.739397 |
7.143425 |
11.630740 |
7.368164 |
10.652027 |
9.325616 |
8.645014 |
9.547425 |
14.774548 |
17.135945 |
1968 |
11.835628 |
10.468197 |
11.409754 |
6.477678 |
10.760765 |
6.067322 |
8.859180 |
8.255519 |
7.224945 |
7.832978 |
12.808634 |
15.017486 |
1969 |
11.166356 |
9.723699 |
10.902000 |
5.767973 |
9.873918 |
6.189973 |
8.564493 |
7.711397 |
7.924521 |
7.754384 |
12.621233 |
15.762904 |
1970 |
12.600329 |
10.726932 |
11.730247 |
6.217178 |
10.567370 |
7.609452 |
9.609890 |
8.334630 |
9.297616 |
8.289808 |
13.183644 |
16.456027 |
1971 |
11.273123 |
9.095178 |
11.088329 |
5.241507 |
9.440329 |
6.097151 |
8.385890 |
6.757315 |
7.915370 |
7.229753 |
12.208932 |
15.025233 |
1972 |
12.463962 |
10.561311 |
12.058333 |
5.929699 |
9.430410 |
6.358825 |
9.704508 |
7.680792 |
8.357295 |
7.515273 |
12.727377 |
15.028716 |
1973 |
11.828466 |
10.680493 |
10.680493 |
5.547863 |
9.640877 |
6.548740 |
8.482110 |
7.614274 |
8.245534 |
7.812411 |
12.169699 |
15.441096 |
1974 |
13.643096 |
11.811781 |
12.336356 |
6.427041 |
11.110986 |
6.809781 |
10.084603 |
9.896986 |
9.331753 |
8.736356 |
13.252959 |
16.947671 |
1975 |
12.008575 |
10.293836 |
11.564712 |
5.269096 |
9.190082 |
5.668521 |
8.562603 |
7.843836 |
8.797945 |
7.382822 |
12.631671 |
15.307863 |
1976 |
11.737842 |
10.203115 |
10.761230 |
5.109426 |
8.846339 |
6.311038 |
9.149126 |
7.146202 |
8.883716 |
7.883087 |
12.332377 |
15.471448 |
1977 |
13.099616 |
11.144493 |
12.627836 |
6.073945 |
10.003836 |
8.586438 |
11.523205 |
8.378384 |
9.098192 |
8.821616 |
13.459068 |
16.590849 |
1978 |
12.504356 |
11.044274 |
11.380000 |
6.082356 |
10.167233 |
7.650658 |
9.489342 |
8.800466 |
9.089753 |
8.301699 |
12.967397 |
16.771370 |
6.10 对每个地区的记录进行月度向下采样
wind.groupby(wind.index.to_period('M')).mean()
|
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
1961-01 |
14.841333 |
11.988333 |
13.431613 |
7.736774 |
11.072759 |
8.588065 |
11.184839 |
9.245333 |
9.085806 |
10.107419 |
13.880968 |
14.703226 |
1961-02 |
16.269286 |
14.975357 |
14.441481 |
9.230741 |
13.852143 |
10.937500 |
11.890714 |
11.846071 |
11.821429 |
12.714286 |
18.583214 |
15.411786 |
1961-03 |
10.890000 |
11.296452 |
10.752903 |
7.284000 |
10.509355 |
8.866774 |
9.644194 |
9.829677 |
10.294138 |
11.251935 |
16.410968 |
15.720000 |
1961-04 |
10.722667 |
9.427667 |
9.998000 |
5.830667 |
8.435000 |
6.495000 |
6.925333 |
7.094667 |
7.342333 |
7.237000 |
11.147333 |
10.278333 |
1961-05 |
9.860968 |
8.850000 |
10.818065 |
5.905333 |
9.490323 |
6.574839 |
7.604000 |
8.177097 |
8.039355 |
8.499355 |
11.900323 |
12.011613 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
1978-08 |
9.645161 |
8.259355 |
9.032258 |
4.502903 |
7.368065 |
5.935161 |
5.650323 |
5.417742 |
7.241290 |
5.536774 |
10.466774 |
12.054194 |
1978-09 |
10.913667 |
10.895000 |
10.635000 |
5.725000 |
10.372000 |
9.278333 |
10.790333 |
9.583000 |
10.069333 |
8.939000 |
15.680333 |
19.391333 |
1978-10 |
9.897742 |
8.670968 |
9.295806 |
4.721290 |
8.525161 |
6.774194 |
8.115484 |
7.337742 |
8.297742 |
8.243871 |
13.776774 |
17.150000 |
1978-11 |
16.151667 |
14.802667 |
13.508000 |
7.317333 |
11.475000 |
8.743000 |
11.492333 |
9.657333 |
10.701333 |
10.676000 |
17.404667 |
20.723000 |
1978-12 |
16.175484 |
13.748065 |
15.635161 |
7.094839 |
11.398710 |
9.241613 |
12.077419 |
10.194839 |
10.616774 |
11.028710 |
13.859677 |
21.371613 |
216 rows × 12 columns
6.11 对每个地区的记录进行周度向下采样
wind.groupby(wind.index.to_period('W')).mean()
|
RPT |
VAL |
ROS |
KIL |
SHA |
BIR |
DUB |
CLA |
MUL |
CLO |
BEL |
MAL |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
1960-12-26/1961-01-01 |
15.040000 |
14.960000 |
13.170000 |
9.290000 |
NaN |
9.870000 |
13.670000 |
10.250000 |
10.830000 |
12.580000 |
18.500000 |
15.040000 |
1961-01-02/1961-01-08 |
13.541429 |
11.486667 |
10.487143 |
6.417143 |
9.474286 |
6.435714 |
11.061429 |
6.616667 |
8.434286 |
8.497143 |
12.481429 |
13.238571 |
1961-01-09/1961-01-15 |
12.468571 |
8.967143 |
11.958571 |
4.630000 |
7.351429 |
5.072857 |
7.535714 |
6.820000 |
5.712857 |
7.571429 |
11.125714 |
11.024286 |
1961-01-16/1961-01-22 |
13.204286 |
9.862857 |
12.982857 |
6.328571 |
8.966667 |
7.417143 |
9.257143 |
7.875714 |
7.145714 |
8.124286 |
9.821429 |
11.434286 |
1961-01-23/1961-01-29 |
19.880000 |
16.141429 |
18.225714 |
12.720000 |
17.432857 |
14.828571 |
15.528571 |
15.160000 |
14.480000 |
15.640000 |
20.930000 |
22.530000 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
1978-11-27/1978-12-03 |
14.934286 |
11.232857 |
13.941429 |
5.565714 |
10.215714 |
8.618571 |
9.642857 |
7.685714 |
9.011429 |
9.547143 |
11.835714 |
18.728571 |
1978-12-04/1978-12-10 |
20.740000 |
19.190000 |
17.034286 |
9.777143 |
15.287143 |
12.774286 |
14.437143 |
12.488571 |
13.870000 |
14.082857 |
18.517143 |
23.061429 |
1978-12-11/1978-12-17 |
16.758571 |
14.692857 |
14.987143 |
6.917143 |
11.397143 |
7.272857 |
10.208571 |
7.967143 |
9.168571 |
8.565714 |
11.102857 |
15.562857 |
1978-12-18/1978-12-24 |
11.155714 |
8.008571 |
13.172857 |
4.004286 |
7.825714 |
6.290000 |
7.798571 |
8.667143 |
7.151429 |
8.072857 |
11.845714 |
18.977143 |
1978-12-25/1978-12-31 |
14.951429 |
11.801429 |
16.035714 |
6.507143 |
9.660000 |
8.620000 |
13.708571 |
10.477143 |
10.868571 |
11.471429 |
12.947143 |
26.844286 |
940 rows × 12 columns
6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。
weekly = wind.resample('W').agg(['min','max','mean','std'])
weekly
|
RPT |
VAL |
ROS |
... |
CLO |
BEL |
MAL |
|
min |
max |
mean |
std |
min |
max |
mean |
std |
min |
max |
... |
mean |
std |
min |
max |
mean |
std |
min |
max |
mean |
std |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-01 |
15.04 |
15.04 |
15.040000 |
NaN |
14.96 |
14.96 |
14.960000 |
NaN |
13.17 |
13.17 |
... |
12.580000 |
NaN |
18.50 |
18.50 |
18.500000 |
NaN |
15.04 |
15.04 |
15.040000 |
NaN |
1961-01-08 |
10.58 |
18.50 |
13.541429 |
2.631321 |
6.63 |
16.88 |
11.486667 |
3.949525 |
7.62 |
12.33 |
... |
8.497143 |
1.704941 |
5.46 |
17.54 |
12.481429 |
4.349139 |
10.88 |
16.46 |
13.238571 |
1.773062 |
1961-01-15 |
9.04 |
19.75 |
12.468571 |
3.555392 |
3.54 |
12.08 |
8.967143 |
3.148945 |
7.08 |
19.50 |
... |
7.571429 |
4.084293 |
5.25 |
20.71 |
11.125714 |
5.552215 |
5.17 |
16.92 |
11.024286 |
4.692355 |
1961-01-22 |
4.92 |
19.83 |
13.204286 |
5.337402 |
3.42 |
14.37 |
9.862857 |
3.837785 |
7.29 |
20.79 |
... |
8.124286 |
4.783952 |
6.50 |
15.92 |
9.821429 |
3.626584 |
6.79 |
17.96 |
11.434286 |
4.237239 |
1961-01-29 |
13.62 |
25.04 |
19.880000 |
4.619061 |
9.96 |
23.91 |
16.141429 |
5.170224 |
12.67 |
25.84 |
... |
15.640000 |
3.713368 |
14.04 |
27.71 |
20.930000 |
5.210726 |
17.50 |
27.63 |
22.530000 |
3.874721 |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
... |
1978-12-03 |
9.08 |
21.29 |
14.934286 |
4.931754 |
4.54 |
21.34 |
11.232857 |
5.978968 |
8.21 |
24.04 |
... |
9.547143 |
6.284973 |
4.92 |
21.42 |
11.835714 |
5.950112 |
11.50 |
25.75 |
18.728571 |
6.393188 |
1978-12-10 |
9.92 |
29.33 |
20.740000 |
7.215012 |
12.54 |
24.79 |
19.190000 |
4.953060 |
7.21 |
25.37 |
... |
14.082857 |
5.516405 |
9.54 |
26.08 |
18.517143 |
5.600389 |
15.34 |
34.59 |
23.061429 |
8.093976 |
1978-12-17 |
9.87 |
23.13 |
16.758571 |
4.499431 |
3.21 |
24.04 |
14.692857 |
7.578665 |
8.04 |
18.05 |
... |
8.565714 |
5.487801 |
5.00 |
21.50 |
11.102857 |
6.631925 |
6.92 |
22.83 |
15.562857 |
6.005594 |
1978-12-24 |
6.21 |
16.62 |
11.155714 |
3.522759 |
3.63 |
13.29 |
8.008571 |
3.882900 |
8.50 |
22.21 |
... |
8.072857 |
3.023131 |
3.21 |
19.79 |
11.845714 |
5.750301 |
10.29 |
31.71 |
18.977143 |
7.194108 |
1978-12-31 |
7.21 |
20.33 |
14.951429 |
4.350400 |
5.46 |
17.41 |
11.801429 |
4.705392 |
7.83 |
27.29 |
... |
11.471429 |
5.533397 |
1.21 |
21.79 |
12.947143 |
7.523148 |
11.96 |
41.46 |
26.844286 |
11.621233 |
940 rows × 48 columns
weekly.loc[weekly.index[1:53],:].head()
|
RPT |
VAL |
ROS |
... |
CLO |
BEL |
MAL |
|
min |
max |
mean |
std |
min |
max |
mean |
std |
min |
max |
... |
mean |
std |
min |
max |
mean |
std |
min |
max |
mean |
std |
Yr_Mo_Dy |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1961-01-08 |
10.58 |
18.50 |
13.541429 |
2.631321 |
6.63 |
16.88 |
11.486667 |
3.949525 |
7.62 |
12.33 |
... |
8.497143 |
1.704941 |
5.46 |
17.54 |
12.481429 |
4.349139 |
10.88 |
16.46 |
13.238571 |
1.773062 |
1961-01-15 |
9.04 |
19.75 |
12.468571 |
3.555392 |
3.54 |
12.08 |
8.967143 |
3.148945 |
7.08 |
19.50 |
... |
7.571429 |
4.084293 |
5.25 |
20.71 |
11.125714 |
5.552215 |
5.17 |
16.92 |
11.024286 |
4.692355 |
1961-01-22 |
4.92 |
19.83 |
13.204286 |
5.337402 |
3.42 |
14.37 |
9.862857 |
3.837785 |
7.29 |
20.79 |
... |
8.124286 |
4.783952 |
6.50 |
15.92 |
9.821429 |
3.626584 |
6.79 |
17.96 |
11.434286 |
4.237239 |
1961-01-29 |
13.62 |
25.04 |
19.880000 |
4.619061 |
9.96 |
23.91 |
16.141429 |
5.170224 |
12.67 |
25.84 |
... |
15.640000 |
3.713368 |
14.04 |
27.71 |
20.930000 |
5.210726 |
17.50 |
27.63 |
22.530000 |
3.874721 |
1961-02-05 |
10.58 |
24.21 |
16.827143 |
5.251408 |
9.46 |
24.21 |
15.460000 |
5.187395 |
9.04 |
19.70 |
... |
9.460000 |
2.839501 |
9.17 |
19.33 |
14.012857 |
4.210858 |
7.17 |
19.25 |
11.935714 |
4.336104 |
5 rows × 48 columns
7. 探索泰坦尼克灾难数据(可视化)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
plt.rcParams["font.sans-serif"]=['SimHei']
plt.rcParams["axes.unicode_minus"]=False
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
7.1 导入数据等
titanic = pd.read_csv('train.csv')
titanic.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
7.1 绘制一个展示乘客性别比例的扇形图
males = (titanic['Sex']=='male').sum()
females = (titanic['Sex']=='female').sum()
proportions = [males, females]
plt.pie(
proportions,
labels = ['Males', 'Females'],
shadow = True,
colors = ['#E76278','#FAC699'],
explode = (0.15,0),
startangle = 90,
autopct = '%1.1f%%'
)
plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JYeSODzv-1658480041729)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_149_0.png)]](https://img-blog.csdnimg.cn/ececc11993b148e1a11a8b9fdcae2a5f.png)
7.2 绘制一个展示船票费用与乘客年龄和船票的散点图
lm = sns.lmplot(x='Age', y='Fare', data=titanic, hue='Sex', fit_reg=False)
lm.set(title='Fare_Age Scatter')
plt.tight_layout()
plt.show()

7.3 有多少人生还
titanic['Survived'].sum()
342
7.4 绘制一个展示船票价格的直方图
plt.figure(figsize=(10,4), dpi=100)
plt.subplot(111, facecolor='#F0F0F0')
plt.title('船票价格分布')
plt.xlabel('船票价格')
plt.ylabel('人数')
plt.hist(titanic['Fare'], bins=50, color='#447C69')
plt.xticks(np.arange(0,300,20), fontsize=10)
plt.yticks(fontsize=10)
plt.tight_layout()

8. 探索 Pokemon 数据
8.1 数据导入及其他
import pandas as pd
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
"type": ['grass', 'fire', 'water', 'bug'],
"hp": [45, 39, 44, 45],
"pokedex": ['yes', 'no','yes','no']
}
8.2 创建 DataFrame
pokemon = pd.DataFrame(raw_data)
pokemon.head()
|
name |
evolution |
type |
hp |
pokedex |
0 |
Bulbasaur |
Ivysaur |
grass |
45 |
yes |
1 |
Charmander |
Charmeleon |
fire |
39 |
no |
2 |
Squirtle |
Wartortle |
water |
44 |
yes |
3 |
Caterpie |
Metapod |
bug |
45 |
no |
8.3 列重新排序
pokemon = pokemon[['name','type','hp','evolution','pokedex']]
pokemon.head()
|
name |
type |
hp |
evolution |
pokedex |
0 |
Bulbasaur |
grass |
45 |
Ivysaur |
yes |
1 |
Charmander |
fire |
39 |
Charmeleon |
no |
2 |
Squirtle |
water |
44 |
Wartortle |
yes |
3 |
Caterpie |
bug |
45 |
Metapod |
no |
8.4 添加列 place
pokemon['place'] = ['park','street','lake','forest']
pokemon
|
name |
type |
hp |
evolution |
pokedex |
place |
0 |
Bulbasaur |
grass |
45 |
Ivysaur |
yes |
park |
1 |
Charmander |
fire |
39 |
Charmeleon |
no |
street |
2 |
Squirtle |
water |
44 |
Wartortle |
yes |
lake |
3 |
Caterpie |
bug |
45 |
Metapod |
no |
forest |
8.5 查看每个列的数据类型
pokemon.dtypes
name object
type object
hp int64
evolution object
pokedex object
place object
dtype: object
9. 探索 Apple 公司股价数据
9.1 导入数据等
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
apple = pd.read_csv('Apple_stock.csv')
apple.head()
|
Date |
Open |
High |
Low |
Close |
Volume |
Adj Close |
0 |
2014-07-08 |
96.27 |
96.80 |
93.92 |
95.35 |
65130000 |
95.35 |
1 |
2014-07-07 |
94.14 |
95.99 |
94.10 |
95.97 |
56305400 |
95.97 |
2 |
2014-07-03 |
93.67 |
94.10 |
93.20 |
94.03 |
22891800 |
94.03 |
3 |
2014-07-02 |
93.87 |
94.06 |
93.09 |
93.48 |
28420900 |
93.48 |
4 |
2014-07-01 |
93.52 |
94.07 |
93.13 |
93.52 |
38170200 |
93.52 |
apple.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8465 entries, 1980-12-12 to 2014-07-08
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Open 8465 non-null float64
1 High 8465 non-null float64
2 Low 8465 non-null float64
3 Close 8465 non-null float64
4 Volume 8465 non-null int64
5 Adj Close 8465 non-null float64
dtypes: float64(5), int64(1)
memory usage: 721.0 KB
9.1 有重复的日期吗
apple['Date'].duplicated().sum()
0
apple['Date'].is_unique
True
apple.index.is_unique
True
apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace=True)
apple.head()
|
Open |
High |
Low |
Close |
Volume |
Adj Close |
Date |
|
|
|
|
|
|
2014-07-08 |
96.27 |
96.80 |
93.92 |
95.35 |
65130000 |
95.35 |
2014-07-07 |
94.14 |
95.99 |
94.10 |
95.97 |
56305400 |
95.97 |
2014-07-03 |
93.67 |
94.10 |
93.20 |
94.03 |
22891800 |
94.03 |
2014-07-02 |
93.87 |
94.06 |
93.09 |
93.48 |
28420900 |
93.48 |
2014-07-01 |
93.52 |
94.07 |
93.13 |
93.52 |
38170200 |
93.52 |
9.2 将 index 设置为升序
apple.sort_index(ascending=True, inplace=True)
apple.head()
|
Open |
High |
Low |
Close |
Volume |
Adj Close |
Date |
|
|
|
|
|
|
1980-12-12 |
28.75 |
28.87 |
28.75 |
28.75 |
117258400 |
0.45 |
1980-12-15 |
27.38 |
27.38 |
27.25 |
27.25 |
43971200 |
0.42 |
1980-12-16 |
25.37 |
25.37 |
25.25 |
25.25 |
26432000 |
0.39 |
1980-12-17 |
25.87 |
26.00 |
25.87 |
25.87 |
21610400 |
0.40 |
1980-12-18 |
26.63 |
26.75 |
26.63 |
26.63 |
18362400 |
0.41 |
9.3 找到每个月的最后一个交易日(business day)
apple_month = apple.resample('BM').mean()
apple_month.head()
|
Open |
High |
Low |
Close |
Volume |
Adj Close |
Date |
|
|
|
|
|
|
1980-12-31 |
30.481538 |
30.567692 |
30.443077 |
30.443077 |
2.586252e+07 |
0.473077 |
1981-01-30 |
31.754762 |
31.826667 |
31.654762 |
31.654762 |
7.249867e+06 |
0.493810 |
1981-02-27 |
26.480000 |
26.572105 |
26.407895 |
26.407895 |
4.231832e+06 |
0.411053 |
1981-03-31 |
24.937727 |
25.016818 |
24.836364 |
24.836364 |
7.962691e+06 |
0.387727 |
1981-04-30 |
27.286667 |
27.368095 |
27.227143 |
27.227143 |
6.392000e+06 |
0.423333 |
9.4 数据集中最早的日期和最晚的日期相差多少天?
(apple.index.max()-apple.index.min()).days
12261
9.5 数据集中一共有多少个月?
len(apple.resample('BM').mean())
404
9.6 按照时间顺序可视化 Adj Close 值
plt.figure(figsize=(8,3), dpi=100)
apple['Adj Close'].plot(title='Apple Stock')
<AxesSubplot:title={'center':'Apple Stock'}, xlabel='Date'>

10. 探索 Iris 纸鸢花数据
10.1 导入数据等
import pandas as pd
iris = pd.read_csv('iris.csv')
iris.head()
|
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
Iris-setosa |
iris.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 5.1 149 non-null float64
1 3.5 149 non-null float64
2 1.4 149 non-null float64
3 0.2 149 non-null float64
4 Iris-setosa 149 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
10.2 添加列名
iris.columns = ['sepal_length','sepal_width', 'petal_length','petal_width', 'class']
iris.head()
|
sepal_length |
sepal_width |
petal_length |
petal_width |
class |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
Iris-setosa |
10.3 数据集有缺失值吗
iris.isna().sum()
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
class 0
dtype: int64
10.4 将petal_lenth 的第10到19行设置为缺失值
- python range() 函数是左闭右开,下面的stop=20,不包括20
for i in range(10,20):
iris['petal_length'].at[i] = np.nan
iris.head(21)
|
sepal_length |
sepal_width |
petal_length |
petal_width |
class |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
Iris-setosa |
5 |
4.6 |
3.4 |
1.4 |
0.3 |
Iris-setosa |
6 |
5.0 |
3.4 |
1.5 |
0.2 |
Iris-setosa |
7 |
4.4 |
2.9 |
1.4 |
0.2 |
Iris-setosa |
8 |
4.9 |
3.1 |
1.5 |
0.1 |
Iris-setosa |
9 |
5.4 |
3.7 |
1.5 |
0.2 |
Iris-setosa |
10 |
4.8 |
3.4 |
NaN |
0.2 |
Iris-setosa |
11 |
4.8 |
3.0 |
NaN |
0.1 |
Iris-setosa |
12 |
4.3 |
3.0 |
NaN |
0.1 |
Iris-setosa |
13 |
5.8 |
4.0 |
NaN |
0.2 |
Iris-setosa |
14 |
5.7 |
4.4 |
NaN |
0.4 |
Iris-setosa |
15 |
5.4 |
3.9 |
NaN |
0.4 |
Iris-setosa |
16 |
5.1 |
3.5 |
NaN |
0.3 |
Iris-setosa |
17 |
5.7 |
3.8 |
NaN |
0.3 |
Iris-setosa |
18 |
5.1 |
3.8 |
NaN |
0.3 |
Iris-setosa |
19 |
5.4 |
3.4 |
NaN |
0.2 |
Iris-setosa |
20 |
5.1 |
3.7 |
1.5 |
0.4 |
Iris-setosa |
iris.iloc[10:20, 1:2] = np.nan
iris.head(21)
|
sepal_length |
sepal_width |
petal_length |
petal_width |
class |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
Iris-setosa |
5 |
4.6 |
3.4 |
1.4 |
0.3 |
Iris-setosa |
6 |
5.0 |
3.4 |
1.5 |
0.2 |
Iris-setosa |
7 |
4.4 |
2.9 |
1.4 |
0.2 |
Iris-setosa |
8 |
4.9 |
3.1 |
1.5 |
0.1 |
Iris-setosa |
9 |
5.4 |
3.7 |
1.5 |
0.2 |
Iris-setosa |
10 |
4.8 |
NaN |
1.6 |
0.2 |
Iris-setosa |
11 |
4.8 |
NaN |
1.4 |
0.1 |
Iris-setosa |
12 |
4.3 |
NaN |
1.1 |
0.1 |
Iris-setosa |
13 |
5.8 |
NaN |
1.2 |
0.2 |
Iris-setosa |
14 |
5.7 |
NaN |
1.5 |
0.4 |
Iris-setosa |
15 |
5.4 |
NaN |
1.3 |
0.4 |
Iris-setosa |
16 |
5.1 |
NaN |
1.4 |
0.3 |
Iris-setosa |
17 |
5.7 |
NaN |
1.7 |
0.3 |
Iris-setosa |
18 |
5.1 |
NaN |
1.5 |
0.3 |
Iris-setosa |
19 |
5.4 |
NaN |
1.7 |
0.2 |
Iris-setosa |
20 |
5.1 |
3.7 |
1.5 |
0.4 |
Iris-setosa |
10.5 将缺失值全部替换为 1.0
iris.fillna(1, inplace=True)
iris.head(21)
|
sepal_length |
sepal_width |
petal_length |
petal_width |
class |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
Iris-setosa |
5 |
4.6 |
3.4 |
1.4 |
0.3 |
Iris-setosa |
6 |
5.0 |
3.4 |
1.5 |
0.2 |
Iris-setosa |
7 |
4.4 |
2.9 |
1.4 |
0.2 |
Iris-setosa |
8 |
4.9 |
3.1 |
1.5 |
0.1 |
Iris-setosa |
9 |
5.4 |
3.7 |
1.5 |
0.2 |
Iris-setosa |
10 |
4.8 |
3.4 |
1.0 |
0.2 |
Iris-setosa |
11 |
4.8 |
3.0 |
1.0 |
0.1 |
Iris-setosa |
12 |
4.3 |
3.0 |
1.0 |
0.1 |
Iris-setosa |
13 |
5.8 |
4.0 |
1.0 |
0.2 |
Iris-setosa |
14 |
5.7 |
4.4 |
1.0 |
0.4 |
Iris-setosa |
15 |
5.4 |
3.9 |
1.0 |
0.4 |
Iris-setosa |
16 |
5.1 |
3.5 |
1.0 |
0.3 |
Iris-setosa |
17 |
5.7 |
3.8 |
1.0 |
0.3 |
Iris-setosa |
18 |
5.1 |
3.8 |
1.0 |
0.3 |
Iris-setosa |
19 |
5.4 |
3.4 |
1.0 |
0.2 |
Iris-setosa |
20 |
5.1 |
3.7 |
1.5 |
0.4 |
Iris-setosa |
10.6 删除列
del iris['class']
iris.head()
|
sepal_length |
sepal_width |
petal_length |
petal_width |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
1.3 |
0.2 |
2 |
4.6 |
3.1 |
1.5 |
0.2 |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
10.7 将数据的前三行设置为缺失值
iris.iloc[0:3,:] = np.nan
iris.head()
|
sepal_length |
sepal_width |
petal_length |
petal_width |
0 |
NaN |
NaN |
NaN |
NaN |
1 |
NaN |
NaN |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
NaN |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
10.8 删除有缺失值的行
iris.drop_duplicates(inplace=True)
iris.head()
|
sepal_length |
sepal_width |
petal_length |
petal_width |
0 |
NaN |
NaN |
NaN |
NaN |
3 |
5.0 |
3.6 |
1.4 |
0.2 |
4 |
5.4 |
3.9 |
1.7 |
0.4 |
5 |
4.6 |
3.4 |
1.4 |
0.3 |
6 |
5.0 |
3.4 |
1.5 |
0.2 |
10.9 重新设置索引
iris.reset_index(inplace=True, drop=True)
iris.head()
|
sepal_length |
sepal_width |
petal_length |
petal_width |
0 |
NaN |
NaN |
NaN |
NaN |
1 |
5.0 |
3.6 |
1.4 |
0.2 |
2 |
5.4 |
3.9 |
1.7 |
0.4 |
3 |
4.6 |
3.4 |
1.4 |
0.3 |
4 |
5.0 |
3.4 |
1.5 |
0.2 |