[ github ] 10道题串起 Pandas 常用知识点-EW帮帮网

Github 地址: https://github.com/guipsamora/pandas_exercises
数据集: https://www.aliyundrive.com/s/Wk6L6R67rrn

1. 了解数据

1.1 导入数据集并查看数据集基本情况

import pandas as pd

chipotle = pd.read_csv('001_chipotle.tsv',sep='\t')
chipotle.head()

	order_id	quantity	item_name	choice_description	item_price
0	1	1	Chips and Fresh Tomato Salsa	NaN	$2.39
1	1	1	Izze	[Clementine]	$3.39
2	1	1	Nantucket Nectar	[Apple]	$3.39
3	1	1	Chips and Tomatillo-Green Chili Salsa	NaN	$2.39
4	2	2	Chicken Bowl	[Tomatillo-Red Chili Salsa (Hot), [Black Beans...	$16.98

chipotle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   order_id            4622 non-null   int64 
 1   quantity            4622 non-null   int64 
 2   item_name           4622 non-null   object
 3   choice_description  3376 non-null   object
 4   item_price          4622 non-null   object
dtypes: int64(2), object(3)
memory usage: 180.7+ KB

chipotle.columns

Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')

chipotle.index

RangeIndex(start=0, stop=4622, step=1)

1.2 被下单最多的商品是什么

as_index = false 使用默认索引列,
显示效果上, 使列名都在一行, “不合并单元格”, 更符合使用习惯

item_buy_count = chipotle[['item_name','quantity']].groupby('item_name').sum()['quantity'].nlargest(1)
item_buy_count

item_name
Chicken Bowl    761
Name: quantity, dtype: int64

原答案

item_buy_count_1 = chipotle[['item_name','quantity']].groupby('item_name', as_index=False).agg({'quantity':'sum'})
item_buy_count_1.sort_values('quantity', ascending=False, inplace=True)
item_buy_count_1.head()

	item_name	quantity
17	Chicken Bowl	761
18	Chicken Burrito	591
25	Chips and Guacamole	506
39	Steak Burrito	386
10	Canned Soft Drink	351

1.3 有多少种商品被下单

nunique() 非重复计数

chipotle['item_name'].nunique()

# 简单做个图
s = chipotle['item_name'].value_counts()

from matplotlib import pyplot as plt
plt.figure(figsize=(20,4),dpi=100)
plt.xticks(fontsize=13, rotation='vertical')
plt.bar(s.index,s.values)

<BarContainer object of 50 artists>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-S9a7xR1e-1658480041724)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_15_1.png)]

1.4 被下单的商品中, choice_description 的排名

没搞明白什么意思

chipotle['choice_description'].value_counts().head()

[Diet Coke]                                                                          134
[Coke]                                                                               123
[Sprite]                                                                              77
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Lettuce]]                42
[Fresh Tomato Salsa, [Rice, Black Beans, Cheese, Sour Cream, Guacamole, Lettuce]]     40
Name: choice_description, dtype: int64

1.5 一共有多少商品被下单

chipotle['quantity'].sum()

1.6 将item_price转换为浮点数

chipotle['item_price'] = chipotle['item_price'].str.lstrip('$').astype('float')
chipotle.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4622 entries, 0 to 4621
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   order_id            4622 non-null   int64  
 1   quantity            4622 non-null   int64  
 2   item_name           4622 non-null   object 
 3   choice_description  3376 non-null   object 
 4   item_price          4622 non-null   float64
dtypes: float64(1), int64(2), object(2)
memory usage: 180.7+ KB

原解法

chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))

---------------------------------------------------------------------------

NameError                                 Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_19644/772157282.py in <module>
----> 1 chipotle['item_price'] = chipo['item_price'].apply(lambda x: float(x[1:-1]))


NameError: name 'chipo' is not defined

1.7 总收入

chipotle['sub_total'] = (chipotle['item_price'] * chipotle['quantity'])
chipotle['sub_total'].sum()

39237.02

1.8 总订单数

chipotle['order_id'].nunique()

1.9 单均价

chipotle['sub_total'].sum()/chipotle['order_id'].nunique()

21.39423118865867

原答案

chipotle[['order_id','sub_total']].groupby(by=['order_id']).agg({'sub_total':'sum'})['sub_total'].mean()

21.39423118865867

2. 数据过滤与排序

2.1 数据导入, 查看数据集基本情况

import pandas as pd
euro = pd.read_csv('Euro2012_stats.csv')
euro.head()

	Team	Goals	Shots on target	Shots off target	Shooting Accuracy	% Goals-to-shots	Total shots (inc. Blocked)	Hit Woodwork	...	Saves made	Saves-to-shots ratio	Fouls Won	Fouls Conceded	Offsides	Yellow Cards	Subs on	Subs off	Players Used
0	Croatia	4	13	12	51.9%	16.0%	32	0	...	13	81.3%	41	62	2	9	9	9	16
1	Czech Republic	4	13	18	41.9%	12.9%	39	0	...	9	60.1%	53	73	8	7	11	11	19
2	Denmark	4	10	10	50.0%	20.0%	27	1	...	10	66.7%	25	38	8	4	7	7	15
3	England	5	11	18	50.0%	17.2%	40	0	...	22	88.1%	43	45	6	5	11	11	16
4	France	3	22	24	37.9%	6.5%	65	1	...	6	54.6%	36	51	5	6	11	11	19

5 rows × 35 columns

euro.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 35 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Team                        16 non-null     object 
 1   Goals                       16 non-null     int64  
 2   Shots on target             16 non-null     int64  
 3   Shots off target            16 non-null     int64  
 4   Shooting Accuracy           16 non-null     object 
 5   % Goals-to-shots            16 non-null     object 
 6   Total shots (inc. Blocked)  16 non-null     int64  
 7   Hit Woodwork                16 non-null     int64  
 8   Penalty goals               16 non-null     int64  
 9   Penalties not scored        16 non-null     int64  
 10  Headed goals                16 non-null     int64  
 11  Passes                      16 non-null     int64  
 12  Passes completed            16 non-null     int64  
 13  Passing Accuracy            16 non-null     object 
 14  Touches                     16 non-null     int64  
 15  Crosses                     16 non-null     int64  
 16  Dribbles                    16 non-null     int64  
 17  Corners Taken               16 non-null     int64  
 18  Tackles                     16 non-null     int64  
 19  Clearances                  16 non-null     int64  
 20  Interceptions               16 non-null     int64  
 21  Clearances off line         15 non-null     float64
 22  Clean Sheets                16 non-null     int64  
 23  Blocks                      16 non-null     int64  
 24  Goals conceded              16 non-null     int64  
 25  Saves made                  16 non-null     int64  
 26  Saves-to-shots ratio        16 non-null     object 
 27  Fouls Won                   16 non-null     int64  
 28  Fouls Conceded              16 non-null     int64  
 29  Offsides                    16 non-null     int64  
 30  Yellow Cards                16 non-null     int64  
 31  Red Cards                   16 non-null     int64  
 32  Subs on                     16 non-null     int64  
 33  Subs off                    16 non-null     int64  
 34  Players Used                16 non-null     int64  
dtypes: float64(1), int64(29), object(5)
memory usage: 4.5+ KB

2.1 有多少球队参与了 2012 年欧洲杯

euro['Team'].nunique()

2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里

discripline = euro[['Team','Yellow Cards', 'Red Cards']]
discripline

	Team	Yellow Cards	Red Cards
0	Croatia	9	0
1	Czech Republic	7	0
2	Denmark	4	0
3	England	5	0
4	France	6	0
5	Germany	4	0
6	Greece	9	1
7	Italy	16	0
8	Netherlands	5	0
9	Poland	7	1
10	Portugal	12	0
11	Republic of Ireland	6	1
12	Russia	6	0
13	Spain	11	0
14	Sweden	7	0
15	Ukraine	5	0

2.3 对 discripline 排序

discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)
discripline

C:\Users\LiCN\AppData\Local\Temp/ipykernel_19644/3135130171.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  discripline.sort_values(['Yellow Cards', 'Red Cards'], ascending=[False, False], inplace=True)

	Team	Yellow Cards	Red Cards
7	Italy	16	0
10	Portugal	12	0
13	Spain	11	0
6	Greece	9	1
0	Croatia	9	0
9	Poland	7	1
1	Czech Republic	7	0
14	Sweden	7	0
11	Republic of Ireland	6	1
4	France	6	0
12	Russia	6	0
3	England	5	0
8	Netherlands	5	0
15	Ukraine	5	0
2	Denmark	4	0
5	Germany	4	0

2.4 计算每个球队拿到的黄牌数的平均值

discripline['Yellow Cards'].mean()

7.4375

2.5 进球大于 6 的球队

discripline[discripline['Yellow Cards']>6]
# discripline[discripline.Yellow Cards >6]

	Team	Yellow Cards	Red Cards
7	Italy	16	0
10	Portugal	12	0
13	Spain	11	0
6	Greece	9	1
0	Croatia	9	0
9	Poland	7	1
1	Czech Republic	7	0
14	Sweden	7	0

2.6 选取以字母 G 开头的球队数据

euro[euro['Team'].str.startswith('G')]

	Team	Goals	Shots on target	Shots off target	Shooting Accuracy	% Goals-to-shots	Total shots (inc. Blocked)	Hit Woodwork	Penalty goals	Penalties not scored	...	Saves made	Saves-to-shots ratio	Fouls Won	Fouls Conceded	Offsides	Yellow Cards	Red Cards	Subs on	Subs off	Players Used
5	Germany	10	32	32	47.8%	15.6%	80	2	1	0	...	10	62.6%	63	49	12	4	0	15	15	17
6	Greece	5	8	18	30.7%	19.2%	32	1	1	1	...	13	65.1%	67	48	12	9	1	12	12	20

2 rows × 35 columns

2.7 选取前 7 列

euro.iloc[:,0:7]

	Team	Goals	Shots on target	Shots off target	Shooting Accuracy	% Goals-to-shots	Total shots (inc. Blocked)
0	Croatia	4	13	12	51.9%	16.0%	32
1	Czech Republic	4	13	18	41.9%	12.9%	39
2	Denmark	4	10	10	50.0%	20.0%	27
3	England	5	11	18	50.0%	17.2%	40
4	France	3	22	24	37.9%	6.5%	65
5	Germany	10	32	32	47.8%	15.6%	80
6	Greece	5	8	18	30.7%	19.2%	32
7	Italy	6	34	45	43.0%	7.5%	110
8	Netherlands	2	12	36	25.0%	4.1%	60
9	Poland	2	15	23	39.4%	5.2%	48
10	Portugal	6	22	42	34.3%	9.3%	82
11	Republic of Ireland	1	7	12	36.8%	5.2%	28
12	Russia	5	9	31	22.5%	12.5%	59
13	Spain	12	42	33	55.9%	16.0%	100
14	Sweden	5	17	19	47.2%	13.8%	39
15	Ukraine	2	7	26	21.2%	6.0%	38

2.8 选取除了最后 3 列之外的全部列

euro.iloc[:,:-3]

	Team	Goals	Shots on target	Shots off target	Shooting Accuracy	% Goals-to-shots	Total shots (inc. Blocked)	Hit Woodwork	Penalty goals	Penalties not scored	...	Clean Sheets	Blocks	Goals conceded	Saves made	Saves-to-shots ratio	Fouls Won	Fouls Conceded	Offsides	Yellow Cards	Red Cards
0	Croatia	4	13	12	51.9%	16.0%	32	0	0	0	...	0	10	3	13	81.3%	41	62	2	9	0
1	Czech Republic	4	13	18	41.9%	12.9%	39	0	0	0	...	1	10	6	9	60.1%	53	73	8	7	0
2	Denmark	4	10	10	50.0%	20.0%	27	1	0	0	...	1	10	5	10	66.7%	25	38	8	4	0
3	England	5	11	18	50.0%	17.2%	40	0	0	0	...	2	29	3	22	88.1%	43	45	6	5	0
4	France	3	22	24	37.9%	6.5%	65	1	0	0	...	1	7	5	6	54.6%	36	51	5	6	0
5	Germany	10	32	32	47.8%	15.6%	80	2	1	0	...	1	11	6	10	62.6%	63	49	12	4	0
6	Greece	5	8	18	30.7%	19.2%	32	1	1	1	...	1	23	7	13	65.1%	67	48	12	9	1
7	Italy	6	34	45	43.0%	7.5%	110	2	0	0	...	2	18	7	20	74.1%	101	89	16	16	0
8	Netherlands	2	12	36	25.0%	4.1%	60	2	0	0	...	0	9	5	12	70.6%	35	30	3	5	0
9	Poland	2	15	23	39.4%	5.2%	48	0	0	0	...	0	8	3	6	66.7%	48	56	3	7	1
10	Portugal	6	22	42	34.3%	9.3%	82	6	0	0	...	2	11	4	10	71.5%	73	90	10	12	0
11	Republic of Ireland	1	7	12	36.8%	5.2%	28	0	0	0	...	0	23	9	17	65.4%	43	51	11	6	1
12	Russia	5	9	31	22.5%	12.5%	59	2	0	0	...	0	8	3	10	77.0%	34	43	4	6	0
13	Spain	12	42	33	55.9%	16.0%	100	0	1	0	...	5	8	1	15	93.8%	102	83	19	11	0
14	Sweden	5	17	19	47.2%	13.8%	39	3	0	0	...	1	12	5	8	61.6%	35	51	7	7	0
15	Ukraine	2	7	26	21.2%	6.0%	38	0	0	0	...	0	4	4	13	76.5%	48	31	4	5	0

16 rows × 32 columns

2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)

euro[euro['Team'].isin(['England', 'Italy', 'Russia'])][['Team','Shooting Accuracy']]

	Team	Shooting Accuracy
3	England	50.0%
7	Italy	43.0%
12	Russia	22.5%

原解法

euro.loc[euro['Team'].isin(['England', 'Italy', 'Russia']), ['Team','Shooting Accuracy']]

	Team	Shooting Accuracy
3	England	50.0%
7	Italy	43.0%
12	Russia	22.5%

3. 数据分组探索酒类消费数据

import pandas as pd

drinks = pd.read_csv('drinks.csv')
drinks.head()

	country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol	continent
0	Afghanistan	0	0	0	0.0	AS
1	Albania	89	132	54	4.9	EU
2	Algeria	25	0	14	0.7	AF
3	Andorra	245	138	312	12.4	EU
4	Angola	217	57	45	5.9	AF

drinks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   country                       193 non-null    object 
 1   beer_servings                 193 non-null    int64  
 2   spirit_servings               193 non-null    int64  
 3   wine_servings                 193 non-null    int64  
 4   total_litres_of_pure_alcohol  193 non-null    float64
 5   continent                     170 non-null    object 
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB

3.1 哪个大陆平均消耗的啤酒更多?

drinks.groupby('continent')['beer_servings'].mean().sort_values().head(1)

continent
AS    37.045455
Name: beer_servings, dtype: float64

3.2 打印出每个大陆红酒消耗的描述性统计值

drinks.groupby('continent')['wine_servings'].describe()

	count	mean	std	min	25%	50%	75%	max
continent
AF	53.0	16.264151	38.846419	0.0	1.0	2.0	13.00	233.0
AS	44.0	9.068182	21.667034	0.0	0.0	1.0	8.00	123.0
EU	45.0	142.222222	97.421738	0.0	59.0	128.0	195.00	370.0
OC	16.0	35.625000	64.555790	0.0	1.0	8.5	23.25	212.0
SA	12.0	62.416667	88.620189	1.0	3.0	12.0	98.50	221.0

3.3 打印出每个大陆每种酒类的消耗平均值

drinks.groupby('continent').mean()

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent
AF	61.471698	16.339623	16.264151	3.007547
AS	37.045455	60.840909	9.068182	2.170455
EU	193.777778	132.555556	142.222222	8.617778
OC	89.687500	58.437500	35.625000	3.381250
SA	175.083333	114.750000	62.416667	6.308333

3.4 打印出每个大陆每种酒类的消耗的中位数

drinks.groupby('continent').median()

	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
continent
AF	32.0	3.0	2.0	2.30
AS	17.5	16.0	1.0	1.20
EU	219.0	122.0	128.0	10.00
OC	52.5	37.0	8.5	1.75
SA	162.5	108.5	12.0	6.85

3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值

drinks.groupby('continent')['spirit_servings'].agg(['mean','max','min'])

	mean	max	min
continent
AF	16.339623	152	0
AS	60.840909	326	0
EU	132.555556	373	0
OC	58.437500	254	0
SA	114.750000	302	25

4. apply 函数 (1960-2014 美国犯罪数据探索)

4.1 数据导入及数据集基本情况

import pandas as pd

crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')

crime.head()

	Year	Population	Total	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
0	1960	179323175	3384200	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1	1961	182992000	3488000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
2	1962	185771000	3752200	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
3	1963	188483000	4109500	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
4	1964	191141000	4564600	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800

crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Year                55 non-null     int64
 1   Population          55 non-null     int64
 2   Total               55 non-null     int64
 3   Violent             55 non-null     int64
 4   Property            55 non-null     int64
 5   Murder              55 non-null     int64
 6   Forcible_Rape       55 non-null     int64
 7   Robbery             55 non-null     int64
 8   Aggravated_assault  55 non-null     int64
 9   Burglary            55 non-null     int64
 10  Larceny_Theft       55 non-null     int64
 11  Vehicle_Theft       55 non-null     int64
dtypes: int64(12)
memory usage: 5.3 KB

4.2 将 Year 数据类型转换为 datetime

crime['Year'] = pd.to_datetime(crime['Year'], format='%Y')
crime.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Year                55 non-null     datetime64[ns]
 1   Population          55 non-null     int64         
 2   Total               55 non-null     int64         
 3   Violent             55 non-null     int64         
 4   Property            55 non-null     int64         
 5   Murder              55 non-null     int64         
 6   Forcible_Rape       55 non-null     int64         
 7   Robbery             55 non-null     int64         
 8   Aggravated_assault  55 non-null     int64         
 9   Burglary            55 non-null     int64         
 10  Larceny_Theft       55 non-null     int64         
 11  Vehicle_Theft       55 non-null     int64         
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB

4.3 将 Year 设置为索引

# 好像不要 drop=True 原索引也不会成为新的一列
crime.set_index('Year', inplace=True)
crime.head()

	Population	Total	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	179323175	3384200	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1961-01-01	182992000	3488000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
1962-01-01	185771000	3752200	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
1963-01-01	188483000	4109500	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
1964-01-01	191141000	4564600	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800

4.4 删除名为 Total 的列

del crime['Total']

4.5 每十年重组数据 resample 非常重要

Pandas中的resample，重新采样，是对原样本重新处理的一个方法，是一个对常规时间序列数据重新采样和频率转换的便捷的方法。
参数表示采样的规则常见时间频率：A year M month W week D day H hour T minute S second
按照每十年为一组进行采样统计

crime_10Ys = crime.resample('10As').sum()
crime_10Ys

	Population	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	1915053175	4134930	45160900	106180	236720	1633510	2158520	13321100	26547700	5292100
1970-01-01	2121193298	9607930	91383800	192230	554570	4159020	4702120	28486000	53157800	9739900
1980-01-01	2371370069	14074328	117048900	206439	865639	5383109	7619130	33073494	72040253	11935411
1990-01-01	2612825258	17527048	119053499	211664	998827	5748930	10568963	26750015	77679366	14624418
2000-01-01	2947969117	13968056	100944369	163068	922499	4230366	8652124	21565176	67970291	11412834
2010-01-01	1570146307	6072017	44095950	72867	421059	1749809	3764142	10125170	30401698	3569080

用 population 的最大值替换重组后的 population 值, 上面每个十年的 population 加总没有意义
用十年中的最大值替换掉这个加总值

crime_10Ys['Population'] = crime['Population'].resample('10As').max()
crime_10Ys.head()

	Population	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	201385000	4134930	45160900	106180	236720	1633510	2158520	13321100	26547700	5292100
1970-01-01	220099000	9607930	91383800	192230	554570	4159020	4702120	28486000	53157800	9739900
1980-01-01	248239000	14074328	117048900	206439	865639	5383109	7619130	33073494	72040253	11935411
1990-01-01	272690813	17527048	119053499	211664	998827	5748930	10568963	26750015	77679366	14624418
2000-01-01	307006550	13968056	100944369	163068	922499	4230366	8652124	21565176	67970291	11412834

4.6 何时是美国历史上生存最危险的年代

idxmax()

crime

	Population	Violent	Property	Murder	Forcible_Rape	Robbery	Aggravated_assault	Burglary	Larceny_Theft	Vehicle_Theft
Year
1960-01-01	179323175	288460	3095700	9110	17190	107840	154320	912100	1855400	328200
1961-01-01	182992000	289390	3198600	8740	17220	106670	156760	949600	1913000	336000
1962-01-01	185771000	301510	3450700	8530	17550	110860	164570	994300	2089600	366800
1963-01-01	188483000	316970	3792500	8640	17650	116470	174210	1086400	2297800	408300
1964-01-01	191141000	364220	4200400	9360	21420	130390	203050	1213200	2514400	472800
1965-01-01	193526000	387390	4352000	9960	23410	138690	215330	1282500	2572600	496900
1966-01-01	195576000	430180	4793300	11040	25820	157990	235330	1410100	2822000	561200
1967-01-01	197457000	499930	5403500	12240	27620	202910	257160	1632100	3111600	659800
1968-01-01	199399000	595010	6125200	13800	31670	262840	286700	1858900	3482700	783600
1969-01-01	201385000	661870	6749000	14760	37170	298850	311090	1981900	3888600	878500
1970-01-01	203235298	738820	7359200	16000	37990	349860	334970	2205000	4225800	928400
1971-01-01	206212000	816500	7771700	17780	42260	387700	368760	2399300	4424200	948200
1972-01-01	208230000	834900	7413900	18670	46850	376290	393090	2375500	4151200	887200
1973-01-01	209851000	875910	7842200	19640	51400	384220	420650	2565500	4347900	928800
1974-01-01	211392000	974720	9278700	20710	55400	442400	456210	3039200	5262500	977100
1975-01-01	213124000	1039710	10252700	20510	56090	470500	492620	3265300	5977700	1009600
1976-01-01	214659000	1004210	10345500	18780	57080	427810	500530	3108700	6270800	966000
1977-01-01	216332000	1029580	9955000	19120	63500	412610	534350	3071500	5905700	977700
1978-01-01	218059000	1085550	10123400	19560	67610	426930	571460	3128300	5991000	1004100
1979-01-01	220099000	1208030	11041500	21460	76390	480700	629480	3327700	6601000	1112800
1980-01-01	225349264	1344520	12063700	23040	82990	565840	672650	3795200	7136900	1131700
1981-01-01	229146000	1361820	12061900	22520	82500	592910	663900	3779700	7194400	1087800
1982-01-01	231534000	1322390	11652000	21010	78770	553130	669480	3447100	7142500	1062400
1983-01-01	233981000	1258090	10850500	19310	78920	506570	653290	3129900	6712800	1007900
1984-01-01	236158000	1273280	10608500	18690	84230	485010	685350	2984400	6591900	1032200
1985-01-01	238740000	1328800	11102600	18980	88670	497870	723250	3073300	6926400	1102900
1986-01-01	240132887	1489169	11722700	20613	91459	542775	834322	3241410	7257153	1224137
1987-01-01	242282918	1483999	12024700	20096	91110	517704	855088	3236184	7499900	1288674
1988-01-01	245807000	1566220	12356900	20680	92490	542970	910090	3218100	7705900	1432900
1989-01-01	248239000	1646040	12605400	21500	94500	578330	951710	3168200	7872400	1564800
1990-01-01	248709873	1820130	12655500	23440	102560	639270	1054860	3073900	7945700	1635900
1991-01-01	252177000	1911770	12961100	24700	106590	687730	1092740	3157200	8142200	1661700
1992-01-01	255082000	1932270	12505900	23760	109060	672480	1126970	2979900	7915200	1610800
1993-01-01	257908000	1926020	12218800	24530	106010	659870	1135610	2834800	7820900	1563100
1994-01-01	260341000	1857670	12131900	23330	102220	618950	1113180	2712800	7879800	1539300
1995-01-01	262755000	1798790	12063900	21610	97470	580510	1099210	2593800	7997700	1472400
1996-01-01	265228572	1688540	11805300	19650	96250	535590	1037050	2506400	7904700	1394200
1997-01-01	267637000	1634770	11558175	18208	96153	498534	1023201	2460526	7743760	1354189
1998-01-01	270296000	1531044	10944590	16914	93103	446625	974402	2329950	7373886	1240754
1999-01-01	272690813	1426044	10208334	15522	89411	409371	911740	2100739	6955520	1152075
2000-01-01	281421906	1425486	10182586	15586	90178	408016	911706	2050992	6971590	1160002
2001-01-01	285317559	1439480	10437480	16037	90863	423557	909023	2116531	7092267	1228391
2002-01-01	287973924	1423677	10455277	16229	95235	420806	891407	2151252	7057370	1246646
2003-01-01	290690788	1383676	10442862	16528	93883	414235	859030	2154834	7026802	1261226
2004-01-01	293656842	1360088	10319386	16148	95089	401470	847381	2144446	6937089	1237851
2005-01-01	296507061	1390745	10174754	16740	94347	417438	862220	2155448	6783447	1235859
2006-01-01	299398484	1418043	9983568	17030	92757	447403	860853	2183746	6607013	1192809
2007-01-01	301621157	1408337	9843481	16929	90427	445125	855856	2176140	6568572	1095769
2008-01-01	304374846	1392628	9767915	16442	90479	443574	842134	2228474	6588046	958629
2009-01-01	307006550	1325896	9337060	15399	89241	408742	812514	2203313	6338095	795652
2010-01-01	309330219	1251248	9112625	14772	85593	369089	781844	2168457	6204601	739565
2011-01-01	311587816	1206031	9052743	14661	84175	354772	752423	2185140	6151095	716508
2012-01-01	313873685	1217067	9001992	14866	85141	355051	762009	2109932	6168874	723186
2013-01-01	316497531	1199684	8650761	14319	82109	345095	726575	1931835	6018632	700294
2014-01-01	318857056	1197987	8277829	14249	84041	325802	741291	1729806	5858496	689527

crime.idxmax()

Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

各种犯罪数量的最大值集中出现在1991年~1993年, 从整个数据集看人口数量是在逐渐增长的(从1960年的1.8亿到2014年的3.2亿), 说明这一时期犯罪率高于其他时期(但并不能断言整个90年代是最危险的年代)

5. 合并(探索虚拟姓名数据)

5.1 构建虚拟数据

import numpy as np
import pandas as pd

raw_data_1 = {
 'subject_id': ['1', '2', '3', '4', '5'],
 'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
 'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}
raw_data_2 = {
 'subject_id': ['4', '5', '6', '7', '8'],
 'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
 'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}
raw_data_3 = {
 'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
 'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}

data_1 = pd.DataFrame(raw_data_1)
data_2 = pd.DataFrame(raw_data_2)
data_3 = pd.DataFrame(raw_data_3)

print(data_1)
print(data_2)
print(data_3)

  subject_id first_name last_name
0          1       Alex  Anderson
1          2        Amy  Ackerman
2          3      Allen       Ali
3          4      Alice      Aoni
4          5     Ayoung   Atiches
  subject_id first_name last_name
0          4      Billy    Bonder
1          5      Brian     Black
2          6       Bran   Balwner
3          7      Bryce     Brice
4          8      Betty    Btisan
  subject_id  test_id
0          1       51
1          2       15
2          3       15
3          4       61
4          5       16
5          7       14
6          8       15
7          9        1
8         10       61
9         11       16

5.2 上下合并, 鸡蛋下落方向, axis=0

all_data = pd.concat([data_1,data_2], ignore_index=True)
all_data

	subject_id	first_name	last_name
0	1	Alex	Anderson
1	2	Amy	Ackerman
2	3	Allen	Ali
3	4	Alice	Aoni
4	5	Ayoung	Atiches
5	4	Billy	Bonder
6	5	Brian	Black
7	6	Bran	Balwner
8	7	Bryce	Brice
9	8	Betty	Btisan

5.3 左右合并, axis=1

by_col = pd.concat([data_1,data_2], axis=1)
by_col

	subject_id	first_name	last_name	subject_id	first_name	last_name
0	1	Alex	Anderson	4	Billy	Bonder
1	2	Amy	Ackerman	5	Brian	Black
2	3	Allen	Ali	6	Bran	Balwner
3	4	Alice	Aoni	7	Bryce	Brice
4	5	Ayoung	Atiches	8	Betty	Btisan

5.4 按照 subject_id 的值对 all_data 和 data3 作合并

all_data = pd.merge(all_data,data_3,on='subject_id')
all_data

	subject_id	first_name	last_name	test_id
0	1	Alex	Anderson	51
1	2	Amy	Ackerman	15
2	3	Allen	Ali	15
3	4	Alice	Aoni	61
4	4	Billy	Bonder	61
5	5	Ayoung	Atiches	16
6	5	Brian	Black	16
7	7	Bryce	Brice	14
8	8	Betty	Btisan	15

5.5 对 data1 和 data2 按照 subject_id 作内连接

inner_join = pd.merge(data_1,data_2, on='subject_id', how='inner')
inner_join

	subject_id	first_name_x	last_name_x	first_name_y	last_name_y
0	4	Alice	Aoni	Billy	Bonder
1	5	Ayoung	Atiches	Brian	Black

5.6 对 data1 和 data2 按照 subject_id 作外连接

outer_join = pd.merge(data_1,data_2, on='subject_id', how='outer')
outer_join

	subject_id	first_name_x	last_name_x	first_name_y	last_name_y
0	1	Alex	Anderson	NaN	NaN
1	2	Amy	Ackerman	NaN	NaN
2	3	Allen	Ali	NaN	NaN
3	4	Alice	Aoni	Billy	Bonder
4	5	Ayoung	Atiches	Brian	Black
5	6	NaN	NaN	Bran	Balwner
6	7	NaN	NaN	Bryce	Brice
7	8	NaN	NaN	Betty	Btisan

6. 统计探索风速数据

6.1 数据导入等

import pandas as pd
import datetime

wind = pd.read_table('wind.data', sep='\s+', parse_dates=[[0,1,2]])
wind.head()

	Yr_Mo_Dy	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
0	2061-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04
1	2061-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83
2	2061-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71
3	2061-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88
4	2061-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83

wind.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Yr_Mo_Dy  6574 non-null   datetime64[ns]
 1   RPT       6568 non-null   float64       
 2   VAL       6571 non-null   float64       
 3   ROS       6572 non-null   float64       
 4   KIL       6569 non-null   float64       
 5   SHA       6572 non-null   float64       
 6   BIR       6574 non-null   float64       
 7   DUB       6571 non-null   float64       
 8   CLA       6572 non-null   float64       
 9   MUL       6571 non-null   float64       
 10  CLO       6573 non-null   float64       
 11  BEL       6574 non-null   float64       
 12  MAL       6570 non-null   float64       
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB

6.2 2061 年是错误数据, 尝试修复

原答案少一层’pd.to_datetime()', 这样的话应用函数后 wind[‘Yr_Mo_Dy’] 将不再是 datetime格式\
- 单个元素的话可以直接用 x.year

def fix_year(x):
    year=x.year-100 if x.year>1978 else x.year
    return pd.to_datetime(datetime.date(year, x.month, x.day),format='%Y/%m/%d')

wind['Yr_Mo_Dy']  = wind['Yr_Mo_Dy'].apply(fix_year)
wind['Yr_Mo_Dy'].dt.year.unique()

array([1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971,
       1972, 1973, 1974, 1975, 1976, 1977, 1978], dtype=int64)

6.3 将日期设为索引, 注意数据类型, 应该是 datetime64

wind.set_index('Yr_Mo_Dy', inplace=True)
wind.head()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
Yr_Mo_Dy
1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04
1961-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83
1961-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71
1961-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88
1961-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83

6.4 对应每一个 location, 一共有多少数据缺失

wind.isna().sum()

RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

6.5 对应每一个 location，一共有多少完整的数据值

notna() 和 info() 显示的都能显示非空行

wind.notna().sum()

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

# 原答案
wind.shape[0] - wind.isnull().sum()

RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

6.6 对于全体数据, 计算风速的平均值

wind.mean().mean()

10.227982360836938

6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差

describe() 是沿columns方向的

wind_des = wind.describe().T.iloc[:,[3,7,1,2]]
wind_des

	min	max	mean	std
RPT	0.67	35.80	12.362987	5.618413
VAL	0.21	33.37	10.644314	5.267356
ROS	1.50	33.84	11.660526	5.008450
KIL	0.00	28.46	6.306468	3.605811
SHA	0.13	37.54	10.455834	4.936125
BIR	0.00	26.16	7.092254	3.968683
DUB	0.00	30.37	9.797343	4.977555
CLA	0.00	31.08	8.495053	4.499449
MUL	0.00	25.88	8.493590	4.166872
CLO	0.04	28.21	8.707332	4.503954
BEL	0.13	42.38	13.121007	5.835037
MAL	0.67	42.54	15.599079	6.699794

# 一般写法
day_stats = pd.DataFrame()
day_stats['min'] = wind.min(axis=1)
day_stats['max'] = wind.max(axis=1)
day_stats['mean'] = wind.mean(axis=1)
day_stats['std'] = wind.std(axis=1)
day_stats.head()

6.8 对于每一个 location, 计算一月份的平均风速

注意, 原po主说把1961年的1月和1962年的1月都当作1月份
注意 query() 的用法

wind['date'] = wind.index
wind.head()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL	date
Yr_Mo_Dy
1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04	1961-01-01
1961-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83	1961-01-02
1961-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71	1961-01-03
1961-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88	1961-01-04
1961-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83	1961-01-05

wind['year']= wind['date'].dt.year
wind['month']= wind['date'].dt.month
wind['day']= wind['date'].dt.day
wind.head()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL	date	year	month	day
Yr_Mo_Dy
1961-01-01	15.04	14.96	13.17	9.29	NaN	9.87	13.67	10.25	10.83	12.58	18.50	15.04	1961-01-01	1961	1	1
1961-01-02	14.71	NaN	10.83	6.50	12.62	7.67	11.50	10.04	9.79	9.67	17.54	13.83	1961-01-02	1961	1	2
1961-01-03	18.50	16.88	12.33	10.13	11.17	6.17	11.25	NaN	8.50	7.67	12.75	12.71	1961-01-03	1961	1	3
1961-01-04	10.58	6.63	11.75	4.58	4.54	2.88	8.63	1.79	5.83	5.88	5.46	10.88	1961-01-04	1961	1	4
1961-01-05	13.33	13.25	11.42	6.17	10.71	8.21	11.92	6.54	10.92	10.34	12.92	11.83	1961-01-05	1961	1	5

# 原
wind['year'] = wind['date'].apply(lambda x: x.year)
wind['month'] = wind['date'].apply(lambda x: x.month)
wind['day'] = wind['date'].apply(lambda x: x.day)

wind[wind['month']==1].iloc[:,0:12].mean()

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

解法二, loc 定位

wind.loc[wind.index.month==1].mean()

RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

6.9 对每个地区的记录进行年度向下采样

DatetimeIndex 多个独立的日期组合起来
PeriodIndex 指定了间隔频率的 DatetimeIndex, 通过将 DatetimeIndex.to_period (“”)，指定间隔频率来得到

wind.groupby(wind.index.to_period('A')).mean()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
Yr_Mo_Dy
1961	12.299583	10.351796	11.362369	6.958227	10.881763	7.729726	9.733923	8.858788	8.647652	9.835577	13.502795	13.680773
1962	12.246923	10.110438	11.732712	6.960440	10.657918	7.393068	11.020712	8.793753	8.316822	9.676247	12.930685	14.323956
1963	12.813452	10.836986	12.541151	7.330055	11.724110	8.434712	11.075699	10.336548	8.903589	10.224438	13.638877	14.999014
1964	12.363661	10.920164	12.104372	6.787787	11.454481	7.570874	10.259153	9.467350	7.789016	10.207951	13.740546	14.910301
1965	12.451370	11.075534	11.848767	6.858466	11.024795	7.478110	10.618712	8.879918	7.907425	9.918082	12.964247	15.591644
1966	13.461973	11.557205	12.020630	7.345726	11.805041	7.793671	10.579808	8.835096	8.514438	9.768959	14.265836	16.307260
1967	12.737151	10.990986	11.739397	7.143425	11.630740	7.368164	10.652027	9.325616	8.645014	9.547425	14.774548	17.135945
1968	11.835628	10.468197	11.409754	6.477678	10.760765	6.067322	8.859180	8.255519	7.224945	7.832978	12.808634	15.017486
1969	11.166356	9.723699	10.902000	5.767973	9.873918	6.189973	8.564493	7.711397	7.924521	7.754384	12.621233	15.762904
1970	12.600329	10.726932	11.730247	6.217178	10.567370	7.609452	9.609890	8.334630	9.297616	8.289808	13.183644	16.456027
1971	11.273123	9.095178	11.088329	5.241507	9.440329	6.097151	8.385890	6.757315	7.915370	7.229753	12.208932	15.025233
1972	12.463962	10.561311	12.058333	5.929699	9.430410	6.358825	9.704508	7.680792	8.357295	7.515273	12.727377	15.028716
1973	11.828466	10.680493	10.680493	5.547863	9.640877	6.548740	8.482110	7.614274	8.245534	7.812411	12.169699	15.441096
1974	13.643096	11.811781	12.336356	6.427041	11.110986	6.809781	10.084603	9.896986	9.331753	8.736356	13.252959	16.947671
1975	12.008575	10.293836	11.564712	5.269096	9.190082	5.668521	8.562603	7.843836	8.797945	7.382822	12.631671	15.307863
1976	11.737842	10.203115	10.761230	5.109426	8.846339	6.311038	9.149126	7.146202	8.883716	7.883087	12.332377	15.471448
1977	13.099616	11.144493	12.627836	6.073945	10.003836	8.586438	11.523205	8.378384	9.098192	8.821616	13.459068	16.590849
1978	12.504356	11.044274	11.380000	6.082356	10.167233	7.650658	9.489342	8.800466	9.089753	8.301699	12.967397	16.771370

6.10 对每个地区的记录进行月度向下采样

wind.groupby(wind.index.to_period('M')).mean()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
Yr_Mo_Dy
1961-01	14.841333	11.988333	13.431613	7.736774	11.072759	8.588065	11.184839	9.245333	9.085806	10.107419	13.880968	14.703226
1961-02	16.269286	14.975357	14.441481	9.230741	13.852143	10.937500	11.890714	11.846071	11.821429	12.714286	18.583214	15.411786
1961-03	10.890000	11.296452	10.752903	7.284000	10.509355	8.866774	9.644194	9.829677	10.294138	11.251935	16.410968	15.720000
1961-04	10.722667	9.427667	9.998000	5.830667	8.435000	6.495000	6.925333	7.094667	7.342333	7.237000	11.147333	10.278333
1961-05	9.860968	8.850000	10.818065	5.905333	9.490323	6.574839	7.604000	8.177097	8.039355	8.499355	11.900323	12.011613
...	...	...	...	...	...	...	...	...	...	...	...	...
1978-08	9.645161	8.259355	9.032258	4.502903	7.368065	5.935161	5.650323	5.417742	7.241290	5.536774	10.466774	12.054194
1978-09	10.913667	10.895000	10.635000	5.725000	10.372000	9.278333	10.790333	9.583000	10.069333	8.939000	15.680333	19.391333
1978-10	9.897742	8.670968	9.295806	4.721290	8.525161	6.774194	8.115484	7.337742	8.297742	8.243871	13.776774	17.150000
1978-11	16.151667	14.802667	13.508000	7.317333	11.475000	8.743000	11.492333	9.657333	10.701333	10.676000	17.404667	20.723000
1978-12	16.175484	13.748065	15.635161	7.094839	11.398710	9.241613	12.077419	10.194839	10.616774	11.028710	13.859677	21.371613

216 rows × 12 columns

6.11 对每个地区的记录进行周度向下采样

wind.groupby(wind.index.to_period('W')).mean()

	RPT	VAL	ROS	KIL	SHA	BIR	DUB	CLA	MUL	CLO	BEL	MAL
Yr_Mo_Dy
1960-12-26/1961-01-01	15.040000	14.960000	13.170000	9.290000	NaN	9.870000	13.670000	10.250000	10.830000	12.580000	18.500000	15.040000
1961-01-02/1961-01-08	13.541429	11.486667	10.487143	6.417143	9.474286	6.435714	11.061429	6.616667	8.434286	8.497143	12.481429	13.238571
1961-01-09/1961-01-15	12.468571	8.967143	11.958571	4.630000	7.351429	5.072857	7.535714	6.820000	5.712857	7.571429	11.125714	11.024286
1961-01-16/1961-01-22	13.204286	9.862857	12.982857	6.328571	8.966667	7.417143	9.257143	7.875714	7.145714	8.124286	9.821429	11.434286
1961-01-23/1961-01-29	19.880000	16.141429	18.225714	12.720000	17.432857	14.828571	15.528571	15.160000	14.480000	15.640000	20.930000	22.530000
...	...	...	...	...	...	...	...	...	...	...	...	...
1978-11-27/1978-12-03	14.934286	11.232857	13.941429	5.565714	10.215714	8.618571	9.642857	7.685714	9.011429	9.547143	11.835714	18.728571
1978-12-04/1978-12-10	20.740000	19.190000	17.034286	9.777143	15.287143	12.774286	14.437143	12.488571	13.870000	14.082857	18.517143	23.061429
1978-12-11/1978-12-17	16.758571	14.692857	14.987143	6.917143	11.397143	7.272857	10.208571	7.967143	9.168571	8.565714	11.102857	15.562857
1978-12-18/1978-12-24	11.155714	8.008571	13.172857	4.004286	7.825714	6.290000	7.798571	8.667143	7.151429	8.072857	11.845714	18.977143
1978-12-25/1978-12-31	14.951429	11.801429	16.035714	6.507143	9.660000	8.620000	13.708571	10.477143	10.868571	11.471429	12.947143	26.844286

940 rows × 12 columns

6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。

# 重采样并聚合
weekly = wind.resample('W').agg(['min','max','mean','std'])
weekly

	RPT				VAL				ROS		...	CLO		BEL				MAL
	min	max	mean	std	min	max	mean	std	min	max	...	mean	std	min	max	mean	std	min	max	mean	std
Yr_Mo_Dy
1961-01-01	15.04	15.04	15.040000	NaN	14.96	14.96	14.960000	NaN	13.17	13.17	...	12.580000	NaN	18.50	18.50	18.500000	NaN	15.04	15.04	15.040000	NaN
1961-01-08	10.58	18.50	13.541429	2.631321	6.63	16.88	11.486667	3.949525	7.62	12.33	...	8.497143	1.704941	5.46	17.54	12.481429	4.349139	10.88	16.46	13.238571	1.773062
1961-01-15	9.04	19.75	12.468571	3.555392	3.54	12.08	8.967143	3.148945	7.08	19.50	...	7.571429	4.084293	5.25	20.71	11.125714	5.552215	5.17	16.92	11.024286	4.692355
1961-01-22	4.92	19.83	13.204286	5.337402	3.42	14.37	9.862857	3.837785	7.29	20.79	...	8.124286	4.783952	6.50	15.92	9.821429	3.626584	6.79	17.96	11.434286	4.237239
1961-01-29	13.62	25.04	19.880000	4.619061	9.96	23.91	16.141429	5.170224	12.67	25.84	...	15.640000	3.713368	14.04	27.71	20.930000	5.210726	17.50	27.63	22.530000	3.874721
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1978-12-03	9.08	21.29	14.934286	4.931754	4.54	21.34	11.232857	5.978968	8.21	24.04	...	9.547143	6.284973	4.92	21.42	11.835714	5.950112	11.50	25.75	18.728571	6.393188
1978-12-10	9.92	29.33	20.740000	7.215012	12.54	24.79	19.190000	4.953060	7.21	25.37	...	14.082857	5.516405	9.54	26.08	18.517143	5.600389	15.34	34.59	23.061429	8.093976
1978-12-17	9.87	23.13	16.758571	4.499431	3.21	24.04	14.692857	7.578665	8.04	18.05	...	8.565714	5.487801	5.00	21.50	11.102857	6.631925	6.92	22.83	15.562857	6.005594
1978-12-24	6.21	16.62	11.155714	3.522759	3.63	13.29	8.008571	3.882900	8.50	22.21	...	8.072857	3.023131	3.21	19.79	11.845714	5.750301	10.29	31.71	18.977143	7.194108
1978-12-31	7.21	20.33	14.951429	4.350400	5.46	17.41	11.801429	4.705392	7.83	27.29	...	11.471429	5.533397	1.21	21.79	12.947143	7.523148	11.96	41.46	26.844286	11.621233

940 rows × 48 columns

weekly.loc[weekly.index[1:53],:].head()

	RPT				VAL				ROS		...	CLO		BEL				MAL
	min	max	mean	std	min	max	mean	std	min	max	...	mean	std	min	max	mean	std	min	max	mean	std
Yr_Mo_Dy
1961-01-08	10.58	18.50	13.541429	2.631321	6.63	16.88	11.486667	3.949525	7.62	12.33	...	8.497143	1.704941	5.46	17.54	12.481429	4.349139	10.88	16.46	13.238571	1.773062
1961-01-15	9.04	19.75	12.468571	3.555392	3.54	12.08	8.967143	3.148945	7.08	19.50	...	7.571429	4.084293	5.25	20.71	11.125714	5.552215	5.17	16.92	11.024286	4.692355
1961-01-22	4.92	19.83	13.204286	5.337402	3.42	14.37	9.862857	3.837785	7.29	20.79	...	8.124286	4.783952	6.50	15.92	9.821429	3.626584	6.79	17.96	11.434286	4.237239
1961-01-29	13.62	25.04	19.880000	4.619061	9.96	23.91	16.141429	5.170224	12.67	25.84	...	15.640000	3.713368	14.04	27.71	20.930000	5.210726	17.50	27.63	22.530000	3.874721
1961-02-05	10.58	24.21	16.827143	5.251408	9.46	24.21	15.460000	5.187395	9.04	19.70	...	9.460000	2.839501	9.17	19.33	14.012857	4.210858	7.17	19.25	11.935714	4.336104

5 rows × 48 columns

7. 探索泰坦尼克灾难数据(可视化)

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

plt.rcParams["font.sans-serif"]=['SimHei']  #正常显示中文
plt.rcParams["axes.unicode_minus"]=False   # 正常显示负号

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

7.1 导入数据等

titanic = pd.read_csv('train.csv')
titanic.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

7.1 绘制一个展示乘客性别比例的扇形图

males = (titanic['Sex']=='male').sum()
females = (titanic['Sex']=='female').sum()

proportions = [males, females]
#plt.figure(figsize=(8,8), dpi=100)
#plt.Subplot(111,facecolor='#F0F0F0')

plt.pie(
    proportions,
    labels = ['Males', 'Females'],
    shadow = True,
    colors = ['#E76278','#FAC699'],
    explode = (0.15,0),
    startangle = 90,
    autopct = '%1.1f%%'
    )

plt.axis('equal')
plt.title('Sex Proportion')
plt.tight_layout()
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JYeSODzv-1658480041729)(%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_files/%E5%8D%81%E5%A5%97%E7%BB%83%E4%B9%A0_2_149_0.png)]

7.2 绘制一个展示船票费用与乘客年龄和船票的散点图

lm = sns.lmplot(x='Age', y='Fare', data=titanic, hue='Sex', fit_reg=False)
lm.set(title='Fare_Age Scatter')

plt.tight_layout()
plt.show()

在这里插入图片描述

7.3 有多少人生还

titanic['Survived'].sum()

7.4 绘制一个展示船票价格的直方图

plt.figure(figsize=(10,4), dpi=100)
plt.subplot(111, facecolor='#F0F0F0')

plt.title('船票价格分布')
plt.xlabel('船票价格')
plt.ylabel('人数')

plt.hist(titanic['Fare'], bins=50, color='#447C69')
plt.xticks(np.arange(0,300,20), fontsize=10)
plt.yticks(fontsize=10)

plt.tight_layout()

在这里插入图片描述

8. 探索 Pokemon 数据

8.1 数据导入及其他

import pandas as pd
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
 "evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
 "type": ['grass', 'fire', 'water', 'bug'],
 "hp": [45, 39, 44, 45],
 "pokedex": ['yes', 'no','yes','no']
 }

8.2 创建 DataFrame

pokemon = pd.DataFrame(raw_data)
pokemon.head()

	name	evolution	type	hp	pokedex
0	Bulbasaur	Ivysaur	grass	45	yes
1	Charmander	Charmeleon	fire	39	no
2	Squirtle	Wartortle	water	44	yes
3	Caterpie	Metapod	bug	45	no

8.3 列重新排序

pokemon = pokemon[['name','type','hp','evolution','pokedex']]
pokemon.head()

	name	type	hp	evolution	pokedex
0	Bulbasaur	grass	45	Ivysaur	yes
1	Charmander	fire	39	Charmeleon	no
2	Squirtle	water	44	Wartortle	yes
3	Caterpie	bug	45	Metapod	no

8.4 添加列 place

pokemon['place'] = ['park','street','lake','forest']
pokemon

	name	type	hp	evolution	pokedex	place
0	Bulbasaur	grass	45	Ivysaur	yes	park
1	Charmander	fire	39	Charmeleon	no	street
2	Squirtle	water	44	Wartortle	yes	lake
3	Caterpie	bug	45	Metapod	no	forest

8.5 查看每个列的数据类型

pokemon.dtypes

name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object

9. 探索 Apple 公司股价数据

9.1 导入数据等

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

apple = pd.read_csv('Apple_stock.csv')
apple.head()

	Date	Open	High	Low	Close	Volume	Adj Close
0	2014-07-08	96.27	96.80	93.92	95.35	65130000	95.35
1	2014-07-07	94.14	95.99	94.10	95.97	56305400	95.97
2	2014-07-03	93.67	94.10	93.20	94.03	22891800	94.03
3	2014-07-02	93.87	94.06	93.09	93.48	28420900	93.48
4	2014-07-01	93.52	94.07	93.13	93.52	38170200	93.52

apple.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8465 entries, 1980-12-12 to 2014-07-08
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Open       8465 non-null   float64
 1   High       8465 non-null   float64
 2   Low        8465 non-null   float64
 3   Close      8465 non-null   float64
 4   Volume     8465 non-null   int64  
 5   Adj Close  8465 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 721.0 KB

9.1 有重复的日期吗

两种方法

apple['Date'].duplicated().sum()
# 没有重复数据

apple['Date'].is_unique

True

# 也可以用在索引上
apple.index.is_unique

True

将时间列设为索引

apple['Date'] = pd.to_datetime(apple['Date'])
apple.set_index('Date', inplace=True)
apple.head()

	Open	High	Low	Close	Volume	Adj Close
Date
2014-07-08	96.27	96.80	93.92	95.35	65130000	95.35
2014-07-07	94.14	95.99	94.10	95.97	56305400	95.97
2014-07-03	93.67	94.10	93.20	94.03	22891800	94.03
2014-07-02	93.87	94.06	93.09	93.48	28420900	93.48
2014-07-01	93.52	94.07	93.13	93.52	38170200	93.52

9.2 将 index 设置为升序

apple.sort_index(ascending=True, inplace=True)
apple.head()

	Open	High	Low	Close	Volume	Adj Close
Date
1980-12-12	28.75	28.87	28.75	28.75	117258400	0.45
1980-12-15	27.38	27.38	27.25	27.25	43971200	0.42
1980-12-16	25.37	25.37	25.25	25.25	26432000	0.39
1980-12-17	25.87	26.00	25.87	25.87	21610400	0.40
1980-12-18	26.63	26.75	26.63	26.63	18362400	0.41

9.3 找到每个月的最后一个交易日(business day)

# 'BM' business month end frequency
apple_month = apple.resample('BM').mean()
apple_month.head()

	Open	High	Low	Close	Volume	Adj Close
Date
1980-12-31	30.481538	30.567692	30.443077	30.443077	2.586252e+07	0.473077
1981-01-30	31.754762	31.826667	31.654762	31.654762	7.249867e+06	0.493810
1981-02-27	26.480000	26.572105	26.407895	26.407895	4.231832e+06	0.411053
1981-03-31	24.937727	25.016818	24.836364	24.836364	7.962691e+06	0.387727
1981-04-30	27.286667	27.368095	27.227143	27.227143	6.392000e+06	0.423333

9.4 数据集中最早的日期和最晚的日期相差多少天?

(apple.index.max()-apple.index.min()).days

9.5 数据集中一共有多少个月?

len(apple.resample('BM').mean())

9.6 按照时间顺序可视化 Adj Close 值

plt.figure(figsize=(8,3), dpi=100)
apple['Adj Close'].plot(title='Apple Stock')

<AxesSubplot:title={'center':'Apple Stock'}, xlabel='Date'>

在这里插入图片描述

10. 探索 Iris 纸鸢花数据

10.1 导入数据等

import pandas as pd
iris = pd.read_csv('iris.csv')
iris.head()

	5.1	3.5	1.4	0.2	Iris-setosa
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   5.1          149 non-null    float64
 1   3.5          149 non-null    float64
 2   1.4          149 non-null    float64
 3   0.2          149 non-null    float64
 4   Iris-setosa  149 non-null    object 
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

10.2 添加列名

iris.columns = ['sepal_length','sepal_width', 'petal_length','petal_width', 'class']
iris.head()

	sepal_length	sepal_width	petal_length	petal_width	class
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa

10.3 数据集有缺失值吗

iris.isna().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
class           0
dtype: int64

10.4 将petal_lenth 的第10到19行设置为缺失值

python range() 函数是左闭右开,下面的stop=20,不包括20

for i in range(10,20):
    iris['petal_length'].at[i] = np.nan

iris.head(21)

	sepal_length	sepal_width	petal_length	petal_width	class
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa
5	4.6	3.4	1.4	0.3	Iris-setosa
6	5.0	3.4	1.5	0.2	Iris-setosa
7	4.4	2.9	1.4	0.2	Iris-setosa
8	4.9	3.1	1.5	0.1	Iris-setosa
9	5.4	3.7	1.5	0.2	Iris-setosa
10	4.8	3.4	NaN	0.2	Iris-setosa
11	4.8	3.0	NaN	0.1	Iris-setosa
12	4.3	3.0	NaN	0.1	Iris-setosa
13	5.8	4.0	NaN	0.2	Iris-setosa
14	5.7	4.4	NaN	0.4	Iris-setosa
15	5.4	3.9	NaN	0.4	Iris-setosa
16	5.1	3.5	NaN	0.3	Iris-setosa
17	5.7	3.8	NaN	0.3	Iris-setosa
18	5.1	3.8	NaN	0.3	Iris-setosa
19	5.4	3.4	NaN	0.2	Iris-setosa
20	5.1	3.7	1.5	0.4	Iris-setosa

iris.iloc[10:20, 1:2] = np.nan
iris.head(21)

	sepal_length	sepal_width	petal_length	petal_width	class
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa
5	4.6	3.4	1.4	0.3	Iris-setosa
6	5.0	3.4	1.5	0.2	Iris-setosa
7	4.4	2.9	1.4	0.2	Iris-setosa
8	4.9	3.1	1.5	0.1	Iris-setosa
9	5.4	3.7	1.5	0.2	Iris-setosa
10	4.8	NaN	1.6	0.2	Iris-setosa
11	4.8	NaN	1.4	0.1	Iris-setosa
12	4.3	NaN	1.1	0.1	Iris-setosa
13	5.8	NaN	1.2	0.2	Iris-setosa
14	5.7	NaN	1.5	0.4	Iris-setosa
15	5.4	NaN	1.3	0.4	Iris-setosa
16	5.1	NaN	1.4	0.3	Iris-setosa
17	5.7	NaN	1.7	0.3	Iris-setosa
18	5.1	NaN	1.5	0.3	Iris-setosa
19	5.4	NaN	1.7	0.2	Iris-setosa
20	5.1	3.7	1.5	0.4	Iris-setosa

10.5 将缺失值全部替换为 1.0

iris.fillna(1, inplace=True)
iris.head(21)

	sepal_length	sepal_width	petal_length	petal_width	class
0	4.9	3.0	1.4	0.2	Iris-setosa
1	4.7	3.2	1.3	0.2	Iris-setosa
2	4.6	3.1	1.5	0.2	Iris-setosa
3	5.0	3.6	1.4	0.2	Iris-setosa
4	5.4	3.9	1.7	0.4	Iris-setosa
5	4.6	3.4	1.4	0.3	Iris-setosa
6	5.0	3.4	1.5	0.2	Iris-setosa
7	4.4	2.9	1.4	0.2	Iris-setosa
8	4.9	3.1	1.5	0.1	Iris-setosa
9	5.4	3.7	1.5	0.2	Iris-setosa
10	4.8	3.4	1.0	0.2	Iris-setosa
11	4.8	3.0	1.0	0.1	Iris-setosa
12	4.3	3.0	1.0	0.1	Iris-setosa
13	5.8	4.0	1.0	0.2	Iris-setosa
14	5.7	4.4	1.0	0.4	Iris-setosa
15	5.4	3.9	1.0	0.4	Iris-setosa
16	5.1	3.5	1.0	0.3	Iris-setosa
17	5.7	3.8	1.0	0.3	Iris-setosa
18	5.1	3.8	1.0	0.3	Iris-setosa
19	5.4	3.4	1.0	0.2	Iris-setosa
20	5.1	3.7	1.5	0.4	Iris-setosa

10.6 删除列

del iris['class']
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	4.9	3.0	1.4	0.2
1	4.7	3.2	1.3	0.2
2	4.6	3.1	1.5	0.2
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

10.7 将数据的前三行设置为缺失值

iris.iloc[0:3,:] = np.nan
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

10.8 删除有缺失值的行

iris.drop_duplicates(inplace=True)
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4
5	4.6	3.4	1.4	0.3
6	5.0	3.4	1.5	0.2

10.9 重新设置索引

iris.reset_index(inplace=True, drop=True)
iris.head()

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

本文含有隐藏内容，请开通VIP 后查看

	sepal_length	sepal_width	petal_length	petal_width
0	4.9	3.0	1.4	0.2
1	4.7	3.2	1.3	0.2
2	4.6	3.1	1.5	0.2
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4
5	4.6	3.4	1.4	0.3
6	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	4.9	3.0	1.4	0.2
1	4.7	3.2	1.3	0.2
2	4.6	3.1	1.5	0.2
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4
5	4.6	3.4	1.4	0.3
6	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2

[ github ] 10道题串起 Pandas 常用知识点

1. 了解数据

1.1 导入数据集并查看数据集基本情况

1.2 被下单最多的商品是什么

1.3 有多少种商品被下单

1.4 被下单的商品中, choice_description 的排名

1.5 一共有多少商品被下单

1.6 将item_price转换为浮点数

1.7 总收入

1.8 总订单数

1.9 单均价

2. 数据过滤与排序

2.1 数据导入, 查看数据集基本情况

2.1 有多少球队参与了 2012 年欧洲杯

2.2 将数据集中的列 Team, Yellow Cards 和 Red Cards 单独存在一个名叫 discripline 的 DateFrame 里

2.3 对 discripline 排序

2.4 计算每个球队拿到的黄牌数的平均值

2.5 进球大于 6 的球队

2.6 选取以字母 G 开头的球队数据

2.7 选取前 7 列

2.8 选取除了最后 3 列之外的全部列

2.9 找到英格兰 (England)、意大利 (Italy) 和俄罗斯 (Russia) 的射正率 (Shooting Accuracy)

3. 数据分组 探索酒类消费数据

3.1 哪个大陆平均消耗的啤酒更多?

3.2 打印出每个大陆红酒消耗的描述性统计值

3.3 打印出每个大陆每种酒类的消耗平均值

3.4 打印出每个大陆每种酒类的消耗的中位数

3.5 打印出每个大陆对 spirit 饮品消耗的平均值, 最大值, 最小值

4. apply 函数 (1960-2014 美国犯罪数据探索)

4.1 数据导入及数据集基本情况

4.2 将 Year 数据类型转换为 datetime

4.3 将 Year 设置为索引

4.4 删除名为 Total 的列

4.5 每十年重组数据 resample 非常重要

4.6 何时是美国历史上生存最危险的年代

5. 合并(探索虚拟姓名数据)

5.1 构建虚拟数据

5.2 上下合并, 鸡蛋下落方向, axis=0

5.3 左右合并, axis=1

5.4 按照 subject_id 的值对 all_data 和 data3 作合并

5.5 对 data1 和 data2 按照 subject_id 作内连接

5.6 对 data1 和 data2 按照 subject_id 作外连接

6. 统计 探索风速数据

6.1 数据导入等

6.2 2061 年是错误数据, 尝试修复

6.3 将日期设为索引, 注意数据类型, 应该是 datetime64

6.4 对应每一个 location, 一共有多少数据缺失

6.5 对应每一个 location，一共有多少完整的数据值

6.6 对于全体数据, 计算风速的平均值

6.7 创建一个 DataFrame 计算并存储每个 location 的风速最小值, 最大值, 平均值和标准差

6.8 对于每一个 location, 计算一月份的平均风速

6.9 对每个地区的记录进行年度向下采样

6.10 对每个地区的记录进行月度向下采样

6.11 对每个地区的记录进行周度向下采样

6.12 计算前 52 周内所有地点(假设第一个星期从 1961年1月2日开始) 的最小值、最大值、平均风速和风速的标准偏差。

7. 探索泰坦尼克灾难数据(可视化)

7.1 导入数据等

7.1 绘制一个展示乘客性别比例的扇形图

7.2 绘制一个展示船票费用与乘客年龄和船票的散点图

7.3 有多少人生还

7.4 绘制一个展示船票价格的直方图

8. 探索 Pokemon 数据

8.1 数据导入及其他

8.2 创建 DataFrame

8.3 列重新排序

8.4 添加列 place

8.5 查看每个列的数据类型

9. 探索 Apple 公司股价数据

9.1 导入数据等

9.1 有重复的日期吗

9.2 将 index 设置为升序

9.3 找到每个月的最后一个交易日(business day)

9.4 数据集中最早的日期和最晚的日期相差多少天?

9.5 数据集中一共有多少个月?

9.6 按照时间顺序可视化 Adj Close 值

10. 探索 Iris 纸鸢花数据

10.1 导入数据等

10.2 添加列名

10.3 数据集有缺失值吗

10.4 将petal_lenth 的第10到19行设置为缺失值

3. 数据分组探索酒类消费数据

6. 统计探索风速数据

	sepal_length	sepal_width	petal_length	petal_width
0	4.9	3.0	1.4	0.2
1	4.7	3.2	1.3	0.2
2	4.6	3.1	1.5	0.2
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
3	5.0	3.6	1.4	0.2
4	5.4	3.9	1.7	0.4
5	4.6	3.4	1.4	0.3
6	5.0	3.4	1.5	0.2

	sepal_length	sepal_width	petal_length	petal_width
0	NaN	NaN	NaN	NaN
1	5.0	3.6	1.4	0.2
2	5.4	3.9	1.7	0.4
3	4.6	3.4	1.4	0.3
4	5.0	3.4	1.5	0.2