Python EXCEL：dataframe处理案例：配对、删除列、重命名和组合-EW帮帮网

示例数据集

	Level 0	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8	Level 9	Level 10
0	!&5	证券代码	所属行业名称	交易币种	证券简称	分析师评级	20240902	20240903	20240904	20240905	20240906
1	!	000007	房地产	CNY	平安	AAA	10.11	10.08	10.02	10.07	10.08
2	!	000008	机械制造	CNH	万A	AAA	6.42	6.52	6.46	6.59	6.47
3	!	000009	电力设备	HKD	*ST国	CCC	11.61	12.25	12.08	11.95	11.67
4	!	000010	建筑装饰	USD	深振	AAA	3.79	4.17	4.13	4.21	4.12
5	!	000011	房地产	CNH	全好	BBB	4.64	4.62	4.58	4.77	4.73
6	!	000012	非金属材料	HKD	沙份	AAA	1.95	1.96	1.93	1.94	1.96
7	!	000014	房地产	CNY	深A	AAA	7.38	7.72	7.88	7.83	7.72
8	!	000026	纺织服装与珠宝	CNH	华A	BB	1.7	1.69	1.67	1.69	1.68
9	!	000027	公用事业	CNY	深股	AA	7.4	7.6	7.66	7.78	7.56
10	!	000028	医疗	CNH	发A	AA	5	5.04	4.95	4.98	4.88

要求

删除列：Level 0 和 Level 4。
定义评级映射：
- 评级列表: [‘AAA’,‘AA’,‘A’,‘BBB’,‘BB’,‘B’,‘CCC’,‘CC’,‘C’,‘D’]
- 数字列表: [1,2,3,4,5,6,7,8,9,10]
- 如果遇到没有的评级，我们可以映射到10（D）或者报错，这里我们映射到10。
添加两列：RateE（原Level 5列的值）和RateS（映射后的数字）。
重命名列：Level 2->code, Level 3->industry, Level 4->Eco。
将code列设置为索引。
按行业分组，得到行业到DataFrame的字典。

import pandas as pd

def create_rating_mapping(rating_list, value_list):
    """
    创建评级映射字典
    
    参数:
    rating_list: 评级字符串列表，如 ['AAA', 'AA', 'A', 'BBB', ...]
    value_list: 对应的数值列表，如 [1, 2, 3, 4, ...]
    
    返回:
    评级映射字典
    """
    return dict(zip(rating_list, value_list))

def process_dataframe(df_, rating_mapping=None):
    """
    处理数据框函数
    
    参数:
    df: 输入的数据框
    rating_mapping: 评级映射字典，如果为None则使用默认映射
    
    返回:
    按行业分组的字典，键为行业名称，值为对应的数据框
    """
    # 深度拷贝并处理列名
    df = df_.copy(deep=True)
    
    # 提取列名行（第二行）作为新列名
    new_columns = df.iloc[0].values
    df.columns = new_columns
    
    # 删除前两行（原列名行和!&5行）
    df = df.iloc[1:].reset_index(drop=True)
    
    # 步骤1：删除不需要的列
    df = df.drop(columns=['!&5', '证券简称'])
    
    # 如果没有提供评级映射，则创建默认映射 (支持10个等级)
    if rating_mapping is None:
        ratings = ['AAA', 'AA', 'A', 'BBB', 'BB', 'B', 'CCC', 'CC', 'C', 'D']
        values = list(range(1, 11))
        rating_mapping = create_rating_mapping(ratings, values)
    
    # 步骤2: 处理分析师评级
    # 将原评级列重命名为"RateE"，并创建新列"RateS"映射数字评级
    df['RateS'] = df['分析师评级']
    df['RateS'] = df['分析师评级'].map(rating_mapping).fillna(10).astype(int)  # 未映射的评级设为10
    df = df.drop(columns=['分析师评级'])  # 删除原来的Level 6列
    
    # 步骤3: 重命名列
    df = df.rename(columns={
        '证券代码': 'code',
        '所属行业名称': 'industry',
        '交易币种': 'Eco'
    }).set_index('code')
    
    # 步骤4: 根据行业分组，分成不同的DataFrame
    industry_groups = df.groupby('industry')
    industry_dfs = {industry: group for industry, group in industry_groups}
    
    return industry_dfs

使用示例：

# 创建自定义评级映射 (如果需要)
custom_ratings = ['AAA', 'AA+', 'AA', 'AA-', 'A+', 'A', 'A-', 'BBB+', 'BBB', 'BBB-']
custom_values = list(range(1, 11))
custom_mapping = create_rating_mapping(custom_ratings, custom_values)

# 处理数据框
result_dfs = process_dataframe(df, custom_mapping)

# 或者使用默认映射
result_dfs = process_dataframe(df)

# 访问特定行业的数据
real_estate_df = result_dfs['房地产']

对于mapping的思考

import pandas as pd
import numpy as np

def convert_rating_column(df, original_col, new_col, rating_mapping, default_value=np.nan):
    """
    从指定列提取评级信息并转换为数值等级
    
    参数:
    df: pandas DataFrame
    original_col: 原始列名(包含评级信息的字符串列)
    new_col: 新列名(转换后的数值列)
    rating_mapping: 评级到数值的映射字典，例如 {'AAA': 1, 'AA': 2, ...}
    default_value: 无法映射时的默认值，默认为NaN
    
    返回:
    包含原始列和新列的DataFrame
    """
    # 确保原始列存在
    if original_col not in df.columns:
        raise ValueError(f"列 '{original_col}' 不存在于DataFrame中")
    
    # 创建DataFrame的副本以避免修改原数据
    result_df = df.copy()
    
    # 提取评级信息并转换为数值
    def extract_and_map_rating(rating_str):
        if pd.isna(rating_str):
            return default_value
        
        # 分割字符串并获取最后一部分
        parts = str(rating_str).split('_')
        if len(parts) < 1:
            return default_value
            
        # 获取最后一个部分作为评级
        rating = parts[-1]
        
        # 根据映射字典转换
        return rating_mapping.get(rating, default_value)
    
    # 应用转换函数
    result_df[new_col] = result_df[original_col].apply(extract_and_map_rating)
    
    return result_df

使用示例

# 示例数据
data = {
    '分析评级': ['穆迪评级_25_AAA', '穆迪评级_30_AA', '穆迪评级_15_A', '穆迪评级_40_BBB', '穆迪评级_22_CCC', np.nan, '无效评级']
}

df = pd.DataFrame(data)

# 评级映射字典
rating_mapping = {
    'AAA': 1,
    'AA': 2,
    'A': 3,
    'BBB': 4,
    'BB': 5,
    'B': 6,
    'CCC': 7,
    'CC': 8,
    'C': 9,
    'D': 10
}

# 转换评级列
result_df = convert_rating_column(df, '分析评级', '评级数值', rating_mapping, default_value=-1)

print(result_df)

输出结果：

        分析评级  评级数值
0  穆迪评级_25_AAA     1
1   穆迪评级_30_AA     2
2    穆迪评级_15_A     3
3  穆迪评级_40_BBB     4
4  穆迪评级_22_CCC     7
5            NaN    -1
6         无效评级    -1

函数特点

灵活性：可以处理不同的评级格式和映射关系
容错性：处理NaN值和无法识别的评级格式
非破坏性：返回DataFrame的副本，不修改原始数据
可配置性：允许自定义默认值处理无法映射的情况

向量化操作

对于大型DataFrame，可以考虑使用向量化操作提高性能：

import pandas as pd
import numpy as np

def convert_rating_column_vectorized(df, original_col, new_col, rating_mapping, default_value=np.nan):
    """
    使用向量化操作高效地从指定列提取评级信息并转换为数值等级
    
    参数:
    df: pandas DataFrame
    original_col: 原始列名(包含评级信息的字符串列)
    new_col: 新列名(转换后的数值列)
    rating_mapping: 评级到数值的映射字典，例如 {'AAA': 1, 'AA': 2, ...}
    default_value: 无法映射时的默认值，默认为NaN
    
    返回:
    包含原始列和新列的DataFrame
    """
    # 确保原始列存在
    if original_col not in df.columns:
        raise ValueError(f"列 '{original_col}' 不存在于DataFrame中")
        
    result_df = df.copy()
    
    # 使用向量化操作提取评级部分
    extracted_ratings = result_df[original_col].astype(str).str.split('_').str[-1] # 先转换为字符串类型，然后使用str.split提取最后一部分
    
    # 使用map进行向量化映射
    mapped_ratings = extracted_ratings.map(rating_mapping) # 使用 pd.Series.map 以避免链式索引警告
    
    # 处理未映射的值：如果提供了非NaN的默认值，填充未映射的值
    if not pd.isna(default_value):
        mapped_ratings = mapped_ratings.fillna(default_value)
    
    # 使用loc确保直接赋值到DataFrame，避免链式索引
    result_df.loc[:, new_col] = mapped_ratings
    
    return result_df

减少复制，增加运行效率。

def convert_rating_column_optimized(df, original_col, new_col, rating_mapping, default_value=np.nan):
    """
    高效且准确地从指定列提取评级信息并转换为数值等级。
    使用字符串分割提取，但处理没有下划线的情况。
    
    参数:
    df: pandas DataFrame
    original_col: 原始列名(包含评级信息的字符串列)
    new_col: 新列名(转换后的数值列)
    rating_mapping: 评级到数值的映射字典，例如 {'AAA': 1, 'AA': 2, ...}
    default_value: 无法映射时的默认值，默认为NaN
    
    返回:
    包含原始列和新列的DataFrame
    """
    if original_col not in df.columns:
        raise ValueError(f"列 '{original_col}' 不存在于DataFrame中")
    
    # 使用向量化操作提取评级部分
    extracted_ratings = df[original_col].astype(str).str.split('_').str[-1] # 先转换为字符串类型，然后使用str.split提取最后一部分
    # 如果原始字符串中没有下划线，分割会返回整个字符串，这可能不是评级信息。
    
    # 使用map进行向量化映射
    mapped_ratings = extracted_ratings.map(rating_mapping) # 使用 pd.Series.map 以避免链式索引警告
    
    # 处理未映射的值：如果提供了非NaN的默认值，填充未映射的值
    if not pd.isna(default_value):
        mapped_ratings = mapped_ratings.fillna(default_value)
    
    # 使用.assign创建新列，避免修改原始DataFrame
    return df.assign(**{new_col: mapped_ratings})

def convert_rating_column_optimized(df, original_col, new_col, rating_mapping, default_value=np.nan):
    """
    Efficiently and accurately extracts rating information from a specified column 
    and converts it to numerical ratings.
    Uses string splitting for extraction with robust handling of cases without underscores.
    
    Parameters:
    df: pandas DataFrame
        Input DataFrame containing the rating data.
    original_col: str
        Name of the original column containing string rating information.
    new_col: str
        Name for the new column containing converted numerical ratings.
    rating_mapping: dict
        Dictionary mapping rating categories to numerical values (e.g., {'AAA': 1, 'AA': 2}).
    default_value: scalar, optional
        Default value for unmappable entries (defaults to np.nan).
    
    Returns:
    DataFrame
        DataFrame containing both the original column and the newly created rating column.
    """
    if original_col not in df.columns:
        raise ValueError(f"Column '{original_col}' does not exist in the DataFrame")
    
    # Use vectorized operations to extract the rating portion
    extracted_ratings = df[original_col].astype(str).str.split('_').str[-1]  # Convert to string type, then split and extract last segment
    # Note: If original string contains no underscores, splitting returns the entire string
    
    # Apply mapping using vectorized Series.map operation
    mapped_ratings = extracted_ratings.map(rating_mapping)  # Using pd.Series.map avoids chained indexing warnings
    
    # Handle unmapped values: Fill with default value if a non-NaN default is provided
    if not pd.isna(default_value):
        mapped_ratings = mapped_ratings.fillna(default_value)
    
    # Use .assign() to create new column while avoiding in-place modification of original DataFrame
    return df.assign(**{new_col: mapped_ratings})

Python EXCEL：dataframe处理案例：配对、删除列、重命名和组合

示例数据集

要求

使用示例：

对于mapping的思考

使用示例

函数特点

向量化操作

网站公告

今日签到

热门文章

最新发布