1、pd.read_csv()
pd.read_csv()
是用于读取 CSV(Comma Separated Values,逗号分隔值)文件并将其转换为 DataFrame 对象。
- CSV 是一种常见的数据存储格式,其中数据以纯文本形式存储,每行表示一条记录,每个字段之间用逗号(或其他分隔符)分隔。
- 简单使用:
pd.read_csv(file_path, sep)
1) file_path: 文件路径
2) sep: csv文件的分隔符,默认为逗号
read_csv(
reader: FilePathOrBuffer, *,
sep: str = ...,
delimiter: str | None = ...,
header: int | Sequence[int] | str = ...,
names: Sequence[str] | None = ...,
index_col: int | str | Sequence | Literal[False] | None = ...,
usecols: int | str | Sequence | None = ...,
squeeze: bool = ...,
prefix: str | None = ...,
mangle_dupe_cols: bool = ...,
dtype: str | Mapping[str, Any] | None = ...,
engine: str | None = ...,
converters: Mapping[int | str, (*args, **kwargs) -> Any] | None = ...,
true_values: Sequence[Scalar] | None = ...,
false_values: Sequence[Scalar] | None = ...,
skipinitialspace: bool = ...,
skiprows: Sequence | int | (*args, **kwargs) -> Any | None = ...,
skipfooter: int = ..., nrows: int | None = ..., na_values=...,
keep_default_na: bool = ..., na_filter: bool = ...,
verbose: bool = ..., skip_blank_lines: bool = ...,
parse_dates: bool | List[int] | List[str] = ...,
infer_datetime_format: bool = ...,
keep_date_col: bool = ...,
date_parser: (*args, **kwargs) -> Any | None = ...,
dayfirst: bool = ..., cache_dates: bool = ...,
iterator: Literal[True],
chunksize: int | None = ...,
compression: str | None = ...,
thousands: str | None = ...,
decimal: str | None = ...,
lineterminator: str | None = ...,
quotechar: str = ...,
quoting: int = ...,
doublequote: bool = ...,
escapechar: str | None = ...,
comment: str | None = ...,
encoding: str | None = ...,
dialect: str | None = ...,
error_bad_lines: bool = ...,
warn_bad_lines: bool = ...,
delim_whitespace: bool = ...,
low_memory: bool = ...,
memory_map: bool = ...,
float_precision: str | None = ...)
2、Dataframe.drop()
- 用于删除 DataFrame 或 Series 中的指定行、列或元素。
DataFrame.drop(labels=None, *, axis=0, index=None, columns=None, level=None, inplace=False, errors=‘raise’)
1) labels: 指定要删除的列名或者行索引,可以是单个值(int/str)或者list
2) axis: 指定删除方向(行或列),0 或 ‘index’ : 删除行;1 or ‘columns’: 删除列
3) index: 用于指定要删除的行索引(index=labels 等效于 labels, axis=0)
4) columns: 用于指定要删除的列名(columns=labels 等效于 labels, axis=1)
5) inplace: bool类型,True表示原地修改,False表示返回一个新的DataFrame,默认为False
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
df_dropped = df.drop('A', axis=1)
df_dropped_equiv = df.drop(columns='A')
df_dropped_row = df.drop(1, axis=0)
df_dropped_row_equiv = df.drop(index=1)
3、pd.get_dummies()
pd.get_dummies()
是将类别变量转换为one-hot变量,进行one-hot编码,一般用于数据的预处理,在推荐系统中将类别变量转换为one-hot变量后,可继续进行embedding
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)[source]
1) data: 待转换的类别变量,可以是Series, or DataFrame
2) prefix: str类型,是生成的新列的前缀,可见如下例子
import pandas as pd
data = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)
})
dummy_data = pd.get_dummies(data['A'], prefix='A')
'''
结果 dummy_data 将是:
A_bar A_foo
0 0 1
1 1 0
2 0 1
3 1 0
4 0 1
5 1 0
6 0 1
7 0 1
'''