1. 数据结构 (Data Structures): Series & DataFrame
Pandas 提供了两种核心数据结构:Series 和 DataFrame。
Series 是一种一维的带标签数组,可以存储任何数据类型(整数、浮点数、字符串、Python 对象等)。你可以将其视为带有索引的列。
import pandas as pd # 从列表创建 Series s1 = pd.Series([1, 3, 5, 'a', 7]) print("Series from list:\\n", s1) # 从字典创建 Series s2 = pd.Series({'a': 1, 'b': 2, 'c': 3}) print("\\nSeries from dictionary:\\n", s2) # 指定索引的 Series s3 = pd.Series([10, 20, 30], index=['x', 'y', 'z']) print("\\nSeries with custom index:\\n", s3)
DataFrame 是一种二维表格型数据结构,可以看作是由 Series 组成的字典,每一列都是一个 Series。DataFrame 拥有行索引和列索引,使其非常适合表示和操作结构化数据。
# 从字典创建 DataFrame data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28], 'City': ['New York', 'London', 'Paris']} df1 = pd.DataFrame(data) print("DataFrame from dictionary:\\n", df1) # 从列表列表创建 DataFrame data2 = [['David', 35, 'Berlin'], ['Emily', 22, 'Tokyo']] df2 = pd.DataFrame(data2, columns=['Name', 'Age', 'City']) print("\\nDataFrame from list of lists:\\n", df2) # 从 Series 创建 DataFrame s4 = pd.Series([100, 200, 300], index=['price1','price2','price3']) df3 = pd.DataFrame(s4, columns=['Price']) print("\\nDataFrame from Series:\\n", df3)
2. 数据读取与写入 (Data Input/Output)
pd.read_csv()
: 用于读取逗号分隔值 (CSV) 文件。你可以指定分隔符、编码、跳过行、列名和索引列等参数。# 假设存在名为 'data.csv' 的文件 # df = pd.read_csv('data.csv') # print(df) # 读取时指定分隔符,encoding # df = pd.read_csv('data.csv', sep=';', encoding='utf-8') # 读取时跳过首行 # df = pd.read_csv('data.csv', skiprows=1) # 指定列名 # df = pd.read_csv('data.csv', names=['col1', 'col2', 'col3']) #指定索引列 # df = pd.read_csv('data.csv', index_col='column_name')
pd.read_excel()
: 用于读取 Excel 文件。可以指定要读取的 sheet 页。# 假设存在名为 'data.xlsx' 的文件 # df = pd.read_excel('data.xlsx') # print(df) # 指定 sheet 页 # df = pd.read_excel('data.xlsx', sheet_name='Sheet2')
pd.to_csv()
: 将 DataFrame 写入 CSV 文件。你可以选择是否写入索引。# df1.to_csv('output.csv') # 不写入索引 # df1.to_csv('output.csv', index=False)
pd.to_excel()
: 将 DataFrame 写入 Excel 文件。可以选择是否写入索引。# df1.to_excel('output.xlsx') # 不写入索引 # df1.to_excel('output.xlsx', index=False)
3. 数据查看 (Data Inspection)
head(n)
: 显示 DataFrame 的前 n 行,默认显示前 5 行。print("First 2 rows:\\n", df1.head(2))
tail(n)
: 显示 DataFrame 的后 n 行,默认显示后 5 行。print("\\nLast 1 row:\\n", df1.tail(1))
info()
: 提供 DataFrame 的基本信息,包括列名、非空值数量、数据类型等。print("\\nDataFrame info:\\n", df1.info())
describe()
: 提供 DataFrame 的统计摘要,包括计数、均值、标准差、最小值、最大值等。print("\\nDataFrame description:\\n", df1.describe())
shape
: 返回 DataFrame 的形状 (行数,列数)。print("\\nDataFrame shape:", df1.shape)
columns
: 返回 DataFrame 的列名列表。print("\\nDataFrame columns:", df1.columns)
index
: 返回 DataFrame 的索引。print("\\nDataFrame Index:", df1.index)
dtypes
: 返回 DataFrame 各列的数据类型。print("\\nDataFrame data types:", df1.dtypes)
4. 数据选择与索引 (Data Selection & Indexing)
[]
: 使用列名或切片进行选择。# 选择单列 print("Selected single column:\\n", df1['Name']) print("\\nSelected single column:\\n", df1.Name) # 等价写法 # 选择多列 print("\\nSelected multiple columns:\\n", df1[['Name', 'Age']]) # 行切片 print("\\nSliced rows:\\n", df1[0:2])
loc[]
: 基于标签进行选择,可以用于选择行、列或子集。# 基于标签选择单行 print("\\nSelect row by label:\\n", df1.loc[0]) #基于标签选择多行 print("\\nSelect rows by label:\\n", df1.loc[0:1]) # 基于标签选择行和列 print("\\nSelect row and columns by label:\\n", df1.loc[0, ['Name', 'Age']]) print("\\nSelect rows and columns by label:\\n", df1.loc[0:1, ['Name', 'Age']]) print("\\nSelect all rows and specific column:\\n", df1.loc[:, 'Age'])
iloc[]
: 基于位置 (整数索引) 进行选择。# 基于位置选择单行 print("\\nSelect row by index:\\n", df1.iloc[0]) # 基于位置选择多行 print("\\nSelect multiple rows by index:\\n", df1.iloc[0:2]) # 基于位置选择行和列 print("\\nSelect row and columns by index:\\n", df1.iloc[0, [0, 1]]) print("\\nSelect rows and columns by index:\\n", df1.iloc[0:2, [0, 1]]) print("\\nSelect all rows and specific column by index:\\n", df1.iloc[:, 1])
条件选择:使用布尔索引选择满足特定条件的行。
#选择年龄大于28的行 print("\\nSelect rows by condition:\\n", df1[df1['Age'] > 28]) #选择年龄大于28的行的名字 print("\\nSelect names of rows with condition:\\n", df1[df1['Age'] > 28]['Name']) #多条件选择 print("\\nSelect rows by multiple condition:\\n", df1[(df1['Age'] > 25) & (df1['City'] == 'London')])
5. 数据操作 (Data Manipulation)
assign()
: 添加新的列到 DataFrame。df_new = df1.assign(Salary=[50000, 60000, 55000]) print("\\nDataFrame with a new column:\\n", df_new)
drop()
: 删除指定的列或行。inplace
参数可以控制是否直接修改原始 DataFrame。df_dropped = df1.drop(columns=['City']) print("\\nDataFrame after dropping column:\\n", df_dropped) df_dropped_row = df1.drop(index=[0,1]) print("\\nDataFrame after dropping rows:\\n", df_dropped_row) # inplace=True 直接修改原DataFrame # df1.drop(columns=['City'], inplace=True) # print("\\nOriginal DataFrame after inplace drop:\\n", df1)
rename()
: 重命名列名或索引标签。inplace
参数可以控制是否直接修改原始 DataFrame。df_renamed = df1.rename(columns={'Name': 'FullName', 'Age': 'Years'}) print("\\nDataFrame after renaming columns:\\n", df_renamed) df_renamed_index = df1.rename(index={0:'first',1:'second',2:'third'}) print("\\nDataFrame after renaming index:\\n",df_renamed_index) # inplace=True 直接修改原DataFrame # df1.rename(columns={'Name':'FullName'}, inplace=True) # print("\\nOriginal DataFrame after inplace rename:\\n", df1)
sort_values()
: 根据指定列的值对 DataFrame 进行排序。df_sorted = df1.sort_values(by='Age', ascending=False) print("\\nDataFrame sorted by age (descending):\\n", df_sorted)
set_index()
: 将 DataFrame 的某一列设置为索引。df_setindex = df1.set_index('Name', drop=False) print("\\nDataFrame with index set:\\n", df_setindex)
reset_index()
: 重置索引,将索引变为普通列。df_resetindex = df_setindex.reset_index(drop=False) print("\\nDataFrame with reseted index:\\n", df_resetindex)
apply()
: 对 DataFrame 或 Series 应用函数。可以按列或按行应用。def add_10(x): return x + 10 #按列应用 df_applied_col = df1[['Age']].apply(add_10,axis=0) print("\\nDataFrame after apply by column:\\n", df_applied_col) #按行应用 def combine(row): return row['Name'] + '-' + str(row['Age']) df_applied_row = df1.apply(combine, axis=1) print("\\nDataFrame after apply by row:\\n", df_applied_row)
applymap()
: 对 DataFrame 中的每个元素应用函数。df_applied_map = df1[['Age']].applymap(lambda x: x*2) print("\\nDataFrame after applying function to all elements:\\n", df_applied_map)
map()
: 对 Series 中的每个元素应用函数。s5 = pd.Series([1, 2, 3]) s5_mapped = s5.map(lambda x: x*x) print("\\nSeries after applying map function:\\n", s5_mapped) df1_mapped_name = df1.Name.map(lambda x: x+'_test') print("\\nSeries in DataFrame after applying map function:\\n", df1_mapped_name) df1_mapped_age = df1.Age.map(lambda x: x+10) print("\\nSeries in DataFrame after applying map function:\\n", df1_mapped_age)
replace()
: 替换 DataFrame 或 Series 中的值。inplace
参数控制是否直接修改原始 DataFrame。df_replaced = df1.replace({'Age': {25: 26}}, inplace=False) print("\\nDataFrame after replacement:\\n", df_replaced) #inplace=True 直接修改原DataFrame # df1.replace({'Age': {25:26}}, inplace=True) # print("\\nOriginal DataFrame after inplace replace:\\n", df1)
6. 数据分组与聚合 (Grouping & Aggregation)
groupby()
: 根据一列或多列的值对 DataFrame 进行分组。data3 = {'Category': ['A', 'B', 'A', 'B', 'A'], 'Value': [10, 20, 15, 25, 12]} df3 = pd.DataFrame(data3) grouped = df3.groupby('Category') print("\\nDataFrame grouped by Category:\\n", grouped) # 分组求和 sum_values = grouped.sum() print("\\nSum of Values for each group:\\n", sum_values) # 分组求平均值 mean_values = grouped.mean() print("\\nMean of Values for each group:\\n", mean_values) # 分组计数 count_values = grouped.count() print("\\nCount of Values for each group:\\n", count_values)
agg()
: 对分组后的数据应用聚合函数,例如 sum
, mean
, count
等。# 可以使用多个函数 aggregated = grouped.agg(['sum','mean','count']) print("\\nAggregated results:\\n", aggregated) # 可以针对不同列使用不同的聚合函数 aggregated2 = df3.groupby('Category').agg({'Value':['sum','max']}) print("\\nAggregated results:\\n",aggregated2)
transform()
: 对分组后的数据应用转换函数,通常用于将聚合值转换为行级数据。df3['sum_of_category'] = df3.groupby('Category')['Value'].transform('sum') print("\\nTransformed dataframe with sum of each category:\\n", df3)
7. 数据缺失值处理 (Missing Value Handling)
isnull()
: 检测缺失值,返回一个布尔值的 DataFrame,缺失值为 True,非缺失值为 False。data4 = {'A': [1, 2, None, 4], 'B': [5, None, 7, 8]} df4 = pd.DataFrame(data4) print("\\nDataFrame with missing values:\\n", df4) print("\\nBoolean DataFrame indicating missing values:\\n", df4.isnull()) # 检测哪一列有缺失值 print("\\nCheck if any null value in each column:\\n", df4.isnull().any()) #检测哪一行有缺失值 print("\\nCheck if any null value in each row:\\n", df4.isnull().any(axis=1))
fillna()
: 填充缺失值,可以使用固定值、均值、前向填充、后向填充等方法。inplace
参数控制是否直接修改原始 DataFrame。# 用固定值填充 df_filled = df4.fillna(0) print("\\nDataFrame with missing values filled with 0:\\n", df_filled) # 用平均值填充 df_filled_mean = df4.fillna(df4.mean()) print("\\nDataFrame with missing values filled with mean:\\n", df_filled_mean) #用前一个值填充 df_filled_forward = df4.fillna(method='ffill') print("\\nDataFrame with missing values filled by forward fill:\\n", df_filled_forward) #用后一个值填充 df_filled_backward = df4.fillna(method='bfill') print("\\nDataFrame with missing values filled by backward fill:\\n", df_filled_backward) #inplace=True 直接修改原DataFrame # df4.fillna(0, inplace=True) # print("\\nOriginal DataFrame after inplace fillna:\\n", df4)
dropna()
: 删除包含缺失值的行或列。how
参数可以指定删除包含任意缺失值('any')或所有缺失值('all')的行或列,inplace
参数控制是否直接修改原始 DataFrame。#删除包含任意缺失值的行 df_dropped_na = df4.dropna(axis=0, how='any') print("\\nDataFrame after dropping rows with any NA:\\n", df_dropped_na) #删除所有值为缺失值的列 df_dropped_na_col = df4.dropna(axis=1, how='all') print("\\nDataFrame after dropping columns with all NA:\\n", df_dropped_na_col) #inplace=True 直接修改原DataFrame # df4.dropna(inplace=True) # print("\\nOriginal DataFrame after inplace dropna:\\n", df4)
8. 数据合并 (Data Merging)
pd.concat()
: 沿着指定的轴(行或列)合并多个 DataFrame。可以指定连接方式(外连接或内连接)。data5 = {'X': [1, 2, 3], 'Y': [4, 5, 6]} df5 = pd.DataFrame(data5) data6 = {'X': [7, 8], 'Z': [9, 10]} df6 = pd.DataFrame(data6) #按行合并 df_concat_row = pd.concat([df5,df6]) print("\\nDataFrame after concatenating by rows:\\n", df_concat_row) #按列合并 df_concat_col = pd.concat([df5,df6], axis=1) print("\\nDataFrame after concatenating by columns:\\n", df_concat_col) #内连接 df_concat_inner = pd.concat([df5,df6], axis=1, join='inner') print("\\nDataFrame after inner concatenating by columns:\\n", df_concat_inner) #外连接 df_concat_outer = pd.concat([df5,df6], axis=1, join='outer') print("\\nDataFrame after outer concatenating by columns:\\n", df_concat_outer)
pd.merge()
: 基于列的值合并两个 DataFrame。可以指定连接方式(内连接、左连接、右连接、外连接)。data7 = {'id':[1,2,3], 'name':['Alice', 'Bob', 'Charlie']} df7 = pd.DataFrame(data7) data8 = {'id':[2,3,4], 'age':[25,30,28]} df8 = pd.DataFrame(data8) #内连接 df_merge_inner = pd.merge(df7,df8, on='id', how='inner') print("\\nDataFrame after merging by inner join:\\n", df_merge_inner) #左连接 df_merge_left = pd.merge(df7,df8, on='id', how='left') print("\\nDataFrame after merging by left join:\\n", df_merge_left) #右连接 df_merge_right = pd.merge(df7,df8, on='id', how='right') print("\\nDataFrame after merging by right join:\\n", df_merge_right) #外连接 df_merge_outer = pd.merge(df7,df8, on='id', how='outer') print("\\nDataFrame after merging by outer join:\\n", df_merge_outer)
9. 数据透视表 (Pivot Tables)
pd.pivot_table()
: 创建数据透视表,用于汇总和重新组织数据。data9 = {'Date':['1/1/2023', '1/1/2023', '2/1/2023', '2/1/2023', '3/1/2023', '3/1/2023'], 'Region':['East', 'West', 'East', 'West', 'East', 'West'], 'Sales':[100, 150, 120, 180, 110, 200]} df9 = pd.DataFrame(data9) pivot_table = pd.pivot_table(df9, values='Sales', index='Region', columns='Date', aggfunc='sum') print("\\nDataFrame as Pivot Table:\\n", pivot_table)
10. 时间序列数据 (Time Series Data)
Pandas 提供了强大的时间序列数据处理功能。
pd.to_datetime()
: 将字符串或其他格式转换为日期时间格式。date_str = ['2023-01-01', '2023-01-02', '2023-01-03'] dates = pd.to_datetime(date_str) print("\\nConverted Datetime:\\n", dates) df9['Date'] = pd.to_datetime(df9['Date']) print("\\nDataFrame with converted datetime column:\\n", df9.info())
时间序列的索引和操作:
使用
pd.date_range()
创建时间索引,对时间序列数据进行切片、聚合等操作。# 创建时间索引 dates_index = pd.date_range(start='2023-01-01', end='2023-01-05', freq='D') print("\\nDate range:", dates_index) s6 = pd.Series([10,20,30,40,50], index=dates_index) print("\\nSeries with datetime index:\\n", s6) #时间序列的切片 print("\\nSlice series based on date range:\\n", s6['2023-01-02':'2023-01-04']) #按日期进行聚合 s6_resampled = s6.resample('2D').sum() print("\\nSeries after resample by two days:\\n", s6_resampled)