数据清洗
数据清洗是数据分析的重要步骤,包括处理缺失值、异常值和数据转换。
缺失值处理
python
import pandas as pd
import numpy as np
# 创建示例数据
data = {
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'age': [25, None, 30, 35, None],
'income': [50000, 60000, None, 80000, 90000]
}
df = pd.DataFrame(data)
# 检测缺失值
print(df.isnull().sum())
# 删除缺失值
df_drop = df.dropna()
# 填充缺失值
df_fill = df.fillna({
'age': df['age'].median(),
'income': df['income'].mean()
})
# 前向填充
df_ffill = df.fillna(method='ffill')
# 后向填充
df_bfill = df.fillna(method='bfill')异常值处理
python
# 创建数据
data = {'value': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 1000]}
df = pd.DataFrame(data)
# 使用 IQR 检测异常值
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# 识别异常值
outliers = df[(df['value'] < lower_bound) | (df['value'] > upper_bound)]
print("异常值:", outliers)
# 处理异常值
df_clean = df[(df['value'] >= lower_bound) & (df['value'] <= upper_bound)]重复值处理
python
# 创建包含重复值的数据
data = {
'name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
'age': [25, 30, 25, 35, 30]
}
df = pd.DataFrame(data)
# 检测重复值
print(df.duplicated())
# 删除重复值
df_unique = df.drop_duplicates()
# 保留最后一个重复值
df_last = df.drop_duplicates(keep='last')数据类型转换
python
# 创建数据
data = {
'age': ['25', '30', '35'],
'income': ['50000', '60000', '70000'],
'is_student': ['True', 'False', 'True'],
'date': ['2023-01-01', '2023-02-15', '2023-03-20']
}
df = pd.DataFrame(data)
# 转换为数值类型
df['age'] = df['age'].astype(int)
df['income'] = df['income'].astype(float)
df['is_student'] = df['is_student'].astype(bool)
df['date'] = pd.to_datetime(df['date'])
# 提取日期组件
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day文本数据处理
python
# 创建数据
data = {'text': ['Hello World!', 'Python is great.', 'Data Analysis']}
df = pd.DataFrame(data)
# 转换为小写
df['text_lower'] = df['text'].str.lower()
# 去除标点符号
import re
df['text_clean'] = df['text_lower'].apply(lambda x: re.sub(r'[^\w\s]', '', x))
# 分词
from nltk.tokenize import word_tokenize
df['tokens'] = df['text_clean'].apply(word_tokenize)
# 去除停用词
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
df['tokens_filtered'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])数据标准化
python
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# 创建数据
data = {'value': [10, 20, 30, 40, 50]}
df = pd.DataFrame(data)
# Z-score 标准化
scaler = StandardScaler()
df['standardized'] = scaler.fit_transform(df[['value']])
# Min-Max 归一化
scaler = MinMaxScaler()
df['normalized'] = scaler.fit_transform(df[['value']])数据编码
python
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
# 创建数据
data = {'category': ['A', 'B', 'C', 'A', 'B']}
df = pd.DataFrame(data)
# 独热编码
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['category']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())
# 标签编码
encoder = LabelEncoder()
df['encoded'] = encoder.fit_transform(df['category'])注意事项
- 数据备份: 在清洗前备份原始数据
- 记录过程: 记录每一步清洗操作
- 验证结果: 清洗后验证数据质量
- 保持一致性: 保持处理逻辑的一致性