前言

这篇帖子主要是用机器学习算法,以常见的量价因子为例,测试价格预测能力,涉及的量价因子如下:

  • 价格趋势因子
    • 移动平均类
      • SMA_5
      • SMA_10
      • SMA_20
      • EMA_5
      • EMA_10
      • EMA_20
    • 动量指标
      • MACD
      • MACD_signal
      • MACD_hist
  • 波动率因子
    • RSI
  • 其他相关因子
    • volume
    • open_oi
    • close_oi
    • returns【收益率】
  • 基础
    • open
    • high
    • low
    • close

图片中文乱码,懒得改了,不影响阅读

数据来源,可以参考这篇文章:金融数据采集 | FlyDay

必要库导入

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')

# 设置绘图风格
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

加载数据

1
2
3
4
5
file_path = "file/KQ_m_CZCE_SA_3m.csv"
df = pd.read_csv(file_path)
print("数据形状:", df.shape)
print("\n前5行数据:")
df.head()

数据形状: (10000, 12)

datetime id open high low close volume open_oi close_oi symbol duration date
0 2023-04-20 10:30:00+08:00 90620.0 2224.0 2225.0 2217.0 2218.0 44242.0 879082.0 886347.0 KQ.m@CZCE.SA 180 2023-04-20
1 2023-04-20 10:33:00+08:00 90621.0 2218.0 2220.0 2213.0 2214.0 37043.0 886347.0 887119.0 KQ.m@CZCE.SA 180 2023-04-20
2 2023-04-20 10:36:00+08:00 90622.0 2214.0 2216.0 2207.0 2210.0 55160.0 887119.0 886485.0 KQ.m@CZCE.SA 180 2023-04-20
3 2023-04-20 10:39:00+08:00 90623.0 2210.0 2211.0 2204.0 2205.0 42947.0 886485.0 887544.0 KQ.m@CZCE.SA 180 2023-04-20
4 2023-04-20 10:42:00+08:00 90624.0 2205.0 2208.0 2200.0 2202.0 48951.0 887544.0 891117.0 KQ.m@CZCE.SA 180 2023-04-20

数据基本信息

1
2
3
4
5
6
print("数据基本信息:")
print(df.info())
print("\n缺失值统计:")
print(df.isnull().sum())
print("\n描述性统计:")
df.describe()
1
2
3
4
数据基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
 #   Column    Non-Null Count  Dtype  
--- ------ -------------- -----
0 datetime 10000 non-null object
1 id 10000 non-null float64
2 open 10000 non-null float64
3 high 10000 non-null float64
4 low 10000 non-null float64
5 close 10000 non-null float64
6 volume 10000 non-null float64
7 open_oi 10000 non-null float64
8 close_oi 10000 non-null float64
9 symbol 10000 non-null object
10 duration 10000 non-null int64
11 date 10000 non-null object
dtypes: float64(8), int64(1), object(3)
memory usage: 937.6+ KB
None

缺失值统计:
datetime 0
id 0
open 0
high 0
low 0
close 0
volume 0
open_oi 0
close_oi 0
symbol 0
duration 0
date 0
dtype: int64

描述性统计:

id open high low close volume open_oi close_oi duration
count 10000.00000 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000 1.000000e+04 1.000000e+04 10000.0
mean 95619.50000 1798.731600 1801.590100 1795.792400 1798.714400 18270.753800 1.073193e+06 1.073204e+06 180.0
std 2886.89568 192.727222 192.958489 192.477165 192.697244 16468.048484 1.713484e+05 1.713394e+05 0.0
min 90620.00000 1511.000000 1517.000000 1508.000000 1511.000000 817.000000 5.486800e+05 5.486800e+05 180.0
25% 93119.75000 1651.000000 1654.000000 1648.000000 1651.000000 8140.750000 9.515280e+05 9.515445e+05 180.0
50% 95619.50000 1708.000000 1711.000000 1704.000000 1708.000000 13232.500000 1.114078e+06 1.114078e+06 180.0
75% 98119.25000 1953.000000 1956.000000 1948.000000 1953.000000 22584.750000 1.180854e+06 1.180854e+06 180.0
max 100619.00000 2248.000000 2251.000000 2242.000000 2248.000000 255318.000000 1.379724e+06 1.379724e+06 180.0

将datetime转换为日期时间类型并设置为索引

1
2
3
4
5
6
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
df.sort_index(inplace=True)

# 检查时间范围
print(f"数据时间范围: {df.index.min()}{df.index.max()}")

数据时间范围: 2023-04-20 10:30:00+08:00 到 2023-08-28 22:12:00+08:00

数据预处理和特征工程

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# 创建目标变量 - 预测下一期的收盘价
df['target'] = df['close'].shift(-1)

# 删除最后一行(因为target为NaN)
df = df.iloc[:-1]

# 创建技术指标特征
def add_technical_indicators(df):
# 简单移动平均线
df['SMA_5'] = df['close'].rolling(window=5).mean()
df['SMA_10'] = df['close'].rolling(window=10).mean()
df['SMA_20'] = df['close'].rolling(window=20).mean()

# 指数移动平均线
df['EMA_5'] = df['close'].ewm(span=5).mean()
df['EMA_10'] = df['close'].ewm(span=10).mean()
df['EMA_20'] = df['close'].ewm(span=20).mean()

# 相对强弱指数 (RSI)
delta = df['close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
df['RSI'] = 100 - (100 / (1 + rs))

# 移动平均收敛散度 (MACD)
exp12 = df['close'].ewm(span=12).mean()
exp26 = df['close'].ewm(span=26).mean()
df['MACD'] = exp12 - exp26
df['MACD_signal'] = df['MACD'].ewm(span=9).mean()
df['MACD_hist'] = df['MACD'] - df['MACD_signal']

# 布林带
df['BB_middle'] = df['close'].rolling(window=20).mean()
bb_std = df['close'].rolling(window=20).std()
df['BB_upper'] = df['BB_middle'] + (bb_std * 2)
df['BB_lower'] = df['BB_middle'] - (bb_std * 2)
df['BB_width'] = (df['BB_upper'] - df['BB_lower']) / df['BB_middle']

# 价格变化率
df['returns'] = df['close'].pct_change()
df['volatility'] = df['returns'].rolling(window=20).std()

return df

# 添加技术指标
df = add_technical_indicators(df)

# 删除包含NaN的行(由于技术指标计算)
df.dropna(inplace=True)

print("添加技术指标后的数据形状:", df.shape)
df[['close', 'SMA_5', 'SMA_10', 'RSI', 'MACD', 'target']].head()

添加技术指标后的数据形状: (9958, 28)

datetime close SMA_5 SMA_10 RSI MACD target
2023-04-20 14:30:00+08:00 2225.0 2225.0 2224.4 50.000000 0.067594 2220.0
2023-04-20 14:33:00+08:00 2220.0 2224.8 2223.7 40.909091 -0.275750 2219.0
2023-04-20 14:36:00+08:00 2219.0 2224.0 2222.8 41.860465 -0.606698 2218.0
2023-04-20 14:39:00+08:00 2218.0 2222.2 2222.1 40.909091 -0.925325 2218.0
2023-04-20 14:42:00+08:00 2218.0 2220.0 2222.0 29.729730 -1.162220 2218.0

选择特征列

1
2
3
4
5
6
7
8
9
10
11
feature_columns = ['open', 'high', 'low', 'close', 'volume', 'open_oi', 'close_oi', 
'SMA_5', 'SMA_10', 'SMA_20', 'EMA_5', 'EMA_10', 'EMA_20',
'RSI', 'MACD', 'MACD_signal', 'MACD_hist',
'BB_middle', 'BB_upper', 'BB_lower', 'BB_width',
'returns', 'volatility']

X = df[feature_columns]
y = df['target']

print("特征矩阵形状:", X.shape)
print("目标变量形状:", y.shape)

特征矩阵形状: (9958, 23)
目标变量形状: (9958,)

1
2
3
4
5
6
7
# 对于时间序列数据,我们按时间顺序划分训练集和测试集
train_size = int(0.8 * len(X))
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

print("训练集大小:", X_train.shape)
print("测试集大小:", X_test.shape)

训练集大小: (7983, 23)
测试集大小: (1996, 23)

1
2
3
4
# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

模型训练和评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# 初始化模型
models = {
'Linear Regression': LinearRegression(),
'Ridge Regression': Ridge(alpha=1.0),
'Lasso Regression': Lasso(alpha=0.1),
'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
'SVR': SVR(kernel='rbf', C=1.0, gamma='scale'),
'XGBoost': XGBRegressor(n_estimators=100, random_state=42),
'LightGBM': LGBMRegressor(n_estimators=100, random_state=42),
'CatBoost': CatBoostRegressor(iterations=100, verbose=0, random_state=42)
}

# 训练和评估模型
results = {}
for name, model in models.items():
print(f"训练 {name}...")
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

# 计算评估指标
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

results[name] = {
'MSE': mse,
'MAE': mae,
'RMSE': rmse,
'R2': r2
}

print(f"{name} - RMSE: {rmse:.4f}, R2: {r2:.4f}")

# 创建结果DataFrame
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('RMSE')
results_df
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
训练 Linear Regression...
Linear Regression - RMSE: 4.8202, R2: 0.9990
训练 Ridge Regression...
Ridge Regression - RMSE: 4.8699, R2: 0.9990
训练 Lasso Regression...
Lasso Regression - RMSE: 5.1683, R2: 0.9989
训练 Random Forest...
Random Forest - RMSE: 12.8100, R2: 0.9930
训练 Gradient Boosting...
Gradient Boosting - RMSE: 12.6408, R2: 0.9932
训练 SVR...
SVR - RMSE: 34.2561, R2: 0.9500
训练 XGBoost...
XGBoost - RMSE: 14.6063, R2: 0.9909
训练 LightGBM...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000541 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5865
[LightGBM] [Info] Number of data points in the train set: 7983, number of used features: 23
[LightGBM] [Info] Start training from score 1824.969059
LightGBM - RMSE: 12.9541, R2: 0.9928
训练 CatBoost...
CatBoost - RMSE: 15.7077, R2: 0.9895
MSE MAE RMSE R2
Linear Regression 23.234569 3.077902 4.820225 0.999009
Ridge Regression 23.715960 3.081490 4.869903 0.998989
Lasso Regression 26.710834 3.342289 5.168253 0.998861
Gradient Boosting 159.790863 7.861443 12.640841 0.993188
Random Forest 164.096094 7.890666 12.810000 0.993004
LightGBM 167.809114 7.983772 12.954116 0.992846
XGBoost 213.343612 9.307866 14.606287 0.990905
CatBoost 246.730999 10.522483 15.707673 0.989481
SVR 1173.478480 24.779806 34.256072 0.949972

模型性能可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 绘制模型性能比较
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# RMSE比较
results_df['RMSE'].plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('模型RMSE比较')
axes[0, 0].set_ylabel('RMSE')
axes[0, 0].tick_params(axis='x', rotation=45)

# R²比较
results_df['R2'].plot(kind='bar', ax=axes[0, 1], color='lightgreen')
axes[0, 1].set_title('模型R²比较')
axes[0, 1].set_ylabel('R²')
axes[0, 1].tick_params(axis='x', rotation=45)

# MAE比较
results_df['MAE'].plot(kind='bar', ax=axes[1, 0], color='lightcoral')
axes[1, 0].set_title('模型MAE比较')
axes[1, 0].set_ylabel('MAE')
axes[1, 0].tick_params(axis='x', rotation=45)

# MSE比较
results_df['MSE'].plot(kind='bar', ax=axes[1, 1], color='gold')
axes[1, 1].set_title('模型MSE比较')
axes[1, 1].set_ylabel('MSE')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

最佳模型预测结果可视化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# 选择最佳模型(基于RMSE)
best_model_name = results_df.index[0]
print(f"最佳模型: {best_model_name}")

# 重新训练最佳模型
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_pred = best_model.predict(X_test_scaled)

# 绘制预测结果
plt.figure(figsize=(15, 7))
plt.plot(y_test.index, y_test.values, label='实际值', alpha=0.7)
plt.plot(y_test.index, y_pred, label='预测值', alpha=0.7)
plt.title(f'{best_model_name} - 预测 vs 实际')
plt.xlabel('时间')
plt.ylabel('价格')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# 绘制残差图
residuals = y_test - y_pred
plt.figure(figsize=(12, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('残差图')
plt.xlabel('预测值')
plt.ylabel('残差')
plt.grid(True, alpha=0.3)
plt.show()

特征重要性分析(对于树模型)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 对于树模型,分析特征重要性
if hasattr(best_model, 'feature_importances_'):
feature_importance = pd.DataFrame({
'feature': feature_columns,
'importance': best_model.feature_importances_
})
feature_importance = feature_importance.sort_values('importance', ascending=False)

plt.figure(figsize=(12, 8))
plt.barh(feature_importance['feature'][:15], feature_importance['importance'][:15])
plt.title('Top 15 特征重要性')
plt.xlabel('重要性')
plt.tight_layout()
plt.show()

elif hasattr(best_model, 'coef_'):
# 对于线性模型
feature_importance = pd.DataFrame({
'feature': feature_columns,
'coefficient': best_model.coef_
})
feature_importance = feature_importance.sort_values('coefficient', key=abs, ascending=False)

plt.figure(figsize=(12, 8))
plt.barh(feature_importance['feature'][:15], feature_importance['coefficient'][:15])
plt.title('Top 15 特征系数(绝对值)')
plt.xlabel('系数值')
plt.tight_layout()
plt.show()

时间序列交叉验证

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# 使用时间序列交叉验证评估模型稳定性
tscv = TimeSeriesSplit(n_splits=5)
model = GradientBoostingRegressor(n_estimators=100, random_state=42)

cv_scores = []
for train_index, test_index in tscv.split(X):
X_train_cv, X_test_cv = X.iloc[train_index], X.iloc[test_index]
y_train_cv, y_test_cv = y.iloc[train_index], y.iloc[test_index]

# 标准化
X_train_cv_scaled = scaler.fit_transform(X_train_cv)
X_test_cv_scaled = scaler.transform(X_test_cv)

model.fit(X_train_cv_scaled, y_train_cv)
y_pred_cv = model.predict(X_test_cv_scaled)
score = r2_score(y_test_cv, y_pred_cv)
cv_scores.append(score)

print("交叉验证R²分数:", cv_scores)
print("平均R²分数:", np.mean(cv_scores))
print("R²分数标准差:", np.std(cv_scores))

plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o')
plt.title('时间序列交叉验证性能')
plt.xlabel('折数')
plt.ylabel('R²分数')
plt.grid(True, alpha=0.3)
plt.show()

当然还可以超参数调优,这里就不继续了

模型部署和预测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 使用最佳模型进行未来预测
def predict_future_price(model, last_known_data, scaler, steps=1):
"""
使用训练好的模型预测未来价格

参数:
model: 训练好的模型
last_known_data: 最后已知的数据点(DataFrame)
scaler: 用于标准化的scaler
steps: 预测步数

返回:
predictions: 预测值列表
"""
predictions = []
current_data = last_known_data.copy()

for _ in range(steps):
# 标准化当前数据
current_data_scaled = scaler.transform(current_data)

# 预测下一步
next_price = model.predict(current_data_scaled)[0]
predictions.append(next_price)

# 更新当前数据(这里需要根据实际情况调整)
# 在实际应用中,需要更复杂的逻辑来更新特征
current_data.iloc[0, 3] = next_price # 更新close价格

return predictions

# 示例:使用最后一行数据进行预测
last_known = X.iloc[[-1]].copy()
future_predictions = predict_future_price(best_model, last_known, scaler, steps=5)

print("未来5期预测价格:", future_predictions)
1
未来5期预测价格: [np.float64(1832.2961094333193), np.float64(1840.6007680672349), np.float64(1847.2991579272611), np.float64(1852.7019599802757), np.float64(1857.0597639651191)]

具体对不对,能不能用,自己去查吧,哈哈哈~

相关文章