期货量价因子模型测试

投资投资LY2025-09-162025-09-16

前言

这篇帖子主要是用机器学习算法，以常见的量价因子为例，测试价格预测能力，涉及的量价因子如下：

价格趋势因子
- 移动平均类
  - SMA_5
  - SMA_10
  - SMA_20
  - EMA_5
  - EMA_10
  - EMA_20
- 动量指标
  - MACD
  - MACD_signal
  - MACD_hist
波动率因子
- RSI
其他相关因子
- volume
- open_oi
- close_oi
- returns【收益率】
基础
- open
- high
- low
- close

图片中文乱码，懒得改了，不影响阅读

数据来源，可以参考这篇文章：金融数据采集 | FlyDay

必要库导入

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
import warnings
warnings.filterwarnings('ignore')

# 设置绘图风格
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
%matplotlib inline

加载数据

file_path = "file/KQ_m_CZCE_SA_3m.csv"
df = pd.read_csv(file_path)
print("数据形状:", df.shape)
print("\n前5行数据:")
df.head()

数据形状: (10000, 12)

datetime	id	open	high	low	close	volume	open_oi	close_oi	symbol	duration	date
0	2023-04-20 10:30:00+08:00	90620.0	2224.0	2225.0	2217.0	2218.0	44242.0	879082.0	886347.0	KQ.m@CZCE.SA	180	2023-04-20
1	2023-04-20 10:33:00+08:00	90621.0	2218.0	2220.0	2213.0	2214.0	37043.0	886347.0	887119.0	KQ.m@CZCE.SA	180	2023-04-20
2	2023-04-20 10:36:00+08:00	90622.0	2214.0	2216.0	2207.0	2210.0	55160.0	887119.0	886485.0	KQ.m@CZCE.SA	180	2023-04-20
3	2023-04-20 10:39:00+08:00	90623.0	2210.0	2211.0	2204.0	2205.0	42947.0	886485.0	887544.0	KQ.m@CZCE.SA	180	2023-04-20
4	2023-04-20 10:42:00+08:00	90624.0	2205.0	2208.0	2200.0	2202.0	48951.0	887544.0	891117.0	KQ.m@CZCE.SA	180	2023-04-20

数据基本信息

print("数据基本信息:")
print(df.info())
print("\n缺失值统计:")
print(df.isnull().sum())
print("\n描述性统计:")
df.describe()

数据基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):

 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   datetime  10000 non-null  object 
 1   id        10000 non-null  float64
 2   open      10000 non-null  float64
 3   high      10000 non-null  float64
 4   low       10000 non-null  float64
 5   close     10000 non-null  float64
 6   volume    10000 non-null  float64
 7   open_oi   10000 non-null  float64
 8   close_oi  10000 non-null  float64
 9   symbol    10000 non-null  object 
 10  duration  10000 non-null  int64  
 11  date      10000 non-null  object 
dtypes: float64(8), int64(1), object(3)
memory usage: 937.6+ KB
None

缺失值统计:
datetime    0
id          0
open        0
high        0
low         0
close       0
volume      0
open_oi     0
close_oi    0
symbol      0
duration    0
date        0
dtype: int64

描述性统计:

	id	open	high	low	close	volume	open_oi	close_oi	duration
count	10000.00000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	1.000000e+04	1.000000e+04	10000.0
mean	95619.50000	1798.731600	1801.590100	1795.792400	1798.714400	18270.753800	1.073193e+06	1.073204e+06	180.0
std	2886.89568	192.727222	192.958489	192.477165	192.697244	16468.048484	1.713484e+05	1.713394e+05	0.0
min	90620.00000	1511.000000	1517.000000	1508.000000	1511.000000	817.000000	5.486800e+05	5.486800e+05	180.0
25%	93119.75000	1651.000000	1654.000000	1648.000000	1651.000000	8140.750000	9.515280e+05	9.515445e+05	180.0
50%	95619.50000	1708.000000	1711.000000	1704.000000	1708.000000	13232.500000	1.114078e+06	1.114078e+06	180.0
75%	98119.25000	1953.000000	1956.000000	1948.000000	1953.000000	22584.750000	1.180854e+06	1.180854e+06	180.0
max	100619.00000	2248.000000	2251.000000	2242.000000	2248.000000	255318.000000	1.379724e+06	1.379724e+06	180.0

将datetime转换为日期时间类型并设置为索引

df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
df.sort_index(inplace=True)

# 检查时间范围
print(f"数据时间范围: {df.index.min()} 到 {df.index.max()}")

数据时间范围: 2023-04-20 10:30:00+08:00 到 2023-08-28 22:12:00+08:00

数据预处理和特征工程

# 创建目标变量 - 预测下一期的收盘价
df['target'] = df['close'].shift(-1)

# 删除最后一行（因为target为NaN）
df = df.iloc[:-1]

# 创建技术指标特征
def add_technical_indicators(df):
    # 简单移动平均线
    df['SMA_5'] = df['close'].rolling(window=5).mean()
    df['SMA_10'] = df['close'].rolling(window=10).mean()
    df['SMA_20'] = df['close'].rolling(window=20).mean()
    
    # 指数移动平均线
    df['EMA_5'] = df['close'].ewm(span=5).mean()
    df['EMA_10'] = df['close'].ewm(span=10).mean()
    df['EMA_20'] = df['close'].ewm(span=20).mean()
    
    # 相对强弱指数 (RSI)
    delta = df['close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['RSI'] = 100 - (100 / (1 + rs))
    
    # 移动平均收敛散度 (MACD)
    exp12 = df['close'].ewm(span=12).mean()
    exp26 = df['close'].ewm(span=26).mean()
    df['MACD'] = exp12 - exp26
    df['MACD_signal'] = df['MACD'].ewm(span=9).mean()
    df['MACD_hist'] = df['MACD'] - df['MACD_signal']
    
    # 布林带
    df['BB_middle'] = df['close'].rolling(window=20).mean()
    bb_std = df['close'].rolling(window=20).std()
    df['BB_upper'] = df['BB_middle'] + (bb_std * 2)
    df['BB_lower'] = df['BB_middle'] - (bb_std * 2)
    df['BB_width'] = (df['BB_upper'] - df['BB_lower']) / df['BB_middle']
    
    # 价格变化率
    df['returns'] = df['close'].pct_change()
    df['volatility'] = df['returns'].rolling(window=20).std()
    
    return df

# 添加技术指标
df = add_technical_indicators(df)

# 删除包含NaN的行（由于技术指标计算）
df.dropna(inplace=True)

print("添加技术指标后的数据形状:", df.shape)
df[['close', 'SMA_5', 'SMA_10', 'RSI', 'MACD', 'target']].head()

添加技术指标后的数据形状: (9958, 28)

datetime	close	SMA_5	SMA_10	RSI	MACD	target
2023-04-20 14:30:00+08:00	2225.0	2225.0	2224.4	50.000000	0.067594	2220.0
2023-04-20 14:33:00+08:00	2220.0	2224.8	2223.7	40.909091	-0.275750	2219.0
2023-04-20 14:36:00+08:00	2219.0	2224.0	2222.8	41.860465	-0.606698	2218.0
2023-04-20 14:39:00+08:00	2218.0	2222.2	2222.1	40.909091	-0.925325	2218.0
2023-04-20 14:42:00+08:00	2218.0	2220.0	2222.0	29.729730	-1.162220	2218.0

选择特征列

feature_columns = ['open', 'high', 'low', 'close', 'volume', 'open_oi', 'close_oi', 
                   'SMA_5', 'SMA_10', 'SMA_20', 'EMA_5', 'EMA_10', 'EMA_20', 
                   'RSI', 'MACD', 'MACD_signal', 'MACD_hist', 
                   'BB_middle', 'BB_upper', 'BB_lower', 'BB_width', 
                   'returns', 'volatility']

X = df[feature_columns]
y = df['target']

print("特征矩阵形状:", X.shape)
print("目标变量形状:", y.shape)

特征矩阵形状: (9958, 23)
目标变量形状: (9958,)

# 对于时间序列数据，我们按时间顺序划分训练集和测试集
train_size = int(0.8 * len(X))
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

print("训练集大小:", X_train.shape)
print("测试集大小:", X_test.shape)

训练集大小: (7983, 23)
测试集大小: (1996, 23)

# 特征标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

模型训练和评估

# 初始化模型
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'SVR': SVR(kernel='rbf', C=1.0, gamma='scale'),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42),
    'LightGBM': LGBMRegressor(n_estimators=100, random_state=42),
    'CatBoost': CatBoostRegressor(iterations=100, verbose=0, random_state=42)
}

# 训练和评估模型
results = {}
for name, model in models.items():
    print(f"训练 {name}...")
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    # 计算评估指标
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {
        'MSE': mse,
        'MAE': mae,
        'RMSE': rmse,
        'R2': r2
    }
    
    print(f"{name} - RMSE: {rmse:.4f}, R2: {r2:.4f}")

# 创建结果DataFrame
results_df = pd.DataFrame(results).T
results_df = results_df.sort_values('RMSE')
results_df

训练 Linear Regression...
Linear Regression - RMSE: 4.8202, R2: 0.9990
训练 Ridge Regression...
Ridge Regression - RMSE: 4.8699, R2: 0.9990
训练 Lasso Regression...
Lasso Regression - RMSE: 5.1683, R2: 0.9989
训练 Random Forest...
Random Forest - RMSE: 12.8100, R2: 0.9930
训练 Gradient Boosting...
Gradient Boosting - RMSE: 12.6408, R2: 0.9932
训练 SVR...
SVR - RMSE: 34.2561, R2: 0.9500
训练 XGBoost...
XGBoost - RMSE: 14.6063, R2: 0.9909
训练 LightGBM...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000541 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5865
[LightGBM] [Info] Number of data points in the train set: 7983, number of used features: 23
[LightGBM] [Info] Start training from score 1824.969059
LightGBM - RMSE: 12.9541, R2: 0.9928
训练 CatBoost...
CatBoost - RMSE: 15.7077, R2: 0.9895

	MSE	MAE	RMSE	R2
Linear Regression	23.234569	3.077902	4.820225	0.999009
Ridge Regression	23.715960	3.081490	4.869903	0.998989
Lasso Regression	26.710834	3.342289	5.168253	0.998861
Gradient Boosting	159.790863	7.861443	12.640841	0.993188
Random Forest	164.096094	7.890666	12.810000	0.993004
LightGBM	167.809114	7.983772	12.954116	0.992846
XGBoost	213.343612	9.307866	14.606287	0.990905
CatBoost	246.730999	10.522483	15.707673	0.989481
SVR	1173.478480	24.779806	34.256072	0.949972

模型性能可视化

# 绘制模型性能比较
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# RMSE比较
results_df['RMSE'].plot(kind='bar', ax=axes[0, 0], color='skyblue')
axes[0, 0].set_title('模型RMSE比较')
axes[0, 0].set_ylabel('RMSE')
axes[0, 0].tick_params(axis='x', rotation=45)

# R²比较
results_df['R2'].plot(kind='bar', ax=axes[0, 1], color='lightgreen')
axes[0, 1].set_title('模型R²比较')
axes[0, 1].set_ylabel('R²')
axes[0, 1].tick_params(axis='x', rotation=45)

# MAE比较
results_df['MAE'].plot(kind='bar', ax=axes[1, 0], color='lightcoral')
axes[1, 0].set_title('模型MAE比较')
axes[1, 0].set_ylabel('MAE')
axes[1, 0].tick_params(axis='x', rotation=45)

# MSE比较
results_df['MSE'].plot(kind='bar', ax=axes[1, 1], color='gold')
axes[1, 1].set_title('模型MSE比较')
axes[1, 1].set_ylabel('MSE')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

最佳模型预测结果可视化

# 选择最佳模型（基于RMSE）
best_model_name = results_df.index[0]
print(f"最佳模型: {best_model_name}")

# 重新训练最佳模型
best_model = models[best_model_name]
best_model.fit(X_train_scaled, y_train)
y_pred = best_model.predict(X_test_scaled)

# 绘制预测结果
plt.figure(figsize=(15, 7))
plt.plot(y_test.index, y_test.values, label='实际值', alpha=0.7)
plt.plot(y_test.index, y_pred, label='预测值', alpha=0.7)
plt.title(f'{best_model_name} - 预测 vs 实际')
plt.xlabel('时间')
plt.ylabel('价格')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

# 绘制残差图
residuals = y_test - y_pred
plt.figure(figsize=(12, 6))
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('残差图')
plt.xlabel('预测值')
plt.ylabel('残差')
plt.grid(True, alpha=0.3)
plt.show()

特征重要性分析（对于树模型）

# 对于树模型，分析特征重要性
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': feature_columns,
        'importance': best_model.feature_importances_
    })
    feature_importance = feature_importance.sort_values('importance', ascending=False)
    
    plt.figure(figsize=(12, 8))
    plt.barh(feature_importance['feature'][:15], feature_importance['importance'][:15])
    plt.title('Top 15 特征重要性')
    plt.xlabel('重要性')
    plt.tight_layout()
    plt.show()
    
elif hasattr(best_model, 'coef_'):
    # 对于线性模型
    feature_importance = pd.DataFrame({
        'feature': feature_columns,
        'coefficient': best_model.coef_
    })
    feature_importance = feature_importance.sort_values('coefficient', key=abs, ascending=False)
    
    plt.figure(figsize=(12, 8))
    plt.barh(feature_importance['feature'][:15], feature_importance['coefficient'][:15])
    plt.title('Top 15 特征系数（绝对值）')
    plt.xlabel('系数值')
    plt.tight_layout()
    plt.show()

时间序列交叉验证

# 使用时间序列交叉验证评估模型稳定性
tscv = TimeSeriesSplit(n_splits=5)
model = GradientBoostingRegressor(n_estimators=100, random_state=42)

cv_scores = []
for train_index, test_index in tscv.split(X):
    X_train_cv, X_test_cv = X.iloc[train_index], X.iloc[test_index]
    y_train_cv, y_test_cv = y.iloc[train_index], y.iloc[test_index]
    
    # 标准化
    X_train_cv_scaled = scaler.fit_transform(X_train_cv)
    X_test_cv_scaled = scaler.transform(X_test_cv)
    
    model.fit(X_train_cv_scaled, y_train_cv)
    y_pred_cv = model.predict(X_test_cv_scaled)
    score = r2_score(y_test_cv, y_pred_cv)
    cv_scores.append(score)

print("交叉验证R²分数:", cv_scores)
print("平均R²分数:", np.mean(cv_scores))
print("R²分数标准差:", np.std(cv_scores))

plt.figure(figsize=(10, 6))
plt.plot(range(1, 6), cv_scores, marker='o')
plt.title('时间序列交叉验证性能')
plt.xlabel('折数')
plt.ylabel('R²分数')
plt.grid(True, alpha=0.3)
plt.show()

当然还可以超参数调优，这里就不继续了

模型部署和预测

# 使用最佳模型进行未来预测
def predict_future_price(model, last_known_data, scaler, steps=1):
    """
    使用训练好的模型预测未来价格
    
    参数:
    model: 训练好的模型
    last_known_data: 最后已知的数据点（DataFrame）
    scaler: 用于标准化的scaler
    steps: 预测步数
    
    返回:
    predictions: 预测值列表
    """
    predictions = []
    current_data = last_known_data.copy()
    
    for _ in range(steps):
        # 标准化当前数据
        current_data_scaled = scaler.transform(current_data)
        
        # 预测下一步
        next_price = model.predict(current_data_scaled)[0]
        predictions.append(next_price)
        
        # 更新当前数据（这里需要根据实际情况调整）
        # 在实际应用中，需要更复杂的逻辑来更新特征
        current_data.iloc[0, 3] = next_price  # 更新close价格
        
    return predictions

# 示例：使用最后一行数据进行预测
last_known = X.iloc[[-1]].copy()
future_predictions = predict_future_price(best_model, last_known, scaler, steps=5)

print("未来5期预测价格:", future_predictions)

1	未来5期预测价格: [np.float64(1832.2961094333193), np.float64(1840.6007680672349), np.float64(1847.2991579272611), np.float64(1852.7019599802757), np.float64(1857.0597639651191)]

具体对不对，能不能用，自己去查吧，哈哈哈~