A complete machine learning pipeline for forecasting solar irradiance using weather data from the Open-Meteo API. The system uses a Random Forest model to predict 24-hour ahead solar irradiance (shortwave radiation in W/m²).
The primary objective is to build an accurate solar irradiance forecasting system that can:
- Ingest weather data from Open-Meteo's historical archive API for Paris, France
- Engineer temporal and weather-derived features to capture solar patterns
- Train a machine learning model to predict shortwave radiation 24 hours ahead
- Serve predictions via a REST API for real-time forecasting
Solar irradiance forecasting is critical for:
- Optimizing solar panel energy production
- Grid management and energy trading
- Building energy management systems
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Ingest │────▶│ Preprocess │────▶│ Train │────▶│ Predict │
│ (Open- │ │ (Feature │ │ (Random │ │ (API / │
│ Meteo) │ │ Engin.) │ │ Forest) │ │ CLI) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
solar-forecasting/
├── configs/
│ └── config.yaml # All configuration (data, ingestion, model)
├── data/
│ ├── raw/ # Raw weather data from Open-Meteo
│ ├── processed/ # Feature-engineered datasets
│ ├── features/ # Feature column list
│ └── eda/ # Exploratory data analysis outputs
├── docs/
│ └── PREDICTION_GUIDE.md # Detailed prediction usage guide
├── notions/
│ ├── time_series.md # Time series concepts reference
│ └── feature_selection.md # Feature selection methodology
├── scripts/
│ ├── ingest.py # Data ingestion from Open-Meteo
│ ├── preprocess.py # Feature engineering pipeline
│ ├── train.py # Model training and evaluation
│ └── eda.py # Exploratory data analysis
├── src/
│ ├── api/
│ │ └── predict.py # SolarPredictor class for inference
│ ├── ingestion/
│ │ ├── client.py # Open-Meteo API client with caching/retry
│ │ ├── validator.py # Data quality validation
│ │ └── writer.py # Raw data persistence
│ ├── preprocessing/
│ │ ├── features.py # Feature engineering functions
│ │ └── pipeline.py # Preprocessing pipeline orchestration
│ └── training/
│ └── model.py # Model training, evaluation, persistence
├── tests/
│ ├── test_preprocessing.py # Preprocessing unit tests
│ └── test_training.py # Training unit tests
├── models/ # Trained model files
├── app.py # FastAPI web application
├── predict.py # Command-line prediction interface
├── train.py # Training script (standalone)
└── requirements.txt # Python dependencies
Fetches historical weather data from Open-Meteo's Archive API:
Location: Paris, France (latitude: 48.8566, longitude: 2.3522) Time Range: 2023-01-01 to 2025-12-31 (~3 years) Frequency: Hourly observations
Variables Collected:
| Variable | Description | Unit |
|---|---|---|
shortwave_radiation |
Target variable - total shortwave solar radiation | W/m² |
direct_radiation |
Direct solar radiation | W/m² |
diffuse_radiation |
Diffuse solar radiation | W/m² |
temperature_2m |
Air temperature at 2m | °C |
relative_humidity_2m |
Relative humidity | % |
precipitation |
Rain/snowfall | mm |
wind_speed_10m |
Wind speed at 10m | km/h |
wind_direction_10m |
Wind direction | degrees |
cloud_cover |
Cloud cover percentage | % |
surface_pressure |
Atmospheric pressure | hPa |
Features:
- Caching with
requests-cache(expires after 1 day) - Automatic retry with exponential backoff (5 retries)
- Data validation against physical bounds
Ensures data quality with DataValidator:
- Column check: All expected columns present
- Temporal continuity: No gaps in hourly timestamps
- Physical bounds: Values within realistic ranges
Applies feature engineering to create the training dataset:
hour- Hour of day (0-23)day_of_week- Day of week (0-6)month- Month of year (1-12)day_of_year- Day of year (1-365)is_weekend- Weekend indicator (0/1)
Time features encoded with sine/cosine to preserve circular nature:
hour_sin,hour_cos- Period = 24 hoursmonth_sin,month_cos- Period = 12 months
total_radiation=direct_radiation+diffuse_radiationis_daytime= (shortwave_radiation> 0) indicatordirect_ratio=direct_radiation/ (shortwave_radiation+ ε)
Historical values at multiple time steps:
- Lags: [1, 3, 6, 12, 24] hours
- Source columns:
shortwave_radiation,direct_radiation,cloud_cover
target = shortwave_radiation shifted by -forecast_horizon (24 hours ahead)
Output: Chronological split into 80% train / 20% test
Comprehensive data analysis including:
- Dataset statistics (26,305 hourly records)
- Correlation analysis
- Target distribution visualization
Key findings from EDA:
- Strong positive correlation:
shortwave_radiation↔direct_radiation(~0.90) - Strong negative correlation:
shortwave_radiation↔cloud_cover(~-0.80) - Target is right-skewed: Mean (~200 W/m²) > Median (~50 W/m²)
- Clear diurnal pattern: zeros at night, peaks at midday
Configuration (configs/config.yaml):
model:
name: solar_forecasting_rf
target_column: "shortwave_radiation"
forecast_horizon: 24
hyperparameters:
n_estimators: 100
max_depth: 10
random_state: 42On held-out test set:
| Metric | Value | Interpretation |
|---|---|---|
| MAE | 44.64 W/m² | Average prediction error |
| RMSE | 87.98 W/m² | Large errors penalized |
| R² | 0.8537 | Model explains 85.37% of variance |
GridSearchCV with 3-fold cross-validation is supported:
# Uncomment in config.yaml for tuning
# hyperparameters:
# param_grid:
# n_estimators: [50, 100, 200]
# max_depth: [5, 10, 15]pip install -r requirements.txtpython scripts/ingest.pyFetches ~3 years of weather data from Open-Meteo. Cached after first run.
python scripts/preprocess.pyCreates:
data/processed/solar_forecasting_dataset.csv- Full datasetdata/processed/train.csv- Training splitdata/processed/test.csv- Test splitdata/features/feature_columns.txt- List of features
python scripts/train.pyTrains Random Forest and saves to models/solar_forecasting_rf.joblib.
python -m pytest tests/# Sample prediction using test data
python predict.py --sample-prediction
# Predict from CSV file
python predict.py --input-file data/weather.csv --output-file predictions.csv
# Custom model path
python predict.py --input-file data/weather.csv --model-path models/custom_model.joblibfrom src.api.predict import create_predictor
import pandas as pd
predictor = create_predictor("models/solar_forecasting_rf.joblib")
# Make predictions
predictions = predictor.predict(weather_data)
print(predictions)# Start server
python app.py
# Server runs at http://localhost:8000
# API documentation at http://localhost:8000/docsEndpoints:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | API information |
/health |
GET | Health check |
/predict |
POST | Batch predictions |
/predict/single |
POST | Single prediction |
Example curl:
curl -X POST http://localhost:8000/predict/single \
-H "Content-Type: application/json" \
-d '{
"datetime": "2024-06-15T12:00:00",
"temperature_2m": 25.0,
"relative_humidity_2m": 60.0,
"precipitation": 0.0,
"cloud_cover": 20.0,
"wind_speed_10m": 3.5,
"wind_direction_10m": 180.0
}'Lag features capture temporal dependencies critical for solar forecasting:
- Solar irradiance follows daily patterns (24h periodicity)
- Weather conditions persist (clouds don't appear/disappear instantly)
- Lag values at [1, 3, 6, 12, 24] hours provide model with "memory"
Hour 23 and hour 0 are adjacent, but raw integers suggest they're far apart:
# WRONG - treats hour 23 and 0 as distant
df['hour'] = df.index.hour # Values: 0, 1, 2, ... 23
# CORRECT - preserves circular structure
df['hour_sin'] = sin(2π × hour / 24)
df['hour_cos'] = cos(2π × hour / 24)
Time features: hour, day_of_week, month, day_of_year, is_weekend, hour_sin, hour_cos, month_sin, month_cos
Derived features: total_radiation, is_daytime, direct_ratio
Lag features (5 lags × 3 sources): lag_1/3/6/12/24 of [shortwave_radiation, direct_radiation, cloud_cover]
All settings in configs/config.yaml:
project:
name: solar-forecasting
version: "0.1.0"
data:
raw_dir: data/raw
processed_dir: data/processed
features_dir: data/features
train_ratio: 0.8
ingestion:
source: open-meteo
latitude: 48.8566
longitude: 2.3522
start_date: "2023-01-01"
end_date: "2025-12-31"
timezone: "Europe/Paris"
hourly_variables:
- shortwave_radiation
- direct_radiation
# ... (10 variables total)
model:
name: solar_forecasting_rf
target_column: "shortwave_radiation"
forecast_horizon: 24
hyperparameters:
n_estimators: 100
max_depth: 10
random_state: 42See notions/time_series.md for detailed explanations of:
- Time series fundamentals and train/test splits
- Temporal leakage prevention
- Feature engineering for time series
- Evaluation metrics (MAE, RMSE, R²)
- Walk-forward validation
See notions/feature_selection.md for:
- Understanding negative correlations (cloud_cover ↔ radiation)
- Feature selection methodology
pandas==2.2.3
numpy>=1.26.4
matplotlib
seaborn
openmeteo-requests>=0.3.0
requests-cache>=1.2.0
retry-requests>=2.0.0
pyyaml>=6.0.2
python-dotenv>=1.0.1
scikit-learn>=1.3.0
joblib>=1.4.0
fastapi>=0.115.0
uvicorn>=0.32.0
| Metric | Test Set Performance |
|---|---|
| MAE | 44.64 W/m² |
| RMSE | 87.98 W/m² |
| R² | 0.8537 (85.37%) |
The model successfully captures:
- Diurnal solar patterns (daily cycles)
- Seasonal variation (summer vs winter)
- Weather impact (cloud cover effects)
- Deploy API - Containerize with Docker for production deployment
- Real-time integration - Connect to live Open-Meteo forecast API
- Model improvement - Try XGBoost/LightGBM, hyperparameter tuning
- Multi-horizon forecasting - Predict 24, 48, 72 hours ahead
- Alert system - Notify grid operators of low/high irradiance predictions