Skip to content

EngineerProjects/solar-forecasting

Repository files navigation

Solar Irradiance Forecasting

A complete machine learning pipeline for forecasting solar irradiance using weather data from the Open-Meteo API. The system uses a Random Forest model to predict 24-hour ahead solar irradiance (shortwave radiation in W/m²).

Project Goal

The primary objective is to build an accurate solar irradiance forecasting system that can:

  1. Ingest weather data from Open-Meteo's historical archive API for Paris, France
  2. Engineer temporal and weather-derived features to capture solar patterns
  3. Train a machine learning model to predict shortwave radiation 24 hours ahead
  4. Serve predictions via a REST API for real-time forecasting

Solar irradiance forecasting is critical for:

  • Optimizing solar panel energy production
  • Grid management and energy trading
  • Building energy management systems

Architecture Overview

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Ingest    │────▶│  Preprocess │────▶│   Train     │────▶│  Predict    │
│   (Open-    │     │  (Feature  │     │   (Random   │     │  (API /     │
│   Meteo)    │     │   Engin.)   │     │   Forest)   │     │   CLI)      │
└─────────────┘     └─────────────┘     └─────────────┘     └─────────────┘

Project Structure

solar-forecasting/
├── configs/
│   └── config.yaml          # All configuration (data, ingestion, model)
├── data/
│   ├── raw/                  # Raw weather data from Open-Meteo
│   ├── processed/            # Feature-engineered datasets
│   ├── features/             # Feature column list
│   └── eda/                  # Exploratory data analysis outputs
├── docs/
│   └── PREDICTION_GUIDE.md   # Detailed prediction usage guide
├── notions/
│   ├── time_series.md        # Time series concepts reference
│   └── feature_selection.md  # Feature selection methodology
├── scripts/
│   ├── ingest.py             # Data ingestion from Open-Meteo
│   ├── preprocess.py         # Feature engineering pipeline
│   ├── train.py              # Model training and evaluation
│   └── eda.py                # Exploratory data analysis
├── src/
│   ├── api/
│   │   └── predict.py        # SolarPredictor class for inference
│   ├── ingestion/
│   │   ├── client.py         # Open-Meteo API client with caching/retry
│   │   ├── validator.py      # Data quality validation
│   │   └── writer.py         # Raw data persistence
│   ├── preprocessing/
│   │   ├── features.py       # Feature engineering functions
│   │   └── pipeline.py       # Preprocessing pipeline orchestration
│   └── training/
│       └── model.py          # Model training, evaluation, persistence
├── tests/
│   ├── test_preprocessing.py # Preprocessing unit tests
│   └── test_training.py      # Training unit tests
├── models/                   # Trained model files
├── app.py                    # FastAPI web application
├── predict.py               # Command-line prediction interface
├── train.py                 # Training script (standalone)
└── requirements.txt         # Python dependencies

Data Pipeline

1. Data Ingestion (scripts/ingest.py)

Fetches historical weather data from Open-Meteo's Archive API:

Location: Paris, France (latitude: 48.8566, longitude: 2.3522) Time Range: 2023-01-01 to 2025-12-31 (~3 years) Frequency: Hourly observations

Variables Collected:

Variable Description Unit
shortwave_radiation Target variable - total shortwave solar radiation W/m²
direct_radiation Direct solar radiation W/m²
diffuse_radiation Diffuse solar radiation W/m²
temperature_2m Air temperature at 2m °C
relative_humidity_2m Relative humidity %
precipitation Rain/snowfall mm
wind_speed_10m Wind speed at 10m km/h
wind_direction_10m Wind direction degrees
cloud_cover Cloud cover percentage %
surface_pressure Atmospheric pressure hPa

Features:

  • Caching with requests-cache (expires after 1 day)
  • Automatic retry with exponential backoff (5 retries)
  • Data validation against physical bounds

2. Data Validation (src/ingestion/validator.py)

Ensures data quality with DataValidator:

  • Column check: All expected columns present
  • Temporal continuity: No gaps in hourly timestamps
  • Physical bounds: Values within realistic ranges

3. Preprocessing Pipeline (scripts/preprocess.py)

Applies feature engineering to create the training dataset:

Time Features

  • hour - Hour of day (0-23)
  • day_of_week - Day of week (0-6)
  • month - Month of year (1-12)
  • day_of_year - Day of year (1-365)
  • is_weekend - Weekend indicator (0/1)

Cyclic Encoding

Time features encoded with sine/cosine to preserve circular nature:

  • hour_sin, hour_cos - Period = 24 hours
  • month_sin, month_cos - Period = 12 months

Derived Features

  • total_radiation = direct_radiation + diffuse_radiation
  • is_daytime = (shortwave_radiation > 0) indicator
  • direct_ratio = direct_radiation / (shortwave_radiation + ε)

Lag Features

Historical values at multiple time steps:

  • Lags: [1, 3, 6, 12, 24] hours
  • Source columns: shortwave_radiation, direct_radiation, cloud_cover

Forecast Target

target = shortwave_radiation shifted by -forecast_horizon (24 hours ahead)

Output: Chronological split into 80% train / 20% test

4. EDA Report (scripts/eda.py)

Comprehensive data analysis including:

  • Dataset statistics (26,305 hourly records)
  • Correlation analysis
  • Target distribution visualization

Key findings from EDA:

  • Strong positive correlation: shortwave_radiationdirect_radiation (~0.90)
  • Strong negative correlation: shortwave_radiationcloud_cover (~-0.80)
  • Target is right-skewed: Mean (~200 W/m²) > Median (~50 W/m²)
  • Clear diurnal pattern: zeros at night, peaks at midday

Model

Algorithm: Random Forest Regressor

Configuration (configs/config.yaml):

model:
  name: solar_forecasting_rf
  target_column: "shortwave_radiation"
  forecast_horizon: 24
  hyperparameters:
    n_estimators: 100
    max_depth: 10
    random_state: 42

Evaluation Metrics

On held-out test set:

Metric Value Interpretation
MAE 44.64 W/m² Average prediction error
RMSE 87.98 W/m² Large errors penalized
0.8537 Model explains 85.37% of variance

Hyperparameter Tuning

GridSearchCV with 3-fold cross-validation is supported:

# Uncomment in config.yaml for tuning
# hyperparameters:
#   param_grid:
#     n_estimators: [50, 100, 200]
#     max_depth: [5, 10, 15]

Usage

Prerequisites

pip install -r requirements.txt

Step 1: Ingest Data

python scripts/ingest.py

Fetches ~3 years of weather data from Open-Meteo. Cached after first run.

Step 2: Preprocess Data

python scripts/preprocess.py

Creates:

  • data/processed/solar_forecasting_dataset.csv - Full dataset
  • data/processed/train.csv - Training split
  • data/processed/test.csv - Test split
  • data/features/feature_columns.txt - List of features

Step 3: Train Model

python scripts/train.py

Trains Random Forest and saves to models/solar_forecasting_rf.joblib.

Step 4: Run Tests

python -m pytest tests/

Prediction Interfaces

Command-Line Interface (predict.py)

# Sample prediction using test data
python predict.py --sample-prediction

# Predict from CSV file
python predict.py --input-file data/weather.csv --output-file predictions.csv

# Custom model path
python predict.py --input-file data/weather.csv --model-path models/custom_model.joblib

Python API (src/api/predict.py)

from src.api.predict import create_predictor
import pandas as pd

predictor = create_predictor("models/solar_forecasting_rf.joblib")

# Make predictions
predictions = predictor.predict(weather_data)
print(predictions)

FastAPI Web Service (app.py)

# Start server
python app.py

# Server runs at http://localhost:8000
# API documentation at http://localhost:8000/docs

Endpoints:

Endpoint Method Description
/ GET API information
/health GET Health check
/predict POST Batch predictions
/predict/single POST Single prediction

Example curl:

curl -X POST http://localhost:8000/predict/single \
  -H "Content-Type: application/json" \
  -d '{
    "datetime": "2024-06-15T12:00:00",
    "temperature_2m": 25.0,
    "relative_humidity_2m": 60.0,
    "precipitation": 0.0,
    "cloud_cover": 20.0,
    "wind_speed_10m": 3.5,
    "wind_direction_10m": 180.0
  }'

Feature Engineering Details

Why Lag Features?

Lag features capture temporal dependencies critical for solar forecasting:

  • Solar irradiance follows daily patterns (24h periodicity)
  • Weather conditions persist (clouds don't appear/disappear instantly)
  • Lag values at [1, 3, 6, 12, 24] hours provide model with "memory"

Why Cyclic Encoding?

Hour 23 and hour 0 are adjacent, but raw integers suggest they're far apart:

# WRONG - treats hour 23 and 0 as distant
df['hour'] = df.index.hour  # Values: 0, 1, 2, ... 23

# CORRECT - preserves circular structure
df['hour_sin'] = sin(2π × hour / 24)
df['hour_cos'] = cos(2π × hour / 24)

Feature List (41 total)

Time features: hour, day_of_week, month, day_of_year, is_weekend, hour_sin, hour_cos, month_sin, month_cos

Derived features: total_radiation, is_daytime, direct_ratio

Lag features (5 lags × 3 sources): lag_1/3/6/12/24 of [shortwave_radiation, direct_radiation, cloud_cover]

Configuration

All settings in configs/config.yaml:

project:
  name: solar-forecasting
  version: "0.1.0"

data:
  raw_dir: data/raw
  processed_dir: data/processed
  features_dir: data/features
  train_ratio: 0.8

ingestion:
  source: open-meteo
  latitude: 48.8566
  longitude: 2.3522
  start_date: "2023-01-01"
  end_date: "2025-12-31"
  timezone: "Europe/Paris"
  hourly_variables:
    - shortwave_radiation
    - direct_radiation
    # ... (10 variables total)

model:
  name: solar_forecasting_rf
  target_column: "shortwave_radiation"
  forecast_horizon: 24
  hyperparameters:
    n_estimators: 100
    max_depth: 10
    random_state: 42

Key Concepts

See notions/time_series.md for detailed explanations of:

  • Time series fundamentals and train/test splits
  • Temporal leakage prevention
  • Feature engineering for time series
  • Evaluation metrics (MAE, RMSE, R²)
  • Walk-forward validation

See notions/feature_selection.md for:

  • Understanding negative correlations (cloud_cover ↔ radiation)
  • Feature selection methodology

Dependencies

pandas==2.2.3
numpy>=1.26.4
matplotlib
seaborn
openmeteo-requests>=0.3.0
requests-cache>=1.2.0
retry-requests>=2.0.0
pyyaml>=6.0.2
python-dotenv>=1.0.1
scikit-learn>=1.3.0
joblib>=1.4.0
fastapi>=0.115.0
uvicorn>=0.32.0

Results Summary

Metric Test Set Performance
MAE 44.64 W/m²
RMSE 87.98 W/m²
0.8537 (85.37%)

The model successfully captures:

  • Diurnal solar patterns (daily cycles)
  • Seasonal variation (summer vs winter)
  • Weather impact (cloud cover effects)

Next Steps

  1. Deploy API - Containerize with Docker for production deployment
  2. Real-time integration - Connect to live Open-Meteo forecast API
  3. Model improvement - Try XGBoost/LightGBM, hyperparameter tuning
  4. Multi-horizon forecasting - Predict 24, 48, 72 hours ahead
  5. Alert system - Notify grid operators of low/high irradiance predictions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages