When evaluating a thermal demand forecast, the accuracy metric you're most likely to see quoted is MAPE — Mean Absolute Percentage Error. A vendor says their model achieves 7% MAPE. What does that mean? Is it good? And what does it not tell you that you need to know before trusting the forecast with HVAC staging decisions?
MAPE is a useful and widely used metric, but it has specific failure modes that HVAC applications expose in ways that electricity grid forecasting or financial modeling don't. Understanding these failure modes is essential for anyone using forecast accuracy claims to make operational decisions about building energy management.
What MAPE Actually Calculates
MAPE is straightforward in definition: for each time period, calculate the absolute error (how far off the forecast was from actuals) as a percentage of the actual value, then average those percentage errors across all periods.
Formally: MAPE = (1/n) × Σ |Actual - Forecast| / |Actual| × 100%
If a model forecasts 450 kW on a period when actual demand was 500 kW, the percentage error for that period is |500-450|/500 = 10%. If the model forecasts 510 kW on a period when actual demand was 500 kW, the error is |500-510|/500 = 2%. MAPE averages these percentage errors across all periods in the evaluation set.
The absolute value in the numerator means that positive errors (forecast too high) and negative errors (forecast too low) are treated identically — they don't cancel out. This is why it's "mean absolute percentage error" and not just "mean percentage error." A model that's consistently 5% too high and one that's consistently 5% too low both have a MAPE of 5%, even though they have very different implications for HVAC pre-conditioning decisions.
The Failure Mode at Low Demand Levels
MAPE has a well-known mathematical failure mode when actual values are near zero: the percentage error denominator becomes very small, causing individual percentage errors to blow up and dominate the average. In electricity load forecasting, this typically matters when overnight loads drop to near-zero levels in very low-occupancy buildings.
For thermal demand forecasting specifically, this manifests during overnight setback periods. If a building's HVAC demand drops to 15 kW during a mild overnight (minimal heating or cooling load needed), a forecast of 22 kW represents a 47% percentage error — but the absolute error is only 7 kW, which is operationally insignificant. That 47% error inflates the reported MAPE substantially even though it has no practical consequence for pre-conditioning decisions, which are calibrated to peak-demand windows, not overnight minimums.
For this reason, MAPE calculated over 24-hour periods is a worse accuracy metric for HVAC demand forecasting than MAPE calculated only over the occupied and pre-conditioning hours — roughly 5 AM through 9 PM. Some demand forecasting providers report "operating hours MAPE" that excludes overnight minimums; others report full-day MAPE that includes them. A 7% full-day MAPE and a 7% operating-hours MAPE are substantially different claims — the operating-hours version is a harder benchmark to meet.
Directional Bias Matters More Than Symmetric Accuracy
MAPE treats over-forecasting and under-forecasting symmetrically. For demand management, they're not symmetric — they have different operational consequences.
An over-forecast (model predicts 600 kW, actual is 500 kW) triggers earlier and more aggressive pre-conditioning than the day actually requires. The building gets pre-cooled more than necessary. This wastes energy — off-peak energy, but real kWh spending — and may create comfort issues from over-chilled spaces at occupancy. It's a false positive: the model indicated high demand risk that didn't materialize.
An under-forecast (model predicts 450 kW, actual is 600 kW) triggers insufficient pre-conditioning. The morning ramp-up is larger than expected, and the demand event that was supposed to be avoided occurs anyway. This is a false negative: the model missed the high-demand event. The demand charge is captured. From a demand charge reduction perspective, under-forecasting is significantly worse than over-forecasting.
A useful complement to MAPE is a directional bias metric — what fraction of forecasts are over- versus under-predictions, and is there a systematic pattern. A model with 7% MAPE but 60% of errors in the under-forecast direction is a worse demand management tool than a model with 8% MAPE that's symmetrically balanced between over- and under-forecasts. MAPE alone doesn't surface this.
We're not saying MAPE is the wrong metric — it's the right starting point. We're saying it's not sufficient on its own, and a vendor that reports only overall MAPE without directional breakdown and operating-hours versus full-day segmentation is giving you an incomplete accuracy picture.
The Validation Methodology Question
How MAPE is calculated matters as much as what the number is. The key question: is the forecast accuracy being measured in-sample (on the same data the model was trained on) or out-of-sample (on data the model has never seen)?
In-sample MAPE — evaluating accuracy on the training data — is almost always better than out-of-sample MAPE. A model that has overfit the training data will report very low in-sample MAPE but fail to generalize to new conditions. For commercial building demand forecasting, where the model needs to generalize to weather conditions and occupancy patterns it hasn't seen in training, out-of-sample validation is the only meaningful accuracy benchmark.
The correct out-of-sample validation methodology for time-series forecasting is temporal cross-validation: train on data from an earlier period, test on a later period, and never allow the test period's information to influence the model parameters. The test period should be held out completely — not just randomized across all available data, which allows future data to leak into the training process in ways that inflate apparent accuracy.
Heatvelo validates forecast models using temporal cross-validation on building-specific data: 80% of available historical data for training, the most recent 20% as a held-out test set, evaluated strictly in temporal order. The MAPE figures we report to buildings during pilot scoping are out-of-sample operating-hours MAPE — a harder benchmark than either in-sample MAPE or full-day MAPE. The typical range for well-instrumented commercial office buildings is 5–9%.
Why 8% Is the Operational Adequacy Threshold
The 8% MAPE threshold for operational adequacy in thermal demand forecasting isn't arbitrary — it comes from the relationship between forecast error and pre-conditioning decision quality. A forecast with MAPE above 8% produces pre-conditioning windows that are systematically too narrow or too wide to reliably prevent demand events, based on empirical analysis of pre-conditioning outcomes across building types in the load forecasting literature.
Below 8% MAPE, forecast accuracy is sufficient that pre-conditioning decisions based on the forecast consistently outperform fixed-schedule pre-conditioning on the same buildings. The relationship isn't perfectly linear — buildings with high thermal mass have more tolerance for forecast error than light-frame buildings, because the pre-cooling window recommendation doesn't need to be as precise — but 8% is a reasonable threshold across the range of commercial building types where demand management strategies are typically applied.
For buildings where MAPE exceeds 8% in backtesting, the appropriate response is to diagnose why — typically, insufficient training data, irregular occupancy patterns, sensor gaps in the BMS historian, or an extreme-weather event regime the model hasn't seen enough of. In some cases, the underlying data quality or occupancy predictability is insufficient to support forecast-based staging, and we tell the building operator this before committing to a pilot rather than discovering it live.