Machine Learning
Building the Budget Forecast Model
The central question the ML module tries to answer is: given what we know about oil prices, exchange rates, population, economic growth, and debt levels, what should we expect Nigeria's federal budget to look like over the next five years? A linear regression could answer part of this. But Nigeria's budget does not grow linearly. It is driven by non-linear interactions between macroeconomic variables, making tree-based ensemble methods a stronger fit.
Why XGBoost
XGBoost (Extreme Gradient Boosting) is an ensemble method that builds decision trees sequentially, with each tree correcting the residuals of the previous one. It handles non-linear relationships without requiring the data to be normalised, manages interactions between variables automatically, and provides feature importance scores that make the model interpretable. For a dataset with 24 training rows and 9 features, a shallow XGBoost (maximum depth of 3) provides enough flexibility without overfitting.
Feature Engineering
The nine features fed into the model are a mix of direct economic indicators and engineered terms. The two most important design choices are the lag terms and the interaction term.
df["Lag1_Budget"] = df["Budget_Approved_Bn"].shift(1)
df["Lag2_Budget"] = df["Budget_Approved_Bn"].shift(2)
df["Oil_USD_x_FX"] = df["Oil_Price_USD"] * df["USDNGN"]
df["Debt_to_GDP"] = df["Debt_Stock_Bn"] / df["GDP_Bn"]
The oil-times-FX interaction term deserves particular attention. When oil is priced in dollars and Nigeria's revenues are reported in naira, the relevant quantity for budget planning is not the dollar oil price alone but the product of that price and the exchange rate. A higher oil price with a weaker naira still produces more naira revenue even if the dollar value is unchanged. Encoding this interaction explicitly gives the model information that neither variable carries alone.
The Forecasting Challenge for Tree Models
Tree-based models have a well-known limitation for time series forecasting: they cannot extrapolate beyond the range of values seen during training. When asked to predict a budget for 2027 given features that all lie within the training distribution, the model returns a value anchored to the training data range rather than projecting the underlying trend forward. This is not a bug in XGBoost. It is a fundamental property of decision trees.
The solution implemented in this dashboard is a blended forecast. The XGBoost prediction is combined with a drift estimate derived from the user-set macroeconomic assumptions, specifically the implied nominal budget growth from the inflation and GDP growth sliders. Early forecast years lean more on the model output. Later years lean more on the drift term, reflecting the reality that the model's signal degrades rapidly with distance from the training data.
implied_growth = np.clip((inf_fc + gdp_fc) / 100 * 0.6, -0.1, 0.40)
for i, yr in enumerate(fc_years):
gbm_pred = model.predict(scaler.transform(feat_row))[0]
drift_pred = last_b1 * (1 + implied_growth)
alpha = min(0.5 + i * 0.12, 0.65)
pred_val = (1 - alpha) * gbm_pred + alpha * drift_pred
Model Performance
The model is validated using TimeSeriesSplit cross-validation, which respects the temporal ordering of the data. Unlike standard k-fold cross-validation, TimeSeriesSplit ensures that future data is never used to predict past values. With four folds, the cross-validated mean absolute percentage error sits around 34 percent, reflecting the high volatility of Nigeria's budget trajectory across periods as different as the 2016 recession and the 2025 post-subsidy-removal surge. In-sample, the model fits to within 0.2 percent.
Model Limitations
The 34 percent cross-validated MAPE should be interpreted in context. Nigeria's budget has experienced year-on-year changes ranging from negative 35 percent (2002) to positive 91 percent (2025). A model predicting within 34 percent of such volatile outcomes on a held-out test set is performing reasonably given the structural breaks in the data. The model is best used as a directional indicator, not a point forecast.