source("source/weatherModel.R")
source("source/print_parameter_comparison_table.R")
6 Calibration walk through
Load source file containing the R implementation of the Weather model:
To generate new data based on a given dataset, we must first be able to estimate the Weather model parameters from said datasets. That is, to find the values of each parameter that can approximate the data of a given year daily series. Once this can be done for each year in the dataset, we can then estimate the hyperparameters as descriptive statistics (i.e., mean and standard deviation, minimum, maximum).
A good estimation of the parameters of the solar radiation and temperature submodels (i.e. sinusoid) can be made directly by measuring the year minimum and maximum.
However, the case of precipitation is far from trivial, given the complexity of the algorithm behind it. The workflow to estimate the parameters of the precipitation submodel deserves a demonstration.
6.1 Parameter estimation using optim()
Set up six variations of parameter settings of the annual double logistic curve (i.e. plateau_value, inflection1, rate1, inflection2, rate2), the discretisation producing the annual cumulative precipitation curve (i.e. n_samples, max_sample_size) and annualPrecipitation, assuming length of year of 365 days. Random generator seed used in discretisation is fixed:
# Fixed random seed for reproducibility
<- 0
SEED
# Simulation parameters
<- 365
YEAR_LENGTH
# Double logistic function parameters
<- matrix(
params_values_double_logistic c(0.01, 125, 0.3, 245, 0.22,
0.15, 63, 0.55, 195, 0.6,
0.5, 64, 0.05, 261, 0.12,
0.45, 215, 0.01, 276, 0.39,
0.6, 20, 0.38, 254, 0.04,
0.85, 97, 0.24, 219, 0.17),
nrow = 6,
byrow = TRUE,
dimnames = list(NULL, c("plateau_value", "inflection1", "rate1", "inflection2", "rate2"))
)
# Discretisation parameters
<- matrix(
params_values_discretisation c(152, 22,
220, 10,
240, 6,
168, 13,
191, 9,
205, 17),
nrow = 6,
byrow = TRUE,
dimnames = list(NULL, c("n_samples", "max_sample_size"))
)
# Annual sum values
<- c(410, 1050, 636, 320, 1280, 745) annual_sum_values
Predefine the range of values explored for each parameter:
<- c(0, 1, 0.01, 1, 0.01, 1, 3)
params_range_lower <- c(1, 365, 0.9, 365, 0.9, 365, 30) params_range_upper
6.1.1 Calibrating gen_annual_double_logistic_curve()
(deterministic function)
Select the first set of parameter values from the params_values_double_logistic
dataset and generate the corresponding curve with the gen_annual_double_logistic_curve()
function. These points will represent the original state of the model that we aim to reverse engineer from the outcome curve. Plot it.
<- params_values_double_logistic[1, 1:5]
original_params
<- gen_annual_double_logistic_curve(
curve plateau_value = original_params[1],
inflection1 = original_params[2],
rate1 = original_params[3],
inflection2 = original_params[4],
rate2 = original_params[5],
year_length = YEAR_LENGTH)
plot(curve, cex = 2)
Define the initial_guess
vector with your initial parameter guess values. Generate the curve using the gen_annual_double_logistic_curve()
function with the initial guess. Plot it. Notice that our initial guess generates a somewhat “average” cumulative curve.
<- c(0.5, 100, 0.1, 200, 0.1) # Initial parameter guess
initial_guess
<- gen_annual_double_logistic_curve(
first_guess_curve plateau_value = initial_guess[1],
inflection1 = initial_guess[2],
rate1 = initial_guess[3],
inflection2 = initial_guess[4],
rate2 = initial_guess[5],
year_length = YEAR_LENGTH)
plot(first_guess_curve, cex = 2)
Define the eval_objective_func()
function that calculates the sum of squared differences between the observed data and the predicted values, generated by the gen_annual_double_logistic_curve()
function with a given parameter setting. Then, use the optim()
function to estimate the best parameter values by minimizing the objective function.
NOTE: optim()
using method “L-BFGS-B”, see ?optim
or:
> Byrd, R. H., Lu, P., Nocedal, J. and Zhu, C. (1995). A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16, 1190–1208. doi:10.1137/0916069.
<- curve
observed_data
# Objective function to minimize (difference between observed and predicted values or "residual")
<- function(params) {
eval_objective_func <- gen_annual_double_logistic_curve(params[1], params[2], params[3], params[4], params[5], YEAR_LENGTH)
predicted_data sum((observed_data - predicted_data)^2)
}
# Use the least squares method to estimate the parameter values
<- optim(initial_guess, eval_objective_func,
fit method = "L-BFGS-B",
lower = params_range_lower[1:5],
upper = params_range_upper[1:5])
<- gen_annual_double_logistic_curve(fit$par[1], fit$par[2], fit$par[3], fit$par[4], fit$par[5], YEAR_LENGTH) best_estimation_curve
Plot the original curve (curve) and overlay it with the curve generated using the best estimated parameter values (best_estimation_curve). The best estimated curve is shown in red.
plot(curve, cex = 2)
lines(best_estimation_curve, col = 'red', lwd = 3)
print_parameter_comparison_table(original_params, fit, params_range_upper[1:5], params_range_lower[1:5])
original | estimated | delta | range | delta (%) | |
---|---|---|---|---|---|
plateau_value | 0.01 | 0.0090998 | 0.000900 | 1.00 | 0.0900 |
inflection1 | 125.00 | 113.7955514 | 11.204449 | 364.00 | 3.0781 |
rate1 | 0.30 | 0.8999843 | 0.599984 | 0.89 | 67.4140 |
inflection2 | 245.00 | 244.9879056 | 0.012094 | 364.00 | 0.0033 |
rate2 | 0.22 | 0.2195381 | 0.000462 | 0.89 | 0.0519 |
We can see that reverse engineering the parameter values of the double logistic curve is relatively straightforward. This specific curve offers a clear hint that rate1
and rate2
(i.e. the maximum growth rates of each logistic component) are harder to estimate when plateau_value
is extreme .
6.1.2 Adding discretise_curve()
(stochastic function)
However, precipitation in the Weather model presents an additional challenge: the continuous cumulative curve is broken down into “steps” through discretise_curve()
, which introduces stochasticity. We will also add rescale_curve()
to the end of the process, in order to approach the curve that would be created by generate_annual_precipitation()
.
Let us extend the workflow used above with gen_annual_double_logistic_curve()
to also cover the two additional parameters of discretise_curve()
(for now, fix seed = 0
):
<- c(params_values_double_logistic[1, 1:5], params_values_discretisation[1, 1:2])
original_params
<- gen_cum_precipitation_of_year(
curve plateau_value = original_params[1],
inflection1 = original_params[2],
rate1 = original_params[3],
inflection2 = original_params[4],
rate2 = original_params[5],
year_length = YEAR_LENGTH,
n_samples = original_params[6],
max_sample_size = original_params[7],
seed = SEED)
plot(curve, type = 'l', lwd = 3)
<- c(0.5, 100, 0.1, 200, 0.1, 180, 15) # Initial parameter guess
initial_guess
<- gen_cum_precipitation_of_year(
first_guess_curve plateau_value = initial_guess[1],
inflection1 = initial_guess[2],
rate1 = initial_guess[3],
inflection2 = initial_guess[4],
rate2 = initial_guess[5],
year_length = YEAR_LENGTH,
n_samples = initial_guess[6],
max_sample_size = initial_guess[7],
seed = SEED)
plot(first_guess_curve, type = 'l', lwd = 3)
<- curve
observed_data
# Objective function to minimize (difference between observed and predicted values)
<- function(params) {
eval_objective_func <- gen_cum_precipitation_of_year(
predicted_data plateau_value = params[1],
inflection1 = params[2], rate1 = params[3],
inflection2 = params[4], rate2 = params[5],
year_length = YEAR_LENGTH,
n_samples = params[6],
max_sample_size = params[7],
seed = SEED
)
sum((observed_data - predicted_data)^2)
}
# Use the least squares method to estimate the parameter values
<- optim(initial_guess, eval_objective_func,
fit method = "L-BFGS-B",
lower = params_range_lower,
upper = params_range_upper)
<- gen_cum_precipitation_of_year(
best_estimation_curve plateau_value = fit$par[1],
inflection1 = fit$par[2], rate1 = fit$par[3],
inflection2 = fit$par[4], rate2 = fit$par[5],
year_length = YEAR_LENGTH,
n_samples = fit$par[6],
max_sample_size = fit$par[7],
seed = SEED
)
plot(curve, type = 'l', lwd = 3)
lines(best_estimation_curve, col = 'red', lwd = 3)
print_parameter_comparison_table(original_params, fit, params_range_upper, params_range_lower)
original | estimated | delta | range | delta (%) | |
---|---|---|---|---|---|
plateau_value | 0.01 | 0.0126065 | 0.002606 | 1.00 | 0.2606 |
inflection1 | 125.00 | 100.2542491 | 24.745751 | 364.00 | 6.7983 |
rate1 | 0.30 | 0.9000000 | 0.600000 | 0.89 | 67.4157 |
inflection2 | 245.00 | 250.3699407 | 5.369941 | 364.00 | 1.4753 |
rate2 | 0.22 | 0.9000000 | 0.680000 | 0.89 | 76.4045 |
n_samples | 152.00 | 143.9652346 | 8.034765 | 364.00 | 2.2074 |
max_sample_size | 22.00 | 30.0000000 | 8.000000 | 27.00 | 29.6296 |
Close, but a much worse fit than obtained with gen_annual_double_logistic_curve()
only. We should take this performance in consideration going forward.
6.1.3 Calibrating multiple example curves to determine hyperparameters
Let us now apply the same workflow for estimating the hyperparameters able to generate an approximation of a sequence of year daily series.
First, generate the original dataset based on the different configurations present in params_values_double_logistic
and params_values_discretisation
:
<- list()
curves <- list()
original_params_list
for (i in 1:nrow(params_values_double_logistic))
{<- c(params_values_double_logistic[i, 1:5], params_values_discretisation[i, 1:2])
original_params
<- gen_cum_precipitation_of_year(
curve plateau_value = original_params[1],
inflection1 = original_params[2],
rate1 = original_params[3],
inflection2 = original_params[4],
rate2 = original_params[5],
year_length = YEAR_LENGTH,
n_samples = original_params[6],
max_sample_size = original_params[7],
seed = SEED
)
<- curve
curves[[i]] <- original_params
original_params_list[[i]]
}
plot(curves[[1]], type = 'l', col = 1, lwd = 3, ylab = 'curve')
for (i in 2:length(curves))
{lines(curves[[i]], col = i, lwd = 3)
}
Apply optim
, reusing initial_guess
and eval_objective_func
, to each curve and generate a sequence of best estimation curves:
<- list()
best_estimation_curves <- list()
best_estimation_fits
for (i in 1:nrow(params_values_double_logistic))
{<- curves[[i]]
observed_data
# Use the least squares method to estimate the parameter values
<- optim(initial_guess, eval_objective_func,
fit method = "L-BFGS-B",
lower = params_range_lower,
upper = params_range_upper)
<- gen_cum_precipitation_of_year(
best_estimation_curve plateau_value = fit$par[1],
inflection1 = fit$par[2], rate1 = fit$par[3],
inflection2 = fit$par[4], rate2 = fit$par[5],
year_length = YEAR_LENGTH,
n_samples = fit$par[6],
max_sample_size = fit$par[7],
seed = SEED)
<- best_estimation_curve
best_estimation_curves[[i]] <- fit
best_estimation_fits[[i]] }
Plot original and estimated curves:
layout(matrix(1:6, nrow = 3, ncol = 2, byrow = TRUE))
par(mar = c(0.1, 0.1, 0.1, 0.1))
for (i in 1:length(curves)) {
plot(curves[[i]], type = 'l', col = i, lwd = 3, xaxt = 'n', yaxt = 'n')
lines(best_estimation_curves[[i]], col = i, lty = 2)
}
Visualise the aggregate estimation quality:
# Helper functions
<- function(list_of_vectors) {
calculate_mean_sd if (length(list_of_vectors) == 0) return(list(mean = numeric(0), sd = numeric(0)))
<- do.call(rbind, list_of_vectors)
values_matrix
list(
mean = colMeans(values_matrix),
sd = apply(values_matrix, 2, sd)
)
}
<- function(list_of_fit_objects) {
get_list_params_from_fit lapply(list_of_fit_objects, `[[`, "par")
}
# Main functions
<- function(original_params_list, fits, params_range_upper, params_range_lower) {
create_parameter_comparison_summary <- calculate_mean_sd(original_params_list)
original_summary <- calculate_mean_sd(get_list_params_from_fit(fits))
estimated_summary
data.frame(
original_mean = round(original_summary$mean, digits = 4),
original_sd = round(original_summary$sd, digits = 4),
estimated_mean = round(estimated_summary$mean, digits = 4),
estimated_sd = round(estimated_summary$sd, digits = 4),
delta_mean = round(abs(original_summary$mean - estimated_summary$mean), digits = 6),
delta_sd = round(abs(original_summary$sd - estimated_summary$sd), digits = 6),
#range = params_range_upper - params_range_lower,
delta_mean_percent = round(
100 * abs(original_summary$mean - estimated_summary$mean) / (params_range_upper - params_range_lower),
digits = 4
),delta_mean_percent = round(
100 * abs(original_summary$sd - estimated_summary$sd) / (params_range_upper - params_range_lower),
digits = 4
)
)
}
<- function(parameter_comparison_summary) {
print_parameter_comparison_summary_table
::kable(parameter_comparison_summary,
knitrformat = "html",
col.names = c(
"original (mean)", "original (sd)",
"estimated (mean)", "estimated (sd)",
"delta (mean)", "delta (sd)",
"delta (mean%)", "delta (sd%)"),
align = c("c", "c", "c", "c", "c", "c", "c", "c"))
}
# Execution
<- create_parameter_comparison_summary(original_params_list, best_estimation_fits, params_range_upper, params_range_lower)
parameter_comparison_summary
print_parameter_comparison_summary_table(parameter_comparison_summary)
original (mean) | original (sd) | estimated (mean) | estimated (sd) | delta (mean) | delta (sd) | delta (mean%) | delta (sd%) | |
---|---|---|---|---|---|---|---|---|
plateau_value | 0.4267 | 0.3051 | 0.5238 | 0.4174 | 0.097147 | 0.112358 | 9.7147 | 11.2358 |
inflection1 | 97.3333 | 67.6481 | 64.9747 | 49.8331 | 32.358633 | 17.815035 | 8.8897 | 4.8942 |
rate1 | 0.2550 | 0.2034 | 0.4256 | 0.4247 | 0.170560 | 0.221209 | 19.1640 | 24.8549 |
inflection2 | 241.6667 | 29.6895 | 256.9584 | 60.9816 | 15.291717 | 31.292059 | 4.2010 | 8.5967 |
rate2 | 0.2567 | 0.2050 | 0.4631 | 0.4623 | 0.206456 | 0.257308 | 23.1973 | 28.9110 |
n_samples | 196.0000 | 32.6741 | 159.1494 | 24.7352 | 36.850629 | 7.938989 | 10.1238 | 2.1810 |
max_sample_size | 12.8333 | 5.8452 | 21.6078 | 10.1365 | 8.774432 | 4.291308 | 32.4979 | 15.8937 |
Although the original example curves are quite different from each other, let us assume that they correspond to cumulative precipitation of six years at a single location. This will allow us to test the optim
calibration workflow on our ultimate target, the precipitation hyperparameters of the Weather model.
Initialise the weather model setting the precipitation hyperparameters as the mean and standard deviation of the best estimation parameter values in parameter_comparison_summary
:
<- initialise_weather_model(
weather_model seed = 0,
precip_plateau_value_mean = parameter_comparison_summary["plateau_value", "estimated_mean"],
precip_plateau_value_sd = parameter_comparison_summary["plateau_value", "estimated_sd"],
precip_inflection1_mean = parameter_comparison_summary["inflection1", "estimated_mean"],
precip_inflection1_sd = parameter_comparison_summary["inflection1", "estimated_sd"],
precip_rate1_mean = parameter_comparison_summary["rate1", "estimated_mean"],
precip_rate1_sd = parameter_comparison_summary["rate1", "estimated_sd"],
precip_inflection2_mean = parameter_comparison_summary["inflection2", "estimated_mean"],
precip_inflection2_sd = parameter_comparison_summary["inflection2", "estimated_sd"],
precip_rate2_mean = parameter_comparison_summary["rate2", "estimated_mean"],
precip_rate2_sd = parameter_comparison_summary["rate2", "estimated_sd"],
precip_n_samples_mean = parameter_comparison_summary["n_samples", "estimated_mean"],
precip_n_samples_sd = parameter_comparison_summary["n_samples", "estimated_sd"],
precip_max_sample_size_mean = parameter_comparison_summary["max_sample_size", "estimated_mean"],
precip_max_sample_size_sd = parameter_comparison_summary["max_sample_size", "estimated_sd"]
)
Run the model to generate a number of cumulative curves:
<- run_weather_model(weather_model, num_years = 30, show_warnings = FALSE) weather_model
Plot original and generated curves:
plot(c(1, weather_model$PARAMS$year_length), c(0, 1), ann = F, bty = 'n', type = 'n', ylab = 'curve')
# original curves
for (i in 1:length(curves))
{lines(curves[[i]], col = i, lwd = 3)
}# generated curves
for (year in unique(weather_model$daily$current_year))
{lines(1:weather_model$PARAMS$year_length,
get_cumulative_precipitation_of_year(
$daily$precipitation[
weather_model$daily$current_year == year
weather_model
]), col = "grey",
lty = 2)
}