SeaValDocumentation/DevelopmentGuidelines.Rmd at master · SeasonalForecastingEngine/SeaValDocumentation · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: "SeaVal - Notes on Development"
author: "Claudio"
date: "`r format(Sys.time(), '%d %B, %Y')`"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


* Within `SeaVal`, data is organized as data.tables. A major advantage of data.tables is that it is very easy to group operations. Generally, for evaluation forecasts that's super important: You might want to calculate the overall MSE for a range of predictions, or a separate MSE for each month, or for each grid point, or for each country and each month, etc.
* Generally data tables will contain columns containing *dimension variables* (e.g. `lon`/`lat`/`month`/`year`) and columns specifying *values* (e.g. `temp`,`pred`,`climatology`). Besides the usual candidates, the following columns are also considered *dimension variables* rather than *values*: `lead_time`, `member` (for ensemble forecasts), `system` (when you consider different NWPs in the same data table). Per unique dimension variable the data table may contain only one row.
* We should try being nice with errors, warnings, and messages. Please use `message` rather than `print` (looks nicer and can be suppressed).
* function names should be concise and as consistent as possible (e.g., we shouldn't name a function `MSE` with upper-case letters and another one `crps` with lower-case letters). Many functions will have the same input arguments, and these should be named consistently as well. I tried being consistent about using only underscores in function names (e.g. `climatology_ens_forecast`) and using only dots for arguments (e.g. `check.dims`, see below).
* We should test our functions on standard datasets. I started a folder scripts/tests/ to this end.
* Please do not work on the master branch. Instead create your own branch and make a pull request once you're finished with changes that are ready to be included in the final package.
* We possibly should start a to-do-list somewhere digitally.

# Evaluation Functions

There are evaluation tools that return a number (I call them scores, even though stuff like bias-estimates also belongs to this category) and tools that return diagrams/plots, such as reliability diagrams. My suggestion is to output:

* A data table for tools that return a number. The data table would have all grouping columns plus the 'score' column, e.g. lon/lat/month/crps.
* A ggplot object for tools returning a plot, when no grouping variables are provided.
* A list of ggplot objects for tools returning a plot, when grouping variables are provided. The list should contain a named page (e.g. 'grouping') containing the grouping levels (e.g. lon/lat/month)


For any evaluation metric, the data should be provided as a data table. There are four different types of columns within the data that need to be treated differently:
*forecast*, *observation*, *grouping variables* and *pooling variables*. Forecast and observation are pretty self-explanatory.
The grouping variables contain dimension variables for stratification. For each level of these dimension variables the evaluation metric is evaluated separately and a separate result is returned. The pooling variables contain dimension variables that are pooled together in the calculation. Say, for example, we want to calculate (mean) squared errors from the following typical data table:
```{r,echo = F,eval = TRUE,message=FALSE}
library(SeaVal)
fcs = ecmwf_monthly
fcs = fcs[,.(pred = mean(prec)),.(lon,lat,year,month)]
months = unique(fcs[,month])
years = unique(fcs[,year])
obs = load_chirps(months = months,years = years)
setnames(obs,'prec','obs')
dt = merge(fcs,obs,by = c('lon','lat','year','month'))
setkey(dt,year,month,lon,lat)
print(dt)
```

Maybe you want to analyze the MSE for each grid point and each month separately. In this case, your grouping columns would be `lon`, `lat` and `month`, and your pooling variables would be `year`. Then, the MSE-function would calculate `mean((pred-obs)^2)` separately for each month and gridpoint, and the mean is taken over all years. But what if you want to analyze your overall prediction error as a time series instead? No problem, just select `month` and `year` as grouping variables and `lon`, `lat` as over-columns. Then, the squared error is averaged over all locations, separately for each month and year.
You can also select all four dimension variables (lon/lat/year/month) as grouping variables (resulting in one squared error per dimension variable, no mean is taken) or all four dimension variables as pooling variables (resulting in one number, the overall MSE). *At least in theory you should be able to do that. As of today, some of the functions might crash because I haven't paid extensive attention to that.*

On first glance one might think 'Why not simply calculating always on the highest resolution (grouped by all dimension variables) and returning the full result? Then, the user can group themselves by averaging however they want.' This would work in the example of the MSE and for most other proper scores, but not for other metrics. You could not calculate, say, a correlation or a skill score separately for each dimension variable (i.e. based on a single value), and then average afterwards. Also, the philosophy of having grouping dimension variables and pooling dimension variables also applies to diagrams: You could derive reliability diagrams from a data table similar as above, grouped by month and pooling by lon/lat/year.

Following this philosophy, a function for an evaluation tool should always have the following arguments:
```{r,eval = FALSE}
MSE(dt,
    f,o,by,pool,
    dim.check)
```

Here, `dt` is the data table containing the data table in the desired format. The arguments `f`/`o`/`by`/`pool` all take characters or character vectors specifying the forecast-, observation-, grouping-, and pooling variables, respectively. The grouping variables are called `by` as reference to data table functionality. Meaning they need to contain column names present in `dt`. It is not allowed for `dt` to have columns with the names `'f'`, `'o'`, `'by'` or `'pool'`. The reason is that this does not work with the function `get`: The main line of the MSE function looks something like this:
```{r, eval = FALSE}
mse_dt = dt[,.(MSE = mean((get(f) - get(obs))^2)), by = by]
```
Here, `get(f)` tells data table to look for a column that is named the *value of* `f` (e.g. `'pred'` if you put `f = 'pred'` in the function header) rather than looking for a column called `'f'`.
Unfortunately, this doesn't work correctly when you try to actually call your forecast column `'f'`. Providing `f = 'f'` will confuse the `get` function and make the function crash (with a nice descriptive error message ;-)). This is the reason why I chose `f` and `o` for forecast- and observation columns, rather than more descriptive names, such as `obs`. Then, a user would not be allowed to have a column called `'obs'` in their data table which seems more restrictive than not being allowed to have a column called `'o'` (in fact `'obs'` is the default for `o` in most cases).
There would be ways to code around this, e.g. avoiding `get`, but all options I'm aware of are messy and make the function code way harder to read, so I think it's not worth trying to account for this.

The code example for calculating `mse_dt` above highlights another weird conundrum when working with data tables: For many (if not most) evaluation metrics, the `pool` variable(s) does not actually occur in the function code. Data table groups by `by` and takes the average error over all rows of `dt` that have identical values in the `by`-columns. This means our scores implicitly assumed that the columns in `by` and `pool` together contain all dimension variables. Whether this is satisfied is checked by the function when `dim.check = TRUE` (the default). Most of the score functions really only need the `pool` argument to run this check, and don't use it when `dim.check = FALSE`. Both the `by` and `pool` arguments have sensible defaults, such that the user usually normally does not need to pass them to the function.
If they want to do non-standard stuff, I think it's fair to ask for the extra-input `pool` to run a sanity check. It happened to me more than once that I accidentally averaged scores over a column I didn't mean to average over and had a hard time finding out what's wrong.

In general, different forecast/observation formats automatically require different data table formats. For example, a score applicable to ensemble forecasts might have an additional column `'member'` that does not belong to either of the four categories: Let us consider again the MSE as an example, and let's say we put the `'member'` column to the grouping variables. Then, we get the MSE for each member separately (but still averaged over all years) which is useless for an ensemble that is exchangable. If, instead, we put it to the pooling variables you end up with the squared error averaged over all ensemble members which is also not what you want. On the other hand, tercile forecasts, for example, do not contain a special `member`-column, but contain multiple predictions per observation (for the low-, medium- and high category). Therefore evaluation functions for these different types of forecasts require input data in different formats and require different arguments.

Therefore, on a function level we have more categories than just scores and diagrams, rather we have scores and diagrams for different forecast types. For example, the largest category we currently have is *scores for ensemble forecasts*, which includes the CPA, CRPS, CRPSS, MSE, MSES, and PCC.
I think it is useful to classify our evaluation functions like this. It helps us setting defaults, testing input and have general auxiliary functions. One example of each, from the scores-for-ensemble-forecasts-category:

* All functions in there have the default `by = by_cols_ens_fc_score()`, which is just an auxiliary function returning the typical names of grouping variables in `dt` for these kind of forecasts. If we need to change the default (e.g. add 'MONTH' because some people for some reason want to use capitalized column names), we can do this within the function and it automatically affects the defaults for all scores of ensemble forecasts.
* All functions in this category test their input calling the same function `checks_ens_fc_score()`, e.g. whether it has illegal column names etc.
* A general auxiliary function for this category is the function `climatology_ens_forecast` which is called by the skill scores `MSES` and `CRPSS`. This computes a leave-one-year-out ensemble climatology forecast from the data.

# Contributing to Development

The project is hosted in the github repository `SeasonalForecastingEngine/SeaVal`. When you first download that repository, please check that the .gitignore file is set up correctly. You should find it directly in your SeaVal folder and it should look similar to this:

```{r eval = FALSE, echo = TRUE}
.Rproj.user
scripts/SeaValDocumentation/bookdown/_temp.RData
scripts/SeaValDocumentation/bookdown/_data_dir.RData
scripts/SeaValDocumentation/bookdown/render_tutorial.R
*.Rhistory
scripts/SeaValDocumentation/bookdown/_book/
scripts/SeaValDocumentation/bookdown/_bookdown_files/
```

You are likely missing the line `scripts/SeaValDocumentation/bookdown/render_tutorial.R`, which you absolutely have to add when you want to contribute to the [documentation website](http://files.nr.no/samba/CONFER/SeaVal/).
After adding it to the .gitignore, please open the `render_tutorial.R` file and adjust the `setwd`-line to your directory.

## Updating Documentation

For updating the documentation, go into the directory `SeaVal/scripts/SeaValDocumentation/bookdown/`. This is the directory from which bookdown renders the documentation. The ordering of the files happens automatically: index.rmd is processed first, and thereafter all other .rmd files in order of their numbering. Each one generates it's own chapter. So if you want to add a new chapter, say, between chapter 2 and 3, you have to start its name by `03` and then change the names of all .rmd files beginning with `03` or higher numbers by increasing their number by one. Do not compile the .rmd files manually. When you're done with your changes, just run the 'render_tutorial.R' script. This renders the entire tutorial anew and publishes it online.