Benchmarking shoreline prediction models over multi-decadal timescales
Benchmarking shoreline prediction models over multi-decadal timescales

Benchmarking shoreline prediction models over multi-decadal timescales

How did your country report this? Share your view in the comments.

Diverging Reports Breakdown

Benchmarking shoreline prediction models over multi-decadal timescales

ShoreShop2.0 solicited submissions from all types of shoreline models, including physics-based, hybrid, and data-driven models. Nearly all models were transect-based with free parameters that were independently associated with and calibrated for each Transect. All submitted models completed the short-term (2019–2023) prediction task. The predictions by each individual model can be interactive using the interactive version of this plot (https://shoreshop.io/ShoreModel_Cluster.html). Seven additional DDMs and five HMs were submitted after ShoreShop 2.0 and are included here as references for potential model improvements, informed by lessons learned during the workshop and additional insights into the shoreline data; however, they are not considered blind tests because the initially withheld data were made available immediately following the workshop. For HMs, such as COCOONED39, CoSMoS-COAST34, ShoreFor11, LX-Shore13 and ShorelineS40, different versions from various modelers were also evaluated.

Read full article ▼
Model submissions

As a benchmarking exercise conducted at an anonymized site (BeachX), with full details provided in the Benchmarking Setup section, ShoreShop2.0 solicited submissions from all types of shoreline models, including physics-based, hybrid, and data-driven models. However, only models defined as data-driven models (DDM) and hybrid models (HM) were submitted. DDMs, including regression, machine learning, and statistical models, rely entirely on data to establish relationships between wave characteristics and shoreline positions. In contrast, HMs include physical constraints through defined mathematical relationships and use data to calibrate free parameters. In ShoreShop2.0, 34 models, including 12 DDMs and 22 HMs, were evaluated and compared as part of the blind competition. Nearly all models were transect-based, with free parameters that were independently associated with and calibrated for each transect, except for four non-transect-based models that used a single set of free parameters for all transects. All submitted models completed the short-term (2019–2023) prediction task, while 29 provided medium-term (1951–1998) predictions, and 20 extended projections for the long-term period (2019-2100). Seven additional DDMs and five HMs were submitted after ShoreShop2.0 and are included here as references for potential model improvements, informed by lessons learned during the workshop and additional insights into the shoreline data; however, they are not considered blind tests because the initially withheld data were made available immediately following the workshop. For HMs, such as COCOONED39, CoSMoS-COAST34, ShoreFor11, LX-Shore13 and ShorelineS40, different versions from various modelers were also evaluated. While most of these models have been validated and applied across different beach types, this benchmarking tested their ability to transfer to an unstudied site. The characteristics of each model submission are provided in Supplementary Table S1, and a detailed description of each model is available in the GitHub and archived repository41 as individual README files. Previous validation and application practices of the models are summarized in Supplementary Table S2.

Short-term model comparison

With agglomerative-hierarchical clustering42, blind model predictions for the short-term period (2019–2023) can be grouped into six distinct clusters based on the dissimilarity of temporal patterns (Fig. 2a). Details of the clustering process are described in the Model Clustering section. Cluster 1 & 2 (Fig. 2b–d) consist of HMs, most of which rely on the MD0443 or Y099 empirical shoreline models to quantify cross-shore sediment transport. These two clusters are characterized by sharp shoreline retreat in response to storms, followed by gentle recovery, which is evident in the ensemble of Cluster 1 & 2. The main distinction between Cluster 1 and 2 is their approach to incorporating longshore sediment transport. Most models in Cluster 1 either do not explicitly model longshore sediment transport (e.g., Y09_LFP, SLRM_LIM, and EqShoreB_MB) or incorporate it using beach rotation models (e.g., IH_MOOSE_LFP14), while models in Cluster 2 adopt CERC-like equations44 to quantify shoreline change related to gradients in longshore sediment transport.

Fig. 2: Clustering of short-term model predictions from the ensemble of blind-test submissions. a Dendrogram resulting from Euclidean distance-based Ward’s minimum variance clustering91. Blue and brown colors of tick labels represent DDM and HM, respectively. b–j Short-term prediction of shoreline positions from different clusters of models for different transects. The deep red line is the ensemble mean (interval mean between 5th and 95th percentiles) of models within each cluster. Black scatters with error bars are SDS shoreline positions with 8.9 m RMSE. The predictions by each individual model can be visualized using the online, interactive version of this plot (https://shoreshop.github.io/ShoreModel_Benchmark/plots.html). Full size image

Cluster 3 & 4 (Fig. 2e–g) consist of a mixture of HMs and DDMs. Models in these clusters have relatively low-frequency variation and smooth trends. Cluster 3 includes the three best-performing models for the short-term period (i.e., GAT-LSTM_YM, iTransformer and CoSMoS-COAST-CONV_SV, ranked in Supplementary Fig. S1) with coherent variability independent of model type. All the HMs in Cluster 4 incorporate longshore sediment transport with CERC-like equations. Although some of them (e.g., CoSMoS-COAST models) use MD0443 or Y099 model for cross-shore sediment transport, the models in Cluster 4 are less responsive to storms compared to the models in Cluster 1 & 2.

Cluster 5 & 6 (Fig. 2h–j) consist of DDMs that struggle to predict shoreline positions (based on the results in Supplementary Fig. S1). Among these models, SARIMAX_AG, XGBoost_AG, and Catboost_MI in Cluster 5 are characterized by high-frequency fluctuations that correspond closely to daily wave characteristics. In contrast, models like SPADS_AG, ConvLSTM2D_LFP and wNOISE_JAAA in Cluster 6 exhibit less noise but struggle to accurately capture shoreline variability. As a result, the ensemble of models in Clusters 5 and 6 exhibits the highest noise and the lowest accuracy. Across all clusters, transects 2 and 8, which represent the ends of the beach and experience larger shoreline variations, are predicted more accurately, whereas transect 5, with its smaller and more irregular variations, presents a greater prediction challenge.

Medium-term model comparison

With the timescale of analysis increasing from short-term (5 years) to medium-term (50 years), the clustering of model predictions changes (Fig. 3a). The first cluster of medium-term predictions is the same as Cluster 6 of short-term and includes noisy DDMs. Despite their daily-scale variations, the inter-annual variability of these models is comparable to those smoother models in Cluster 2, which mostly overlaps with short-term Cluster 3 and represents the best-performing models. Models in Cluster 3 & 4 of the medium-term predictions have large overlap with Clusters 2 & 1, respectively, of the short-term predictions. These model predictions feature large and quick responses to storms, which became more evident in the medium-term with more severe storm events (e.g., in 1972 and 1974) observed. Model ensembles in Cluster 3 & 4 tend to predict larger shoreline erosion in response to these events than other clusters.

Fig. 3: Clustering and ensemble of medium-term model predictions. a Dendrogram resulting from Euclidean distance-based Ward’s minimum variance clustering91. Blue and brown colors of tick labels represent DDM and HM, respectively. b–j clusters of medium-term prediction of shoreline positions for different transects. Deep red line is the ensemble mean (interval mean between 5th and 95th percentiles) of models within each cluster. Black circles and dots represent the target shoreline position pre and post 1986 respectively for better visualization. The predictions by each individual model can be visualized using the online, interactive version of this plot (https://shoreshop.github.io/ShoreModel_Benchmark/plots.html). Full size image

Cluster 5 & 6 consist of model predictions with larger medium-term variations for different reasons. For the models in Cluster 5, shoreline change is primarily driven by gradients in longshore sediment transport, resulting in planform response and redistribution of sediment, contrary to the episodic beach erosion caused by cross-shore sediment transport43,45,46. The large variation of model performance in Cluster 6 is attributed to the extreme sensitivity of the ShoreFor model to shifts in wave climate11,30,47. As the hindcast wave data uses different observations for data assimilation pre and post 197948, the wave climate changes slightly around 1979. This minor change in the distribution of waves leads to the large long-term divergence of the ShoreFor-based models (e.g., SegShoreFor_XC and ShoreForCaCeHb_KS) unless additional modeling techniques to address this issue are included (ShoreForAndRotation_GA).

Long-term model comparison

Although prediction of future state is a common goal among modeling applications, the accuracy of long-term (2019-2100) model projections cannot be critically evaluated due to the absence of observational data. Instead, the ensemble and variability of these projections can be used for statistical analysis of long-term coastal erosion risks (Fig. 4). Here, the 15 models incorporating sea-level rise (Supplementary Table S1) are included in the analysis. The ensemble projections (Fig. 4a1–c1) in both future climate scenarios exhibit strong seasonal and interannual variability driven by the variation of wave climates (Fig. 4d). This variability is more pronounced than the long-term trend of shoreline retreat caused by sea-level rise (Fig. 4d), particularly at transects 2 and 8. With the combined impacts of changing wave climates and sea-level rise over time, the frequency of shoreline erosion reaching the cross-shore location of the present-day dune toe increases with time. Similar to the first five years evaluated in the short-term comparison, the final five years of the 21st century (2095 ~ 2100, Fig. 4a2–c2) show that most models continue to provide consistent shoreline prediction statistics. Only a few models (one for transects 2 and 5, and four for transect 8) project that the average shoreline position will reach the present-day cross-shore location of the dune toe. However, when wave-driven shoreline erosion and seasonal effects are considered (i.e., the temporal variation of the predictions), the dune-erosion risk increases, particularly at transect 8, where 7 out of 15 models project maximum seasonal shoreline erosion to reach the present-day dune toe in both RCP 4.5 and RCP 8.5 scenarios. For transect 8, most models project similar shoreline positions during the 2095–2100 period for both scenarios in terms of temporal minimum, maximum, and mean. However, the difference between the RCP scenarios is substantially larger for transects 2 and 5, with most models projecting greater erosion in the RCP 8.5 scenario.

Fig. 4: Long-term shoreline projections in response to waves and sea-level rise. a1–c1 Ensemble of monthly long-term shoreline projections in RCP4.5 and RCP8.5 scenarios, including only models that account for sea-level impacts. Solid lines are ensemble means while the shaded areas represent the range between minimum and maximum projections. The red dash-dot line marks the position of the present-day dune toe. a2–c2 Model-wise statistics of shoreline projections between 2095 and 2100. Circles represent means while caps indicate the range between temporal minimum and maximum. d Wave and sea-level projections. Solid lines are the 1-year running backwards mean of significant wave \({H}_{s}\), while dashed lines are yearly sea-level rise with respect to mean sea-level recorded between 1995 and 2014. Projection of each individual model can be visualized in the online, interactive version of this plot (https://shoreshop.github.io/ShoreModel_Benchmark/plots.html). Full size image

Model metrics

The Taylor diagram49 and related loss function (\({{\mathcal{L}}}\), refer to Eq. 2 in Methods) are used to benchmark model performance in ShoreShop2.0. Models are ranked based on the average loss \(\bar{{{\mathcal{L}}}}\) across all the different transects and for each timescale (Fig. 5). The evaluation for the medium-term task is separated into pre-1986 (1951 ~ 1985) and post-1986 (1986 ~ 1998) periods due to differences in the density and source of target data (i.e., photogrammetry versus satellite). In the majority of the Taylor diagrams on Fig. 5, the centered root mean square error (CRMSE) of models reaches the intrinsic accuracy (8.9 m) of SDS as reported for the adjacent Narrabeen Beach21, suggesting that the model accuracy starts to be limited by the accuracy of shoreline data used to train and validate the models. Examining Fig. 5 in more detail, the general model performance is comparable for the two ends of the beach, transects 2 (left column) and 8 (right column), across all periods and is substantially better than for transect 5 representing the center of the embayment (center column). This is because the ends of the embayed beach oscillate with the seasonal directional wave climate, whereas the center of the embayment may be more influenced by contrasting cross-shore and alongshore processes or the alongshore propagation of sand waves and sandbars through the middle of the beach. The model performance for medium-term prediction (Fig. 5d–i) is comparable if not better than for the short-term period, demonstrating the potential of the suite of shoreline models available for this benchmarking competition to reliably predict up to 50 years of coastal variability and shoreline change. The better skill metrics of Medium (1951–1985) (Fig. 5d–f) compared to other periods can be attributed to two factors. First, there are only six data points available pre-1986 for validation using the available photogrammetry data compared to more than 100 data points available in other periods (refer to Fig. 3), which will undoubtedly influence the error statistics. The aerial photogrammetry dataset also generates a full beach profile above mean sea level from which a specific mean sea level (MSL) shoreline contour can be extracted. Shoreline data based on MSL contours are less susceptible to noise than SDS data containing errors associated with tides, wave setup and runup21. The limitation of SDS data is further described in the discussion section.

Fig. 5: Model performance in Taylor diagrams. a–i Taylor diagram for different transects and timescales. The diagrams show the normalized standard deviation (radial – x and y-axis), correlation coefficient (curved axes along the circumference of the circle), and normalized centered root mean square error (CRMSE, concentric dashed arcs). Stars, circles, and squares represent HM, DDM, and ensemble mean respectively. Solid and hollow markers distinguish models submitted before (blind) and after (non-blind) ShoreShop2.0, respectively. The black triangle (Observed) shows the observed data in a Taylor diagram with zero error. The model performance is indicated by the distance of scatter points of model predictions to the observed. The red dashed arc indicates the normalized RMSE of SDS (8.9 m) with respect to the observed shoreline standard deviation (STD) for that time period. The legends are sorted based on the average loss \(\bar{{{\mathcal{L}}}}\) (displayed within the bracket) for all transects and timescales where predictions are available. The superscript * after a model name indicates non-blind models submitted after ShoreShop2.0. The Taylor diagrams and model ranking for each timescale can be found on (https://github.com/ShoreShop/ShoreModel_Benchmark). Full size image

Comparing the average loss across all three periods, the top 3 performing models were the GAT-LSTM_YM, iTransformer-KC, and CoSMoS-COAST-CONV_SV, two of which are DDM. The GAT-LSTM_YM model was the top-performing Medium (1951-1985) model and CoSMoS-COAST-CONV_SV was the top-performing model for both Short (2019-2023) and Medium (1986-1998) tasks. The median \(\bar{{{\mathcal{L}}}}\) of HMs (1.27) was marginally better than DDMs (1.28). In contrast to Shoreshop1.0 in 2018 with the model ensemble recognized as the top-performing prediction, several individual models outperformed the ensemble in Shoreshop2.0. The predictions from most models are highly correlated, with only a few model pairs showing statistical non-correlation (P value > 0.01 in Pearson’s non-correlation test; Supplementary Fig. S2). With availability of previously hidden shoreline data and input-data pre-processing methods learned from the ShoreShop2.0 in-person workshop held in October 2024, all the non-blind model submissions except for EqShoreB_MB improved their accuracy. The detailed loss scores for each model and for different transects and tasks can be found in Supplementary Fig. S1.

The model performance was further evaluated using quantile-quantile plots and metrics used in ShoreShop1.0 for the short-term and medium-term (1986-1998) tasks with abundant target data (Fig. 6a–f). Although most models have high quantile-quantile correlations with the target data, biases are evident in several models (Fig. 6a–f). Notably, the underestimation of extreme shoreline positions is a recurring issue for many models, particularly for transects 2 and 5, a limitation that was also identified in ShoreShop1.025.

Fig. 6: Blind model performance for short and medium-term model predictions. a–f Quantile-Quantile plots for the three target transects across short-term (2019-2023) and medium-term (1986–1998) timescales. g Mielke’s modification (\(\lambda\)). Squares, stars, and circles correspond to transects 2, 5, and 8, respectively, while hollow and solid markers distinguish short-term (2019–2023) and medium-term (1986-1998) results. The horizontal dashed red line indicates the ensemble model metrics reported in ShoreShop1.0. Models are arranged based on the average loss function \(\bar{{{\mathcal{L}}}}\) across target transects for the short-term prediction. The superscript * after a model name indicates non-blind models submitted after ShoreShop2.0. The Quantile-Quantile correlation of each individual model can be found in the online, interactive version of this plot (https://shoreshop.github.io/ShoreModel_Benchmark/plots.html). Full size image

Following ShoreShop1.0, the Mielke’s modification index \(\lambda\)50 that accounts for both bias and dispersion is also used to evaluate model performance. \(\lambda\) values range from 0 to 1, with \(\lambda =1\) representing perfect agreement, and \(\lambda =\,\)0 representing no agreement between observation and prediction. Compared to ShoreShop1.0 that benchmarked models over a 3-year period at Tairua Beach, NZ, the \(\lambda\) value in ShoreShop2.0 shows slight improvements in some instances (Fig. 6g) despite the use of less accurate and less frequent shoreline data available for training. However, model performance is also not necessarily consistent across all transects and timescales as indicated by the range of \(\lambda\) for each model. Some of the best short-term models are also the worst medium term performers (e.g., IH_MOOSE_LFP and SegShoreFor_XC) whereas some other models exhibit more consistent metrics (e.g., CoSMoS-COAST-CONV_SV, and GAT-LSTM_YM) across different timescales for transects 2 and 8. This is attributed to the different governing physics and architectures of the models used here in ShoreShop2.0. Most non-blind models substantially improve their score for individual transects and tasks; however, the consistency of performance shows less improvement in the non-blind models submitted after the workshop.

Source: Nature.com | View original article

Source: https://www.nature.com/articles/s43247-025-02550-4

Leave a Reply

Your email address will not be published. Required fields are marked *