Technical note : Evaluation of three machine learning models for surface ocean CO 2 mapping

Reconstructing surface ocean CO2 from scarce measurements plays an important role in estimating oceanic CO2 uptake. There are varying degrees of differences among the 14 models included in the Surface Ocean CO2 Mapping (SOCOM) inter-comparison initiative, in which five models used neural networks. This investigation evaluates two neural networks used in SOCOM, self-organizing maps and feedforward neural networks, and introduces a machine learning model called a support vector machine for ocean CO2 mapping. The technique note provides a practical guide to selecting the models.


Model equations
The machine learning models included in this study cannot directly model the long-term trend of CO 2 .Therefore, we express the dependence of CO 2 fugacity (f CO 2 ) on year (YR), month (MON), latitude (LAT), and longitude (LON) as the sum of a nonlinear static component and a linear trend component: (1) As available observations are scarce with respect to the biogeochemical properties of the surface ocean, we used sea surface temperature (SST), sea surface salinity (SSS), chlorophyll-a concentration (CHL), and mixed layer depth (MLD) as the proxy variables of space and time.These proxy variables were commonly used by models included in the SOCOM.The model equation becomes where dSST denotes the difference between the monthly and annual means of SST.Here we excluded LON and MON.
Published by Copernicus Publications on behalf of the European Geosciences Union.
J. Zeng et al.: Evaluation of three machine learning models for surface ocean CO 2 mapping They have a circular property and therefore cannot be used directly.For instance, longitude −180 • is geographically connected to longitude 180 • , but numerically they appear to be two extreme longitude values to the models.Zeng et al. (2014Zeng et al. ( , 2015) ) circumvented this problem by using sine and cosine transformed components.Their approach could unintentionally enhance the influence of LON and MON on f CO 2 as one more derived variable from each of them was added to the model.We excluded LON in the belief that the combination of SST, SSS, CHL, and MLD contains sufficient spatial information, but retained LAT for its different seasonal and geophysical meanings in the Northern and Southern hemispheres.Replacing MON with dSST also improves the expression of the effect of season geographically.
The database has a 1 • × 1 • spatial resolution and includes global measurements from 1970 to 2014.Similar to Zeng et al. (2014), we excluded some data points by these criteria: (i) f CO 2 values smaller than 250 µatm or larger than 550 µatm, (ii) ocean depth smaller than 500 m, (iii) salinity smaller than 25.0, and (iv) year before 1990.A total of 158 052 data points were extracted with these conditions.
The monthly SST data of 1990 to 2015 were extracted from the Optimum Interpolation (OI) V2 product2 of NOAA (Reynolds et al., 2002).The monthly SSS climatology was extracted from the World Ocean Atlas 2013 (WOA13) product3 (Boyer et al., 2013), which contains the monthly mean SSS from 27 June 1896 to 25 December 2012.The monthly CHL climatology was calculated using the MODIS Aqua and SeaWiFS climatology4 , which covers the period of 2012 to 2015.The mean of the two CHLs was used as the CHL climatology.The mixed layer data were derived from the Monthly Isopycnal and Mixed-layer Ocean Climatology5 of NOAA (Schmidtko et al., 2013), which includes the period of 1955 to 2012.

Machine learning models
The Appendix and Table 1 summarize the algorithms of the three models.Here we focus on discussing their usage in CO 2 mapping.
The trend in Eq. ( 2) cannot be modeled directly by the models.One approach to dealing with the problem is to normalize the measurements to a reference year using a global rate and to only model the nonlinear component.Zeng et al. (2014) presented a method to model the linear component in Eq. ( 2).Instead of repeating the process, we used their annual rate of 1.5 µatm to remove the trend from f CO 2 to normalize it to the reference year 2005, i.e., (3) Although Takahashi et al. (2014) obtained a global mean rate of 1.9 µatm yr −1 , we used 1.5 µatm yr −1 as this rate was obtained by using the gridded f CO 2 of SOCAT version 2. The normalized f CO 2 was used to model the nonlinear component in Eq. ( 2).In later discussions, f CO 2 means the normalized f CO 2 unless explicitly stated.Similarly, we applied the log transform of Zeng et al. (2014) to CHL prior to data scaling discussed below, i.e., CHL = log 10 (1.0 + CHL).( 4)

SMV
For a given dataset, the SVM requires a prior step to find the optimal value for the parameter σ in Eq. ( A10) and the parameter γ in Eq. (A11).To shorten the training time, we randomly chose 10 % of the measurement data in this step and obtained 0.06 for σ and 10 for γ .Note that these values are dependent on data scaling, which is necessary in this case to avoid the overflow problem in solving Eq. (A18).We scaled all input variables LAT, SST, SSS, CHL, MLD, and dSST by their minimum and maximum to confine them in the range (0, 1), i.e., (5)

FNN
Data scaling is not necessary for the FNN, but can improve its performance.Following Zeng et al. (2014), we scaled the input variables by their mean and standard deviation as The output variable f CO 2 is scaled by This confines the scaled f CO 2 to between 0.1 and 0.9 for better response to changes in input variables.The kernel function Eq. (A4) has the property that for any input in (−∞, +∞), the output varies between 0 and 1.For f CO 2 close to 0 or 1, a small change in f CO 2 requires very large adjustment of model parameters, which slows down the convergence of training.
Table 1.Feature comparison of the three machine learning models.

Feature SVM FNN SOM
Input space projection Projects the input variable space to a high-dimensional space that is proportional to the number of training samples.
Projects the input space to a highdimensional space that is proportional to the number of hidden neurons and input variables.
Projects the input space to a feature space whose size is determined by the number of neurons.
Prediction by Continuous interpolation.Continuous interpolation.
Picking up labeling samples that have the closest feature to the input.

Problems
May over-fit and over-interpolate.May over-fit and over-interpolate.Discrete interpolation leads to spatial discontinuity.

Data scaling
Helps in solving the linear equation, but has no effect on the result.
Helps the convergence of training, but has an insignificant effect on the result.
Significant effect on the result.

Results affected by
The parameter values for regularization and kernel function.
The number of hidden neurons.
The number of neurons and data scaling.
We used 64 hidden neurons for the FNN as Zeng et al. ( 2014) did.The learning rate in Eq. (A6) was set to 0.25 by trial-and-error.A small value makes training slow, whereas a large value may make a training diverge.The constant in Eq. (A8) was determined dynamically in each iterative training loop.It was taken as 10 times the mean of absolute differences between modeled and observed f CO 2 .We experienced that this method improves the performance of training.

SOM
Data scaling is critical for the SOM, as the distance defined by Eq. (A1) would be affected by variable units.We used Eq. ( 6) to scale input variables in training the SOM.Based on our preliminary correlation analysis, we applied a factor of 2 to enhance the influence of SST and CHL on the distance.Using such a subjective factor is the only way to make the correlations between the output and the input variables more in line with observed correlations.
From the labeling procedure of SOM described in the Appendix, it is not difficult to see that the number of neuron cells in SOM affects the labeling and hence the prediction.Unfortunately, there is no guideline for choosing the size.Based on previous studies (Telszewski et al., 2009;Nakaoka et al., 2013), we used 20 000 neuron cells, roughly one neuron cell for one 1 × 1 grid cell of sampled areas.

Model validation
We examined the goodness of fit by randomly selecting 10 to 50 % of the data points to train the FNN and SVM, and to label the SOM; and then calculated the correlation coefficient between modeled and observed CO 2 of the selected data points.
The SOM yields the best correlation in the case of 10 % of randomly selected data points and the correlation decreases with the number of data points (Fig. 1).The reason is that for a given number of neuron cells, the fewer the data points, the less likely a neuron cell will be labeled by multiple measurements and the more likely that the prediction will find the same CO 2 value used for labeling.Therefore, the goodness of fit does not necessary mean good SOM modeling.
The correlations obtained by the SVM and FNN do not vary much with the number of data points.While the SVM's correlation decreases monotonically, even though by only a little, with the number of data points, the FNN's correlation obtained with 75 000 data points is larger than that with 60 000 data points.The FNN is known for not being able to find the global optimum in training.This case could be an indication of an imperfect training.The FNN appears inferior to SVM in all cases.However, imperfect training does not account for all the differences.If we use the number of model parameters to be determined by the training as the indicator of the dimension of the model space, the FNN's dimension is far smaller than that of the SVM.The former is determined by the number of hidden neurons and input variables, whereas the latter is determined by the number of training data.For 6 input variables, 15 000 training data, and 64 hidden neurons, the number of parameters is 509 for the FNN and 15001 for the SVM.
A better indicator of the performance of the models would be the goodness of prediction.To emulate the situation that the sampled area was only a small portion of the global ocean, we evaluated the goodness of prediction by training FNN and SVM and labeling SOM with 10 % of randomly selected data to make a prediction for the rest of the data.Figure 2 shows that the SVM yielded the best correlation (R 2 = 0.72), the FNN fell behind (R 2 = 0.67), and the SOM performed the worst (R 2 = 0.54).The differences between predicted and observed f CO 2 are 0.1 ± 17.4 µatm for SVM, 0.1 ± 18.9 µatm for FNN, and 0.2 ± 23.3 µatm for SOM.Compared to the variation of f CO 2 measurements, these differences are small and their uncertainties are on the same order of magnitude as the variation of measurements.Let us examine the standard deviation (SD) of f CO 2 in those grids with at least three data points.The track-gridded f CO 2 in SOCAT version 3.0 includes an SD ranging from 0.1 to 71.2 µatm and the mean is 5.2 µatm.Calculating the SD of normalized f CO 2 in the same grids and in the same months of all years yielded a mean of 12.5 µatm in the range of 0.1 to 103.1 µatm.The normalization had little effect on the SD as the calculation for non-normalized f CO 2 gives a mean SD of 14.6 µatm in the range of 0.1 to 107.5 µatm.
From the algorithm of SOM in the Appendix, it is not difficult to see that the SOM does not make extrapolationthe model always approximates new inputs by the measurements used for training and approximates f CO 2 by the measurements used for labeling; therefore, the predicted f CO 2 values are within the observed f CO 2 range (Fig. 2a). Figure 2c shows that the extrapolated f CO 2 by the SVM, if any, did not exceed the observed range.To investigate the extrapolation risk, we used 200 000 data points randomly generated for SST, dSST, SSS, MLD, and CHL in the range of (0, 40 • C), (−20, 20 • C), (20, 50), (1, 1500 m), and (0 log(mg m −3 ), 2 log(mg m −3 )), respectively.These ranges are larger than the corresponding observed ranges of (0, 34 • C), (−13, 16 • C), (24, 40), (1, 1000 m), and (0 log(mg m −3 ), 1.2 log(mg m −3 )).The SVM and FNN produced f CO 2 in the range of (267, 468 µatm) and (199, 596 µatm), respectively, for the simulated samples.Compared to the observed f CO 2 range of (240, 560 µatm), the experiment indicates that the over-extrapolation risk of the SVM is low.

Differences
Figure 3 shows f CO 2 maps in February and July 2005, which is the reference year for normalization.In the mapping, we randomly selected 50 % of the data to train the FNN and SVM and to label the SOM.All models captured the major features of observed f CO 2 distribution.The SOM exhibits obvious discontinuity because of its discrete characteristics of picking up f CO 2 values from the labeled SOM.For year 2005, the mean f CO 2 difference is −0.05 ± 12.73 µatm for FNN−SVM and −0.6 ± 18.80 for SOM−SVM.The uncertainty is the standard deviation of the mean difference between predicted and observed values.The statistics indicates that FNN agrees better with SVM than SOM does.
Although the differences among models might be on the order of 10 to 20 µatm, the effect on the global ocean CO 2 flux estimate is small (Fig. 4).The fluxes are calculated using the wind speed from ECMWF's interim product (Dee et al., 2011).Our estimate for the oceanic uptake is on the higher end among those in Wanninkhof et al. (2013) andLe Quéré et al. (2015).For example, Wanninkhof et al. (2013) reported that the median sea-air anthropogenic CO 2 fluxes centered on year 2000 ranged from 1.9 to 2.5 PgC yr −1 among the seven models.In comparison, our estimates by the three models are about 2.4 PgC yr −1 .The mean difference of CO 2 flux is 0.02 PgC yr −1 between the FNN and the SVM (FNN−SVM) and 0.06 PgC yr −1 between the SOM and the SVM (SOM−SVM).They are small in comparison with those differences among the models in Wanninkhof et al. (2013) andLe Quéré et al. (2015).Note that the flux estimate is highly dependent on wind products as shown by Wanninkhof et al. (2013) and Zeng et al. (2014).On the spatial scale of tens of degrees, the three models show good mutual agreement for modeled f CO 2 distributions among them.However, each model shows distinguished fine structures, which are determined by the biogeochemical processes in the ocean, by model parameters obtained from training, and by the characteristics of the models.We believe that the modeled monthly f CO 2 distributions are true to the degree given by the model validations.

Summary
The main features of the three machine models are listed in Table 1.The SVM is recommended when the computer has enough memory to store the matrix in Eq. (A18), which is proportional to the square of the number of training data.The SVM performs the best, but the training time could become very long when the number of training data is too large to be handled by a computer without using virtual memory.For any given dataset, using the SVM requires a prior step to find the optimal value for the parameter σ in Eq. (A10) and the parameter γ in Eq. (A11).
The FNN model does not perform as well as the SVM, but the number of training data does not affect its training as much as the SVM's.The training time can become long when a large number of hidden neurons are used and many iterations are needed to achieve convergence.It takes a longer time to train the FNN than the SVM for a small number of data points.However, the FNN is simpler to use as it requires no prior step.However, it may have the risk of overextrapolation.
The SOM is recommended only when the other two models have over-fitting or over-interpolation problems.The SOM performs the worst and is not as straightforward as the others as its result depends too much on data scaling and the number of neurons.An advantage of the SOM is that once trained, re-labeling the SOM with new CO 2 measurements and making a new prediction is fast.Although the SOM does not have the over-extrapolation problem of the FNN, it may produce nonsense predictions due to its strong dependence on data scaling.
In areas where there was no measurement on a large scale, predictions made by the models must be treated conservatively, as SVM and FNN may produce extrapolated results and SOM may extract CO 2 from unexpected provinces.Figure 3 shows that the modeled CO 2 east of the African coast near the Equator in July 2005 (Fig. 3) appeared much higher than the nearby measurements, which were made in July 1995 and adjusted to 2005 using the global rate of 1.5 µatm yr −1 .However, considering the large variations of the rate from region to region (Takahashi et al., 2014) and of the repeated measurements discussed in Sect.5, the measurements were not sufficient to support rejecting the modeled CO 2 .Similar CO 2 hotspots occurred in the Southern Ocean west of South America in February 2005, around the latitudinal zone of 50 • S. The modeled CO 2 distributions by Takahashi et al. (2014) also showed CO 2 hotspots around the latitudinal zone of 30 • S in the same month and region.Their model used a completely different interpolation scheme based on a diffusion-advection transport model for surface waters.In principle, this hotspot CO 2 was produced by our models using measurements somewhere else where the biogeochemical properties were similar to those in the hotspot areas.As the SOM does not make extrapolation, the SVM has a low possibility of over-extrapolation, and the hotspots appeared in all models, the risk of accepting them would not be high.

Figure 1 .
Figure 1.Correlation coefficient between modeled and observed f CO 2 (uatm).The sample size is the number of data points randomly selected to train FFN and SVM and to label SOM.

Figure 2 .
Figure 2. Predicted vs. observed f CO 2 (µatm).Ten percent of data points were selected randomly to train FNN and SVM and to label SOM, and the rest was used for validation.

Figure 3 .
Figure 3. Distributions of modeled and observed f CO 2 .The composite map for observations includes f CO 2 in 1990-2014.Half of randomly selected data points were used to train FNN and SVM and to label SOM to make predictions.(a) shows February and (b) shows July.

Figure 4 .
Figure 4. Modeled global CO 2 fluxes.A negative value indicates oceanic uptake.
Data availability.The software and data used by this study are available at https://figshare.com/s/38488b7003b03e2103c9.