Classification of the maturity stage of coffee cherries using comparative feature and machine learning

This work presents the use of multiple techniques (i.e., physicochemical and spectral) applied to harvested coffee cherries for the postharvest classification of the maturity stage. The moisture content (MC), total soluble solids (TSS), bulk density, fruits’ hardness, CIEL*a*b parameters and the dielectric spectroscopy methods were applied on coffee cherries at seven maturity stages. These maturity stages were assessed according to the days after flowering (DAF) and the physical appearance as traditionally performed by growers. An increase of the green-to-red ratio (i.e., a*) parameter was perceived, accompanied by a monotonic response for the hardness, TSS and bulk density with a maximum moisture content at stage 5. In the case of the dielectric spectroscopy technique, the loss parameter presented higher losses for unripe stages at the ionic conduction region. To compare the individual performance of each of the techniques, three machine learning methods were used: random forest (RF), support vector machine (SVM) and k-nearest neighbours (k-NN). The meta-parameters for these techniques were optimized for each case to achieve the best performance possible. Furthermore, as the dielectric response is of spectral nature, recursive feature selection was applied and the 500 MHz to 1.3 GHz frequency range selected for the task. The highest performance was obtained for the colorimetric (75.1%) and hardness (72.5%) responses, while the lowest was obtained for the moisture content (45.5%). The dielectric spectroscopy response presented a promising response (56.8%), that achieved a clear separation of unripe from ripe stages, except for stage 5 in which some of the samples were classified as stage 2. Most techniques studied are compatible with field conditions, and the dielectric technique shows potential to be transferred based on available software-radio defined platforms.


INTRODUCTION
The processing of coffee represents an important challenge to the industry as it considerably affects the quality and represents the highest costs to the producers (Arcila et al., 2007;Farah, 2012;Sunarharum;Williams;Smyth, 2014). Colombia has traditionally produced coffee by the wet method. In this method, once the cherries are hand-picked, they are pulped, fermented, and dried in the coffee farm or processing units. The dry parchment coffee beans are then hulled and roasted or exported. Most of these processing practices are performed by traditional artisan techniques.
One of the most essential stages is the hand-picking and sorting of the ripe coffee cherries. For instance, the Colombian coffee industry performs a selective and manual harvesting of ripe cherries to avoid affecting the sensory attributes of the beverage (Marín et al., 2004;Puerta-Quintero, 2000). As the maturity of the cherries if not achieved uniformly, several harvest sessions are required (Craig et al., 2015). The selection process is performed according to the appearance of the cherry and completely depends on skilful human labour. The coffee cherry harvesting represents around 40% of the production costs for Colombian coffee grower (Federación Nacional de Cafeteros -FNC, 2010).
The use of field-compatible approaches as the total soluble solids measured by the refractive index which correlates to the total soluble solids, quantitative measurement of colour, titratable acidity, and equatorial firmness are appealing approaches that might be closer to the growers' requirements (Arcila et al., 2007;Silva et al., 2014). These are variables that can increase or reduce with the cherry development; for example, the sugar content of the pulp increases with maturity and the pericarp softens (De Castro;Marraccini, 2006). One of the drawbacks of these techniques is that some of them are not batch compatible.
The dielectric spectroscopy technique at the radiofrequency and microwave range has been of interest for quality assessment applications of multiple crops (Sosa-Morales et al., 2010), it can process batches and simple samples, and it can be transferred to field compatible equipment. The complex permittivity of the samples is indicative of the molecular composition, and is represented by a complex number (Equation 1) that depends on the frequency and temperature (Franco et al., 2015).
(1) Gray Level Co-occurrence Matrix (GLCM) and k-nearest neighbor (KNN), and support vector machine (SVM) among others, and it was possible the development of a classification model for maturity prediction with 100% of accuracy and 0.0995 s training time. Classification of cherries regular and irregular shaped was developed (Momeny et al., 2020) employing HOG and LBP to extract the features of the images of cherries and KNN, ANN, Fuzzy and EDT algorithms to classify them. Results were compared with the Convolutional Neural Network (CNN) method, which was able to classify cherries with accuracy around 99% in all image sizes. The aim of this study was to evaluate the physical features of coffee cherries that could operate as global maturity classifiers, regardless of the cultivar. Different physicochemical approaches compatible with the coffee cherry and the dielectric spectroscopy technique were used for this purpose. Machine learning was employed to the feature's comparison and classification according to defined maturity stages: days after flowering and physical appearance of the cherries.

Coffee samples
Four coffee cultivars were used for this study: two from northern Huila (NH1 / NH2), one from southern Huila (SH1), and one from Caldas (CA), all coffee producing regions in Colombia. All coffee cherries were collected during the 2018 harvest, and their features are presented in Table 1.
Around 10 kg of fresh cherries collected at the plantations were stored at 8 °C and transported to the South Colombian Coffee Research Centre -CESURCAFE pilot plant during a period not superior to 6 -8 h. Consequently, the sample set was constituted by 84 coffee cherry samples: 7 maturity stages, 4 cultivars and 3 biological replicates. Stage 1 represents green unripe cherries, which appear 196 d after flowering (Marín et al., 2004). Stage 2 corresponds to green-yellow cherries (i.e., 203 d), Stages 3 and 4 to almost ripe or "pintón" cherries (i.e., 208-215 d. Stages 5 and 6 represent ripe cherries (i.e., 217-224 d) while Stage 7 (i.e., after 224 d) represented overripe fruits. The difference between stages 3 and 4 were assessed according to physical appearance, as traditionally performed by growers. The same strategy was considered for stages 5 and 6. All cultivars considered were red varietals.
The real part of  is the relative electrical permittivity or dielectric constant (ε'), and the imaginary part of is the dielectric loss factor (ε'') with j = 1  . The electrical permittivity represents the ability of the material to polarize and to store electric energy in response to an applied electric field, and the loss factor is associated with the dissipation of energy as heat (Sosa-Morales et al., 2017).
As the maturity of many fruits and vegetables consists on the formation of sugars or the increase or loss of bounded and unbounded water, the dielectric spectroscopy approach has been used to assess the maturity of different fruits (Castro-Giraldez et al., 2010;Guo et al., 2015). This method has been also considered to study the water features of green and roasted coffee (Berbert et al., 2008;Iaccheri et al., 2015).
Finally, the development of analytical based models has employed machine learning techniques for this purpose. As a considerable wavelength and frequency information is generated, the selection for critical features and qualitatively describe the reasons for this selection should be considered during the construction (Li et al., 2018). To allow efficient technological transfer, the meta-parameters considered for different machine learning methods have to be optimized and selected to produce the best model possible (Fashi;Naderloo;Javadikia, 2019). Furthermore, it is ideal not to rely on a single scheme but to explore as many techniques as possible as some techniques have better behavior than others for specific tasks (Yang et al., 2019). Finally, all these elements merge together in the construction of a pipelined structure where feature selection, dimension reduction, and meta-parameter optimization are combined to produce simple and accurate models (Piedad et al., 2018). This strategy can be combined to consider multiple factors or in this case processing stages that support growers' in the complicated task of providing quality consistence (Rungpichayapichet et al., 2016).
Successfully results in maturity prediction of fruits using machine learning have been previously reported. For papaya (Santos Pereira et al., 2018), combined digital image features and random forest to develop a model to the ripening prediction. Combination of hand-crafted image features with machine learning techniques allowed to reach higher accuracy, improving the classification models for papaya. On the other hand, the work by (Behera; Rath; Sethy, 2020) employed local binary pattern (LBP), histogram of oriented gradients (HOG),

Colorimetric and physical analysis
The colorimetric analysis was performed with a Konica Minolta CR-410 chroma-meter (New Jersey, USA) evaluating parameters as L*(lightness), a*(redness-greenness) and b*(yellowness-blueness). Analysis were performed in triplicate form, recording 9 measurements for each sample. Colour difference was estimated using Equation 2.
were performed before the tests to verify the calibration and stability of the equipment.
Furthermore, after calibration, the probe and cable were not manipulated to avoid interferences. The 85070 v.E06.01.36 software (Agilent Technologies, Malaysia) controlled the network analyser to perform the frequency sweep and measure the complex reflection coefficient of the sample around the probe tip, which was then converted to the complex permittivity. The measurements were performed from 0.5 to 20 GHz. For each measurement, 1001 points were acquired with an IF-bandwidth of 10 kHz.
A mass of 40 g of coffee cherries stored at 4 °C for no longer than 1-2 weeks was mashed with a ceramic mortar until a homogenous paste was obtained. The coffee cherries paste was poured into a beaker and carefully placed on and around the probe tip, to avoid air bubbles around the sensing zone. All measurements were performed at 20 ºC and a relative humidity of 65%.
After each measurement, the probe was cleaned with distilled water and dried with soft paper. All the dielectric properties determinations were carried out in independent triplicates. Average values and standard deviations were calculated.

Data analysis
The first part of the study consisted in the evaluation of the seven maturity stages. Accordingly, a random effect multilevel model was considered. Maturity was considered as the fixed effect, and the variety was considered as the random effect (random intercept and slope). Then, Tukey pairwise comparisons were performed to identify the groups that differentiated the stages. For the dielectric spectroscopy response, the analysis was split in two parts. The first portion consisted in the identification of the most critical features. For this purpose, the technique of extremely randomized trees was used, in tandem with recursive variable elimination. In the first run, five-fold cross validation was performed to determine the number of trees that resulted in the highest maturity stage classification accuracy. Then, thirty random seeds were generated, and the process of recursive feature elimination performed for each seed. Interactions across some of the frequencies were also considered.
The remaining features were scaled and analysed with principal component analysis (PCA), because of the high collinearity across frequencies, and the first 50 components were obtained. The traditional SVD decomposition was used for this purpose, and the biplot of the scores and loadings presented.
The third stage consisted in the development of the classification models. In the case of the physicochemical attributes, each technique was individually considered. (2) All physical analyses were performed in triplicate. Total soluble solids as percentage were determined by an Atago PR 201 portable digital refractometer (Tokyo, Japan) and are presented as °Brix. Total soluble solids for Stages 1 and 2 could not be obtained as mucilage could not be extracted. The moisture content for the samples was calculated according to the wet basis gravimetrical method based on the ISO 6673 standard (Adnan et al., 2017). Empty aluminium tins to contain the samples were weighed with a 4-digit precision scale (i.e., WT). Then, a 5 g coffee cherry sample was deposited into the tin to obtain the wet weight (i.e., WW+WT). These full tins were heated inside a Memmert 55 oven (Schwabach, Germany) set at 105 °C, and the samples were dried for 24 h. After removing the samples from the oven, the tins were weighed again to obtain the dry weight (i.e., DW+WT). The moisture content of each sample was computed according to Equation 3 (Adnan et al., 2017).
The traditional weight over volume method was used to calculate the bulk density. The samples were deposited in a beaker. The volume that the samples occupied was measured, and the weight of the samples was recorded. The bulk density was calculated as the division of the weight of the coffee cherries by their volume.
The hardness measurements were completed with a Brookfield CT3 texture analyser (Pennsylvania, USA) equipped with a 50 kg load cell and a TA25/1000 cylindrical plane plunger. The plunger was set to advance at a velocity of 0.8 mm•s -1 until the pulp failed.

Dielectric spectroscopy characterization
The dielectric spectroscopy characterization was carried out using a Keysight 85070E open circuit coaxial probe (Santa Rosa, USA). Before performing the measurements, the vector network analyser was warmed up for at least 90 min and subsequently calibrated using the standard given by the equipment: air -open circuit, short block and load -distilled water. Three to four measurements Three types of classification techniques were considered: random forests, support vector machine (SVM) and k nearest neighbours (k-NN). The data was partitioned into two sets: 56 samples for training and 28 for validation, stratifying the observations by the interaction between maturity and variety. This partition was randomly performed 30 times, using the seeds previously computed, to obtain the statistics of the classification accuracy and confusion matrix results. This recalls the bootstrapping technique.
The hyper-parameters for each of the techniques were obtained by five-fold cross validation. The parameters to be calculated for each method were: number of trees for random forests, the sparsity regularization parameter and kernel function for SVM, and the number of neighbours used for k-NN. Then, each resulting model was evaluated with the validation set, and the classification accuracy and confusion matrix recorded. The latter were then averaged, and these correspond to the reported values.
In the case of the dielectric spectroscopy technique, an additional layer was considered in the pipelined classifier: the PCA components as exogenous variables. Cross validation was also considered to select the optimal number of components for each technique. Once selected, the process is identical to the approach used for the physicochemical features. As the dielectric spectroscopy response generates three different features (i.e., dielectric permittivity, loss factor and loss tangent), the feature with the best accuracy was selected. Multi-level random effects models were calculated with the lm4 and lsmeans R packages version 3.3.6. The machine learning models were implemented with the scikit-learn package in Anaconda with Python 3.7.

Colorimetric response
As the results for L* (lightness) and b*(yellownessblueness) are colinear, the plot between a* and b* are presented in Figure 1.
This plot presents an adequate distribution for each maturity stage as all varieties were red, and the only sample set that presented undesirable behaviour is the NH2-Stage 2 set, which recalls the NH2-Stage 3 set. Yet, having four varieties accounts for possible variations across samples and the biological samples present the consistency required. The dominating parameter for this experiment was a*. Although the presence of yellow during stages 2 to 4 dominated the variation on b* (yellowness-blueness), which slightly changed with the proportion of yellow in the cherries. The parameter a*(rednessgreenness) changed from green to red with the progress of maturity. The lightness L* depicted a monotonically response, decreasing with the ripening process (Velásquez et al., 2019).

Physical analyses: the effects of maturity
The results for the features considered are presented in Table 2. As can be seen, all features were distributed between three to five groups. The moisture content and bulk density values agree with those reported in the literature (Aristizábal-Torres et al., 2012;Farah, 2012).
The hardness and TSS values variations are related to the natural ripening process of fruits, which resulted in fruit softening and increase of sugar content other compounds water-soluble (Bashir;Abu-Goukh, 2003). Identical behaviour for hardness and TSS was previously reported as well (Marín-López et al., 2003). Due to the spread response of the original values, a natural logarithm transformation was performed on the data for modelling. This feature and the total soluble solids (TSS) were distributed in five groups.
In the case of the moisture content, stage 1 presented a value close to 70.5% and decreased in stages 2 and 3. Then, the moisture reached a maximum at stage 5 which corresponds with stage 6 to the ripe stages. Finally, it decreased at stage 7 due to pericarp absorption. Similar results were previously reported (Marín-López et al., 2003) for coffee fruits var. Colombia red, presenting the highest moisture content at stage 5 and then diminishing until stage 7.
In the case of bulk density, the higher bulk density values were seen for fully and green-yellow immature stages, and the lowest for mature stages. As can be seen in Table 2, this parameter increased with moisture content. This behaviour was previously reported (Chandrasekar;Viswanathan, 1999) coffee beans arabica and robusta varieties. Figure 2 presents the dielectric permittivity response for all maturity stages and the frequency range between 0.5 GHz and 20 GHz. The x-axis, which represents the frequency, is in logarithmic scale.

Permittivity and maturity
The response varies from 60 down to 20 for the complete frequency range. The relative dielectric permittivity presented no statistical difference across the maturity stages. The lowest dielectric permittivity was seen for stage 7. Figure 3 presents the loss factor response.  Opposed to the dielectric permittivity, the loss factor presented a difference across maturity stages, especially for stage 1 (green unripe). A second group was seen for stages 2 to 4, and a third for stages 5 to 7. The frequency regions where this difference was more evident was between 0.5 MHz and 2 GHz, that corresponds to the ionic conduction region (Keysight Technologies, 2015). The relaxation frequency, which was close to 15 GHz for all stages, presented no difference across stages. The dipolar rotation region, which is mainly dominated by the presence of water, is not the predominating phenomena in the discrimination of the maturity stage. The response is typical of a salt-mineral solution (Franco et al., 2015), which might be indicative that the maturity in coffee responds to the mineral content of the pulp rather than to its water content (Farah, 2012).
Finally, the loss tangent is presented in Figure 4. As with the loss factor, the main separation across maturity stages was perceived between 0.5 and 2 GHz.

Feature selection and principal component analysis (PCA)
The first cross validation showed that the optimal number of design trees was 300. After randomly running the recursive elimination scheme, 42 out of the 1001 frequencies were included in all models for maturity classification: frequencies between 0.5 GHz to 1.3 GHz. When running the PCA algorithm with these 42 frequencies, 948.5 MHz was the highest PC1 loading frequency. This first component represented almost 98% of the variability of the data. After analysing the remaining frequency ranges: 2 -10 GHz and 10 GHz to 20 GHz, the highest ranked in these ranges were frequencies between 2 to 3 GHz and those superior to 19 GHz. The interactions between these ranges and 948.5 MHz were included in the final model. A total of 62 features were used for the final principal component analysis. The bi-plot, which presents both the scores and loadings plot is presented in Figure 5.
The first component represents 97% of the variance while the second represents 1.6%. The high variance represented in component 1 shows that the high collinearity between frequencies. Yet, the discrimination capacity of the technique to classify the maturity stages is good although only 6% of the features are retained. The interactions mostly contributed in the direction of component 2. The first component was related to the maturity stage, while the second component presented a slight discrimination across varietals.
The highest dispersion was perceived for stage 1 and stage 2, as with the colorimetric analysis. Positive values of the first component are indicative of green underripe to almost ripe cherries, while negative values indicate ripe to overripe cherries. Samples below 0 for this component are related to good quality attributes (Puerta-Quintero, 2000). Figure 6 presents the loss tangent values at 948.5 MHz and the groups according to the random effects model. The maturity stages were separated in five different groups at this frequency, with the lowest values for stages 5 to 7 and the higher values for stages 1 to 4, being stage 1 the stage with the highest value. Consequently, Figures 3 to 6 depict the potential that the dielectric spectroscopy technique has for the classification of the maturity stage of coffee cherries.

Classification models
According to the pipeline structure presented in section 2.6, Table 3 presents the general results for the techniques used and all the dielectric properties.
The best classification results for the dielectric spectroscopy technique were obtained for the SVM (support vector machine) model, that used the first five components of the PCA analysis and used the radial basis kernel with regularization parameter equivalent to 10. Random Forest models present overfitting of the validation sets, and k-NN presented intermediate performance. Consequently, the SVM models were considered and the dielectric spectroscopy response was represented by the loss tangent. Table 4 presents the accuracy obtained for each of the measurements.   The highest ranked techniques were the hardness and colorimetric response. For the latter, the result is valid while analysed cultivars are red at ripeness. Yellow, orange or pink varieties might result in a lower global classification accuracy. These two techniques are followed by the loss tangent, which presents an accuracy close to 60 %: The lowest ranked techniques are the total soluble content and the moisture content. However, the normalized confusion matrix is more indicative of the real accuracy at each stage. Figure 7 presents these results, were the columns represent the actual stage, and the rows represent the stage predicted by the model. The desired response for an ideal model is an identity matrix. False positives are represented in values outside the diagonal of the rows. In the case of the dielectric spectroscopy, the highest true positive values were seen for stages 1, 5 and 6. In particular, the model for stage 1 did not present false positives from stages 5, 6, and 7. Hence, no underripe cherries will be considered as ripe. Stages 2 and 3 present a performance decrease compared to stage 1. The lowest performance samples corresponded to stages 4 and 5. In general, stage 4 is confused with stages 2,3 and 5. In the case of stage 5, samples are mostly confused with stages 4, 5 and 2. It is important to recall that this is independent of the cultivar.
In the case of colour, some of the stage 5 samples were classified as stage 1. The best accuracy was achieved for stage 4. Stage 5 was misclassified at every stage, mostly for stages 4 and 5. Colour is definitely an important variable for maturity assessment, but is important to consider other alternatives (Arcila et al., 2007).
The moisture content is the technique that presented the weakest performance. Yet, the stage that had the best performance was stage 5. Hence, the separation of stage 5 could be achieved by threshold verification. Nevertheless, performing moisture content measurements in the field is not an easy task.
The lack of measurements in stages 1 and 2 might affect the accuracy of the total soluble solid model. The remaining stages present adequate classification values for stages 3, 5 and 6. Stages 6 and 7 present similar performances and are eventually confused. Finally, the hardness presents the best performance. All stages are misclassified with at most 2 of their neighbour stages. Stages 1 and 7 presented the best performance.

CONCLUSIONS
Global maturity classifiers for coffee cherries were proposed. The validity of the sample collection was verified with the colorimetric analysis, which presented the highest variance for stages 1 and 2 and a reduction of the variance as the maturity advances. Cultivars with other colours should be considered to evaluate the consistency of the technique.
The physicochemical features associated to the maturity stages agree with previously reported results. Three groups were obtained for all considered technologies due to the most attributes presented a monotonic response. Even though the moisture content presented a maximum at stage 5, threshold strategies could be verified for this approach.
The dielectric spectroscopy technique presents an important potential for this application. Although the dielectric permittivity did not present a difference across the maturity stages, the loss factor and loss tangent did. The difference was mainly evidenced at the ionic conduction region. The loss tangent should be employed as dielectric spectroscopy marker at frequencies from 300 MHz to 1.3 GHz. The higher frequency interactions contributed to the separation across the second PCA component, which was mostly related to the cultivar, than to the first that responded to the maturity stage. This technique has potential for performing singular and mass (i.e., group of cherries) measurements. Furthermore, this could be transferred to simpler devices that can be carried to the field for cherry sorting considering that planar technologies (i.e., antennas and filters) and current software defined platforms operate well at the selected frequencies. This could be further extended to implement RF imaging equipment for online automatic sorting. From the three machine learning techniques, the best results were obtained for SVM. The training and test accuracy sets agreed well for this technique. As seen in the confusion matrices, the dielectric spectroscopy approach performed well for stages 1, 6 and 7, and reduced performance for stages 3 and 5. According to this, the technique could be used to predict the moisture content.
Although techniques such as the colorimetric analysis and hardness presented reliable performance, the analysis must be performed on a unique cherry basis or require high accuracy that cannot be easily transferred to portable field compatible technology.

ACKNOWLEDGMENTS
Physicochemical analyses of coffee and sample collection were performed by members of the Cesurcafé research centre at Universidad Surcolombiana. Dielectric properties measurements were performed by William Romero at Universidad de los Andes.