From visual estimates to fully automated sensor-based measurements of plant disease severity: status and challenges for improving accuracy

The severity of plant diseases, traditionally the proportion of the plant tissue exhibiting symptoms, is a key quantitative variable to know for many diseases and is prone to error. Good quality disease severity data should be accurate (close to the true value). Earliest quantification of disease severity was by visual estimates. Sensor-based image analysis including visible spectrum and hyperspectral and multispectral sensors are established technologies that promise to substitute, or complement visual ratings. Indeed, these technologies have measured disease severity accurately under controlled conditions but are yet to demonstrate their full potential for accurate measurement under field conditions. Sensor technology is advancing rapidly, and artificial intelligence may help overcome issues for automating severity measurement under hyper-variable field conditions. The adoption of appropriate scales, training, instruction and aids (standard area diagrams) has contributed to improved accuracy of visual estimates. The apogee of accuracy for visual estimation is likely being approached, and any remaining increases in accuracy are likely to be small. Due to automation and rapidity, sensor-based measurement offers potential advantages compared with visual estimates, but the latter will remain important for years to come. Mobile, automated sensor-based systems will become increasingly common in controlled conditions and, eventually, in the field for measuring plant disease severity for the purpose of research and decision making.


Background
Plant disease epidemics impact agriculture and forestry by reducing the quantity and quality of the product, and pose a threat to food security and food safety (Strange and Scott 2005;Oerke 2006;Madden et al. 2007;Savary et al. 2012Savary et al. , 2017. Knowledge of the quantity of disease is fundamental to a) determine crop losses; b) conduct disease surveys; c) establish thresholds for decisionmaking; d) improve knowledge of disease epidemiology, and e) evaluate the effect of treatments (e.g. cultivar, fungicides, etc). Plant disease intensity (a generic term) can be expressed by incidence or severity at the field/ plot scale and below. Incidence is the proportion of the plant units that are diseased in a defined population or sample (Madden et al. 2007) while severity is the proportion of the plant unit exhibiting visible disease symptoms, usually expressed as a percentage (Madden et al. 2007). Symptoms of disease on a plant may change in size, shape and color. Disease severity is often the variable that is of most importance or interest in a particular experimental situation (Paul et al. 2005). Quantification of disease severity caused by biotic agents is the focus of this article.
Visual estimation is the action of assigning a value to severity of symptoms perceived by the human eye. A sensor or instrument directly or indirectly measures the amount of disease or stress signal based on remote sensing (Nilsson 1995;Bock et al. 2010a). Thus, an image can be captured in the visible spectrum (VIS) and processed using image analysis (Bock et al. 2010a;Bock and Nutter Jr 2011;Barbedo 2013Barbedo , 2016a. The amount of disease can also be measured by image capture in the non-VIS spectral range, including by hyperspectral and multispectral imaging (HSI and MSI), and chlorophyll fluorescence or other methods. The latter methods are conceptually different to that estimate or measurement of disease severity based on visible symptoms or the visible spectrum alone (Mahlein et al. 2012a;Mutka and Bart 2015;Simko et al. 2017;Kuska and Mahlein 2018;Mahlein et al. 2018). Visual estimates are based only on the perception of wavelengths of the electromagnetic spectrum in the VIS range (380 to 750 nm), while HSI and MSI systems use wavelengths in the range 250 to 2500 nm (Fig. 1). In general, only part of this range is chosen (usually the near-infrared (NIR) and infrared (IR) bands) -no single system covers the entire range. Raters perceive and learn to discriminate symptomatic from asymptomatic tissue in order to estimate percent diseased tissue. VIS spectrum image analysis bases measurement on the number of pixels that conform to pre-defined properties of pixels representing a diseased state vs. healthy state, which are identified using a range of statistical procedures. HSI and MSI systems measure signature wavelengths associated with the diseased state. Image acquisition and analysis has additional challenges but also advantages over visual estimates (Mahlein 2016). Similar to visual ratings, image-based systems, depending on the objective, should: (i) detect disease or other stress as early as possible, (ii) differentiate among biotic diseases, (iii) differentiate biotic from abiotic stresses, and (iv) quantify disease severity accurately.
Information on disease severity is needed at various spatial scales from the microscopic to plant organs, whole plants, plots, fields or regions, so scalability is an important criterion to take into account when choosing an assessment method. Furthermore, assessment of severity is needed to complement genomics-scale data and provide timely, appropriate and correct measurements to fulfil the needs of 'phenomics' in plant breeding (Mutka and Bart 2015;Simko et al. 2017). High throughput is an important consideration in the era of phenomics, affecting progress and resource use efficiency.
Optical sensors perform non-invasively and have been developed and used to support disease detection, classification and severity measurement. Precision agriculture and plant phenotyping for resistance breeding already benefit from these technologies (Fiorani and Schurr 2013;Kruse et al. 2014;Stewart et al. 2016;Mahlein et al. 2018). Although other sensor-based methods of Fig. 1 The electromagnetic spectrum showing wavelengths and frequencies illustrating the visible (VIS) range of light (specifically RGB) and the hyperspectral range used for disease severity estimation and measurement disease or pathogen quantification exist (thermal imaging, chlorophyll fluorescence and molecular or serological approaches), the reader is recommended to seek out recent publications on these topics elsewhere Sankaran et al. 2010;Mutka and Bart 2015;Mahlein 2016). This review will focus primarily on the status and use of visual estimation, VIS spectrum and HSI image analysis as methods to quantify disease severity, paying particular attention to recent developments and challenges to improve accuracy and reliability of the estimates and measurements.
Terms, concepts and the importance of accurate plant disease severity quantification An accurate estimate or measurement is one that is close to the actual or true value, or 'gold standard' (Nutter Jr et al. 1991;Madden et al. 2007;Bock et al. 2010a;Bock et al. 2016a). In remote sensing, the actual or true values are referred to as 'ground truth' data. Biased estimates or measurements are those that deviate from actual accuracy. Two biases exist: systematic bias (over or underestimation which is related to the magnitude of the actual value) and constant bias (overall tendency to over or underestimate). Precision is the variability of estimates, but in disease severity estimation or measurement accuracy, precision must accommodate closeness to the true value (Madden et al. 2007). By definition, consistently accurate estimates must be reliable , where reliability is the tendency for repeated estimates or measurements of the same specimen(s) to be close to one another (Nutter Jr et al. 1991;Madden et al. 2007). Reliability can be described as inter-rater (or method, e.g. various imaging methods) reliability or intra-rater (or method) reliability. Reliability may be less of an issue when measuring disease under controlled conditions using devices like VIS image analysis or HSI compared to estimates by different visual raters, or measurements under field conditions. Accurate measurements or estimates of severity are important: it ensures that treatment effects are correctly analyzed, yield loss relationships understood, surveys are meaningful, and germplasm phenotypes rated appropriately. Furthermore, severity data might be used as a decision threshold or for disease forecasting purposes and thus the need to spray (or not). Inaccuracy can hamper the research process, waste resources, and could impact grower profitability. The required level of accuracy may vary among situations. Several empirical and simulationbased studies have demonstrated that disease assessment can result in a type II error (a false negative) (Christ 1991;Newton and Hackett 1994;Parker et al. 1995a;Bock et al. 2010b;Chiang et al. 2014;Chiang et al. 2017aChiang et al. , 2017b. A type I error (a false positive) could be as damaging, although this has not been found in disease assessment studies. Accurate estimates or measurements will minimize these two errors.

Visual estimation of disease severity
The evolving status of visual estimates has been punctuated by various reviews and book chapters (Anon 1947;Chester 1950;Large 1966;James 1974;Horsfall and Cowling 1978, Chapter 6;Kranz 1988, Chapter 3;Campbell and Madden 1990, Chapter 6;Chaube and Singh 1991, Chapter 9;Nilsson 1995;Cooke 2006, Chapter 2;Madden et al. 2007, Chapter 2;Bock et al. 2010a). Since 2010 there have been only two reviews, one relating to the issue of accuracy , and the other providing a summary of the development and validation of standard area diagrams (SADs, Del Ponte et al. 2017).

Methods of visual estimation and nature of the data
Visual estimates of disease severity are based on various kinds of scales typical of measurement science (Stevens 1946;Baird and Norma 1978). Of the four main scale types, only interval scales are not represented in plant disease severity estimation as they lack a true zero (it is not possible to estimate less than zero disease). Disease severity has been assessed using nominal, ordinal and ratio scales. Their perceived utility, advantages and disadvantages are as follows: Nominal scales These qualitative (descriptive) scales have been defined and described (Newell and Tysdal 1945;Campbell and Madden 1990;Madden et al. 2007;Bock et al. 2010a;Bock et al. 2016a). Nominal scales are based on brief descriptions such as "no disease", "mild disease", "moderate disease" and "severe disease", or symbols "-"(healthy), "+", "++" and "+++" (various levels of severity). Nominal scales are subjective, and may vary by rater and assessment time. The data may be analyzed using statistical methods based on rank or frequencies.
Ordinal scales (quantitative and qualitative) There remains a lack of clarity on what in this review is termed a 'quantitative ordinal scale', which has a set number of classes describing numeric intervals between 0 and 100%. These have been termed interval scales (Nutter Jr and Esker 2006;Bock et al. 2009a), ordinal scales (Hartung and Piepho 2007), category scales (Chiang et al. 2014) and quantitative ordinal scales  in the literature. The American Phytopathology Society in its instruction to authors considers them an ordinal scale (Anon 2020). Qualitative ordinal scales have a clear and significant order of values, but the numeric magnitude of the differences between each class is unknown (for example, the Likert scale, Likert 1932). Quantitative ordinal scales have a clear and significant order of values, and the magnitude of each ordered number is numerically bounded by a specified range.
Qualitative ordinal scales are valuable for comparing severity of some diseases that do not have easily quantified symptom. Many virus, other systemic diseases and root diseases may fall into this category, for example, cassava mosaic disease (Hahn et al. 1980) and huanglongbing of citrus (Gottwald et al. 2007). These rank data are based on discrete descriptions of symptom types and progression that is almost certainly not linear. It is not statistically appropriate to take means or use midpoints of these scales (Stevens 1946), as the mid-point and mean have little biological relation and violate assumptions of parametric tests. An index based on class frequencies can be calculated for qualitative ordinal scales, which may then be analyzed using parametric statistics, or they can be analyzed using non-parametric statistics suitable for various experiment designs and distribution functions (Shah and Madden 2004;Fu et al. 2012).
Quantitative ordinal scales may have equal or unequal intervals (Horsfall and Heuberger 1942;Horsfall and Barratt 1945;Hunter and Roberts 1978). The Horsfall-Barratt scale (HB, Horsfall and Barratt 1945) has been widely used (Table 1; Haynes et al. 2002;Miyasaka et al. 2012;Jones and Stansly 2014;Rioux et al. 2017;Kutcher et al. 2018;Strayer-Scherer et al. 2018). The US Forestry Service uses it to assess ozone injury (https://www.nrs.fs. fed.us/fia/topics/ozone/methods/). However, it is based on the nonexistent Weber-Fechner law (Nutter Jr and Esker 2006), and the ability of raters to estimate in the broad categories in the middle of the scale is better compared to what the scale indicated (Forbes and Korva 1994;Nutter Jr and Esker 2006;Bock et al. 2009b). Inappropriate scale structure is illustrated in results of studies in plant breeding (Xie et al. 2012). An improved quantitative ordinal scale has been developed that provides a lower risk of type II error, which is recommended where an ordinal scale is required (Chiang et al. 2014) ( Table 2). Analysis of quantitative ordinal scales may be through mid-point conversion (mid-point of the percent interval, not mid-point of the scale itself) and subsequent parametric analysis, or as described above for qualitative ordinal scales, or using a proportional odds model .
The frequency of ordinal scores may be used to obtain a disease severity index (DSI) (Chester 1950). Disease severity is estimated on the specimens by a rater using the scale and is used to determine the DSI (%) = [sum (class frequency × score of rating class)] / [(total number of plants) × (maximal disease index)] × 100 (Chester 1950;Hunter and Roberts 1978;Chaube and Singh 1991;Kora et al. 2005;Vieira et al. 2012). Although a relationship may exist between true severity and a severity index, they are intrinsically different and should not be used interchangeably. Recent studies by Chiang et al. (2017aChiang et al. ( , 2017b indicate that the DSI can be particularly prone to overestimation when using the above formula if the midpoint values of the rating class are not considered.  Very few studies have addressed resource use efficiency in visual disease assessmenthow to minimize the risk of a type II error while optimizing use of specimen numbers and assessment method (Chiang et al. 2016b). The results of that study indicated that choice of assessment method, optimizing specimen numbers and number of replicate estimates while using a balanced experimental design are important criteria to consider for maximizing the power of hypothesis tests.

Sources of error
Rater variation The earliest study to clearly demonstrate rater variability was that of Nutter Jr et al. (1993), although Sherwood et al. (1983) demonstrated rater effects in their study comparing rater estimates of disease caused by Stagonospora arenaria on leaves of Dactylis glomerata. Bock et al. (2009a) described rater variability for 28 different raters assessing symptoms of citrus canker on leaves of grapefruit. Some individuals are innately accurate, yet others are inaccurate. Individual raters tend to over or under-estimate and this may extend over the whole scale, or the rater may have variable tendencies over the range of the percentage scale (Hau et al. 1989;Nita et al. 2003;Godoy et al. 2006;Bock et al. 2009a;Bardsley and Ngugi 2013;Yadav et al. 2013;Schwanck and Del Ponte 2014). Where rater bias is concerned, type II error can be exacerbated using quantitative ordinal scales .
Some rater-related characteristics may be associated with cognitive type, gender or other psychological traits, but this is yet to be explored in severity estimation. Inter-rater variability may be problematic although no studies have investigated the impact of different raters in an experiment. Minimizing the number of raters on a specific experiment will help remove potential variability from the data; or deploying raters by block or replicate will help minimize effects of individual raters.
Responses to disease characteristics A common tendency is to overestimate at low disease severities, which is particularly sensitive to the number of lesions and lesion sizethe more lesions there are, the greater the tendency to overestimate (Sherwood et al. 1983;Forbes and Jeger 1987;Bock et al. 2008b).
Host organ characteristics Forbes and Jeger (1987) found that visual assessments of severity on simulated root structures were overestimated. Other organ types were not notably different in terms of accuracy (stems, leaves (various types), panicles, pods, tubers heads and roots). But few studies that have investigated the effect of organ type. Studies on the development and validation of SADs may be useful in this regard, but most diagrams have been developed for foliar diseases (Del Ponte et al. 2017).
Other factors Rating environment: does a rater perform more accurately under certain conditions? What is the effect of noise, heat, exhaustion or time allotted for an assessment? Fast assessments are not necessarily less precise (Parker et al. 1995a). Color blindness may impact disease severity estimation of some pathosystems (Nilsson 1995).

Methods to improve accuracy of estimates
Standard area diagrams (SADs) SADs are a simple and widely used tool to improve accuracy of rater estimates (Fig. 2). The diagrams developed by Cobb (1892) are the oldest assessment aid. James (1971) subsequently developed SADs for several crops. During the last 25 years, research on SAD development and validation has intensified, further demonstrating the value of SADs for improving accuracy (Del Ponte et al. 2017). Gains using SADs are variable among raters and across pathosystems (Spolti et al. 2011;Yadav et al. 2013;Schwanck and Del Ponte 2014), and are generally greatest for those raters who are least accurate (Yadav et al. 2013;Braido et al. 2014;González-Domínguez et al. 2014;Debona et al. 2015;Duan et al. 2015). Increase (Δ) in agreement (based on Lin's concordance correlation, ρ c ) may range from Δ > 0.4 for inexperienced raters, to Δ~0, or possibly a slight loss in agreement for innately accurate raters. Overall, the use of SADs helps standardize raters, improving inter-rater reliability (itself a result of the accuracy of estimates of severity on individual specimens). Agreement (ρ c ) on the 0 to 100% range with actual values from image analysis frequently > 0.90 when using SADs (Spolti et al. 2011;Duarte et al. 2013;Domiciano et al. 2014;González-Domínguez et al. 2014). This can be considered excellent agreement in measurement science (Altman 1991), although others are more conservative (McBride 2005). When SADs are not used, agreement is often < 0.85. There may be symptomatic patterns where unaided estimates can be quite accurate and so SADs are less useful (Del Ponte et al. unpublished).
A recent, comprehensive review of SADs quantitatively summarizes their characteristics and provides guidelines for additional research (see Table 3 in Del Ponte et al. 2017). Several questions remain to be addressed. Does diagram number in a SADs affect accuracy of the estimates ? Recently, an electronic version of interactive SADs was developed for portable devices. The app, called 'Estimate' displays an ordinal quantitative scale (severity intervals in either linear or log increments) accompanied by a SAD representing the mid-point. The severity value is not entered directly as in typical use of a SAD. The rater first selects a main category (specific % interval) and, alternatively, a subcategory in 1 % units (Pethybridge and Nelson 2015 ).  (Domiciano et al. 2014), b frogeye leaf spot on soybean (Debona et al. 2015), c potato early blight (Duarte et al. 2013), and d anthracnose on fruit of sweet pepper (Pedroso et al. 2011). The numbers represent percentage (%) of leaf area showing symptoms showed the superiority of linear over the logincremental scale, but only for the two-stage (category and subcategory) assessment process. The delivery of SADs in portable devices may increase in the future, as sophistication improves usability. Nutter Jr and Schultz (1995) demonstrated that computer-based training improved accuracy, but this may be short-lived (Parker et al. 1995b). In a few cases training may reduce accuracypossibly due to training on pathosystems not related to the one being used in practice (Bardsley andNgugi 2013). Nutter Jr andSchultz (1995) found that one rater's coefficient of determination (R 2 ), indicative of precision, changed from 0.825 to 0.933 before and after training. Training software programs were developed for older computer operating systems, for example DISTRAIN (Tomerlin and Howell 1988) and Severity. Pro (Nutter Jr and Litwiller 1998). Neither new nor updated versions of these training programs based on computer-generated images exist; they may have been replaced by training raters with true-color photos of symptoms combined with the use of SADs technology.

Training
Instruction Instruction provides an opportunity for the raters to recognize symptoms and estimate severity accurately. Bardsley and Ngugi (2013) found good instruction of symptoms of bacterial spot on peach and nectarine resulted in the greatest improvement in interrater reliability (which could also be tangentially related to improvements in accuracy in that study) by inexperienced raters compared to training. The coefficient of determination (R 2 ) increased from 0.76 to 0.96 after instruction (and to 0.88 after training).
Experience, general field-based training and other methods Experience in recognizing disease symptoms does have an impact on ability to estimate accurately. Although individual, inexperienced raters may be innately more accurate than some experienced raters, as a group, experienced raters tend to be more accurate (Yadav et al. 2013;González-Domínguez et al. 2014). Grids comprised of squares that overlay a leaf (or other specimen area) were shown to improve accuracy (Parker et al. 1995b) but have never been widely implemented. Considering these tools available to improve accuracy and reliability (and acknowledging that many questions remain), standardized procedures may be outlined that will provide a basis to maximize accuracy of individual specimen estimates when performing visual assessments (Table 3).

Application in research and practice
Visual assessments are most often applied at the scale of individual organs (leaflets, leaves, fruit, flowers etc.), plants, and occasionally fields. However, these data are used at regional and global levels. Visually estimating severity at the field scale is somewhat archaic. For example, a key was developed during the 1950s to assess late blight of potato in the UK at the field scale (Moore 1943). Such field keys, although a valid method of disease severity assessment, are not considered further as they have been rarely used in recent times.
Visual severity assessment has been applied to compare treatments (for example, fungicide or cultural control methods), assess the effect of disease on yield, for surveys, assess the severity of disease on different genotypes etc.

Summary of how accuracy has been improved for visual estimates
Based on current research, where possible, the percentage scale is demonstrably the most accurate tool on which to base visual estimates of disease severity (Nita et al. 2003;Hartung and Piepho 2007;Bock et al. 2010b;Chiang et al. 2014). Thus, accuracy of disease severity estimation has been improved through a better understanding of error, methods to reduce bias, particularly with the use of SADs, but also through instruction and training.
Visual estimation (with use of the approaches outlined in Table 3) has probably come close to maximizing accuracy of estimates. Appropriate scales, SADs, training and instruction, if correctly implemented can provide remarkably accurate estimates that will minimize the risk of any type II errors.

Measurement of disease severity using visible spectrum image analysis
Assessment based on VIS spectrum image analysis have the potential to be accurate, repeatable and reproducible (Martin and Rybicki 1998;Bock et al. 2008a;Barbedo 2014;Clément et al. 2015). Lindow and Webb (1983) were among the earliest pioneers of digital image analysis of plant disease. Particularly since 2000, more sophisticated algorithms and statistical approaches have advanced the capability of differentiating symptomatic from healthy tissue in digital images (Table 4) (Bock and Nutter Jr 2011;Barbedo 2013Barbedo , 2016aBarbedo , 2017Barbedo , 2019.

Methods of image acquisition
Various cameras or image capturing devices record in the VIS spectrum. Red-green-blue (RGB) sensors are portable and widely available. With the advent of handheld devices with cameras the possibilities of easily obtaining numerous images is increased many-fold Table 4 The crop, stress, and analysis technique used to describe severity measurement using visible spectrum (RGB) image analysis with symptom segmentation. The superscript numbers cross-reference the "Reference" with the "Analysis software/technique" and "Symptom measured" for each study. For example, in the first row

'Color
Transformations' and 'filtering' were used only by Camargo  Camargo and Smith (2009) Table 4 The crop, stress, and analysis technique used to describe severity measurement using visible spectrum (RGB) image analysis with symptom segmentation. The superscript numbers cross-reference the "Reference" with the "Analysis software/technique" and "Symptom measured" for each study. For example, in the first row

Methods of image analysis and processing
Segmentation Segmentation (delineation of the area of interest) is a step in many image analysis algorithms (Fig. 3). In testing image analysis, leaf segmentation is generally performed manually, but for practical application segmentation must be automated. The only difference between segmentation and severity measurement is that the latter includes an additional step relating the areas occupied by diseased and healthy tissues. With the rise of artificial intelligence (AI, machine learning, and its off-shoot, deep learning) segmentation is less of a requirement.
Software for image analysis Many studies have employed third-party software to measure severity including Assess (Horvath and Vargas 2005;Steddom et al. 2005;Mirik et al. 2006;Bock et al. 2008aBock et al. , 2008bBock et al. 2009aBock et al. , 2009bBock et al. , 2009cDe Coninck et al. 2012;Sun et al. 2014;El Jarroudi et al. 2015), launched in 2002 (Lamari 2002). Assess requires the user to predefine segmentation parameters for automation, but this works only if all images were captured under the same conditions (Bock et al. 2009c (Wijekoon et al. 2008;Goodwin and Hsiang 2010). In the review on SADs, 20 programs were reported to obtain actual severity measurements, but Assess and Quant (Vale et al. 2003) were the most commonly used (Del Ponte et al. 2017) Validation Validation involves comparing the image analyzed measurement to an actual or "gold-standard". The actual value may be based on a visual estimate (Steddom et al. 2005;De Coninck et al. 2012;El Jarroudi et al. 2015) or manually delineated image analysis data (Martin and Rybicki 1998;Bock et al. 2009a;Peressotti et al. 2011). Regression has been widely used to compare accuracy of image analysis systems (Horvath and Vargas 2005;Steddom et al. 2005;Peressotti et al. 2011;El Jarroudi et al. 2015), although other statistical criteria are often used to provide more meaningful insights (Bock et al. 2009a;De Coninck et al. 2012;Stewart and McDonald 2014). Because experimental setups and contexts vary between studies, the results are not always comparable (Horvath and Vargas 2005); reported variabilities based on regressions (R 2 ) and correlations (r) fall within the 0.70-1.00 range (Martin and Rybicki 1998;Steddom et al. 2005;Peressotti et al. 2011;De Coninck et al. 2012).
Custom systems using color transformations and artificial intelligence Newer methods for severity measurement can be divided in two categories. The first relies on color transformations; the second on AI using machine or deep learning techniques.  (Price et al. 1993;Patil and Bodhe 2011;Clément et al. 2015) and filtering (Camargo and Smith 2009), with the objective of isolating the regions of interest. These algorithms are generally quick to develop and simple to implement but may not be suitable for dealing with subtle symptoms. ii) Many applications of AI for image analysis are based on machine learning, which may be supervised or unsupervised. Supervised learning typically involves methods of classification (including logistic regression, support vector machines and artificial neural networks), while unsupervised learning relies on methods of clustering (including k-means clustering and principal component analysis) that rely on structural patterns in the data. For disease severity measurement the classifiers require the severity to be transformed from continuous data to a discrete scale of values. This is usually accomplished by either labelling each pixel as healthy or diseased, or by defining severity levels based on a nominal or ordinal scale, for example as "low", "medium" and "high". A variety of methods have been tested and reported in the literature, including K-means clustering ( , which contains > 50,000 curated images of many crop diseases; and Digipathos (Barbedo et al. 2018, available at https://www.digipathos-rep.cnptia.embrapa.br), also containing > 50,000 images of crop diseases. However, neither has image annotation for sample source location or actual severity. Image libraries are a progress-limiting gap. Data sharing is one solution: globally, plant pathologists working on various pathosystems could capture images to represent the diversity of characteristics and enable image analysis systems (Barbedo 2019).
Many trained deep learning models are lightweight enough for mobile applications, so they can be run directly on the device without the need for connectivity (Ramcharan et al. 2019), important in remote areas.

Accuracy of image analysis
The number of studies employing CNN has increased in the last few years. Ramcharan et al. (2019) used CNN and 2415 leaf samples to automatically detect two severity classes of cassava mosaic disease. Accuracy of low severity detection was 29.4%. Esgario et al. (2019) found that assigning severity of multiple diseases of coffee using deep learning was up to 84.13% accurate. Wang et al. (2017) found accuracy of severity of apple leaf black rot measurements ranged from 83.3 to 100%, depending on class (there were 4 classes of severity). Thus, estimates of accuracy are often being considered at a lower resolution compared to visual estimation using the 0 to 100% scale. Scale type, number of intervals and replication may differ  Amara et al. 2017) considerably to achieve the same power in a hypothesis test (Bock et al. 2010b;Chiang et al. 2014Chiang et al. , 2016aChiang et al. , 2016bChiang et al. , 2019. Much of the variation in image analysis may be attributed to two factors. Firstly, conditions under which the images were captured and the variety of symptoms in the images. Studies using VIS spectrum images captured in the field often report lower accuracies. Examples of images captured under variable conditions include the systems proposed by Macedo-Cruz et al. (2011), Barbedo (2017) and Hu et al. (2017) (resulting in 92, 91, and 84% accuracy, respectively); images captured under controlled conditions include methods proposed by Patil and Bodhe (2011), Kruse et al. (2014) and Stewart et al. (2016) (resulting in 98, 95, and 94% accuracy, respectively). Secondly, the actual reference values to which the estimates are compared will affect accuracy. Where the reference is a visual estimate, subjectivity will be directly related to the perceptions of the rater (Bock et al. 2008a).

Sources of error affecting accuracy
Operator Operators must accurately pair the diagnosis guidelines with the symptoms. Even manual measurements using image analysis have some subjectivity. Actual values based on image analysis used to validate automatic methods (or other methods of assessment) are variable (Barbedo 2013;Bock et al. 2008a). But the error should be small.
Variation in symptoms, host and background To work effectively, deep learning models must be trained using images covering a wide range of conditions. For most other techniques segmentation of leaf and disease is required (Barbedo 2016a). Threshold values and other parameters derived under one set of conditions generally fail under a different set of conditions due to variation in brightness, contrast, reflections, weather conditions and numerous other factors (Barbedo 2014). Symptoms may vary depending on stage of development (Patil and Bodhe 2011) and the interaction with environmental factors (Mutka et al. 2016). Separating image components automatically with field-acquired images is a challenging and complex task and solutions are only recently being developed (Zhang et al. 2018a). Automatic segmentation can be easier if a screen is placed behind the leaf prior to image capture (El Jarroudi et al. 2015;Pethybridge and Nelson 2015;Shrivastava et al. 2015), but this makes image capture more time-consuming and problematic. Thus, most methods using field-captured images rely on the user to manually segment the leaf (Barbedo 2014(Barbedo , 2016b(Barbedo , 2017. Issues with image acquisition and differentiating diseased vs. healthy areas There is subjectivity in determining the edges of some symptoms (Barbedo 2014;Stewart et al. 2016). Leaves are not always flat causing perspective problems (Barbedo 2014), or require flattening (Clément et al. 2015). Small symptoms may be confused with debris (Barbedo 2014). Shadows, leaf veins, and other parts of the plant may mimic symptoms, causing error (Olmstead et al. 2001;Bade and Carmona 2011;Barbedo 2014;Clément et al. 2015;Barbedo 2016a). Groups of lesions may merge, impairing a counting process (Bock et al. 2008a;Bade and Carmona 2011). The presence of other disorders may exacerbate delineation of the symptoms of interest (Bock et al. 2008a(Bock et al. , 2009aEl Jarroudi et al. 2015;Barbedo 2016b). Specular reflections may render parts of the leaf featureless (Steddom et al. 2005;Peressotti et al. 2011;Barbedo 2016a). Image compression may introduce distortions and artifacts (Steddom et al. 2005;Bock et al. 2010a). Symptom complexity affects the difficulty of the task (Bock et al. 2008a;Barbedo 2017), which has led some authors to argue that different algorithms are needed for each symptom (Contreras-Medina et al. 2012), or each hostpathogen pair (Mutka and Bart 2015). AI techniques can address some of these issues if trained with sufficiently comprehensive data. Factors that cause loss of information (specular reflections, shadows, etc.) can only be addressed by appropriate protocols during image capture.
Automatic image capture in the field can result in underlying leaves being obscured. Perspectives will be variable. This is an issue for plants with dense canopies if severity measurement on lower leaves is needed (Wiesner-Hanks et al. 2018).
Actual values Evaluation of measurements obtained using VIS image analysis is not straightforward. Generally, the "gold standard" reference is generated manually by image analysis (Peressotti et al. 2011;El Jarroudi et al. 2015), by expert visual estimation, or rarely other methods (Martin and Rybicki 1998). Due to subjectivity, even manually delineated image analysis may harbor operator error, and thus the systems developed are dependent on the references they are tasked to mimic; they could vary if other "gold standard" references were used.
System limitations As effective as various new techniques are, including deep learning, sometimes images in the visible range do not carry enough information for distinction of severity classes. In such cases, combining different imaging methods may be a viable solution (Berdugo et al. 2014), perhaps with the sacrifice of higher costs and reduced mobility. Image analysis software for disease severity measurement is available for mobile devices (Pethybridge and Nelson 2015;Manso et al. 2019). Mobile device-based applications generally require the user to set thresholds, which can lead to inconsistencies (Bock et al. 2008a(Bock et al. , 2009c. Software was recently developed automating severity estimation using Fuzzy Logic rules and image segmentation for the mobile application 'Leaf Doctor' (Sibiya and Sumbwanyambe 2019).

Scales of application
Image capture using mobile platforms (UAVs, ground robots etc) is being studied in the field, although disease detection is the primary focus (Johnson et al. 2003;Garcia-Ruiz et al. 2013;de Castro et al. 2015). Measurement of severity with VIS spectrum image analysis using mobile platforms is less common (Lelong et al. 2008;Sugiura et al. 2016;Duarte-Carvajalino et al. 2018;Franceschini et al. 2019;Ganthaler et al. 2018;Liu et al. 2018), but is an area of research need. An automated VIS image analysis system on a UAV for measuring severity had moderate precision compared to visual rating (R 2 = 0.73), but was deemed acceptable for rating potato resistance to late blight (Sugiura et al. 2016). Zhang et al. (2018b) found RGB images taken using a UAV were less effective (R 2 ≤ 0.554) in differentiating severity of sheath blight of rice compared to HSI sensors (R 2 ≤ 0.627). VIS image analysis to measure disease severity is not yet routinely used outside the research realm. There are a few examples of controlled environment, highthroughput systems used routinely for research purposes. Karisto et al. (2018) described automated VIS image analysis to measure severity of Septoria leaf blotch on wheat. There was a good relationship between image Fig. 5 "Spectral data cube". Three-dimensional structure of hyperspectral imaging data with two spatial dimensions y and x and a spectral dimension z. Each image pixel contains the spectral information over the measured range. In this example, the reflectance from barley leaves diseased with rust is illustrated at different disease severities analyzed measurements and visual estimates (Lin's concordance correlation, ρ c = 0.76 to 0.99, depending on rater (Stewart and McDonald 2014)). Microscopic imaging of powdery mildew on barley for genotype screening was considered ready for high-throughput processing (Ihlow et al. 2008). But both still require time-consuming sample preparation.
Spectral sensor technology to measure plant disease severity MSI and HSI sensors measure the light reflected by an object. In plant disease detection and severity measurement this might be a single plant organ (leaf, fruit, and/or storage root), a plant, or a crop stand. Several studies have demonstrated that diseases can be detected accurately even before symptoms are visible to the human eye (Rumpf et al. 2010;Zhao et al. 2017). Indeed, detecting the quantity of disease at very early stages is valuable for disease management decisions, and neither raters nor VIS image analysis can detect latent disease. Furthermore, HSI is non-invasive and non-destructive, and is an objective method, and if automated can significantly reduce the workload compared to other methods of assessment (Walter et al. 2015;Mahlein 2016;Virlet et al. 2017).

Characteristics of light reflectance from plants
The optical properties of plants are determined mainly by their reflectance, transmission and absorbance of light. Diseases affect these signature characteristics.
Reflectance of light from plants Reflectance depends on leaf properties. Transmission and absorbance are influenced by pigments and water (Gates et al. 1965;Curran 1989). Reflectance is caused by biochemical properties that result in a mixed signal (Gates et al. 1965;Carter and Knapp 2001;Gay et al. 2008). The visible range (400-700 nm) is characterized by absorption by chlorophyll, carotenoids and anthocyanins (Gay et al. 2008). According to Hindle (2008), NIR and SWIR stimulate molecular motion that induces absorption or reflection by compounds having characteristic spectral patterns. The NIR reflectance of leaves is determined mainly by the leaf and cell structures and the canopy architecture (Gates et al. 1965;Elvidge 1990). The NIR and SWIR regions have bands that are absorbed by water (particularly the SWIR region) (Seelig et al. 2008).
How do plant diseases influence the optical properties of plants? The pathogen causes changes in physiological and biochemical processes in the host , resulting in disease, often accompanied by symptoms. The pathogen and symptom types have consequences for the detectability and measurement of disease severity. Each host-parasite interaction has a specific spatial and temporal dynamic, impacting different wavebands during pathogenesis Wahabzada et al. 2016). Sensors offer the potential to extract new features of disease severity and dynamics, and a new way to visualize and analyze severity. Progress in disease symptoms can be directly related to HSI measurements (as "metro maps" or "disease traces", Kuska et al. 2015;Wahabzada et al. 2015Wahabzada et al. , 2016. Metro maps of plant disease dynamics explicitly track the host-pathogen interaction, providing an abstract yet interpretable view of disease progress.

Methods of hyperspectral image acquisition
In contrast to RGB cameras having a spatial resolution of several megapixels, spectral sensors include high-resolution techniques with greater spectral resolution ( Fig. 5; Mahlein et al. 2018). HSI and MSI sensors assess narrow wavebands in specific ranges of the electromagnetic spectrum in combination with a high spatial resolution. The VIS and NIR region (400-1000 nm) have the highest information content for monitoring plant stress. The ultraviolet-range (UV, 250-400 nm) (Brugger et al. 2019) and SWIR-range (1000-2500 nm)  provide information as well. Spectral sensors can be characterized by resolution (number of wavebands per nm) and the type of the detector. Often, MSI sensors cover the RGB range in addition to NIR but provide less data due to lower spectral resolution, although they are lightweight and cost less . In contrast, HSI sensors are more complex, heavier, expensive and the measurement takes longer, demanding strict protocols. Systems consist of the sensor, a light source and a control unit for measuring, storing and processing the data (Thomas et al. 2018b).
Choice of HSI sensor in combination with the measuring design and platform is the basis of a data set. Accuracy and resolution are influenced by the distance between the sensor and the object. Thus, airborne or space borne systems have lower spatial resolution compared to near-range systems. Data preprocessing and analysis is closely linked and individually designed depending on the sensor, setup and purpose of measuring (Behmann et al. 2015a;Mishra et al. 2018).
Non-imaging sensors Non-imaging HSI sensors do not provide spatial information. The focal length of the viewing angle and the distance to the target determine the size of the measured area. The signal comprises mixed information from healthy and diseased areas, affecting the sensitivity and specificity, so early detection and measurement of symptoms by non-imaging sensors is limited, especially at low disease severities. Measurement of severity of mixed infections is challenging using non-imaging sensors. Mahlein et al. (2010Mahlein et al. ( , 2012b found the detection limit using non-imaging HSI for Cercospora leaf spot (CLS) and powdery mildew of sugar beet was 10 and 20% diseased leaf area, respectively.

Imaging sensors
Imaging HSI sensors collect extra information on shape, gradient or color of the spatial dimension (Behmann et al. 2015a). There are push-broom and whisk-broom scanners that capture the spectral information of a pixel point or a pixel line at the same time, respectively. The image emerges due to movement of the sensor and has high spatial and spectral resolution. Depending on image size, image acquisition time may take minutes, limiting imaging sensors to motionless objects (Thomas et al. 2017).
Other HSI sensors Filter-based HSI sensors do not require the sensor to move and are generally faster than push-and whisk-broom sensors, but the subject must be motionless. HSI snapshot cameras capture images akin to RGB cameras, but have lower resolution compared to push-or whisk-broom sensors, although they have a fast image acquisition time (Thomas et al. 2017).

Choice of sensor platform
It is critical to consider purpose and subject. HSI sensor setups can be handheld or mounted on a platform (vehicles, robots, UAVs, airplanes or satellites). Choosing the right sensor in combination with the right measurement scale is the key requirement for successful field measurement. Possible targets could be early disease detection/ identification, or quantifying disease incidence or severity. Drone measurements at a height of 50 m above the crop in combination with a low spatial resolution hyperspectral camera will not detect single leaf lesions compared to a measuring device close to the leaf canopy that has high spatial resolution. Pixel-wise attribution of diseased and healthy tissue is conducive to observe spectral reflectance patterns of diseases in detail. It should be noted that some disease symptoms can only be distinguished from other diseases and stresses when using HSI imaging with high spatial resolution.

Data handling, training and analysis
There are several approaches for analyzing HSI and MSI databut no standard one. Data preprocessing typically consists of normalization to a white reference standard and dark current images (Behmann et al. 2015a). A smoothing of the data can be performed. Often the background and parts of the image which are not required for further analysis are masked to reduce the data complexity.
Vegetation indices A common and straightforward way to analyze hyperspectral images are vegetation indices (VI) (Devadas et al. 2009;Ashourloo et al. 2014;Behmann et al. 2015a). VIs are algorithms based on band ratios. Often 2-6 bands are involved. VIs are used to highlight a specific factor while reducing data complexity and the impact of other factors (Jackson and Huete 1991;Gitelson et al. 2014;Blackburn 2007). Several welldescribed VIs have been used for the detection or quantification of diseases, but weren't specifically developed for that purpose. Moreover, VIs are related to pigment content, vitality, biomass, water content and so on. For the analysis of MSI data, VIs are often the method of choice.
Some disease specific VIs have been developed (Mahlein et al. 2013;Ashourloo et al. 2014;Oerke et al. 2016). The correlation between disease severity and reflectance wavebands are calculated and those wavebands with the highest correlations are integrated into disease specific indices. Comparative studies have demonstrated that disease specific VIs are superior to standard VIs (Mahlein et al. 2013;Ashourloo et al. 2014). An overview of VIs for the detection and/or quantification of diseases is presented, including disease specific VIs (Table 5).
Symptom recognition and analysis As for VIS image analysis, hyperspectral image analysis is challenging. The aim is to extract a small proportion of relevant information from the hyperspectral signal (Behmann et al. 2015b). Algorithms are developed to learn and make predictions about the data ) and can cope with hundreds of wave bands used for detection, quantification and characterization of plant diseases in the laboratory, greenhouse and field (Behmann et al. 2015b;Singh et al. 2016). Either the entire spectral data set can be analyzed, and patterns identified, or feature selection methods can be applied to reduce the data complexity. As with VIS image analysis methods, there are supervised and unsupervised learning approaches.
Supervised approaches like regression and classification demand annotated training data. Provision of training data is a limiting factor in severity measurement as sufficiently large image sets of annotated data for specific diseases under a full range of conditions are not available.
Compared to supervised approaches, unsupervised approaches are less well explored, but do not rely on annotation and training data. Unsupervised methods can be assigned to pattern recognition in hyperspectral image data. A 'crossover' is a data driven learning model that relies on the actual data set, and not on predefined models; the algorithm utilizes extreme data points to define archetypal signatures, including latent aspects of the data .
Approaches using AI for measuring severity are based on deep learning. In contrast to the predefined features of machine learning approaches, deep learning models determine more abstract and more informative data representation within the process of optimization to a particular task. Deep learning offers potential to identify optimal features for the detection and measurement of a specific disease. As with RGB images, CNNs show great potential as a component of deep learning. Nagasubramanian et al. (2017Nagasubramanian et al. ( , 2019) applied a 3D CNN for detection of charcoal rot on soybean using closerange VIS-NIR hyperspectral images and achieved a detection accuracy of 97% and was able to predict lesion length on most stems. However, these technologies demand substantial training data. Establishing a library of groundtruthed data for different diseases is crucial to the successful implementation of deep learning for disease quantification. Related to general disease severity measurement, the importance of early detection (a "pre-visible symptom severity measurement") cannot be overstated and is critical in many circumstances; HSI can excel when severity is nascent.

Ground truthing, accuracy and measuring disease severity with spectral sensors
Various actual values or "ground truthing" have been used in HSI disease severity measurement including visual estimates based on nominal or ordinal scales Wang et al. 2016;Leucker et al. 2017), described stages of symptom progression (Kuska et al. 2015;Wahabzada et al. 2015Wahabzada et al. , 2016Zhu et al. 2017), and molecular quantification of the pathogen (Thomas et al. 2017;Zhao et al. 2017). An increasing number of studies have demonstrated that HSI and MSI data can be used to accurately detect, differentiate and quantify symptoms of plant diseases (Mahlein et al. 2012a). However, as noted, accuracy is not necessarily measured using the 0 to 100% scale as it has historically been for visual estimates or even for VIS image analysis. It may be related directly to the physiological, biochemical, structural and development changes in the host and pathogen. Comparing estimated or measured symptoms using the 0 to 100% scale to HSI, measurements can easily be done as HSI sensors provide pixel-based results on disease status (Fig. 6). The relation among visual rating and sensor measurement can be evaluated by postclassification routines and confusion matrixes.
Accuracy of detection can be robust. Apan et al. (2004) detected sugarcane orange rust with 96.9% accuracy compared to visually ground-truthed data;  used in-field spectral images for early detection of yellow rust infected wheat with 96% when Fig. 6 RGB images and false-color classification of diseased pixels of wheat leaves with symptoms of powdery mildew caused by Blumera graminis f.sp. tritici. Hyperspectral images were acquired using a Specim V10 camera system, and classification was performed using Support Vector Machines (SVM). Percentage of diseased leaf area assessed by SVM classification is indicated on the right; classification accuracy ranged from 90% to 95% compared to a visually-assessed disease map; Hillnhütter et al. (2011Hillnhütter et al. ( , 2012 discriminated symptoms caused by the nematode Heterodera schachtii and the soil borne fungus Rhizoctonia solani in sugar beet under both field and controlled conditions (spectral reflectance data and manual symptom assessment were correlated, P < 0.01); Delalieux et al. (2007, Delalieux et al. 2009a identified narrow waveband ratios with c-values (the cindex is derived from Received Operator Curves maximizing sensitivity for low values of the false-positive fraction) ranging from 0.80 to 0.88 for detecting scab (caused by Venturia inaequalis) on apple.
For measuring severity, Wahabzada et al. (2015Wahabzada et al. ( , 2016 used advanced data mining techniques to define cardinal points during pathogenesis and differentiate spatial and temporal development of symptom dynamics of foliar diseases (caused by Pyrenophora teres, Puccinia hordei and Blumeria graminis hordei) of barley. Disease was quantified by counting the number of diseased pixels to equate to the stage of infection which has a relationship with severity (leaf area diseased), although severity (as a percent area diseased) was not explicitly performed. Some of these ideas are ushering in novel paradigms in the progress of disease severity for HSI. Huang et al. (2007) demonstrated reliable measurement of severity using a 9-class ordinal scale for severity of yellow rust in wheat (R 2 = 0.91). Other studies have explored classification accuracy using ordinal groupings in classes of visually assessed specimens as the assumed gold standard Alisaac et al. 2018;Thomas et al. 2018a;Alisaac et al. 2019), including the use of confusion matrices. Regression analysis of visual estimates of diseased wheat spikes on a percentage scale and hyperspectral measurements also had demonstrable reliability (R 2 up to 0.828, Kobayashi et al. 2016). Thomas et al. (2017), using pathogen DNA to ground-truth achieved a coefficient of determination (R 2 ) of 0.72 from 3 to 9 days after infection of barley with Blumeria graminis f.sp. hordei.

Sources of error affecting accuracy
Illumination Measurements in the field can be performed using shading and artificial light. If sunlight is used, robust checks against variation in sunlight intensity are critical (Wendel and Underwood 2017). Interpolation approaches may fail through lack of continuous illumination (Suomalainen et al. 2014). Solar altitude, clouds, dew or dust can be problematic. The application of suitable radiation transfer models may help reduce environmental effects (Jay et al. 2016) but is complex and time consuming. Appropriate calibration to reflectance standards or the continuous assessment of radiation intensity is necessary. Varying illumination issues are more acute in direct sunlight and less severe under cloudy conditions, where the light is more diffuse. So far there are no standard calibration methods, the method of choice has to be designed depending on the senor-platform and illumination situation (Banerjee et al. 2020). For HSI under laboratory conditions, calibration routines are well established (Behmann et al. 2015a).
Motion Crop motion due to wind can be an issue. Most HSI sensors record information with a small temporal offset. With line scanning HSI cameras, the single lines are measured consecutively, and movement distorts the spatial image, whereas the spectral information remains valid (Thomas et al. 2017). Filter based systems often demand several seconds to record an image. If the object moves, the spectrum will consist of the reflectance information from different leaf areas and possibly even the ground, which cannot be corrected as the movement geometry is unknown. However, averaging the entire hyperspectral image mostly eliminates the effect, but spatial resolution is lost and the resulting data is comparable to that obtained using a simple spectrometer.
Mixed infection and mixed stress Quantification of a disease can be hindered by simultaneous stress (biotic or abiotic) or mixed infection. This aspect has only begun to be addressed. Studies are needed to demonstrate the potential of HSI to simultaneously identify and quantify multiple stressors or diseases.
Technical setup Leaves at different levels in a complex canopy require different exposure times. Shadows complicate saturation and since the choice of the exposure time is based on the brightest object, the exposure time is often much lower than required for shaded leaves low in the canopy, resulting in a noisier image.
Characteristics of the disease distribution Disease distributions may affect the ease with which the sensor can access specimens to sample. Some diseases spread from the lower leaves to the upper leaves through wind or the kinetic energy of rain droplets (e.g Septoria leaf blotch). Also, Septoria leaf blotch has a prolonged biotrophic phase. Thus, the upper leaves may not reflect the true disease severity in the crop stand when measurements are captured from above the canopy. Wind borne pathogens may be more likely to infect upper parts of a plant. In cereals, this favors the detection of foliar rust diseases or powdery mildews.
These challenges notwithstanding, HSI has great potential to provide a sophisticated, accurate and rapid method to measure disease severity at multiple spatial scales. The challenges are technically surmountable, and the advances over the last several years demonstrate the utility of this technology.

Application in research and practice
Controlled conditions Many studies have measured disease severity using HSI under controlled conditions in the laboratory (Delalieux et al. 2009a(Delalieux et al. , 2009bArens et al. 2016;Leucker et al. 2017). High spatial resolution can be obtained by hyperspectral microscopes (Kuska et al. 2015;Leucker et al. 2016), detecting plant-pathogen interactions at the submillimeter scale, before they are visible, or detectable using field-based HSI systems. Scale independent transfer of characteristic spectral signatures may be possible (Bohnenkamp et al. 2019), whereby spectral signatures of different diseases over time is used for detection and quantification models at different spatial scales. The approach will help process large numbers of complex host-pathogen interactions and the impact of mixed infections or abiotic stressors.
Field conditions HSI measurement of disease severity under field conditions is particularly challenging West et al. 2003). As with systems under controlled conditions, these are at an early experimental phase. Applied systems do not yet exist. Variable environmental conditions and biological heterogeneity impair the quality of field data. Additionally, the infection biology and epidemiology of a disease may impact detectability and measurability Mahlein et al. 2019).

Contrasting the methods
An overview of the methods is presented in Fig. 7, and some of the advantages and disadvantages of the methods are contrasted (Table 6). Clearly, they have different levels of subjectivity, speed, scalability and cost. Accuracy also varies. Inexperienced, untrained/uninstructed and unaided raters can be wildly inaccurate in severity estimation. But trained, well-instructed and aided raters can provide very accurate estimates. Raters are slow, may be more expensive, and have low throughput. Scalability for visual rating is limited to plot or at most, field levels of assessment. However, both VIS and HSI/MSI image analysis offer less variable measurements of severity under tightly controlled conditions. Both can offer high throughput. Early detection and measurement of severity, particularly by HSI or MSI (and other remote sensors) is a major advantage and is being realized in the research arena. However, both HSI and MSI are limited in field situations as they are currently less capable of dealing with the wide variability in host, pathogen and disease characteristics experienced in the field. Raters, when well-trained and instructed can differentiate symptoms of diseases and suitable samples for assessment. Visual estimation of disease severity will be widely used Fig. 7 The main characteristics of visual severity estimation and imaging severity measurement methods as described and discussed in the text for many years yet and may be needed alongside automated systems for validation and ground-truthing of new or improved fully automated AI-based methods for the foreseeable future.
Visual rating, when performed by trained, wellinstructed and aided raters has probably reached its zenith of accuracy. But much is left to be understood regarding visual severity estimation, and the level of improvement will vary according to disease symptoms and how consistency within and among raters can be improved. In contrast, both VIS and HSI/MSI image analysis are rapidly evolving fields with ever more sophisticated approaches being developed and used for image acquisition and processing to measure severity. This is clear in the recent development of highthroughput systems for measuring disease under controlled conditions. Although measuring disease severity under field conditions remains challenging, the technical hurdles are being addressed and various systems have been demonstrated to have some utility, if not yet of practical value. It is possible that a combination of manual operations with automated measures will be required to overcome some limitations.
Visual rating of plant disease severity remains the most widely performed method for all purposes of field research where severity is a required variable. Very few mobile, or field operated VIS and HSI/MSI image analysis systems are routinely used in plant breeding, plant disease management, or for other purposes requiring severity measurement. This will doubtless change as research makes more advances facilitating the field application of VIS and HSI/MSI image analysis. As described, new tools based on AI have demonstrated capability and the potential to overcome many of the Table 6 A comparison of different criteria for visual assessment, visible spectrum image analysis (RGB) and hyperspectral image analysis as methods for obtaining plant disease severity data barriers. Already some small companies and start-ups provide HSI services for crop monitoring. These may be a model for the future where plant disease assessment is a standard service using HSI and may be provided using various platforms. Furthermore, new digital technologies must be linked to existing prognosis and expert systems with integration into disease thresholding models for real-time management of disease. VIS and HSI/MSI image analysis will continue to play a more prominent role for quantifying disease in research and practice.
Most visual estimates are assessed for accuracy based on the percentage scale, which offers high resolution for differentiating severity of disease. VIS image analysis under tightly controlled conditions can accurately measure disease either when manually operated or automated based on the percentage scale. But under field conditions accuracy is less certain, and the measurements are most often compared to a limited number of classes on an ordinal scale (up to 9 classes), which results in lower resolution to differentiate severity compared to the percentage scale. However, sample sizes can be rapidly and easily increased with VIS image analysis, which can improve the power of a hypothesis test. Severity data collected by HSI/MSI sensors is sometimes related to the percentage scale, but often the data are related to an ordinal or nominal scale rating of the ground-truthed samples, or to characteristic stages during the pathogenesis process. This may provide a new paradigm for rating severity other than using a ratio, ordinal or nominal scales.
A major challenge for both VIS and HSI/MSI is training image sample sizes covering the range in variability of symptoms and conditions expected to be experienced. This will require considerable effort. A possible solution is citizen science (Barbedo 2019), in which non-professional volunteers collect and/or process data (Silvertown 2009). Practitioners and stakeholders could capture images in the field and an expert could annotate these. This idea has been implemented by Plantix™ (https:// plantix.net/en/, PEAT, Berlin). This, and other studies referenced provide a sound basis for being optimistic for the technology in the future.
Furthermore, accuracies of different methods cannot be directly compared unless they are tested against identical gold standards or actual values. Thus, inferring the state of art quantitatively is challenging. It is worth noting that sharing the datasets used in published studies is being encouraged by many journals, so it might be possible to test new methodologies with the data used in prior experiments (Barbedo 2019), thus enabling more direct comparisons. Examples of accuracies attained by each of these methods are summarized by examples (Table 7). These and other studies have demonstrated that all three methods can provide accurate estimates or measurements of disease severity. However, VIS and HSI/MSI image analysis are still primarily at a research and developmental stage. Remote sensor-based methods are becoming less expensive, readily available and portable, and have the advantage of high throughput and scalability. However, the capability of raters in providing accurate estimates should not be overlooked as more sophisticated methods become available. Indeed, it behooves us to assure that the accuracy and reliability being attained by remote sensing methods is providing information at least sufficient for the purpose. Methods of validation should be in place to determine thisuse of actual values or ground-truthing in all studies is critical to the ongoing process of ensuring accuracy.

Some needs for future research in visual disease assessment, RGB and HSI image analysis
This section is structured to pose specific questions and issues that need to be addressed through research. It does not intend to be exhaustive, but suggestive of some important avenues for future study.
Visual severity estimation When dealing with multiple raters, some individual or environment-related sources of errors that may affect accuracy remain unknown: -Do raters' characteristics such as information processing speed (reflective or impulsive) affect accuracy? -Does the environment (heat, cold etc.) affect accuracy of estimates?
We need to continue to optimize quantitative ordinal scales and SAD design to ensure that accuracy is maximized: -Are there ordinal scales applicable for different pathosystems, regardless of severity range? -How do we design SADs for diseases with different characteristics (lesion size, shape, colors, etc)? -Do the number of diagrams in a SADs affect severity estimates? -Is it possible to develop a few generic SADs to cover the range of leaf types and diseases that have to be assessed? -Is one SAD representative of a percent sufficient as a reference diagram? -How can instruction be performed to maximize accuracy?
Classification accuracy (%) 94.83% Thomas et al. (2018a, b) -What kind of training is most appropriate? -Must it be in the specific pathosystem? -Should training use actual photographs of the target disease, computer-generated images, or a combination of both?
RGB image analysis Research is needed to determine if classification of severity using VIS image analysis and AI techniques provides the resolution and accuracy needed under field conditions.
-Can this be achieved using the 100% ratio scale? -If ordinal type scales are used, how many classes are needed? How will that vary with pathosystem? -How can RGB sensor-based systems penetrate the crop canopy where severity estimates of lower leaves might be required?
Databases of annotated images are needed for developing reliable and accurate automated systems based on AI: -Is development of sufficient image databases for the numbers of diseases and crop combinations practical (true for both VIS and HSI/MSI image analysis)? -If so, how best to coordinate the logistics of image acquisition?
Particularly for training using AI, systems need to be developed that do not need connectivity to a database: -Can we develop more efficiently packaged mobile applications?
Explore further combining RGB with HSI/MSI or other techniques: -Will this help maximize (and possibly synergize) information for accurate measurement of severity?
HSI/MSI and image processing Several of the issues that affect RGB image analysis are common to HSI/MSI too (for example, databases of appropriately groundtruthed images for accurately measuring severity).
Ideally it would be best if hyperspectral signatures were transferrable across scales: -Can we transfer discriminating hyperspectral signatures to different scales (leafplantfield scale) for different diseases? -If so, are they effective for measuring severity in the variable field situation? -If scalability is indeed practical for most diseases, how to resolve the issue of proximal and distal sensing and resolution and still maintain accuracy of severity measurements (may not be an issue for detection)?
A major issue that remains is related to data quality: -How does ground resolution, shadowing, crop motion and image capture influences accuracy of measurements? -What standard is required for disease measurement? -Are HSI/MSI measures based on disease development equally or more effective than traditional measures of severity using the percentage scale (metro maps, etc). -Can more sophisticated mobile platforms or combinations of 3D sensors provide a method to resolve issues of architecture or hidden sampling units?
Intensive knowledge transfer is needed: -What can we learn from other disciplines such as informatics, medicine, electrical engineering, etc.? on earlier versions of this article. His knowledge and expertise on the use of assessment scales for severity estimation is well-recognized, and his input on those sections was particularly insightful. AKM and DB would like to thank all group members and former group members of the INRES-Pflanzenkrankheiten, IfZ and partners for contributing to research on hyperspectral imaging for plant diseases measurement. The article reports the results of research only. Mention of a trademark or proprietary product is solely for the purpose of providing specific information and does not constitute a guarantee or warranty of the product by the U.S. Department of Agriculture and does not imply its approval to the exclusion of other products that may also be suitable.
Authors' contributions CHB led and coordinated the writing of the review, with emphasis on the section on visual disease assessment. EMD provided input on various sections including on SADs and visual disease assessment. JGAB led the section on VIS image analysis, and AKM led the section on HSI/MSI with input from DB. All authors coordinated writing of the introduction and conclusion sections. The author(s) read and approved the final manuscript. Availability of data and materials Not applicable.
Ethics approval and consent to participate Not applicable (no human/animal subjects).

Consent for publication
Not applicable.