The GIZ Data Lab has partnered with the Sustainable Rice Platform to develop an experiment looking at ways to support fieldworkers automatically digitize and verify handwritten responses and comments in Thai farmer diaries.
At large, smallholder farmers still suffer from low productivity and profitability in developing countries. One way through which farmers can increase their yield and income is by tracking their work through so-called ‘farmer diaries’. ’Farmer diaries’ collect information on agricultural practices and financial management in order to monitor programmatic impact and to offer tailored support through agricultural advice, financial services or marketing.
The Data Lab has joined forces with the Sustainable Rice Platform (SRP) projects, to develop a data platform that allow for central storage and mining of farmer diaries. As a multi-stakeholder alliance, SRP promotes resource-use efficiency and climate change resilience in rice systems to help farmers attain better lives while also protecting the environment. Its vision is to ‘Feed the world. Sustainably.’
How data is currently collected
Currently, the paper-based farmer diaries get collected by field staff who manually check the diaries for inconsistencies and errors and personally clarify issues with the farmers. This step is particularly important to ensure data quality. The farmer diaries contain personal information such as name and gender but also information on the farm property, agricultural practices, productivity and the use of agricultural inputs among others. As such, the diaries are prone to spelling mistakes, blank fields or inconsistencies in instances where fields require input in a specific format. After an initial verification, the staff then proceeds to manually digitize the diaries into Excel spreadsheets which in turn get verified by two further staff responsible for data.
Can the digitization be conducted more efficiently?
Our experiment is looking for ways which will enable field staff to automatically digitize and verify the paper diaries. For that, the data lab has been looking into methodologies and tools (e.g. for Optical Character Recognition) that allow to adequately recognize handwritten answers and comments in an automated way. This is especially challenging when looking for a well-developed tool for the Thai language. So far, we have applied and tested different software solutions. Given a sample of 100 diaries (50 in English and 50 in Thai), the program was able to detect over 90 per cent of the answers correctly, when they were given through numbers or check boxes. Handwritten paragraphs in Thai however, had an accuracy rate of less than 50 per cent. Nonetheless, when combining solutions with third-party software the results became more accurate.
Given the low accuracy in detecting hand written text, question remains on whether (semi-) automated character recognition is more efficient than manual procedures for paper based surveys. This is a question that the Data Lab and SRP are currently exploring. The hope is that, even if the tool does not work 100% accurate, it will still substantially reduce the workload caused by manual diary verification. This is particularly the case, if diaries are designed in a way to mostly use check boxes, numbers or lists of predefined values or text to be filled in.
Changes in the data processing stage
Besides reengineering the digitization process, state-of-the-art statistical methods can improve data quality. This includes data checks through outlier analysis and missing value imputation approaches. If adapted successfully, these methods could allow for real-time quality checks as new data is fed into the database of SRP projects. This is especially useful, as it accelerates the data processing and verification phase by automatically correcting data entries in real-time and help with detecting and flagging suspicious observations with need of manual follow-up.
Scaling opportunities
While we currently work with SRP projects in Thailand, the methods explored in both the data sourcing and processing stage may be applicable at scale in other regions and contexts. Any data collection exercise in low resource environments face similar data quality challenges. As our methods do not depend on sector or location specific content, they can be customized for many other projects outside of the SRP context.
Source: GIZ Data Lab, Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ), 09 March 2021