Full paper “Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency” by Shaochen Yu, Lei Han, Marta Indulska, Shazia Sadiq and Gianluca Demartini accepted at TheWebConf WWW2023
We propose a novel hybrid human-machine system that leverages crowdsourcing to address syntactic format inconsistencies in an effective and cost-efficient way. We first ask crowd workers to select training examples for our inference algorithm through data selection and result validation. Then, we propose and make use of a novel rule-based learning algorithm to infer the regular expression that works for the format consistency issues in a given structured dataset. In this way, we are able to apply the created regular expression to the entire dataset to find more consistency issues. Having experts writing regular expressions is no longer required.