The Web Conf 2023 – Accepted Paper from CIRES

Full paper “Human-in-the-loop Regular Expression Extraction for Single Column Format Inconsistency” by Shaochen Yu, Lei Han, Marta Indulska, Shazia Sadiq and Gianluca Demartini accepted at TheWebConf WWW2023

We propose a novel hybrid human-machine system that leverages crowdsourcing to address syntactic format inconsistencies in an effective and cost-efficient way. We first ask crowd workers to select training examples for our inference algorithm through data selection and result validation. Then, we propose and make use of a novel rule-based learning algorithm to infer the regular expression that works for the format consistency issues in a given structured dataset. In this way, we are able to apply the created regular expression to the entire dataset to find more consistency issues. Having experts writing regular expressions is no longer required.