Data Profiling: Extracting Metadata from Tables

Abstract: Data profiling comprises a broad range of methods to efficiently extract various metadata from a given dataset, including data types and value patterns, keys and foreign keys, and various other data dependencies. This research area has recently thrived, due to (i) its simple problem statements, such as “discover all key candidates”, paired with the high computational complexity of the problems, (ii) the manifold opportunities for algorithmic improvements, such as apriori-inspired pruning or data sampling, and (iii) the various application areas for data profiling results, such as query optimization and data cleaning. Accordingly, the tutorial-style talk will be divided into three parts. After a motivation and overview of the field covering some basic data structures and methods, we will regard several concrete dependency discovery algorithms. Finally, we will highlight several application areas for data profiling results, including information integration, data cleaning, and query optimization.

Felix NaumannFelix Naumann

Felix Naumann studied mathematics, economy, and computer sciences at the University of Technology in Berlin and completed his PhD thesis in the area of data quality at Humboldt University of Berlin in 2000. After a PostDoc position at the IBM Almaden Research Center working on data integration topics, he became assistant professor for information integration, again at the Humboldt-University of Berlin in 2003. Since 2006 he holds the chair for Information Systems at the Hasso Plattner Institute (HPI) at the University of Potsdam in Germany. He has been visiting researcher at QCRI, AT&T Research, IBM Research, and SAP. His research interests include data profiling, data quality and cleansing, and data integration, recorded in over 200 scientific publications. Next to numerous PC memberships for international conferences, he has organized several conferences in various roles, including VLDB 2021 as PC co-chair, and he is trustee of the VLDB Endowment. More details are via this link.