Event
SEMINAR: Data Valuation for Tabular Data Analysis
Speaker: Zirui Tan. This presentation explores how to measure the value of individual data points in tabular datasets from a machine learning perspective.
Please join us for CIRES HDR, Zirui Tan’s PhD Progress Review 1 Confirmation seminar.
Data Valuation for Tabular Data Analysis
Speaker: Zirui Tan
Abstract: Tabular data is the most common format for structured information in business and operational systems. Each row typically represents a user, transaction, or record, and carries potential commercial value—e.g., in advertising systems like Google Ads. Traditionally, such value has been defined by business impact. However, as machine learning becomes integral to decision-making, the value of tabular data increasingly shifts toward its contribution to model performance.
This research explores how to measure the value of individual data points in tabular datasets from a machine learning perspective. We focus on the data Shapley value, a game-theoretic method that quantifies each data point’s marginal contribution to model utility. While several approximation techniques have been developed to reduce the high computational cost of exact Shapley value (e.g., truncated Monte-Carlo, gradient-based methods), these methods largely assume a static and complete dataset. Yet, in practice, real-world tabular datasets often present more diverse conditions, such as evolving features and incomplete records. In such cases, existing Shapley values approximations are inefficient or inapplicable.
Our first work addresses this gap by proposing a Shapley value-based coreset selection method for efficient model adaptation in feature-incremental tabular learning. We show that feature expansion can significantly alter the contribution of individual data points, and that tracking these changes enables the selection of a small but highly informative subset. This supports a data-centric approach to cost-effective model updating for large-scale tabular datasets with complex model architectures. Our lightweight regression pipeline efficiently approximates Shapley value changes while preserving the relative importance ranking across data instances.
Our future work will focus on developing efficient and adaptive data valuation methods to address real-world challenges such as missing values and shifting distributions in tabular data.
Bio Zirui Tan is a PhD candidate at the ARC Centre for Information Resilience (CIRES) under School of EECS, UQ. She received her B. Actuarial Studies and B. Information Technology (Honours) from the Australian National University. Her primary research interests are tabular data analysis and data valuation.
View all events