News
Research Insight: Budgeted Causal Effect Estimation
Two excellent publications & research insights from the CIRES team at The University of Queensland including PhD researcher Hechuan Wen, CIs Dr. Rocky Chen, Prof. Hongzhi Yin, and Centre Director Prof. Shazia Sadiq, and colleagues. Thank you to our partners Dr. Li Kheng Chai and Health and Wellbeing Queensland for supporting this work. Delighted to […]
Two excellent publications & research insights from the CIRES team at The University of Queensland including PhD researcher Hechuan Wen, CIs Dr. Rocky Chen, Prof. Hongzhi Yin, and Centre Director Prof. Shazia Sadiq, and colleagues. Thank you to our partners Dr. Li Kheng Chai and Health and Wellbeing Queensland for supporting this work.
Delighted to share our two accepted works in FY2024 – 2025: “Progressive Generalization Risk Reduction for Data-Efficient Causal Effect Estimation” & “Enhancing Treatment Effect Estimation via Active Learning: A Counterfactual Covering Perspective” by KDD’25 & ICML’25, respectively.
Together with my supervisors and collaborators: Dr. Rocky Chen, Dr. Li Kheng Chai, Dr. Guanhua Ye, A/Prof. Mingming Gong, Prof. Yin Hongzhi, and Prof. Shazia Sadiq, we study the theoretical foundations for the budgeted causal effect estimation and propose simple yet effective data acquisition scheme to “valuate” the unlabelled data and prioritize the budget spending on labelling the most informative data. Huge Thanks to the ARC Training Centre for Information Resilience (CIRES), Health and Wellbeing Queensland, and The University of Queensland for supporting this work!
Read the papers: https://lnkd.in/gH4HJwg7 and https://lnkd.in/gp_7xtGj.
What we did
We identify optimizable quantities by rigorous theoretical analysis, which serves as the guidelines to “valuate” the unlabelled data points and promote the efficiency of budget spending. That means, given the vast unlabeled data pool, the labelling budget can be spent most effectively for the proposed target when building up the dataset for causal effect estimation.
Key Insights
► The most valuable unlabelled pair (control and treated) is acquired with the highest estimation variance and smallest distance in between.
► The overall estimation risk (incalculable directly) can be well bounded (indirectly) by the computable terms, i.e., factual covering and counterfactual covering radii, to give theoretical groundings for unlabelled data valuation/selection.
Forward
The computational cost when operating the proposed algorithm on very large unlabelled pool set is considerable, the future work on improving the algorithm’s scalability is worth exploring.
What opportunities or risks do you see in building up the dataset for model training from scratch?