{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,22]],"date-time":"2026-03-22T12:23:05Z","timestamp":1774182185971,"version":"3.50.1"},"reference-count":85,"publisher":"Association for Computing Machinery (ACM)","issue":"8","content-domain":{"domain":["dl.acm.org"],"crossmark-restriction":true},"short-container-title":["Proc. VLDB Endow."],"published-print":{"date-parts":[[2025,4]]},"abstract":"<jats:p>Data-centric machine learning (ML) pipelines extend traditional ML pipelines\u2014of feature transformations, hyper-parameter tuning, and model training\u2014by additional pre-processing steps for data cleaning, data augmentation, and feature engineering to create high-quality data with good coverage. Finding effective data-centric ML pipelines is still a labor- and compute-intensive process though. While AutoML tools use effective search strategies, they struggle to scale with large datasets. Large language models (LLMs) show promise for code generation but face challenges in generating data-centric ML pipelines due to private datasets not seen during training, complex pre-processing requirements, and the need for mitigating hallucinations. These demands exceed typical code generation as it requires actions tailored to the characteristics and requirements of a particular dataset. This paper introduces CatDB, a comprehensive, LLM-based system for generating effective, error-free, and efficient data-centric ML pipelines. CatDB leverages data catalog information and refined metadata to dynamically create dataset-specific rules (instructions) to guide the LLM. Moreover, CatDB includes a robust mechanism for automatic validation and error handling of the generated pipeline. Our experimental results show that CatDB reliably generates effective ML pipelines across diverse datasets, achieving accuracy comparable to or better than existing LLM-based systems, standalone AutoML tools, and combined workflows of data cleaning and AutoML tools, while delivering up to orders of magnitude faster performance on large datasets.<\/jats:p>","DOI":"10.14778\/3742728.3742754","type":"journal-article","created":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T13:32:53Z","timestamp":1756906373000},"page":"2639-2652","update-policy":"https:\/\/doi.org\/10.1145\/crossmark-policy","source":"Crossref","is-referenced-by-count":1,"title":["CatDB: Data-Catalog-Guided, LLM-Based Generation of Data-Centric ML Pipelines"],"prefix":"10.14778","volume":"18","author":[{"given":"Saeed","family":"Fathollahzadeh","sequence":"first","affiliation":[{"name":"Concordia University"}]},{"given":"Essam","family":"Mansour","sequence":"additional","affiliation":[{"name":"Concordia University"}]},{"given":"Matthias","family":"Boehm","sequence":"additional","affiliation":[{"name":"Technische Universit\u00e4t Berlin"}]}],"member":"320","published-online":{"date-parts":[[2025,9,3]]},"reference":[{"key":"e_1_2_1_1_1","unstructured":"2021. State of Data Science and Machine Learning. https:\/\/www.kaggle.com\/kaggle-survey-2021"},{"key":"e_1_2_1_2_1","unstructured":"2024. Artifici Alanalysis. https:\/\/artificialanalysis.ai\/models\/mixtral-8x7b-instruct\/providers#summary"},{"key":"e_1_2_1_3_1","unstructured":"2024. ast: Abstract Syntax Trees. https:\/\/docs.python.org\/3\/library\/ast.html"},{"key":"e_1_2_1_4_1","unstructured":"2024. Google AI Studio. https:\/\/aistudio.google.com\/"},{"key":"e_1_2_1_5_1","unstructured":"2024. Groq Cloud. https:\/\/console.groq.com\/"},{"key":"e_1_2_1_6_1","unstructured":"2024. OpenAI. https:\/\/platform.openai.com\/"},{"key":"e_1_2_1_7_1","unstructured":"2024. Scikit-learn: Machine Learning in Python. https:\/\/scikit-learn.org\/stable\/"},{"key":"e_1_2_1_8_1","unstructured":"2024. TransmogrifAI: Automated Machine Learning for Structured Data. https:\/\/github.com\/salesforce\/TransmogrifAI"},{"key":"e_1_2_1_9_1","doi-asserted-by":"publisher","DOI":"10.1145\/3035918.3054772"},{"key":"e_1_2_1_10_1","unstructured":"AI@Meta. 2024. Llama 3 Model Card. (2024). https:\/\/github.com\/meta-llama\/llama3\/blob\/main\/MODEL_CARD.md"},{"key":"e_1_2_1_11_1","unstructured":"Apache Atlas. 2020. Open Metadata Management and Governance. https:\/\/atlas.apache.org\/."},{"key":"e_1_2_1_12_1","doi-asserted-by":"publisher","DOI":"10.1145\/3661826"},{"key":"e_1_2_1_13_1","doi-asserted-by":"publisher","DOI":"10.1145\/3097983.3098021"},{"key":"e_1_2_1_14_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-3-030-62466-8_41"},{"key":"e_1_2_1_15_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313602"},{"key":"e_1_2_1_16_1","volume-title":"CIDR","author":"Boehm Matthias","unstructured":"Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginth\u00f6r, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR. http:\/\/cidrdb.org\/cidr2020\/papers\/p22-boehm-cidr20.pdf"},{"key":"e_1_2_1_17_1","doi-asserted-by":"publisher","DOI":"10.1145\/3308558.3313685"},{"key":"e_1_2_1_18_1","volume-title":"Brown et al","author":"Tom","year":"2020","unstructured":"Tom B. Brown et al. 2020. Language Models are Few-Shot Learners. In NeurIPS. https:\/\/proceedings.neurips.cc\/paper\/2020\/hash\/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html"},{"key":"e_1_2_1_19_1","doi-asserted-by":"publisher","DOI":"10.1016\/J.COMPELECENG.2013.11.024"},{"key":"e_1_2_1_20_1","doi-asserted-by":"publisher","DOI":"10.1561\/1900000006"},{"key":"e_1_2_1_21_1","doi-asserted-by":"publisher","DOI":"10.14778\/1687553.1687576"},{"key":"e_1_2_1_22_1","doi-asserted-by":"publisher","DOI":"10.1109\/CVPR.2019.00020"},{"key":"e_1_2_1_23_1","first-page":"42","article-title":"Validating Data and Models in Continuous ML Pipelines","volume":"44","author":"Dreves Mike","year":"2021","unstructured":"Mike Dreves, Gene Huang, Zhuo Peng, Neoklis Polyzotis, Evan Rosen, and Paul Suganthan G. C. 2021. Validating Data and Models in Continuous ML Pipelines. IEEE Data Eng. Bull. 44, 1 (2021), 42\u201350. http:\/\/sites.computer.org\/debull\/A21mar\/p42.pdf","journal-title":"IEEE Data Eng. Bull."},{"key":"e_1_2_1_24_1","volume-title":"Smola","author":"Erickson Nick","year":"2020","unstructured":"Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander J. Smola. 2020. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. CoRR abs\/2003.06505 (2020). https:\/\/arxiv.org\/abs\/2003.06505"},{"key":"e_1_2_1_25_1","volume-title":"Hands-free AutoML via Meta-Learning. J. Mach. Learn. Res. 23","author":"Feurer Matthias","year":"2022","unstructured":"Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2022. Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning. J. Mach. Learn. Res. 23 (2022). http:\/\/jmlr.org\/papers\/v23\/21-0992.html"},{"key":"e_1_2_1_26_1","volume-title":"Hyperparameter optimization. Automated machine learning: Methods, systems, challenges","author":"Feurer Matthias","year":"2019","unstructured":"Matthias Feurer and Frank Hutter. 2019. Hyperparameter optimization. Automated machine learning: Methods, systems, challenges (2019), 3\u201333."},{"key":"e_1_2_1_27_1","volume-title":"Manuel Blum, and Frank Hutter.","author":"Feurer Matthias","year":"2015","unstructured":"Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and Robust Automated Machine Learning. In NeurIPS. 2962\u20132970. https:\/\/proceedings.neurips.cc\/paper\/2015\/hash\/11d0e6287202fced83f79975ec59a3a6-Abstract.html"},{"key":"e_1_2_1_28_1","first-page":"1289","article-title":"An Extensive Empirical Study of Feature Selection Metrics for Text Classification","volume":"3","author":"Forman George","year":"2003","unstructured":"George Forman. 2003. An Extensive Empirical Study of Feature Selection Metrics for Text Classification. Journal of Machine Learning Research 3 (2003), 1289\u20131305. http:\/\/jmlr.org\/papers\/v3\/forman03a.html","journal-title":"Journal of Machine Learning Research"},{"key":"e_1_2_1_29_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2308.15363"},{"key":"e_1_2_1_30_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE55515.2023.00303"},{"key":"e_1_2_1_31_1","doi-asserted-by":"publisher","DOI":"10.1007\/978-981-99-7022-3_23"},{"key":"e_1_2_1_32_1","doi-asserted-by":"publisher","DOI":"10.1145\/2882903.2903730"},{"key":"e_1_2_1_33_1","doi-asserted-by":"publisher","DOI":"10.1109\/IJCNN.2008.4633969"},{"key":"e_1_2_1_34_1","doi-asserted-by":"crossref","unstructured":"Mossad Helali Essam Mansour Ibrahim Abdelaziz and et al. 2022. A Scalable AutoML Approach Based on Graph Neural Networks. PVLDB 15 11 (2022). https:\/\/www.vldb.org\/pvldb\/vol15\/p2428-helali.pdf","DOI":"10.14778\/3551793.3551804"},{"key":"e_1_2_1_35_1","doi-asserted-by":"crossref","unstructured":"Mossad Helali Niki Monjazeb Shubham Vashisth Philippe Carrier Ahmed Helal Antonio Cavalcante Khaled Ammar Katja Hose and Essam Mansour. 2024. KGLiDS: A Platform for Semantic Abstraction Linking and Automation of Data Science. In ICDE.","DOI":"10.1109\/ICDE60146.2024.00021"},{"key":"e_1_2_1_36_1","volume-title":"The Eleventh International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=cp5PvcI6w8_","author":"Hollmann Noah","year":"2023","unstructured":"Noah Hollmann, Samuel M\u00fcller, Katharina Eggensperger, and Frank Hutter. 2023. TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. In The Eleventh International Conference on Learning Representations. https:\/\/openreview.net\/forum?id=cp5PvcI6w8_"},{"key":"e_1_2_1_37_1","unstructured":"Noah Hollmann Samuel M\u00fcller and Frank Hutter. 2023. Large Language Models for Automated Data Science: Introducing CAAFE for Context-Aware Automated Feature Engineering. In NeurIPS. https:\/\/arxiv.org\/pdf\/2305.03403"},{"key":"e_1_2_1_38_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.18341"},{"key":"e_1_2_1_39_1","unstructured":"Michael I. Jordan. 2018. SysML: Perspectives and Challenges. In MLSys."},{"key":"e_1_2_1_40_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2311.14648"},{"key":"e_1_2_1_41_1","doi-asserted-by":"publisher","DOI":"10.5555\/3430915.3442426"},{"key":"e_1_2_1_42_1","doi-asserted-by":"publisher","DOI":"10.1145\/3588689"},{"key":"e_1_2_1_43_1","doi-asserted-by":"publisher","DOI":"10.1145\/3534678.3539454"},{"key":"e_1_2_1_44_1","first-page":"948","article-title":"ActiveClean: Interactive Data Cleaning For Statistical Modeling","volume":"9","author":"Krishnan Sanjay","year":"2016","unstructured":"Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J. Franklin, and Ken Goldberg. 2016. ActiveClean: Interactive Data Cleaning For Statistical Modeling. PVLDB 9, 12 (2016), 948\u2013959. http:\/\/www.vldb.org\/pvldb\/vol9\/p948-krishnan.pdf","journal-title":"PVLDB"},{"key":"e_1_2_1_45_1","volume-title":"Hinton","author":"Krizhevsky Alex","year":"2012","unstructured":"Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NeurIPS. 1106\u20131114. https:\/\/proceedings.neurips.cc\/paper\/2012\/hash\/c399862d3b9d6b76c8436e924a68c45b-Abstract.html"},{"key":"e_1_2_1_46_1","volume-title":"Proceedings of the AutoML Workshop at ICML","volume":"2020","author":"LeDell Erin","year":"2020","unstructured":"Erin LeDell and Sebastien Poirier. 2020. H2o automl: Scalable automatic machine learning. In Proceedings of the AutoML Workshop at ICML, Vol. 2020. ICML."},{"key":"e_1_2_1_47_1","doi-asserted-by":"publisher","DOI":"10.1145\/3136625"},{"key":"e_1_2_1_48_1","doi-asserted-by":"publisher","DOI":"10.1109\/ICDE51399.2021.00009"},{"key":"e_1_2_1_49_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2406.09534"},{"key":"e_1_2_1_50_1","volume-title":"SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions. In CIDR. www.cidrdb.org. https:\/\/www.cidrdb.org\/cidr2024\/papers\/p72-lin.pdf","author":"Lin Yin","year":"2024","unstructured":"Yin Lin, Bolin Ding, H. V. Jagadish, and Jingren Zhou. 2024. SMARTFEAT: Efficient Feature Construction through Feature-Level Foundation Model Interactions. In CIDR. www.cidrdb.org. https:\/\/www.cidrdb.org\/cidr2024\/papers\/p72-lin.pdf"},{"key":"e_1_2_1_51_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2405.04674"},{"key":"e_1_2_1_52_1","volume-title":"International Conference on Automated Machine Learning (PMLR)","volume":"224","author":"Lopez Roque","year":"2023","unstructured":"Roque Lopez, Raoni Louren\u00e7o, R\u00e9mi Rampin, Sonia Castelo, A\u00e9cio S. R. Santos, Jorge Henrique Piazentin Ono, Cl\u00e1udio T. Silva, and Juliana Freire. 2023. AlphaD3M: An Open-Source AutoML Library for Multiple ML Tasks. In International Conference on Automated Machine Learning (PMLR), Vol. 224. 22\/1\u201322. https:\/\/proceedings.mlr.press\/v224\/lopez23a.html"},{"key":"e_1_2_1_53_1","doi-asserted-by":"publisher","unstructured":"Mark Mazumder et al. 2022. DataPerf: Benchmarks for Data-Centric AI Development. CoRR abs\/2207.10062 (2022). 10.48550\/ARXIV.2207.10062","DOI":"10.48550\/ARXIV.2207.10062"},{"key":"e_1_2_1_54_1","doi-asserted-by":"publisher","DOI":"10.14778\/3401960.3401972"},{"key":"e_1_2_1_55_1","doi-asserted-by":"publisher","DOI":"10.14778\/3574245.3574258"},{"key":"e_1_2_1_56_1","doi-asserted-by":"publisher","DOI":"10.1007\/S00778-023-00820-1"},{"key":"e_1_2_1_57_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2311.13028"},{"key":"e_1_2_1_58_1","volume-title":"Moore","author":"Olson Randal S.","year":"2016","unstructured":"Randal S. Olson and Jason H. Moore. 2016. TPOT: A Tree-based Pipeline Optimization Tool for Automating Machine Learning, Vol. 64. JMLR.org. http:\/\/proceedings.mlr.press\/v64\/olson_tpot_2016.html"},{"key":"e_1_2_1_59_1","volume-title":"https:\/\/openai.com\/index\/hello-gpt-4o\/","author":"AI.","year":"2024","unstructured":"OpenAI. 2024. GPT-4o. (2024). https:\/\/openai.com\/index\/hello-gpt-4o\/"},{"key":"e_1_2_1_60_1","first-page":"4092","article-title":"Efficient Neural Architecture Search via Parameter Sharing","volume":"80","author":"Pham Hieu","year":"2018","unstructured":"Hieu Pham, Melody Y. Guan, Barret Zoph, Quoc V. Le, and Jeff Dean. 2018. Efficient Neural Architecture Search via Parameter Sharing. In ICML, Vol. 80. 4092\u20134101. http:\/\/proceedings.mlr.press\/v80\/pham18a.html","journal-title":"ICML"},{"key":"e_1_2_1_61_1","doi-asserted-by":"publisher","DOI":"10.1145\/3448016.3452788"},{"key":"e_1_2_1_62_1","first-page":"334","article-title":"Data Validation for Machine Learning","volume":"1","author":"Polyzotis Neoklis","year":"2019","unstructured":"Neoklis Polyzotis, Martin Zinkevich, Sudip Roy, Eric Breck, and Steven Whang. 2019. Data Validation for Machine Learning. In MLSys, Vol. 1. 334\u2013347. https:\/\/proceedings.mlsys.org\/paper_files\/paper\/2019\/file\/928f1160e52192e3e0017fb63ab65391-Paper.pdf","journal-title":"MLSys"},{"key":"e_1_2_1_63_1","doi-asserted-by":"publisher","DOI":"10.48786\/EDBT.2024.12"},{"key":"e_1_2_1_64_1","doi-asserted-by":"publisher","DOI":"10.48550\/ARXIV.2305.20015"},{"key":"e_1_2_1_65_1","doi-asserted-by":"publisher","unstructured":"Sergey Redyuk Zoi Kaoudi Volker Markl and Sebastian Schelter. 2021. Automating Data Quality Validation for Dynamic Data Ingestion. In EDBT. 61\u201372. 10.5441\/002\/EDBT.2021.07","DOI":"10.5441\/002\/EDBT.2021.07"},{"key":"e_1_2_1_66_1","doi-asserted-by":"publisher","DOI":"10.14778\/3461535.3463474"},{"key":"e_1_2_1_67_1","doi-asserted-by":"publisher","unstructured":"A\u00e9cio S. R. Santos Aline Bessa Fernando Chirigati Christopher Musco and Juliana Freire. 2021. Correlation Sketches for Approximate Join-Correlation Queries. In SIGMOD. 1531\u20131544. 10.1145\/3448016.3458456","DOI":"10.1145\/3448016.3458456"},{"key":"e_1_2_1_68_1","doi-asserted-by":"publisher","DOI":"10.14778\/3229863.3229867"},{"key":"e_1_2_1_69_1","volume-title":"AIDE: Human-Level Performance in Data Science Competitions. https:\/\/www.weco.ai\/blog\/technical-report","author":"Schmidt Dominik","year":"2024","unstructured":"Dominik Schmidt, Yuxiang Wu, and Zhengyao Jiang. 2024. AIDE: Human-Level Performance in Data Science Competitions. https:\/\/www.weco.ai\/blog\/technical-report"},{"key":"e_1_2_1_70_1","doi-asserted-by":"publisher","unstructured":"Vraj Shah Jonathan Lacanlale Premanand Kumar Kevin Yang and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In SIGMOD. 1584\u20131596. 10.1145\/3448016.3457274","DOI":"10.1145\/3448016.3457274"},{"key":"e_1_2_1_71_1","doi-asserted-by":"publisher","unstructured":"Vraj Shah Jonathan Lacanlale Premanand Kumar Kevin Yang and Arun Kumar. 2021. Towards Benchmarking Feature Type Inference for AutoML Platforms. In SIGMOD. 1584\u20131596. 10.1145\/3448016.3457274","DOI":"10.1145\/3448016.3457274"},{"key":"e_1_2_1_72_1","first-page":"1391","article-title":"How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses","volume":"17","author":"Shah Vraj","year":"2024","unstructured":"Vraj Shah, Thomas J. Parashos, and Arun Kumar. 2024. How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses. PVLDB 17, 6 (2024), 1391\u20131404. https:\/\/www.vldb.org\/pvldb\/vol17\/p1391-shah.pdf","journal-title":"PVLDB"},{"key":"e_1_2_1_73_1","doi-asserted-by":"publisher","unstructured":"Zeyuan Shang Emanuel Zgraggen Benedetto Buratti Ferdinand Kossmann Philipp Eichmann Yeounoh Chung Carsten Binnig Eli Upfal and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171\u20131188. 10.1145\/3299869.3319863","DOI":"10.1145\/3299869.3319863"},{"key":"e_1_2_1_74_1","first-page":"3","volume-title":"Proc. ACM Manag. Data 1","author":"Siddiqi Shafaq","year":"2023","unstructured":"Shafaq Siddiqi, Roman Kern, and Matthias Boehm. 2023. SAGA: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications. Proc. ACM Manag. Data 1, 3 (2023), 218:1\u2013218:26. 10.1145\/3617338"},{"key":"e_1_2_1_75_1","doi-asserted-by":"publisher","unstructured":"Evan R. Sparks Ameet Talwalkar Daniel Haas Michael J. Franklin Michael I. Jordan and Tim Kraska. 2015. Automating model search for large scale machine learning. In SoCC. 368\u2013380. 10.1145\/2806777.2806945","DOI":"10.1145\/2806777.2806945"},{"key":"e_1_2_1_76_1","doi-asserted-by":"publisher","unstructured":"Hubert Tardieu. 2022. Role of Gaia-X in the European Data Space Ecosystem. In Designing Data Spaces: The Ecosystem Approach to Competitive Advantage. 41\u201359. 10.1007\/978-3-030-93975-5_4","DOI":"10.1007\/978-3-030-93975-5_4"},{"key":"e_1_2_1_77_1","doi-asserted-by":"publisher","unstructured":"Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 10.48550\/arXiv.2403.05530","DOI":"10.48550\/arXiv.2403.05530"},{"key":"e_1_2_1_78_1","doi-asserted-by":"publisher","unstructured":"Saravanan Thirumuruganathan Nan Tang Mourad Ouzzani and AnHai Doan. 2020. Data Curation with Deep Learning. In EDBT. 277\u2013286. 10.5441\/002\/EDBT.2020.25","DOI":"10.5441\/002\/EDBT.2020.25"},{"key":"e_1_2_1_79_1","doi-asserted-by":"publisher","unstructured":"Chris Thornton Frank Hutter Holger H. Hoos and Kevin Leyton-Brown. 2013. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In SIGKDD. 847\u2013855. 10.1145\/2487575.2487629","DOI":"10.1145\/2487575.2487629"},{"key":"e_1_2_1_80_1","volume-title":"FLAML: A Fast and Lightweight AutoML Library. In MLSys. https:\/\/proceedings.mlsys.org\/paper\/2021\/hash\/92cc227532d17e56e07902b254dfad10-Abstract.html","author":"Wang Chi","year":"2021","unstructured":"Chi Wang, Qingyun Wu, Markus Weimer, and Erkang Zhu. 2021. FLAML: A Fast and Lightweight AutoML Library. In MLSys. https:\/\/proceedings.mlsys.org\/paper\/2021\/hash\/92cc227532d17e56e07902b254dfad10-Abstract.html"},{"key":"e_1_2_1_81_1","volume-title":"Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al.","author":"Wilkinson Mark D","year":"2016","unstructured":"Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific data 3 (2016). https:\/\/www.nature.com\/articles\/sdata201618"},{"key":"e_1_2_1_82_1","volume-title":"ICLR 2024 Workshop on Large Language Model (LLM) Agents. https:\/\/openreview.net\/forum?id=uAjxFFing2","author":"Wu Qingyun","year":"2024","unstructured":"Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. 2024. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. In ICLR 2024 Workshop on Large Language Model (LLM) Agents. https:\/\/openreview.net\/forum?id=uAjxFFing2"},{"key":"e_1_2_1_83_1","doi-asserted-by":"crossref","unstructured":"Wenglei Wu Nicholas Kunz and Paula Branco. 2022. Imbalanced learning regression-a python package to tackle the imbalanced regression problem. In ECML PKDD. 645\u2013648.","DOI":"10.1007\/978-3-031-26422-1_48"},{"key":"e_1_2_1_84_1","doi-asserted-by":"publisher","DOI":"10.1145\/3698811"},{"key":"e_1_2_1_85_1","doi-asserted-by":"crossref","first-page":"295","DOI":"10.1016\/j.neucom.2020.07.061","article-title":"On hyperparameter optimization of machine learning algorithms: Theory and practice","volume":"415","author":"Yang Li","year":"2020","unstructured":"Li Yang and Abdallah Shami. 2020. On hyperparameter optimization of machine learning algorithms: Theory and practice. Neurocomputing 415 (2020), 295\u2013316. https:\/\/www.sciencedirect.com\/science\/article\/pii\/S0925231220311693","journal-title":"Neurocomputing"}],"container-title":["Proceedings of the VLDB Endowment"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/dl.acm.org\/doi\/pdf\/10.14778\/3742728.3742754","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,9,3]],"date-time":"2025-09-03T13:35:57Z","timestamp":1756906557000},"score":1,"resource":{"primary":{"URL":"https:\/\/dl.acm.org\/doi\/10.14778\/3742728.3742754"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,4]]},"references-count":85,"journal-issue":{"issue":"8","published-print":{"date-parts":[[2025,4]]}},"alternative-id":["10.14778\/3742728.3742754"],"URL":"https:\/\/doi.org\/10.14778\/3742728.3742754","relation":{},"ISSN":["2150-8097"],"issn-type":[{"value":"2150-8097","type":"print"}],"subject":[],"published":{"date-parts":[[2025,4]]},"assertion":[{"value":"2025-09-03","order":3,"name":"published","label":"Published","group":{"name":"publication_history","label":"Publication History"}}]}}