Benchmarking Prompt Architectures: A Quantitative Study of Contextual and Decomposed Prompting for Complex ETL Code Generation

Authors

Rajitha Gentyala
Frisco, United States

Keywords:

Prompt Engineering, Large Language Models, Data Engineering, ETL Pipelines, Code Generation, Benchmark Evaluation, Contextual Prompting, Chain-of-Thought, GPT-4, Software Quality

Synopsis

The integration of large language models (LLMs) into data engineering workflows promises significant acceleration in the development of Extract, Transform, and Load (ETL) pipelines. However, the practical adoption of this paradigm is hindered by a lack of empirical rigor in evaluating the efficacy of different prompt design strategies, particularly for tasks of high complexity. While current pedagogical resources effectively teach the art of prompt crafting, they do not provide quantitative evidence on which prompt architectures yield the most reliable, accurate, and efficient code generation. This study addresses this critical gap by conducting a systematic, benchmark-driven evaluation of two advanced prompt-design paradigms contextual prompting and decomposed prompting against a baseline of generic, direct prompts for the generation of complex ETL logic. Our methodology constructs a novel benchmark suite of 50 challenging data transformation tasks, derived from real-world scenarios, including nested JSON flattening, hierarchical data aggregation, and time-series imputation with conditional logic. For each task, we generate code solutions using the state-of-the-art GPT-4 model across three distinct prompt architectures: a minimal direct prompt, a context-rich prompt incorporating domain-specific schema definitions and constraint annotations, and a decomposed prompt that breaks the problem into a sequenced chain of subtasks. Drawing inspiration from the human-AI collaboration framework explored by Wu et al. (2022) in "PromptChainer: Chaining Large Language Model Prompts through Visual Programming," we formalize the decomposed prompting strategy into a reproducible chain-of-thought process tailored for data engineering. Furthermore, we extend the evaluation principles for code-generating models discussed by Chen et al. (2021) in "Evaluating Large Language Models Trained on Code," by applying a multi-faceted assessment metric. Each generated code artifact is automatically evaluated for functional correctness through unit-test execution on sample datasets, for computational efficiency via runtime profiling, and for robustness through stress testing with edge-case data. Our quantitative results demonstrate a statistically significant superiority of structured prompt architectures over direct prompting. Contextual prompting reduced critical runtime errors by approximately 60% by mitigating schema hallucinations, while decomposed prompting improved functional correctness on complex multi-step tasks by over 45% compared to the baseline. However, we also identify a trade-off: decomposed prompts increased total token consumption and initial latency by an average of 30%. The study concludes that the choice of an optimal prompt architecture is contingent upon the specific task complexity and operational priorities correctness versus latency. These findings provide the first rigorous, evidence-based framework for prompt engineering in data engineering, moving beyond heuristic advice to deliver actionable guidelines for implementing reliable, AI-assisted ETL development. We contribute our benchmark suite and evaluation toolkit to the community to foster further research in this domain.

References

M. Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv preprint

arXiv:2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

[2] T. Wu et al., “PromptChainer: Chaining Large Language Model Prompts through Visual

Programming,” in Proc. CHI Conf. Hum. Factors Comput. Syst. Extended Abstracts, New

Orleans, LA, USA, 2022, pp. 1–7, doi: 10.1145/3491101.3519729.

[3] T. Scholak, N. Schucher, and D. Bahdanau, “PICARD: Parsing Incrementally for

Constrained Auto-Regressive Decoding from Language Models,” in Proc. Conf.

Empirical Methods Natural Lang. Process. (EMNLP), Punta Cana, Dominican Republic,

2021, pp. 9895–9901, doi: 10.18653/v1/2021.emnlp-main.779.

[4] J. Austin et al., “Program Synthesis with Large Language Models,” arXiv preprint

arXiv:2108.07732, 2021. [Online]. Available: https://arxiv.org/abs/2108.07732

[5] J. Li et al., “BIRD: A Big Bench for Large-scale Database Grounded Text-to-SQL

Evaluation,” in Proc. 37th Conf. Neural Inf. Process. Syst. (NeurIPS), New Orleans, LA,

USA, 2023. [Online]. Available: https://arxiv.org/abs/2310.03255

[6] P. Liang et al., “Holistic Evaluation of Language Models,” arXiv preprint

arXiv:2211.09110, 2022. [Online]. Available: https://arxiv.org/abs/2211.09110

ISCSITR-IJCSE

Published

June 5, 2025