arXiv:2406.11731

PerfCurator: Curating a large-scale dataset of performance bug-related commits from public repositories

Published on Jun 17, 2024

Authors:

Abstract

PerfCurator, a repository miner using PcBERT-KD, a parameter-efficient BERT model, constructs a large-scale dataset of performance bug-fix commits, enhancing data-driven detection systems.

AI-generated summary

Performance bugs challenge software development, degrading performance and wasting computational resources. Software developers invest substantial effort in addressing these issues. Curating these performance bugs can offer valuable insights to the software engineering research community, aiding in developing new mitigation strategies. However, there is no large-scale open-source performance bugs dataset available. To bridge this gap, we propose PerfCurator, a repository miner that collects performance bug-related commits at scale. PerfCurator employs PcBERT-KD, a 125M parameter BERT model trained to classify performance bug-related commits. Our evaluation shows PcBERT-KD achieves accuracy comparable to 7 billion parameter LLMs but with significantly lower computational overhead, enabling cost-effective deployment on CPU clusters. Utilizing PcBERT-KD as the core component, we deployed PerfCurator on a 50-node CPU cluster to mine GitHub repositories. This extensive mining operation resulted in the construction of a large-scale dataset comprising 114K performance bug-fix commits in Python, 217.9K in C++, and 76.6K in Java. Our results demonstrate that this large-scale dataset significantly enhances the effectiveness of data-driven performance bug detection systems.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.11731 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2406.11731 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.11731 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.