Data And Material For: "Content Classification Of Development Emails" - Items

Item
Groups

Data And Material For: "Content Classification Of Development Emails"

This data and material support the paper "Content classification of development emails" published in the proceedings of the 34th International Conference on Software Engineering (ICSE 2012). Every software system has a history. We find traces of a system's history in software repositories, which are used by developers when building and maintaining their systems. Each repository tells us a part of the history, from its perspective: Issue repositories murmur dark events involving defective and flawed entities; versioning system repositories narrate about restless artifacts and classes that nobody would ever touch; mailing list archives report of unexpected stories on developers’ interactions and opinions. But... can we seriously trust these repositories? Can we just listen to what they tell us and behave accordingly? Many wise researchers warmly warned us about the risks of showing such a naive faith in data repositories: Versioning system repositories might seduce us with enchanting stories of always changing entities, but in reality many of these entities may just modify their make up and maintain their old behaviour; or issue repositories might tell us a partial truth about certain very special entities, or developers. We do agree with these researchers: Especially natural language documents contain information in different languages, surrounded by much noise. We must pay a special attention when using them. We created MUCCA, a classification method to use when dealing with natural language documents. It recognizes source code fragments, patches, stack traces, noise, and natural language with significantly high accuracy. In this way, it allows one to subsequently apply ad hoc analysis techniques to exploit the peculiarities of each category, and extract reliable information. This Zenodo upload supports the paper that describe our work on this topic. 1. Source code & Virtual Image MUCCA is written in Cincom VisualWorks Smalltalk and is composed of several components. You can download the source code of the following MUCCA components from this upload (mucca-source_code folder): Miler2, the core of MUCCA, including metamodels, importers, classification engine, etc.; MailPeek, our web application for the manual classification of email content; PetitIsland, our grammar to generate island parsers; PetitJava, the grammar of Java, which we implemented for PetitParser; PetitSTrace, our island parser for java stack traces. Note that, in order to make Miler2 run, you will also need the following external Smalltalk components: Moose, Glorp, Seaside, TwoFlower, MetaDB, PetitParser. In addition, we make use of the Weka workbench, for the machine learning tasks. You can download the two trained classifiers that compose MUCCA (mucca-classifiers folder): Naive Bayes based classifier (classifier1-nb.model), Decision Tree based classifier (classifier2-dt.model). Alternatively, we created a VirtualBox image with a pre-configured VisualWorks environment, which includes all the MUCCA components, and pre-requisites (both Smalltalk and Java): MUCCA.ova (Both user and password are muccauser). 2. Benchmark To train machine-learning classifiers and evaluate the effectiveness of the different approaches, we manually create a benchmark, in which emails are classified at character granularity. Given the time and effort needed to create such a benchmark, we humbly think it is a valuable contribution to the community. With the help of this benchmark, other researchers can reproduce our experiments and devise new classification methods, which can be immediately compared to ours. You can download the dataset from the GitHub repository (a dump of the GitHub repository is uploaded here (benchmark/githubDump.zip), or download the full database dump in PostgreSQL format (benchmark/benchmarkDump.tar.bz2).

Tags

Data and Resources

To access the resources you must log in

This item has no data

Item URL

http://data.d4science.org/ctlg/RISIS2OpenData/dedup_wf_001--0ec579db951f1b13a61fc7d3df898afc

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field	Value
PID	https://www.doi.org/10.5281/zenodo.1345172
PID	https://www.doi.org/10.5281/zenodo.1345171
URL	https://dx.doi.org/10.5281/zenodo.1345171
URL	https://dx.doi.org/10.5281/zenodo.1345172
URL	http://dx.doi.org/10.5281/zenodo.1345172
URL	https://zenodo.org/record/1345172
URL	http://dx.doi.org/10.5281/zenodo.1345171
URL	https://figshare.com/articles/Data_and_material_for_Content_classification_of_development_emails_/6969272

Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field	Value
Access Right	Open Access

Attribution

Description: Authorships and contributors

Field	Value
Author	Bacchelli, Alberto, 0000-0003-0193-6823
Author	Sasso, Tommaso Dal
Author	D'Ambros, Marco
Author	Lanza, Michele
Contributor	Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field	Value
Collected From	Zenodo; Datacite; figshare
Hosted By	Zenodo; figshare
Publication Date	2012-06-02
Publisher	Zenodo

Additional Info

Field	Value
Language	English
Resource Type	Dataset
system:type	dataset

Management Info

Field	Value
Source	https://science-innovation-policy.openaire.eu/search/dataset?datasetId=dedup_wf_001::0ec579db951f1b13a61fc7d3df898afc
Author	jsonws_user
Version	None
Last Updated	8 January 2021, 17:23 (CET)
Created	8 January 2021, 17:23 (CET)