Data And Material For: "Content Classification Of Development Emails"

This data and material support the paper "Content classification of development emails" published in the proceedings of the 34th International Conference on Software Engineering (ICSE 2012). Every software system has a history. We find traces of a system's history in software repositories, which are used by developers when building and maintaining their systems. Each repository tells us a part of the history, from its perspective: Issue repositories murmur dark events involving defective and flawed entities; versioning system repositories narrate about restless artifacts and classes that nobody would ever touch; mailing list archives report of unexpected stories on developers’ interactions and opinions. But... can we seriously trust these repositories? Can we just listen to what they tell us and behave accordingly? Many wise researchers warmly warned us about the risks of showing such a naive faith in data repositories: Versioning system repositories might seduce us with enchanting stories of always changing entities, but in reality many of these entities may just modify their make up and maintain their old behaviour; or issue repositories might tell us a partial truth about certain very special entities, or developers. We do agree with these researchers: Especially natural language documents contain information in different languages, surrounded by much noise. We must pay a special attention when using them. We created MUCCA, a classification method to use when dealing with natural language documents. It recognizes source code fragments, patches, stack traces, noise, and natural language with significantly high accuracy. In this way, it allows one to subsequently apply ad hoc analysis techniques to exploit the peculiarities of each category, and extract reliable information. This Zenodo upload supports the paper that describe our work on this topic.   1. Source code & Virtual Image MUCCA is written in Cincom VisualWorks Smalltalk and is composed of several components. You can download the source code of the following MUCCA components from this upload (mucca-source_code folder): Miler2, the core of MUCCA, including metamodels, importers, classification engine, etc.; MailPeek, our web application for the manual classification of email content; PetitIsland, our grammar to generate island parsers; PetitJava, the grammar of Java, which we implemented for PetitParser; PetitSTrace, our island parser for java stack traces. Note that, in order to make Miler2 run, you will also need the following external Smalltalk components: Moose, Glorp, Seaside, TwoFlower, MetaDB, PetitParser. In addition, we make use of the Weka workbench, for the machine learning tasks. You can download the two trained classifiers that compose MUCCA (mucca-classifiers folder): Naive Bayes based classifier (classifier1-nb.model), Decision Tree based classifier (classifier2-dt.model). Alternatively, we created a VirtualBox image with a pre-configured VisualWorks environment, which includes all the MUCCA components, and pre-requisites (both Smalltalk and Java): MUCCA.ova (Both user and password are muccauser).   2. Benchmark To train machine-learning classifiers and evaluate the effectiveness of the different approaches, we manually create a benchmark, in which emails are classified at character granularity. Given the time and effort needed to create such a benchmark, we humbly think it is a valuable contribution to the community. With the help of this benchmark, other researchers can reproduce our experiments and devise new classification methods, which can be immediately compared to ours. You can download the dataset from the GitHub repository (a dump of the GitHub repository is uploaded here (benchmark/githubDump.zip), or download the full database dump in PostgreSQL format (benchmark/benchmarkDump.tar.bz2).

Tags
Data and Resources
To access the resources you must log in

This item has no data

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field Value
PID https://www.doi.org/10.5281/zenodo.1345172
PID https://www.doi.org/10.5281/zenodo.1345171
URL https://dx.doi.org/10.5281/zenodo.1345171
URL https://dx.doi.org/10.5281/zenodo.1345172
URL http://dx.doi.org/10.5281/zenodo.1345172
URL https://zenodo.org/record/1345172
URL http://dx.doi.org/10.5281/zenodo.1345171
URL https://figshare.com/articles/Data_and_material_for_Content_classification_of_development_emails_/6969272
Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field Value
Access Right Open Access
Attribution

Description: Authorships and contributors

Field Value
Author Bacchelli, Alberto, 0000-0003-0193-6823
Author Sasso, Tommaso Dal
Author D'Ambros, Marco
Author Lanza, Michele
Contributor Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung
Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field Value
Collected From Zenodo; Datacite; figshare
Hosted By Zenodo; figshare
Publication Date 2012-06-02
Publisher Zenodo
Additional Info
Field Value
Language English
Resource Type Dataset
system:type dataset
Management Info
Field Value
Source https://science-innovation-policy.openaire.eu/search/dataset?datasetId=dedup_wf_001::0ec579db951f1b13a61fc7d3df898afc
Author jsonws_user
Version None
Last Updated 8 January 2021, 17:23 (CET)
Created 8 January 2021, 17:23 (CET)