Sourcerer Asts

Overview Sourcerer is a large corpus of open source Java software. It was constructed by fully downloading the source code of 19 K projects, of which 6 K turned out to be empty. And, after performing the following operations: Remove non-Java files Remove SCM branches Remove duplicate projects Manually reviewed duplicate files Remove test code The authors have reduced the 390 GB corpus to 14.3 GB. The resulting corpus has been made publicly available. Paper Abstract Measuring the internal quality of source code is one of the traditional goals of making software development into an engineering discipline. Cyclomatic Complexity (CC) is an often used source code quality metric, next to Source Lines of Code (SLOC). However, the use of the CC metric is challenged by the repeated claim that CC is redundant with respect to SLOC due to strong linear correlation. We test this claim by studying a corpus of 17.8M methods in 13K open-source Java projects. Our results show that direct linear correlation between SLOC and CC is only moderate, as caused by high variance. We observe that aggregating CC and SLOC over larger units of code improves the correlation, which explains reported results of strong linear correlation in literature. We suggest that the primary cause of correlation is the aggregation. Our conclusion is that there is no strong linear correlation between CC and SLOC of Java methods, so we do not conclude that CC is redundant with SLOC. This conclusion contradicts earlier claims from literature, but concurs with the widely accepted practice of measuring of CC next to SLOC.

Tags
Data and Resources
To access the resources you must log in

This item has no data

Identity

Description: The Identity category includes attributes that support the identification of the resource.

Field Value
PID https://www.doi.org/10.5281/zenodo.268531
URL http://dx.doi.org/10.5281/zenodo.268531
URL https://zenodo.org/record/268531
URL https://figshare.com/articles/Sourcerer_ASTs/6249913
Access Modality

Description: The Access Modality category includes attributes that report the modality of exploitation of the resource.

Field Value
Access Right Open Access
Attribution

Description: Authorships and contributors

Field Value
Author Landman, Davy
Publishing

Description: Attributes about the publishing venue (e.g. journal) and deposit location (e.g. repository)

Field Value
Collected From Zenodo; Datacite; figshare
Hosted By Zenodo; figshare
Publication Date 2015-05-25
Publisher Zenodo
Additional Info
Field Value
Language UNKNOWN
Resource Type Dataset
system:type dataset
Management Info
Field Value
Source https://science-innovation-policy.openaire.eu/search/dataset?datasetId=dedup_wf_001::2aaf14ff47a83b991470745841fe82f2
Author jsonws_user
Version None
Last Updated 13 January 2021, 16:35 (CET)
Created 13 January 2021, 16:35 (CET)