fbpx
Quantifying R Package Dependency Risk Quantifying R Package Dependency Risk
We recently commented on excess package dependencies as representing risk in the R package ecosystem. The question remains: how much risk? Is low dependency a... Quantifying R Package Dependency Risk

We recently commented on excess package dependencies as representing risk in the R package ecosystem.

The question remains: how much risk? Is low dependency a mere talisman, or is there evidence it is a good practice (or at least correlates with other good practices)?

[Related Article: Data-Driven Exploration of the R User Community Worldwide]

Well, it turns out we can quantify it: each additional non-core package declared as an “Imports” or “Depends” is associated with an extra 11% relative chance of a package having an indicated issue on CRAN. At over 5 non-core “Imports” plus “Depends” a package has significantly elevated risk.

The number of dependent packages in use versus modeled issue probability can be summed up in the following graph.

Unnamed chunk 6 2

In the above graph the dashed horizontal line is the overall rate that packages have issues on CRAN. Notice the curve crosses the line well before 5 non-trivial dependencies.

In fact packages importing more than 5 non-trivial dependencies have issues on CRAN at an empirical rate of 35%, (above the model prediction at 5 dependencies) and double the overall rate of 17%. Doubling a risk is considered very significant. And almost half the packages using more than 10 non-trivial dependencies have known issues on CRAN.

A very short analysis deriving the above can be found here.

Obviously we are using lack of problems on CRAN as a rough approximation for package quality, and number of non-trivial package Imports and Depends as rough proxy for package complexity. It would be interesting to quantify and control for other measures (including package complexity and purpose).

[Related Article: Using an Embedding Matrix on Tabular Data in R]

Our theory is the imports are not so much causing problems, but are a “code smell” correlated with other package issues. We feel this is evidence that the principles that guide some package developers to prefer packages with well defined purposes and low dependencies are part of a larger set of principles that lead to higher quality software.

A table of all scored packages and their problem/risk estimate can be found here.

Originally Posted Here

John Mount

John Mount

My specialty is analysis and design of algorithms, with an emphasis on efficient implementation. I work to find applications of state of the art methods in optimization, statistics and machine learning in various application areas. Currently co-authoring "Practical Data Science with R"

1