Over 10% of Python packages on PyPI are distributed without a license
September 18, 2018
0 mins readImagine that you installed a random Python package from PyPI. There’s a good 13.5% chance that the package has no licensing information. Considering it’s not uncommon to have hundreds of dependencies and sub-dependencies in a typical Python application, there’s a very good chance of using unlicensed code. Depending on the context, the consequences of using unlicensed software could be anywhere from insignificant to disastrous. Ok, that’s a wild range, so this post will dig deeper into this issue.
What is PyPI?
PyPI is the most commonly used central repository for Python packages, created and maintained by the Python Software Foundation. It is usually accessed via package management tools, such as pip, to download and install Python libraries and apps. This is similar to package repositories for other languages, such as npm for Javascript, RubyGems.org for Ruby and crates.io for Rust.
Background
Snyk analyzes the dependencies of software projects and finds issues such as security vulnerabilities and bad licenses. To this end, they recently hired me to fetch all of the meta-data from PyPI about packages and their released versions, including licensing information. With this data in hand, I examined the licensing situation for all of the Python packages on PyPI.
Below are some key insights from this analysis, followed by suggestions in light of these insights, and a description of the methods used to obtain them.
How is licensing in the Python ecosystem?
First up, here’s a quick disclaimer: I am not a lawyer. I am an experienced software engineer with some understanding of software licensing, but I’m no expert in the field. None of the following is to be considered legal advice, so it is recommended that you consult a lawyer if you need to!
Types of licenses
Generally speaking, almost all open-source software licenses can be divided into a few broad categories:
The “Copyleft” licenses are a relatively restrictive license type, including the well-known GPL license, its many variations and similar licenses. These allow software to be used for practically any purpose, but require that any modifications you make are also made publicly available. Further, some of the “stronger” copyleft licenses require that code using such licensed software also be made publicly available, in certain circumstances.
The “Permissive” licenses, on the other hand, allow one to modify the software without having to make those modifications public. They are also generally much simpler and include fewer restrictions.
At the far end of the spectrum are the “Public Domain” licenses which generally place no restrictions at all on use.
For additional discussion of copyleft vs. permissive licensing, see a previous post on the subject on our blog.
In recent years, several “weak copyleft” licenses have appeared. These are mainly intended for software libraries, allowing their use with fewer or no restrictions on the using software. For more info on the distinction between “weak” and “strong” copyleft, see the Wikipedia article section on the subject.
It is also possible, although not recommended, to not specify any license. This is most often due to oversight or ignorance, but in some rare cases it is done intentionally. Without a license, use of the software is simply subject to copyright law (of each country!), and also other types of law such as patent law.
Breakdown of licensing types
Let’s take a look at the types of licenses used on PyPI:
From this data we can see the following:
The majority of packages on PyPI, 64%, use a “permissive” license, such as the MIT, Apache 2.0 and 2- or 3-Clause BSD licenses.
18.5% use a “strong copyleft” license, such as the GPL and AGPL licenses.
3% use a “weak copyleft” license, such as the LGPL and MPL licenses.
Only 1% use a “public domain” license, such as the CC0, WTFPL and Unlicense licenses.
13.5% don’t include any license.
This is unsurprising, as it is similar to the general trends seen on GitHub and other languages’ package repositories.
Most common licenses for Python packages
Four licenses dominate the landscape on PyPI, namely MIT, BSD-2-Clause, GPL-3.0 and Apache-2.0.
Many other licenses are in common use as well (used by at least 100 packages). Of these, two are unique to the Python ecosystem: The Python Software Foundation license (PSF) and the Zope Public License (ZPL).
What you should do NOW!
Given the licensing of Python packages on PyPI, I highly recommend the following:
Check if you are directly or indirectly using software that uses “copyleft” licenses and be comfortable with its implications on your code.
If you’re writing proprietary (non-open source) code, this could be a serious legal liability, especially with the “strong copyleft” licenses used by over 18% of PyPI packages.
If you’re writing open-source software, these could mean that you’ll be forced to license your code under similar licensing terms.
Check if any of your dependencies are missing a license.
If you need to modify such a dependency, or may need to in the future, you may not be allowed to according to copyright law.
Future versions of such dependencies are likely to introduce a license, and you cannot be sure what type of license that would be.
Set up an automated tool to periodically / continuously check the licensing of your dependencies.
Over a project’s lifetime dependencies are often added and changed.
Dependencies may change their license between versions, so a version upgrade could result in a licensing change.
Further reading
We’ve just scratched the surface; licensing may be more important to you than you realize and you should read more! Consider reading up on the following subjects:
Multi-licensing (~500 packages on PyPI are multi-licensed)
The Legal Side of Open Source, and especially the section about choosing a license for your software.
Methodology
Here are some details on how the data collection and analysis was conducted.
First, I collected the metadata of all packages and all of their released versions on PyPI. This was done over several days in early August 2018. At the time there were nearly 150,000 packages on PyPI.
I then extracted the license information for each version of each package, from both the “license” and the “classifiers” fields. I then aggregated those into a single data structure, with the versions sorted according to a semantic versioning-style interpretation.
Next I cleaned and normalized the licensing data. I built upon on a version of an SPDX license normalizer used at Snyk. I iteratively improved the normalizer to correctly recognize a larger portion of the licenses, followed by manually creating a large mapping of special cases to handle the long tail of exceptional cases.
After this, I set out to analyze the 18% of the packages without a license. I had not yet looked at the files in the releases themselves. I randomly selected 50 license-less packages, downloaded their latest releases and searched for licensing information. After collecting this info, I classified these packages into five categories as follows:
Only those under the “Licensed” category actually have a license. Of those under the “training”, “throwaway” and “placeholder” categories, the great majority did include some code, and could conceivably be used as a dependency, whether purposefully or by mistake (e.g. a typo). Therefore, I found it reasonable to consider these proper unlicensed packages. Extrapolating from this sample, I arrived at the estimation that ~18% Ã 72% = ~13.5% of PyPI packages are unlicensed.
It is worth noting that I didn’t perform similar analysis for licensed packages on PyPI; it is likely that there are many “training”, “throwaway” and “placeholder” packages that do mention a license. Therefore, taking a conservative stance that only ~18% Ã 28% = ~5% of the “real” packages on PyPI are unlicensed would likely be severely inaccurate. I did not have the time to conduct such an analysis, unfortunately, so I stuck with the analysis described above.
If you have any questions or comments, I’d love to hear from you! Please reach out to me on twitter at @taleinat.
Get started in capture the flag
Learn how to solve capture the flag challenges by watching our virtual 101 workshop on demand.