State of the Python/PyPi dependency graph

I usually work in Java/Maven environment, so when I explain to people that Python also has a package manager - a bit less heavy than maven - and that it's working pretty well, I always have to answer the same question : "Ok, but how does it solve the transitive dependency hell ?"

Also known as the historic DLL Hell /Jar Hell etc... In short, when you depend on A and C, that A depends on B (version 1.2) and C depends on B (version 1.5) : How do you choose which version of B you will take ?

I ended up trying to answer, not exactly that question, but why I never really had that problem in Python. So this article is the first of a three part series you could call "Dependency as a liability".

In this part, I wanted to analyse the Python library world in terms of a full dependency graph - how every library depends on each other.

After talking with Tarek Ziadé about that, he told me how complicated things are right now. It seems that, for now, the way things are, the only complete and secure way to know what a package needs in terms of dependency is to execute its installation on every operating system. This was a bit out of my scope for now, so I took another way, just to see where it would lead me.

Analyzing setup.py files

For recent packages, following the Hitchiker's Guide to packaging, the metadata of the package are stored in file called setup.py that looks like this : [sourcecode language="python"] from distutils.core import setup setup( name='TowelStuff', version='0.1.0', author='J. Random Hacker', author_email='jrh@example.com', packages=['towelstuff', 'towelstuff.test'], scripts=['bin/stowe-towels.py','bin/wash-towels.py'], url='http://pypi.python.org/pypi/TowelStuff/', license='LICENSE.txt', description='Useful towel-related stuff.', long_description=open('README.txt').read(), install_requires=[ "Django >= 1.1.1", "caldav == 0.1.4", ], ) [/sourcecode] You can notice a few things like the author, version, author_email, url, license... and what I was focusing on the install_requires parameter, where you declare all your dependencies. the problem is, that it may sound simple, but the setup.py file is a python script in itself, so the install_requires directive can be changed when the script is executed. So I took my chances, and decided to create a project to extract dependencies from all packages on PyPi according to the install_requires parameter and see if this is mainly used statically or dynamically. So what the meta-deps project does is :

extract all packages from PyPi using the XML-RPC api;
download the releases and extract from the setup.py file the install_requiresdependency;
Store the results in a csv file pypi-deps.csv;

If you want to re-use the raw data, you don't need to re-execute the process (and overload PyPi servers in the meantime), just download the pypi-deps.csvfile, it contains just these columns :

name of the dependency
version extracted
a base64 encoded, json string to store the list of dependencies : so you just need to execute json.loads(b64decode(...))

Results

So what comes out of all this ? This graph : [caption id="attachment_968" align="aligncenter" width="640"]

PyPi dependency graph generated using Gephi

PyPi dependency graph - click to see the interactive version[/caption] Ok, if you see it like that, you must think it looks like a huge jellyfish, and that i'm just joking with you. So I spent a little time creating and optimizing an interactive graph of the PyPi dependency (it seems to be best to open it using chrome) where you can scroll and see all the dependencies with all the metrics and explanation needed. The next steps will be to do the same with Maven dependencies in a Java world, and compute metrics needed to compare the both. Vale

State of the Python/PyPi dependency graph

Analyzing setup.py files

Results

Newsletter

0 Comments