Skip to content

Commit 62ed063

Browse files
authored
Merge pull request #665 from pypa/package-downloads-revamp
Revamp the 'package downloads' guide
2 parents b82fe4e + a9d439a commit 62ed063

1 file changed

Lines changed: 128 additions & 82 deletions

File tree

source/guides/analyzing-pypi-package-downloads.rst

Lines changed: 128 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
Analyzing PyPI package downloads
33
================================
44

5-
This section covers how to use the `PyPI package dataset`_ to learn more
6-
about downloads of a package (or packages) hosted on PyPI. For example, you can
7-
use it to discover the distribution of Python versions used to download a
8-
package.
5+
This section covers how to use the public PyPI download statistics dataset
6+
to learn more about downloads of a package (or packages) hosted on PyPI. For
7+
example, you can use it to discover the distribution of Python versions used to
8+
download a package.
99

1010
.. contents:: Contents
1111
:local:
@@ -14,71 +14,45 @@ package.
1414
Background
1515
==========
1616

17-
PyPI does not display download statistics because they are difficult to
18-
collect and display accurately. Reasons for this are included in the
19-
`announcement email
20-
<https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html>`__:
21-
22-
There are numerous reasons for [download counts] removal/deprecation some
23-
of which are:
24-
25-
- Technically hard to make work with the new CDN
26-
27-
- The CDN is being donated to the PSF, and the donated tier does
28-
not offer any form of log access
29-
- The work around for not having log access would greatly reduce
30-
the utility of the CDN
31-
- Highly inaccurate
32-
- A number of things prevent the download counts from being
33-
accurate, some of which include:
34-
35-
- pip download cache
36-
- Internal or unofficial mirrors
37-
- Packages not hosted on PyPI (for comparisons sake)
38-
- Mirrors or unofficial grab scripts causing inflated counts
39-
(Last I looked 25% of the downloads were from a known
40-
mirroring script).
41-
- Not particularly useful
42-
43-
- Just because a project has been downloaded a lot doesn't mean
44-
it's good
45-
- Similarly just because a project hasn't been downloaded a lot
46-
doesn't mean it's bad
47-
48-
In short because it's value is low for various reasons, and the tradeoffs
49-
required to make it work are high It has been not an effective use of
50-
resources.
51-
52-
As an alternative, the `Linehaul project
53-
<https://github.com/pypa/linehaul>`__ streams download logs to `Google
54-
BigQuery`_ [#]_. Linehaul writes an entry in a
55-
``the-psf.pypi.downloadsYYYYMMDD`` table for each download. The table
56-
contains information about what file was downloaded and how it was
57-
downloaded. Some useful columns from the `table schema
58-
<https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema>`__
59-
include:
17+
PyPI does not display download statistics for a number of reasons: [#]_
6018

61-
+------------------------+-----------------+-----------------------+
62-
| Column | Description | Examples |
63-
+========================+=================+=======================+
64-
| file.project | Project name | ``pipenv``, ``nose`` |
65-
+------------------------+-----------------+-----------------------+
66-
| file.version | Package version | ``0.1.6``, ``1.4.2`` |
67-
+------------------------+-----------------+-----------------------+
68-
| details.installer.name | Installer | pip, `bandersnatch`_ |
69-
+------------------------+-----------------+-----------------------+
70-
| details.python | Python version | ``2.7.12``, ``3.6.4`` |
71-
+------------------------+-----------------+-----------------------+
19+
- **Inefficient to make work with a Content Distribution Network (CDN):**
20+
Download statistics change constantly. Including them in project pages, which
21+
are heavily cached, would require invalidating the cache more often, and
22+
reduce the overall effectiveness of the cache.
7223

73-
.. [#] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html>`__
24+
- **Highly inaccurate:** A number of things prevent the download counts from
25+
being accurate, some of which include:
7426

75-
Setting up
76-
==========
27+
- ``pip``'s download cache (lowers download counts)
28+
- Internal or unofficial mirrors (can both raise or lower download counts)
29+
- Packages not hosted on PyPI (for comparisons sake)
30+
- Unofficial scripts or attempts at download count inflation (raises download
31+
counts)
32+
- Known historical data quality issues (lowers download counts)
33+
34+
- **Not particularly useful:** Just because a project has been downloaded a lot
35+
doesn't mean it's good; Similarly just because a project hasn't been
36+
downloaded a lot doesn't mean it's bad!
7737

78-
In order to use `Google BigQuery`_ to query the `PyPI package dataset`_,
79-
you'll need a Google account and to enable the BigQuery API on a Google
80-
Cloud Platform project. You can run the up to 1TB of queries per month `using
81-
the BigQuery free tier without a credit card
38+
In short, because it's value is low for various reasons, and the tradeoffs
39+
required to make it work are high, it has been not an effective use of
40+
limited resources.
41+
42+
Public dataset
43+
==============
44+
45+
As an alternative, the `Linehaul project <https://github.com/pypa/linehaul>`__
46+
streams download logs from PyPI to `Google BigQuery`_ [#]_, where they are
47+
stored as a public dataset.
48+
49+
Getting set up
50+
--------------
51+
52+
In order to use `Google BigQuery`_ to query the `public PyPI download
53+
statistics dataset`_, you'll need a Google account and to enable the BigQuery
54+
API on a Google Cloud Platform project. You can run the up to 1TB of queries
55+
per month `using the BigQuery free tier without a credit card
8256
<https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card>`__
8357

8458
- Navigate to the `BigQuery web UI`_.
@@ -90,8 +64,31 @@ For more detailed instructions on how to get started with BigQuery, check out
9064
the `BigQuery quickstart guide
9165
<https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui>`__.
9266

67+
68+
Data schema
69+
-----------
70+
71+
Linehaul writes an entry in a ``the-psf.pypi.downloadsYYYYMMDD`` table for each
72+
download. The table contains information about what file was downloaded and how
73+
it was downloaded. Some useful columns from the `table schema
74+
<https://console.cloud.google.com/bigquery?pli=1&p=the-psf&d=pypi&t=downloads&page=table>`__
75+
include:
76+
77+
+------------------------+-----------------+-----------------------+
78+
| Column | Description | Examples |
79+
+========================+=================+=======================+
80+
| file.project | Project name | ``pipenv``, ``nose`` |
81+
+------------------------+-----------------+-----------------------+
82+
| file.version | Package version | ``0.1.6``, ``1.4.2`` |
83+
+------------------------+-----------------+-----------------------+
84+
| details.installer.name | Installer | pip, `bandersnatch`_ |
85+
+------------------------+-----------------+-----------------------+
86+
| details.python | Python version | ``2.7.12``, ``3.6.4`` |
87+
+------------------------+-----------------+-----------------------+
88+
89+
9390
Useful queries
94-
==============
91+
--------------
9592

9693
Run queries in the `BigQuery web UI`_ by clicking the "Compose query" button.
9794

@@ -102,7 +99,7 @@ recent history by using `wildcard tables
10299
select all tables and then filter by ``_TABLE_SUFFIX``.
103100

104101
Counting package downloads
105-
--------------------------
102+
~~~~~~~~~~~~~~~~~~~~~~~~~~
106103

107104
The following query counts the total number of downloads for the project
108105
"pytest".
@@ -148,7 +145,7 @@ column.
148145
+---------------+
149146

150147
Package downloads over time
151-
---------------------------
148+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
152149

153150
To group by monthly downloads, use the ``_TABLE_SUFFIX`` pseudo-column. Also
154151
use the pseudo-column to limit the tables queried and the corresponding
@@ -188,7 +185,7 @@ costs.
188185
+---------------+--------+
189186

190187
More queries
191-
------------
188+
~~~~~~~~~~~~
192189

193190
- `Data driven decisions using PyPI download statistics
194191
<https://langui.sh/2016/12/09/data-driven-decisions/>`__
@@ -198,19 +195,68 @@ More queries
198195
- `Non-Windows downloads, grouped by platform
199196
<https://bigquery.cloud.google.com/savedquery/51422494423:ff1976af63614ad4a1258d8821dd7785>`__
200197

198+
Caveats
199+
=======
200+
201+
In addition to the caveats listed in the background above, Linehaul suffered
202+
from a bug which caused it to significantly under-report download statistics
203+
prior to July 26, 2018. Downloads before this date are proportionally accurate
204+
(e.g. the percentage of Python 2 vs. Python 3 downloads) but total numbers are
205+
lower than actual by an order of magnitude.
206+
207+
201208
Additional tools
202209
================
203210

204-
You can also access the `PyPI package dataset`_ programmatically via the
205-
BigQuery API.
211+
Besides using the BigQuery console, there are some additional tools which may
212+
be useful when analyzing download statistics.
213+
214+
``google-cloud-bigquery``
215+
-------------------------
206216

207-
pypinfo
208-
-------
217+
You can also access the public PyPI download statistics dataset
218+
programmatically via the BigQuery API and the `google-cloud-bigquery`_ project,
219+
the official Python client library for BigQuery.
220+
221+
.. code-block:: python
222+
223+
from google.cloud import bigquery
224+
225+
# Note: depending on where this code is being run, you may require
226+
# additional authentication. See:
227+
# https://cloud.google.com/bigquery/docs/authentication/
228+
client = bigquery.Client()
229+
230+
query_job = client.query("""
231+
SELECT COUNT(*) AS num_downloads
232+
FROM `the-psf.pypi.downloads*`
233+
WHERE file.project = 'pytest'
234+
-- Only query the last 30 days of history
235+
AND _TABLE_SUFFIX
236+
BETWEEN FORMAT_DATE(
237+
'%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
238+
AND FORMAT_DATE('%Y%m%d', CURRENT_DATE())""")
239+
240+
results = query_job.result() # Waits for job to complete.
241+
for row in results:
242+
print("{} downloads".format(row.num_downloads))
243+
244+
245+
``pypinfo``
246+
-----------
209247

210248
`pypinfo`_ is a command-line tool which provides access to the dataset and
211249
can generate several useful queries. For example, you can query the total
212250
number of download for a package with the command ``pypinfo package_name``.
213251

252+
Install `pypinfo`_ using pip.
253+
254+
::
255+
256+
pip install pypinfo
257+
258+
Usage:
259+
214260
::
215261

216262
$ pypinfo requests
@@ -223,20 +269,20 @@ number of download for a package with the command ``pypinfo package_name``.
223269
| -------------- |
224270
| 9,316,415 |
225271

226-
Install `pypinfo`_ using pip.
227272

228-
::
273+
``pandas-gbq``
274+
--------------
275+
276+
The `pandas-gbq`_ project allows for accessing query results via `Pandas`_.
229277

230-
pip install pypinfo
231278

232-
Other libraries
233-
---------------
279+
References
280+
==========
234281

235-
- `google-cloud-bigquery`_ is the official client library to access the
236-
BigQuery API.
237-
- `pandas-gbq`_ allows for accessing query results via `Pandas`_.
282+
.. [#] `PyPI Download Counts deprecation email <https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html>`__
283+
.. [#] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html>`__
238284
239-
.. _PyPI package dataset: https://bigquery.cloud.google.com/dataset/the-psf:pypi
285+
.. _public PyPI download statistics dataset: https://console.cloud.google.com/bigquery?p=the-psf&d=pypi&page=dataset
240286
.. _bandersnatch: /key_projects/#bandersnatch
241287
.. _Google BigQuery: https://cloud.google.com/bigquery
242288
.. _BigQuery web UI: https://console.cloud.google.com/bigquery

0 commit comments

Comments
 (0)