22Analyzing PyPI package downloads
33================================
44
5- This section covers how to use the ` PyPI package dataset `_ to learn more
6- about downloads of a package (or packages) hosted on PyPI. For example, you can
7- use it to discover the distribution of Python versions used to download a
8- package.
5+ This section covers how to use the public PyPI download statistics dataset
6+ to learn more about downloads of a package (or packages) hosted on PyPI. For
7+ example, you can use it to discover the distribution of Python versions used to
8+ download a package.
99
1010.. contents :: Contents
1111 :local:
@@ -14,71 +14,45 @@ package.
1414Background
1515==========
1616
17- PyPI does not display download statistics because they are difficult to
18- collect and display accurately. Reasons for this are included in the
19- `announcement email
20- <https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html> `__:
21-
22- There are numerous reasons for [download counts] removal/deprecation some
23- of which are:
24-
25- - Technically hard to make work with the new CDN
26-
27- - The CDN is being donated to the PSF, and the donated tier does
28- not offer any form of log access
29- - The work around for not having log access would greatly reduce
30- the utility of the CDN
31- - Highly inaccurate
32- - A number of things prevent the download counts from being
33- inaccurate, some of which include:
34-
35- - pip download cache
36- - Internal or unofficial mirrors
37- - Packages not hosted on PyPI (for comparisons sake)
38- - Mirrors or unofficial grab scripts causing inflated counts
39- (Last I looked 25% of the downloads were from a known
40- mirroring script).
41- - Not particularly useful
42-
43- - Just because a project has been downloaded a lot doesn't mean
44- it's good
45- - Similarly just because a project hasn't been downloaded a lot
46- doesn't mean it's bad
47-
48- In short because it's value is low for various reasons, and the tradeoffs
49- required to make it work are high It has been not an effective use of
50- resources.
51-
52- As an alternative, the `Linehaul project
53- <https://github.com/pypa/linehaul> `__ streams download logs to `Google
54- BigQuery `_ [# ]_. Linehaul writes an entry in a
55- ``the-psf.pypi.downloadsYYYYMMDD `` table for each download. The table
56- contains information about what file was downloaded and how it was
57- downloaded. Some useful columns from the `table schema
58- <https://bigquery.cloud.google.com/table/the-psf:pypi.downloads20161022?tab=schema> `__
59- include:
17+ PyPI does not display download statistics for a number of reasons: [# ]_
6018
61- +------------------------+-----------------+-----------------------+
62- | Column | Description | Examples |
63- +========================+=================+=======================+
64- | file.project | Project name | ``pipenv ``, ``nose `` |
65- +------------------------+-----------------+-----------------------+
66- | file.version | Package version | ``0.1.6 ``, ``1.4.2 `` |
67- +------------------------+-----------------+-----------------------+
68- | details.installer.name | Installer | pip, `bandersnatch `_ |
69- +------------------------+-----------------+-----------------------+
70- | details.python | Python version | ``2.7.12 ``, ``3.6.4 `` |
71- +------------------------+-----------------+-----------------------+
19+ - **Inefficient to make work with a Content Distribution Network (CDN): **
20+ Download statistics change constantly. Including them in project pages, which
21+ are heavily cached, would require invalidating the cache more often, and
22+ reduce the overall effectiveness of the cache.
7223
73- .. [# ] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html >`__
24+ - **Highly inaccurate: ** A number of things prevent the download counts from
25+ being accurate, some of which include:
7426
75- Setting up
76- ==========
27+ - ``pip ``'s download cache (lowers download counts)
28+ - Internal or unofficial mirrors (can both raise or lower download counts)
29+ - Packages not hosted on PyPI (for comparisons sake)
30+ - Unofficial scripts or attempts at download count inflation (raises download
31+ counts)
32+ - Known historical data quality issues (lowers download counts)
33+
34+ - **Not particularly useful: ** Just because a project has been downloaded a lot
35+ doesn't mean it's good; Similarly just because a project hasn't been
36+ downloaded a lot doesn't mean it's bad!
7737
78- In order to use `Google BigQuery `_ to query the `PyPI package dataset `_,
79- you'll need a Google account and to enable the BigQuery API on a Google
80- Cloud Platform project. You can run the up to 1TB of queries per month `using
81- the BigQuery free tier without a credit card
38+ In short, because it's value is low for various reasons, and the tradeoffs
39+ required to make it work are high, it has been not an effective use of
40+ limited resources.
41+
42+ Public dataset
43+ ==============
44+
45+ As an alternative, the `Linehaul project <https://github.com/pypa/linehaul >`__
46+ streams download logs from PyPI to `Google BigQuery `_ [# ]_, where they are
47+ stored as a public dataset.
48+
49+ Getting set up
50+ --------------
51+
52+ In order to use `Google BigQuery `_ to query the `public PyPI download
53+ statistics dataset `_, you'll need a Google account and to enable the BigQuery
54+ API on a Google Cloud Platform project. You can run the up to 1TB of queries
55+ per month `using the BigQuery free tier without a credit card
8256<https://cloud.google.com/blog/big-data/2017/01/how-to-run-a-terabyte-of-google-bigquery-queries-each-month-without-a-credit-card> `__
8357
8458- Navigate to the `BigQuery web UI `_.
@@ -90,8 +64,31 @@ For more detailed instructions on how to get started with BigQuery, check out
9064the `BigQuery quickstart guide
9165<https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui> `__.
9266
67+
68+ Data schema
69+ -----------
70+
71+ Linehaul writes an entry in a ``the-psf.pypi.downloadsYYYYMMDD `` table for each
72+ download. The table contains information about what file was downloaded and how
73+ it was downloaded. Some useful columns from the `table schema
74+ <https://console.cloud.google.com/bigquery?pli=1&p=the-psf&d=pypi&t=downloads&page=table> `__
75+ include:
76+
77+ +------------------------+-----------------+-----------------------+
78+ | Column | Description | Examples |
79+ +========================+=================+=======================+
80+ | file.project | Project name | ``pipenv ``, ``nose `` |
81+ +------------------------+-----------------+-----------------------+
82+ | file.version | Package version | ``0.1.6 ``, ``1.4.2 `` |
83+ +------------------------+-----------------+-----------------------+
84+ | details.installer.name | Installer | pip, `bandersnatch `_ |
85+ +------------------------+-----------------+-----------------------+
86+ | details.python | Python version | ``2.7.12 ``, ``3.6.4 `` |
87+ +------------------------+-----------------+-----------------------+
88+
89+
9390Useful queries
94- ==============
91+ --------------
9592
9693Run queries in the `BigQuery web UI `_ by clicking the "Compose query" button.
9794
@@ -102,7 +99,7 @@ recent history by using `wildcard tables
10299select all tables and then filter by ``_TABLE_SUFFIX ``.
103100
104101Counting package downloads
105- --------------------------
102+ ~~~~~~~~~~~~~~~~~~~~~~~~~~
106103
107104The following query counts the total number of downloads for the project
108105"pytest".
@@ -148,7 +145,7 @@ column.
148145+---------------+
149146
150147Package downloads over time
151- ---------------------------
148+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
152149
153150To group by monthly downloads, use the ``_TABLE_SUFFIX `` pseudo-column. Also
154151use the pseudo-column to limit the tables queried and the corresponding
@@ -188,7 +185,7 @@ costs.
188185+---------------+--------+
189186
190187More queries
191- ------------
188+ ~~~~~~~~~~~~
192189
193190- `Data driven decisions using PyPI download statistics
194191 <https://langui.sh/2016/12/09/data-driven-decisions/> `__
@@ -198,19 +195,68 @@ More queries
198195- `Non-Windows downloads, grouped by platform
199196 <https://bigquery.cloud.google.com/savedquery/51422494423:ff1976af63614ad4a1258d8821dd7785> `__
200197
198+ Caveats
199+ =======
200+
201+ In addition to the caveats listed in the background above, Linehaul suffered
202+ from a bug which caused it to significantly under-report download statistics
203+ prior to July 26, 2018. Downloads before this date are proportionally accurate
204+ (e.g. the percentage of Python 2 vs. Python 3 downloads) but total numbers are
205+ lower than actual by an order of magnitude.
206+
207+
201208Additional tools
202209================
203210
204- You can also access the `PyPI package dataset `_ programmatically via the
205- BigQuery API.
211+ Besides using the BigQuery console, there are some additional tools which may
212+ be useful when analyzing download statistics.
213+
214+ ``google-cloud-bigquery ``
215+ -------------------------
206216
207- pypinfo
208- -------
217+ You can also access the public PyPI download statistics dataset
218+ programmatically via the BigQuery API and the `google-cloud-bigquery `_ project,
219+ the official Python client library for BigQuery.
220+
221+ .. code-block :: python
222+
223+ from google.cloud import bigquery
224+
225+ # Note: depending on where this code is being run, you may require
226+ # additional authentication. See:
227+ # https://cloud.google.com/bigquery/docs/authentication/
228+ client = bigquery.Client()
229+
230+ query_job = client.query("""
231+ SELECT COUNT(*) AS num_downloads
232+ FROM `the-psf.pypi.downloads*`
233+ WHERE file.project = 'pytest'
234+ -- Only query the last 30 days of history
235+ AND _TABLE_SUFFIX
236+ BETWEEN FORMAT_DATE(
237+ '%Y%m%d ', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
238+ AND FORMAT_DATE('%Y%m%d ', CURRENT_DATE())""" )
239+
240+ results = query_job.result() # Waits for job to complete.
241+ for row in results:
242+ print (" {} downloads" .format(row.num_downloads))
243+
244+
245+ ``pypinfo ``
246+ -----------
209247
210248`pypinfo `_ is a command-line tool which provides access to the dataset and
211249can generate several useful queries. For example, you can query the total
212250number of download for a package with the command ``pypinfo package_name ``.
213251
252+ Install `pypinfo `_ using pip.
253+
254+ ::
255+
256+ pip install pypinfo
257+
258+ Usage:
259+
214260::
215261
216262 $ pypinfo requests
@@ -223,20 +269,20 @@ number of download for a package with the command ``pypinfo package_name``.
223269 | -------------- |
224270 | 9,316,415 |
225271
226- Install `pypinfo `_ using pip.
227272
228- ::
273+ ``pandas-gbq ``
274+ --------------
275+
276+ The `pandas-gbq `_ project allows for accessing query results via `Pandas `_.
229277
230- pip install pypinfo
231278
232- Other libraries
233- ---------------
279+ References
280+ ==========
234281
235- - `google-cloud-bigquery `_ is the official client library to access the
236- BigQuery API.
237- - `pandas-gbq `_ allows for accessing query results via `Pandas `_.
282+ .. [# ] `PyPI Download Counts deprecation email <https://mail.python.org/pipermail/distutils-sig/2013-May/020855.html >`__
283+ .. [# ] `PyPI BigQuery dataset announcement email <https://mail.python.org/pipermail/distutils-sig/2016-May/028986.html >`__
238284
239- .. _ PyPI package dataset : https://bigquery .cloud.google.com/dataset/ the-psf: pypi
285+ .. _ public PyPI download statistics dataset : https://console .cloud.google.com/bigquery?p= the-psf&d= pypi&page=dataset
240286.. _bandersnatch : /key_projects/#bandersnatch
241287.. _Google BigQuery : https://cloud.google.com/bigquery
242288.. _BigQuery web UI : https://console.cloud.google.com/bigquery
0 commit comments