Skip to content

Commit b6bcddb

Browse files
authored
Fonts 2025 queries (#4175)
* Copy the queries from 2024 to 2025 * Update the readme * Update the INCLUDE pseudo-directives * Replace all with crawl * Update the dates to 2025-07-01 * Update the usage of custom_metrics * Update the usage of summary * Update the usage of payload in the common functions * Update the usage of JSON_* * Fix development/fonts_hinting * Add a Python script for validating the queries * Add development/styles_hyphens * Replace all single dates with a placeholder * Replace all multiple dates with a placeholder * Replace 2025 with {year} * Mention the parameters in the readme * Remove performance/scripts_font_face.sql * Introduce a precision parameter and round proportions * Sort by count, not proportion * Add development/styles_text_wrap * Do not print size if there is none * Create sheets for query results * Add a few comments * Name sheets by the question * Populate the spreadsheet * Make a cosmetic adjustment * Nullify NaNs * Address a lint * Exclude non-SQL files * Add a parameter for controlling the number of workers * Use SAFE.INT64 for respBodySize * Take the first line of the error * Cast file sizes to integers * Downsample in design/fonts_family_by_script.sql * Fix a typo * Add rounding in design/fonts_metric.sql * Fix a typo * Fix the reporting of failures * Update the readme * Update the usage of the Chrome UX report * Update the usage of parsed_css * Use JSON instead of STRING in custom JavaScript functions * Make a cosmetic adjustment * Remove JSON_QUERY in favor of direct indexing * Simplify SCRIPTS * Simplify HAS_EMOJI * Do no use subsampling
1 parent b3d1a03 commit b6bcddb

57 files changed

Lines changed: 3532 additions & 14 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

sql/2025/fonts/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
*.csv

sql/2025/fonts/README.md

Lines changed: 78 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,84 @@
1-
# 2025 Fonts queries
1+
# Fonts
22

3-
<!--
4-
This directory contains all of the 2025 Fonts chapter queries.
3+
## Resources
54

6-
Each query should have a corresponding `metric_name.sql` file.
7-
Note that readers are linked to this directory, so try to make the SQL file names descriptive for easy browsing.
5+
* 📄 [Planning document]
6+
* 📊 [Results sheet]
7+
* 📝 [Chapter content]
88

9-
Analysts: if helpful, you can use this README to give additional info about the queries.
10-
-->
9+
## Structure
1110

12-
## Resources
11+
The queries are split by the section where they are used:
12+
13+
* `design/` is about foundries and families,
14+
* `development/` is about tools and technologies, and
15+
* `performance/` is about hosting and serving.
16+
17+
Each file name starts with one of the following prefixes indicating the primary subject of the corresponding analysis:
18+
19+
* `fonts_` is about font files,
20+
* `pages_` is about HTML pages,
21+
* `scripts_` is about JavaScript scripts, and
22+
* `styles_` is about CSS style sheets.
23+
24+
The prefix is followed by the property studied given in singular, potentially extended one or several suffixes narrowing down the scope, as in `fonts_size_by_table.sql` and `pages_link_relation.sql`.
25+
26+
## Content
27+
28+
Each query starts with a preamble indicating the section, question, and normalization type, as illustrated below:
29+
30+
```sql
31+
-- Section: Performance
32+
-- Question: What is the distribution of the file size broken down by table?
33+
-- Normalization: Pages
34+
```
35+
36+
Many queries rely on temporary functions for convenience and clarity. The functions that appear in several queries are extracted into a common file called `common.sql`. Whenever any of the functions defined in `common.sql` is used by a query, the query has the following pseudo-directive at the top:
37+
38+
```sql
39+
-- INCLUDE https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/{year}/fonts/common.sql
40+
```
41+
42+
The pseudo-directive has to be replaced with the content of `common.sql` prior to executing the query in question.
43+
44+
In addition, queries generally have parameters, as in `@date`, so as to be able to run them for different configurations. The values for the parameters will have to be supplied upon execution.
45+
46+
All the above is taken take of automatically if the queries are executed using `execute.py`, which we discuss next.
47+
48+
## Execution
49+
50+
The queries can be executed using the `execute.py` script. The results are first saved in local CSV files sitting next to the SQL files and then uploaded to the spreadsheet. In the spreadsheet, for each query, a separate sheet is created and named after the question the query answers, which is given in its preamble. If the CSV file already exists, the corresponding query is not executed. If cell A1 is already populated, the corresponding sheet is not updated.
51+
52+
First, ensure that the Application Default Credentials authorization strategy is configured, and that the HTTP Archive project is used as the quota project:
53+
54+
```shell
55+
gcloud auth application-default login \
56+
--scopes https://www.googleapis.com/auth/cloud-platform,https://www.googleapis.com/auth/spreadsheets
57+
gcloud auth application-default set-quota-project httparchive
58+
```
59+
60+
Second, install the Python prerequisites for the script:
61+
62+
```shell
63+
pip install -r requirements.txt
64+
```
65+
66+
The script can be run for all or a subset of the queries as illustrated below:
67+
68+
```shell
69+
python execute.py
70+
python execute.py design/*.sql
71+
python execute.py development/fonts_*.sql
72+
```
73+
74+
By default, it operates in a dry-run mode: it does not run the queries but prints an estimate of the amount of data that would be processed by each query. To actually run the queries, pass the `--no-dry-run` option as follows:
1375

14-
- [📄 Planning doc][~google-doc]
15-
- [📊 Results sheet][~google-sheets]
16-
- [📝 Markdown file][~chapter-markdown]
76+
```shell
77+
python execute.py --no-dry-run
78+
python execute.py --no-dry-run design/*.sql
79+
python execute.py --no-dry-run development/fonts_*.sql
80+
```
1781

18-
[~google-doc]: https://docs.google.com/document/d/1jVc0vgmAY_lBxryItRBguXxEq77mvbaQ3UpbTweUoSI/
19-
[~google-sheets]: https://docs.google.com/spreadsheets/d/1otdu4p_CCI70B4FVzw6k02frStsPMrQoFu7jUim_0Bg/edit
20-
[~chapter-markdown]: https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/src/content/en/2025/fonts.md
82+
[Planning document]: https://docs.google.com/document/d/1jVc0vgmAY_lBxryItRBguXxEq77mvbaQ3UpbTweUoSI
83+
[Results sheet]: https://docs.google.com/spreadsheets/d/1otdu4p_CCI70B4FVzw6k02frStsPMrQoFu7jUim_0Bg
84+
[Chapter content]: https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/src/content/en/2025/fonts.md

sql/2025/fonts/common.sql

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
-- Normalize a family name. Used in FAMILY_INNER.
2+
CREATE TEMPORARY FUNCTION FAMILY_INNER_INNER(name STRING) AS (
3+
CASE
4+
WHEN REGEXP_CONTAINS(name, r'(?i)font\s?awesome') THEN 'Font Awesome'
5+
ELSE IF(LENGTH(TRIM(name)) < 3, NULL, NULLIF(TRIM(name), ''))
6+
END
7+
);
8+
9+
-- Normalize a family name. Used in FAMILY.
10+
CREATE TEMPORARY FUNCTION FAMILY_INNER(name STRING) AS (
11+
FAMILY_INNER_INNER(
12+
REGEXP_REPLACE(
13+
name,
14+
r'(?i)([\s-]?(black|bold|book|cond(ensed)?|demi|ex(tra)?|heavy|italic|light|medium|narrow|regular|semi|thin|ultra|wide|\d00|\d+pt))+$',
15+
''
16+
)
17+
)
18+
);
19+
20+
-- Extract the family name from a payload.
21+
CREATE TEMPORARY FUNCTION FAMILY(payload JSON) AS (
22+
FAMILY_INNER(
23+
COALESCE(
24+
STRING(payload._font_details.names[16]),
25+
STRING(payload._font_details.names[1])
26+
)
27+
)
28+
);
29+
30+
-- Extract the file format from an extension and a MIME type.
31+
CREATE TEMPORARY FUNCTION FILE_FORMAT(extension STRING, type STRING) AS (
32+
LOWER(IFNULL(REGEXP_EXTRACT(type, '/(?:x-)?(?:font-)?(.*)'), extension))
33+
);
34+
35+
-- Normalize a foundry name. Used in FOUNDRY.
36+
CREATE TEMPORARY FUNCTION FOUNDRY_INNER(name STRING) AS (
37+
CASE UPPER(name)
38+
WHEN 'ADBO' THEN 'ADBE'
39+
WHEN 'PFED' THEN 'AWSM'
40+
ELSE NULLIF(TRIM(REGEXP_REPLACE(name, r'[[:cntrl:]]+', '')), '')
41+
END
42+
);
43+
44+
-- Extract the foundry name from a payload.
45+
CREATE TEMPORARY FUNCTION FOUNDRY(payload JSON) AS (
46+
FOUNDRY_INNER(STRING(payload._font_details.OS2.achVendID))
47+
);
48+
49+
-- Infer scripts from codepoints. Used in SCRIPTS.
50+
CREATE TEMPORARY FUNCTION SCRIPTS_INNER(codepoints JSON)
51+
RETURNS ARRAY<STRING>
52+
LANGUAGE js
53+
OPTIONS (library = ["gs://httparchive/lib/text-utils.js"])
54+
AS r"""
55+
if (codepoints && codepoints.length) {
56+
return detectWritingScript(codepoints.map((character) => parseInt(character, 10)), 0.05);
57+
} else {
58+
return [];
59+
}
60+
""";
61+
62+
-- Infer scripts from a payload.
63+
CREATE TEMPORARY FUNCTION SCRIPTS(payload JSON) AS (
64+
SCRIPTS_INNER(payload._font_details.cmap.codepoints)
65+
);
66+
67+
-- Infer the service from a URL.
68+
CREATE TEMPORARY FUNCTION SERVICE(url STRING) AS (
69+
CASE
70+
WHEN REGEXP_CONTAINS(url, r'(fonts|use)\.typekit\.(net|com)') THEN 'Adobe'
71+
WHEN REGEXP_CONTAINS(url, r'cloud\.typenetwork\.com') THEN 'typenetwork.com'
72+
WHEN REGEXP_CONTAINS(url, r'cloud\.typography\.com') THEN 'typography.com'
73+
WHEN REGEXP_CONTAINS(url, r'cloud\.webtype\.com') THEN 'webtype.com'
74+
WHEN REGEXP_CONTAINS(url, r'f\.fontdeck\.com') THEN 'fontdeck.com'
75+
WHEN REGEXP_CONTAINS(url, r'fast\.fonts\.(com|net)\/(jsapi|cssapi)') THEN 'fonts.com'
76+
WHEN REGEXP_CONTAINS(url, r'fnt\.webink\.com') THEN 'webink.com'
77+
WHEN REGEXP_CONTAINS(url, r'fontawesome\.com') THEN 'fontawesome.com'
78+
WHEN REGEXP_CONTAINS(url, r'fonts\.(gstatic|googleapis)\.com|themes.googleusercontent.com/static/fonts|ssl.gstatic.com/fonts') THEN 'Google'
79+
WHEN REGEXP_CONTAINS(url, r'fonts\.typonine\.com') THEN 'typonine.com'
80+
WHEN REGEXP_CONTAINS(url, r'fonts\.typotheque\.com') THEN 'typotheque.com'
81+
WHEN REGEXP_CONTAINS(url, r'kernest\.com') THEN 'kernest.com'
82+
WHEN REGEXP_CONTAINS(url, r'typefront\.com') THEN 'typefront.com'
83+
WHEN REGEXP_CONTAINS(url, r'typesquare\.com') THEN 'typesquare.com'
84+
WHEN REGEXP_CONTAINS(url, r'use\.edgefonts\.net|webfonts\.creativecloud\.com') THEN 'edgefonts.net'
85+
WHEN REGEXP_CONTAINS(url, r'webfont\.fontplus\.jp') THEN 'fontplus.jp'
86+
WHEN REGEXP_CONTAINS(url, r'webfonts\.fontslive\.com') THEN 'fontslive.com'
87+
WHEN REGEXP_CONTAINS(url, r'webfonts\.fontstand\.com') THEN 'fontstand.com'
88+
WHEN REGEXP_CONTAINS(url, r'webfonts\.justanotherfoundry\.com') THEN 'justanotherfoundry.com'
89+
ELSE 'self-hosted'
90+
END
91+
);
92+
93+
-- Extract the color formats from a formats payload and remove spurious entries
94+
-- via a table-sizes payload.
95+
--
96+
-- When nonempty, it is expected that
97+
--
98+
-- * `CBDT` is larger than 2 + 2 bytes,
99+
-- * `COLR` is larger than 2 + 2 + 4 + 4 + 2 (+ 4 + 4 + 4 + 4 + 4) bytes,
100+
-- * `SVG ` is larger than 2 + 4 + 4 + 2 bytes, and
101+
-- * `sbix` is larger than 2 + 2 + 4 + 4 bytes.
102+
--
103+
-- For simplicity, the threshold is set to 50 bytes.
104+
CREATE TEMPORARY FUNCTION COLOR_FORMATS_INNER(formats JSON, table_sizes JSON)
105+
RETURNS ARRAY<STRING>
106+
LANGUAGE js AS '''
107+
try {
108+
return formats.filter((format) => {
109+
const table = `${format} `.slice(0, 4);
110+
return table_sizes[table] > 50;
111+
});
112+
} catch (e) {
113+
return [];
114+
}
115+
''';
116+
117+
-- Extract the color formats from a payload.
118+
CREATE TEMPORARY FUNCTION COLOR_FORMATS(payload JSON) AS (
119+
COLOR_FORMATS_INNER(
120+
payload._font_details.color.formats,
121+
payload._font_details.table_sizes
122+
)
123+
);
124+
125+
-- Check if the font is a color font given its payload.
126+
CREATE TEMPORARY FUNCTION IS_COLOR(payload JSON) AS (
127+
ARRAY_LENGTH(COLOR_FORMATS(payload)) > 0
128+
);
129+
130+
-- Check if the font was successfully parsed given its payload.
131+
CREATE TEMPORARY FUNCTION IS_PARSED(payload JSON) AS (
132+
payload._font_details.table_sizes IS NOT NULL
133+
);
134+
135+
-- Check if the font is a variable font given its payload.
136+
CREATE TEMPORARY FUNCTION IS_VARIABLE(payload JSON) AS (
137+
REGEXP_CONTAINS(
138+
TO_JSON_STRING(payload._font_details.table_sizes),
139+
'(?i)gvar|CFF2'
140+
)
141+
);
142+
143+
-- Extract the variable formats from a payload.
144+
CREATE TEMPORARY FUNCTION VARIABLE_FORMATS(payload JSON) AS (
145+
REGEXP_EXTRACT_ALL(
146+
TO_JSON_STRING(payload._font_details.table_sizes),
147+
'(?i)glyf|CFF2'
148+
)
149+
);
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
-- Section: Design
2+
-- Question: Which designers are popular?
3+
-- Normalization: Pages
4+
5+
-- INCLUDE https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/{year}/fonts/common.sql
6+
7+
WITH
8+
designers AS (
9+
SELECT
10+
client,
11+
NULLIF(TRIM(STRING(payload._font_details.names[9])), '') AS designer,
12+
COUNT(DISTINCT page) AS count,
13+
ROW_NUMBER() OVER (PARTITION BY client ORDER BY COUNT(DISTINCT page) DESC) AS rank
14+
FROM
15+
`httparchive.crawl.requests`
16+
WHERE
17+
date = @date AND
18+
type = 'font' AND
19+
is_root_page AND
20+
IS_PARSED(payload)
21+
GROUP BY
22+
client,
23+
designer
24+
QUALIFY
25+
rank <= 100
26+
),
27+
28+
pages AS (
29+
SELECT
30+
client,
31+
COUNT(DISTINCT page) AS total
32+
FROM
33+
`httparchive.crawl.requests`
34+
WHERE
35+
date = @date AND
36+
is_root_page
37+
GROUP BY
38+
client
39+
)
40+
41+
SELECT
42+
client,
43+
designer,
44+
count,
45+
total,
46+
ROUND(count / total, @precision) AS proportion
47+
FROM
48+
designers
49+
JOIN
50+
pages
51+
USING (client)
52+
ORDER BY
53+
client,
54+
count DESC
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
-- Section: Design
2+
-- Question: Which families are used broken down by foundry?
3+
-- Normalization: Requests (parsed only)
4+
5+
-- INCLUDE https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/{year}/fonts/common.sql
6+
7+
WITH
8+
requests AS (
9+
SELECT
10+
client,
11+
FOUNDRY(payload) AS foundry,
12+
FAMILY(payload) AS family,
13+
COUNT(0) OVER (PARTITION BY client) AS total
14+
FROM
15+
`httparchive.crawl.requests`
16+
WHERE
17+
date = @date AND
18+
type = 'font' AND
19+
IS_PARSED(payload) AND
20+
is_root_page
21+
)
22+
23+
SELECT
24+
client,
25+
foundry,
26+
family,
27+
COUNT(0) AS count,
28+
total,
29+
ROUND(COUNT(0) / total, @precision) AS proportion,
30+
ROW_NUMBER() OVER (PARTITION BY client ORDER BY COUNT(0) DESC) AS rank
31+
FROM
32+
requests
33+
GROUP BY
34+
client,
35+
foundry,
36+
family,
37+
total
38+
QUALIFY
39+
rank <= 100
40+
ORDER BY
41+
client,
42+
count DESC
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
-- Section: Design
2+
-- Question: Which families are used broken down by script?
3+
-- Normalization: Requests (parsed only)
4+
5+
-- INCLUDE https://github.com/HTTPArchive/almanac.httparchive.org/blob/main/sql/{year}/fonts/common.sql
6+
7+
WITH
8+
requests AS (
9+
SELECT
10+
client,
11+
SCRIPTS(payload) AS scripts,
12+
FAMILY(payload) AS family,
13+
COUNT(0) OVER (PARTITION BY client) AS total
14+
FROM
15+
`httparchive.crawl.requests`
16+
WHERE
17+
date = @date AND
18+
type = 'font' AND
19+
is_root_page AND
20+
IS_PARSED(payload)
21+
)
22+
23+
SELECT
24+
client,
25+
script,
26+
family,
27+
COUNT(0) AS count,
28+
total AS total,
29+
ROUND(COUNT(0) / total, @precision) AS proportion,
30+
ROW_NUMBER() OVER (PARTITION BY client, script ORDER BY COUNT(0) DESC) AS rank
31+
FROM
32+
requests,
33+
UNNEST(scripts) AS script
34+
WHERE
35+
family != 'Adobe Blank'
36+
GROUP BY
37+
client,
38+
script,
39+
family,
40+
requests.total
41+
QUALIFY
42+
rank <= 10
43+
ORDER BY
44+
client,
45+
script,
46+
count DESC

0 commit comments

Comments
 (0)