Skip to content

Commit 5e8aba5

Browse files
committed
Merge branch 'main' of github.com:HTTPArchive/almanac.httparchive.org into production
2 parents 71ac0a6 + dd62459 commit 5e8aba5

13 files changed

Lines changed: 875 additions & 125 deletions

File tree

.github/workflows/code-static-analysis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ jobs:
3535
uses: actions/checkout@v4
3636
- name: Set up Python 3.8
3737
if: ${{ matrix.language == 'python' }}
38-
uses: actions/setup-python@v5.0.0
38+
uses: actions/setup-python@v5.1.0
3939
with:
4040
python-version: '3.8'
4141
- name: Install dependencies

.github/workflows/lintsql.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
# Full git history is needed to get a proper list of changed files within `super-linter`
2020
fetch-depth: 0
2121
- name: Set up Python 3.8
22-
uses: actions/setup-python@v5.0.0
22+
uses: actions/setup-python@v5.1.0
2323
with:
2424
python-version: '3.8'
2525
- name: Lint SQL code

.github/workflows/predeploy.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ jobs:
3737
with:
3838
node-version: '16'
3939
- name: Set up Python 3.8
40-
uses: actions/setup-python@v5.0.0
40+
uses: actions/setup-python@v5.1.0
4141
with:
4242
python-version: '3.8'
4343
- name: Install Asian Fonts

.github/workflows/progress-tracker.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ env:
44
# Update these environment variables every year
55
# And check the tracker is enabled at:
66
# https://github.com/HTTPArchive/almanac.httparchive.org/actions/workflows/progress-tracker.yml
7-
FILTER_LABEL: '2022 chapter'
8-
TRACKER_ISSUE_NUMBER: 2908
9-
SQL_BASE_DIR: 'https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/sql/2022/'
7+
FILTER_LABEL: '2024 chapter'
8+
TRACKER_ISSUE_NUMBER: 3634
9+
SQL_BASE_DIR: 'https://github.com/HTTPArchive/almanac.httparchive.org/tree/main/sql/2024/'
1010

1111
on:
1212
workflow_dispatch:

.github/workflows/test_website.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ jobs:
3030
with:
3131
node-version: '16'
3232
- name: Set up Python 3.8
33-
uses: actions/setup-python@v5.0.0
33+
uses: actions/setup-python@v5.1.0
3434
with:
3535
python-version: '3.8'
3636
- name: Run the website

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ See [our contributing guide](CONTRIBUTING.md). To run the Web Almanac locally pl
3030

3131
## Next Web Almanac timeline
3232

33-
We are [taking a break for 2023](https://github.com/HTTPArchive/almanac.httparchive.org/discussions/33610), but hope to return in future.
33+
We [took a break in 2023](https://github.com/HTTPArchive/almanac.httparchive.org/discussions/33610), and are currently working on the 2024 version. If you would like to get involved, here's a [list of chapters](https://github.com/HTTPArchive/almanac.httparchive.org/issues?q=is%3Aissue+is%3Aopen+label%3A%222024+chapter%22) looking for help.
3434

3535
In the meantime, enjoy the [2022 edition](https://almanac.httparchive.org) and we're still open to contributions in the form of [translations](https://github.com/HTTPArchive/almanac.httparchive.org/issues?q=is%3Aissue+is%3Aopen+label%3Atranslation), [development](https://github.com/HTTPArchive/almanac.httparchive.org/issues?q=is%3Aissue+is%3Aopen+label%3Adevelopment) or bug fixes.
3636

sql/util/README.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,14 @@ This query generates a list of candidate URLs for manifest and service worker fi
2424

2525
The `almanac.manifests` and `almanac.service_workers` tables depend on the `pwa_candidates` table. Running these queries will generate the latest data that can be appended to their respective tables.
2626

27-
## green_web_foundation
27+
## [green_web_foundation.sql](./green_web_foundation.sql)
2828

2929
1. Go to https://admin.thegreenwebfoundation.org/admin/green-urls
3030
2. Scroll to the bottom for the latest database dump
3131
3. Convert to a BQ-compatible format, ie CSV
3232
4. Import into a temporary BQ table
3333
5. Join with the date-partitioned `green_web_foundation` table
34+
35+
## [bq_sql_to_spreadsheet.ipynb](./bq_to_sheets.ipynb)
36+
37+
This Jupyter notebook runs BigQuery SQL queries for a chapter and saves the results to a Google Sheet. It uses the `gspread` library to interact with Google Sheets.

sql/util/bq_to_sheets.ipynb

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "view-in-github",
7+
"colab_type": "text"
8+
},
9+
"source": [
10+
"<a href=\"https://colab.research.google.com/github/HTTPArchive/almanac.httparchive.org/blob/fellow-vicuna/sql/util/bq_to_sheets.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"source": [
16+
"# Almanac\n",
17+
"CHAPTER = \"privacy\"\n",
18+
"YEAR = \"2024\"\n",
19+
"\n",
20+
"# BigQuery\n",
21+
"GCP_PROJECT = \"httparchive\"\n",
22+
"\n",
23+
"# Git\n",
24+
"BRANCH_NAME = \"{chapter}-sql-{year}\".format(\n",
25+
" chapter=CHAPTER,\n",
26+
" year=YEAR\n",
27+
")\n",
28+
"\n",
29+
"# SQL folder\n",
30+
"folder = r'almanac.httparchive.org/sql/{year}/{chapter}/*.sql'.format(\n",
31+
" year=YEAR,\n",
32+
" chapter=CHAPTER\n",
33+
")\n",
34+
"\n",
35+
"# Google Sheets\n",
36+
"spreadsheet_name = \"{chapter} (Web Almanac {year})\".format(\n",
37+
" chapter=CHAPTER.capitalize(),\n",
38+
" year=YEAR\n",
39+
")\n",
40+
"\n",
41+
"# Set to `None` to create new one or an existing spreadsheet URL.\n",
42+
"existing_spreadsheet_url = 'https://docs.google.com/spreadsheets/d/1U6DTYxxhDWf-39Fr0o1Jq2r1RUVa4EbyxIZu-wqrso0/edit'"
43+
],
44+
"metadata": {
45+
"id": "U37785Bxt5tE"
46+
},
47+
"execution_count": 1,
48+
"outputs": []
49+
},
50+
{
51+
"cell_type": "code",
52+
"execution_count": 2,
53+
"metadata": {
54+
"colab": {
55+
"base_uri": "https://localhost:8080/"
56+
},
57+
"id": "OVkCxlRQH6Yt",
58+
"outputId": "9fb31f97-8541-461a-991f-e7932da56101"
59+
},
60+
"outputs": [
61+
{
62+
"output_type": "stream",
63+
"name": "stdout",
64+
"text": [
65+
"Cloning into 'almanac.httparchive.org'...\n",
66+
"remote: Enumerating objects: 43942, done.\u001b[K\n",
67+
"remote: Counting objects: 100% (5935/5935), done.\u001b[K\n",
68+
"remote: Compressing objects: 100% (1535/1535), done.\u001b[K\n",
69+
"remote: Total 43942 (delta 4709), reused 4950 (delta 4391), pack-reused 38007\u001b[K\n",
70+
"Receiving objects: 100% (43942/43942), 384.14 MiB | 29.81 MiB/s, done.\n",
71+
"Resolving deltas: 100% (29622/29622), done.\n",
72+
"Updating files: 100% (5472/5472), done.\n"
73+
]
74+
}
75+
],
76+
"source": [
77+
"# Download repo\n",
78+
"!git clone -b $BRANCH_NAME https://github.com/HTTPArchive/almanac.httparchive.org.git"
79+
]
80+
},
81+
{
82+
"cell_type": "code",
83+
"execution_count": 3,
84+
"metadata": {
85+
"colab": {
86+
"base_uri": "https://localhost:8080/"
87+
},
88+
"id": "UzhgG5xvbQ1E",
89+
"outputId": "4dfc6202-2034-49bd-a77c-5a6e00e01bea"
90+
},
91+
"outputs": [
92+
{
93+
"output_type": "stream",
94+
"name": "stdout",
95+
"text": [
96+
"Already on 'privacy-sql-2024'\n",
97+
"Your branch is up to date with 'origin/privacy-sql-2024'.\n",
98+
"Already up to date.\n"
99+
]
100+
}
101+
],
102+
"source": [
103+
"# Update local branch\n",
104+
"!cd almanac.httparchive.org/ && git checkout $BRANCH_NAME && git pull"
105+
]
106+
},
107+
{
108+
"cell_type": "code",
109+
"execution_count": 4,
110+
"metadata": {
111+
"id": "45dBifFPJAtO"
112+
},
113+
"outputs": [],
114+
"source": [
115+
"# Authenticate\n",
116+
"import google.auth\n",
117+
"import os\n",
118+
"from google.colab import auth\n",
119+
"from google.cloud import bigquery\n",
120+
"\n",
121+
"import gspread\n",
122+
"from gspread_dataframe import set_with_dataframe\n",
123+
"\n",
124+
"os.environ[\"GOOGLE_CLOUD_PROJECT\"] = GCP_PROJECT\n",
125+
"auth.authenticate_user()\n",
126+
"credentials, project = google.auth.default()\n",
127+
"client = bigquery.Client()\n",
128+
"gc = gspread.authorize(credentials)"
129+
]
130+
},
131+
{
132+
"cell_type": "code",
133+
"execution_count": 5,
134+
"metadata": {
135+
"colab": {
136+
"base_uri": "https://localhost:8080/"
137+
},
138+
"id": "nblNil985Tjt",
139+
"outputId": "ccde5268-430c-4ecc-b99c-fce20d061ec8"
140+
},
141+
"outputs": [
142+
{
143+
"output_type": "stream",
144+
"name": "stdout",
145+
"text": [
146+
"Using existing spreadsheet: https://docs.google.com/spreadsheets/d/1U6DTYxxhDWf-39Fr0o1Jq2r1RUVa4EbyxIZu-wqrso0\n"
147+
]
148+
}
149+
],
150+
"source": [
151+
"import glob\n",
152+
"import re\n",
153+
"\n",
154+
"# Build Sheets\n",
155+
"try:\n",
156+
" ss = gc.open_by_url(existing_spreadsheet_url)\n",
157+
" print('Using existing spreadsheet:', ss.url)\n",
158+
"except:\n",
159+
" ss = gc.create(spreadsheet_name)\n",
160+
" print('Created a new spreadsheet:', spreadsheet_name, ss.url)\n",
161+
"existing_sheets = [s.title for s in ss.worksheets()]\n",
162+
"\n",
163+
"file_match_include = r\"number_of_websites_with_features_based_on_string_search.sql\"+\"|\"+ \\\n",
164+
" \"number_of_websites_with_origin_trial_from_token.sql\"\n",
165+
"\n",
166+
"file_match_exclude = r\"^$\"\n",
167+
"\n",
168+
"overwrite = False\n",
169+
"dry_run = True\n",
170+
"tb_processed_limit = 0.1\n",
171+
"\n",
172+
"# Find matching .sql queries in folder and save to google sheet.\n",
173+
"for filepath in glob.iglob(folder):\n",
174+
" filename = filepath.split('/')[-1]\n",
175+
" sheet_title = re.sub(r\"(\\.sql|[^a-zA-Z0-9]+)\", \" \", filename).strip().title()\n",
176+
"\n",
177+
" if re.search(file_match_include, filename) and not re.search(file_match_exclude, filename):\n",
178+
"\n",
179+
" print('Processing:', sheet_title)\n",
180+
" with open(filepath) as f:\n",
181+
" query = f.read()\n",
182+
"\n",
183+
" response = client.query(\n",
184+
" query,\n",
185+
" job_config = bigquery.QueryJobConfig(dry_run = True)\n",
186+
" )\n",
187+
"\n",
188+
" tb_processed = response.total_bytes_processed/1024/1024/1024/1024\n",
189+
" print(f\"Total Tb billed:{tb_processed:9.3f}\")\n",
190+
"\n",
191+
" if dry_run:\n",
192+
" continue\n",
193+
"\n",
194+
" if tb_processed > tb_processed_limit:\n",
195+
" print('Data volume hit the limit. Skipping:', sheet_title)\n",
196+
" continue\n",
197+
"\n",
198+
" if sheet_title in existing_sheets:\n",
199+
" if not overwrite:\n",
200+
" print('Overwrite is False. Skipping:', sheet_title)\n",
201+
" continue\n",
202+
"\n",
203+
" else:\n",
204+
" st = ss.worksheet(sheet_title)\n",
205+
" ss.del_worksheet(st)\n",
206+
"\n",
207+
" df = client.query(query).to_dataframe()\n",
208+
" rows, cols = df.shape\n",
209+
"\n",
210+
" st = ss.add_worksheet(title = sheet_title, rows = rows, cols = cols)\n",
211+
" set_with_dataframe(st, df)\n",
212+
"\n",
213+
" else:\n",
214+
" print('Not Matched. Skipping:', sheet_title)"
215+
]
216+
}
217+
],
218+
"metadata": {
219+
"colab": {
220+
"provenance": []
221+
},
222+
"kernelspec": {
223+
"display_name": "Python 3",
224+
"name": "python3"
225+
},
226+
"language_info": {
227+
"name": "python"
228+
}
229+
},
230+
"nbformat": 4,
231+
"nbformat_minor": 0
232+
}

src/config/last_updated.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1648,6 +1648,11 @@
16481648
"date_modified": "2024-02-18T00:00:00.000Z",
16491649
"hash": "55b44b08b6f4ca26d87b4498490be742"
16501650
},
1651+
"ja/2022/chapters/privacy.html": {
1652+
"date_published": "2024-05-07T00:00:00.000Z",
1653+
"date_modified": "2024-05-07T00:00:00.000Z",
1654+
"hash": "025f7034129e8d56d6b4a7ee0d699762"
1655+
},
16511656
"ja/2022/chapters/seo.html": {
16521657
"date_published": "2023-10-19T00:00:00.000Z",
16531658
"date_modified": "2023-10-19T00:00:00.000Z",

0 commit comments

Comments
 (0)