GoogleCloudPlatform · copybara-service · Dec 5, 2025
@@ -3,7 +3,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "DZ1j6RRg-Td6",
       "metadata": {
         "cellView": "form",
         "id": "f705f4be70e9"
@@ -27,7 +26,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "99c1c3fc2ca5",
       "metadata": {
         "id": "71a642b5575a"
       },
@@ -50,7 +48,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "f9-tJ6RfDLIs",
       "metadata": {
         "id": "0779b48f654e"
       },
@@ -88,7 +85,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "47GcOrZjosOx",
       "metadata": {
         "id": "69453bf7230e"
       },
@@ -98,7 +94,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "1D_pWejJPHP3",
       "metadata": {
         "id": "bf3706e69f61"
       },
@@ -109,15 +104,14 @@
         "- Custom model serving TPU v5e cores per region\n",
         "- Custom model serving Nvidia A100 80GB GPUs per region\n",
         "\n",
-        "By default, the quota for TPU deployment `Custom model serving TPU v5e cores per region` is 4, which is sufficient for serving the Llama 3.1 8B model. The Llama 3.1 70B model requires 16 TPU v5e cores. TPU quota is only available in `us-west1`. You can request for higher TPU quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
+        "By default, the quota for TPU deployment `Custom model serving TPU v5e cores per region` is 4, which is sufficient for serving the Llama 3.1 8B model. The Llama 3.1 70B model requires 16 TPU v5e cores. TPU quota is only available in `us-west1`. You can request for higher TPU quota following the instructions at [\"Request a quota adjustment\"](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota).\n",
         "\n",
-        "The quota for A100_80GB deployment `Custom model serving Nvidia A100 80GB GPUs per region` is 0. You need to request at least 4 for 70B model and 1 for 8B model following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota)."
+        "The quota for A100_80GB deployment `Custom model serving Nvidia A100 80GB GPUs per region` is 0. You need to request at least 4 for 70B model and 1 for 8B model following the instructions at [\"Request a quota adjustment\"](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota)."
       ]
     },
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "L3dqbxovo5t6",
       "metadata": {
         "cellView": "form",
         "id": "50047cc80bb9"
@@ -136,7 +130,7 @@
         "\n",
         "REGION = \"\"  # @param {type:\"string\"}\n",
         "\n",
-        "# @markdown 4. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
+        "# @markdown 4. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a quota adjustment\"](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota).\n",
         "\n",
         "# @markdown | Machine Type | Accelerator Type | Recommended Regions |\n",
         "# @markdown | ----------- | ----------- | ----------- |\n",
@@ -223,7 +217,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "SeGqxuMfRBS5",
       "metadata": {
         "id": "4782dd003acb"
       },
@@ -234,7 +227,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "BxlzWU2KQqmw",
       "metadata": {
         "cellView": "form",
         "id": "798068fc0355"
@@ -274,7 +266,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "JpNBJJgjWL7j",
       "metadata": {
         "id": "10ed490e28e5"
       },
@@ -302,7 +293,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "9gZJ8cB27e1m",
       "metadata": {
         "id": "30ddb93fdd7b"
       },
@@ -317,7 +307,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "RpmoA2nXjdCd",
       "metadata": {
         "cellView": "form",
         "id": "b56d82c1aa6f"
@@ -506,7 +495,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "5QoK8c0R9U3B",
       "metadata": {
         "cellView": "form",
         "id": "96c5afed49b4"
@@ -572,7 +560,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "29rn5ATmB2YC",
       "metadata": {
         "cellView": "form",
         "id": "9a95c9f90358"
@@ -646,7 +633,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "KjbM8E9DGuuR",
       "metadata": {
         "id": "12ad6d1ff725"
       },
@@ -657,7 +643,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "JpLU7GRQGuuR",
       "metadata": {
         "cellView": "form",
         "id": "1ab4e3bb74b4"
@@ -684,7 +669,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "XZ33HhYmOxCS",
       "metadata": {
         "id": "7a8a9a1b2ddf"
       },
@@ -706,7 +690,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "E8OiHHNNE_wj",
       "metadata": {
         "cellView": "form",
         "id": "4425cc0bdedc"
@@ -910,7 +893,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "zex1oXl36A70",
       "metadata": {
         "cellView": "form",
         "id": "bcbafec839cd"
@@ -973,7 +955,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "gDOC_nfsJeUR",
       "metadata": {
         "cellView": "form",
         "id": "e984f43422d5"
@@ -1047,7 +1028,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "GdGxaTirJeUR",
       "metadata": {
         "id": "dff0d10dcc20"
       },
@@ -1058,7 +1038,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "OgoqXE-VJeUR",
       "metadata": {
         "cellView": "form",
         "id": "5b8751773e7f"
@@ -1085,7 +1064,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "w4Guijaw_NEs",
       "metadata": {
         "id": "863775857a46"
       },
@@ -1100,7 +1078,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "ml8fgoIQWSbY",
       "metadata": {
         "id": "565cbdc3a06b"
       },
@@ -1142,7 +1119,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "NmWRro8Q-Td6",
       "metadata": {
         "id": "94eaa9050abb"
       },
@@ -1153,7 +1129,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "72d1GlrYifKU",
       "metadata": {
         "cellView": "form",
         "id": "5f358cc230a6"
@@ -1470,7 +1445,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "CNiItf5hdVFU",
       "metadata": {
         "cellView": "form",
         "id": "be3170e0e05a"
@@ -1546,7 +1520,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "WahYGAZyq6Gl",
       "metadata": {
         "id": "30c5d2535df3"
       },
@@ -1556,7 +1529,6 @@
     },
     {
       "cell_type": "markdown",
-      "id": "bV5Yjkgav9BZ",
       "metadata": {
         "id": "63c10917ff95"
       },
@@ -1567,7 +1539,6 @@
     {
       "cell_type": "code",
       "execution_count": null,
-      "id": "qsks36cOH9rb",
       "metadata": {
         "cellView": "form",
         "id": "92892e1b1730"

@@ -95,7 +95,7 @@
         "\n",
         "# @markdown 1. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n",
         "\n",
-        "# @markdown 2. By default, the quota for TPU deployment `Custom model serving TPU v5e cores per region` is 4. TPU quota is only available in `us-west1`. You can request for higher TPU quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
+        "# @markdown 2. By default, the quota for TPU deployment `Custom model serving TPU v5e cores per region` is 4. TPU quota is only available in `us-west1`. You can request for higher TPU quota following the instructions at [\"Request a quota adjustment\"](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota).\n",
         "\n",
         "# @markdown 3. **[Optional]** [Create a Cloud Storage bucket](https://cloud.google.com/storage/docs/creating-buckets) for storing experiment outputs. Set the BUCKET_URI for the experiment environment. The specified Cloud Storage bucket (`BUCKET_URI`) should be located in the same region as where the notebook was launched. Note that a multi-region bucket (eg. \"us\") is not considered a match for a single region covered by the multi-region range (eg. \"us-central1\"). If not set, a unique GCS bucket will be created instead.\n",
         "\n",
@@ -434,7 +434,6 @@
         "    )\n",
         "    return model, endpoint\n",
         "\n",
-        "\n",
         "models[\"hexllm_tpu\"], endpoints[\"hexllm_tpu\"] = deploy_model_hexllm(\n",
         "    model_name=common_util.get_job_name_with_datetime(prefix=MODEL_ID),\n",
         "    model_id=model_id,\n",
@@ -659,6 +658,7 @@
         "        vllm_args.append(\"--enable-auto-tool-choice\")\n",
         "        vllm_args.append(\"--tool-call-parser=vertex-llama-3\")\n",
         "\n",
+        "\n",
         "    env_vars = {\n",
         "        \"MODEL_ID\": base_model_id,\n",
         "        \"DEPLOY_SOURCE\": \"notebook\",\n",
@@ -704,7 +704,6 @@
         "\n",
         "    return model, endpoint\n",
         "\n",
-        "\n",
         "models[\"vllm_gpu\"], endpoints[\"vllm_gpu\"] = deploy_model_vllm(\n",
         "    model_name=common_util.get_job_name_with_datetime(prefix=\"codegemma-serve-vllm\"),\n",
         "    model_id=model_id,\n",

@@ -97,7 +97,7 @@
         "\n",
         "REGION = \"\"  # @param {type:\"string\"}\n",
         "\n",
-        "# @markdown 3. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a higher quota\"](https://cloud.google.com/docs/quota/view-manage#requesting_higher_quota).\n",
+        "# @markdown 3. If you want to run predictions with A100 80GB or H100 GPUs, we recommend using the regions listed below. **NOTE:** Make sure you have associated quota in selected regions. Click the links to see your current quota for each GPU type: [Nvidia A100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_a100_80gb_gpus), [Nvidia H100 80GB](https://console.cloud.google.com/iam-admin/quotas?metric=aiplatform.googleapis.com%2Fcustom_model_serving_nvidia_h100_gpus). You can request for quota following the instructions at [\"Request a quota adjustment\"](https://cloud.google.com/docs/quotas/view-manage#requesting_higher_quota).\n",
         "\n",
         "# @markdown | Machine Type | Accelerator Type | Recommended Regions |\n",
         "# @markdown | ----------- | ----------- | ----------- |\n",
@@ -176,9 +176,7 @@
         "# @markdown You can also filter by model name.\n",
         "model_filter = \"gemma\"  # @param {type:\"string\"}\n",
         "\n",
-        "model_garden.list_deployable_models(\n",
-        "    list_hf_models=list_hf_models, model_filter=model_filter\n",
-        ")"
+        "model_garden.list_deployable_models(list_hf_models=list_hf_models, model_filter=model_filter)"
       ]
     },
     {
@@ -228,10 +226,10 @@
         "endpoints[LABEL] = model.deploy(\n",
         "    hugging_face_access_token=HF_TOKEN,\n",
         "    use_dedicated_endpoint=use_dedicated_endpoint,\n",
-        "    accept_eula=True,  # Accept the End User License Agreement (EULA) on the model card before deploy. Otherwise, the deployment will be forbidden.\n",
+        "    accept_eula = True,  # Accept the End User License Agreement (EULA) on the model card before deploy. Otherwise, the deployment will be forbidden.\n",
         ")\n",
         "\n",
-        "endpoint = endpoints[LABEL]"
+        "endpoint=endpoints[LABEL]"
       ]
     },
     {