|
| 1 | +<!-- |
| 2 | + Copyright 2026 Google LLC |
| 3 | +
|
| 4 | + Licensed under the Apache License, Version 2.0 (the "License"); |
| 5 | + you may not use this file except in compliance with the License. |
| 6 | + You may obtain a copy of the License at |
| 7 | +
|
| 8 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 9 | +
|
| 10 | + Unless required by applicable law or agreed to in writing, software |
| 11 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 12 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 13 | + See the License for the specific language governing permissions and |
| 14 | + limitations under the License. |
| 15 | +--> |
| 16 | + |
| 17 | +--- |
| 18 | +name: Vertex AI Model Garden Deploy |
| 19 | +description: Deploy open models or custom weights to Vertex AI endpoints. |
| 20 | +--- |
| 21 | + |
| 22 | +# Vertex AI Model Garden Deploy Skill |
| 23 | + |
| 24 | +This skill provides instructions for deploying Open Models from Vertex AI Model |
| 25 | +Garden to endpoints, and subsequently undeploying them to clean up resources. |
| 26 | + |
| 27 | +## 1. Prerequisites |
| 28 | + |
| 29 | +Before deploying, ensure you have the correct project and region set. The |
| 30 | +commands below use placeholder variables `PROJECT_ID` and `LOCATION_ID`. |
| 31 | + |
| 32 | +Ensure you are authenticated: |
| 33 | + |
| 34 | +```bash |
| 35 | +gcloud auth login |
| 36 | +gcloud auth application-default login |
| 37 | +gcloud config set project $PROJECT_ID` |
| 38 | +``` |
| 39 | + |
| 40 | +## 2. Discovering Deployable Models |
| 41 | + |
| 42 | +You can list models available in Model Garden and check if they can be |
| 43 | +self-deployed. |
| 44 | + |
| 45 | +```bash |
| 46 | +gcloud ai model-garden models list |
| 47 | +``` |
| 48 | + |
| 49 | +To see what machine types and accelerators are supported for a specific model |
| 50 | +(e.g., `google/gemma3@gemma-3-27b-it`): |
| 51 | + |
| 52 | +```bash |
| 53 | +gcloud ai model-garden models list-deployment-config \ |
| 54 | + --model="google/gemma3@gemma-3-27b-it" |
| 55 | +``` |
| 56 | + |
| 57 | +> [!NOTE] Some models, especially Hugging Face models, might require a Hugging |
| 58 | +> Face Access Token for deployment. |
| 59 | + |
| 60 | +> [!TIP] **Model Recommendation Instructions:** If a user asks to deploy a model |
| 61 | +> but **does not specify which one**, you should recommend a model based on |
| 62 | +> their use case (e.g., Llama 3.3 70B for general purpose or Gemma 3 for |
| 63 | +> lightweight tasks). * You **MUST** ensure you are recommending the **latest |
| 64 | +> version** or **popular version** of the suggested model family. * You **MUST** |
| 65 | +> verify the model is currently deployable using `gcloud ai model-garden models |
| 66 | +> list` before suggesting it to the user. |
| 67 | + |
| 68 | +## 3. Deploying a Model |
| 69 | + |
| 70 | +> [!WARNING] Deploying models, especially large ones, consumes significant |
| 71 | +> compute resources and incurs costs. 1. You **MUST** refer to |
| 72 | +> [Vertex AI prediction pricing](https://cloud.google.com/vertex-ai/pricing#prediction-and-explanation) |
| 73 | +> to calculate a rough cost estimation based on the requested `--machine-type` |
| 74 | +> and `--accelerator-type` (and count). 2. You **MUST** present this cost |
| 75 | +> estimation to the user and warn them that this is the **list price**, which |
| 76 | +> may differ from their actual bill due to potential discounts or reservations. |
| 77 | +> 3. You **MUST ALWAYS** request explicit confirmation from the user agreeing to |
| 78 | +> the estimated cost before executing any `deploy` command. |
| 79 | + |
| 80 | +To deploy a model, use the `deploy` command. It is highly recommended to use the |
| 81 | +`--asynchronous` flag for long-running deployments, and then poll the status if |
| 82 | +necessary. |
| 83 | + |
| 84 | +### Example: Deploying Gemma 3 |
| 85 | + |
| 86 | +Here is a typical bash script to deploy a model. You can run this block |
| 87 | +directly. |
| 88 | + |
| 89 | +```bash |
| 90 | +#!/bin/bash |
| 91 | +# Example script to deploy a model from Model Garden |
| 92 | +
|
| 93 | +PROJECT_ID=$(gcloud config get-value project) |
| 94 | +LOCATION_ID="us-central1" # Recommended default region |
| 95 | +MODEL_ID="google/gemma3@gemma-3-27b-it" # Replace with your chosen model ID |
| 96 | +
|
| 97 | +echo "Deploying model $MODEL_ID to project $PROJECT_ID in $LOCATION_ID..." |
| 98 | +
|
| 99 | +# Model Garden can automatically select the required hardware based on the list-deployment-config if hardware params are omitted. |
| 100 | +# Below is a comprehensive command with all supported parameters: |
| 101 | +gcloud ai model-garden models deploy \ |
| 102 | + --project=$PROJECT_ID \ |
| 103 | + --region=$LOCATION_ID \ |
| 104 | + --model=$MODEL_ID \ |
| 105 | + --machine-type="g2-standard-48" \ |
| 106 | + --accelerator-type="NVIDIA_L4" \ |
| 107 | + --accelerator-count=4 \ |
| 108 | + --endpoint-display-name="my-gemma-deployment" \ |
| 109 | + --hugging-face-access-token="YOUR_HF_TOKEN" \ |
| 110 | + --reservation-affinity="reservation-affinity-type=specific-reservation,key=compute.googleapis.com/reservation-name,values=my-reservation" \ |
| 111 | + --asynchronous |
| 112 | +
|
| 113 | +echo "Deployment initiated asynchronously." |
| 114 | +echo "Check the Google Cloud Console (Vertex AI -> Online Prediction) for status." |
| 115 | +``` |
| 116 | + |
| 117 | +### Example: Deploying Custom Weights |
| 118 | + |
| 119 | +To deploy a model using custom weights, you can use the exact same `deploy` |
| 120 | +command. Instead of providing the model garden model ID, provide the Google |
| 121 | +Cloud Storage (GCS) URI to your custom weights folder in the `--model` flag. |
| 122 | + |
| 123 | +```bash |
| 124 | +#!/bin/bash |
| 125 | +# Example script to deploy a model with custom weights from a GCS bucket |
| 126 | +
|
| 127 | +PROJECT_ID=$(gcloud config get-value project) |
| 128 | +LOCATION_ID="us-central1" |
| 129 | +# Replace with the gs:// URI pointing to your custom weights |
| 130 | +MODEL_GCS_URI="gs://your-bucket-name/path/to/custom-weights" |
| 131 | +
|
| 132 | +echo "Deploying custom model from $MODEL_GCS_URI to project $PROJECT_ID in $LOCATION_ID..." |
| 133 | +
|
| 134 | +gcloud ai model-garden models deploy \ |
| 135 | + --project=$PROJECT_ID \ |
| 136 | + --region=$LOCATION_ID \ |
| 137 | + --model=$MODEL_GCS_URI \ |
| 138 | + --machine-type="g2-standard-12" \ |
| 139 | + --accelerator-type="NVIDIA_L4" \ |
| 140 | + --endpoint-display-name="my-custom-model" \ |
| 141 | + --asynchronous |
| 142 | +
|
| 143 | +echo "Deployment initiated asynchronously." |
| 144 | +``` |
| 145 | + |
| 146 | +## 4. Checking Deployment Status |
| 147 | + |
| 148 | +When you deploy a model asynchronously using the `--asynchronous` flag, the |
| 149 | +`deploy` command will return an operation ID. You can use this ID to check the |
| 150 | +ongoing status of the deployment. |
| 151 | + |
| 152 | +```bash |
| 153 | +gcloud ai operations describe YOUR_OPERATION_ID \ |
| 154 | + --region=$LOCATION_ID |
| 155 | +``` |
| 156 | + |
| 157 | +> [!NOTE] As an agent, you can also offer to check the status of a deployment |
| 158 | +> for the user if they provide an operation ID or if they just initiated the |
| 159 | +> deployment with you. |
| 160 | + |
| 161 | +Alternatively, you can list your endpoints to see if it shows up and check the |
| 162 | +Cloud Console under the "Online prediction" tab. |
| 163 | + |
| 164 | +```bash |
| 165 | +gcloud ai endpoints list \ |
| 166 | + --region=$LOCATION_ID |
| 167 | +``` |
| 168 | + |
| 169 | +Note: Large models (like Llama 3.1 8B or Gemma 27B) may take 15-20 minutes to |
| 170 | +fully deploy and start serving. |
| 171 | + |
| 172 | +### Verifying Deployment |
| 173 | + |
| 174 | +If the model is successfully deployed, verify by making a prediction call to |
| 175 | +test. Because Model Garden models are often deployed to Dedicated Endpoints, you |
| 176 | +shouldn't use `gcloud ai endpoints predict`. Instead, you must fetch the |
| 177 | +endpoint's dedicated DNS name and send a `curl` request. |
| 178 | + |
| 179 | +> [!TIP] Ask the user to try using their own prompt to see the results. |
| 180 | +> Otherwise use the default. |
| 181 | + |
| 182 | +Use the following script: |
| 183 | + |
| 184 | +```bash |
| 185 | +#!/bin/bash |
| 186 | +PROJECT_ID=$(gcloud config get-value project) |
| 187 | +LOCATION_ID="us-central1" |
| 188 | +ENDPOINT_ID="YOUR_ENDPOINT_ID" |
| 189 | +PROMPT=${1:-"Explain quantum computing in simple terms."} |
| 190 | +
|
| 191 | +echo "Fetching dedicated Endpoint DNS..." |
| 192 | +ENDPOINT_URL=$(gcloud ai endpoints describe $ENDPOINT_ID --project=$PROJECT_ID --region=$LOCATION_ID --format="value(dedicatedEndpointDns)") |
| 193 | +
|
| 194 | +if [ -z "$ENDPOINT_URL" ]; then |
| 195 | + echo "Error: Could not retrieve a dedicated endpoint URL. Verify your ENDPOINT_ID." |
| 196 | + exit 1 |
| 197 | +fi |
| 198 | +
|
| 199 | +echo "Sending prediction request to $ENDPOINT_URL..." |
| 200 | +curl -X POST \ |
| 201 | + -H "Authorization: Bearer $(gcloud auth print-access-token)" \ |
| 202 | + -H "Content-Type: application/json" \ |
| 203 | + "https://${ENDPOINT_URL}/v1beta1/projects/${PROJECT_ID}/locations/${LOCATION_ID}/endpoints/${ENDPOINT_ID}/chat/completions" \ |
| 204 | + -d '{ |
| 205 | + "model": "'"$ENDPOINT_ID"'", |
| 206 | + "messages": [ |
| 207 | + { |
| 208 | + "role": "user", |
| 209 | + "content": "'"$PROMPT"'" |
| 210 | + } |
| 211 | + ] |
| 212 | + }' |
| 213 | +``` |
| 214 | + |
| 215 | +## 5. Undeploying and Cleaning Up |
| 216 | + |
| 217 | +To stop incurring charges, you must undeploy the model from the endpoint. This |
| 218 | +is a multi-step process if you don't already have the exact endpoint and |
| 219 | +deployed model IDs. |
| 220 | +
|
| 221 | +### Example: Finding and Undeploying a Model |
| 222 | +
|
| 223 | +Here is a bash script demonstrating how to find the IDs and undeploy the model. |
| 224 | +
|
| 225 | +```bash |
| 226 | +#!/bin/bash |
| 227 | +# Example script to undeploy a model |
| 228 | +
|
| 229 | +PROJECT_ID=$(gcloud config get-value project) |
| 230 | +LOCATION_ID="us-central1" |
| 231 | +# The model ID used during deployment (without the provider prefix sometimes, or exactly as listed in describe) |
| 232 | +# It's usually easier to find the specific ID via `gcloud ai models list` |
| 233 | +# For this example, let's assume we know the exact Endpoint ID and Deployed Model ID. |
| 234 | + |
| 235 | +# 1. Find the Endpoint ID |
| 236 | +echo "Listing endpoints in $LOCATION_ID:" |
| 237 | +gcloud ai endpoints list --project=$PROJECT_ID --region=$LOCATION_ID |
| 238 | + |
| 239 | +# (Assuming you extracted ENDPOINT_ID from the above output) |
| 240 | +# ENDPOINT_ID="your_endpoint_id" |
| 241 | + |
| 242 | +# 2. Find the Deployed Model ID |
| 243 | +echo "Listing models in $LOCATION_ID to find model description:" |
| 244 | +gcloud ai models list --project=$PROJECT_ID --region=$LOCATION_ID |
| 245 | + |
| 246 | +# (Assuming you found the specific MODEL_ID) |
| 247 | +# MODEL_ID="your_model_id" |
| 248 | +# gcloud ai models describe $MODEL_ID --project=$PROJECT_ID --region=$LOCATION_ID |
| 249 | +# (Extract the deployedModelId from the output) |
| 250 | +# DEPLOYED_MODEL_ID="your_deployed_model_id" |
| 251 | + |
| 252 | +# 3. Undeploy |
| 253 | +# Uncomment and replace the variables below to actually perform the undeployment |
| 254 | +# echo "Undeploying model $DEPLOYED_MODEL_ID from endpoint $ENDPOINT_ID..." |
| 255 | +# gcloud ai endpoints undeploy-model $ENDPOINT_ID \ |
| 256 | +# --project=$PROJECT_ID \ |
| 257 | +# --region=$LOCATION_ID \ |
| 258 | +# --deployed-model-id=$DEPLOYED_MODEL_ID |
| 259 | +# |
| 260 | +# echo "Model undeployed." |
| 261 | + |
| 262 | +# 4. Delete Endpoint |
| 263 | +# echo "Deleting endpoint $ENDPOINT_ID..." |
| 264 | +# gcloud ai endpoints delete $ENDPOINT_ID \ |
| 265 | +# --project=$PROJECT_ID \ |
| 266 | +# --region=$LOCATION_ID \ |
| 267 | +# --quiet |
| 268 | +# echo "Endpoint deleted." |
| 269 | + |
| 270 | +# 5. Delete Model |
| 271 | +# echo "Deleting model $MODEL_ID..." |
| 272 | +# gcloud ai models delete $MODEL_ID \ |
| 273 | +# --project=$PROJECT_ID \ |
| 274 | +# --region=$LOCATION_ID \ |
| 275 | +# --quiet |
| 276 | +# echo "Model deleted." |
| 277 | +``` |
| 278 | +
|
| 279 | +> [!WARNING] Failing to undeploy a model will result in continuous charges for |
| 280 | +> the allocated compute resources, even if you are not sending prediction |
| 281 | +> requests. Always clean up after testing. |
| 282 | +
|
| 283 | +## 6. Troubleshooting |
| 284 | +
|
| 285 | +### Deployment Failure: Quota or Resource Exhausted |
| 286 | +
|
| 287 | +If your deployment fails (or stays in an error state) due to `QUOTA_EXCEEDED` or |
| 288 | +`RESOURCE_EXHAUSTED` errors, the specific hardware requested (e.g., `NVIDIA_L4` |
| 289 | +or `g2-standard-24`) is either not available in your chosen region or exceeds |
| 290 | +your project's quota limits. |
| 291 | +
|
| 292 | +**Solution:** Look closely at the error message returned. It will often |
| 293 | +recommend an alternative region or machine type that currently has availability. |
| 294 | +**Ask the user for confirmation** to retry the deployment using the suggested |
| 295 | +`--region` or `--machine-type` parameters. |
| 296 | +
|
| 297 | +> [!WARNING] If the alternative suggestions involve changing the machine type or |
| 298 | +> accelerator, you **MUST** recalculate the estimated cost using |
| 299 | +> [Vertex AI prediction pricing](https://cloud.google.com/vertex-ai/pricing#prediction-and-explanation), |
| 300 | +> warn the user about list prices versus actual billing, and get their explicit |
| 301 | +> confirmation for the new cost before retrying the deployment. |
0 commit comments