Skip to main content

Autoscaling

Overview

Autoscaling is now available for all types of deployments in AI Cloud. This feature allows you to automatically adjust the number of running service instances based on traffic demand — ensuring optimal performance while managing resource usage efficiently.

  • Lower costs for low-traffic or test environments
  • Improved reliability and responsiveness for high-demand production services

During the deployment creation process, you’ll see the Autoscaling section in the Deployment UI. Default settings are provided, but you can customize them as needed. By default:

  • Minimum Replicas: 1
  • Maximum Replicas: 10

You can leave these values or tailor them based on your expected workload.

Settings

The available settings are as follows:

  • Minimum Replicas: The minimum number of service copies that will always be running, even during periods of low usage.
  • Maximum Replicas: The upper limit of service instances allowed to run concurrently during high demand.
  • Scaling Metric: The performance indicator (e.g., CPU, memory, requests per second) used to trigger scaling actions.
  • Target Metric: The ideal target value for the chosen scaling metric. The autoscaler maintains this value by adding or removing instances as needed.

Advanced settings (optional)

  • Initial Scale: Number of instances started immediately when a new version is deployed. Helps reduce startup time. Applied only once.
  • Scale Up Minimum: Sets how many instances to start when scaling up from zero. Ensures fast wake-up from idle state.
  • Scale Down Delay: Time to wait before reducing replicas after traffic drops. Prevents frequent scale-downs during temporary lulls.
  • Stable Window: The averaging period used for traffic evaluation before making scaling decisions. Helps avoid unnecessary fluctuations.

Notes and edge cases

  • For testing or low-traffic deployments, enabling scale to zero is recommended to minimize resource usage.
  • Keep in mind that cold start time (scaling from 0 to active replicas) may introduce a delay when the first request arrives.
  • Make sure the target metric aligns with your workload type for best results.