Published on
March 28, 2023

How to minimise cloud costs and make the most of what’s out there.
According to research performed by IBM, approximately one-third of business and IT executives believe that in the near future, service workloads will be able to migrate across clouds smoothly to assure availability. Since 2019, business leaders have also said that taking advantage of unique features offered by multiple cloud providers is the main reason for wanting to spread workloads across them.

Imagine this: an organisation’s ERP system hosted on a managed Kubernetes cluster (GKE) on GCP, using an (AD) from Microsoft Azure and ingesting data into Snowflake using StreamSets. And the data is then prepared for training, testing, and validating an ML model for revenue predictions. Cool, right? It’s amazing to think of using different cloud services available on all the various cloud platforms and simply being able to choose the best service for its use case. So, it’s not surprising that the expectation is that more and more CTOs and other IT decision-makers will seek multi-cloud strategies in the near future to reap the benefits of a variety of services across various cloud platforms.

But of course decision-makers have apprehensions. Not only that costs will sprial, but about maintaining solution visibility by coordinating the multiple services provided by many cloud providers. This Forbes piece on multi-cloud visibility, security, and governance challenges does a great job of explaining the issue succinctly.

Various cost management and visibility services are available from the major cloud providers, but unfortunately they are currently only adapted for their exclusive services. For instance, Amazon Web Services offers AWS Cloud Financial Management, which includes a suite of resources for planning, monitoring, and reporting o -n cloud expenditures. And similarly with Google Cloud and Microsoft Azure.

Cloud cost management and visibility may add a great deal of complexity for a business seeking to adopt multi-cloud for its operations, where many services and tools interact seamlessly to provide value to the business. The company will need to establish separate budgets, monitoring, and alert mechanisms.

Due to the added complexity, most firms are hesitant to embrace cloud computing, much alone deploy multi-cloud solutions.

SO HOW CAN WE LOOK AT LOWERING COSTS?

Unexpected costs in the cloud can often be attributed to unnecessary or underutilised resources such as idle compute resources, underused virtual private clouds (VPCs), and uncleaned resources. Choosing the right set of resources for the task at hand can help to avoid these costs. In the cloud, there are many options for completing the same tasks. For example, Apache Airflow can manage to deploy, use, and monitor a machine learning workflow in the cloud. This workflow can utilise the cloud platform compute resources for preprocessing data (such as with EMR, Dataproc, or Databricks), training and validating a model, deploying it, and monitoring and visualising its performance. Other cloud-specific software, such as AWS Sagemaker, Azure ML, and GCP ML, can also be utilised for these tasks. Sagemaker, for instance, provides access to a range of fully-managed tools and libraries for preprocessing data, training ML models, deploying them, and monitoring their performance. The ML process can also be orchestrated with the help of Lambda functions and Step functions. GCP offers similar capabilities through Dataproc, Cloud Run, Cloud Functions, and MLflow, while Azure offers similar resources through its Data and ML pipelines.

Regardless of the approach taken, keeping tabs on resource creation, updates across all resources, and the deletion and replacement of some resources are crucial.

A FEW COST-SAVING HACKS

  1. As much as possible, try to automate the creation of production-ready resources to avoid Engineering Toil – Eliminating Toil and the potential issues of having an unreplicable production environment. If resources are not cleaned up properly, they may be left behind and accrue costs. Tools like Terraform and configuration management tools like Ansible can be used to streamline deployments and resource management. These tools allow you to manage resources using code that can be checked into version control systems (VCS). Making it easier for the engineering team to understand the resources being used in production, and if updates or replacements are needed, these can be done through the standard codebase. This makes managing different deployment versions easier and migrating projects from one cloud platform to another with minimal changes. Using these tools, you can improve visibility into the resources used in production and ensure that resources are cleaned up properly.
  2. In production, it is often best to use managed resources as much as possible with auto-scaling enabled. For example, if you want to deploy a multi-region, highly available RDBMS, you can create RDBMS instances in different regions and use load balancers to distribute traffic. You can then use configuration tools like Chef, Puppet, and Ansible to update packages and migrate versions across the instances (self-managed approach). However, your engineering team is responsible for ensuring that all instances are healthy and scaling them based on traffic, hardware metric utilisation, etc. This can lead to increased Engineering Toil and resource utilisation costs, and problems such as version incompatibility or replication issues due to networking can cause downtime and revenue loss. To avoid these issues, you can use managed services such as Cloud Spanner on GCP or similar services on AWS and Azure. Other examples; Instead of creating a self-managed Kubernetes cluster, you can use a fully managed cluster with auto-scaling, and high availability enabled on GKE, AKS, or EKS. You can also leverage auto-scalable services like Dataproc, EMR, or Databricks instead of creating a self-managed Spark cluster. [See attached code for a Dataproc cluster.]
  3. Using workflows (pipelines) can help to visualise the resources being used and monitor performance across different resources. In the machine learning example, a workflow can ensure that tasks like preprocessing, training, evaluation, and deployment only run if the previous steps are successful. This helps avoid unnecessary resource usage and costs and ensures that the deployment process is consistent and that the best model is running in production. For example, you can define service level indicators (SLIs) and service level objectives (SLOs) for each process (step) and only trigger the next step if the SLO for the current step is achieved. This helps ensure that resources are not wasted and that the production environment runs optimally.
  4. Consider Serverless. One advantage of Serverless computing is that it can help to avoid unnecessary costs due to its event-driven nature. The Serverless function releases infrastructure resources when there are no events and only spins them up when an event occurs. This means that you only pay for the execution time of the function, which can help to reduce resource utilisation costs. The event-driven nature of Serverless computing helps to ensure that resources are only used when needed, which can help to improve efficiency and reduce waste….

THE ANSWER? A SKILLED ENGINEER

A multi-cloud approach is becoming increasingly necessary as we see more and more applications designed to take advantage of the variety of resources out there. Businesses simply need to have skilled engineers on hand to help them navigate the complexities of cloud engineering. A smart engineer will be able to suggest cost-saving tools and put strategies in place to help make the most of the incredible technical benefits of the cloud. The more clouds the merrier, I say.

SOME THINGS TO HELP YOU ON YOUR WAY!

Using the HashiCorp Configuration Language (HCL) and Terraform, you can easily manage a Dataproc cluster on GCP using a few commands. With Terraform, you can create, update, or delete resources on the cluster without any manual actions on the GCP console. This can streamline resource management and reduce the need for manual intervention, improving efficiency, reducing the risk of errors, and eliminating Engineering Toil.

  • Install Terraform
  • To securely access and manage Dataproc resources, create a service account with the necessary permissions. Then, download the credentials and store them in a safe location.
  • Create a directory for the example
    1mkdir -p ~/code/github/<github username>/dataproc-terraform-example
  • navigate to the newly created directory
    1cd ~/code/github/<github username>/dataproc-terraform-example

     

  • Create a file named “provider.tf ” , then add the following content to it
    1/*
    2Setting up the provider
    3/
    4terraform {
    5 required_providers {
    6 google = {
    7 source = "hashicorp/google"
    8 version = "4.33.0"
    9 }
    10 }
    11}
    12provider "google" {
    13 credentials = file(var.credentials)
    14 project = var.project
    15 region = var.region
    16 zone = var.zone
    17}
  • create the main file for setting up the resources for the cluster
    1/*
    2Create the dataproc cluster
    3*/
    4resource "google_dataproc_cluster" "dataproc-terraform" {
    5 name = var.cluster_name
    6 region = var.region
    7 project = var.project
    8 cluster_config {
    9 # For adding metadata, initialization actions, configs that apply to all instances in the cluster */
    10 gce_cluster_config {
    11 zone = var.zone
    12}
    13 # Allowing http port access to components inside the cluster
    14 endpoint_config {
    15 enable_http_port_access = "true"
    16 }
    17 # Configuring the master nodes
    18 master_config {
    19 num_instances = var.master_num_instances
    20 machine_type = var.master_machine_type
    21 disk_config {
    22 boot_disk_size_gb = var.master_disk_size
    23 }
    24 }
    25 # Configuring the worker nodes
    26 worker_config {
    27 num_instances = var.node_num_instances
    28 machine_type = var.node_machine_type
    29 disk_config {
    30 boot_disk_size_gb = var.node_disk_size
    31 num_local_ssds = var.node_num_local_ssds
    32 }
    33 }
    34 software_config {
    35 override_properties = {
    36 # Add the spark-bigquery connector to save data in the BigQuery Data Warehouse
    37 "spark:spark.jars.packages" = "com.google.cloud.spark:spark-3.1-bigquery:0.26.0-preview"
    38 "dataproc:dataproc.allow.zero.workers" = "true"
    39 }
    40 # Add components like Zeppelin, Jupyter, Druid, Hbase, etc.
    41 optional_components = var.additional_components
    42 }
    43 }
    44}
  • Create a file named “variables.tf” to store the variables schema.
    1variable "project" {
    2 type = string
    3 description = "The project indicates the default GCP project all of your resources will be created in."
    4}
    5variable "region" {
    6 type = string
    7 description = "The region will be used to choose the default location for regional resources. Regional resources are spread across several zones."
    8}
    9variable "zone" {
    10 type = string
    11 description = "The zone will be used to choose the default location for zonal resources. Zonal resources exist in a single zone. All zones are a part of a region."
    12}
    13variable "cluster_name" {
    14 type = string
    15 description = "cluster name"
    16}
    17variable "master_machine_type" {
    18 type = string
    19 description = "The compute type(CPU+Memory+etc.) to assign to the master)"
    20}
    21variable "node_machine_type" {
    22 type = string
    23 description = "The compute type(CPU+Memory+etc.) to assign to the master)"
    24}
    25variable "credentials" {
    26 type = string
    27 description = "The path to the credentials"
    28 sensitive = true
    29}
    30variable "additional_components" {
    31 type = list(string)
    32 description = "Additional Components like Zeppelin, Hive etc."
    33}
    34variable "node_num_instances" {
    35 type = number
    36 description = "The number of worker instances in the cluster"
    37}
    38variable "master_num_instances" {
    39 type = number
    40 description = "The number of master instances in the cluster"
    41}
    42variable "master_disk_size" {
    43 type = number
    44 description = "The boot disk size of the master node"
    45}
    46variable "node_disk_size" {
    47 type = number
    48 description = "The boot disk size of the worker nodes"
    49}
    50variable "node_num_local_ssds" {
    51 type = number
    52 description = "This can help in temporary storing any data to disk locally"
    53}
    54
  • create a file for collecting some outputs
    1output "jupyter_url" {
    2 value = google_dataproc_cluster.dataproc-terraform.cluster_config[0].endpoint_config[0].http_ports["Jupyter"]
    3}
  • In Terraform, variables can be registered in various ways, such as by exporting them as environment variables or by using a .tfvars file. In this tutorial, we will be using the .tfvars file method. It’s important to note that this file should not be committed to version control or shared with anyone on the internet. To proceed, create a file named “terraform.tfvars” and add the following content
    1project	= "<locate this on your GCP console>"
    2region	= "<choose a region>" eg. "us-central1"
    3zone	= "<choose a zone>" "us-central1-a"
    4cluster_name	= "<name of the cluster>" eg. test-dataproc"
    5master_machine_type	= "n1-standard-2"
    6node_machine_type	= "n1-standard-2"
    7credentials	= "<the path to your credentials file>"
    8additional_components	= ["JUPYTER"]
    9node_num_instances	= 2
    10master_num_instances	= 1
    11master_disk_size	= 30
    12node_disk_size	= 50
    13node_num_local_ssds	= 0
  • Ensure your directory structure looks like this
    1.
    2└── dataproc-terraform-example
    3 ├── main.tf
    4 ├── outputs.tf
    5 ├── provider.tf
    6 ├── terraform.tfvars
    7 └── variables.tf
  • Navigate to the dataproc-terraform-example directory
  • Run the command below to install and initialise the providers.
    1terraform init
  • To review and view the changes that would be made to the GCP account to meet the specified state, run the command as follows
    1terraform plan -out latest-plan.tfplan
  • Run the next command to apply to changes to the GCP account
    1terraform apply "latest-plan.tfplan"
  • Output after running terraform apply
    1std-logic@dev-01:~/code/github/aastom/dataproc-terraform-example$ terraform apply "latest-plan.tfplan"
    2google_dataproc_cluster.dataproc-terraform: Creating...
    3google_dataproc_cluster.dataproc-terraform: Still creating... [10s elapsed]
    4google_dataproc_cluster.dataproc-terraform: Still creating... [20s elapsed]
    5google_dataproc_cluster.dataproc-terraform: Still creating... [30s elapsed]
    6google_dataproc_cluster.dataproc-terraform: Still creating... [40s elapsed]
    7google_dataproc_cluster.dataproc-terraform: Still creating... [50s elapsed]
    8google_dataproc_cluster.dataproc-terraform: Still creating... [1m0s elapsed]
    9google_dataproc_cluster.dataproc-terraform: Still creating... [1m10s elapsed]
    10google_dataproc_cluster.dataproc-terraform: Still creating... [1m20s elapsed]
    11google_dataproc_cluster.dataproc-terraform: Still creating... [1m30s elapsed]
    12google_dataproc_cluster.dataproc-terraform: Still creating... [1m40s elapsed]
    13google_dataproc_cluster.dataproc-terraform: Still creating... [1m50s elapsed]
    14google_dataproc_cluster.dataproc-terraform: Still creating... [2m0s elapsed]
    15google_dataproc_cluster.dataproc-terraform: Still creating... [2m10s elapsed]
    16google_dataproc_cluster.dataproc-terraform: Still creating... [2m20s elapsed]
    17google_dataproc_cluster.dataproc-terraform: Still creating... [2m30s elapsed]
    18google_dataproc_cluster.dataproc-terraform: Still creating... [2m40s elapsed]
    19google_dataproc_cluster.dataproc-terraform: Still creating... [2m50s elapsed]
    20google_dataproc_cluster.dataproc-terraform: Still creating... [3m0s elapsed]
    21google_dataproc_cluster.dataproc-terraform: Still creating... [3m10s elapsed]
    22google_dataproc_cluster.dataproc-terraform: Still creating... [3m20s elapsed]
    23google_dataproc_cluster.dataproc-terraform: Still creating... [3m30s elapsed]
    24google_dataproc_cluster.dataproc-terraform: Still creating... [3m40s elapsed]
    25google_dataproc_cluster.dataproc-terraform: Still creating... [3m50s elapsed]
    26google_dataproc_cluster.dataproc-terraform: Creation complete after 3m52s [id=projects/spark-372815/regions/us-central1/clusters/test-dataproc]
    27
    28Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
    29
    30Outputs:
    31
    32jupyter_url = "https://studup2gzreqlikar6m3jeuyoa-dot-us-central1.dataproc.googleusercontent.com/gateway/default/jupyter/"
    33std-logic@dev-01:~/code/github/aastom/dataproc-terraform-example$
  • By executing the following command, Terraform can efficiently remove all resources (such as VPCs, Compute Engines, and Kubernetes Clusters) defined within the dataproc cluster, preventing any orphan resources from remaining and incurring additional costs.
    1terraform destroy

     

To find out how bigspark can help your organisation successfully manage a multi-cloud approach get in touch today.