Announcement: Pleco - the open-source Kubernetes and Cloud Services garbage collector

TLDR; Pleco is a service that automatically removes Cloud managed services and Kubernetes resources based on tags with TTL.

When using cloud provider services, whether using UI or Terraform, you usually have to create many resources (users, VPCs, virtual machines, clusters, etc...) to host and expose an application to the outside world. When using Terraform, sometimes, the deployment will not go as planned. Terraform will create some resources, and others will not if it fails for any reason during the process. But what to do with those dangling resources?

Delete them by hand? Too long and laborious. Manage the Terraform crash, its tfstate, and the probable state lock? Certainly not; what a pain! That's why at Qovery, we created Pleco. Pleco is a program that checks that the resources present are useful at regular intervals - reducing cloud cost and increasing engineering team efficiency.

Enzo Radnaï

Enzo Radnaï

December 24, 2021 · 7 min read
Announcement: Pleco - the open-source Kubernetes and Cloud Services garbage collector - Qovery
Written byEnzo Radnaï

Enzo Radnaï

Software Engineer working on the internal tooling for Qovery.

See all articles
EngineeringCloudKubernetes

Why did we create and open-source Pleco?

"The Qovery team is super excited to offer Pleco to the open-source community. A tool that saves us tons of time and money in the Cloud. Especially in the Cloud, where forgetting to stop a resource can cost an arm and a leg" - Pierre Mavro - CTO and co-founder of Qovery

At Qovery, we provide a product that helps developers to deploy their apps in the Cloud. Behind the scene, we deploy and manage Kubernetes clusters on the cloud account of our customers. To make our product reliable for our users, we need to permanently run thousands of tests in the Cloud. Running tests mean that we can have dangling resources that are running because of testing failure. At the beginning of Qovery (2 years ago), we cleaned up those dangling resources manually. It was unreliable and a complete loss of time and money for our team. This is where we invested in writing Pleco to make our life easier and save us from money loss.

How does Pleco work?

Pleco is an application that works with a simple command line and allows you to set many options such as:

  • Cloud provider on which it acts
  • How to connect to the cluster
  • Duration of the interval between each check and potential cleaning
  • Logger precision
  • Geographical areas in which Pleco must perform cleaning
  • Resource types to be cleaned
  • Dry run deactivation
pleco start <cloud provider> -k <connection mode> -i <interval in seconds> - -level <logger precision> -a <regions or zones separated by a comma> <flags of resources to manage> <dry-run deactivation> 

Example :

pleco start aws -k in --level debug -i 240 -a eu-west-3 -e -r -m -c -l -b -p -s -w -n -u -z -o -y  

Here is the resources list that Pleco currently manages:

Kubernetes

  • Namespaces

AWS

  • Document DB databases
  • Document DB subnet groups
  • Document DB snapshots
  • Elasticache databases
  • Elasticache subnet groups
  • RDS databases
  • RDS subnet groups
  • RDS parameter groups
  • RDS snapshots
  • EBS volumes
  • ELB load balancers
  • EC2 Key pairs
  • ECR repositories
  • EKS clusters
  • IAM groups
  • IAM users
  • IAM policies
  • IAM roles
  • Cloudwatch logs
  • KMS keys
  • VPC VPCs
  • VPC internet gateways
  • VPC nat gateways
  • VPC Elastic IP
  • VPC route tables
  • VPC subnets
  • security groups
  • S3 buckets

SCALEWAY

  • Kubernetes clusters
  • Database instances
  • Load balancers
  • Detached volumes
  • S3 Buckets
  • Unused Security Groups

DIGITAL OCEAN

  • Kubernetes clusters
  • Database instances
  • Load balancers
  • Detached volumes
  • S3 Buckets

To manage unnecessary resources, we have thought of two approaches:

  • Verify that the resource user, typically a cluster, still exists. It is possible but laborious because they have hierarchical junctions.
Cloud provider structure
Cloud provider structure
  • Use a resource creation date and add a lifetime, using tags to verify its expiration over time.

Well, at Qovery, we have decided to do both simultaneously!

When Pleco is running, it uses Kubernetes API (kubectl) and the corresponding one from the cloud provider to list the resources and tags. Then it performs a comparison of resource creation date and current date based on a lifetime tag. If a resource is expired, Pleco deletes it.

If a deleted resource includes other items, then those items are also deleted.

One of the other advantages of Pleco is that it is possible, thanks to the two connection modes, to make it run within a cluster so that it self-cleans or on an external machine for remote cleaning.

To deploy Pleco, you have two options, Dockerfile or Helm chart.

Managing deployments

Different tools allow us to inject tags into different resources, a formalized creation date, a lifetime (TTL) in seconds. Although there are many resources with a creation date among the different cloud providers, we have decided to add our own to maintain consistency. With some cloud providers, we may be faced with a consistency problem, the date formats returned are not the same depending on the resources or they do not all have a creation date.

locals {
  tags_common = {
    ClusterId = var.kubernetes_cluster_id
    ClusterName = var.kubernetes_cluster_name,
    Region = var.region
    creationDate = time_static.on_cluster_create.rfc3339
    {% if resource_expiration_in_seconds is defined %}ttl = var.resource_expiration_in_seconds{% endif %}
  }

  tags_eks = merge(
  local.tags_common,
  {
    "Service" = "EKS"
  }
  )
}

resource "aws_eks_cluster" "eks_cluster" {
  name            = var.kubernetes_cluster_name
  role_arn        = aws_iam_role.eks_cluster.arn
  version         = var.eks_k8s_versions.masters

  enabled_cluster_log_types = ["api","audit","authenticator","controllerManager","scheduler"]

  vpc_config {
    security_group_ids = [aws_security_group.eks_cluster.id]
    subnet_ids = flatten([
      aws_subnet.eks_zone_a[*].id,
      aws_subnet.eks_zone_b[*].id,
      aws_subnet.eks_zone_c[*].id,
      {% if vpc_qovery_network_mode == "WithNatGateways" %}
      aws_subnet.eks_zone_a_public[*].id,
      aws_subnet.eks_zone_b_public[*].id,
      aws_subnet.eks_zone_c_public[*].id
      {% endif %}
    ])
  }

  tags = local.tags_eks

  # grow this value to reduce random AWS API timeouts
  timeouts {
    create = "60m"
    update = "90m"
    delete = "30m"
  }

  depends_on = [
    aws_iam_role_policy_attachment.eks_cluster_AmazonEKSClusterPolicy,
    aws_iam_role_policy_attachment.eks_cluster_AmazonEKSServicePolicy,
    aws_cloudwatch_log_group.eks_cloudwatch_log_group,
  ]
}

Unfortunately, not all resources support the functionality of tags and in some cases, since we can use dependencies between resources, tagging is irrelevant.

How is Pleco made?

Choice of language

Pleco is an application that the Qovery team created to answer an immediate need; therefore, we code it in Go. It is a perfect language to set up a small application, easy to use, fast build of binary files per operating system, more than acceptable performance, allowing the use of many SDKs that we need (AWS, Scaleway, Digital Ocean, k8s) and accessible asynchronous processing, thanks to goroutines.

Cluster connection

To be able to remove unnecessary namespaces, Pleco uses the Kubernetes API. This API cannot work without a configuration file. Therefore, it is possible to generate a k8s client in two ways: With an internal connection that generates a connection depending on the configuration of the cluster in which Pleco is running. With an external connection by providing the Kubeconfig ourselves.

func AuthenticateInCluster() (*kubernetes.Clientset, error) {
	// creates the in-cluster config
	config, err := rest.InClusterConfig()
	if err != nil {
		return nil, fmt.Errorf("failed to get client config: %v", err)
	}
	// creates the clientset
	clientSet, err := kubernetes.NewForConfig(config)
	if err != nil {
		return nil, fmt.Errorf("failed to generate client set: %v", err)
	}
	return clientSet, nil
}

func AuthenticateOutOfCluster() (*kubernetes.Clientset, error) {
	kubeconfig := os.Getenv("KUBECONFIG")

	// use the current context in kubeconfig
	config, err := clientcmd.BuildConfigFromFlags("", kubeconfig)
	if err != nil {
		return nil, fmt.Errorf("failed to get client config: %v", err)
	}

	// create the clientset
	clientSet, err := kubernetes.NewForConfig(config)
	if err != nil {
		return nil, fmt.Errorf("failed to generate client set: %v", err)
	}
	return clientSet, nil
}

Regarding Cloud provider API calls, Pleco requires the injection of environment variables with the credentials of the target cloud account to create a client for the API.

func CreateSession(region string) *session.Session {
	sess, err := session.NewSession(&aws.Config{
		Region: aws.String(region)},
	)
	if err != nil {
		logrus.Fatalf("Can't connect to AWS: %s", err.Error())
	}
	return sess
}

Resource management

Pleco can act in different geographical locations and on different resources. To be able to separate the processes, we used goroutines to manage API calls by region and by resource. A routine is created by region and for each resource inherent to this one, a routine is generated by resource.

func RunPlecoAWS(cmd *cobra.Command, regions []string, interval int64, wg *sync.WaitGroup, options AwsOptions) {
	for _, region := range regions {
		wg.Add(1)
		go runPlecoInRegion(region, interval, wg, options)
	}

	currentSession := CreateSession(regions[0])

	wg.Add(1)
	go runPlecoInGlobal(cmd, interval, wg, currentSession, options)
}

We could have worked synchronously, but given the number of regions and resources, we chose call parallelization in order to perform processing faster at each interval.

Protection tag

Although each resource is managed with an expiration date or a dependency, it is possible to add a 3rd tag to block its deletion: the do_not_delete tag with a Boolean value. So even if the resource is expired it will never be deleted. There is another way to protect yourself from deletion, do not put a TLL tag.

Conclusion

Even if you try to manage all resources' lifetimes, sometimes it can go wrong, and you can still have useless stuff on your cloud account.

Pleco makes it easier for DevOps teams to handle unused resources, leading to issues or financial loss. It will allow developers to work without thinking about quotas or even about deleting by themself all the stuff they created, thanks to the TTL tag.

One of the other advantages of using Pleco is to force developers to use consistent tagging. It helps to organize and structure cloud accounts.

At the moment, Pleco is running on all Qovery's test cloud accounts. Since we started using it, we don't have any quotas or flaky resources issues. We also saved thousands of dollars by deleting all unused resources regularly.

Check out Pleco on GitHub - Give a star and feel free to contribute 🤩

EngineeringCloudKubernetes