devops-automation kubernetes operatorscustom resourcesk8s controller

Kubernetes Custom Resources: Building Production Operators

Master kubernetes operators and custom resources for production environments. Learn to build robust k8s controllers with real-world examples and best practices.

📖 19 min read 📅 June 4, 2026 ✍ By PropTechUSA AI
19m
Read Time
3.8k
Words
20
Sections

Managing complex applications in Kubernetes often requires extending the [platform](/saas-platform)'s native capabilities beyond what built-in resources can offer. While Deployments, Services, and ConfigMaps handle basic workloads effectively, modern distributed applications demand more sophisticated automation and domain-specific logic. This is where Kubernetes operators and custom resources transform how we architect and operate cloud-native systems.

Understanding the Operator Pattern Foundation

Kubernetes operators represent a paradigm shift in application lifecycle management, embodying the principle of infrastructure as code through declarative APIs. The operator pattern extends Kubernetes' core philosophy by allowing developers to codify operational knowledge and automate complex tasks that traditionally required manual intervention.

The Controller Architecture

At its core, every Kubernetes operator implements the controller pattern through a continuous reconciliation loop. This architecture ensures that the actual state of your system consistently matches the desired state defined in your custom resources.

go
type ReconcileLoop struct {

client.Client

scheme *runtime.Scheme

}

func (r *ReconcileLoop) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {

// Fetch the current state

instance := &v1alpha1.MyCustomResource{}

err := r.Get(ctx, req.NamespacedName, instance)

if err != nil {

return reconcile.Result{}, client.IgnoreNotFound(err)

}

// Compare desired vs actual state

if needsUpdate := r.analyzeState(instance); needsUpdate {

return r.updateResources(ctx, instance)

}

return reconcile.Result{RequeueAfter: time.Minute * 5}, nil

}

This reconciliation approach provides several advantages over traditional imperative automation scripts. The controller continuously monitors resource states, automatically correcting drift and ensuring consistency even when external factors modify system components.

Custom Resource Definitions as APIs

Custom Resource Definitions (CRDs) serve as the interface between operators and users, defining new API endpoints that feel native to Kubernetes. When designing CRDs for production environments, focus on creating intuitive schemas that abstract complex implementation details while providing necessary configuration flexibility.

yaml
apiVersion: apiextensions.k8s.io/v1

kind: CustomResourceDefinition

metadata:

name: databases.proptech.ai

spec:

group: proptech.ai

versions:

- name: v1alpha1

served: true

storage: true

schema:

openAPIV3Schema:

type: object

properties:

spec:

type: object

properties:

replicas:

type: integer

minimum: 1

maximum: 10

backup:

type: object

properties:

schedule:

type: string

pattern: '^[0-9\*\s\/\-\,]+$'

retention:

type: string

default: "30d"

required:

- replicas

scope: Namespaced

names:

plural: databases

singular: database

kind: Database

Event-Driven Architecture Benefits

Kubernetes operators leverage the platform's event-driven architecture through watch mechanisms, enabling efficient resource utilization compared to polling-based approaches. This design pattern allows operators to respond immediately to changes while maintaining low resource overhead in stable environments.

Core Components of Production Operators

Building robust kubernetes operators requires understanding the intricate relationships between controllers, custom resources, and the broader Kubernetes ecosystem. Production-grade operators must handle [edge](/workers) cases, implement proper error handling, and provide observability features that enable effective troubleshooting.

Controller Implementation Strategies

Modern operator development frameworks like Operator SDK and Kubebuilder provide scaffolding and best practices, but understanding the underlying controller mechanics remains crucial for production deployments. The controller manager coordinates multiple controllers and handles cross-cutting concerns like leader election and graceful shutdown.

go
package main

import (

"context"

"os"

"sigs.k8s.io/controller-runtime/pkg/manager"

"sigs.k8s.io/controller-runtime/pkg/controller"

"sigs.k8s.io/controller-runtime/pkg/source"

)

func main() {

mgr, err := manager.New(cfg, manager.Options{

Namespace: "", // Watch all namespaces

LeaderElection: true,

LeaderElectionID: "proptech-operator-lock",

MetricsBindAddress: ":8080",

})

if err != nil {

os.Exit(1)

}

// Create controller

c, err := controller.New("database-controller", mgr, controller.Options{

Reconciler: &DatabaseReconciler{

Client: mgr.GetClient(),

Scheme: mgr.GetScheme(),

},

MaxConcurrentReconciles: 2,

})

// Watch for changes to Database resources

err = c.Watch(&source.Kind{Type: &v1alpha1.Database{}}, &handler.EnqueueRequestForObject{})

if err != nil {

return err

}

// Start the manager

if err := mgr.Start(context.Background()); err != nil {

os.Exit(1)

}

}

State Management and Status Reporting

Effective status reporting provides visibility into operator behavior and enables integration with monitoring systems. The status subresource should reflect the current operational state and include relevant metadata for troubleshooting.

go
type DatabaseStatus struct {

Phase string json:"phase"

Conditions []metav1.Condition json:"conditions,omitempty"

Replicas int32 json:"replicas"

ReadyReplicas int32 json:"readyReplicas"

LastBackup *metav1.Time json:"lastBackup,omitempty"

}

func (r *DatabaseReconciler) updateStatus(ctx context.Context, db *v1alpha1.Database, phase string) error {

db.Status.Phase = phase

db.Status.Conditions = append(db.Status.Conditions, metav1.Condition{

Type: "Ready",

Status: metav1.ConditionTrue,

LastTransitionTime: metav1.Now(),

Reason: "ReconcileSuccess",

Message: "Database reconciliation completed successfully",

})

return r.Status().Update(ctx, db)

}

Error Handling and Recovery Mechanisms

Production operators must implement sophisticated error handling that distinguishes between transient failures requiring retry and permanent errors needing human intervention. The reconciliation loop should include exponential backoff and circuit breaker patterns to prevent cascading failures.

⚠️
WarningAvoid infinite reconciliation loops by implementing proper error categorization and retry limits. Operators that continuously fail reconciliation can overwhelm the Kubernetes API server and impact cluster performance.

Implementation Patterns and Real-World Examples

Successful kubernetes operators often follow established patterns that address common operational challenges. These patterns have emerged from production deployments across various industries and provide proven approaches to complex automation scenarios.

Multi-Resource Orchestration

Complex applications typically require coordinating multiple Kubernetes resources with dependencies and ordering constraints. PropTechUSA.ai's platform automation demonstrates this pattern by orchestrating database deployments, networking configurations, and security policies through a single operator.

go
func (r *DatabaseReconciler) reconcileDatabase(ctx context.Context, db *v1alpha1.Database) error {

// Step 1: Ensure ConfigMap exists

configMap := r.buildConfigMap(db)

if err := r.createOrUpdate(ctx, configMap); err != nil {

return fmt.Errorf("failed to reconcile ConfigMap: %w", err)

}

// Step 2: Create Secrets for credentials

secret := r.buildSecret(db)

if err := r.createOrUpdate(ctx, secret); err != nil {

return fmt.Errorf("failed to reconcile Secret: %w", err)

}

// Step 3: Deploy StatefulSet

statefulSet := r.buildStatefulSet(db, configMap, secret)

if err := r.createOrUpdate(ctx, statefulSet); err != nil {

return fmt.Errorf("failed to reconcile StatefulSet: %w", err)

}

// Step 4: Ensure Service exists

service := r.buildService(db)

return r.createOrUpdate(ctx, service)

}

func (r *DatabaseReconciler) createOrUpdate(ctx context.Context, obj client.Object) error {

err := r.Create(ctx, obj)

if errors.IsAlreadyExists(err) {

return r.Update(ctx, obj)

}

return err

}

Lifecycle Management Automation

Operators excel at automating complex lifecycle operations like backup scheduling, rolling updates, and disaster recovery. Implementing these capabilities requires careful consideration of timing, dependencies, and rollback strategies.

go
type BackupController struct {

client.Client

scheme *runtime.Scheme

cron *cron.Cron

}

func (r *BackupController) scheduleBackup(db *v1alpha1.Database) error {

jobFunc := func() {

backup := &batchv1.Job{

ObjectMeta: metav1.ObjectMeta{

Name: fmt.Sprintf("%s-backup-%d", db.Name, time.Now().Unix()),

Namespace: db.Namespace,

OwnerReferences: []metav1.OwnerReference{{

APIVersion: db.APIVersion,

Kind: db.Kind,

Name: db.Name,

UID: db.UID,

Controller: &[]bool{true}[0],

}},

},

Spec: batchv1.JobSpec{

Template: corev1.PodTemplateSpec{

Spec: r.buildBackupPodSpec(db),

},

},

}

ctx, cancel := context.WithTimeout(context.Background(), time.Minute*5)

defer cancel()

if err := r.Create(ctx, backup); err != nil {

log.Error(err, "Failed to create backup job", "database", db.Name)

}

}

_, err := r.cron.AddFunc(db.Spec.Backup.Schedule, jobFunc)

return err

}

Observability and [Metrics](/dashboards) Integration

Production operators should expose metrics and logging that integrate with existing observability infrastructure. This includes custom metrics that reflect business logic and operational health indicators.

go
import (

"github.com/prometheus/client_golang/prometheus"

"sigs.k8s.io/controller-runtime/pkg/metrics"

)

var (

databasesTotal = prometheus.NewGaugeVec(

prometheus.GaugeOpts{

Name: "proptech_databases_total",

Help: "Total number of managed databases",

},

[]string{"namespace", "phase"},

)

reconciliationDuration = prometheus.NewHistogramVec(

prometheus.HistogramOpts{

Name: "proptech_reconciliation_duration_seconds",

Help: "Time spent reconciling databases",

},

[]string{"controller", "result"},

)

)

func init() {

metrics.Registry.MustRegister(databasesTotal, reconciliationDuration)

}

💡
Pro TipImplement structured logging with consistent field names across your operator to enable effective log aggregation and analysis. Include correlation IDs and request context in log entries for distributed tracing capabilities.

Production Best Practices and Security Considerations

Deploying kubernetes operators in production environments requires attention to security, reliability, and operational concerns that may not be apparent during development. These considerations become critical when operators manage business-critical workloads or handle sensitive data.

Security and RBAC Configuration

Operators require carefully scoped permissions that follow the principle of least privilege. The RBAC configuration should grant only the minimum permissions necessary for the operator's functionality while considering future feature requirements.

yaml
apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRole

metadata:

name: database-operator-manager

rules:

  • apiGroups:

- proptech.ai

resources:

- databases

verbs:

- create

- delete

- get

- list

- patch

- update

- watch

  • apiGroups:

- proptech.ai

resources:

- databases/status

verbs:

- get

- patch

- update

  • apiGroups:

- apps

resources:

- statefulsets

verbs:

- create

- delete

- get

- list

- patch

- update

- watch

  • apiGroups:

- ""

resources:

- secrets

- configmaps

- services

verbs:

- create

- get

- list

- patch

- update

- watch

---

apiVersion: rbac.authorization.k8s.io/v1

kind: ClusterRoleBinding

metadata:

name: database-operator-manager

roleRef:

apiGroup: rbac.authorization.k8s.io

kind: ClusterRole

name: database-operator-manager

subjects:

  • kind: ServiceAccount

name: database-operator-manager

namespace: operator-system

Resource Management and Limits

Operators themselves consume cluster resources and should be configured with appropriate limits to prevent resource starvation. Additionally, operators that create workloads should implement resource quotas and limits for the resources they manage.

yaml
apiVersion: apps/v1

kind: Deployment

metadata:

name: database-operator-controller

namespace: operator-system

spec:

replicas: 1

selector:

matchLabels:

control-plane: controller-manager

template:

metadata:

labels:

control-plane: controller-manager

spec:

containers:

- name: manager

image: proptech/database-operator:v1.2.0

resources:

limits:

cpu: 100m

memory: 128Mi

requests:

cpu: 50m

memory: 64Mi

livenessProbe:

httpGet:

path: /healthz

port: 8081

initialDelaySeconds: 15

periodSeconds: 20

readinessProbe:

httpGet:

path: /readyz

port: 8081

initialDelaySeconds: 5

periodSeconds: 10

Testing and Validation Strategies

Comprehensive testing of k8s controllers requires multiple approaches including unit tests for business logic, integration tests for Kubernetes API interactions, and end-to-end tests for complete workflows.

go
func TestDatabaseReconciliation(t *testing.T) {

// Setup test environment

scheme := runtime.NewScheme()

_ = v1alpha1.AddToScheme(scheme)

_ = corev1.AddToScheme(scheme)

client := fake.NewClientBuilder().WithScheme(scheme).Build()

reconciler := &DatabaseReconciler{

Client: client,

Scheme: scheme,

}

// Create test database resource

db := &v1alpha1.Database{

ObjectMeta: metav1.ObjectMeta{

Name: "test-db",

Namespace: "default",

},

Spec: v1alpha1.DatabaseSpec{

Replicas: 3,

Backup: v1alpha1.BackupSpec{

Schedule: "0 2 * * *",

Retention: "30d",

},

},

}

err := client.Create(context.Background(), db)

require.NoError(t, err)

// Test reconciliation

req := reconcile.Request{

NamespacedName: types.NamespacedName{

Name: "test-db",

Namespace: "default",

},

}

result, err := reconciler.Reconcile(context.Background(), req)

require.NoError(t, err)

require.False(t, result.Requeue)

// Verify StatefulSet was created

statefulSet := &appsv1.StatefulSet{}

err = client.Get(context.Background(), types.NamespacedName{

Name: "test-db",

Namespace: "default",

}, statefulSet)

require.NoError(t, err)

assert.Equal(t, int32(3), *statefulSet.Spec.Replicas)

}

Scaling Operations and Ecosystem Integration

As kubernetes operators mature from proof-of-concept to production systems, they must integrate seamlessly with existing DevOps toolchains and operational procedures. This integration often determines the long-term success and adoption of operator-based automation.

CI/CD [Pipeline](/custom-crm) Integration

Modern operator deployments leverage GitOps principles and automated testing pipelines. PropTechUSA.ai's infrastructure automation platform demonstrates how operators can integrate with continuous delivery systems to enable self-service infrastructure provisioning while maintaining governance and compliance requirements.

Operator versioning and upgrade strategies become critical as custom resources evolve. Implementing proper API versioning with [conversion](/landing-pages) webhooks ensures backward compatibility during operator upgrades.

yaml
apiVersion: apiextensions.k8s.io/v1

kind: CustomResourceDefinition

metadata:

name: databases.proptech.ai

spec:

group: proptech.ai

versions:

- name: v1alpha1

served: true

storage: false

deprecated: true

- name: v1beta1

served: true

storage: true

conversion:

strategy: Webhook

webhook:

clientConfig:

service:

name: database-operator-webhook

namespace: operator-system

path: /convert

conversionReviewVersions: ["v1", "v1beta1"]

Multi-Cluster and Federation Patterns

Enterprise environments often require operators that span multiple clusters or coordinate resources across different environments. This adds complexity around networking, security, and data consistency that must be carefully architected.

Monitoring and Alerting Integration

Production operators should integrate with existing monitoring infrastructure and provide actionable alerts when manual intervention is required. Effective alert design focuses on symptoms rather than causes and provides sufficient context for rapid incident response.

💡
Pro TipImplement operator health checks that verify not just the controller's operational status but also the health of managed resources. This enables proactive monitoring and helps identify issues before they impact applications.

The future of Kubernetes operations increasingly relies on sophisticated automation that reduces operational overhead while improving reliability and consistency. Custom resources and operators provide the foundation for this evolution, enabling organizations to codify operational knowledge and scale expertise across teams. Success in this domain requires balancing automation complexity with operational simplicity, ensuring that operators enhance rather than complicate production environments.

For organizations looking to implement production-grade Kubernetes operators, start with clear requirements and simple use cases before expanding functionality. Consider leveraging platforms like PropTechUSA.ai that provide battle-tested operator patterns and infrastructure automation capabilities, allowing teams to focus on business logic rather than Kubernetes plumbing.

Ready to streamline your Kubernetes operations? Explore how PropTechUSA.ai's automation platform can accelerate your operator development and deployment processes while ensuring production reliability and security best practices.

🚀 Ready to Build?

Let's discuss how we can help with your project.

Start Your Project →