Search engine platform Google, which developed Kubernetes in the first place, announced the beta release of the Kubernetes Operator for Apache Spark (Spark Operator). It enables Spark to run natively on k8s clusters and so allows Spark applications for analytics, data engineering or Machine Learning, to execute to these clusters as they would any Spark instance.
Apache Spark is an enormously popular execution framework for operating data engineering and Machine Learning workloads. It powers the Databricks platform and is accessible in both on-premises and Cloud-based Hadoop services, like Amazon EMR, Azure HDInsight, and Google Cloud Dataproc. Additionally, it also can function on Mesos clusters. As noted by Google, Spark Operator is a Kubernetes custom controller which utilizes custom resources for declarative specification of Spark applications; it also supports automatic restart and cron-based, scheduled applications. Further, developers, data engineers, and data scientists can build declarative specifications which explain their Spark applications and utilize native Kubernetes tooling such as kubectl to manage their applications.
Kubernetes Operator for Apache Spark is available on the Google Cloud Platform marketplace for Kubernetes, in the form of Google Click to implement containers, for easy deployment to Google Kubernetes Engine. But Kubernetes Operator for Apache Spark is an open source project and can be implemented to any Kubernetes ecosystem, and the project’s GitHub site offers Helm chart-based command line installation guidelines. Though, it will be intriguing to see if Amazon and Microsoft will support and provide simple deployment of the Spark Operator for their own Kubernetes services Elastic Container Service/EKS and Azure Kubernetes Service/AKS, in that order. Performing so would be a great service to their users who do not want to have the overhead of an EMR, HDInsight or Databricks workspace and cluster.