Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting us. A member of our team will be in touch shortly. Close

An error occurred while submitting your form. Please try again or file a bug report. Close

  1. Blog
  2. Article

Giulia Lanzafame
on 4 September 2025

Implement an enterprise-ready data lakehouse architecture with Spark and Kyuubi 


Here at Canonical we are excited to announce that we have shipped the first release of our solution for enterprise-ready data lakehouses, built on the combination of Apache Spark and Apache Kyuubi. Using our Charmed Apache Kyuubi in integration with Spark, you can deliver a robust, production-level, and open source data lakehouse . Our Apache Kyuubi charm integrates tightly as part of the Charmed Apache Spark bundle, providing a single and simpler-to-use SQL interface to big data analytics enthusiasts.

Data lakehouse: an architecture overview

The lakehouse architecture for data processing and analytics is an enterprise data management paradigm shift. Historically, organizations have been forced to make a trade-off between the raw, scalable storage of data lakes and the fast-performing queryability of structured data warehouses. The lakehouse approach is able to bridge the gap, enabling enterprises to store large quantities of structured and unstructured data in a single platform, perform data streaming, batch processing, and rapid analytics all bundled up in a wrapper of transactional integrity and governance. Canonical’s approach to data lakehousing relies on the integration of Apache Spark and Apache Kyuubi, creating a platform where batch and streaming data can coexist, be processed at scale, and be made available for advanced analytics and AI/ML in an instant.

At the heart of this lakehouse blueprint is Apache Spark, the industry’s standard distributed data processing engine. Spark’s in-memory, fault-tolerant architecture allows a user to run high-throughput ETL, data transformation, and iterative machine learning workloads. Canonical’s approach leverages Spark OCI images with Kubernetes as a cluster manager, targeting a modernized approach to standard Spark jobs, optimizing for cost and performance. The integration supports many data sources like any S3 compliant storage and Azure Blob storage for data ingestion, as well as other databases as metastores for processing.

One of the greatest challenges in the deployment of enterprise Spark has always been to provide secure, multi-user, and easy-to-use SQL access to business users, analysts, and data scientists. That is where Apache Kyuubi shines. Kyuubi is a high-throughput, multi-tenant SQL gateway for Spark that provides a single JDBC and ODBC endpoint well-suited to integrate with data explorers like Tableau, Power BI, and Apache Superset. Unlike Spark’s own Thrift Server, Kyuubi provides true session isolation so that each application or user runs its own secure Spark context. This not only provides an additional layer of security but also enables fine-grained resource allocation, workload prioritization, and strict auditing which are critical capabilities for compliance and governance within regulated industries.

A charming lakehouse, fit for an enterprise

Canonical’s Spark and Kyuubi lakehouse stack is built for speed and reliability. In fact the deployment is automated end-to-end using Canonical’s charmed operators, which oversee the lifecycle of Spark, Kyuubi, and supporting components. This includes automated cluster provisioning, rolling upgrades, fault-tolerance, security patching, and cloud-native elastic scaling across Kubernetes environments. 

Security is built into every layer of the bundle. The release of the Charmed Apache Spark/Kyuubi bundle includes end-to-end encryption, native integration with the Canonical Observability Stack, and security-hardening with improved documentation. In addition we have been working on the patching of several critical and high CVE for this launch, enhancing overall the product security posture. The bundle now includes back up and restore for Kyuubi, with the benefits of reliability and continuity of business, and adds in-place upgrades to minimize downtime and complexity. High-availability support allows servers running Kyuubi to be scaled reliably for mission-critical workloads.

The spark-kyuubi bundle is platform agnostic, supporting hybrid and multi-cloud, as well as on-premises deployments. This is done with the goal of avoiding vendor lock-in, empowering organizations to optimize cost, performance, and compliance on the infrastructure of their choice. Whether greenfielding a new analytics platform or refactoring a legacy Hadoop deployment, Canonical’s solution provides an easy way forward with expert support every step of the way.

Alongside new features and security patches, the release brings improved usability and documentation. The deployment process is fully explained, and the solution is made available via the standard Canonical channels, so we’d encourage you to go look at the documentation and the release notes and ultimately to give it a try. 

We have also recently delivered a webinar “Open source data lakehouse architecture with Spark and Kyuubi – an engineering deep dive” that you can follow for a guided deployment experience. It all comes down to a more secured and innovative big data analytics stack available to be deployed on-premises or in the cloud by enterprises. With the new launch, organizations can go ahead with confidence that they are benefiting from the latest developments in the open source domain for big data analysis.

Open Source Data Lakehouse Architecture with Spark and Kyuubi: Engineering Deep Dive

Spark and Kyuubi: try it today

In summary, Canonical’s Kyuubi and Spark-based data lakehouse enables organizations to unify data architecture, accelerate analytics, and future-proof data strategy. By combining open source innovation with enterprise-grade support, Canonical empowers businesses to unlock the true potential of their data – reliably, efficiently, and at scale. We invite data engineers, architects, and IT enthusiasts to test the solution and find out more about how Canonical can help you build the next generation of data-driven applications and insights.

Related posts


Giulia Lanzafame
26 June 2025

Accelerating data science with Apache Spark and GPUs

Data Platform Article

Apache Spark has always been very well known for distributing computation among multiple nodes using the assistance of partitions, and CPU cores have always performed processing within a single partition.  What’s less widely known is that it is possible to accelerate Spark with GPUs. Harnessing this power in the right situation brings imm ...


Giulia Lanzafame
10 June 2025

Apache Spark security: start with a solid foundation

Data Platform Article

Everyone agrees security matters – yet when it comes to big data analytics with Apache Spark, it’s not just another checkbox. Spark’s open source Java architecture introduces special security concerns that, if neglected, can quietly reveal sensitive information and interrupt vital functions. Unlike standard software, Spark design permits ...


Giulia Lanzafame
10 December 2024

Spark or Hadoop: the best choice for big data teams?

Data Platform Article

I always find the Olympics to be an unusual experience. I’m hardly an athletics fanatic, yet I can’t help but get swept up in the spirit of the competition. When the Olympics took place in Paris last summer, I suddenly began rooting for my country in sports I barely knew existed. I would spend random ...