A Cloud Data Lakehouse Success Story - Part 1
By: Ron L'Esteve | Updated: 2022-06-28 | Comments | Related: 1 | 2 | > Cloud Strategy
While organizations are interested in cloud adoption, they are also seeking to learn about the benefits of cloud adoption through real word success stories. They are interested in knowing how to go about implementing a Cloud Platform MVP project by identifying the challenges it will solve and business goals it will achieve? What are some of the strategic approaches for cloud adoption and solution capabilities that might empower the customer journey to the cloud?
The journey to cloud adoption varies for each customer. While some organizations are just beginning their journey, others are more mature on the Lakehouse platform and continuing their journey with Advanced Analytics and AI projects. Organizations that are in the foundational stages of their journey have been building successful and scalable Minimum Viable Products (MVPs) as part of applied innovation and digital transformation organizational initiatives. These organizations have been bold enough to challenge their traditional on-premises norms by embarking on their journey into the cloud. Furthermore, these organizations have instantly seen cost efficiencies and have access to robust reporting and advanced analytics capabilities which all accommodate the three Vs of Data (Volume, Velocity, and Variety). In this two-part series, we will begin to explore one such Data Lakehouse success story by understanding the existing on-premises challenges, business goals, strategy & solution approach and the solution's capabilities.
Challenge & Business Goal
With a huge traditional on-premises footprint in place, a customer had been struggling with infrastructure from a cost, maintenance, and scalability perspective. Additionally, performing advanced analytics, integrating robust data engineering pipelines from multiple cloud sources, and deploying advanced reporting capabilities became a challenge with traditional on-premises tools. The customer had to outsource their Data Science workloads to vendors which also added to their cost and complexity challenges. The client needed a highly performant and cost-efficient Modern Data Platform to meet their growing needs for applied innovation and digital transformation.
Strategy & Solution Approach
The customer decided that they needed to take action to modernize their Data and Analytics Platform through Cloud centric innovation and transformation. With the support of expert Cloud Consultants, a significant amount of time was spent to envision the future state platform through architectural & design-led thinking workshops to custom tailor the solution approach and accompanying architecture to best fit their needs. A team of Data, Infrastructure, DevOps, QA, and Reporting experts where subsequently put together to deliver a performant & cost-efficient foundational Cloud Data Platform Stack, reusable data engineering pipelines, reporting visualizations, dashboards & data models to enable near real-time advanced insights for self-service BI, advanced analytics, and faster business decisions.
The foundational Cloud Platform which was planned for delivery would solve many on-premises challenges. The low-cost Data Lake storage supports the decoupling of compute from storage and is governed by secured group membership-based access. Additionally, the Software as a service (SaaS) and Platform as a service (PaaS) resource would follow a 'pay-as-you-go' model with the option to reserve and pre-purchase capacity at deep discounts for certain qualified technologies. This beneficial and flexible cloud cost model itself justifies the benefits of adopting a Cloud Platform. The added advantage of setting budgets, monitoring, and alerting owners and administrators of budgetary thresholds that have been reached will put the platform administrators at ease as they manage growing costs while the platform grows and scales.
Both Infrastructure as code and robust security models were put into place to securely administer access to various resources and products. Data was ingested from multiple cloud and on-premises sources through a combination of incremental real-time and batch ELT pipelines, which were driven by meta data driven Audit, Balance, and Control (ABC) Frameworks. Various storage zones were created to accommodate varying stages of processed data. The Bronze Zone contained raw data in its natural form. The Silver Zone contained cleansed data which applied data quality and/or masking logic. If PII data needs to be further masked, encrypted, and provided to external vendors, a new Encrypted Zone could be introduced. As the curated and transformed data made its way into the final Gold Zone, it was served to various Business and Data Science teams for consumption by various reporting and advanced analytics tools.
There is an increasing demand for AI and ML driven workloads and the availability of both custom and out of the box connectors, models, and re-usable frameworks has empowered both mature and citizen AI Engineers, ML Engineers, and Data Scientists to do more with their Cloud Platform. Cloud Data-driven mature organizations frequently use Advanced Data Science and AI services, tools, infrastructure, and frameworks which drive the most complex and innovative business use cases to extract deep value and outcomes which contribute to the organization's growth.
Many cloud providers support open-source ML and AI frameworks and supply Engineers and Scientists with the infrastructure and compute to run these workloads on a variety of platforms including IDE's of their choice. With a Cloud Platform, customers will have access to Deep Learning and Conversational AI Bot frameworks, in addition to Cognitive and Machine Learning services which can be used on developers' IDE of choice. Developers can use a combination of VS Tools, ML Studios and Workbenches, Jupyter Notebooks, PyCharm, and more. Deep Learning frameworks including Cognitive Toolkits, TensorFlow, Caffee, Keras, Scikit-learn and more are available for immediate use. Finally, AI data (e.g.: Data Lake), compute (e.g.: Apache Spark), and infrastructure (e.g.: FPGA, GPU) resources are readily customizable and deployable. The figure below illustrates the various AI services, tools, and infrastructure.
There are numerous on-premises and cloud-based reporting tools available on the market. Oftentimes, as organizations begin their cloud adoption journey, they have an existing suite of on-premises tools that they have been using for years or decades. These on-premises technologies can include databases, ETL tools, analysis services, and report development tools. From the perspective of reporting, traditional on-premises reporting tools such as SQL Server Reporting Services (SSRS) and Tableau have been in existence since 2004. They are mature tools that have achieved a suite of excellent product features over time, however, they came with the burden of having to install and maintain costly reporting servers. As products and features within the modern cloud platform took flight, cost efficiencies were also applied to innovative technologies such as Microsoft's flagship cloud reporting technology, Power BI which was initially released in 2011.
Power BI is a business analytics service that can analyze and visualize data, extract insights, and share it across various departments within an organization. This cloud-based reporting technology supported seamless integration with multi-cloud technologies, and provided access to a variety of AI and ML powered tools, visuals, and data enrichment capabilities. Additionally, Power BI introduced highly efficient cost model options including the ability to pay per user. As Power BI picked up traction as a result of these great features, the traditional on-premises reporting tools began to lose market share on the basis of its high and un-flexible cost models and limited capabilities for AI, ML, and cloud-native integration and sharing capabilities. Customers began migrating their workloads from traditional on-premises to cloud-native reporting tools and continued this trajectory as cloud adoption MVP projects increased. To maintain a competitive advantage as a result of this trend, in 2013 Tableau introduced its cloud-based version called, Tableau Online. As a result of this competitive shift into the cloud realm, for thirteen years, Gartner recognized both Microsoft (#1) and Tableau (#2) as Magic Quadrant Leaders in Analytics and Business Intelligence Platforms. The new era of cloud reporting tools brought 'per-user cost models', high performance for big data workloads, easy to use interfaces, a plethora of cloud source connectors, excellent custom development capabilities, rich visualizations, and out of box AI and ML tools. The figure below illustrates some of the AI features of Power BI.
Any transformational Cloud Adoption MVP project will need a team of fully dedicated Data Science and Engineering skilled professionals. The various personas of these teams have a dedicated workspace in Databricks that cater to the tools, frameworks, compute, and infrastructure. In the Data Science & Engineering workspace, developers build complex real-time incremental streaming ELT pipelines and complex Data Science models such as customer lifetime value and propensity models. The Machine Learning engineer creates feature stores, AutoML or Custom experiments, and productionizes models by serving them as endpoints to consumers. Finally, various Business and BI Analysts leverage the SQL workspace to write ANSI SQL dialect queries to explore, analyze, and create rich visualizations queries to derive business insights and outcomes. This full collection of persona-based workspaces unifies an end-to-end data and analytics platform intended to deliver transformational initiatives at scale to achieve business value. The figure below shows the various workspaces and the capability of toggling between them.
- Read more about the Differences and Comparisons between Power BI and Tableau
- Explore Databrick's Solution Accelerator for Customer Lifetime Value (CLV)
- Read more about How to Get Started with Databricks Machine Learning
- Read more about the Databricks SQL Workspace and Data Science & Engineering Workspace
- Read more about the Benefits and Limitations of Multi-Cloud
- Read more about AWS vs Azure vs GCP
About the author
View all my tips
Article Last Updated: 2022-06-28