This post presents some high-level architectural ideas for implementing Data Lakes using SQL Server 2022, specifically SQL Server 2022 S3 data virtualisation. Whilst SQL Server 2022 is under NDA, this post and subsequent posts will focus leveraging S3 for a Data Lake – ready for the data to be virtualised via SQL Server. This post also serves to provide some ideas as to architectures that could be put into place for organisations already using Big Data Clusters.
Path Forwards for Analytics
On February the 25th in this blog post, Microsoft announced to the world its “Path forwards for analytics”. One of the more salient points in this post was the announcement to retire Big Data Clusters. I would make one point abundantly clear, Microsoft has provided a wealth of guidance regarding where to go and how they are prepared to help folks already using Big Data Clusters. However, I will pick two key points from Microsoft’s post:
- You have time
Support of SQL Server 2019 Big Data Clusters will continue up until February 25th 2025
- Good architecture replacement options are just around the corner
SQL Server 2022 S3 object virtualisation is the cornerstone piece for what your organisation can put into place for a analytics platforms post February 25th 2025. Whilst Microsoft has yet to announce the GA date for the release of SQL Server 2022, it is safe to assume that it will be available in public preview form sometime prior to the GA date.
Scale-out Analytics Platforms On-Premises Are No Longer A Thing ?
Not quite, the blog post mentions Microsoft’s investment into Azure Arc – for a scale-out data/analytics platform on-premises or one you can run anywhere, I would refer people to Azure Arc PostgreSQL Hyperscale.
This post assumes that for reasons relating to data sovereignty, fiduciary or regulatory reasons in general that the:
- analytics platform will be underpinned by something which is cloud and on premises infrastructure agnostic, Kubernetes in other words.
- focal points of the Data Lake processing element will be Python and open source tools
- SQL Server 2022 S3 object virtualisation is the preferred technology for querying the Data Lake via a
T-SQL surface area
- S3 is the preferred technology for storing the data in our Data Lake.
High Level Architectures
I will propose three potential Data Lake oriented architectures that can be used to replace Big Data Clusters:
- Data Lake with the Spark Operator
The Spark Operator as its name suggests is a Kubernetes operator for Spark. Working with the Spark operator involves deploying it to a Kubernetes cluster in the first instance. Jobs are submitted via the creation of SparkApplication objects, refer to this example for a PySpark job. Before a SparkApplication job can be created, a Docker image for use in the job first needs to be created, do this by customising the Docker file that comes with the standard Spark distribution and use the docker image build tool. Jobs created via SparkApplication objects differ from jobs created via Spark submit in that SparkApplication objects dynamically spin up a Spark cluster whereas Spark submit relies on a cluster always being present.
- Data Lake with the vanilla upstream Spark
Deploying Spark to a Kubernetes cluster is well documented as per the documentation that can be found on the Apache Spark site.
- Data Lake with Python/Boto3 Data Pipelines
Boto3 is a AWS Python SDK that allows you to work directly with S3. Code written in Python and then turned into docker image form, can then be incorporated into pipelines orchestrated using tools such as Argo Workflows – something that I am currently working on.
Tools for Working with S3
There are two tools in particular that I recommend for working with S3:
- Cyberduck – an S3 object browser in GUI form
- s5cmd – a high performance command line tool for manipulating data inside of S3 buckets, and loading/unloading data to/from S3
In a follow up post I hope to explore some of these Data Lake architecture options in greater detail.