Containerising Data Pipeline Components

The last post in this series covered some simple Python code that leveraged twitter’s tweepy API in order to obtain tweets based on a query, sentiment score each tweet and then load these into an […]
The last post in this series covered some simple Python code that leveraged twitter’s tweepy API in order to obtain tweets based on a query, sentiment score each tweet and then load these into an […]
In my last post I outlined a number of architectural options for solutions that could be implemented in light of Microsoft retiring SQL Server 2019 Big Data Clusters, one of which was data pipelines that […]
TL;DR This post presents some high-level architectural ideas for implementing Data Lakes using SQL Server 2022, specifically SQL Server 2022 S3 data virtualisation. Whilst SQL Server 2022 is under NDA, this post and subsequent posts […]
Someone I know had worked at an organization that needed to scale out their OpenShift clusters/footprint, they were constrained by the speed of their procurement department and were wondering if they could get by with […]
Part seven of this series focuses on deploying an Azure Arc enabled Data Services controller to a Kubernetes cluster. As per the closing comments of the last blog post, PX Backup will be covered in […]
Part six of this series will focus on deploying a storage solution to our Kubernetes cluster: Where Were We ? If you have been following this blog post series you should have: a basic grasp […]
Our journey up the stack brings us to the installation of MetalLB – a software load balancer for Kubernetes: All the content in this series of blog posts relates to the Arc-PX-VMware-Faststart repo on GitHub, […]
In the last post, part 3 of this series – we started off at the bottom of the stack with the Terraform module for virtual machine creation. We continue our journey up the stack in […]
Part 3 of this series will begin the journey up the stack, starting with the deployment of the virtual machines that will host the Kubernetes cluster nodes: All the blog posts in this series relate […]
Before diving into what the various Terraform modules do that make up the Arc-PX-VMware-Faststart repo, I’m going to provide an introduction to Terraform in this blog post. Terraform comes from Hashicorp, it is a tool […]
One of the most significant things to change the landscape for Azure data professionals will be general release of Azure Arc enabled Data Services. To provide an expedient means of experiencing all that Azure Arc […]
A source of some interesting discussions at work is whether or not Kubernetes nodes should be virtualized. The thesis behind why this is not a good idea, is the fact that a virtualized layer adds […]
CU5 for SQL Server 2019 Big Data Clusters ushers in support for Red Hat OpenShift Container Platform, this is a big deal – but what exactly is OpenShift and more saliently; why does it matter […]
This series of posts is essentially my presentation from the recent Data Weekender popup virtual conference – in blog form with some bonus additional content. This post focuses on the things you should consider before […]
At work we are seeing a burgeoning demand for Kubernetes test and development environment, as such we have been looking at simple and rapid ways to provision clusters. I have already blogged about the use […]
I have not blogged for a while, it was my hope to produce part 5 in the series of creating a Kubernetes cluster for production grade Big Data Clusters. However, there is a very good […]
The seed of the idea behind this blog post first germinated when I noticed the following yaml: Note the line that includes mergeTestResults, this got me thinking along the lines of running multiple tests in […]
I was originally going to cover storage in its entirety in a single blog post. However, as storage and Kubernetes is the cause of a tremendous amount of confusion in the Microsoft data platform community, […]
The previous post in this series covered Kubernetes cluster creation via Kubespray. It was my intention to cover off load balancing in this post, however at the time of writing when you create a SQL […]
Part 1 of this series covered the creation of the virtualized infrastructure for creating a Kubernetes cluster on. There are a variety of tools for building clusters, including Kops, Kubespray and Kubeadm. Kubeadm is perhaps […]
This blog post is the first in a series detailing how to build a Kubernetes cluster to deploy a SQL Server 2019 big data cluster to. For the purposes of learning and on-boarding there are […]
From a ‘Vanilla’ Kubernetes perspective; where all nodes in the cluster run on Linux, only containers based on Linux images can run. As of version 1.9 of Kubernetes, we are currently on 1.12 at the […]
In the previous post the scene was set for why container orchestration is required, what Kubernetes is and why the world of open source should be entered into with ones eyes wide open. This post […]
With the announcement of SQL Server 2019 big data clusters at Ignite, Kubernetes (often abbreviated to K8s) now stands front and center as part of Microsoft’s data platform vision. The obvious inference being that this […]
Where Were We ? In part I of this series I set the scene for why you would want to use docker and Jenkins for SQL Server continuous integration pipelines. The first post also covered […]
There seems to a great deal of interest in containers and Kubernetes at present, fueled by Microsoft hinting that Kubernetes has a big part to play in the future of the Microsoft data platform: Of […]
In the first post in this series I covered why you might want to use Jenkins as a CI engine and how to deploy to SQL Server running in a container using the ‘Sidecar’ pattern. […]
The mainstay of my presentation material this year has been my deck on continuous integration, Docker and Jenkins. For people who have not had the chance to see this presentation or have seen it and […]
In the previous part of this blog post I discussed how containers could be used to scale out a singleton work load. Where as my attempts to get my experiments to work ran into difficulties […]
I will forewarn readers of this blog post that this is ‘Conceptual’ in nature, due to the fact in my tests I was spinning up containers which then fell over with core dumps. Nonetheless, I […]
This post covers building a simple continuous integration environment using Jenkins and SQL Server data tools which is fully containerised. There are two github repositories associated with this post, the first contains the files for […]
Consider a scenario in which you wish to use DACPACs, but you want to spin up SQL Server in a container on Linux (say Ubuntu) because you wish to forgo the cost of having to […]
In this post I am going to demonstrate how to use one of Jenkins more powerful features ; its ability to create multi-branch build pipelines. Source Code Control and Branching 101 The very first step […]
In this post I will demonstrate how a neat trick brought to my attention by Niko Neugebauer can turn the processing of windowing functions from “Anti-scale” to scale-ability. First of all we need to create […]
In the world of continuous integration and delivery where we might want to perform numerous builds a day. Docker is ideally suited for spinning up environments and then tearing them down afterwards in use cases […]
I was fortunate enough to be selected to speak at SQL Saturday Dublin, the talk I gave was on leveraging the in-memory engine, the basic flow of the presentation is thus: I ask the audience […]
In this blog post I wanted to distill down the most fundamental points to consider when attempting to process a SQL Server workload in a scale-able manner. However, many of the principles I will outline […]
This post continues my work on the LMax disrupt or pattern, I’ve already covered this already, what I have not covered is: Spinlock profiling Wait statistic profiling How the in-memory engine now behaves with SQL […]
The aim of this blog post is twofold, it is to explain how: A “Self building pipeline” for the deployment of a SQL Server Data Tools project can be implemented using open source tools A build pipeline […]
Every so often a post appears on linked-in that serves as an oasis of information in a desert of spam. The article I speak of is “Non-Volatile Storage: Implications of The Data Center’s Shifting Center”. Its a god […]
A question I received following my pre-conference training day at SQL Bits and during my post-conference training day in Poland was how my material relates to SQL Server running in a virtualized environment. As there are figures […]
A lot of the work I have done over the last year has involved placing stress on the database engine via singleton inserts using a stored procedure that inserts rows in a loop under the […]
In my Saturday community day session at SQL Bits I mentioned an article by Linchi Shea from the sqlblog site in which he demonstrates that the overhead of 100% foreign memory access does not incur the […]
I will speaking at Join! Conference in Poland, on the Tuesday (May 10 th) I will be presenting a regular session on leveraging memory in SQL Server and on the Wednesday I will be delivering […]
This is probably one of the most unheralded new features of SQL Server 2016 which gets but a single bullet point in the CSS engineers “It Just Runs Faster” blog post series. Querying sysprocesses and […]
The Story So Far The graph below represents the throughput we managed to get ( from a warm buffer cache ) from the legacy database engine and the help of an in memory table as a scale-able […]
It is with great honor that I have the privilege of announcing that my pre-conference submission for SQL Bits XV has been accepted !. I will be putting on a days worth of training in […]
One of the many things I hope to get around to blogging about, time permitting, are the challenges of building “Web scale” platforms using SQL Server. Of the many challenges this presents, one is coming […]
Following some feedback from my last blog post: . . . what is inhibiting scalability ? As the degree of parallelism is increased, CPU saturation should be achievable unless some contended resource is being waited […]
There is a type of behavior in the database engine which undermines scalability, this is when multiple threads contend for a single resource, contention on the page free space bit map is the example that […]
kubectl is the defacto command line tool for administering Kubernetes clusters. Connecting to a cluster via kubectl requires a Kubernetes config file, this in turn contains one or more contexts. A context is simply a […]