This series of posts is essentially my presentation from the recent Data Weekender popup virtual conference – in blog form with some bonus additional content. This post focuses on the things you should consider before building out your infrastructure for your Big Data Cluster.
Kubernetes Cluster Requirements
For test, development and general learning purposes, a single worker node Kubernetes cluster will suffice. Microsoft recommends a minimum of three worker nodes, each with 64 GB of memory, eight logical processors and 100 GB of storage for container images for production purposes:
Using a ship as an analogy, the storage pool is the engine room of a big data cluster:
For production purposes, the master pool also needs to be given special considerations. The default ‘prod’ config (either for kubeadm or aks) will create a master pool with three master instances. Therefore, it’s probably not a bad idea to go with a minimum of four worker nodes; two for the storage pool and two for the rest of the big data cluster, the idea being that at least two of the master instances should reside on different worker nodes.
#Tip 1: Plan for storage pods to reside on dedicated nodes, and master instances to reside on at least two different worker nodes.
To Virtualize or Not To Virtualize, That Is The Question
Proposing the idea of using virtual machines as Kubernetes cluster nodes to a Kubernetes purist is likely to be met with consternation. However, the different nodes in your cluster have different resource requirements. A master node can get away with as little as 2 GB of memory and 2 logical processors, worker nodes require much more resources. A best practice is never to run applications on master nodes in production. The view of the world from a Kubernetes purist, is that Kubernetes is designed to obviate the need for virtualization. Consider that you do go down the bare metal route, its unlikely that you are going to purchase blades or servers with 2 GB of memory and 2 CPU cores. At the very least consider the use of virtual machines to host master nodes on. For organizations that have standardized on a software defined virtualized infrastructure, Kubernetes will run perfectly happy on this. Also for the rapid provisioning of environments – virtualization provides the fastest means of doing this – simply create yourself a virtual machine template and base your cluster node hosts on this.
#Tip 2: Prefer virtual machines for Kubernetes master node hosts at the very least.
A Word On State
State in your big data cluster and its associated Kubernetes cluster live in various places. The state associated with the controller, compute and app pools is technically metadata that resides in the master pool instance databases. Storage is also required for etcd – where a Kubernetes cluster stores its state, certificates and also the images that an applications containers are based on.
#Tip 3: Storage consideration needs to be given not just to a big data cluster as an application, but also Kubernetes – the underlying platform.
A Word On Storage Plugins
Kubernetes talks to storage platforms via what is known as a storage plugin. A big data cluster requires a storage plugin that:
a) Supports persistent volumes
b) Works with an underlying storage platform that supports a storage protocol supported by SQL Server on Linux
There are three types of storage plugin:
The only type of storage plugins that have any future within the Kubernetes Storage Special Interest Group are those that adhere to the Container Storage Interface (CSI) standard. A list of vendors that provide CSI compliant plugins can be found here.
#Tip 4: Prefer the use of Container Storage Interface (CSI) storage plugins.
The Best Kubernetes Storage Presentation You Will Ever See
The general topic of what to do for Kubernetes storage in the general Microsoft data platform community appears to be causing rank confusion. This presentation in my humble opinion, is the best presentation you are likely to see that explains the basics of Kubernetes storage:
#Tip 5: Watch this Kubecon Barcelona 2019 Keynote, Debunking the Myth: Kubernetes Storage is Hard.
HDFS Replication Factors
The default replication factor for HDFS is 3, for storage platforms that have baked in HA, via things such as RAID and / or erasure coding, this is not required.
#Tip 6: The HDFS replication factor can be set to 1 for storage platforms that use RAID and / or erasure coding, as per the bdc.json excerpt below:
Stateful Application Support
There tends to be two different means by which storage platforms allow pods to be rescheduled on different worker nodes in a Kubernetes cluster whilst maintaining their state:
- Replication based schemes
2. Centralized storage platforms
Both schemes work, but here comes the . . . however. Consider you are using a replication based platform, every single bit and byte of data needs to be written at least twice. “At scale” – this overhead becomes considerable. Consider also, what happens for a replication based platform that does not support RAID or erasure coding. For the sake of argument, take 100 TB of data, with a HDFS replication factor of 3, 300 TB of raw capacity is required to store this. Then take into account a storage platform that makes this highly available by replicating everything twice, the raw storage capacity required for 100 TB of data becomes 600 TB.
#Tip 7: For storage pools that are required to store large volumes of data, pay special attention to how your storage platform makes data highly available from a RAW versus usable capacity standpoint.
Tools, Tools, Tools . . .
It is enshrined in DBA lore, that development and administration tools should be installed on the same servers as SQL Server instances. The exact same practice is equally relevant to the world of Kubernetes:
#Tip 8: Plan for a deployment server to host all the tools required to deploy and manage the life cycle of both your big data cluster and Kubernetes cluster.
The next post in this series will move on to deploying a big data cluster.