I was originally going to cover storage in its entirety in a single blog post. However, as storage and Kubernetes is the cause of a tremendous amount of confusion in the Microsoft data platform community, I have decided to give a thorough grounding in Kubernetes storage concepts in a dedicated blog post.
The traditional mindset around storage is rooted in the way physical and virtual machines consume storage. The logical steps for consuming storage are:
- Carve out a LUN / volume on your storage device,
- Attach the LUN / volume to a physical host, in the case of VMware this would be the physical ESXi server,
- For VMware, make the storage available to the virtual machine via a data store, raw device or virtual volume. For hyper-v, virtual disks are created on an existing windows volume or created as pass-through-disks,
- You would then scan the SCSI bus on the physical machine or Windows guests,
- The disk(s) appear as being offline and without a volume in disk manager,
- etc etc etc.
Note that this is a bottom up process from the storage up.
The Kubernetes Storage “Layer Cake”
The Microsoft documentation for big data cluster mentions persistent volumes, however, persistent volumes only form one “Layer of the cake”. The full cake comprises of three layers, as detailed below.
Persistent Volumes
A persistent volume is similar to a disk observed in Disk manager that is online. Use this kubectl command to view the persistent volumes associated with your cluster:
kubectl --namespace=[big data cluster name] get pv
These are the persistent volumes associated with the big data cluster ca-sqlbdc which will be used for example purposes in this post:
Note the various attributes associated with each persistent volume:
- Access mode – RWO
The volume can be mounted as read / write by a single mode, - Reclaim policy – Delete
Persistent volumes are deleted when their associated persistent volume claims are deleted, - Status – Bound
A persistent volume claim is associated with the persistent volume and a POD can consume storage from it, - Claim
The persistent volume claim associated with the persistent volume, - Storage Class
A label for an underlying storage platform. Storage classes generally map to two different types of storage platform: block or file / object / unstructured data. Each storage class usually has a provision-er associated with it, simply put, a provision-er determines the storage plugin to be used for provisioning persistent volumes, an exception to this is NFS which does not require a plugin.
Storage classes are the ‘Glue’ by which Kubernetes talks to the storage platform(s) it consumes storage from. As before, kubectl is your friend when it comes to digging into storage classes:
kubectl get sc
Persistent Volume Claims
Persistent volume claims are required in order to be able to consume storage from a persistent volume, as before, kubectl is your friend for viewing these, as per the command below.
kubectl --namespace=[big data cluster name] get pvc
Think of a persistent volume claim as being similar to creating a windows volume on a disk.
Volumes
Volumes are the touch point for storage at pod level. To view the volumes associated with a pod, issue a describe command against the pod of interest:
kubectl --namespace=[big data cluster name] describe pod [name of pod]
This is the volumes section for a describe pod command issued against the mssql-storage-pool-default-0 pod:
Local Storage
A question that often crops up is “Can I use local storage”, the answer is “It depends”. Kubernetes is essentially a container scheduler at its most basic and fundamental level. The ‘Pod’ is the unit of scheduling, containers in the same pod share the same life cycle and always run on the same node. For stateless pods life is reasonably simple and straight forward, for state-full pods, life is a bit more nuanced. If for any reason a node fails, the pods that ran on that node have to be rescheduled to run on a working node, and their storage needs to follow them. This involves un-mounting the volume from the failed node and then mounting it on the node the pod(s) are rescheduled to run on. With basic vanilla hyper-converged storage, i.e. storage and compute in the same chassis, this will ultimately lead to scheduling (and potentially data loss) problems.
Kubernetes 1.14 supports “Local volumes” in beta form, however, the following disclaimer from the Kubernetes documentation on this subject should be noted:
Before going into details about how to use Local Persistent Volumes, note that local volumes are not suitable for most applications. Using local storage ties your application to that specific node, making your application harder to schedule. If that node or local volume encounters a failure and becomes inaccessible, then that pod also becomes inaccessible. In addition, many cloud providers do not provide extensive data durability guarantees for local storage, so you could lose all your data in certain scenarios.
For those reasons, most applications should continue to use highly available, remotely accessible, durable storage.
An option to address this is to use a software defined solution to turn a hyper-converged infrastructure into a Kubernetes friendly storage cluster. The most basic requirement for any software defined solution that turns hyper-converged infrastructure into Kubernetes friendly storage is that a pod should be able to see the same persistent volume contents on at least two different servers (or nodes). Some people automatically associated HDFS with local storage, the reason for this is probably because “Back in the day”, the most cost efficient way for Google to scale out its infrastructure was via commodity servers with local disks.
General Storage Considerations
The Storage Pool
The storage pool uses HDFS reliable distributed data sets (RDD), by default each RDD has two replicas. The upshot of this is that, 1 PB of data would consume 2 PB of effective capacity. A storage platform that performs data de-duplication might be helpful in this circumstance. HDFS is old school in the respect to way it uses replicas, erasure coding is the more modern way of making storage platforms resilient.
Backup and Restore
For large volumes of data, the only practical means of backing up the persistent volumes is via snapshots, at the time of writing there is no facility to create a big data cluster using persistent volumes that already exist. However, it is highly likely that this may change between now and when the platform becomes generally available.
Kubernetes Upgrades
At the time of writing Kubernetes upgrades are incremental from one version to the next, i.e. if you wanted to go from version 1.10 to 1.12, you would have to upgrade your cluster from 1.10 to 1.11 and then to 1.12. With a storage platform that facilitates persistent volume snapshots, a new 1.12 cluster can be created and stood up against persistent volumes created from snapshots of those belonging to the 1.10 cluster. Again the ability to leverage this depends on Microsoft shipping a future version of big data clusters that allows clusters to be created against persistent volumes that already exist.
Storage Node Failure Rebuild Times and Blast Radius’s
For storage platforms composed of multiple storage nodes that form an aggregated storage pool, test how long a node rebuild takes should one fail. Also consider the size of a node rebuild operation relative to the amount of data stored on each node.
Summary
Whatever storage solution you end up using when running a big data cluster on premises, persistent volumes, persistent volume claims and volumes should all behave in the same manner. The most minimal requirement is that storage “Can follow” pods between at least two different nodes in the cluster. SQL Server 2019 big data clusters and vanilla Kubernetes cannot provide this ability alone. Most storage vendors should be able to provide a Kubernetes storage plugin and there are a slew of vendors that provide software defined solutions for composing Kubernetes friendly storage.
Coming Up In Part 5
It was my intention to cover standing up a big data cluster with persistent storage in the next post in this series. However, because this the topic of on-premises Kubernetes and persistent storage continues to be a source of great confusion, part 5 will cover the available options for persistent storage when running Kubernetes on-premises for big data clusters.
Love the series on this, Looking to do this in production environment, when were you planning on putting out part 5 ??
Craig, Good question, honest answer is I don’t know because I’m currently tied up preparing the material for a Kubernetes and big data workshop for SQL Bits at the London Excel at the end of March. At present this is consuming all of my available cycles, if you have a specific question(s) you can always ping me an email to chris@exadat.co.uk