Azure Storage

Purpose

This page covers notes, labs, personal setup items, deviations, etc. covered in AZ-104: Implement and manage storage in Azure learn module.

General Takeaways

  • Three types of storage
    1. Virtual Machine data is block level storage for VMs with files being managed file shares
    2. Unstructured data - Azure Blob (binary Large Object) and Azure Data Lake is object storage that doesn’t adhere to architecture rules like key:value, tables, rows, etc. you may see in a DB scheme. Just toss the object as bytes onto the storage medium and call it a day.
    3. Structured Storage - for data that adheres to some sort of strict structure. NoSQL table storage, Azure Cosmos DB, Azure SQL (as a service), etc.
    • I really wonder what storage appliances they’re using on the back end - the IOPs and regional replication must be insane
  • The devil is in the details
    • e.g. Mounting blob storage via NFS works directly in Linux but not Windows.
    • Shifting between tiers
    • ACLs, service endpoints, paired replication regions… all sorts of things have little gotchas
  • Use resume capable clients when transmitting large chunks of data
    • e.g. Storage Explorer, Robocopy, SFTP clients, etc. and not just Copy + Paste via File Explorer and SMB shares.
  • The Developer Tools tab on a storage account has links to the documentation on accessing and using storage accounts via various tools (Azure CLI, PowerShell, Python, etc.)
  • Pretty interesting dev tool called Azureite allows for emulating an Azure storage API server locally. Once dev is complete, shift the work/project to the cloud service
  • Their learn module is downright terrible with the introduction to SAS and the lack of accompanying SAP

Storage Notes

  1. Storage accounts CANNOT be transitioned from one type to another.
    1. Standard = HDD
    2. Premium = SSD with various levels of managed (throttled) IOPs
  2. Get the LRS, ZRS, GRS, etc. correct from the beginning when setting up things like storage accounts and recovery service vaults. RSVs default to the… ZRS? tier and cannot be shifted down to LRS without jumping through hoops or abandoning the data.
    1. The migration process could impact system uptime or be impossible depending on the deployment particulars.

Storage Account Types

The table here has a good overview with links to more detailed information.

Takeaways:

  1. You don’t just get what you pay for, you pay for what you get
    • Hot/Cold status, replication, transit, etc. ALL cost different types of money.
    • Enable Azure Monitor to get a good baseline and then evaluate where you’re over-spending
    • Right size the provisions, move cold data to cold storage and make sure it’s cold. The data access costs can add up to be more than placing data in the appropriate hot storage.
  2. General purpose is exactly that - general purpose - and should be the starting point for scenarios with a lot of unknowns
    • Transition to other storage account types if/when GP isn’t working out
  3. Premium block blob can be less expensive vs general purpose depending on the number of transactions per second

Replication

The table is a summary of what replication standards will maintain data availability during X event.

Data Center Down Region Down Region Down - Read Access
ZRS, GRS, RA-GRS, GZRS, RA-GZRS GRS, RA-GRS, GZRS, RA-GZRS RA-GRS, RA-GZRS
  • LRS is cheap, disposable and only lives in a single datacenter
    • Back up to a storage account in a different region
  • ZRS replicates across 3 storage clusters in a geographical region (availability zone)
    • Basically three different data centers in a given region
    • Backup to a storage account in a different region
  • GRS - replicate globally for 16 9s
    • Utilizes LRS for local redundancy, then ships it out to the second region (it skips the zone redundancy)
    • RA-GRS (Read Access GRS) allows you to read the replicated GRS data virtually in real time
      • It’s accessible from the second region even if an event hasn’t triggered a failover to the GRS paired region
  • GZRS - Geo-zone redundant storage
    • GRS but it replications to the primary regions zones via ZRS then ships to the secondary region
    • RA-GZRS is also a thing

Access

Every object in storage has a URL, with the following services mapped to the specific subdomain

  • Container - mystorageaccount.blob.core.windows.net
  • Table - mystorageaccount.table.core.windows.net
  • Que - mystorageaccount.queue.core.windows.net
  • Files - mystorageaccount.file.core.windows.net

You can map custom domains to the access URLs above via a regular CNAME record or using an intermediary domain mapping

Network Access

There is more to this then what they cover in the learn module which boils down to the following.

  • Use Private Endpoint to secure traffic from Azure networks to the storage account
  • Disable the wide open public IP access and restrict to know public IPs (or just do everything over an Azure VPN Gateway or something similar)

However, there’s an entire article with a best practice checklist. The article also has links to other great resources around securing blob storage access from the network level, access from trusted resources, service endpoints, etc.

There’s also an entire AZ-700 path for networking that looks pretty neat.

Blob Storage Notes

  • Access tiers are divided into Hot, Cool, Cold and Archive
    • Basically a cost gradient on storage cost (expensive for Hot going cheaper to Cold), frequency of access costs (Hot is cheap access, Cold is expensive if accessed too much)
  • Minimum storage duration applies for:
    • Cool: 30 days
    • Archive: 180 days

Lifecycle Management

This stuff is incredibly slick and something I need to explore with Bicep deployments. I’ve written PowerShell scripts that essentially perform lifecycle management on database data that was manually zipped up and moved off site. Using the blob lifecycle management, all of that can be skipped and everything memorialized in Bicep templates. For example, decommissioned VMs that are mothballed after X days. The blob lifecycle management can do the same thing around migration data, older project data, etc.

Offsite backups to a Hot tier (lower transaction costs) then turn around around step through cool, cold, archive as appropriate. There may even be use cases where data queries to a table or other service type could be cached for X amount of time (hours/minutes) and then deleted automatically.

The simplest way to access Lifecycle Management is to go to the storage account and then type “life” into the top search box on the left.

Object Replication

My mind immediately goes to orchestrating some sort of Frankenstein CDN vs going with the built in Azure service. I wonder if there are DB systems, software, etc. where the data could be replicated, compute performed, then recompile the data to the final, useful item. Sounds like container (K8, Docker, etc.) orchestration. Ideally it’s data that needs replication anyways, otherwise the transaction and transmit costs will probably exceed the compute savings.

The scalability and performance appears to be a good starting point if/when digging in to storage items.

Blob Types

  • Block blobs are the typical items, text, binary, images, etc.
  • Append blobs allow data to be tacked onto the end of an existing blob. e.g. log files
    • (REST API record coordination between services? Query A, write ‘adata’ - Query B, append ‘adata.bdata’ - Submit ‘adata.bdata’ to destination C)
  • Page blobs are what VMs use for OS disks and data disks
    • Sideloaded .vhd files are recommended to be page blobs.
  • You cannot change the blob type after it is created
  • Versioning can be enabled from the Data protection blade
    • Backup and versioning are cool, but watch out for cost increases
    • Compliance items around retaining (or deleting) certain kinds of data?
  • Only billed for data that leaves the region
    • THIS could be a great reason for replication

Storage Lab 01

Another Point-And-Click adventure that just shows you around the Azure portal - which means it’s a great opportunity to brush up on the Bicep authoring!

Tasks 1 and 2.a are completed via the Bicep deployment, 2.b was completed via Azure Storage Explorer and task 3 is completed from the portal.

Task 1: Create a storage account.

  • Create a storage account in your region with locally redundant storage.
  • Verify the storage account was created.

Task 2: Work with blob storage.

  • Create a private blob container.
  • Upload a file to the container.

Task 3: Monitor the storage container.

  • Review common storage problems and troubleshooting guides.
  • Review insights for performance, availability, and capacity.

Can I just say I love the parent property when naming child items? I don’t know if I completely missed it a couple of years ago, or if it’s a new addition, but this is an incredible step up from the tedious concat syntax from the ARM templates and joining a named variable with another variable (or Bicep var variables with similar syntax). Simply using the parent property is awesome!

param location string = resourceGroup().location
param storagePrefix string = 'stg'
param publicIP string

var storageAcctName = '${toLower(storagePrefix)}${uniqueString(resourceGroup().id)}'

param tags object = {
  Environment: 'Test'
  Method: 'Bicep'
}

resource storageaccount 'Microsoft.Storage/storageAccounts@2021-02-01' = {
  name: storageAcctName
  location: location
  kind: 'StorageV2'
  properties: {
    accessTier: 'Hot'
    minimumTlsVersion: 'TLS1_2'
    allowBlobPublicAccess: false
    allowSharedKeyAccess: true
    networkAcls: {
      defaultAction: 'Deny'
      ipRules: [
        {
          action: 'Allow'
          value: publicIP
        }
      ]
    }
  }
  sku: {
    name: 'Standard_LRS'
  }
  tags: tags
}

resource blobService 'Microsoft.Storage/storageAccounts/blobServices@2023-01-01' = {
  parent: storageaccount
  name: 'default'
  properties: {}
}


resource blobContainer 'Microsoft.Storage/storageAccounts/blobServices/containers@2023-01-01' = {
  parent: blobService
  name: 'test01'
}

Storage Security

As a preface to everything below: There is an additional unit in the Learn module later on that expands on the SAS usage. However, all of the items I pointed out are still valid as that unit works through a slew of content using as-hoc SAS and only AFTER training users on how to do it the wrong way, decides to show the learner the correct way. It does do a much better job on expanding use cases, programmatically creating SAS, how to reduce risk, etc. in later units.

Unfortunately, none of the .net/C# sandbox exercises actually worked which was a bit of a letdown, however, it doesn’t feel like Azure admins are actually the target audience for those particular units. For thoroughness, documentation on obtaining a SAS via .net is found here

  • Shared Key: creates a secure, encrypted signature string that is passed in the authorization header
    • Full access without fine-grained controls
    • Valid until manually revoked
      • Rotate the thing at some point because chances are it will eventually get out
    • Only use in fully trusted (or disposable) setups/environments/etc.
  • Shared access signatures: time-based access to provide specific permissions to specific people/devices/apps against specific storage objects
  • Anynymouse read access for public items
  • EntraID for RBAC

Shared Access Signatures

  • Potentially hierarchal access so watch nested items
  • Account or Service level
    • Account: 1:n storage services
    • Service: 1:1 storage service
  • Control via IP and/or protocols

The URI structure is interesting, but I’m not sure how useful it really is to have the structure memorized. A lot of it is pretty self explanatory, and the rest can be inferred. The connection string created when generating a SAS includes all service endpoints + the security info. Some client libraries and other tools can parse this automatically as-is. Other processes though may need the individual connection URIs in order to function correctly.

A stored access policy is referenced a grand total of ONE time for mitigating compromised SAS. This isn’t covered at all during the module outside of a “Note” block on how SAP can help narrow down the SAS access scope. This is borderline malicious as a SAS + SAP scoped correctly is the only way to correctly apply and use SAS.

  1. Documentation going over SAS
  2. This is not like other cert based services (e.g. OpenVPN with cert based authentication) where you can just revoke client certificates to deny access
  3. The learn module creates an ad-hoc SAS which is a shared access signature without any sort of policy. It’s a one-and-done item that cannot be managed afterwards.
    • Basically these shouldn’t be used outside of very specific use cases as the only way to “revoke” is to regenerate the signing keys or delete the service the SAS is scoped to.
      • They can be scoped to the entire storage account, or the table, queue, etc. services
      • Time limits, read/write/delete permissions, etc. should be overly restrictive vs overly permissible
    • Both resolutions for a compromised SAS require jumping through a bunch of technical hoops, workarounds, and/or downtime which is unacceptable in a 24/7 operation.
  4. Service SAS with a stored access policy is the highly recommended way to go in a production environment
    • An access policy is assigned to a service
    • The SAS is created and scoped to the access policy on creation
    • If/when the SAS token is compromised, the access policy can be removed, modified, etc. to restrict or block access to the service

Storage Security Lab

Another point-and-click adventure which is great - more Bicep practice! I’ve tucked this one over in the Bicep page with the main deployment takeaways there.

  1. Mapping file shares via the powershell script/portal is pretty damn risky - it’s reallY for admin access only as you’re exposing the storage account key. If that gets out, it’s full access into the storage account.
    1. It CAN be used - the question is whether or not it’s appropriate TO use
    2. Supplement access via IP rules and whatnot if mapping is truly necessary
    3. Authentication based solely on Entra ID is not supported (at this time)
      • “Microsoft Entra ID is not a domain controller, only a directory service. User accounts solely based in Microsoft Entra ID are currently not supported.”

Files And File Sync

File share over SMB and NFS (restrictions apply - covered below) hosted in Azure. The description on the tin sounds pretty neat but I think there’s a deep uncanny valley on who this is useful to.

There’s a lot of information not covered in the Learn module, more info here and some great enterprise level architecture information here.

  • Simultaneous access from many VMs
  • They can be mounted by VMs or applications
    • ehhh… key based? Careful with this.
  • Snapshots, backups, etc. are all automated and pretty slick
    • Only changed files (CBT on the back end?) are tracked across changes
    • Like other backup items, the backup, snapshot, etc. data must be deleted prior to deleting the protected share
  • Traffic on port 445 (SMB) is required to/from client networks
    • Some home ISPs will block this traffic - even if it is outbound
    • NSGs may need to be adjusted as well unless using the Microsoft.Storage service endpoint

File Sync

  • Storage Sync Service orchestrates the the whole swarm of storage accounts, sync groups, sync agents, etc.
  • Sync groups define the topology for a set of files
    • A group of objects syncing the files
    • A sync service can have as many groups as needed
  • File Sync Agent
    • Runs on Windows Server to facilitate the north/south traffic
      • East/west too? Curious if they p2p the traffic between each other
    • FileSyncSvc.exe monitors for local changes and also initiates sync sessions to azure
    • StorageSync.sys is a system filter that supports the cloud tiering
      • more on that
  • Server endpoint are the synced file locations on the server. e.g. E:\sync1, E:\dir2, etc.
  • Cloud endpoint is an Azure file share that’s included in the sync group.
    • The entire cloud endpoint (file share) is synced
    • File share can only be a member of one cloud endpoint
    • File share can only be a member of one sync group
      • Pretty firm boundary

The whole file sync stuff seems pretty cool (almost like the OneDrive sync processes on steroids) but I’m curious if this is useful to cloud-first organizations. It appears that this is 100% geared towards larger organizations, or at least one with multiple locations, and syncing data between multiple geographical locations. For the cloud-only orgs it feels like SharePoint, OneDrive, etc. would be the go-to daily drivers for data and then offload archival data into File Share, Container, etc.

Azure Storage Tools

Azure Storage Explorer was by far the most overlooked tool when I started working with Azure. All of the information I found around moving data around, importing/exporting VHDs, etc. were all shell based which… really wasn’t a good fit for the scenarios I dealt with. The session management/keep-alives for large file transfers is an absolute godsend. No need to change session timeout policies while you wait for a 120GB disk to move from one region to another - just copy and paste in Azure Explorer!

  • Storage Explorer allows management of storage objects, object manipulations, attaching to 3rd party storage via SAS, etc.
  • The Azure Import/Export service is for prepping local data to disks that are shipped to an Azure DC where the data is imported into your storage accounts
  • AzCopy is the CLI behind the scenes copying data everywhere. Storage Explorer will show you the AzCopy commands used to accomplish the steps you’ve taken

For deeper dives: