Waf

Purpose

This page covers notes, labs, personal setup items, deviations, etc. covered in the Microsoft Azure Well-Architected Framework Learn module.

Purpose
Cost Optimization
DevOps
Performance Efficiency
Reliability
Security

Cost Optimization

In general, promote widespread ownership/buy-in from team members by getting them invested, or at least exposed to, the financial side of things. The great game of business and “Open-Book” style management is one great way of accomplishing this. Basically be open to feedback, suggestions, etc. along performance as well as design tweaks from any/all parties.

Cost Modeling

Define and develop a cost model to:
- segment expenses
- Estimate costs
- forecast total cost
Include infrastructure costs, team expenditures and revenue
- Different approaches to design and shed light on TCO
- Make sure requirements and nice-to-have’s are covered
- Keep it flexible, these are sliding figures

Design Cost

Migration design decisions (rehost, re-write, etc.) are tricky and need to include projections on ROI, solving (or accepting) technical debt, training, etc.. Typically, maintaining a like-for-like approach is the most cost effective option while providing for reducing the most egregious technical dept.

Focus on low hanging fruit the requires minimal service/process changes but still reduce cost.
- e.g. Azure SQL Database services vs VM scale sets in multiple regions. (assuming feature parity)
Min/Max the ROI potential
Take advantage of value
- It may not be the cheapest option, but the SLA, resiliency, scaling options, etc. may make the spend worth it
Add guardrails to keep costs within budget
- Use policies or application design patterns to prevent unapproved changes
  - Scale Limits
  - Block higher priced SKUs
  - Automate storage cleanup, data tiering, etc.

Usage Optimization

Basically, use what you pay for and constantly evaluate product needs (if it’s not used, why pay for it? VISIO)

Use consumption based pricing for workloads that aren’t utilized 24/7/365 or that have “spiky” workloads
- fixed-billing is typically best for steady workloads
Analyze high availability designs and determine if active-active or active-only models will optimize costs over active-passive
- Essentially right-scoping resources
Keep environments clean
- Delete old VMs (they still incur storage costs)
- Move old data to archive tiers, or remove it
- Cleanup backup sets
- Update configuration management process to include lifecycle management

Consolidation Optimization

The main hurdle to this is knowing your environment, teams, needs, current solutions, etc.. Once that baseline is known it’s possible to start consolidating services starting at the largest ROI to smallest.

Consolidating can also open doors for other features or billing practices. e.g. Consolidating microservices into a larger, but stable and known workload could change the billing strategy from consumption to Standard tiers which offer additional features.

Evaluate compute performance to know what systems/deployments can take on more roles, or which can be right-sized (reduced) while maintaining their operational goals.
Use reservations
- May take some time to establish known workload requirements, but once known, absolutely a must-have for static workloads

Monitoring

Knowing your workloads and establishing a baseline is essential to locating gaps where cost-saving steps can be taken. Beyond workloads are new features, products, billing practices, etc. that can improve costs - so monitor new features and products as well.

Monitor resources
- Access times
- Usage metrics
- Data tiers
- etc.
Monitor support contracts, suppliers and vendors
Implement a robust tagging system to help build useful cost reports

Estimate Budgets

Basically establish boundaries over non-negotiable requirements, personnel costs and processes that promote and anticipate growth.

Set boundaries and then check spending against them
- Azure budgets, alerts, etc. help keep things in check
The TCO becomes more clear as more data points are generated, policies and budgets defined
A real budget takes shape, however, there is still a need for flexibility/buffering

DevOps

The big driver behind realizing DevOps benefits is collaboration and shared responsibility. Break knowledge, resources and skill out of their silos and make these available across teams and other divisions. Orchestrating this correctly while maintaining clear lines of ownership/accountability can provide a significant boost to productivity.

Use common tools and processes
Increase awareness on environments, challenges, features, etc.
- Develop robust and flexible monitoring to assist in providing awareness across the product/organization
- Include priority rankings to assist in prioritizing issues to address - include contextual data as some stakeholders will not know the who/what/why behind priority rankings
- Dashboards are a great visual feedback as long as they have the correct correlations and context
- Alert on the items monitored and make sure that all interested parties are included when designing the alerts
Encourage continuous feedback, learning and experimentation throughout the development cycle
- Track and report on bugs, failures, time to deploy, etc. to target areas of improvement
- Early,frequent and consistent QA provides consistent feedback
Foster knowledge sharing across teams and have a single source of truth
- Documentation, knowledgebase, issue tracking, postmortems, project goals, timelines, etc.
- Dashboards, alert destinations, etc. should be centralized
Set standards for development, operations, deployments, emergency situations, monitoring, etc.
- Performing the same action should always have the same result
- Remove human elements that introduce unwanted deviations
  - IaC, CI/CD pipelines, etc.

Performance Efficiency

Efficiency is the ability to adjust to chances in demands. Scaling up, down, in and out are options that need to be accounted for depending on workload demands, ability to do so (e.g. some applications don’t support multiple instances).

This requires monitoring and clearly defined performance targets to trigger the adjustment
- Know your business requirements, seasonal fluctuations, expectations from all stakeholders, etc.
- This requires constant iteration as you can’t measure what hasn’t been created yet, and you can’t create without measurement
Use historical data to gain visibility on usage patterns, bottlenecks, etc.
Don’t hesitate to lean on third party/external resources for analysis and industry standards
Ideally, include a range of acceptable to unacceptable performance - this allows for locating ideal design choices and associated costs to remain in budget
Clearly define all workloads and their workflows
Test for performance in development to ensure acceptable performance targets are met (make sure targets are defined/updated as appropriate)
Sanitize systems to improve performance (e.g. clean up old data so database lookups are quicker, remove old roles/software, etc.)
Stay up-to-date on new features that help with design improvements
Dedicate time to go back and polish off those 80% “it’s not ideal but it works” items

Reliability

Business requirements are defined between business stakeholders and workload architects. Each group must reach an agreement on realistic and achievable requirements that meet customer needs.

Set a goal - what does success look like?
1. Use metrics to quantify expectations and provide data around costs and performance value
Test failures early and often in the development lifecycle, and determine the impact of performance on reliability.
Target high priority systems, workflows, etc. that must maintain the highest levels of reliability
- Understand how platform/vendor SLAs support your own SLAs for the high priority systems - design to fill any gaps
Document dependencies (internal and external) to identify potential workflow interruptions
Determine failure risks for each point of failure
- Blast radius, business interruption, etc.
- Know how these failures will be handled if/when they occur
- What are business tolerances for downtime, cost justifications, etc.
Determine error handling for workflows and implement solutions
Build in redundancy for data, PaaS, IaaS, etc.
Prepare for disasters by designing and documenting tests, recovery plans, policies, etc. in the single source of truth
- Compliance standards can be a PITA but they really help define a lot of the items here - consider looking into compliance practices even if compliance isn’t required
Identify self-healing options and use them on appropriately important workloads
- e.g. VM scale sets with Automatic Instance Repair
Monitoring systems should correlate telemetry to improve visibility on an event (interlocking processes, systems, flows, etc. need to coordinate on monitoring data)
Keep alerts actionable - avoid alert fatigue
Avoid overengineering and keep critical paths lean
- Break out subprocesses and workflows from critical ones
Establish standards in code implementation, deployment, and processes, and document them.
- Identify opportunities to enforce those standards by using automated validations.

Security

Create a security readiness plan that’s aligned with business priorities.
- Define workload requirements
- Integrate with reliability, health modeling and self-reservation strategies
Adopt a Zero Trust design philosophy to mitigate exposure and blast radius
1. Segmentation to set security boundaries around workloads, infrastructure, processes, and teams to isolate access
  1. Determine based on business requirements - base on criticality, division of labor, privacy concerns, etc. as long as the choice makes sense and has high level buy-in/support
Create and share the incident response plan - lean on industry frameworks for inspiration on preparedness, detection, containment, mitigation and post-incident processes.
1. The goal is to limit all points of confusion by defining roles and responsibilities while checking off all remediation requirements.
Functionality, budge and other considerations should have no impact on security investment
Codify secure operations and development practices
Classify workload data and implement security practices (i.e. access restrictions, data masking, full encryption, etc.) as appropriate
- User
- Usage
- Configuration
- Compliance
- Intellectual property
- Sensitivity
- Potential Risk
- etc.
Time limit access to data
Implement JIT access
Encrypt all the things - at all times - in all places
Add security scanning at every step of the development, deployment and management process
- Scan during deployment for common CVEs in dependencies
- Malware scan code and third party packages
- Implement PaaS solutions for wholistic coverage (Azure Stack HCI, Windows Defender Application Control, etc.)
Utilize cryptographic mechanisms
- Code signing, certificates, encryption, etc.
- Keep ciphers up to date
Encrypt backups and make them immutable
- Lean on platform options where possible, e.g. storage policies for immutable archive tiers (WORM, Legal Hold and Time Based)
Utilize security practices to optimize reliability
- DDoS mitigation, input validation/sanitization, input throttling, etc.
Implement code scanners, apply the latest security patches, update software, and protect your system with effective antimalware on an ongoing basis
Apply at least the same level of security rigor in your recovery resources and processes as you do in the primary environment, including security controls and frequency of backup
- Scan your backup systems after/during drills
Keep everything, OS, Applications, Code libraries, etc. up to date.