Waf

Purpose

This page covers notes, labs, personal setup items, deviations, etc. covered in the Microsoft Azure Well-Architected Framework Learn module.

Cost Optimization

In general, promote widespread ownership/buy-in from team members by getting them invested, or at least exposed to, the financial side of things. The great game of business and “Open-Book” style management is one great way of accomplishing this. Basically be open to feedback, suggestions, etc. along performance as well as design tweaks from any/all parties.

Cost Modeling

  • Define and develop a cost model to:
    • segment expenses
    • Estimate costs
    • forecast total cost
  • Include infrastructure costs, team expenditures and revenue
    • Different approaches to design and shed light on TCO
    • Make sure requirements and nice-to-have’s are covered
    • Keep it flexible, these are sliding figures

Design Cost

Migration design decisions (rehost, re-write, etc.) are tricky and need to include projections on ROI, solving (or accepting) technical debt, training, etc.. Typically, maintaining a like-for-like approach is the most cost effective option while providing for reducing the most egregious technical dept.

  • Focus on low hanging fruit the requires minimal service/process changes but still reduce cost.
    • e.g. Azure SQL Database services vs VM scale sets in multiple regions. (assuming feature parity)
  • Min/Max the ROI potential
  • Take advantage of value
    • It may not be the cheapest option, but the SLA, resiliency, scaling options, etc. may make the spend worth it
  • Add guardrails to keep costs within budget
    • Use policies or application design patterns to prevent unapproved changes
      • Scale Limits
      • Block higher priced SKUs
      • Automate storage cleanup, data tiering, etc.

Usage Optimization

Basically, use what you pay for and constantly evaluate product needs (if it’s not used, why pay for it? VISIO)

  • Use consumption based pricing for workloads that aren’t utilized 24/7/365 or that have “spiky” workloads
    • fixed-billing is typically best for steady workloads
  • Analyze high availability designs and determine if active-active or active-only models will optimize costs over active-passive
    • Essentially right-scoping resources
  • Keep environments clean
    • Delete old VMs (they still incur storage costs)
    • Move old data to archive tiers, or remove it
    • Cleanup backup sets
    • Update configuration management process to include lifecycle management

Consolidation Optimization

The main hurdle to this is knowing your environment, teams, needs, current solutions, etc.. Once that baseline is known it’s possible to start consolidating services starting at the largest ROI to smallest.

Consolidating can also open doors for other features or billing practices. e.g. Consolidating microservices into a larger, but stable and known workload could change the billing strategy from consumption to Standard tiers which offer additional features.

  • Evaluate compute performance to know what systems/deployments can take on more roles, or which can be right-sized (reduced) while maintaining their operational goals.
  • Use reservations
    • May take some time to establish known workload requirements, but once known, absolutely a must-have for static workloads

Monitoring

Knowing your workloads and establishing a baseline is essential to locating gaps where cost-saving steps can be taken. Beyond workloads are new features, products, billing practices, etc. that can improve costs - so monitor new features and products as well.

  • Monitor resources
    • Access times
    • Usage metrics
    • Data tiers
    • etc.
  • Monitor support contracts, suppliers and vendors
  • Implement a robust tagging system to help build useful cost reports

Estimate Budgets

Basically establish boundaries over non-negotiable requirements, personnel costs and processes that promote and anticipate growth.

  • Set boundaries and then check spending against them
    • Azure budgets, alerts, etc. help keep things in check
  • The TCO becomes more clear as more data points are generated, policies and budgets defined
  • A real budget takes shape, however, there is still a need for flexibility/buffering

DevOps

The big driver behind realizing DevOps benefits is collaboration and shared responsibility. Break knowledge, resources and skill out of their silos and make these available across teams and other divisions. Orchestrating this correctly while maintaining clear lines of ownership/accountability can provide a significant boost to productivity.

  1. Use common tools and processes
  2. Increase awareness on environments, challenges, features, etc.
    • Develop robust and flexible monitoring to assist in providing awareness across the product/organization
    • Include priority rankings to assist in prioritizing issues to address - include contextual data as some stakeholders will not know the who/what/why behind priority rankings
    • Dashboards are a great visual feedback as long as they have the correct correlations and context
    • Alert on the items monitored and make sure that all interested parties are included when designing the alerts
  3. Encourage continuous feedback, learning and experimentation throughout the development cycle
    • Track and report on bugs, failures, time to deploy, etc. to target areas of improvement
    • Early,frequent and consistent QA provides consistent feedback
  4. Foster knowledge sharing across teams and have a single source of truth
    • Documentation, knowledgebase, issue tracking, postmortems, project goals, timelines, etc.
    • Dashboards, alert destinations, etc. should be centralized
  5. Set standards for development, operations, deployments, emergency situations, monitoring, etc.
    • Performing the same action should always have the same result
    • Remove human elements that introduce unwanted deviations
      • IaC, CI/CD pipelines, etc.

Performance Efficiency

Efficiency is the ability to adjust to chances in demands. Scaling up, down, in and out are options that need to be accounted for depending on workload demands, ability to do so (e.g. some applications don’t support multiple instances).

  1. This requires monitoring and clearly defined performance targets to trigger the adjustment
    • Know your business requirements, seasonal fluctuations, expectations from all stakeholders, etc.
    • This requires constant iteration as you can’t measure what hasn’t been created yet, and you can’t create without measurement
  2. Use historical data to gain visibility on usage patterns, bottlenecks, etc.
  3. Don’t hesitate to lean on third party/external resources for analysis and industry standards
  4. Ideally, include a range of acceptable to unacceptable performance - this allows for locating ideal design choices and associated costs to remain in budget
  5. Clearly define all workloads and their workflows
  6. Test for performance in development to ensure acceptable performance targets are met (make sure targets are defined/updated as appropriate)
  7. Sanitize systems to improve performance (e.g. clean up old data so database lookups are quicker, remove old roles/software, etc.)
  8. Stay up-to-date on new features that help with design improvements
  9. Dedicate time to go back and polish off those 80% “it’s not ideal but it works” items

Reliability

Business requirements are defined between business stakeholders and workload architects. Each group must reach an agreement on realistic and achievable requirements that meet customer needs.

  1. Set a goal - what does success look like?
    1. Use metrics to quantify expectations and provide data around costs and performance value
  2. Test failures early and often in the development lifecycle, and determine the impact of performance on reliability.
  3. Target high priority systems, workflows, etc. that must maintain the highest levels of reliability
    • Understand how platform/vendor SLAs support your own SLAs for the high priority systems - design to fill any gaps
  4. Document dependencies (internal and external) to identify potential workflow interruptions
  5. Determine failure risks for each point of failure
    • Blast radius, business interruption, etc.
    • Know how these failures will be handled if/when they occur
    • What are business tolerances for downtime, cost justifications, etc.
  6. Determine error handling for workflows and implement solutions
  7. Build in redundancy for data, PaaS, IaaS, etc.
  8. Prepare for disasters by designing and documenting tests, recovery plans, policies, etc. in the single source of truth
    • Compliance standards can be a PITA but they really help define a lot of the items here - consider looking into compliance practices even if compliance isn’t required
  9. Identify self-healing options and use them on appropriately important workloads
    • e.g. VM scale sets with Automatic Instance Repair
  10. Monitoring systems should correlate telemetry to improve visibility on an event (interlocking processes, systems, flows, etc. need to coordinate on monitoring data)
  11. Keep alerts actionable - avoid alert fatigue
  12. Avoid overengineering and keep critical paths lean
    • Break out subprocesses and workflows from critical ones
  13. Establish standards in code implementation, deployment, and processes, and document them.
    • Identify opportunities to enforce those standards by using automated validations.

Security

  1. Create a security readiness plan that’s aligned with business priorities.
    • Define workload requirements
    • Integrate with reliability, health modeling and self-reservation strategies
  2. Adopt a Zero Trust design philosophy to mitigate exposure and blast radius
    1. Segmentation to set security boundaries around workloads, infrastructure, processes, and teams to isolate access
      1. Determine based on business requirements - base on criticality, division of labor, privacy concerns, etc. as long as the choice makes sense and has high level buy-in/support
  3. Create and share the incident response plan - lean on industry frameworks for inspiration on preparedness, detection, containment, mitigation and post-incident processes.
    1. The goal is to limit all points of confusion by defining roles and responsibilities while checking off all remediation requirements.
  4. Functionality, budge and other considerations should have no impact on security investment
  5. Codify secure operations and development practices
  6. Classify workload data and implement security practices (i.e. access restrictions, data masking, full encryption, etc.) as appropriate
    • User
    • Usage
    • Configuration
    • Compliance
    • Intellectual property
    • Sensitivity
    • Potential Risk
    • etc.
  7. Time limit access to data
  8. Implement JIT access
  9. Encrypt all the things - at all times - in all places
  10. Add security scanning at every step of the development, deployment and management process
    • Scan during deployment for common CVEs in dependencies
    • Malware scan code and third party packages
    • Implement PaaS solutions for wholistic coverage (Azure Stack HCI, Windows Defender Application Control, etc.)
  11. Utilize cryptographic mechanisms
    • Code signing, certificates, encryption, etc.
    • Keep ciphers up to date
  12. Encrypt backups and make them immutable
    • Lean on platform options where possible, e.g. storage policies for immutable archive tiers (WORM, Legal Hold and Time Based)
  13. Utilize security practices to optimize reliability
    • DDoS mitigation, input validation/sanitization, input throttling, etc.
  14. Implement code scanners, apply the latest security patches, update software, and protect your system with effective antimalware on an ongoing basis
  15. Apply at least the same level of security rigor in your recovery resources and processes as you do in the primary environment, including security controls and frequency of backup
    • Scan your backup systems after/during drills
  16. Keep everything, OS, Applications, Code libraries, etc. up to date.