Week in Tech #9 — August 25, 2025

Agentic AI That Ships, Giga‑Scale Infrastructure, and Governance That Sticks

Aug 25, 2025

This week main focus was on… pragmatic agentic patterns (Agent Factory, MCP-aware ops), infra leaps for training/inference (NVIDIA + partners), and the rising twin mandates of governance and sustainability—complete with blueprints leaders can act on.

Agentic AI That Ships, Giga‑Scale Infrastructure, and Governance That Sticks

Executive Summary

Boardroom → Ops floor → Engine room → Workshop → Policy arena → Test track.

Walk into Monday’s boardroom and two numbers loom large: run‑rate AI spend and power draw. Bubble talk aside, budgets are flowing to systems that ship outcomes—agents embedded in real workflows, with the guardrails, observability, and cost controls CFOs can defend.

Down on the ops floor, Google adds Gemini Cloud Assist Investigations for faster root‑cause analysis; AIOps moves from dashboards to decisive suggestions. In the engine room, NVIDIA turns the dial with giga‑scale networking and faster inference serving, while HPE + NVIDIA align storage and accelerated compute for data platforms that don’t buckle at scale.

In the workshop, Microsoft’s Agent Factory goes from concept to how‑to—codified patterns for building the first agent and wiring it to tools, oversight, and telemetry. Google, meanwhile, ships real‑world GenAI blueprints and a tool chooser to reduce architecture guesswork.

Over in the policy arena, enterprises lean into AI governance—who can do what, with which tools, and when—while Google’s research on environmental footprint makes carbon a first‑class KPI. Finally, on the test track, cloud business cases (Azure) and autonomous CI/CD ideas for DevSecOps sketch a near future where safety gates, policy‑as‑code, and agentic change‑control are the default.

Future predictions

AIOps copilots go from RCA to remediation — one‑click, policy‑guarded fix flows enter the mainstream in 6–12 months.
Agent playbooks replace PoCs — “Agent Factory”‑style blueprints (tools, HIL, safety, telemetry) become platform features across clouds within the year.
Giga‑scale becomes table stakes — networking + inference serving optimizations shift the cost curve; leaders evaluate latency per $ as closely as tokens per $.
Green SLAs emerge — emission budgets and grams CO₂ / 1K tokens appear in vendor scorecards and quarterly reviews.
Autonomous DevSecOps — policy‑as‑code, signed SBOMs, and agentic pipelines reduce toil while increasing audit-ability by 2026.

Resources

HPE + NVIDIA partner on AI + enterprise storage
Strategic alignment around accelerated compute and modern storage for AI‑first data platforms. Expect tighter reference architectures that balance throughput, cost, and manageability.
Why it matters: Building AI at scale is as much a data plumbing problem as a model problem.
Action for tech leaders: Map training/inference bottlenecks (I/O, networking) and request partner RAs that include TCO + ops playbooks.
Read more

Gemini Cloud Assist Investigations: RCA for cloud incidents
Google expands AIOps with guided root‑cause analysis for incidents. Moves beyond surfacing metrics to proposing causal paths.
Why it matters: Faster MTTR and clearer blameless postmortems.
Action: Pilot on one high‑noise service; measure MTTR and false‑positive deltas against current process.
Read more

NVIDIA: giga‑scale networking + faster inference serving
Networking and serving upgrades to push larger, faster workloads through the same (or cheaper) footprint.
Why it matters: Throughput and tail‑latency drive UX and unit economics.
Action: Re‑benchmark critical paths (context window, batching) and revisit autoscaling policies.
Read more

Agent Factory: build your first AI agent
Microsoft documents hands‑on patterns—tool orchestration, oversight, and telemetry—for agents that deliver business outcomes.
Why it matters: Reduces PoC thrash; standardizes how teams build and govern agents.
Action: Stand up a reference agent with HIL + audit logs; require SLOs (quality, cost, safety) before scale‑out.
Read more

AI’s power drain: Google details Gemini apps’ footprint
Fresh look at environmental impacts and efficiency levers for Gemini‑class applications.
Why it matters: Sustainability moves from marketing slide to budget line.
Action: Add energy metrics to observability; set targets for emissions per request alongside latency and cost.
Read more

Azure: business case for cloud + AI modernization
Guidance for framing ROI, risk, and capability uplift—what a “frontier firm” looks like on Azure.
Why it matters: Helps align exec sponsorship with platform roadmaps.
Action: Build a two‑page business case: value buckets, KPI hypotheses, and a 90‑day pilot plan.
Read more

White paper: DevSecOps in a fully autonomous CI/CD
Explores safety gates, policy‑as‑code, and signed artifacts in agent‑assisted pipelines.
Why it matters: Compliance and speed can rise together—if automation is observable and constrained.
Action: Add SBOM + provenance checks to your golden pipeline; simulate agent‑proposed changes with canary + rollback.
Read more

Google’s real‑world GenAI blueprints
Reference architectures for practical use cases; reduces time from idea to first deploy.
Why it matters: Lowers design risk and standardizes patterns.
Action: Pick one blueprint; adapt with your data/products and define success metrics up front.
Read more

The turn toward AI governance in enterprise workflows
Operational governance for agents and automations—permissions, audits, and change control baked into the flow.
Why it matters: Scales trust and reduces shadow AI.
Action: Establish an AI change‑advisory cadence and a minimal RACI for agent ownership.
Read more

Choosing the right Google AI developer tool
A chooser guide for developers to align the tool with the job (apps, agents, data, or MLOps).
Why it matters: Cuts tooling friction and rework.
Action: Publish an internal “when to use what” one‑pager mapped to your platform guardrails.
Read more

Demystifying AI agents: hype vs. reality
Experts weigh in on where agents deliver today vs. what still needs research and controls.
Why it matters: Sets realistic expectations and avoids “agent sprawl.”
Action: Limit to 2–3 core agent use cases; require post‑launch reviews on accuracy, safety, and cost.
Read more

Stay curious, Maciej Gos

Thanks for reading! If you found these insights valuable, feel free to share and subscribe. I’d love to hear which topics you want to explore deeper — just reply to this email or drop a comment.

The Future Thinker

Discussion about this post