As the size and complexity of AI infrastructure grows, information heart operators want steady visibility into components together with efficiency, temperature and energy utilization. These insights allow information heart operators to actively monitor and alter information heart configurations throughout large-scale, distributed methods — validating that these methods are working at their highest effectivity and reliability.
NVIDIA is creating a software program answer for visualizing and monitoring fleets of NVIDIA GPUs — giving cloud companions and enterprises an insights dashboard that may assist them increase GPU uptime throughout computing infrastructures.
The providing is an opt-in, customer-installed service that displays GPU utilization, configuration and errors. It is going to embody an open-source consumer software program agent — a part of NVIDIA’s ongoing assist of open, clear software program that helps clients get essentially the most from their GPU-powered methods.
With the service, information heart operators will be capable of:
- Observe spikes in energy utilization to maintain inside power budgets whereas maximizing efficiency per watt.
- Monitor utilization, reminiscence bandwidth and interconnect well being throughout the fleet.
- Detect hotspots and airflow points early to keep away from thermal throttling and untimely element ageing.
- Verify constant software program configurations and settings to make sure reproducible outcomes and dependable operation.
- Spot errors and anomalies to establish failing components early.
These capabilities may also help enterprises and cloud suppliers visualize their GPU fleet, tackle system bottlenecks and optimize productiveness for increased return on funding.
This elective service gives real-time monitoring by every GPU system speaking and sharing GPU metrics with the exterior cloud service. NVIDIA GPUs shouldn’t have {hardware} monitoring expertise, kill switches and backdoors.
Open-Supply Agent Provides Insights for Knowledge Heart House owners
The service will characteristic a consumer software program agent that the client can set up to stream node-level GPU telemetry information to a portal hosted on NVIDIA NGC. Prospects will be capable of visualize their GPU fleet utilization in a dashboard, globally or by compute zones — teams of nodes enrolled in the identical bodily or cloud places.

The consumer tooling agent can also be slated to be open sourced, offering transparency and auditability. It’ll provide a working instance for a way clients can incorporate NVIDIA instruments into their very own options for monitoring GPU infrastructure — whether or not for crucial compute clusters or total fleets.
The software program gives perception into an organization’s GPU stock however can not modify GPU configurations or underlying operations. It gives read-only telemetry information that’s buyer managed and customizable.
The service may also allow clients to generate stories that element GPU fleet data.
As AI functions develop in quantity and complexity, trendy AI infrastructure administration is evolving to maintain tempo. Ensuring that AI information facilities are working at peak well being is significant as AI revolutionizes each business and utility. This software program service is right here to assist.
Register for NVIDIA GTC, going down March 16-19 in San Jose, California, to study extra.
See discover relating to software program product data.

