This thread walks through a real customer question: how do you find which processes are eating the most CPU across a large VDI estate (200 servers) over a long period, so you know what to optimize?
It starts in ControlUp Dashboards. Using the VDI process data, you can build a widget showing top processes by CPU, then narrow it to the machines you care about using global filters (filter by folder, or switch to query filters with IS / IS NOT). A table beats a bar chart for a top-50 view, and any filters and time frames you set are saved in the URL, so you can bookmark or share the exact view. The built-in Gallery dashboards (notably "Big Screen Dashboard VDI") already include a processes-by-CPU widget out of the box.
The harder part is the one the customer really cared about: getting a meaningful picture across all machines rather than a single noisy spike on one box. The data is stored per process per 5-minute timeslot, not pre-aggregated per machine, so the practical approach that emerged was:
- Filter to the folder containing all target machines.
- Use CPU usage P95 to strip out short, expected spikes (with a fallback to average CPU, since P95 returned N/A in some environments — a possible bug worth a support ticket).
- Add metrics under Advanced: computer_name (unique count) to see how many machines run the process, and process_name (count) to see how often it appeared (each record ≈ a 5-minute aggregate).
- The processes worth optimizing are the ones combining high CPU + high machine count + high record count.
Worked example: a process showing ~1,080 records across 197 machines roughly translates to ~90 hours of activity over 7 days, or a few minutes per machine per day. The same method surfaced candidates like CompatTelRunner.exe, WerFault.exe, and WEM-related activity for further investigation. To go deeper, the App Trends and App Statistics reports let you drill into a specific app, and product_name data helps identify what’s actually behind generic process names like setup.exe.
A few important caveats came up: averages hide spikes (10% average could be one minute at 100%), process counts are per machine, and hypervisor-level CPU (e.g., XenServer) can disagree with in-VM agent data when hosts are over-provisioned — which is also why a Sizing Recommendations report may suggest removing vCPUs even on a machine that looks maxed out.
Bottom line: use the dashboard to identify the heavy processes (folder filter + P95/avg CPU + machine count + record count), then use the App Trends report to dig into the details per app.
Read the entire ” thread below:
Morning, is there any way to get data where I can see which process has been taking up the most CPU over the past 30 days for example? so I know what to optimize basically
Hi @member yes you can. I assume you mean from VDI data. See screenshot. (BTW this is also a very good question to ask our Ask Ai it can also generate these lists).
where you see this information?
Ah you posted in the Dashboards thread so I thought you where in ControlUp Dashboards 🙂
how do I choose the VDAs, its picking the citrix infrastructure
so there are multiple options in the Widgt filter section you could filter on VDI name
Or in the global filters select the folder you want to see
also would advice when showing a top 50 to chose a table instead of a bar 🙂
nice! thanks, the folderpath is interesting. Can I target a folder?
the folders are part of global filters so you can make a full dashboard (save the widget to a dashboard and create more if you want) and then use the global filter to change the dashboard to show only that folder. And you can even select multiple folders or switch to query filter where you can do IS and IS NOT
where is that view where you can pick a folder
in the widget builder its here
when you save a widget on the dashboard its here
very awesome
if you go to dashboard Gallery tab you can see buildin dashboards we created for you to consume or customize
Big Screen Dashboard VDI might already contain what you are looking for
yeah in the bottom row we have processes with CPU usage
let me take a look
and in the gallery items you still have the ability to change the time frame and set global filter etc without needing to customize the dashboard 🙂
this is what I see when I try to use that big screen dashboard
have you tried big screen dashboard VDI
if you dont have desktop products the other one will be empty
this is the big screen dashboard desktop
try the one next to it that ends on VDI 🙂
the other one is for ControlUp for Desktops (our laptop and devices monitoring product)
yes that one works!
last thing if you set global filters or a different time frame that is all reflected in the URL that you can set as favorite or share with some one. That way you dont need to readd the filter each time you want to look at it. It will be in the url
Good luck 🙂
thank you so much!
just one thing
does this mean that us015a.exe takes the most cpu? we have 200 servers
how can I confirm that it checked all 200 servers
So it is the highest avg_cpu of the processes of all processes
on all machines
ok great, thanks
but that might be on just one machine
yeah thats the thing.. I want to have the average of all machines
one machine doesnt tell me much
I understand hmm let me think about that because we save it per process
Ok what you at least can do is show on how many computers the process runs
yeah, I have 200 servers and I would like to check top 10-50 processes. If I see for example that SDXHelper.exe is top 10 with 40% cpu usage within these 200 servers then I know that I have to optimize something with office updates
if you click advanced you can add an extra metric with computer_name unique count this will add a extra column showing how many devices have the process
@member I might need your help with @member question from the data stand point 🙂 do you have a good idea to display what he is asking you know the VDI metrics better then me 🙂
if I pick past 6 months, I see that the process that used most CPU is related to chrome and edge.. which is weird
I filtered on the folder where all the 200 VDAs reside
Let me check
With the filtering you’re on the right track, using that global filter to select the folder filters it down to using the data that of the machines in that folder only.
Now to understand what you’re after. You want to know which process is taking up the most resources so can take action there. Correct?
There are a couple of ways to go about this. You can do this
• Select cpu_usage_95. This take the 95th percentile only, basically excluding (very) short spikes. In my opinion it’s better then taking the average because the very short spikes are expected behavior (user clicks something, opens the app, etc…) and not something to be taken into account
• You grab the average of that. Keep in mind it then averages over the p95 cpu usage you just selected. We store our data in timeslots. So it’s averaging over each timeslot in the selected timeframe (7d in my case)
• You can also grab the highest value instead of average. That tells you the processes that spike for a longer period of time
• Always enable grouping and select process name so you see the details
@member I agree your way with the P95 is even better but @member wants to see it over all machines. With your (and my way) if a process is really busy for like 5 minutes on one machine that will pop out in the widget. Not what was the top 3 most heavy processes acros all my machines which is what he wants to see. I also dont know if it is possible maybe something in the machine table ?
Doesn’t it, the processes table covers processes from all machines. Filtered down to the folder path he’d like to see I’d think we’re already querying this across all machines.
The machines table has the cpu aggregated per machine. There is no info about processes there so it’ll stay very high-level
it does look at all the machines but it doesnt avarge them out on all machines. Also dont know if thats even possible. Maybe indeed filtering it down on folder and taking a P95 to remove spikes and enabling computer count is the best way
Processes is organized to store a process with all the details you see in that table for a certain timeslot. Let’s say we’re looking at a 5 minute timeslot. It’ll average all real-time data (every 3 seconds) on cpu usage into that field. P95 does the same but removes the spikes.
What we’re doing here is looking at all records (processes) and then averaging either the average or p95 across all processes across all machines in a specific folder. I’d say that covers with @member is asking for.
perfect perfect, I’m going to try this now and see what data I get.
I targeted the folder with all the 200 servers. With your settings I see N/A under cpu usage 95
would be great if I can see the cpu usage
tried with 72 hours and its the same, N/A
@member I also get N/A on P95 in CUoffice ..
@member it works for me in CUDemo… and it should work everywhere. If not that’s potentially a bug we should investigate on our end.
@member could you get a support ticket created? Refer them back to this slack chat. Either support can investigate or they’ll escalate to development and they’ll take a look.
can I just email the support?
I need this asap
Please do
One simple thing btw, did you sort in table based on cpu usage 95?
yes
I have emailed the support now with the details.
says for example ms-teamsupdate.exe , question is if this is legit. If this teams updater is causing high CPU on all 200 servers
So click on advanced and click on +add metric then select computer_name and unique count
then you will see on how many computers this causes high cpu
now teamsupdater is not part of the list at all
can you for now put it back to avg 🙂 so we have cpu data
and then the once with high avg and a high number of computers will be the most intressting for you to check out 🙂
ok got it.. but can I trust this?
yeah but dont put the aggregation to highest value
that will always be 100
oh sorry.. changing to average now
here we go
so then these 2, comattelrunner.exe and wefault.exe are taking up 10% cpu past 7 days? can I trust that
yeah thats correct so now you can focus on the ones with the highest cpu and computers
so you still have to keep in mind when the process was running
if werfault only ran for 5 minutes the past week its less cririctal them something running a lot
we can add that information
thats the thing.. if I see 10% from warfault I suppose it has been running the most since I pick top 15 in the widget
add another +add extra metric then process_name then count
ok
the count on the end now shows the number of times the process had a record
sorry for taking so long to figure this out. The data you want aggregated over all your machines is not in the data record VDI provides to dashboards so we need to make something close to it
ah ok so for example
yeah this is starting to look good!
no worries
so basically.. setup.exe can be anything. How can I deep dive further into this?
very weird that setup.exe took 8% average cpu past 7 days with 1344 process names? not sure what process names means in this case
with process_names we count how often it popped up in the last 7 days I think each count is an 5 minute aggregate
so if it says 1 then in the last 7 days in a certian 5 mintues it popped up
doesnt mean it was the full 5 minutes btw
now to figure out what the setup.exe was lets see if we can find more details
ok! then this is very interesting.. 1344 past 7 days
and this is on 196 machines so its very interesting to know what is happening exactly
yeah that pops up a lot on a lot of devices
so this has been going on around 6720 minutes? as in 110+ hours past 7 days?
ok I found more info
so we do save product_name in there too
if you import this widget
it will tell you which products are causing setup.exe
import is in the 3 dots menu on top of your dashboard
ok give me a sec
you can also find more product names by process names by just editing the filter
imported
this only shows 2 products, think it should show more
I modified it a little
it shows 2 because those have the process name setup.exe
which we wanted to know 🙂 if you remove the filter it will just show all the product names 🙂 which is maybe more handy then the process anmes
so WEM takes up 5% CPU and it goes under setup.exe?
can you add the process counts to it ?
ah BTW another thing I didnt think about the process count is per machine!
but still processes with high cpu + plus high number of machiens + high number of records ARE the ones you want to focus on in optimizing
how do I add the process counts? was it a metric under advanced? or under grouping
under advanced + add metric
process_name then count
okok
yeah so this means that on 197 computers it had 24380 records (each record is an aggregate of 5 minutes) and the avg cpu of all those records was 5%
and with Microsoft edge installer?
1080 records, how much does that transfer into? in hours
at least 1080 minutes right
90 hours over 197 devices over the past 7 days
wow ok! thats quite a lot
3.9 minutes per day
seems like I can trust this data. So now I can deepdive into optimization of the microsoft edge installer for example and if I can disable it to save that CPU
per machine
3.9 minutes per machine, do you think that will make a big difference
yeah this data will lead you in the right way 🙂
well 3.9 minutes every day doing a install and taking 9% cpu doesnt sound that good
also its avg cpu so it can be 1 mintue of 100% cpu
this one here is worse I think
yeah you can check up in WEM if its running a script often or doing something ? But I do know that during logon WEM can be CPU heavy with creating start menu items, registries etc
yeah I think so too.. so to calculate 1080, you converted it to 1080 minutes?
the 90 hours..
so first you take the records which are 5 minutes aggregates so its 24380 records x 5 mintues = 121900 minutes / 197 devices = 618 minutes in 7 days per device
devide by 7 to get 88 minutes per device per day
thanks!
again if we have a aggregate per machine it would be even better but I feel that you are pretty close here and you can start investigations 🙂
and this could be different aswell right, like 1 minute at 100% cpu or 5 minutes at 40% cpu etc
if you have any suggestions which can give me more accurate data then please go ahead
all I have is this pretty much, it has been very difficult to understand what takes up all the CPU on our VMs. But we can have a maximum of 10 users logged on per VM and they are up and running at 100% cpu sometimes
yeah so its a 5 mintunes aggregate so in the 5 min you can have spikes etc. I think the P95 is way better but its not showing the data in mine and your environment so lets see what the helpdesk will say 🙂
yeah Im waiting for that
@member we came at this from dashboarding do you know if there are any reports in VDI or somwhere in the realtime console to check more
yeah realtime console works for me aswell if there is any other way to do this
yeah so in real time console you have the process view which you can sort on cpu high to low but its realtime not aggregated over 7 days 😄
exactly.. its very difficult to capture this in that way. Because CPU might be at 90% and suddenly drop to 30%
ah I found something intresseting
I was thinking about that @member
The App Statistics report is useful here
so thats what I’m trying to understand.. what keeps the CPU at 90% for those minutes
in reports you have app tends
oh yeah and app statistic
App trends also works but you’d have to select an app first. App statistic gives you a higher-level overview. And alot of metrics to sort by
let me check app statistics.. I think I have already tried that one
you could use the app names from the dashboard and go into app trends and select the app we found 🙂
True
in app statistic at the end it also shows on the number of machines its found 🙂
the selector is empty ?
Completely other way to go at this is using the Sizing recommendations report. Then for the machines it says are undersized investigate why cpu and/or memory are undersized via dashboard investigation
I pick setup.exe for example
I can try that Sonny
sizing recommendation tells me to remove vCPUs haha
thats not really accurate
CPU usage data is telling a different story. It looks like it’s over sized in terms of cpu.
there is something off here..
how can the report be to remove vCPUs when some of the servers are at 100%
This is from where?
Sorry have to jump into a meeting now
this is from xenserver, the platform where the servers reside
on the one where the cpu is 100% the report says remove 3 vCPUs haha
Would be interesting to understand the differences. This is real-time I think. Also they’re tracking from the hypervisor and we’re tracking from within the VM. These numbers could be off if you’ve over provisioned the CPUs on your hosts.
if I have overprovisioned the CPUs, does that affect control up data?
what did 25 mean in this case? was it still 5 minute aggregate?
We take data from the agent. So if you over provisioned cpus you might get some skewed data. Windows reports 50% cpu usage. But in the hypervisor that cpu could be shared with one or more other machines which could be using the other 50%.
Which time frame have you selected in the last screenshot?
here is my final filter.. this is for past 72H and 200 VMs included. So basically CompatTelRunner.exe with 1197 aggregations per 5 minutes and WeFault.exe with 445 aggregations per 5 minutes are the two processes that I need to optimize.
and indeed Sonny, CPUs are overprovisioned but xenserver reports very low usage
these are the hosts
Regardless of the overprovisioning I think optimizing those 2 processes make sense.
On the overprovisioning side. I would do a test and see if you can stop the overprovisioning for a few vms if possible. Especially if cpu load is still low. Is there even a reason to overprovision?
I of course don’t have full insights in your environment. I’m basing this off off what you’ve shared. If you want we can always have a quick chat on zoom or teams tomorrow.
that would be awesome Sonny.
I got some really good data from this today but it would be very nice to have a meeting, especially if you have been helping customers in similar situations where the CPU is very high and the customer is trying to figure out what is causing the spikes
so very thankful so far, I will definitely pass on this to the organization and tell them about the awesome service
Tomorrow I’m relatively empty. I’ve still got 1-3pm open and 4-5. I’m in the cet timezone. Send me a PM with what works for you and your email and I’ll send you an invite.
I work 08:00 – 16:00 CET. Im in a another meeting between 09:30 – 10:00 and 10:30 – 11:30 thats all
rest I’m available
I can even do more than 16:00 if needed
Let’s do 1pm right after lunch. Please send me your email (via a direct message to me or here). I’ll sned you an invite
done!
this is what I got from the support
there is no metric named CPU Usage which the support is mentioning
The answer of support is not correct. @member
Chris what do you think about this one? Im using the usage_95 and even if it says N/A I can still see the process name 111 678 on all computers. So 398 minutes each day goes to this process where it has high CPU between 1-5 minutes?
Well the group by sorting is on the selected metric and if the metric is not there (NA) it might just sort alphabetical
so not telling a lot then I think avgcpu is better then nothing
this one right?
yes
its also still woth looking into why you are missing the data but I see you have a meeting with Sonny today
yes I do 😄
is this 84 minutes a day?
that goes to WEM process
yeah around 75 min per day per machine
but it could be anything from 10% cpu to 90?
well over all those machines and all those process combined the avg was 6% usage
that can be stable or spickey
I think you can see that back in the app trend report selected for that process
Yes I would definitely focus on using the dashboard to identify the heavy processes. Then use the app trend report to dive into the details per app
Continue reading and comment on the thread ‘How to Pinpoint CPU-Heavy Processes Across Your VDI Fleet with ControlUp’. Not a member? Join Here!
Categories: All Archives, ControlUp Dashboards, ControlUp for Desktops, ControlUp for VDI, ControlUp Scripts & Triggers
