Analyzing Hive Health and Scout Health in ControlUp Portal

This discussion focused on the difference between hive health and scout health, and the health metrics available in the scout details on the ControlUp portal. It was suggested that a "dead man’s switch" be included to better help Scoutbee users with a metric to rule out hive-related issues. Hive availability can also be monitored to understand the reason for no test results in a certain period. The URL for the feature "Scout malfunction alert" is https://support.controlup.com/docs/alerts-for-stopped-scouts.

Read the entire ‘Analyzing Hive Health and Scout Health in ControlUp Portal’ thread below:

Is the Uptime for Hive in Hives List section refers to Hive health or the success rate of the specific scout? While doing some troubleshooting, I notice the hive uptime (%) = success test/total tests. Is this expected?

Scout

270 / 288 = 93.75

Wait

Yeah

So it’s the uptime of the resource you’re testing

Depending on where you took the screenshot it could be the uptime of the scout for a specific hive (I think that’s in the bottom of the page)

(Doing this from memory, not at my desk)

The scout success rate is already under the scout section at the left top corner

I think this is a bug.

That’s all hives combined. But I think the logic is shared with other tests such as the network tests. With those tests you can select multiple hives for a single test. So with an EUC test there’s only one hive so the numbers are the same but with multiple hives the top number is the average across all hives for that scout

And this an EUC test right?

Network test

And with one Hive selected for the test?

If the hive server and service are healthy, some tests launched from the hive failed, then I would expect to see 100% health for hive.

One hive only

You had 18 failed tests?

Yes

The percentages have nothing to do with the hive, it’s just a list sorted by hive. Like this test from me:

I had 100% successful tests on most of my hives but 99.93 from the Hive in Chennai. Hence the 99.99% across all hives

In your case, 18 tests failed, which means that the hive could not reach the destination or load the page (depending on what network tests you used) 18 times, so your uptime for the tested resource is 93.75%.

hive health is different measurement from scout health, right?

So the hive health feature is new as of today, and currently only alerts when the hive is unavailable and tests can’t execute. I think that when a hive is down no tests run and they are not included in any of the scout data

https://support.controlup.com/docs/alerts-for-stopped-scouts re the hive health feature (or as the article calls it “stopped scouts” caused by an unavailable custom hive)

But we don’t report on hive availability. You could setup a 2nd custom hive and have the 2 hives test each other with a network test I guess 🙂

Or are you saying that the 18 failed tests were due to the hive being down?

Btw, is this Ivan from Canada? I think we met

I do use the new feature “Scout malfunction alert” and this is for custom hive. Also I configured multiple hives for some scouts, not this one. Hive Uptime suggests this is the metric for hive server/service health, and hive only. It should be independent to scout result.

Yes, Mr. Stocker 😀

I can explain to you over a shadow session if it works better.

Ok so it could use some label improvements in the UX. But all the numbers you see on a scout page are related to the targeted resource. We don’t monitor the uptime of a hive itself besides “it’s working or not” (and then alert when it’s not)

18 scout failures and the hive didn’t miss a beat. I would expect to see 100% Hive Health in the UX

I understand how it can cause confusion

But if you are on a scout page, all the numbers are about the tested resource

@member some UI improvement suggestion to remove the possible confusion on what the numbers mean in the Hive list at the bottom

Actually if the uptime is calculated based on hive availability, then it would be very helpful in root cause analysis (ruling out any hive issue)

But I think when the hive is the issue the test wouldn’t even start and there would be no data. So it wouldn’t even be in the number you are looking at. So the number you see is only including tests that were initiated by the hive

I hope @member can figure out a smart way to include the dead man’s switch in the compute on the portal end. 😉

For people similar with Scoutbee (jobs and etc.), we can figure it out quickly. However the hive health metric can provide other Scoutbee users an easy way to know the issue is not hive related but target related.

Hi Ivan

Thanks for your insights

The health and availability under the scout details are intended to help analyze if the overall target health and availability is degraded due to a problem in a specific location (in which case you will see only one location with failures) or if it equals to all locations and might hint on a problem on the target side.

As for hive availability: If for any reason the hive is not working, there will be no test results. Getting a report on hive availability can help you understand why there are no test results in a certain period. But you can’t use that to understand the target availability.

Continue reading and comment on the thread ‘Analyzing Hive Health and Scout Health in ControlUp Portal’. Not a member? Join Here!

Categories: All Archives, ControlUp Scripts & Triggers, ControlUp Synthetic Monitoring
Topics: Automation & Alerting, ControlUp Insights, Reporting, SaaS & Web App Availability Testing, Shadowing

Ask Us Anything, Connect, Learn, and Grow with the ControlUp Community!

Footer