Invalid against Linux or Mac from Windows

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0
Topic 198339

Hi Folks,

since a couple of day's i can see a couple of wu's which are marked as invalid if the other crunching partner is a mac or linux host.
Can everybody confirm this behaviour. We talk about this four wu's.

https://einsteinathome.org/workunit/234153120
https://einsteinathome.org/workunit/234052697
https://einsteinathome.org/workunit/233893987
https://einsteinathome.org/workunit/233816659

BR

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109883879935
RAC: 30645119

Invalid against Linux or Mac from Windows

Those 4 invalid tasks are spread over 3 hosts. On the host with 2 invalids, I took a quick look at some valid quorums and saw examples of 'mixed' validations where there was no problem. There are always going to be the (fairly rare) examples where perhaps the 'mismatch' between platforms is possibly a contributing factor but there's not much that can be done about that. In the past, a lot of effort was put into tweaking the validation routines to minimise the likelihood of cross-platform issues.

To be sure there wasn't anything unusual happening at the moment, I scanned all tasks for my hosts currently listed in the online database. I have 12243 tasks in total, 7373 BRP6 and 4870 FGRP4. Of those, there are 9 invalid, 5 FGRP4 and 4 BRP6. I looked at those 9 and whilst some were against machines running a different OS, the majority were against machines where at least one (and sometimes both) of the others was running the same OS (Linux). I'm fairly confident that there is no current issue with cross-platform validation.

In my case, there has been a bout of quite hot weather in the last week or two so I reckon that a machine going a little outside its comfort zone due to heat is probably the prime suspect for the invalid results I can see.

Cheers,
Gary.

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

Hello to Down Under, I saw

Hello to Down Under,

I saw this WU https://einsteinathome.org/workunit/234293675 this Morning and will have a look of this. I would expect that the Linux Host looses. So far, this fact wasn't seen by me so far I can think. Let's have a look for the next couple of Weeks.

BR

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109883879935
RAC: 30645119

I don't think your

I don't think your expectations will be realised :-). I'm quietly confident that the Linux box will win and that there will be a Windows Linux validation and that your Windows box will be marked invalid.

Why do I say this, you ask? Well, I cheated and had a look at the owner of the Linux box :-). Do you still expect that the Linux box will be marked invalid :-).

On a more serious note, your box already has 2 invalids and this current inconclusive seems likely to turn into #3. If it does, I would suggest you consider checking for good CPU cooling and after that I would check motherboard and PSU for any signs of swollen capacitors. Then I would check RAM. I see you are running Windows Server so I guess you will have server grade hardware? If so, that makes it less likely to be a hardware issue but you never know until you check for sure.

I'm not running any server grade hardware and whenever I see invalids, I seem to be able to find a heat/hardware issue that is causing it.

I hope I'm wrong and that the other machine ends up with the invalid result :-).

Cheers,
Gary.

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

...ok,ok; it is a host from

...ok,ok; it is a host from AEI! ;-)

But the Server of mine is a new 3 month old HP Server, and the two invalids
are the result of "fighting" against Linux/Darwin! ;-)

BTW: The Server is stored in our well cooled DataCenter.

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109883879935
RAC: 30645119

Sure, I know nothing about

Sure, I know nothing about the nature of your hardware, where it's housed and the operating conditions. I'm merely mentioning the things I've seen that cause invalid results.

I run Linux on my hosts and there are more Windows boxes out there than Linux boxes. If there was an excessive cross-platform problem, I would expect to see examples amongst my hosts. Every time I look, I don't see any real evidence. I'm not saying there are no invalids caused by cross-platform differences. I'm just saying that the rate of occurrence seems quite low.

I believe the Devs keep a close watch on error/invalid rates and would take action if they could see evidence of "fighting" as you describe it. If both hosts in a quorum have 'inconclusives' when Windows is matched against Linux/OS X, then there must be (on average) more Linux or OS X boxes that eventually get invalid results because of the greater chance that a Windows box would be used for the deadlock-breaking 3rd result.

Cheers,
Gary.

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

...this is rubbish, the

...this is rubbish, the Windows Wingman produce "error while computing", a second
Linux Host came up and I was out. If this behavior will be "normal" in the future, I've to think about other Way's. I can accept Errors from my Hosts if there are Problems in thermal Conditions or hardwarefailures but this sucks.

BR

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109883879935
RAC: 30645119

I guess you are referring to

I guess you are referring to the quorum https://einsteinathome.org/workunit/234293675 that you pointed out in a previous message. It is indeed unfortunate that the 'deciding' 3rd task on the Windows box did error out and was replaced with a 4th task on Linux. So we don't yet see a quorum where your result doesn't match another Windows result. This still doesn't prove that the problem is a cross-platform one. I think you may still find such an example where your result doesn't agree with another Windows result.

I would suggest that you continue to monitor your host for invalid tasks and, if they keep occurring, look for ones where there was at least one other Windows host involved in the comparison. If you never see any examples of these, it may be a cross-platform issue and we could ask Bernd to investigate. I don't think we are at that stage yet.

Cheers,
Gary.

Der Mann mit der Ledertasche
Der Mann mit de...
Joined: 12 Dec 05
Posts: 151
Credit: 302594178
RAC: 0

I've to do so. BTW: How long

I've to do so.
BTW: How long can I see back in history Tasks are being in Error, Invalid or
Validation inconclusive? Perhaps I have to write down the Task Number if the next
Errors occur?

THX so far.

Greetings from the North

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5845
Credit: 109883879935
RAC: 30645119

Tasks can disappear quite

Tasks can disappear quite quickly - just a few days after the quorum is completed. This will happen for the current quorum now that all results are in. I'm not sure of the exact period. I think it might be 5 days. It used to be longer but was changed to keep the size of the online database within manageable limits.

You should write down the details of your invalids, particularly the OS used for the two other results that caused yours to be marked invalid. That way you will be able to point to accurate details for a decent sample size. Already, one of the previous invalids for the host in question has been removed since the number showing is still 2 despite the 3rd one having been recently added. If it is a 5 days retention period, the oldest of the 2 current invalids is going to disappear around 2.00PM UTC today since the quorum was completed on the 11th around that time.

Cheers,
Gary.

Christian Beer
Christian Beer
Joined: 9 Feb 05
Posts: 595
Credit: 124687276
RAC: 314967

I investigated this issue a

I investigated this issue a little bit. So far there is no widespread problem. The overall "invalid results" ratio for windows is 0.19%. Those are successful tasks that are consistent by itself but fail to match with another consistent task.

Your personal ratio is 0.59% and the ATLAS Cluster (which is also well managed) has a ratio of 0.25%. Just so you can compare.

The comparison of tasks allows for a bit of a difference and the invalid tasks may come from mathematical differences between the platforms that are more different that we've seen in the past.

I'm going to investigate this a bit more in-depth in January. I think there will be enough such tasks in the database then, so you don't need to collect the task IDs in the meantime.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.