Greedy cluster

Nigel Garvey
Nigel Garvey
Joined: 4 Oct 10
Posts: 50
Credit: 16828136
RAC: 77123
Topic 196028

I was intrigued last night to notice that all three of my G5's "pending" results were waiting for the same wingman to report, also that one of its current tasks was a replacement for one aborted by the same wingman a couple of days earlier.

It turns out that the other computer is one of 3792 attached to a single account (although only 376 have been "active in past 30 days"). It — like most of the active ones I sampled from the account — is fetching large numbers of tasks each day for all three searches, but is only returning Binary Radio Pulsar results. Tasks for the other searches are simply timing out or are being aborted after ten to fourteen days.

The account name is "ATLAS AEI Hannover", which also appears to be the name of a computer cluster at Einstein@home's headquarters. It's hard to imagine the project staff would intentionally abuse their own project and its participants in this way — even for BOINC points. So can anyone shed light on it?

NG

NG

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4265
Credit: 244922893
RAC: 16808

Greedy cluster

LSC Clusters (which includes Atlas) use Einstein@home as 'backfill' or 'bottom feeder', i.e. to run on machines (or even single CPUs) when these are not busy with ordinary jobs from the scheduling system.

The problem with that is that it can't be predicted how much work (E@H tasks) can be done until this machine is used again for some ordinary job, and thus the BOINC 'work fetch' mechanism is not helpful for this purpose. The local computing power of these machines is pretty high, so even with short work caches they fetch a lot of tasks on scheduler contact, but then may not be able to complete these (or even contact the server again) before the deadline because they got assigned other jobs.

I already suggested to use 'BOINCLite' (a very simple BOINC Client that just downloads, runs and reports a single task of a single project) for these backfill tasks, and I hope that's being worked on (configurations considered, developed and tested). But currently all LSC Clusters are using the standard BOINC Client (and rather older versions at that).

BM

BM

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6534
Credit: 284703502
RAC: 110557

Fair question. ATLAS

Fair question. ATLAS traditionally donates 'burn in time' for new nodes to E@H WU's. The one mentioned is, I guess, one with quad Tesla's ( again guessing, it looks like it's mapping 1:1 CPU/GPU ). As you say it's certainly not going to be an abuse/points thingy. I'll ask ....

Cheers, Mike.

( edit ) Err, no I won't. Thanks Bernd. :-)

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4265
Credit: 244922893
RAC: 16808

RE: The one mentioned is, I

Quote:
The one mentioned is, I guess, one with quad Tesla's ( again guessing, it looks like it's mapping 1:1 CPU/GPU )

Ok, I think they are testing new (Condor) configurations on these GPU machines. This may make things better or even worse than on the "normal" Atlas nodes.

BM

BM

Nigel Garvey
Nigel Garvey
Joined: 4 Oct 10
Posts: 50
Credit: 16828136
RAC: 77123

Thanks to you both for your

Thanks to you both for your replies. I don't really know what comment to make, except to point out that my G5's just picked up yet another Gravitational Wave task with the same wingman!

NG

NG

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109386586808
RAC: 35930143

RE: ... my G5's just picked

Quote:
... my G5's just picked up yet another Gravitational Wave task with the same wingman!


And it's very likely to continue doing that because of locality scheduling. There would be a number of hosts sharing the same frequency set so if you took the trouble to look through all the tasks you will see other members of the same set of wingmen. You could change this by forcing a change to a different frequency set but because there are so many Atlas nodes, you are just as likely as not to bump into different nodes on differeny frequency data showing the same frustrating behaviour.

The simple solution is not to let it bother you :-). Tasks that time out or get aborted will be completed by other wingmen in your grpup eventually and all will be resolved in time.

Cheers,
Gary.

Nigel Garvey
Nigel Garvey
Joined: 4 Oct 10
Posts: 50
Credit: 16828136
RAC: 77123

Hi, Gary. Thanks for the

Hi, Gary.

Thanks for the additional insights.

Quote:
The simple solution is not to let it bother you :-). Tasks that time out or get aborted will be completed by other wingmen in your grpup eventually and all will be resolved in time.

While it's unquestionably an abusive situation — whether intended so or not — to have the project's own computers grabbing fistfuls of tasks every day and simply sitting on two of the three search kinds for a couple of weeks, I'm not personally upset. I know everything will be resolved "in time" and that I'm not the only computer-time donor affected. In fact, at the moment, I'm quite enjoying the grossness of the situation. :) The wingman for the task my G5 received yesterday was yet again Computer 4231758 and the task received this morning is a replacement for one just aborted by that machine.

NG

NG

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4265
Credit: 244922893
RAC: 16808

I just wanted to let you know

I just wanted to let you know that this issue is currently being discussed and worked on internally.

BM

BM

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 1580

RE: I already suggested to

Quote:


I already suggested to use 'BOINCLite' (a very simple BOINC Client that just downloads, runs and reports a single task of a single project) for these backfill tasks, and I hope that's being worked on (configurations considered, developed and tested). But currently all LSC Clusters are using the standard BOINC Client (and rather older versions at that).

BM

Do you really need something like this? Wouldn't just setting the clients connection settings to: Computer is connected to the Internet about every: 0.001 days, Maintain enough work for an additional: 0.001 day;s limit the client to only 1 task/core at a time.

Christoph
Christoph
Joined: 25 Aug 05
Posts: 41
Credit: 5954206
RAC: 0

I want to point you to a

I want to point you to a discussion over at LHC@home1 / SIXTRACK. [url]http://lhcathomeclassic.cern.ch/sixtrack/forum_thread.php?id=3374]Long_delay_in_jobs[/url].
Depending of the version of your BOINC server, there is a mechanism of 'Acceleratin retries' and a 'Trusted Host' Mechanism.
Maybe you are using it already, but then the ATLAS nodes should be 'untrusted' since they trash work on regular basis and not get any re-sends. Regular work would still flow.

It is the fourth post where this ist described.
This would be useful for you?

Greetings, Christoph

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4265
Credit: 244922893
RAC: 16808

We're not using "Accelerating

We're not using "Accelerating retries" and currently are not considering the reliability of hosts. The primary criteria for assigning jobs to a host is to minimize the download volume.

BM

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.