Sudden lurch in remaining work display


Advanced search

Message boards : Cruncher's Corner : Sudden lurch in remaining work display

Sort
AuthorMessage
archae86
Joined: Dec 6 05
Posts: 569
ID: 139940
Credit: 5,757,826
RAC: 9,250
Message 79937 - Posted 23 Jan 2008 17:39:36 UTC

The S5R3 search progress pane on the server status page suddenly changed from saying we had well over 300 days of work to go to claiming on 6.2 days.

Bug, or news item?
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 79939 - Posted 23 Jan 2008 17:47:45 UTC

Actually news, but I didn't come to write the details yet. We found we had to break the current run in two parts at 800Hz frequency. The display shows the work remaining below 800Hz. We'll have set up the upper half in a few days.

BM

th3_1rzt
Joined: Aug 24 06
Posts: 208
ID: 210060
Credit: 1,948,393
RAC: 6,210
Message 79941 - Posted 23 Jan 2008 18:11:40 UTC

I noticed all the WUs i had above 800 gives way too much credit, will future WUs in that range give lower credits?

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 79943 - Posted 23 Jan 2008 18:17:50 UTC - in response to Message 79941.
Last modified: 23 Jan 2008 18:18:04 UTC

I noticed all the WUs i had above 800 gives way too much credit, will future WUs in that range give lower credits?

Apn unexpected side-effect of the problems we have found with the >=800Hz WUs is that they run shorter as intended. The new ones will get the same credit, but run noticeably longer.

BM

DanNeely
Joined: Sep 4 05
Posts: 780
ID: 106636
Credit: 4,560,479
RAC: 9,003
Message 79970 - Posted 24 Jan 2008 1:19:53 UTC - in response to Message 79943.

I noticed all the WUs i had above 800 gives way too much credit, will future WUs in that range give lower credits?

Apn unexpected side-effect of the problems we have found with the >=800Hz WUs is that they run shorter as intended. The new ones will get the same credit, but run noticeably longer.

BM


Would it be possible to keep the runtime unchanged and adjust the credit instead? This would reduce alot of the grumbling from people with older machines that aren't on 24/7.
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 79971 - Posted 24 Jan 2008 1:25:11 UTC - in response to Message 79970.

I noticed all the WUs i had above 800 gives way too much credit, will future WUs in that range give lower credits?

Apn unexpected side-effect of the problems we have found with the >=800Hz WUs is that they run shorter as intended. The new ones will get the same credit, but run noticeably longer.

BM


Would it be possible to keep the runtime unchanged and adjust the credit instead? This would reduce alot of the grumbling from people with older machines that aren't on 24/7.


If the official Windows app becomes 4.26, there may be enough of a speed boost to help the GUM (Great Unwashed Masses). If not, then boosting deadlines up to 16-18 days until SSE can be implemented in the Windows app may also help...
____________

Profile Donald A. Tevault
Avatar
Joined: Feb 17 06
Posts: 308
ID: 173034
Credit: 9,455,962
RAC: 25,839
Message 79972 - Posted 24 Jan 2008 1:32:36 UTC - in response to Message 79971.

I noticed all the WUs i had above 800 gives way too much credit, will future WUs in that range give lower credits?

Apn unexpected side-effect of the problems we have found with the >=800Hz WUs is that they run shorter as intended. The new ones will get the same credit, but run noticeably longer.

BM


Would it be possible to keep the runtime unchanged and adjust the credit instead? This would reduce alot of the grumbling from people with older machines that aren't on 24/7.


If the official Windows app becomes 4.26, there may be enough of a speed boost to help the GUM (Great Unwashed Masses). If not, then boosting deadlines up to 16-18 days until SSE can be implemented in the Windows app may also help...


Hmmm. . .

I don't know. If I understand Bernd correctly, it sounds like these >= 800Hz workunits don't run long enough to complete all of the needed calculations. Thus, the need to create new workunits with longer runtimes.
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 79974 - Posted 24 Jan 2008 1:40:45 UTC - in response to Message 79972.

I noticed all the WUs i had above 800 gives way too much credit, will future WUs in that range give lower credits?

Apn unexpected side-effect of the problems we have found with the >=800Hz WUs is that they run shorter as intended. The new ones will get the same credit, but run noticeably longer.

BM


Would it be possible to keep the runtime unchanged and adjust the credit instead? This would reduce alot of the grumbling from people with older machines that aren't on 24/7.


If the official Windows app becomes 4.26, there may be enough of a speed boost to help the GUM (Great Unwashed Masses). If not, then boosting deadlines up to 16-18 days until SSE can be implemented in the Windows app may also help...


Hmmm. . .

I don't know. If I understand Bernd correctly, it sounds like these >= 800Hz workunits don't run long enough to complete all of the needed calculations. Thus, the need to create new workunits with longer runtimes.


Yes, and what I stated does depend on the runtime staying consistent between the < and the >=. If >= 800 takes longer than < 800, then there is definitely going to be some need for panic...

____________

DanNeely
Joined: Sep 4 05
Posts: 780
ID: 106636
Credit: 4,560,479
RAC: 9,003
Message 80002 - Posted 24 Jan 2008 11:43:43 UTC - in response to Message 79972.


I don't know. If I understand Bernd correctly, it sounds like these >= 800Hz workunits don't run long enough to complete all of the needed calculations. Thus, the need to create new workunits with longer runtimes.


Depends on what Bernd meant. The way i read it was that the WUs were completing all the work they needed to do in significantly less time than was expected.
____________

Profile Donald A. Tevault
Avatar
Joined: Feb 17 06
Posts: 308
ID: 173034
Credit: 9,455,962
RAC: 25,839
Message 80007 - Posted 24 Jan 2008 12:42:58 UTC - in response to Message 80002.


I don't know. If I understand Bernd correctly, it sounds like these >= 800Hz workunits don't run long enough to complete all of the needed calculations. Thus, the need to create new workunits with longer runtimes.


Depends on what Bernd meant. The way i read it was that the WUs were completing all the work they needed to do in significantly less time than was expected.



If that's the case, then I don't understand what the problem is.

Hopefully, we'll get some more amplifying info on this later.
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 80010 - Posted 24 Jan 2008 13:09:50 UTC - in response to Message 80007.


I don't know. If I understand Bernd correctly, it sounds like these >= 800Hz workunits don't run long enough to complete all of the needed calculations. Thus, the need to create new workunits with longer runtimes.


Depends on what Bernd meant. The way i read it was that the WUs were completing all the work they needed to do in significantly less time than was expected.



If that's the case, then I don't understand what the problem is.

Hopefully, we'll get some more amplifying info on this later.


The way I translated it, the workunits ran much faster than anticipated. What isn't stated is why they ran faster than anticipated. Another related message here was about how tasks at the 799.xx frequency were erroring out immediately...

I unno... I've asked multiple times about deadline extensions. I was considering not asking again based upon the increase in speed by Windows 4.26. Will have to wait and see...
____________

Profile Donald A. Tevault
Avatar
Joined: Feb 17 06
Posts: 308
ID: 173034
Credit: 9,455,962
RAC: 25,839
Message 80024 - Posted 24 Jan 2008 13:54:21 UTC

I've finally received a pair of these >= 800Hz jobs. They completed in about 76,000 seconds, far less than the 110,000 - 120,000 seconds that would be normal for this machine. So, there's definitely something strange here.

Dual Pentium III 866
____________

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 80027 - Posted 24 Jan 2008 14:20:33 UTC - in response to Message 80024.

I've finally received a pair of these >= 800Hz jobs. They completed in about 76,000 seconds, far less than the 110,000 - 120,000 seconds that would be normal for this machine. So, there's definitely something strange here.

Dual Pentium III 866


My timing always sucks... I am only up to 779... :-(
____________

Profile Astro
Avatar
Joined: Jan 18 05
Posts: 257
ID: 3237
Credit: 1,000,560
RAC: 0
Message 80029 - Posted 24 Jan 2008 14:34:46 UTC - in response to Message 80027.
Last modified: 24 Jan 2008 14:38:58 UTC

I've finally received a pair of these >= 800Hz jobs. They completed in about 76,000 seconds, far less than the 110,000 - 120,000 seconds that would be normal for this machine. So, there's definitely something strange here.

Dual Pentium III 866


My timing always sucks... I am only up to 779... :-(

Brian, Here's a look at my Mobile AMD64 3700 laptops wus using windows and the work done so far:

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 80031 - Posted 24 Jan 2008 15:20:17 UTC - in response to Message 80029.

I've finally received a pair of these >= 800Hz jobs. They completed in about 76,000 seconds, far less than the 110,000 - 120,000 seconds that would be normal for this machine. So, there's definitely something strange here.

Dual Pentium III 866


My timing always sucks... I am only up to 779... :-(

Brian, Here's a look at my Mobile AMD64 3700 laptops wus using windows and the work done so far:



Yeah yeah... rub it in... You got the credit boost from going above 799 and then the performance boost by going to 4.26... :-P on you too...

____________

Profile Astro
Avatar
Joined: Jan 18 05
Posts: 257
ID: 3237
Credit: 1,000,560
RAC: 0
Message 80032 - Posted 24 Jan 2008 15:42:00 UTC
Last modified: 24 Jan 2008 16:14:37 UTC

Well, to be honest, I hadn't looked at the credits for the 800's until you mentioned it.

Calculating Credit/hour with the benchmark means it should get 14.16/hour.

After 407 Rosetta wus(recent app), this host got 12.79/hour avg.
After 125 stock 5.27 Seti wus it got 15.21/hour avg.
With 4 4.15 wus <800 it got 16.225/hour, but with >800 and 4.15 it got 28.65/hour. And with >800 AND 4.26 it yields 33.1125/hour. WOW

OK, now in fairness/full disclosure: The other project I've recently ran was Boinc Simap and this host is getting and avg. 22.45/hour there.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 80033 - Posted 24 Jan 2008 15:43:22 UTC
Last modified: 24 Jan 2008 15:48:00 UTC

The data files currently on Einstein@home of 800Hz and above (h1_0800.0_S5R2* / l1_0800.0_S5R2*) are wrong. While we are generating the correct ones, we stopped generating workunits for 800Hz and above.

We intend to let the some thousand WUs that point to the wrong files that are already in the database simply run out. The ones on the boundary (that have 0799.5 files as well as 0800.0) will error out just at the beginning when trying to read the files ("error in SFT sequence"), with no CPU time wasted. The ones above 800Hz that are already in the database will run shorter as the assigned credit would suggest, because the run-time and this the credit was estimated based on correct datafiles. If we would simply cancel these workunits, people that already have completed such a task would get no credit at all for this, so I decided to be rather too generous and let them run.

The current WU generator will only generate new WUs below 800Hz. There are ~300,000 left to be generated, which should be work for the project for about a week in total. During that time we will generate correct data files and set up a new WU generator for the work of 800Hz and above.

So the second half run of S5R3 (currently internally called S5R3b) should start early next week. The new Tasks will run as long as estimated and thus will get the same credit we currently give to the ones with the same base-frequency (but wrong data files).

Brian, we are considering your proposal to extend the deadline for these new WUs.

BM

Mats Nilsson
Joined: Dec 10 05
Posts: 88
ID: 143528
Credit: 74,377
RAC: 52
Message 80034 - Posted 24 Jan 2008 15:49:57 UTC

Current app will handle this new WU?
____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 80035 - Posted 24 Jan 2008 15:53:37 UTC - in response to Message 80034.

Current app will handle this new WU?

The new workunits will reference the same Apps. No change there.

BM

Brian Silvers
Joined: Aug 26 05
Posts: 782
ID: 103927
Credit: 282,700
RAC: 0
Message 80036 - Posted 24 Jan 2008 16:26:40 UTC - in response to Message 80033.


Brian, we are considering your proposal to extend the deadline for these new WUs.


Thanks... The speed increase from 4.26 is definitely appreciated and would probably reduce the incidence of tasks missing deadline by only a couple of days as it appears to be 10-20% faster, depending on hardware. I guess it all will depend on how long the new results take...

Anyway, as for the boundary tasks, do you know if all of those have already been distributed? Since they fail very quickly, any host that gets them will likely be driven down to only 1/day quota...

____________

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 80084 - Posted 25 Jan 2008 10:37:15 UTC - in response to Message 80036.

Anyway, as for the boundary tasks, do you know if all of those have already been distributed? Since they fail very quickly, any host that gets them will likely be driven down to only 1/day quota...

You're right, I cancelled the workunits, which means that no new tasks should be generated for them. For the few dozen tasks that have already been generated for these in the DB I'm afraid I won't be able to do anything (without risking DB inconsistencies).

BM

Profile Gary Roberts
Forum moderator
Joined: Feb 9 05
Posts: 2068
ID: 12521
Credit: 57,352,127
RAC: 174,818
Message 80166 - Posted 26 Jan 2008 21:32:13 UTC

The Server Status page has been showing around 4 days or so remaining work (I didn't pay attention to the precise figures) but this morning it seems to be above 5 days which suggests that incremental additions to the stock remaining are possibly being made. Perhaps testing of small numbers of new +800 tasks??

Anyone been paying better attention to the figures and perhaps can confirm this?


____________
Cheers,
Gary.

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 80239 - Posted 28 Jan 2008 15:53:21 UTC

Update: We started to "drain" the current S5R3a workunit generator, i.e. it will generate all the workunits below ~799Hz that have not yet been generated, put them into the database and then terminate.

You should see a fast decrease in the "Work remaining", a fast increase in the "Workunits in database" and "Workunits with no canonical result" values on the Server status page, and finally find the "Einstein S5R3 generator" "Not running" any more.

This will allow us to start the new "S5R3b" workunit generator some time tomorrow. Hopefully everything goes smooth enough that people not reading the forums won't even notice that there is a transition. The Apps (Beta-, power- and standard Apps) will stay the same, so no need to do something there.

BM

Profile Bernd Machenschalk
Forum moderator
Project developer
Joined: Oct 15 04
Posts: 2033
ID: 2
Credit: 21,971,104
RAC: 41,805
Message 80356 - Posted 31 Jan 2008 10:18:23 UTC

We started to send out the first (few hundred) "upper-half" S5R3 tasks for testing. For the curious: The task names end in "S5R3b", and the data files in "S5R3". The Delay Bound ("deadline") has been increased to 18 days.

BM

Profile Richard Johnson
Joined: Feb 9 05
Posts: 5
ID: 12630
Credit: 18,792
RAC: 0
Message 82268 - Posted 7 Mar 2008 12:14:14 UTC

To respect to the 800 hz, given the creditals of the format for recieving credits for wu's. I had believed that in the interim, computer that runs works under the assumption of half time. Meaning that if the work takes longer the credit should not be any different then if it took a short time. I believe that the wear and tear on the cpu in which its life expectancy deminishes. For the lack of a better word. The cpu dies out due to the excessive work load that it endures through processing data.
____________

Profile Bikeman
Forum moderator
Volunteer developer
Avatar
Joined: Aug 28 06
Posts: 2056
ID: 210833
Credit: 5,079,839
RAC: 9,661
Message 82273 - Posted 7 Mar 2008 16:43:28 UTC - in response to Message 82268.

To respect to the 800 hz, given the creditals of the format for recieving credits for wu's. I had believed that in the interim, computer that runs works under the assumption of half time. Meaning that if the work takes longer the credit should not be any different then if it took a short time. I believe that the wear and tear on the cpu in which its life expectancy deminishes. For the lack of a better word. The cpu dies out due to the excessive work load that it endures through processing data.


Hi!

The fact that the old near 800 Hz units ran twice as fast was a special effect, so it was not reflected in the credits for WUs. Now the runtime is back to "normal" and all should be fine.

As to wear and tear of the CPU: I would not be concerned about this, CPUs are designed to run on max load for years and years (if they run within the specified limits, overclocking is a different story, of course). Most components are stressed more when switching the system on and off, so 24/7 operation or a continuous high-load operation should not lower the lifetime of a CPU below the time you usually expect to have a CPU in operation (few of us use a CPU build 10 years ago, and we won't use our current CPUs in 10 years).

There are some moving parts like fans and disk drives that might show a lower life expectancy from BOINC, tho.

If you see it from an economical view, the cost of wear-and-tear should be insignificant anyway compared to the additional energy costs, so wear and tear is a non-issue, I guess.

CU
Bikeman

____________

archae86
Joined: Dec 6 05
Posts: 569
ID: 139940
Credit: 5,757,826
RAC: 9,250
Message 82276 - Posted 7 Mar 2008 17:15:59 UTC - in response to Message 82273.

As to wear and tear of the CPU: I would not be concerned about this, CPUs are designed to run on max load for years and years (if they run within the specified limits, overclocking is a different story, of course). Most components are stressed more when switching the system on and off, so 24/7 operation or a continuous high-load operation should not lower the lifetime of a CPU below the time you usually expect to have a CPU in operation (few of us use a CPU build 10 years ago, and we won't use our current CPUs in 10 years).

I was a reliability guy for a major semiconductor manufacturer for four years around 1990, and have had some further contact on this subject since. Your advice to users here differs from my understanding of the matter.

Actually, these days the in-service reliability goal is set for the distribution of expected operating conditions, not on the worst-case assumption that all in-service parts see worst-case conditions. If everybody re-wrote their flash card at the maximum feasible rate, many more would fail far sooner than the goals. If everybody operated their CPU at 100% utilization with poor cooling, the fleet CPU failure rate would be much higher than the requirement.

Hotter is worse, and higher voltage is worse, though the degree to which these two things hurt your chances varies with mechanism.

Thermal cycling of a CPU to the degree presented by switching a system off and on is an utterly negligible stress. There have been cases (usually involving thin-film compatibility issues) of component/package combinations with appreciable thermal cycling failure rates stemming from delamination, but the accumulated harm varies as a quite high power of the cycling range (something like sixth power, if I recall a paper my colleague Rich Blish presented on the subject), and the range for desk-top CPUs is just not much.

In summary, yes, you are raising the probability of failure of your CPU at any given moment (including the first minute after start) by running BOINC applications in time which would otherwise be idle. You are raising it further if you increase power consumption and temperature by overclocking. You are raising it further if you raise the CPU voltage. You are raising it further if you operate the PC during hours in which you would have shut it down.

I do agree that there are probably components of the PC which don't like the system being powered up and down, but CPU failure probability is not likely in that category.

So after all that negativism, let me switch sides and say:
1. You are far more likely to have your system fail from fouled-up software than from any hardware problem.
2. Among hardware problems, last time I saw the data, hard drive failures and monitor failures are considerably more common than failures in the CPU/motherboard parts of the system.
3. So if you don't carry overvoltage overclocking to extremes, and assure decent cooling, I don't think your extra risk is troublingly high.

____________

Profile nevermore
Joined: Feb 14 06
Posts: 2719
ID: 171869
Credit: 1,388,406
RAC: 0
Message 82328 - Posted 9 Mar 2008 1:36:52 UTC - in response to Message 82276.
Last modified: 9 Mar 2008 1:38:28 UTC


snip
2. Among hardware problems, last time I saw the data, hard drive failures and monitor failures are considerably more common than failures in the CPU/motherboard parts of the system.
/snip


I cannot comment on the monitor bit (I shut mine off between sessions usually) but I can attest to the HDD failures/BOINC issue. I replaced one a few months ago that had been in use for about 4 years (crashed hard) and in another box I have one (2.5 years) that is showing the Spin Retry Count to be out of it's acceptable threshhold (debating whether to do a HD swap between boxes because the oldest one is going to the kids shortly). It is possible that both is a case of "averages" but none of my previous computers burned through their HDDs in their lifetimes (5+ years with no BOINC). IMO, the cost of an equivalent HDD is quite reasonable though lost data if unrecoverable from a crash can be bad thing (repeat mantra "backup data, backup data")
____________

Message boards : Cruncher's Corner : Sudden lurch in remaining work display


Return to Einstein@Home main page

This material is based upon work supported by the National Science Foundation (NSF) under Grant NSF-0200852 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2009 Bruce Allen for the LIGO Scientific Collaboration