Gary Roberts Forum moderator
Joined: Feb 9 05 Posts: 2068 ID: 12521 Credit: 57,351,877 RAC: 174,811
You would have to have been totally not paying attention to have missed the fact that the servers are having issues at the moment. However if you think about the error messages you are receiving and look at how your own machines are behaving you should be able to work out for yourself a few important details. This might prevent you from taking some rather silly actions or making some rather silly statements in your frustration.
Yes, no doubt everyone is frustrated to some degree but if you think calmly about what is going on you are much less likely to give yourself a heart attack :).
Firstly, you've all seen your own client's messages and the large volume of identical stuff that people insist on posting as well. They all indicate a server problem and not a client problem. In other words there is nothing you can do to your client that is going to change things. So things like detaching, resetting, uninstalling, manually updating ad infinitum, etc are essentially a complete waste of time.
One of the things that perplexes a lot of people is "why do some machines/users seem to be largely unaffected and other machines just can't get action going at all?" I believe the reason for this is linked to whether or not a machine needs new large data file(s) or not. I have many machines that don't need new large data files at the moment and so they are doing just as Pooh Bear has mentioned a couple of times, ie downloading and uploading results without problems. I have other machines that do not get any new work. I believe that this is because they need some form of database lookup to decide a new large data file and that something of this nature is failing and so - no more work.
Secondly, many people are complaining that they can't upload results. If you are worried about this, here is what I have worked out with a little bit of experimenting. On the basis that the problems are connected in some way with downloading new large data files, I decided to break the connection between downloading and uploading so that they are not both being attempted (and both failing) at the same time. All I did was set "No new tasks" on an affected machine, and then "update" the EAH project on that machine. BOINC then tries the upload only without the request for new work. This seems to succeed in about 100% of the cases. After clearing the stuck uploads, I simply re-enable work requests. I still don't get work but at least dozens of uploads are successfully reported, with quite a few examples of "ALREADY Reported" messages too :).
Thirdly, a few people are complaining quite vocally about a lack of information. Statements about "just a line or two" being needed, or "the developers need to wake up" or "worst project for communication" or "the servers must be hacked" or "server status all green - what rubbish", etc, tend to fly about from time to time. Here are my thoughts on this.
If "Just a line or two would suffice" then simply read what the server tells you each time a transaction fails. The messages are actually quite informative if you think about them. Oh, I see, you really meant a page or three giving much fuller "blow by blow" descriptions of what is happening all the time. I would have thought it would be pretty obvious that this problem, whatever it is, is quite intractible and until all the facets of it have been fully investigated it's just about impossible to give you a deep and meaningful report without wasting a lot more time and perhaps indulging in speculation about possible scenarios which ultimately turn out to be wrong anyway. If you start trying to give short "update" reports, you can bet your bottom dollar that someone will start wingeing for the next one shortly after the previous one was given. The staff resource to manage this rather nasty ongoing situation is quite small and should be left alone to get the job done.
Much has been commented about the server status page. Here are my thoughts on this. Green means that the hardware is powered and that at least one process of the type indicated is running. Take validation for example. There are two different types of results (S5RI and the old S5R1) so two different programs are needed. Depending on how many results are being returned, multiple instances of the validator program may be needed to handle the load. Of course, each extra instance of the validator program chews up more RAM and more cycles and increases server load. I would think it would be quite feasible in an overstressed server environment, to temporarily shut down the bulk of the running validators to give cycles to other more needy parts of the system. If there is just one validator instance still running (but unable to keep up) the status will still be green but you will see a growing backlog of results to be validated. So what!!! You have to at least give the Devs some credit for trying to juggle things for the better performance of the system as a whole.
As a final comment to some people, please don't keep starting new threads with essentially the same complaint in perhaps a slightly different guise. We all know you are frustrated and we all support your right to express that frustration. But please not in umpteen different threads with pretty much the same winge over and over again. That creates its own level of frustration in others.
____________
Cheers,
Gary.
ID: 64107 |
Richard Haselgrove
Joined: Dec 10 05 Posts: 579 ID: 144054 Credit: 2,965,572 RAC: 2,400
Well said. This deserves to be stickied, or even given FAQ status.
ID: 64111 |
Vladimir Zarkov
Joined: Feb 27 05 Posts: 51 ID: 41623 Credit: 772,994 RAC: 5
Cool, timely, and objective. Wow, and witty too. Thanks, Gary, reading your comments felt good. :)
____________
Hallo Gary !
Many thanks for this very informal thread. We are sure, the server crew is very, very busy these days, and they like to do a really quick and good job.
I’m sure, a lot of threads concerning this failure within the last days will give you also valuable information about the coming out of this obviously complex situation. Please don’t forget, that a great part of the participants in this project are no computer specialist, and take part for the first time in a such complex project. Many of them did learn computing in their spare time or if they are computing professionals, they are very busy – like you -, and are pleased about some form of gentle support.
If there would have been very early some information at the E@H homepage like : “ We became aware of low throughput of the validators. We are analysing the situation.� That would relax the situation for many participants and will not overload the chief in charge of the server crew, it´s more a question of their will. A daily short report of 1 or 2 sentences like “ Damned, it’s still not clear why we have only about x% of the nominal throughput. Please keep crunching if you get work and can upload results. We have sufficient diskspace for another yy days – just the data you have anywhere present -.�, or “ We found the failure. It will take probably another 2 days to write, test and install new code. …. Meanwhile your work can go on as in the last few days.� Such short notes will not overstress the crew, but relax the situation for the many people out there and will give a more familiar and trusty atmosphere.
And this atmosphere is more important than you might think about. Behind these very useful but stupid and silly computers are humans responsible for their doing, and they want to be accepted and handled as humans. Several participants did write, they shut off E@H. - And how man didn’t write, but just did it? – And how many of them will stay off permanently, because they felt angry? A permanent loss also for your success.
I know very well what I’m talking about, as I was responsible for the operation of big equipment – not computers - in science for several decades of years.
Thanks for the post Gary, just wish this thread would get stickied (stickyed?) so that the non-informative threads don't keep sending this one lower.
In the few years I have been BOINCing (>7 if you could the pre-BOINC version of SETI@Home), I have consistently found Einstein@Home to be the most stable of the projects I have crunched. I continue to crunch for this project as I support its goals, and I have another 2 projects that can take over any "spare" cycles if I don't manage to get WUs from here.
Thanks to the moderators for posting and keeping us informed, thanks to the project team for working hard in the background, and thanks to ALL you Einstein users who quietly continue to crunch this project without making any threats to leave, etc. I believe the quiet users are the majority :) but the squeaky wheels are the ones that get the attention.
Geez-gosh-whizz ... If you had bothered to post the info that you sent to the message board, that would have helped 100%! At least we would understand what is going on...
Thirdly, a few people are complaining quite vocally about a lack of information. Statements about "just a line or two" being needed, or "the developers need to wake up" or "worst project for communication" or "the servers must be hacked" or "server status all green - what rubbish", etc, tend to fly about from time to time. Here are my thoughts on this.
If "Just a line or two would suffice" then simply read what the server tells you each time a transaction fails. The messages are actually quite informative if you think about them. Oh, I see, you really meant a page or three giving much fuller "blow by blow" descriptions of what is happening all the time. I would have thought it would be pretty obvious that this problem, whatever it is, is quite intractible and until all the facets of it have been fully investigated it's just about impossible to give you a deep and meaningful report without wasting a lot more time and perhaps indulging in speculation about possible scenarios which ultimately turn out to be wrong anyway. If you start trying to give short "update" reports, you can bet your bottom dollar that someone will start wingeing for the next one shortly after the previous one was given. The staff resource to manage this rather nasty ongoing situation is quite small and should be left alone to get the job done.
Much has been commented about the server status page. Here are my thoughts on this. Green means that the hardware is powered and that at least one process of the type indicated is running. Take validation for example. There are two different types of results (S5RI and the old S5R1) so two different programs are needed. Depending on how many results are being returned, multiple instances of the validator program may be needed to handle the load. Of course, each extra instance of the validator program chews up more RAM and more cycles and increases server load. I would think it would be quite feasible in an overstressed server environment, to temporarily shut down the bulk of the running validators to give cycles to other more needy parts of the system. If there is just one validator instance still running (but unable to keep up) the status will still be green but you will see a growing backlog of results to be validated. So what!!! You have to at least give the Devs some credit for trying to juggle things for the better performance of the system as a whole.
As a final comment to some people, please don't keep starting new threads with essentially the same complaint in perhaps a slightly different guise. We all know you are frustrated and we all support your right to express that frustration. But please not in umpteen different threads with pretty much the same winge over and over again. That creates its own level of frustration in others.
Thirdly, a few people are complaining quite vocally about a lack of information. Statements about "just a line or two" being needed, or "the developers need to wake up" or "worst project for communication" or "the servers must be hacked" or "server status all green - what rubbish", etc, tend to fly about from time to time. Here are my thoughts on this.
From the outset, let me say that having read this post, I am quite happy to sit and wait for a solution to be found and implemented.
Like everyone else, I'm experiencing problems with some of my computers. And, obviously unlike some of the posters, I have worked in a high-pressure support environment where you learn _very_ quickly to ignore telephone calls & etc while you are working to fix the problem.
But (don't ya just hate that word?) a prominent "line or two" (on the EAH home page perhaps) informs us, the user base, that the support team are aware of the problem. If everyone sat idly by and did not say anything, then it is possible, just possible, that the EAH team would not know that a problem exists.
Paying heed to the error messages and keeping quiet about them does not expedite a solution. Neither does starting numerous threads on the same issue. And polemic rhetoric should be left to the politicans and such others with no gainful employment. All we need is the application of that most rare of commodities - common sense!
Just my 2c worth.
____________
Problem Solving Algorithm:
1) Write down problem
2) Think really hard
3) Write down answer
- Richard Feynman
ID: 64305 |
Gary Roberts Forum moderator
Joined: Feb 9 05 Posts: 2068 ID: 12521 Credit: 57,351,877 RAC: 174,811
I would like to thank all those who have expressed appreciation for the information that I have tried provide in this thread. It is frustrating for all of us to experience the difficulties in getting regular work and reporting the results. My reason for posting is to try to ease the level of frustration and not to try to pretend that the problems don't exist or to suggest that they should simply be ignored.
As I read through the responses, there are a couple of comments that need to be addressed. One of those is that I should sticky this post. OK, that has now been done. Another is that I should post a summary on the front page. Unfortunately that is something I can't do as I'm not a staff member of the project. I'm simply a user like everyone else, with the ability to do some basic housekeeping, like deleting posts or threads or making a thread sticky.
Many people wonder why the project staff seem to be insensitive to the user frustration. Believe me, I'm sure they are not. I'm sure it's just a matter of too many fires to fight and too few firefighters to do it. Take a look at the contributors page and see if you can find any IT specialists who might be responsible for the management of the server farm and the ongoing development of the software system that runs that farm. How many database specialists are there who know all the tricks to really improve database performance? Unfortunately it is the physicists themselves that have to do this. Any programmers you see there are working on the science apps and not the server back end or database code.
The problems are certainly with the server and database code as this thread over at Seti seems to indicate. In a later message, Matt Lebofsky indicates that both Seti and Einstein are being affected. I'm sure people like Bruce Allen and David Hammer are doing their best to resolve these problems as quickly as possible.
The problems will ultimately be solved. Indications are there that this may well be sooner rather than later now that those over at Seti seem to have worked out a possible strategy. I'm sure the project staff will let us know more details as soon as they are able to. In the meantime, I would like to thank you for your continuing support and patience.
Gary, I appreciate your efforts. I do believe there are multiple problems confronting the admin folks here and can appreciate that they are up to their elbows in alligators. Still, it really would be nice to have seen some home page update in the past month given the ongoing very real problems encountered here. It has been a rather lousy two months here.
Like others, I am a strong advocate of running multiple projects -- I have no systems with less than two projects and nearly all of my own collection have three or four active projects.
With the ongoing problems here, I was going to suspend processing on Einstein pending resolution -- and probably a resolution which is confirmed by 10 days to two weeks solid running. What I've done first though, is set Einstein to 'no new work'. That way I'll be able to clear my Einstein work to do within the next week or less. As each workstation clears the last Einstein workunit, I set it to suspend Einstein,
I think it is a reasonable approach for the duration as there are multiple worthy BOINC projects which currently are running fairly well (including SETI, even with its much larger database). Then again, it means that posting a home page announcement is a bit more important for me, as I'm rather disinclined to tramp thru multiple message boards and threads here to glean status information.
____________
ID: 64383 |
F. Prefect
Joined: Nov 7 05 Posts: 137 ID: 119854 Credit: 882,195 RAC: 526
Gary, I appreciate your efforts. I do believe there are multiple problems confronting the admin folks here and can appreciate that they are up to their elbows in alligators. Still, it really would be nice to have seen some home page update in the past month given the ongoing very real problems encountered here. It has been a rather lousy two months here.
Like others, I am a strong advocate of running multiple projects -- I have no systems with less than two projects and nearly all of my own collection have three or four active projects.
With the ongoing problems here, I was going to suspend processing on Einstein pending resolution -- and probably a resolution which is confirmed by 10 days to two weeks solid running. What I've done first though, is set Einstein to 'no new work'. That way I'll be able to clear my Einstein work to do within the next week or less. As each workstation clears the last Einstein workunit, I set it to suspend Einstein,
I think it is a reasonable approach for the duration as there are multiple worthy BOINC projects which currently are running fairly well (including SETI, even with its much larger database). Then again, it means that posting a home page announcement is a bit more important for me, as I'm rather disinclined to tramp thru multiple message boards and threads here to glean status information.
Just a simply explaination is all I have been asking for during the past 3 weeks and all I got was flamed, and would like to apologise for my petty response.
I would assume that since my uploaded and reported credits are showing up in the "credits pending" as well as "results", I will get credit eventually, but like yourself, I just can't figure out why they can't write a couple of sentences explaining things on the status page. It kind of makes one wonder if the problem is one of a serious nature, but again an explaination that the project is going to continue would be enough for me to remain.
It appears I have been getting credit for some of the pending jobs as my overall point total is slowly rising. However the results pending number is rising much faster. As long as the thing still seems to be working I'm going to stay put. I am running Rosetta on a couple of machines, but being on dialup I'm spending more time downloading than anything else.:-( If you know of a worthwhile program that's up and running smoothly let me know. :-)
Correct, the database has been in 'dribble validation mode' for the past week or so (in addition to other problems that have occurred and been dealt with or remain ongoing).
Regarding other projects that run pretty smoothly, the two I added in the past year or so are World Grid (which has a bit different interface but is part of the BOINC group) and Rosetta -- I pushed computer cycles over to Rosetta with the beginning of Einstein's severe problems in December. Those two projects now are getting about 60% of my cycles over the past two months. SETI has also been a benefactor of the shift from Einstein -- back in November (when Einstein was still running reliably), SETI was getting less than 15% of my CPU time, now it gets 25%. SETI runs reasonably reliably (they have a weekly 3 to 4 hour outage on Tuesdays which is planned), and further, the much larger and proactive user community there means that when there are problems, the word gets out big time.
It appears I have been getting credit for some of the pending jobs as my overall point total is slowly rising. However the results pending number is rising much faster. As long as the thing still seems to be working I'm going to stay put. I am running Rosetta on a couple of machines, but being on dialup I'm spending more time downloading than anything else.:-( If you know of a worthwhile program that's up and running smoothly let me know. :-)
I would like to thank all those who have expressed appreciation for the information that I have tried provide in this thread. It is frustrating for all of us to experience the difficulties in getting regular work and reporting the results. My reason for posting is to try to ease the level of frustration and not to try to pretend that the problems don't exist or to suggest that they should simply be ignored.
.....
The problems will ultimately be solved. Indications are there that this may well be sooner rather than later now that those over at Seti seem to have worked out a possible strategy. I'm sure the project staff will let us know more details as soon as they are able to. In the meantime, I would like to thank you for your continuing support and patience.
Thanks for your efforts and the valuable information, Gary.
I'll keep cooking for E@H, with a duty cycle of 12.5% down from 100%. And have elected climateprediction.net as my 'other' project - no worries with this one, the expected execution time is over 2000 hours per WU, you download one WU and forget about it ... Once I have chewed about 10 of those, I'll have learnt to be patient and will reconsider E@H ;)
It's a pity that no official information has been forthcoming from the project. Someone wrote in a post that E@H never promised a problem-free ride. Granted. But if the broad user support E@H have gained just evaporates, nothing will be gained.
I could understand if this problem had only been happening for a few day, but IIRC it's been ongoing for almost two weeks. That, and the fact their is no "official" response on the main E@H page, shows a real lack of respect toward contributors.
I quite E@H once before, looks like it's time to do so again. Sad really.
____________
ID: 64458 |
F. Prefect
Joined: Nov 7 05 Posts: 137 ID: 119854 Credit: 882,195 RAC: 526
I would like to thank all those who have expressed appreciation for the information that I have tried provide in this thread. It is frustrating for all of us to experience the difficulties in getting regular work and reporting the results. My reason for posting is to try to ease the level of frustration and not to try to pretend that the problems don't exist or to suggest that they should simply be ignored.
As I read through the responses, there are a couple of comments that need to be addressed. One of those is that I should sticky this post. OK, that has now been done. Another is that I should post a summary on the front page. Unfortunately that is something I can't do as I'm not a staff member of the project. I'm simply a user like everyone else, with the ability to do some basic housekeeping, like deleting posts or threads or making a thread sticky.
Many people wonder why the project staff seem to be insensitive to the user frustration. Believe me, I'm sure they are not. I'm sure it's just a matter of too many fires to fight and too few firefighters to do it. Take a look at the contributors page and see if you can find any IT specialists who might be responsible for the management of the server farm and the ongoing development of the software system that runs that farm. How many database specialists are there who know all the tricks to really improve database performance? Unfortunately it is the physicists themselves that have to do this. Any programmers you see there are working on the science apps and not the server back end or database code.
The problems are certainly with the server and database code as this thread over at Seti seems to indicate. In a later message, Matt Lebofsky indicates that both Seti and Einstein are being affected. I'm sure people like Bruce Allen and David Hammer are doing their best to resolve these problems as quickly as possible.
The problems will ultimately be solved. Indications are there that this may well be sooner rather than later now that those over at Seti seem to have worked out a possible strategy. I'm sure the project staff will let us know more details as soon as they are able to. In the meantime, I would like to thank you for your continuing support and patience.
Excellent post. Personally I believe that if there are still plans for the project to remain in existance over a significant period of time, they are making a very big mistake by not informing the current contributors regarding the current as well as future status of the project.
The only possible reason I can imagine for their lack of action is the fear of losing particpants, which of course, in most cases will never return after setting up shop elsewhere. I hope I'm wrong, and am continuing to run Einstein@home on all machine and will continue to do so as long as the results are being shown to have been received and the my total credit number continues to increase albeit at a very slow rate, and all sent jobs are showing up on the results page and my pending credit continues to increase.
However after reading several of the posts in this forum and others, I would tend to believe they are losing more participants than if they simply post a short message giving the current and future status of the project and at least many who have been with the program for a lengthy period of time as well as others considering joining or have only been with the program for a short period of time, can now make an informed decision as to what they intend to do.
Their apparent position that "the truth will kill", are very likely to discover that the total lack of action will get the job done just as quickly while at the same time showing total disrespect to those who have several years invested in the project.
ID: 64464 |
Mike Hewson Forum moderator
Joined: Dec 1 05 Posts: 1868 ID: 135571 Credit: 4,434,081 RAC: 5,120
Well spoken Gary!
Some of my thoughts after more than a year of E@H involvement:
- E@H is a victim of it's own success. For a number of reasons, and ironically reliability is prominent amongst them, it has steadily grown ie.
for those of you cognisant with IT stuff and database design in particular, it is often the case that some things just don't scale well. This came to light for E@H late last year with increasing activity of the project declared a number of hardware issues patent.
- we are all actually getting feedback, but many do not realise it, as part of the design of the project: it's called the BOINC error messages. While not absolutely personalised enough for some, they do accurately reflect their computer's situation. If some people are meaning "lack of feedback" == "the project's internal decision loop is not visible to me", then you are absolutely right. By definition really. I would respectfully suggest that it is correct for it to be that way - I won't repeat the many reasons already mentioned for this. Sadly some will take personal offence at this scenario, or at least appear to.. :-)
- it is always worth mentioning, again, the multi-project aspect of BOINC. This is not to say "go away if you're not happy" ( although you may ), but simply reflects that the design of BOINC explicity caters for this. Again no personal impugnment ought be deduced from this.
- there is a fascinating social aspect to this distributed processing paradigm. It certainly confirms the old adage "you can't please everybody". Fundamentally there is a frequent contributor expectation that the provision of volunteered resources necessarily implies a quid pro quo of some sort. I guess it does, however some look to beyond a mere "thanks for your time", aka credits. I really don't know where to go from there along that line of thinking, except to point out that credits don't actually mean anything outside the confines of this "castle in the air" that is the E@H project. It really is a pretty pure knowledge project/experiment and about as cutting edge as it gets ( that is: no-one has yet detected a gravity wave, and boy will it make a splash when it does! ) My view is that if one doesn't get a tingle up the spine simply by being involved in this historic enterprise then I probably am actually on another intellectual plane! I mean that sincerely and kindly without malice or being condescending !! :-)
- as for the project staff, they are superb in my view. They no doubt they wince when reading some posts ( mine included ) but they have remained professional and hardworking. Please don't be too harsh if you conclude their apparent absence as some variey of snub. Also please don't assume that there is some open ended resource bucket yet to be dipped into either. Instead consider it a blessing if the ship's engineer spends his time in the engine-room rather than chatting with the passengers on deck, and more so if the seas are rough!
Having vented the bilge on that ( metaphors are not my strong point ), let's all ease up a bit and be patient. :-)
Cheers, Mike.
NB. I wonder how many know that Bruce Allen was a graduate student of Stephen Hawking? To grossly understate - that is not a position casually obtained! That does not make him some god that strides upon the Earth, but likewise he is no flunky functionary that some contributors have implied. There really is no known reason to concoct personal attacks - you know who you are - and I will vigourously delete such board activity at any such time it arises, then as now. :-)
( edit ) '...open ended resource bucket...' - groan! What was I thinking! :-)
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
Well spoken Gary!
Some of my thoughts after more than a year of E@H involvement:
- E@H is a victim of it's own success. For a number of reasons, and ironically reliability is prominent amongst them, it has steadily grown ie.
((image : see previous post))
for those of you cognisant with IT stuff and database design in particular, it is often the case that some things just don't scale well. This came to light for E@H late last year with increasing activity of the project declared a number of hardware issues patent.
- we are all actually getting feedback, but many do not realise it, as part of the design of the project: it's called the BOINC error messages. While not absolutely personalised enough for some, they do accurately reflect their computer's situation. If some people are meaning "lack of feedback" == "the project's internal decision loop is not visible to me", then you are absolutely right. By definition really. I would respectfully suggest that it is correct for it to be that way - I won't repeat the many reasons already mentioned for this. Sadly some will take personal offence at this scenario, or at least appear to.. :-)
- it is always worth mentioning, again, the multi-project aspect of BOINC. This is not to say "go away if you're not happy" ( although you may ), but simply reflects that the design of BOINC explicity caters for this. Again no personal impugnment ought be deduced from this.
- there is a fascinating social aspect to this distributed processing paradigm. It certainly confirms the old adage "you can't please everybody". Fundamentally there is a frequent contributor expectation that the provision of volunteered resources necessarily implies a quid pro quo of some sort. I guess it does, however some look to beyond a mere "thanks for your time", aka credits. I really don't know where to go from there along that line of thinking, except to point out that credits don't actually mean anything outside the confines of this "castle in the air" that is the E@H project. It really is a pretty pure knowledge project/experiment and about as cutting edge as it gets ( that is: no-one has yet detected a gravity wave, and boy will it make a splash when it does! ) My view is that if one doesn't get a tingle up the spine simply by being involved in this historic enterprise then I probably am actually on another intellectual plane! I mean that sincerely and kindly without malice or being condescending !! :-)
- as for the project staff, they are superb in my view. They no doubt they wince when reading some posts ( mine included ) but they have remained professional and hardworking. Please don't be too harsh if you conclude their apparent absence as some variey of snub. Also please don't assume that there is some open ended resource bucket yet to be dipped into either. Instead consider it a blessing if the ship's engineer spends his time in the engine-room rather than chatting with the passengers on deck, and more so if the seas are rough!
Having vented the bilge on that ( metaphors are not my strong point ), let's all ease up a bit and be patient. :-)
Cheers, Mike.
NB. I wonder how many know that Bruce Allen was a graduate student of Stephen Hawking? To grossly understate - that is not a position casually obtained! That does not make him some god that strides upon the Earth, but likewise he is no flunky functionary that some contributors have implied. There really is no known reason to concoct personal attacks - you know who you are - and I will vigourously delete such board activity at any such time it arises, then as now. :-)
( edit ) '...open ended resource bucket...' - groan! What was I thinking! :-)
Thanks Mike !
There is indeed a social aspect to this argument. It's all about partnership and expected fairness, as I understand it.
The contributors community controls a valuable resource, which it allocates to various degrees to E@H's computing needs. The community does so for the love of a good science project, or to garner credits, or for whatever particular reasons a member may have. By the mere fact that the project needs and accepts to use this freely contributed resource, it enters a partnership with the community.
We all rely for our social interactions on some sense of fairness, as do some of our relatives on the path of evolution. This sense of fairness has helped resolve or prevent frustrations on several occasions :
1.
the credit allocation scheme was revised after long-standing protests of the Linux contributors base claiming that the allocation based on the BOINC benchmarks was arbitrary and unfair.
2.
feedback and valuable technical information on the project's progress was eventually posted on the Science board.
3.
within hours of the hardware outages hitting the E@H's file server, the IT staff had an HTTP server up with a short and informative message on display. Which the community greatly appreciated.
Now, if there is growing dissatisfaction within the contributors community with the PR or problem management aspects of the project, I guess they can a) educate the community to proper attitudes and expectations, or b) adjust their model of said community to fit the experimental data. As you say, a fascinating and unexpected aspect of this project.
What I see so far is the community helping itself - many thanks again to the moderators. The silence on the project's side sounds like
we don't have a problem, regrettably we don't even know how to fix it, and it's none of your business anyway.
Now, I was just trolling - but a few lines would have prevented a lot a grief.
I spend 40 minutes reading all the threads about the past two weeks issues.
In these days of trouble, I am having a hard time to keep all my machines buzy crunching for EAH (being in the top 30 RAC I do have a few machines to take care of...) and it would really help if somebody could confirm that it is 100% sure that no credit will be lost in the recovery process. May be I overlooked some threads but frankly this information deserves a well placed message.
One remark.
I am not a usual participant to these forums. I believe a lot of people like me would not have needed to spend 40 minutes trying to find out what is going wrong if at least a small message would be posted on the home page of EAH. We all know systems admins are buzy trying to fix the problem but the variety of participants (IT savvys and the other) deserve various commmunications channels I believe.
In these days of trouble, I am having a hard time to keep all my machines buzy crunching for EAH (being in the top 30 RAC I do have a few machines to take care of...) and it would really help if somebody could confirm that it is 100% sure that no credit will be lost in the recovery process.
I doubt anything can be stated with 100% certainty in this business … but AFAICT it would take a new and unforeseen (and probably quite spectacular, or at least multiple) failure of some kind for results that are already ‘safely pending’ in the database to get lost.
____________
ID: 64511 |
F. Prefect
Joined: Nov 7 05 Posts: 137 ID: 119854 Credit: 882,195 RAC: 526
Hello All,
I spend 40 minutes reading all the threads about the past two weeks issues.
In these days of trouble, I am having a hard time to keep all my machines buzy crunching for EAH (being in the top 30 RAC I do have a few machines to take care of...) and it would really help if somebody could confirm that it is 100% sure that no credit will be lost in the recovery process. May be I overlooked some threads but frankly this information deserves a well placed message.
One remark.
I am not a usual participant to these forums. I believe a lot of people like me would not have needed to spend 40 minutes trying to find out what is going wrong if at least a small message would be posted on the home page of EAH. We all know systems admins are buzy trying to fix the problem but the variety of participants (IT savvys and the other) deserve various commmunications channels I believe.
Sherwood
Sounds logical to me. I even suggested 3 weeks ago that might be the best course of action and was flamed. However, the fact that it seems so simple I believe may hold a hidden meaning. Of what, I have no idea, but there must be something happening that a brief explaination may not, at least at this point in time, be possible. I have been with this project for approx 1.5 years and in the past when ANYthing was out of the ordinary, there would be a post from the director or one of his assistants informing all of the nature of the problem and when it was forecast to be rectified.
This seems to be a problem with too many DC projects. Lack of communication between Project honchos and the particpants. People can make all the excuses and / or apologies they want, but frankly it amounts to poor communication.
____________
This seems to be a problem with too many DC projects. Lack of communication between Project honchos and the particpants. People can make all the excuses and / or apologies they want, but frankly it amounts to poor communication.
It is hard to believe that the project staff is not yet aware of the need of this DC community for better communication. It is even harder to figure out why they should have chosen to ignore this need for so long.
As has been pointed out by Sherwood below, the lack of structured communications means that everyone interested has to grasp posts scattered through multiple threads. Not to speak of the frustrated and angry members who flame or post sarcastic comments. Now, what is the Greater Good that is worth this regrettable state of things ?
Look back. Einstein - yes, Albert Einstein. He was also good at communicating, wasn't he ?
So, I spend a lot of time reading all this threads, but I can't find a solution of my problem: BOINC Manager can't download any work from project. I get all the time a message like this:
2007-02-20 10:24:20|http://einstein.phys.uwm.edu/|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2007-02-20 10:24:20|http://einstein.phys.uwm.edu/|Reason: Requested by user
2007-02-20 10:24:20|http://einstein.phys.uwm.edu/|Requesting 17280 seconds of new work
2007-02-20 10:26:21|http://einstein.phys.uwm.edu/|Scheduler request failed: HTTP internal server error
2007-02-20 10:26:21|http://einstein.phys.uwm.edu/|Deferring scheduler requests for 1 minutes and 0 seconds
or like this:
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Scheduler list download succeeded
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Reason: Requested by user
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Requesting 17280 seconds of new work
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Scheduler request succeeded
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Message from server: Server can't open database
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Resetting project
2007-02-20 10:29:07||Rescheduling CPU: exit_tasks
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Detaching from project
Tell me what is wrong? I think, it is your business to make the project work OK.
____________
ID: 64534 |
F. Prefect
Joined: Nov 7 05 Posts: 137 ID: 119854 Credit: 882,195 RAC: 526
This seems to be a problem with too many DC projects. Lack of communication between Project honchos and the particpants. People can make all the excuses and / or apologies they want, but frankly it amounts to poor communication.
Regarding the communication issue, I was one of first to start asking questions
(complaining) and am in complete agreement with you.
However, for the first time since I joined this project, I began to follow the actual "numbers" in the past week or 10 days and I have to admit it appears to still be working. For the first time in at least a couple of weeks my average has now risen in the past 2 days although my pending credits continued to rise.
The problem as I'm sure most participates are aware of is in the verification process. I have no idea how many times a job has to be "ran" before it is considered to be "finished" and credits move from the pending column into actual credits, but the lag time, for some reason has become greater in verifiing what appears to be a goodly sized number of jobs. It could be they have taken in too many volunteers (there has to be a limit at some point) and are unable to process the results fast enough to "keep up"? I don't know, but would doubt that would be the reason.
But as long as the jobs are being recorded I think I'm just going to wait it out. It makes no sense at all if there are plans to shut down the project, for them to keep providing new work and continue to record the results of work that is being uploaded.
Perhaps we could have a contest: Why is there no communication in 25 words or less.
So, I spend a lot of time reading all this threads, but I can't find a solution of my problem: BOINC Manager can't download any work from project. I get all the time a message like this:
2007-02-20 10:24:20|http://einstein.phys.uwm.edu/|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2007-02-20 10:24:20|http://einstein.phys.uwm.edu/|Reason: Requested by user
2007-02-20 10:24:20|http://einstein.phys.uwm.edu/|Requesting 17280 seconds of new work
2007-02-20 10:26:21|http://einstein.phys.uwm.edu/|Scheduler request failed: HTTP internal server error
2007-02-20 10:26:21|http://einstein.phys.uwm.edu/|Deferring scheduler requests for 1 minutes and 0 seconds
or like this:
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Scheduler list download succeeded
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Reason: Requested by user
2007-02-20 10:29:04|http://einstein.phys.uwm.edu/|Requesting 17280 seconds of new work
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Scheduler request succeeded
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Message from server: Server can't open database
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Resetting project
2007-02-20 10:29:07||Rescheduling CPU: exit_tasks
2007-02-20 10:29:07|http://einstein.phys.uwm.edu/|Detaching from project
Tell me what is wrong? I think, it is your business to make the project work OK.
It's not a fault at your end, it's at the server end. Alas it seems there will be no quick solution, so my main message is to be patient and have faith. Or to put it differently: just keep hanging on to this project and you'll get some work eventually.
If you run multiple projects then in the meantime your spare CPU cycles will get donated to those automatically. If you don't run other BOINC projects and you want to donate your spare CPU cycles, you could consider attaching to other BOINC projects, currently there are around 30 of them to choose between.
Regarding any expectations of the Einstein@Home people in this project I always remind myself that BOINC projects are all based on voluntary donations from the users' end, with voluntary meaning to me that I do not expect any reward whatsoever apart from the act of giving itself.
OK, I admit, there is also the very slim chance I happen to be the one whose donated cycles made any discovery possible :).
[edit] had to add that smile... [/edit]
ID: 64570 |
F. Prefect
Joined: Nov 7 05 Posts: 137 ID: 119854 Credit: 882,195 RAC: 526
Read home page, an update was put there, today.
Well, my pending numbers took the largest drop in several weeks today and an average of I suppose my daily "output" made a similar upmove. Glad I didn't go anywhere, but I will remain critical of how the whole thing was handled. I suppose scientists operate on their own clock, although I still can't help but think something else was afoot, but I have no idea of what it may have been. Hopefully things will settle down over the next few weeks and just MAYBE they can keep the slaves just a little better informed.
I would also like to address a couple of replies I made to critisism of my continued effort to get some kind of a statement as was issued today. I apologise to all who were offended which I suspect was just about everyone, but I was simply a little POed that my squeeky wheel technique was not only a failure but I was being critisized for a simple request for information. I was way out of line and in the future will just have to adjust to the administrator's clock, like it or not.
February 21, 2007
The database problems have been identified but in order to correct them some modifications will be needed on the Einstein@Home back-end components. To effect these modifications, the project will have to be repeatedly taken up and down today. So please be warned: we will have frequent and unscheduled service interruptions while we work on this today.
Pages are loading extremely fast.
____________
ID: 64581 |
Terry
Joined: Feb 18 07 Posts: 7 ID: 246269 Credit: 9,229 RAC: 0
Not here, the Einstein website here and forums are loading very slowly here and thats with my high speed DSL connection 20 miles from UWM. Einstein@Home's servers and database seems to be working fine now though, according to the Boinc manager messages. I just started using Boinc a few days ago. Its fascinating to be part of this research, if only in a small way.
February 21, 2007
Pages are loading extremely fast.
Not here, the Einstein website here and forums are loading very slowly here and thats with my high speed DSL connection 20 miles from UWM.
Last night I experienced a very refreshing improvement in the website’s responsiveness for a while: pages were loading almost instantly. But today it’s back to the ‘molasses flowing uphill in Janury’ performance we’ve been getting for the past few weeks.
____________
ID: 64585 |
Richard Haselgrove
Joined: Dec 10 05 Posts: 579 ID: 144054 Credit: 2,965,572 RAC: 2,400
As the front page says, work is ongoing, and I keep losing all connections - but when the server is up, it seems to be much more responsive.
And my Celeron has just downloaded new work for the first time in over a week.
Definite signs of progress - keep up the good work. (And thanks for the daily updates on the front page).
ID: 64593 |
Terry
Joined: Feb 18 07 Posts: 7 ID: 246269 Credit: 9,229 RAC: 0
The webpages and forum are remarkably speedy this early AM (5:30 CST). I no longer feel like I did back in the old dialup days. Excellent work to the E@H team.
____________ --Terry photostuff.org
ID: 64596 |
Vladimir Zarkov
Joined: Feb 27 05 Posts: 51 ID: 41623 Credit: 772,994 RAC: 5
The project looks better and better today - servers started gobbling that unvalidated load as my dwindling pending credit shows. And the curve in Total Credit chart in my BOINC Manager points at the sky right now. How can it not make me happy? :)))
Heroic work again. Hats off to the project's team.
____________
Good to see things going back to normal, very good to see news in the home page telling about the progress made. Thanks a lot.
-rg-
(But my two boxes remain committed to 88% to climateprediction - these WUs take long weeks to complete, and it's stupid to throw away whatever work was done on them.)
____________
ID: 64599 |
F. Prefect
Joined: Nov 7 05 Posts: 137 ID: 119854 Credit: 882,195 RAC: 526
Looking very good right now at 6:07am CST.
Everything appears to be back to normal as of 11:00AM CST. I still have about 3 times my usual pending numbers, but they have been falling rapidly all morning.
Please remove them in any way you think appropriate. Thanks & regards.
-rg-
____________
ID: 64622 |
Gary Roberts Forum moderator
Joined: Feb 9 05 Posts: 2068 ID: 12521 Credit: 57,351,877 RAC: 174,811
Please remove them in any way you think appropriate. Thanks & regards.
-rg-
Both deleted at your request. When a thread is deleted, a category for the type of deletion has to be assigned. The only options are:-
Obscene,
Flame/hate mail, and
Commercial spam.
Obviously your thread fits none of these. Please don't take offense that it was categorised as spam when you get an email informing you of the deletion :).
Thanks for your assistance during the period of the database problems.
____________
Cheers,
Gary.
ID: 64624 |
Bruce Allen Forum moderator Project administrator Project developer Project scientist
Joined: Oct 15 04 Posts: 985 ID: 3 Credit: 170,849,008 RAC: 0
Dear Einstein@Home volunteers and contributors,
I thought I would post a description of what went wrong and how it was fixed.
(1) Project performance problems. These were due to our database getting overloaded. It was processing an average of 950 queries per second, with peaks of up to about 3000 queries per second. Ultimately, these were due to the way that the BOINC locality scheduler works and the fact that our new analysis run did not have many low-frequency workunits. Einstein@Home is the only project that uses the locality scheduler, which is designed to send many workunits for the same data file, only sending a new data file when there is no work left for the previous data file. What happened was that many hosts that had low frequency files (because they were slower than the majority of hosts) requested work for these files, or NEW workunits also for low frequency files. When the project ran out of work for these files, the locality scheduler would then perform an extremely database intensive 'crawl' through the database looking for more work. So the slowest 20% of hosts were generating very large numbers of database queries looking for non-existent low frequency workunits. I fixed this by modifying the algorithm that searches for new work. Anyone interested in the details can look at BOINC CVS next week when I check in the modified code.
The database is now averaging about 60 to 80 queries per second, and the database server and project servers are once again snappy and responsive.
(2) File server problems. Our project uses three file servers, each of which has about 8TB of RAID-6 disk space. The file servers use Areca 24-port SATA controller cards, and Western Digital WD4000YR disks. For a number of months we have been experiencing problems in which a disk would apparently drop from the array and then reappear a few seconds later, prompting a RAID array rebuild. In the end we sent one of our server boxes (approximately 80 kg, worth about 10kUSD) by express mail to Taiwan, and the Areca engineers looked at it more closely. (Many thanks to these engineers, who have given us first-rate support!) It turned out that our problems were due to a hardware problem with the WD4000YR drives. They have a SATA interface chip which (in some revisions of the WD4000YR) is incompatible with an interface chip used on the Areca RAID controller. This incompatibility is only triggered by issuing NCQ commands. So by disabling NCQ on the RAID controller, the problem was fixed. Our two remaining file servers have now been working without issues for more than two weeks.
These things were further exacerbated by my move to Germany with my family (our kids are 2.5 and 6 years old) which meant that I couldn't give these issues enough attention until now.
Hopefully these problems are behind us! I am grateful to everyone for their patience, and apologize for how long it took to track these things down and deal with them.
Thanks for that excellent recap, Bruce. Suddenly it all makes sense.
(And quite a feat to unveil the NCQ incompatibility, must have been annoyingly elusive. I can easily imagine the need to ship the entire subsystem to Areca.)
____________
I thought I would post a description of what went wrong and how it was fixed.
(snipped to conserve space)
Cheers,
Bruce Allen
Thanks Bruce for all your hard work getting the system back up and running full speed again. Am surely glad I didn't stop processing work during the spate of problems. All my back credits when processed brought me back online to where I would have been if things were running normally during that period.
I hope that you and your family enjoy the new surroundings in Germany.
Arion
(Edited for grammer)
____________
ID: 64744 |
Dave Burbank
Joined: Jan 30 06 Posts: 275 ID: 168016 Credit: 1,548,376 RAC: 0
Thank You for the update on the database and file server problems.
Glad to see that things are running smoothly again.
Keep up the good work.
____________
There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman
How often does the official E@H project people check this website?
I posted some serious tech problem in Getting Started section and after more than a day so far no responce.
:\\
Probably you should PM the moderator or try and find out direct contact for support. Forum is considered a place for discussion so getting priority support may not work through the forum.
ID: 65927 |
Mike Hewson Forum moderator
Joined: Dec 1 05 Posts: 1868 ID: 135571 Credit: 4,434,081 RAC: 5,120
Probably you should PM the moderator or try and find out direct contact for support. Forum is considered a place for discussion so getting priority support may not work through the forum.
I'll pass this on. :-)
Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
Mike Hewson Forum moderator
Joined: Dec 1 05 Posts: 1868 ID: 135571 Credit: 4,434,081 RAC: 5,120
Get a PM option here, Mike. So I can bug you daily. :)
(This post can be deleted)
Ah .... Jord :-) :-)
Well the secret is ( not now of course ) that the red-X does the trick there. Mind you with the red-X all/other moderators read the same so I encourage a preface like 'For Mike Hewson Only' or somesuch if you want my specific attention. I do actually get messages of 'non-complaining' nature via that channel anyway. Naturally I'm not after a lot of them but I appreciate that some privacy is required for some issues. I trust ( heaven forbid... ) people to use this thoughtfully. :-)
Whoops, the cat's out of the bag now .... an avalanche awaits!! :-)
Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
I have been working on computing the data recently and discovered by means of utilizing the latest techniques in analytical physics,Discovered there is a relation between the factors of this eqaution wich have yet been defined within specific paremeters, The varibles are of infinite propotions concluding thast solving the eqaution is a bit trying but I believe in the next few weeks I may a have a solution to the factors envolved and capable of defining each of the 7 hundred variations in which I will divide by 2 Meco cycles nding the relation between qauntum theory and a revalutionary bio matrix of mathmatical configurations, once and for all and a hypothisis will have the absolute properties to the eqaution and it's value within set paremeters, Able to identify each factor in this difficult equation. As long as I am able to recover the technichal data required to compulate specifics without properties being identifiable outside of a controlled envirement.
____________
This material is based upon work supported by the National Science
Foundation (NSF) under Grant NSF-0200852 and by the Max Planck
Gesellschaft (MPG). Any opinions, findings, and conclusions or
recommendations expressed in this material are those of the investigators
and do not necessarily reflect the views of the NSF or the MPG.