Chatroom
 

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Bad Astronomy and Universe Today Forum > Science and Space > Astronomy
Register FAQ Members List Calendar Mark Forums Read

   

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #61 (permalink)  
Old 25-June-2005, 12:18 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Hi minbari,

You are welcome; PT is incredibly accessible (as is Bruce Allen at E@H, we are very fortunate to be running projects for such people.)

As to your machine's problem:

It is not that the WU is failing to upload correctly, but that the result disagrees with the other two results, and so you get no credit for it. See the detail for the most recent "validate error," WU #1373063.

The more recent results are marked "pending" and "success," but you won't get credit for them either if they aren't confirmed by 2 other boxes.

The most likely prediction based on the previous results is, sad to say, that the newer ones won't be validated either.

This is the same kind of error that the BA had with his home computer; we never found a solution for it. Hopefully azazul's advice of rebooting will work, if not and you request me to, I'll post over at Cruncher's Corner; maybe they know something. When I searched there for the BA, the only thing I found that seemed remotely plausible was overheating in the FPU, which causes the calculation to compute an erroneous result, without actually aborting it. Let us know how temperatures are.

I hope this works out for you! And I agree that the ultimate solution to any DC problem is just buy more processing power!
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #62 (permalink)  
Old 25-June-2005, 01:27 PM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

Hmmm, sadly this has occured for 80% of the work units for this machine since the 16th of June. al with a Validate error ("The result was reported but could not be validated, typically because the output files were lost on the server."), yet there where 2 successfull unis completed this morning.

Thanks for the tips, The machine gets booted twice a day on schedule and the temp of both CPU's sits at around 44 Celcius which is well within nominal range.

99% of work units prior to June 16th are tagged successfull.

The only thing changed on this computer;
http://einstein.phys.uwm.edu/results.php?hostid=257202

since that day may of been the firewall though I think this was later.

If this continues, I will jump on the EH gorum and try hunt down why this may be happening all of a sudden.
Reply With Quote
  #63 (permalink)  
Old 25-June-2005, 02:01 PM
Klausnh Klausnh is offline
Established Member
 
Join Date: Feb 2003
Posts: 418
Default

Quote:
Originally Posted by Minbari
Hmmm, sadly this has occured for 80% of the work units for this machine since the 16th of June. al with a Validate error ("The result was reported but could not be validated, typically because the output files were lost on the server."), yet there where 2 successfull unis completed this morning.

Thanks for the tips, The machine gets booted twice a day on schedule and the temp of both CPU's sits at around 44 Celcius which is well within nominal range.

99% of work units prior to June 16th are tagged successfull.

The only thing changed on this computer;
http://einstein.phys.uwm.edu/results.php?hostid=257202

since that day may of been the firewall though I think this was later.

If this continues, I will jump on the EH gorum and try hunt down why this may be happening all of a sudden.
Try here from Bernd Machenschalk
Quote:
The CPU chip gets hot at the spot where the most energy is needed. When it gets too hot, it first breaks the results of the unit that is located there. If an integer unit gives false results, this will soon end in a crash of the program or the OS, e.g. because of wrong memory address calculations. If it's the FPU that gets too hot, you will notice nothing of it while the program runs until you take a close look at the results.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #64 (permalink)  
Old 25-June-2005, 05:07 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

[quote="Klausnh"]
Try here from Bernd Machenschalk
Quote:
The CPU chip gets hot at the spot where the most energy is needed. When it gets too hot, it first breaks the results of the unit that is located there. If an integer unit gives false results, this will soon end in a crash of the program or the OS, e.g. because of wrong memory address calculations. If it's the FPU that gets too hot, you will notice nothing of it while the program runs until you take a close look at the results.
Klaus, that's exactly the post I found while looking for the BA's problem. Still, it's hard to see how the FPU could overheat while the overall temp was 44C. But I'm not as small as these chips.

Quote:
Originally Posted by Minbari
yet there where 2 successfull unis completed this morning.
minbari,

Remember, "success" does not mean "validated." There's no way your BOINC client can know in advance whether the WUs are correct, ie, will later be validated by two other boxes. Unfortunate term, "success." Maybe they coulda used "Tentative Success?" Good luck over there, there's a lot of knowledge.

In the BA's case, it was more important for the BA to be the BA than to keep working this problem, so he stopped BOINCing on that box.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #65 (permalink)  
Old 26-June-2005, 12:06 AM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

Will have to wait and see, I know 100% for a fact this cannot be temperature related as there is about 5 temperature probes in this box at all times. They all have Max, Min and mean recordings for everyday and will alarm even if ambient is out nominal ranges.

I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering.

I am certaintly not dropping the firewall for days to test this. So if it dosn't fix I guess EH will be removed from proxima.

Was crunching fine and now its not. Only difference on the system since these validation errors started occuring is the firewall.
Reply With Quote
  #66 (permalink)  
Old 26-June-2005, 12:15 AM
tmosher's Avatar
tmosher tmosher is offline
Established Member
 
Join Date: Jul 2003
Location: Savannah, Georgia - Down by the Sea
Posts: 2,265
Default

Quote:
Originally Posted by Ken Vogt

Nice round numbers all of them; and all of this with most of tmosher's boxes still in packing boxes in Savannah! So I echo Klaus on the good going to all team members, big time. =D>=D>=D>

Our RAC is now about 500 above them, so we might just stay this time.

Thanks also to Klaus, and a big Aw Shucks, for the kind words. Somehow, I can't figure out how to edit out the large type from his quote, just at the moment.8)
One has gone unstable (Duron) and one sitting in the hotel running again (wireless network). The other two are sitting in a storage shed with the rest of my possessions including baby (my K75C motorcycle). Finding a place to live is a pain in the butt. Property managers here are as helpful as an Intel 8088 running Windows 3.11.

I'll probably replace the flakey Duron machine with an Athlon 64 once I get a couple of paychecks in my pocket from work (first paycheck is next thursday).

I should have four systems running 24/7 in a week or two.
__________________
I feel a hot wind on my shoulder
And the touch of a world that is older
Reply With Quote
  #67 (permalink)  
Old 26-June-2005, 01:01 AM
gopher65's Avatar
gopher65 gopher65 is offline
Established Member
 
Join Date: Feb 2005
Location: Saskatoon, Saskatchewan, Canada
Posts: 422
Default

To me that sounds like corrept files, bad memory, or a slightly damaged (but still mostly functional) Hard Drive. Maybe deinstall BOINC and Einstein@home, defrag, and then reinstall them. If that doesn't work do a complete system wipe and reinstall everything. If that doesn't work Swap out the Memory, and if it still doesn't work then the HD. THen if none of that works you can beat me to death with a wiffle bat for telling you to do all that unnecessary work when it was actually a faulty power supply.
Reply With Quote
  #68 (permalink)  
Old 26-June-2005, 01:35 AM
azazul azazul is offline
Established Member
 
Join Date: Jan 2004
Location: Rio Hondo, TX
Posts: 368
Default

Quote:
Originally Posted by Minbari
I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering.
What kind of firewall is it? Can you open ports so that the data is untouched through the firewall?
__________________
www.csphysmath.com
BAUT Team Stats
Reply With Quote
  #69 (permalink)  
Old 26-June-2005, 02:39 AM
Klausnh Klausnh is offline
Established Member
 
Join Date: Feb 2003
Posts: 418
Default

Quote:
Originally Posted by Minbari
Will have to wait and see, I know 100% for a fact this cannot be temperature related as there is about 5 temperature probes in this box at all times. They all have Max, Min and mean recordings for everyday and will alarm even if ambient is out nominal ranges.

I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering.

I am certaintly not dropping the firewall for days to test this. So if it dosn't fix I guess EH will be removed from proxima.

Was crunching fine and now its not. Only difference on the system since these validation errors started occuring is the firewall.


Error is
Quote:
Validate state Checked, but no consensus yet
From Wikipedia
Quote:
Checked, but no consensus yet
There have been several Results returned, and a Validation was attempted; but, it was not possible to form a Quorum of Results. If this occurs, the current Results should be marked with this state and additional Results issued if the maximum number of erroroneous Results has not been reached.
If, at the time of the last Validation attempt, the Work Unit had accumulated the maximum number of erroroneous Results allowed, then all of the current Results will be tagged with "Invalid".
As I understand this, all the results will either be marked invalid or all valid.
If you want, I can post your problem at Einstein@home message boards and get an explanation before you remove EH. I'm really curious and like to understand what the problem is.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #70 (permalink)  
Old 26-June-2005, 03:35 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Minbari
Thanks Azazul, einstein dosn't like sending completed work via my new firewalls,it keeps on malforming the request spitting out an icomplete request message. Though there are no problems downloading new data to crunch when it needs, so I have to open up the firewall for a few seconds each day to push the completed data sets and some other bits of spurious traffic.
Forgive me, minbari, for having accused your computer of overheating. ops: I meant to say that this was the only thing I could find over at CC that I could see might possibly bear on the problem. :-? It is now clear that this is not the problem. Sorry.

But here's what I'm not seeing about the firewall being the culprit: As I understand it the sequence goes:

1) Firewall full on
2)WU crunches away, completes, and local client sees computation as success. All without any attempt to open a port.
3) Client tries to upload the WU, is told by firewall port is blocked, client backs off, rinse, repeat.
----So far everything normal, firewall and client behaving impeccably, just
----as one would expect on dial up, say.

4) Firewall is manually turned off
5) Update command is issued manually
6) Client uploads the WU
7) Firewall turned back on
8: Much later, during verification, WU is found invalid.

Note first that "invalid" does not necessarily mean "corrupt," although it could; and maybe Gopher's ideas may help. But "invalid" is more likely to mean, IMO, that the WU calculated "42.1" for the answer, when the other 3 computers agreed on "42". ( to oversimplify.)

But my main question is, how could it be the firewall that caused the invalidity/corruption, when FW was turned off when the transfer took place?

In going through this sequence, which I hope is fair, I guess it's possible that somehow in the very act, at step 3), of trying to upload the WU the firewall corrupted the WU. Maybe the active armor, which I don't understand.

Again, I sure could be wrong, but this is the only step where I can see a possible interaction between FW and client. If this is the case, you can maybe confirm or refute it this way. (Forgive me if you have it set this way already):

In general preferences, change "Confirm before connecting to Internet?" to "Yes." Then step 3 never happens. You will have to answer no to a lot of prompts to connect, but this is just an experiment. When you are ready to connect, start at step 4. Then with the FW off, answer yes when the client asks to connect. Mark down the number of the WU ID that gets sent in this way to see if it gets verified eventually. The WU will show up almost immediately in your list as "Pending." But actual verification will take some days. Meantime, you may as well stop running E@H on the box, unless you want to try some other remedy.

So then, if that test WU is verified, the firewall was corrupting the results, and so, unless you want to change firewalls etc, the simplest thing is to uninstall BOINC on that machine. It is supposed to be fun after all, not worth running into a brick wall, etc.

But if that test WU also comes back "Invalid" it seems to me you can rule out the FW as the problem, because it could never have touched that WU, since no requests were ever made to it to open ports, and no data was presented to it, while the WU was on your box, and the FW was disabled when you finally did send it.

So all you gain, in this second case, is one less possible culprit; so you still might well decide it's not worth looking for others. We would all support you if you decide to uninstall at this point, possibly reserving the possibility of trying the same setup when orbit@home comes on line, or even running a few seti WUs to see if the problem is with the einstein app, which of course, it certainly might be.

I'm sorry if I've said obvious things; I can only speak at my own low level of technical understanding; none of it meant to be condescending, or anything, only helpful.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #71 (permalink)  
Old 26-June-2005, 03:56 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Klausnh
As I understand this, all the results will either be marked invalid or all valid.
Klaus,

I understand this a bit differently:

You can refer to the WU of Minbari's I was talking about above. 257202 is Minbaris's box.

When the first result is uploaded, there is (obviously!) no way to know wheher the result it has calculated is correct. (Otherwise, no project! )

When the second one comes in, there is likewise no way of knowing which, if any, is correct.

Now the third one comes in, and in the case above, doesn't agree with Minbari, but does agree with the second box.

This looks like bad news for Minbari, but the protocol requires 3 confirmations,(the quorum,) so the WU is sent to a 4th box, 249347. (actually, Minbari's WU was the third to be received, but the argument is the same.)

The 4th result comes back and turns out to agree with the other 2, and so those 3 computers are marked valid, and get credit, while minbari is out of luck.

The fifth computer would have been needed if the 4th one had not agreed with the other 2.

So eventully 3 boxes do agree in most all cases; but the only way all results would be marked invalid is if they keep sending WUs and all of them keep disagreeing among themselves. At some point the algorithm will just discard the WU.

But more likely is Minbari's situation of 3 out of 4 agreeing.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #72 (permalink)  
Old 26-June-2005, 11:57 AM
Klausnh Klausnh is offline
Established Member
 
Join Date: Feb 2003
Posts: 418
Default

Quote:
Originally Posted by Ken Vogt
Quote:
Originally Posted by Klausnh
As I understand this, all the results will either be marked invalid or all valid.
Klaus,

So eventully 3 boxes do agree in most all cases; but the only way all results would be marked invalid is if they keep sending WUs and all of them keep disagreeing among themselves. At some point the algorithm will just discard the WU.

But more likely is Minbari's situation of 3 out of 4 agreeing.
What I don't understand is why wouldn't Minbari's result just be marked "invalid" instead of leaving it in the unresolved state of "Validate state Checked, but no consensus yet"?
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #73 (permalink)  
Old 26-June-2005, 12:55 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

A warm welcome to our newest team member Walter Williams =D> . He is member #75 for Einstein@Home, joining with 300 points already, and #21 for orbit@home.

Let us know if there's anything you need.

**********
I can't resist posting another clip from a post by Pasquale Tricarico about the science of orbit@home. The whole thread---still only a five post read!---is here.
Quote:
Originally Posted by Pasquale Tricarico
[data input I] - the data relative to the asteroids is already collected by the Minor Planet Center. Every observer is asked to send his observations there, and then MPC processes them, and publishes the Minor Planet Electronic Circulars, or MPECs. The latest MPECs are available here. So as a beginning, orbit@home will monitor these files. Every time a new MPEC is published, the data is collected and processed by o@h, creating wus, waiting for results, and then updating the local database. Orbit@home doesn't need to recruit astronomers, or have any particular connection with observatories. All the data available will be processed by o@h, without limits on country or team. Backyard astronomers are welcome, but before they should go trough the process of getting an MPC code, that also certifies the quality of the data (see MPC website). This is really needed, otherwise most of their data would be rejected because is not accurate enough, and rejection means usually more computing time (you try to use that data, then decide to reject it, and compute again).

[data input II] - on the long term, it will be possible to accept data directly from observers. This to speed up the analysis process, and get results faster. So I believe that in special cases, observers will be happy to send their observations directly to o@h (and of course to MPC too).
************

This BABB thread will soon, if it hasn't already, get to the length where it is unreasonable to read it before posting to it. I've said before that this is fine with me; there has never been a post by anyone on any of the team threads even obliquely suggesting you should "search first."

But the speed with which topics move down the queue means that items like the scientific posts are not seen by so many people. So in future, I at least will be posting such things to threads like "orbit@home science," at the BABB section of mickal555's site, with just links here. I know this will be a great relief to many. Anyway, you are cordially invited down there every couple of weeks or so to see what's new; and of course you are more than welcome to add to and/or comment on such posts.

A similar invite applies to this poll down there about whether the orbit@home team will hurt the Einstein@Home team.

Opinions on all such things are still welcome in this thread of course, but some may find the multiple thread format at the Shack more congenial.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #74 (permalink)  
Old 26-June-2005, 01:12 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Klausnh
What I don't understand is why wouldn't Minbari's result just be marked "invalid" instead of leaving it in the unresolved state of "Validate state Checked, but no consensus yet"?
Klaus, I think that is just poor wording on that page? In the detail page for the particular WU, the wording used in the Outcome column is "Validate error." If you click on the "explain" link, it says:
Quote:
Validate error The result was reported but could not be validated...
Fairly definitive?

It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?

Apologies to Minbari, I believe I've been not capitalizing his nick in places, I'll try to do better.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #75 (permalink)  
Old 26-June-2005, 02:19 PM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

ops: Oh no, what you suggest is a perfectly plausible idea Ken, any suggestion or possible angle to look is a good one as far as I am concerned. I just thought I would make it clear that this is probably not the cause so as to be able to continue the process of elimination.

I see there are two WU's that have succeeded in the last day which makes this ever more elusive. HDD (SATA II) are in perfect order, drive temps reports 38 Celsius and they only ever get loaded with the occasional paging so fragmentation is not an issue.

You have all been most helpful indeed, I will of course try to carry on this spirit where possible.
Reply With Quote
  #76 (permalink)  
Old 26-June-2005, 04:11 PM
Klausnh Klausnh is offline
Established Member
 
Join Date: Feb 2003
Posts: 418
Default

Quote:
Originally Posted by Ken Vogt

It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?
Yeah, that makes sense. Maybe the bug needs to be reported to E@H
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #77 (permalink)  
Old 26-June-2005, 04:32 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Minbari
I see there are two WU's that have succeeded in the last day which makes this ever more elusive.
No kidding, Minbari! I mean, I'm very happy these succeeded, overjoyed, to tell the truth, but that makes this into an intermittent problem as well as a tough one; not a good combination. . (if interested you can follow the saga here.)

It does seem to exonerate the firewall from corrupting data though? A good thing? :-?

I had meant to ask previously if there were ever any unexpected error messages in the BOINC manager when it tried to upload WUs against the firewall?

There is a logging program for messages on the project sites, which might be helpful to track down any differences between a successful upload and one which will eventually be invalid. Immediate link to download is here: boinclogger.zip.

One other exotic possibility I thought of is an actual error in the FPU itself---not heat related. It's rare, but microchips can sometimes fail. And maybe even intermittently? A cursory google shows there might be tests for such things, and this site has a scary first paragraph, but I didn't locate a real easy-to-use FPU test for a modern desktop machine.

I would think though that predictor@home would use mostly integer arithmetic, since it would seem to be a combinatorial task? We are all hoping that the errors have vanished, never to return, but if it does start spitting them out again, seeing if predictor@home gives the same errors could maybe be a quick and dirty way to isolate the FPU as a factor.

The problem is sure made all the tougher by having to wait a week to know about success or failure.

IOW, all I have to say is, Glad It's Not Me, Minbari. Good luck and best wishes.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #78 (permalink)  
Old 26-June-2005, 04:42 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Klausnh
Yeah, that makes sense. Maybe the bug needs to be reported to E@H
Go for it Klaus! Really, those result pages are very confusing. I know back in the day, that confusion contributed a little to the, um, misunderstanding I had with them over the Will I Ever Get Credit For This WU issue. Anything that makes these pages give consistent signals would be a big help to the whole community.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #79 (permalink)  
Old 26-June-2005, 09:50 PM
Klausnh Klausnh is offline
Established Member
 
Join Date: Feb 2003
Posts: 418
Default

Quote:
Originally Posted by Ken Vogt
It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?
You were right
Quote:
Originally Posted by Bernd Machenschalk
Hm, the validate state is actually "validate error", meaning that the validator was unable to check whether the Result is valid or not - for whatever reason. The "Checked, but no consensus yet" is probably a misleading error message. Thanks for pointing us to it, we'll check the validator.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #80 (permalink)  
Old 26-June-2005, 11:31 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Klaus,

Thanks for doing that...good work!

I'm very envious of your concision: 2 posts totalling 3 lines and they see the problem clearly.

I woulda taken 4 paragraphs just to state the problem.

You are the official BABB Envoy to the Projects now, my friend!
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #81 (permalink)  
Old 27-June-2005, 12:09 AM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

Thanks ken. Your right in that is becoming a curly problem. As far as tests go, this machine went through extensive tests during the built phase, IE. specview and sciencemark tests for 3 days and produced zero errors with astonishing speed results so I am inclined to look past the FPU as being at fault (good suggestion though).

Also as you can see from the list of results proxima has worked on EH worked flaulesly untilt he 16th June as stated back a page or 2 ago.

I will see how things progress over the next week, I dont have much time to contribute in fixing this problem so there are no gaurantees that any future opterons in the farm can be added later .

Thanks for all your help again, its always appreciated.
Reply With Quote
  #82 (permalink)  
Old 27-June-2005, 01:07 AM
gopher65's Avatar
gopher65 gopher65 is offline
Established Member
 
Join Date: Feb 2005
Location: Saskatoon, Saskatchewan, Canada
Posts: 422
Default

Maybe you could ask the BA to create a forum for BOINC on these boards. There is always something BOINCish to talk about, and more projects are created all the time.
Reply With Quote
  #83 (permalink)  
Old 27-June-2005, 03:30 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by gopher65
Maybe you could ask the BA to create a forum for BOINC on these boards. There is always something BOINCish to talk about, and more projects are created all the time.
[Edited:] There's no question such a section would be helpful to the cause, and I appreciate the suggestion, gopher.
But even if the BA would go for it, I think it's a bit out of character with the other sections here? Plus opening it up for other worthwhile topics to ask for similar space? Other opinions?

Still BAD BOINC does have a certain ring to it, I'll admit. Reminds me of my youth.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #84 (permalink)  
Old 28-June-2005, 12:12 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default Einstein@Home S4 data

I've been getting errors the last half hour while trying to post here, so here's a link to a Scotsons Shack post about the new S4 data set we are now starting to crunch.

Aha, working now.

****************
Also, BABB just moved into 21st place in RAC. Good going All! =D> =D>
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #85 (permalink)  
Old 28-June-2005, 06:20 AM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

Sheesh, turned the firewall off and results are getting through but now I am getting spurious erros relating to other things.

Like the one that gets my goat is EH will not send end more units to me as it is now saying my daily quota has been reached LOL.

My daily quota is 2 when this machine can push out a max of 8, I dont understand this silly scheduler any more.

If any machine should get errors it is the AMD Athlon 2700+ that runs all day at 69 celsius for 6 months without missing a beat, so go figure.

Anyways I will persist for a little longer. The firewall was nvidias foceware with armour gaurd, that I have now chaned over.

einstein errors follow;
===================
28/06/2005 3:12:56 PM||Remote control not allowed; using loopback address
28/06/2005 3:12:56 PM|Einstein@Home|Resuming computation for result H1_0407.0__0407.1_0.1_T23_Fin1_0 using einstein version 4.79
28/06/2005 3:12:56 PM|Einstein@Home|Deferring communication with project for 19 hours, 45 minutes, and 38 seconds
28/06/2005 3:12:56 PM||Insufficient work; requesting more
28/06/2005 3:12:56 PM|orbit@home|Deferring communication with project for 22 hours, 29 minutes, and 40 seconds
28/06/2005 3:13:22 PM||request_reschedule_cpus: project op
28/06/2005 3:13:22 PM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
28/06/2005 3:13:22 PM|Einstein@Home|Requesting 8640 seconds of work, returning 1 results
28/06/2005 3:13:24 PM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
28/06/2005 3:13:24 PM|Einstein@Home|Message from server: No work sent
28/06/2005 3:13:24 PM|Einstein@Home|Message from server: (reached daily quota of 2 results)
Reply With Quote
  #86 (permalink)  
Old 28-June-2005, 06:38 AM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

I just rebooted the system and it still says I have reached the daily quota limit of 2 requests, this is madness. So I finish the last unit in about 30 minutes and have to wait 18 hours till I receive another 2 units to work on. Keh, this hasn;t happened to me before on any of the machines I have set up and running for the last half a year.

Dual proc Opteron is pushing out less than a celeron at the moment, I know it aint the machine so there must be an issue at EH with opterons in the last weeks or something.

I have no control of EH scheduling so my hands are tied until Eh does something to fix this mess up.

Also I noted an update to the latest BOINC causes a WU in progress to pre-m,aturily terminate if one clicks to display the graphic. I did this just before and noted the error, LOL the unit only had 5% to completion and now its trashed cause of the silly graphic.

Sheesh What a headache this is turning out to be, Im starting to wounder if its worth pursuing this any further.
===
28/06/2005 3:29:28 PM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi
28/06/2005 3:29:28 PM|Einstein@Home|Requesting 8640 seconds of work, returning 0 results
28/06/2005 3:29:30 PM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded
28/06/2005 3:29:30 PM|Einstein@Home|Message from server: No work sent
28/06/2005 3:29:30 PM|Einstein@Home|Message from server: (reached daily quota of 2 results)
28/06/2005 3:29:30 PM|Einstein@Home|No work from project
28/06/2005 3:29:31 PM|Einstein@Home|Deferring communication with project for 18 hours, 49 minutes, and 40 seconds
Reply With Quote
  #87 (permalink)  
Old 28-June-2005, 07:03 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Hi Minbari,

You have been bitten by the "Ghost WU" bug; fortunately not serious, AFAIK.

If you look at the results for computer here, you'll notice some completely spurious WUs were all said to be downloaded at around 3:45. These never would never have appeared on your "Work" tab, and they report a "client error" in downloading. But the scheduler thinks it has downloaded them, and so reports you've exceeded your daily quota.

As I remember the discussion at CC now, this usually resolves itself when the next day rolls round and fresh real WUs are downloaded. I don't recall ATM whether midnite is reckoned in GMT or local time.

If this doesn't work, I believe the solution is to nuke the master file containing the WUs. Wouldn't swear to this, too tired ATM to check it, will try tomorrow to be sure: You would shut dowm BM and delete (or rename for safety) the file h1_0407.0 from the einstein_4.79 directory. Then restarting the client should download a new master file which would not have the ghost WU bug.

The file h1_0407.0 is one of the new S4 series, as you can see from clicking on any if the "Result IDs," the far left column. So It's possible that this is a new problem with them, but it looks like ghost WUs to me.

And there is no question that the scheduler, to put it kindly, is not totally bug free.

Will check again tomorrow, especially if you still aren't getting work; there's nothing worse that having a box all fired up with nothing to do.

Once o@H comes on, this will be easier for us running both: unlikely both projects will be down or having errors at the same time. Then, will just have to worry about the scheduler negotiating the share properly.

Hope this helps for now; best I can do.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #88 (permalink)  
Old 28-June-2005, 07:10 AM
Minbari Minbari is offline
Junior Member
 
Join Date: Nov 2004
Posts: 98
Default

Thanks for the feedback ken, I will sit and await for the roll over until a try anything else.

I'm having some bad luck with EH on Proxima the last few weeks. Still the show must go on.

=D>
Reply With Quote
  #89 (permalink)  
Old 28-June-2005, 12:07 PM
Klausnh Klausnh is offline
Established Member
 
Join Date: Feb 2003
Posts: 418
Default

Problem with new h1 (lower case) WUs.
Quote:
Originally Posted by Bruce Allen
After some discussions with David Anderson, I've taken the simple way out. I've cancelled the workunits with names that start "h1_" (NOTE: this is case sensitive, work starting "H1_" is NOT cancelled).

I've also removed the problematic h1_XXXX.X data files from the download servers. After these changes propagate to the data server mirrors (15 to 30 minutes) this should generate hard download errors for any client that attempts these WU.

I'll rename the workunits and files using "w1" (w for Washington state, where the Hanford detector is located) and reissue them.

Apologies to everyone for this fiasco. It's my fault. Hopefully we can recover quickly.

Please feel free to manually abort any h1_ workunits. My apologies for wasted CPU cycles. Fortunately these workunits have only been out there for a half-day so this shouldn't be too severe.

Bruce
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #90 (permalink)  
Old 28-June-2005, 01:06 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Established Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Klaus,

Thanks so much for this . From that thread:

Quote:
... I stopped BOINC, deleted the h1 file (lower case h), restarted BOINC, forced an update and got a new file ... and everything seems sweet again.

Reply by Dr Allen:

I'm glad this works. I think that this is probably the easiest procedure for most users.
To elaborate on the procedure marked in bold above:

There is no need to abort any running WUs first.

1) Exit the BOINCmanager, and the tray icon if it's still visible.

2) Delete any "h1_xxxx" files from your C:\program files\BOINC\projects\einstein.phys.uwm.edu\ folder. Lower case "h"

3) Restart BOINCmanager. Any h1_xxxx WUs which were previously running will be marked 100%, "client error," and ready to report: their computation is automatically aborted by the deletion of the master file in step 2). Which is what you want.

4) Manually "update" from the projects tab . You may not need to do this, since, upon finding no master file, the client may download one or two small WUs and resume crunching. Download of a new master file will then happen in accord with your "Connect to ..." cache settings. But at least one manual update is a good idea to be sure everything is running smoothly. Wait a minute to do it to avoid any "deferring for 59 sec" messages.

(This will, or should, also solve Minbari's "Ghost WU" problem as well, though they are unrelated.)

Repeating, there is no reason to continue with the h1_ xxxx files, as all such WUs will either fail locally or upon verification.

Once this is done, you will of course have a few "client errors" when you view your computers under Your Account. It's a good idea to check back there in a day or two to be sure that the new WUs are getting credit.

Here's the latest quote from the E@H frontpage:
Quote:
Originally Posted by E@H
June 28, 2005
A mistake was made in generating some of the S4 workunits. All workunits whose names start with 'h1_' (LOWER CASE!) have been cancelled. To save CPU time on host machines, any running 'h1_' (LOWER CASE ONLY!) work can be aborted 'by hand' if desired. Some detail is in this message board thread.
Again, the reason for the problem is that for linux people, "h1_xxxx" and "H1_xxxx" are different filenames, but not on Windows.

Since these WUs have now been cancelled, I think that the problem will self-correct when your currently running WUs complete: Client will try to fetch more WUs from the master file; be informed that they are cancelled, and eventually download a new master file. If this is not what happens on your machine, that is, if the client persists in running more h1_xxxx WUs, please let me know, and I'll send a team email in a few hours recommending the above steps.


Apologies for this snafu.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
Reply


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT. The time now is 12:38 PM.


Powered by vBulletin® Version 3.8.3
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
LinkBacks Enabled by vBSEO 3.0.0
©  2006 Bad Astronomy and Universe Today