Chatroom
 

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Bad Astronomy and Universe Today Forum > Space and Astronomy > Astronomy
Register FAQ Members List Calendar Mark Forums Read

   

Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #61 (permalink)  
Old 25-June-2005, 12:18 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Senior Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Hi minbari,

You are welcome; PT is incredibly accessible (as is Bruce Allen at E@H, we are very fortunate to be running projects for such people.)

As to your machine's problem:

It is not that the WU is failing to upload correctly, but that the result disagrees with the other two results, and so you get no credit for it. See the detail for the most recent "validate error," WU #1373063.

The more recent results are marked "pending" and "success," but you won't get credit for them either if they aren't confirmed by 2 other boxes.

The most likely prediction based on the previous results is, sad to say, that the newer ones won't be validated either.

This is the same kind of error that the BA had with his home computer; we never found a solution for it. Hopefully azazul's advice of rebooting will work, if not and you request me to, I'll post over at Cruncher's Corner; maybe they know something. When I searched there for the BA, the only thing I found that seemed remotely plausible was overheating in the FPU, which causes the calculation to compute an erroneous result, without actually aborting it. Let us know how temperatures are.

I hope this works out for you! And I agree that the ultimate solution to any DC problem is just buy more processing power!
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #62 (permalink)  
Old 25-June-2005, 01:27 PM
Minbari Minbari is offline
Member
 
Join Date: Nov 2004
Posts: 98
Default

Hmmm, sadly this has occured for 80% of the work units for this machine since the 16th of June. al with a Validate error ("The result was reported but could not be validated, typically because the output files were lost on the server."), yet there where 2 successfull unis completed this morning.

Thanks for the tips, The machine gets booted twice a day on schedule and the temp of both CPU's sits at around 44 Celcius which is well within nominal range.

99% of work units prior to June 16th are tagged successfull.

The only thing changed on this computer;
http://einstein.phys.uwm.edu/results.php?hostid=257202

since that day may of been the firewall though I think this was later.

If this continues, I will jump on the EH gorum and try hunt down why this may be happening all of a sudden.
Reply With Quote
  #63 (permalink)  
Old 25-June-2005, 02:01 PM
Klausnh Klausnh is offline
Senior Member
 
Join Date: Feb 2003
Posts: 365
Default

Quote:
Originally Posted by Minbari
Hmmm, sadly this has occured for 80% of the work units for this machine since the 16th of June. al with a Validate error ("The result was reported but could not be validated, typically because the output files were lost on the server."), yet there where 2 successfull unis completed this morning.

Thanks for the tips, The machine gets booted twice a day on schedule and the temp of both CPU's sits at around 44 Celcius which is well within nominal range.

99% of work units prior to June 16th are tagged successfull.

The only thing changed on this computer;
http://einstein.phys.uwm.edu/results.php?hostid=257202

since that day may of been the firewall though I think this was later.

If this continues, I will jump on the EH gorum and try hunt down why this may be happening all of a sudden.
Try here from Bernd Machenschalk
Quote:
The CPU chip gets hot at the spot where the most energy is needed. When it gets too hot, it first breaks the results of the unit that is located there. If an integer unit gives false results, this will soon end in a crash of the program or the OS, e.g. because of wrong memory address calculations. If it's the FPU that gets too hot, you will notice nothing of it while the program runs until you take a close look at the results.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #64 (permalink)  
Old 25-June-2005, 05:07 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Senior Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

[quote="Klausnh"]
Try here from Bernd Machenschalk
Quote:
The CPU chip gets hot at the spot where the most energy is needed. When it gets too hot, it first breaks the results of the unit that is located there. If an integer unit gives false results, this will soon end in a crash of the program or the OS, e.g. because of wrong memory address calculations. If it's the FPU that gets too hot, you will notice nothing of it while the program runs until you take a close look at the results.
Klaus, that's exactly the post I found while looking for the BA's problem. Still, it's hard to see how the FPU could overheat while the overall temp was 44C. But I'm not as small as these chips.

Quote:
Originally Posted by Minbari
yet there where 2 successfull unis completed this morning.
minbari,

Remember, "success" does not mean "validated." There's no way your BOINC client can know in advance whether the WUs are correct, ie, will later be validated by two other boxes. Unfortunate term, "success." Maybe they coulda used "Tentative Success?" Good luck over there, there's a lot of knowledge.

In the BA's case, it was more important for the BA to be the BA than to keep working this problem, so he stopped BOINCing on that box.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #65 (permalink)  
Old 26-June-2005, 12:06 AM
Minbari Minbari is offline
Member
 
Join Date: Nov 2004
Posts: 98
Default

Will have to wait and see, I know 100% for a fact this cannot be temperature related as there is about 5 temperature probes in this box at all times. They all have Max, Min and mean recordings for everyday and will alarm even if ambient is out nominal ranges.

I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering.

I am certaintly not dropping the firewall for days to test this. So if it dosn't fix I guess EH will be removed from proxima.

Was crunching fine and now its not. Only difference on the system since these validation errors started occuring is the firewall.
Reply With Quote
  #66 (permalink)  
Old 26-June-2005, 12:15 AM
tmosher's Avatar
tmosher tmosher is offline
Senior Member
 
Join Date: Jul 2003
Location: Savannah, Georgia - Down by the Sea
Posts: 2,265
Default

Quote:
Originally Posted by Ken Vogt

Nice round numbers all of them; and all of this with most of tmosher's boxes still in packing boxes in Savannah! So I echo Klaus on the good going to all team members, big time. =D>=D>=D>

Our RAC is now about 500 above them, so we might just stay this time.

Thanks also to Klaus, and a big Aw Shucks, for the kind words. Somehow, I can't figure out how to edit out the large type from his quote, just at the moment.8)
One has gone unstable (Duron) and one sitting in the hotel running again (wireless network). The other two are sitting in a storage shed with the rest of my possessions including baby (my K75C motorcycle). Finding a place to live is a pain in the butt. Property managers here are as helpful as an Intel 8088 running Windows 3.11.

I'll probably replace the flakey Duron machine with an Athlon 64 once I get a couple of paychecks in my pocket from work (first paycheck is next thursday).

I should have four systems running 24/7 in a week or two.
__________________
I feel a hot wind on my shoulder
And the touch of a world that is older
Reply With Quote
  #67 (permalink)  
Old 26-June-2005, 01:01 AM
gopher65's Avatar
gopher65 gopher65 is offline
Senior Member
 
Join Date: Feb 2005
Location: Saskatoon, Saskatchewan, Canada
Posts: 422
Default

To me that sounds like corrept files, bad memory, or a slightly damaged (but still mostly functional) Hard Drive. Maybe deinstall BOINC and Einstein@home, defrag, and then reinstall them. If that doesn't work do a complete system wipe and reinstall everything. If that doesn't work Swap out the Memory, and if it still doesn't work then the HD. THen if none of that works you can beat me to death with a wiffle bat for telling you to do all that unnecessary work when it was actually a faulty power supply.
Reply With Quote
  #68 (permalink)  
Old 26-June-2005, 01:35 AM
azazul azazul is offline
Senior Member
 
Join Date: Jan 2004
Location: Rio Hondo, TX
Posts: 368
Default

Quote:
Originally Posted by Minbari
I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering.
What kind of firewall is it? Can you open ports so that the data is untouched through the firewall?
__________________
www.csphysmath.com
BAUT Team Stats
Reply With Quote
  #69 (permalink)  
Old 26-June-2005, 02:39 AM
Klausnh Klausnh is offline
Senior Member
 
Join Date: Feb 2003
Posts: 365
Default

Quote:
Originally Posted by Minbari
Will have to wait and see, I know 100% for a fact this cannot be temperature related as there is about 5 temperature probes in this box at all times. They all have Max, Min and mean recordings for everyday and will alarm even if ambient is out nominal ranges.

I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering.

I am certaintly not dropping the firewall for days to test this. So if it dosn't fix I guess EH will be removed from proxima.

Was crunching fine and now its not. Only difference on the system since these validation errors started occuring is the firewall.


Error is
Quote:
Validate state Checked, but no consensus yet
From Wikipedia
Quote:
Checked, but no consensus yet
There have been several Results returned, and a Validation was attempted; but, it was not possible to form a Quorum of Results. If this occurs, the current Results should be marked with this state and additional Results issued if the maximum number of erroroneous Results has not been reached.
If, at the time of the last Validation attempt, the Work Unit had accumulated the maximum number of erroroneous Results allowed, then all of the current Results will be tagged with "Invalid".
As I understand this, all the results will either be marked invalid or all valid.
If you want, I can post your problem at Einstein@home message boards and get an explanation before you remove EH. I'm really curious and like to understand what the problem is.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #70 (permalink)  
Old 26-June-2005, 03:35 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Senior Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Minbari
Thanks Azazul, einstein dosn't like sending completed work via my new firewalls,it keeps on malforming the request spitting out an icomplete request message. Though there are no problems downloading new data to crunch when it needs, so I have to open up the firewall for a few seconds each day to push the completed data sets and some other bits of spurious traffic.
Forgive me, minbari, for having accused your computer of overheating. ops: I meant to say that this was the only thing I could find over at CC that I could see might possibly bear on the problem. :-? It is now clear that this is not the problem. Sorry.

But here's what I'm not seeing about the firewall being the culprit: As I understand it the sequence goes:

1) Firewall full on
2)WU crunches away, completes, and local client sees computation as success. All without any attempt to open a port.
3) Client tries to upload the WU, is told by firewall port is blocked, client backs off, rinse, repeat.
----So far everything normal, firewall and client behaving impeccably, just
----as one would expect on dial up, say.

4) Firewall is manually turned off
5) Update command is issued manually
6) Client uploads the WU
7) Firewall turned back on
8: Much later, during verification, WU is found invalid.

Note first that "invalid" does not necessarily mean "corrupt," although it could; and maybe Gopher's ideas may help. But "invalid" is more likely to mean, IMO, that the WU calculated "42.1" for the answer, when the other 3 computers agreed on "42". ( to oversimplify.)

But my main question is, how could it be the firewall that caused the invalidity/corruption, when FW was turned off when the transfer took place?

In going through this sequence, which I hope is fair, I guess it's possible that somehow in the very act, at step 3), of trying to upload the WU the firewall corrupted the WU. Maybe the active armor, which I don't understand.

Again, I sure could be wrong, but this is the only step where I can see a possible interaction between FW and client. If this is the case, you can maybe confirm or refute it this way. (Forgive me if you have it set this way already):

In general preferences, change "Confirm before connecting to Internet?" to "Yes." Then step 3 never happens. You will have to answer no to a lot of prompts to connect, but this is just an experiment. When you are ready to connect, start at step 4. Then with the FW off, answer yes when the client asks to connect. Mark down the number of the WU ID that gets sent in this way to see if it gets verified eventually. The WU will show up almost immediately in your list as "Pending." But actual verification will take some days. Meantime, you may as well stop running E@H on the box, unless you want to try some other remedy.

So then, if that test WU is verified, the firewall was corrupting the results, and so, unless you want to change firewalls etc, the simplest thing is to uninstall BOINC on that machine. It is supposed to be fun after all, not worth running into a brick wall, etc.

But if that test WU also comes back "Invalid" it seems to me you can rule out the FW as the problem, because it could never have touched that WU, since no requests were ever made to it to open ports, and no data was presented to it, while the WU was on your box, and the FW was disabled when you finally did send it.

So all you gain, in this second case, is one less possible culprit; so you still might well decide it's not worth looking for others. We would all support you if you decide to uninstall at this point, possibly reserving the possibility of trying the same setup when orbit@home comes on line, or even running a few seti WUs to see if the problem is with the einstein app, which of course, it certainly might be.

I'm sorry if I've said obvious things; I can only speak at my own low level of technical understanding; none of it meant to be condescending, or anything, only helpful.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #71 (permalink)  
Old 26-June-2005, 03:56 AM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Senior Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Klausnh
As I understand this, all the results will either be marked invalid or all valid.
Klaus,

I understand this a bit differently:

You can refer to the WU of Minbari's I was talking about above. 257202 is Minbaris's box.

When the first result is uploaded, there is (obviously!) no way to know wheher the result it has calculated is correct. (Otherwise, no project! )

When the second one comes in, there is likewise no way of knowing which, if any, is correct.

Now the third one comes in, and in the case above, doesn't agree with Minbari, but does agree with the second box.

This looks like bad news for Minbari, but the protocol requires 3 confirmations,(the quorum,) so the WU is sent to a 4th box, 249347. (actually, Minbari's WU was the third to be received, but the argument is the same.)

The 4th result comes back and turns out to agree with the other 2, and so those 3 computers are marked valid, and get credit, while minbari is out of luck.

The fifth computer would have been needed if the 4th one had not agreed with the other 2.

So eventully 3 boxes do agree in most all cases; but the only way all results would be marked invalid is if they keep sending WUs and all of them keep disagreeing among themselves. At some point the algorithm will just discard the WU.

But more likely is Minbari's situation of 3 out of 4 agreeing.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #72 (permalink)  
Old 26-June-2005, 11:57 AM
Klausnh Klausnh is offline
Senior Member
 
Join Date: Feb 2003
Posts: 365
Default

Quote:
Originally Posted by Ken Vogt
Quote:
Originally Posted by Klausnh
As I understand this, all the results will either be marked invalid or all valid.
Klaus,

So eventully 3 boxes do agree in most all cases; but the only way all results would be marked invalid is if they keep sending WUs and all of them keep disagreeing among themselves. At some point the algorithm will just discard the WU.

But more likely is Minbari's situation of 3 out of 4 agreeing.
What I don't understand is why wouldn't Minbari's result just be marked "invalid" instead of leaving it in the unresolved state of "Validate state Checked, but no consensus yet"?
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #73 (permalink)  
Old 26-June-2005, 12:55 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Senior Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

A warm welcome to our newest team member Walter Williams =D> . He is member #75 for Einstein@Home, joining with 300 points already, and #21 for orbit@home.

Let us know if there's anything you need.

**********
I can't resist posting another clip from a post by Pasquale Tricarico about the science of orbit@home. The whole thread---still only a five post read!---is here.
Quote:
Originally Posted by Pasquale Tricarico
[data input I] - the data relative to the asteroids is already collected by the Minor Planet Center. Every observer is asked to send his observations there, and then MPC processes them, and publishes the Minor Planet Electronic Circulars, or MPECs. The latest MPECs are available here. So as a beginning, orbit@home will monitor these files. Every time a new MPEC is published, the data is collected and processed by o@h, creating wus, waiting for results, and then updating the local database. Orbit@home doesn't need to recruit astronomers, or have any particular connection with observatories. All the data available will be processed by o@h, without limits on country or team. Backyard astronomers are welcome, but before they should go trough the process of getting an MPC code, that also certifies the quality of the data (see MPC website). This is really needed, otherwise most of their data would be rejected because is not accurate enough, and rejection means usually more computing time (you try to use that data, then decide to reject it, and compute again).

[data input II] - on the long term, it will be possible to accept data directly from observers. This to speed up the analysis process, and get results faster. So I believe that in special cases, observers will be happy to send their observations directly to o@h (and of course to MPC too).
************

This BABB thread will soon, if it hasn't already, get to the length where it is unreasonable to read it before posting to it. I've said before that this is fine with me; there has never been a post by anyone on any of the team threads even obliquely suggesting you should "search first."

But the speed with which topics move down the queue means that items like the scientific posts are not seen by so many people. So in future, I at least will be posting such things to threads like "orbit@home science," at the BABB section of mickal555's site, with just links here. I know this will be a great relief to many. Anyway, you are cordially invited down there every couple of weeks or so to see what's new; and of course you are more than welcome to add to and/or comment on such posts.

A similar invite applies to this poll down there about whether the orbit@home team will hurt the Einstein@Home team.

Opinions on all such things are still welcome in this thread of course, but some may find the multiple thread format at the Shack more congenial.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #74 (permalink)  
Old 26-June-2005, 01:12 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline
Senior Member
 
Join Date: Nov 2004
Location: Bloomington, Indiana, USA
Posts: 425
Default

Quote:
Originally Posted by Klausnh
What I don't understand is why wouldn't Minbari's result just be marked "invalid" instead of leaving it in the unresolved state of "Validate state Checked, but no consensus yet"?
Klaus, I think that is just poor wording on that page? In the detail page for the particular WU, the wording used in the Outcome column is "Validate error." If you click on the "explain" link, it says:
Quote:
Validate error The result was reported but could not be validated...
Fairly definitive?

It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?

Apologies to Minbari, I believe I've been not capitalizing his nick in places, I'll try to do better.
__________________
Ken

Visit the BAUT Team Page
BOINCview: Network control
and a superior UI!
Reply With Quote
  #75 (permalink)  
Old 26-June-2005, 02:19 PM
Minbari Minbari is offline
Member
 
Join Date: Nov 2004
Posts: 98
Default

ops: Oh no, what you suggest is a perfectly plausible idea Ken, any suggestion or possible angle to look is a good one as far as I am concerned. I just thought I would make it clear that this is probably not the cause so as to be able to continue the process of elimination.

I see there are two WU's that have succeeded in the last day which makes this ever more elusive. HDD (SATA II) are in perfect order, drive temps reports 38 Celsius and they only ever get loaded with the occasional paging so fragmentation is not an issue.

You have all been most helpful indeed, I will of course try to carry on this spirit where possible.
Reply With Quote
  #76 (permalink)  
Old 26-June-2005, 04:11 PM
Klausnh Klausnh is offline
Senior Member
 
Join Date: Feb 2003
Posts: 365
Default

Quote:
Originally Posted by Ken Vogt

It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?
Yeah, that makes sense. Maybe the bug needs to be reported to E@H
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene
Reply With Quote
  #77 (permalink)  
Old 26-June-2005, 04:32 PM
Ken Vogt's Avatar
Ken Vogt Ken Vogt is offline