|
| If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|||||||
| Register | FAQ | Members List | Calendar | Mark Forums Read |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
|
||||
|
Hi minbari,
You are welcome; PT is incredibly accessible (as is Bruce Allen at E@H, we are very fortunate to be running projects for such people. )As to your machine's problem: It is not that the WU is failing to upload correctly, but that the result disagrees with the other two results, and so you get no credit for it. See the detail for the most recent "validate error," WU #1373063. The more recent results are marked "pending" and "success," but you won't get credit for them either if they aren't confirmed by 2 other boxes. The most likely prediction based on the previous results is, sad to say, that the newer ones won't be validated either. This is the same kind of error that the BA had with his home computer; we never found a solution for it. Hopefully azazul's advice of rebooting will work, if not and you request me to, I'll post over at Cruncher's Corner; maybe they know something. When I searched there for the BA, the only thing I found that seemed remotely plausible was overheating in the FPU, which causes the calculation to compute an erroneous result, without actually aborting it. Let us know how temperatures are. I hope this works out for you! And I agree that the ultimate solution to any DC problem is just buy more processing power! ![]() |
|
|||
|
Hmmm, sadly this has occured for 80% of the work units for this machine since the 16th of June. al with a Validate error ("The result was reported but could not be validated, typically because the output files were lost on the server."), yet there where 2 successfull unis completed this morning.
Thanks for the tips, The machine gets booted twice a day on schedule and the temp of both CPU's sits at around 44 Celcius which is well within nominal range. 99% of work units prior to June 16th are tagged successfull. The only thing changed on this computer; http://einstein.phys.uwm.edu/results.php?hostid=257202 since that day may of been the firewall though I think this was later. If this continues, I will jump on the EH gorum and try hunt down why this may be happening all of a sudden.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Quote:
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
[quote="Klausnh"]
Try here from Bernd Machenschalk Quote:
![]() Quote:
Remember, "success" does not mean "validated." There's no way your BOINC client can know in advance whether the WUs are correct, ie, will later be validated by two other boxes. Unfortunate term, "success." Maybe they coulda used "Tentative Success?" Good luck over there, there's a lot of knowledge. In the BA's case, it was more important for the BA to be the BA than to keep working this problem, so he stopped BOINCing on that box. ![]() |
|
|||
|
Will have to wait and see, I know 100% for a fact this cannot be temperature related as there is about 5 temperature probes in this box at all times. They all have Max, Min and mean recordings for everyday and will alarm even if ambient is out nominal ranges.
I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering. I am certaintly not dropping the firewall for days to test this. So if it dosn't fix I guess EH will be removed from proxima. Was crunching fine and now its not. Only difference on the system since these validation errors started occuring is the firewall.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
||||
|
To me that sounds like corrept files, bad memory, or a slightly damaged (but still mostly functional) Hard Drive. Maybe deinstall BOINC and Einstein@home, defrag, and then reinstall them. If that doesn't work do a complete system wipe and reinstall everything. If that doesn't work Swap out the Memory, and if it still doesn't work then the HD. THen if none of that works you can beat me to death with a wiffle bat for telling you to do all that unnecessary work when it was actually a faulty power supply
.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Quote:
|
|
|||
|
Quote:
Error is Quote:
Quote:
If you want, I can post your problem at Einstein@home message boards and get an explanation before you remove EH. I'm really curious and like to understand what the problem is.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
Quote:
ops: I meant to say that this was the only thing I could find over at CC that I could see might possibly bear on the problem. :-? It is now clear that this is not the problem. Sorry. ![]() But here's what I'm not seeing about the firewall being the culprit: As I understand it the sequence goes: 1) Firewall full on 2)WU crunches away, completes, and local client sees computation as success. All without any attempt to open a port. 3) Client tries to upload the WU, is told by firewall port is blocked, client backs off, rinse, repeat. ----So far everything normal, firewall and client behaving impeccably, just ----as one would expect on dial up, say. 4) Firewall is manually turned off 5) Update command is issued manually 6) Client uploads the WU 7) Firewall turned back on 8: Much later, during verification, WU is found invalid. Note first that "invalid" does not necessarily mean "corrupt," although it could; and maybe Gopher's ideas may help. But "invalid" is more likely to mean, IMO, that the WU calculated "42.1" for the answer, when the other 3 computers agreed on "42". ( to oversimplify.) But my main question is, how could it be the firewall that caused the invalidity/corruption, when FW was turned off when the transfer took place? In going through this sequence, which I hope is fair, I guess it's possible that somehow in the very act, at step 3), of trying to upload the WU the firewall corrupted the WU. Maybe the active armor, which I don't understand. Again, I sure could be wrong, but this is the only step where I can see a possible interaction between FW and client. If this is the case, you can maybe confirm or refute it this way. (Forgive me if you have it set this way already): In general preferences, change "Confirm before connecting to Internet?" to "Yes." Then step 3 never happens. You will have to answer no to a lot of prompts to connect, but this is just an experiment. When you are ready to connect, start at step 4. Then with the FW off, answer yes when the client asks to connect. Mark down the number of the WU ID that gets sent in this way to see if it gets verified eventually. The WU will show up almost immediately in your list as "Pending." But actual verification will take some days. Meantime, you may as well stop running E@H on the box, unless you want to try some other remedy. So then, if that test WU is verified, the firewall was corrupting the results, and so, unless you want to change firewalls etc, the simplest thing is to uninstall BOINC on that machine. It is supposed to be fun after all, not worth running into a brick wall, etc. But if that test WU also comes back "Invalid" it seems to me you can rule out the FW as the problem, because it could never have touched that WU, since no requests were ever made to it to open ports, and no data was presented to it, while the WU was on your box, and the FW was disabled when you finally did send it. So all you gain, in this second case, is one less possible culprit; so you still might well decide it's not worth looking for others. We would all support you if you decide to uninstall at this point, possibly reserving the possibility of trying the same setup when orbit@home comes on line, or even running a few seti WUs to see if the problem is with the einstein app, which of course, it certainly might be. I'm sorry if I've said obvious things; I can only speak at my own low level of technical understanding; none of it meant to be condescending, or anything, only helpful. ![]() |
|
||||
|
Quote:
I understand this a bit differently: You can refer to the WU of Minbari's I was talking about above. 257202 is Minbaris's box. When the first result is uploaded, there is (obviously!) no way to know wheher the result it has calculated is correct. (Otherwise, no project! )When the second one comes in, there is likewise no way of knowing which, if any, is correct. Now the third one comes in, and in the case above, doesn't agree with Minbari, but does agree with the second box. This looks like bad news for Minbari, but the protocol requires 3 confirmations,(the quorum,) so the WU is sent to a 4th box, 249347. (actually, Minbari's WU was the third to be received, but the argument is the same.) The 4th result comes back and turns out to agree with the other 2, and so those 3 computers are marked valid, and get credit, while minbari is out of luck. The fifth computer would have been needed if the 4th one had not agreed with the other 2. So eventully 3 boxes do agree in most all cases; but the only way all results would be marked invalid is if they keep sending WUs and all of them keep disagreeing among themselves. At some point the algorithm will just discard the WU. But more likely is Minbari's situation of 3 out of 4 agreeing. |
|
|||
|
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
A warm welcome to our newest team member Walter Williams =D>
. He is member #75 for Einstein@Home, joining with 300 points already, and #21 for orbit@home.Let us know if there's anything you need. ![]() ********** I can't resist posting another clip from a post by Pasquale Tricarico about the science of orbit@home. The whole thread---still only a five post read!---is here. Quote:
This BABB thread will soon, if it hasn't already, get to the length where it is unreasonable to read it before posting to it. I've said before that this is fine with me; there has never been a post by anyone on any of the team threads even obliquely suggesting you should "search first." But the speed with which topics move down the queue means that items like the scientific posts are not seen by so many people. So in future, I at least will be posting such things to threads like "orbit@home science," at the BABB section of mickal555's site, with just links here. I know this will be a great relief to many. Anyway, you are cordially invited down there every couple of weeks or so to see what's new; and of course you are more than welcome to add to and/or comment on such posts. ![]() A similar invite applies to this poll down there about whether the orbit@home team will hurt the Einstein@Home team. Opinions on all such things are still welcome in this thread of course, but some may find the multiple thread format at the Shack more congenial. ![]() |
|
||||
|
Quote:
Quote:
![]() It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?Apologies to Minbari, I believe I've been not capitalizing his nick in places, I'll try to do better. ![]() |
|
|||
ops: Oh no, what you suggest is a perfectly plausible idea Ken, any suggestion or possible angle to look is a good one as far as I am concerned. I just thought I would make it clear that this is probably not the cause so as to be able to continue the process of elimination. I see there are two WU's that have succeeded in the last day which makes this ever more elusive. HDD (SATA II) are in perfect order, drive temps reports 38 Celsius and they only ever get loaded with the occasional paging so fragmentation is not an issue. You have all been most helpful indeed, I will of course try to carry on this spirit where possible. ![]()
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |