|
| If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|||||||
| Register | FAQ | Members List | Calendar | Mark Forums Read |
![]() |
|
|
LinkBack | Thread Tools | Search this Thread | Display Modes |
|
||||
|
Hi minbari,
You are welcome; PT is incredibly accessible (as is Bruce Allen at E@H, we are very fortunate to be running projects for such people. )As to your machine's problem: It is not that the WU is failing to upload correctly, but that the result disagrees with the other two results, and so you get no credit for it. See the detail for the most recent "validate error," WU #1373063. The more recent results are marked "pending" and "success," but you won't get credit for them either if they aren't confirmed by 2 other boxes. The most likely prediction based on the previous results is, sad to say, that the newer ones won't be validated either. This is the same kind of error that the BA had with his home computer; we never found a solution for it. Hopefully azazul's advice of rebooting will work, if not and you request me to, I'll post over at Cruncher's Corner; maybe they know something. When I searched there for the BA, the only thing I found that seemed remotely plausible was overheating in the FPU, which causes the calculation to compute an erroneous result, without actually aborting it. Let us know how temperatures are. I hope this works out for you! And I agree that the ultimate solution to any DC problem is just buy more processing power! ![]() |
|
|||
|
Hmmm, sadly this has occured for 80% of the work units for this machine since the 16th of June. al with a Validate error ("The result was reported but could not be validated, typically because the output files were lost on the server."), yet there where 2 successfull unis completed this morning.
Thanks for the tips, The machine gets booted twice a day on schedule and the temp of both CPU's sits at around 44 Celcius which is well within nominal range. 99% of work units prior to June 16th are tagged successfull. The only thing changed on this computer; http://einstein.phys.uwm.edu/results.php?hostid=257202 since that day may of been the firewall though I think this was later. If this continues, I will jump on the EH gorum and try hunt down why this may be happening all of a sudden.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Quote:
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
[quote="Klausnh"]
Try here from Bernd Machenschalk Quote:
![]() Quote:
Remember, "success" does not mean "validated." There's no way your BOINC client can know in advance whether the WUs are correct, ie, will later be validated by two other boxes. Unfortunate term, "success." Maybe they coulda used "Tentative Success?" Good luck over there, there's a lot of knowledge. In the BA's case, it was more important for the BA to be the BA than to keep working this problem, so he stopped BOINCing on that box. ![]() |
|
|||
|
Will have to wait and see, I know 100% for a fact this cannot be temperature related as there is about 5 temperature probes in this box at all times. They all have Max, Min and mean recordings for everyday and will alarm even if ambient is out nominal ranges.
I am convinced this is firewall related as all data passes via an active armour and gets checked before being passed off to the firewall system for filtering. I am certaintly not dropping the firewall for days to test this. So if it dosn't fix I guess EH will be removed from proxima. Was crunching fine and now its not. Only difference on the system since these validation errors started occuring is the firewall.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
||||
|
To me that sounds like corrept files, bad memory, or a slightly damaged (but still mostly functional) Hard Drive. Maybe deinstall BOINC and Einstein@home, defrag, and then reinstall them. If that doesn't work do a complete system wipe and reinstall everything. If that doesn't work Swap out the Memory, and if it still doesn't work then the HD. THen if none of that works you can beat me to death with a wiffle bat for telling you to do all that unnecessary work when it was actually a faulty power supply
.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Quote:
|
|
|||
|
Quote:
Error is Quote:
Quote:
If you want, I can post your problem at Einstein@home message boards and get an explanation before you remove EH. I'm really curious and like to understand what the problem is.
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
Quote:
ops: I meant to say that this was the only thing I could find over at CC that I could see might possibly bear on the problem. :-? It is now clear that this is not the problem. Sorry. ![]() But here's what I'm not seeing about the firewall being the culprit: As I understand it the sequence goes: 1) Firewall full on 2)WU crunches away, completes, and local client sees computation as success. All without any attempt to open a port. 3) Client tries to upload the WU, is told by firewall port is blocked, client backs off, rinse, repeat. ----So far everything normal, firewall and client behaving impeccably, just ----as one would expect on dial up, say. 4) Firewall is manually turned off 5) Update command is issued manually 6) Client uploads the WU 7) Firewall turned back on 8: Much later, during verification, WU is found invalid. Note first that "invalid" does not necessarily mean "corrupt," although it could; and maybe Gopher's ideas may help. But "invalid" is more likely to mean, IMO, that the WU calculated "42.1" for the answer, when the other 3 computers agreed on "42". ( to oversimplify.) But my main question is, how could it be the firewall that caused the invalidity/corruption, when FW was turned off when the transfer took place? In going through this sequence, which I hope is fair, I guess it's possible that somehow in the very act, at step 3), of trying to upload the WU the firewall corrupted the WU. Maybe the active armor, which I don't understand. Again, I sure could be wrong, but this is the only step where I can see a possible interaction between FW and client. If this is the case, you can maybe confirm or refute it this way. (Forgive me if you have it set this way already): In general preferences, change "Confirm before connecting to Internet?" to "Yes." Then step 3 never happens. You will have to answer no to a lot of prompts to connect, but this is just an experiment. When you are ready to connect, start at step 4. Then with the FW off, answer yes when the client asks to connect. Mark down the number of the WU ID that gets sent in this way to see if it gets verified eventually. The WU will show up almost immediately in your list as "Pending." But actual verification will take some days. Meantime, you may as well stop running E@H on the box, unless you want to try some other remedy. So then, if that test WU is verified, the firewall was corrupting the results, and so, unless you want to change firewalls etc, the simplest thing is to uninstall BOINC on that machine. It is supposed to be fun after all, not worth running into a brick wall, etc. But if that test WU also comes back "Invalid" it seems to me you can rule out the FW as the problem, because it could never have touched that WU, since no requests were ever made to it to open ports, and no data was presented to it, while the WU was on your box, and the FW was disabled when you finally did send it. So all you gain, in this second case, is one less possible culprit; so you still might well decide it's not worth looking for others. We would all support you if you decide to uninstall at this point, possibly reserving the possibility of trying the same setup when orbit@home comes on line, or even running a few seti WUs to see if the problem is with the einstein app, which of course, it certainly might be. I'm sorry if I've said obvious things; I can only speak at my own low level of technical understanding; none of it meant to be condescending, or anything, only helpful. ![]() |
|
||||
|
Quote:
I understand this a bit differently: You can refer to the WU of Minbari's I was talking about above. 257202 is Minbaris's box. When the first result is uploaded, there is (obviously!) no way to know wheher the result it has calculated is correct. (Otherwise, no project! )When the second one comes in, there is likewise no way of knowing which, if any, is correct. Now the third one comes in, and in the case above, doesn't agree with Minbari, but does agree with the second box. This looks like bad news for Minbari, but the protocol requires 3 confirmations,(the quorum,) so the WU is sent to a 4th box, 249347. (actually, Minbari's WU was the third to be received, but the argument is the same.) The 4th result comes back and turns out to agree with the other 2, and so those 3 computers are marked valid, and get credit, while minbari is out of luck. The fifth computer would have been needed if the 4th one had not agreed with the other 2. So eventully 3 boxes do agree in most all cases; but the only way all results would be marked invalid is if they keep sending WUs and all of them keep disagreeing among themselves. At some point the algorithm will just discard the WU. But more likely is Minbari's situation of 3 out of 4 agreeing. |
|
|||
|
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
A warm welcome to our newest team member Walter Williams =D>
. He is member #75 for Einstein@Home, joining with 300 points already, and #21 for orbit@home.Let us know if there's anything you need. ![]() ********** I can't resist posting another clip from a post by Pasquale Tricarico about the science of orbit@home. The whole thread---still only a five post read!---is here. Quote:
This BABB thread will soon, if it hasn't already, get to the length where it is unreasonable to read it before posting to it. I've said before that this is fine with me; there has never been a post by anyone on any of the team threads even obliquely suggesting you should "search first." But the speed with which topics move down the queue means that items like the scientific posts are not seen by so many people. So in future, I at least will be posting such things to threads like "orbit@home science," at the BABB section of mickal555's site, with just links here. I know this will be a great relief to many. Anyway, you are cordially invited down there every couple of weeks or so to see what's new; and of course you are more than welcome to add to and/or comment on such posts. ![]() A similar invite applies to this poll down there about whether the orbit@home team will hurt the Einstein@Home team. Opinions on all such things are still welcome in this thread of course, but some may find the multiple thread format at the Shack more congenial. ![]() |
|
||||
|
Quote:
Quote:
![]() It is clear from the WU detail that in fact a consensus has been reached, and Minbari's result is not part of it, and didn't get credit, so the wording you mention is just plain wrong. Maybe the software has a bug where it doesn't update that text when the final quorum is reached?Apologies to Minbari, I believe I've been not capitalizing his nick in places, I'll try to do better. ![]() |
|
|||
ops: Oh no, what you suggest is a perfectly plausible idea Ken, any suggestion or possible angle to look is a good one as far as I am concerned. I just thought I would make it clear that this is probably not the cause so as to be able to continue the process of elimination. I see there are two WU's that have succeeded in the last day which makes this ever more elusive. HDD (SATA II) are in perfect order, drive temps reports 38 Celsius and they only ever get loaded with the occasional paging so fragmentation is not an issue. You have all been most helpful indeed, I will of course try to carry on this spirit where possible. ![]()
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
Quote:
. (if interested you can follow the saga here.)It does seem to exonerate the firewall from corrupting data though? A good thing? :-? I had meant to ask previously if there were ever any unexpected error messages in the BOINC manager when it tried to upload WUs against the firewall? There is a logging program for messages on the project sites, which might be helpful to track down any differences between a successful upload and one which will eventually be invalid. Immediate link to download is here: boinclogger.zip. One other exotic possibility I thought of is an actual error in the FPU itself---not heat related. It's rare, but microchips can sometimes fail. And maybe even intermittently? A cursory google shows there might be tests for such things, and this site has a scary first paragraph, but I didn't locate a real easy-to-use FPU test for a modern desktop machine. I would think though that predictor@home would use mostly integer arithmetic, since it would seem to be a combinatorial task? We are all hoping that the errors have vanished, never to return, but if it does start spitting them out again, seeing if predictor@home gives the same errors could maybe be a quick and dirty way to isolate the FPU as a factor. The problem is sure made all the tougher by having to wait a week to know about success or failure. ![]() IOW, all I have to say is, Glad It's Not Me, Minbari. Good luck and best wishes.![]() |
|
||||
|
Quote:
Really, those result pages are very confusing. I know back in the day, that confusion contributed a little to the, um, misunderstanding I had with them over the Will I Ever Get Credit For This WU issue. Anything that makes these pages give consistent signals would be a big help to the whole community. |
|
|||
|
Quote:
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
Klaus,
Thanks for doing that...good work! ![]() I'm very envious of your concision: 2 posts totalling 3 lines and they see the problem clearly. I woulda taken 4 paragraphs just to state the problem. ![]() You are the official BABB Envoy to the Projects now, my friend! ![]() |
|
|||
|
Thanks ken. Your right in that is becoming a curly problem. As far as tests go, this machine went through extensive tests during the built phase, IE. specview and sciencemark tests for 3 days and produced zero errors with astonishing speed results so I am inclined to look past the FPU as being at fault (good suggestion though).
Also as you can see from the list of results proxima has worked on EH worked flaulesly untilt he 16th June as stated back a page or 2 ago. I will see how things progress over the next week, I dont have much time to contribute in fixing this problem so there are no gaurantees that any future opterons in the farm can be added later .Thanks for all your help again, its always appreciated.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
||||
|
Maybe you could ask the BA to create a forum for BOINC on these boards
. There is always something BOINCish to talk about, and more projects are created all the time.
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
||||
|
Quote:
![]() But even if the BA would go for it, I think it's a bit out of character with the other sections here? Plus opening it up for other worthwhile topics to ask for similar space? Other opinions? Still BAD BOINC does have a certain ring to it, I'll admit. Reminds me of my youth. |
|
||||
|
I've been getting errors the last half hour while trying to post here, so here's a link to a Scotsons Shack post about the new S4 data set we are now starting to crunch.
Aha, working now. **************** Also, BABB just moved into 21st place in RAC. Good going All! =D> =D> |
|
|||
|
Sheesh, turned the firewall off and results are getting through but now I am getting spurious erros relating to other things.
Like the one that gets my goat is EH will not send end more units to me as it is now saying my daily quota has been reached LOL. My daily quota is 2 when this machine can push out a max of 8, I dont understand this silly scheduler any more. If any machine should get errors it is the AMD Athlon 2700+ that runs all day at 69 celsius for 6 months without missing a beat, so go figure. Anyways I will persist for a little longer. The firewall was nvidias foceware with armour gaurd, that I have now chaned over. einstein errors follow; =================== 28/06/2005 3:12:56 PM||Remote control not allowed; using loopback address 28/06/2005 3:12:56 PM|Einstein@Home|Resuming computation for result H1_0407.0__0407.1_0.1_T23_Fin1_0 using einstein version 4.79 28/06/2005 3:12:56 PM|Einstein@Home|Deferring communication with project for 19 hours, 45 minutes, and 38 seconds 28/06/2005 3:12:56 PM||Insufficient work; requesting more 28/06/2005 3:12:56 PM|orbit@home|Deferring communication with project for 22 hours, 29 minutes, and 40 seconds 28/06/2005 3:13:22 PM||request_reschedule_cpus: project op 28/06/2005 3:13:22 PM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi 28/06/2005 3:13:22 PM|Einstein@Home|Requesting 8640 seconds of work, returning 1 results 28/06/2005 3:13:24 PM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded 28/06/2005 3:13:24 PM|Einstein@Home|Message from server: No work sent 28/06/2005 3:13:24 PM|Einstein@Home|Message from server: (reached daily quota of 2 results)
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
I just rebooted the system and it still says I have reached the daily quota limit of 2 requests, this is madness. So I finish the last unit in about 30 minutes and have to wait 18 hours till I receive another 2 units to work on. Keh, this hasn;t happened to me before on any of the machines I have set up and running for the last half a year.
Dual proc Opteron is pushing out less than a celeron at the moment, I know it aint the machine so there must be an issue at EH with opterons in the last weeks or something. I have no control of EH scheduling so my hands are tied until Eh does something to fix this mess up. Also I noted an update to the latest BOINC causes a WU in progress to pre-m,aturily terminate if one clicks to display the graphic. I did this just before and noted the error, LOL the unit only had 5% to completion and now its trashed cause of the silly graphic. Sheesh What a headache this is turning out to be, Im starting to wounder if its worth pursuing this any further. === 28/06/2005 3:29:28 PM|Einstein@Home|Sending scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi 28/06/2005 3:29:28 PM|Einstein@Home|Requesting 8640 seconds of work, returning 0 results 28/06/2005 3:29:30 PM|Einstein@Home|Scheduler request to http://einstein.phys.uwm.edu/EinsteinAtHome_cgi/cgi succeeded 28/06/2005 3:29:30 PM|Einstein@Home|Message from server: No work sent 28/06/2005 3:29:30 PM|Einstein@Home|Message from server: (reached daily quota of 2 results) 28/06/2005 3:29:30 PM|Einstein@Home|No work from project 28/06/2005 3:29:31 PM|Einstein@Home|Deferring communication with project for 18 hours, 49 minutes, and 40 seconds
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
||||
|
Hi Minbari,
You have been bitten by the "Ghost WU" bug; fortunately not serious, AFAIK. If you look at the results for computer here, you'll notice some completely spurious WUs were all said to be downloaded at around 3:45. These never would never have appeared on your "Work" tab, and they report a "client error" in downloading. But the scheduler thinks it has downloaded them, and so reports you've exceeded your daily quota. As I remember the discussion at CC now, this usually resolves itself when the next day rolls round and fresh real WUs are downloaded. I don't recall ATM whether midnite is reckoned in GMT or local time. If this doesn't work, I believe the solution is to nuke the master file containing the WUs. Wouldn't swear to this, too tired ATM to check it, will try tomorrow to be sure: You would shut dowm BM and delete (or rename for safety) the file h1_0407.0 from the einstein_4.79 directory. Then restarting the client should download a new master file which would not have the ghost WU bug. The file h1_0407.0 is one of the new S4 series, as you can see from clicking on any if the "Result IDs," the far left column. So It's possible that this is a new problem with them, but it looks like ghost WUs to me. And there is no question that the scheduler, to put it kindly, is not totally bug free. ![]() Will check again tomorrow, especially if you still aren't getting work; there's nothing worse that having a box all fired up with nothing to do. Once o@H comes on, this will be easier for us running both: unlikely both projects will be down or having errors at the same time. Then, will just have to worry about the scheduler negotiating the share properly. Hope this helps for now; best I can do. ![]() |
|
|||
|
Thanks for the feedback ken, I will sit and await for the roll over until a try anything else.
I'm having some bad luck with EH on Proxima the last few weeks. Still the show must go on. =D>
__________________
http://boincwapstats.sourceforge.net.../style:2/p.png |
|
|||
|
Problem with new h1 (lower case) WUs.
Quote:
__________________
"To excel in physics is to embrace doubt while walking the winding road to clarity." - Brian Greene |
|
||||
|
Klaus,
Thanks so much for this . From that thread:Quote:
There is no need to abort any running WUs first. 1) Exit the BOINCmanager, and the tray icon if it's still visible. 2) Delete any "h1_xxxx" files from your C:\program files\BOINC\projects\einstein.phys.uwm.edu\ folder. Lower case "h" ![]() 3) Restart BOINCmanager. Any h1_xxxx WUs which were previously running will be marked 100%, "client error," and ready to report: their computation is automatically aborted by the deletion of the master file in step 2). Which is what you want. 4) Manually "update" from the projects tab . You may not need to do this, since, upon finding no master file, the client may download one or two small WUs and resume crunching. Download of a new master file will then happen in accord with your "Connect to ..." cache settings. But at least one manual update is a good idea to be sure everything is running smoothly. Wait a minute to do it to avoid any "deferring for 59 sec" messages. (This will, or should, also solve Minbari's "Ghost WU" problem as well, though they are unrelated.) Repeating, there is no reason to continue with the h1_ xxxx files, as all such WUs will either fail locally or upon verification. Once this is done, you will of course have a few "client errors" when you view your computers under Your Account. It's a good idea to check back there in a day or two to be sure that the new WUs are getting credit. Here's the latest quote from the E@H frontpage: Quote:
![]() Since these WUs have now been cancelled, I think that the problem will self-correct when your currently running WUs complete: Client will try to fetch more WUs from the master file; be informed that they are cancelled, and eventually download a new master file. If this is not what happens on your machine, that is, if the client persists in running more h1_xxxx WUs, please let me know, and I'll send a team email in a few hours recommending the above steps. Apologies for this snafu. ![]() |
![]() |
| Thread Tools | Search this Thread |
| Display Modes | |
|
|