"Validate Error" rates

Floyd1
Floyd1
Joined: 29 Jun 14
Posts: 14
Credit: 590463278
RAC: 0
Topic 211383

I have a pair of machines crunching for E@H on their GPUs. Both are running Windows 10 and have been doing so for quite some time.

One uses a Radeon 380X and happily crunches away with virtually no errors reported.

The other uses the built-in Intel iGPU which also appears to be happy, but it also runs a Radeon 290X which has recently seen a large uplift in the number of tasks reported as "Validate Error".

I am unable to spot any differences in the content of the STDERR files related to successful or failed tasks and would really like to understand where things are going wrong.

I have tried several different versions of the GPU driver (including the one being used successfully on my other machine) and a couple of different versions of BOINC Manager. On previous occasions, I have tried running multiple concurrent tasks, but the ratio of "good" to "bad" results rocketed, overwhelming any advantage from the extra output. (The 380X GPU seems happy to run dual tasks without error).

The GPU has historically suffered from thermal throttling, but I have manually ramped up the cooling so GPU temps never hit the hardware ceiling.

The AMD task has a CPU core reserved, while the Intel iGPU has another CPU core shared between the two concurrent tasks, leaving a pair of CPU cores to crunch on a different project.

BOINC Manager is 7.6.33 while the Radeon OpenCL software version is 22.19.662.4.

The machine is on E@H as:-

https://einsteinathome.org/host/11741374

Can anybody shed some light on potential reasons for the flood of errors, or offer clues on how to troubleshoot?

archae86
archae86
Joined: 6 Dec 05
Posts: 3146
Credit: 7092794931
RAC: 1373766

Floyd1 wrote:The GPU has

Floyd1 wrote:
The GPU has historically suffered from thermal throttling, but I have manually ramped up the cooling so GPU temps never hit the hardware ceiling.

First off, I'd caution others commenting that the Original Poster really does mean "Validate Error", and not just the far more common "Completed, marked as invalid".

I recently had one of these pop up once on one of my GPUs, not having seen one in many months.  I was running overclocking  tests at the time, and believe the error was a symptom of my having gone a click too far in speed.

To my extremely limited degree of understanding, a Validate Error is logged when the two tasks required for a forum are both available at the server, but get some initial sanity check before the are actually compared.  With a Validate Error something is found wrong that flunks the sanity check, so no comparison is needed.  This also means there is zero chance that the real problem was with the quorum partner--which happens regularly with marked as invalid cases.

So my suggestion should not surprise you: find a way to dial down the memory and core clock rates on the card temporarily.  I suggest by an appreciable amount, say 10%.   Your error rate is high enough that I'd expect you to be able to see an improvement in error rates within a day or so if excess speed is indeed important to your problem.

While you did not mention this, I noticed that you logged an "Error while computing" on November 29.  On my card this is also something that happens at excess clock rate (either core clock or memory clock, on my most recent tests).  In normal running it should never happen at all.

As my cards are Nividia Pascal series, I won't make any strong claims that my experience maps well to yours.  I have no certain advice as to what applications or other means are available to slow the clock rates on your card.  But as you run Windows 10, perhaps the MSIAfterburner that I like to use for test fiddling may turn out to offer control and monitoring for your card.

 

 

 

 

 

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.