Hyperthreading and Task number Impact Observations |
Message boards : Cruncher's Corner : Hyperthreading and Task number Impact Observations
| Author | Message |
|---|---|
|
Over the next couple of days I plan to generate some observations on the execution time and aggregate throughput impact of switching my Westmere E5620 between hyperthreaded and not, and of varying the number of simultaneously executing Einstein HF 3.06 tasks. | |
| ID: 108579 | | |
|
I'm planning to post results by doing a screen capture of a bit of Excel spreadsheet, posting to my Photobucket account, and linking the image here. | |
| ID: 108590 | | |
|
archae86 wrote: But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case While I can't speak for other regular posters, a result here surprised me. The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case. The observed result is of course what one would like and naively expect--with nothing to do on the "other half" of a core running HT you would of course want no context switching or other overhead to occur. But on my previous main machine with HT some years ago, I formed the strong impression that single task execution was considerably slower with HT enabled than not. I assumed that was still true for Nehalem and have a dedicated BIOS setting group aimed to support my audio processing which is nHT because I thought my (largely single-threaded) audio processing would go faster (I don't run BOINC when I'm doing audio). Possibly I was mistake then, or possibly the Intel HT implementation in the Nehalem architecture is dramatically superior to that in the Gallatin (Northwood-derived with big cache) machine I used to own. Considering how unfortunate some other aspects of the whole Willamette-descended set of designs were, this would not surprise me. So it is even more crucial to assure that performance data are taken with appropriate simultaneous workload than I thought. To the problems of conflict for RAM and cache resources, one must add the fact that an underloaded HT host can actually provide dramatically more computation per charged CPU second than a loaded one. ____________ | |
| ID: 108610 | | |
|
Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch. | |
| ID: 108613 | | |
|
Richard Haselgrove wrote: Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch. But the same system running nHT still needed to process interrupts and make context switches for those same things. And a big part of the claim for HT is that it supports a sort of context switch between the two threads sharing a core at any given moment that is immensely faster than a standard context switch. So at least some should have gone faster, and I don't see why there would be a penalty for all the others unless the scheme somehow forced frequent thread to thread switches even though there was not work for the other thread. Something like that is what I've assumed, but not from inside information, but the behavior I thought I'd seen. a clever operating system could keep one 100% utilisation task running on one core without context switchesI've noticed that Windows 7 on my application load seems far more inclined to leave a process on a core for a while than Windows XP Pro. Not sure Windows 7 qualifies as clever in this respect. ____________ | |
| ID: 108616 | | |
The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case. Based on the work we did a couple a years ago ( Ready Reckoner et al ), if still valid, then one could easily get variation of the order of a third ( average variation, measured extremes were from as low as ~15% to as high as ~45% ) of the run time due to stepping through phase space at a given frequency ( sinusoids etc ). That's clearly of the order of +/- HT effect we expect anyway ..... [ specifically, if true, this implies nHT processing getting lucky with a 'short' WU with correspondingly disadvantaged HT processing working on a 'long' WU, to explain your finding ] Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108619 | | |
|
Running (up to) N/2 tasks* on machine with N HT CPUs yields pretty much same | |
| ID: 108621 | | |
What's surprising about it? For many, not alot. For others, a credulity issue .... :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108622 | | |
|
Mike Hewson wrote: if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )My impression is that the proportionate execution time variability of the current work load is much less than it was in that era. Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload. ![]() I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention. ____________ | |
| ID: 108623 | | |
Fair enough. I thought that might well be so, as the variation 'back then' related to sinusoidal function 'look-ups' and like issues, which has undergone optimisation ( or become less relevant ) since. The phase space is right ascension and declination, effectively considered as orthogonal co-ordinates, but to un-Doppler a signal you still need to resolve components to detector/Earth frame ie. trigonometry. At least that's how I remember it. :-) Cheers, Mike. ( edit ) Nice curve too. To a first attempt you'd model that as normally distributed : normalising_const * exp[-((ordinate - mean_measure)/spread_measure)^2] If so then you have an underlying random variable with no especial 'preference' related to the task at hand*. Asynchronous ( with respect to WU processing ) interruptions would explain that nicely ..... ( edit ) Mean is ~ 22460, standard deviation is ~ 148. Average absolute residual ( of actual WU's per run-time bracket ) from Gaussian prediction is ~ 3.1 or around 10% of the peak. Close enough ... certainly believable for that sample size. ( edit ) This means that I am saying that the WU 'interruptions' account for around +/- 2% of their run-times ( 3 standard deviations/average ). So this is way less than the 'sequence number' effect studied in days of old. But you could have guessed that. I couldn't remember any more Excel-Fu ..... :-) ( edit ) Actually yet another reason for intrinsic WU variation not affecting your group of 144, is that at ~ 1500 Hz : each frequency is going to have well over 1000 WU's to plow through. [ We found earlier that the number of work-units per sequence-number-cycle at a specific frequency went quadratically with frequency, as you have to plod through the phase space more finely ]. With E@H's use of locality scheduling there is a mighty tendency for a given host ( especially a fast one ) to be given near consecutive sequence numbers, thus your example of 144 WU's may not sample much of any ( if existing ) sinusoidal variation in run-times from that cause. In fact looking at your Xeon's first 20 tasks in the current 'In progress' list ( a page full ) that's exactly what's happening. So I'll shut up now, having demonstrated that this is definitely not relevant to the HT exploration here. :-) :-) ( edit ) * Sorry, quiet night-shift! I didn't explain that if it was predominantly sequence-number related you'd get a low-skewed ( to the left ) concave-up curve, and not a convex-down symmetric bell shape, as most WU's would cluster at the sinusoid trough ( shorter run-times ). If there is any skew asymmetry in your data it is definitely to the right. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108624 | | |
|
I've been surprised yet again. The most recent addition to the result table in the second post of this thread covers the case of 4 tasks running with hyperthreading turned on. | |
| ID: 108653 | | |
(...) the shortfall (of HT_4 --ed.) to nHT_4 is quite large. I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI. | |
| ID: 108654 | | |
|
tear wrote: I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI. Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely. In hyperthreading the OS is presented with a set of apparently equivalent CPUs. But in the current form pairs of them use the same physical hardware. So, at least in the Nehalem generation, there would be a great advantage to assigning next execution of a thread to a virtual CPU which was not only only idle itself, but whose "sibling" as you call it--the other virtual CPU in fact using the same core--was currently idle rather than to a virtual CPU whose sibling was already using a full core resource. I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways. That may have seemed a needless complication or a departure from the purity of apparent equivalence to those making the original design decisions. That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature. ____________ | |
| ID: 108655 | | |
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI. The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing. No 'forcing CPU affinity' required. It's already included! You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU). Happy fast crunchin', Martin ____________ Powered by Mandriva Linux A user friendly OS! See the Boinc HELP Wiki | |
| ID: 108657 | | |
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI. I've seen plenty of bad scheduling with HT support in the scheduler (CONFIG_SCHED_SMT). Though I admit, theory does appear nice. You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU). No disagreement here :) | |
| ID: 108663 | | |
(edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons) Exactly right. The close frequencies and sequence numbers mean the skygrid right ascension and declination are real close. My guess ( admittedly based on old parameter estimates ) is that the sequence numbers have about a cycle of 400 work units before returning to similiar runtimes, for around that frequency value. [ For those not familiar with this aspect of the discussion : at each assumed signal frequency the entire sky is examined in small areas ( one per work unit ) with more, and thus individually smaller, areas required for higher frequencies. Because the Earth is rotating around it's own axis, and it is also orbiting the Sun then a signal channel from each interferometer needs to be 'de-Dopplered' accordingly for each and every choice of distant sky grid element ( tiny area on a construct called the 'celestial sphere' ). Ultimately a signal is effectively expressed in what it would be like if it were heard at a place called the solar system 'barycenter'. There is another line of adjustment according to estimates of putative source movements too. The part of the algorithm that steps through the skygrid has to acknowledge some trigonometry to resolve a signal's components to the directions along which a particular interferometer's arms happen to lie at a given instant. In addition not all skygrid areas are equal which is a consequence of spherical geometry not being 'flat'. In any case the work unit's runtime used to be very dependent on skygrid position, with a marked sinusoidal variation above an amount that was constant regardless of sky position. The algorithm starts stepping from I think at the equator, but it could have been a pole as I can't remember which, and wraps around the sphere with a 'stagger' reminiscent of winding yarn around a ball. The number steps to return for another wrap around is this cycle length of approximately 400 that I'm referring to. At lower frequencies than we are currently doing now, around 3 such cycles were required to cover the entire sky grid. There was also another effect 'rippling' the sinusoidal runtime vs. sequence number curve, probably ( well that was my view ) due to conversion of co-ordinates from an Earth based equatorial view to the Earth's orbital plane or ecliptic. The Earth's axis is rotated with respect to the ecliptic, which is why we have seasons etc. In any case method changes have made all this rather less relevant now ..... but it used to be a huge issue in comparing runtimes and relative (in)efficiencies ] Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108664 | | |
<snipped interpretation> (nb, yes, that's my message) I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways. If I were to nitpick I'd say "Hardware enables software to retrieve physical layout"... ;) sorry. That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't know how to do that in Windows. However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature. Yes... use cases, use cases, use cases, use cases (to paraphrase Steve Ballmer). I can't see one (use case, not Steve -- ed.) either. | |
| ID: 108665 | | |
... As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't know how to do that in Windows. Is the Windows scheduler HT-aware yet?... Aside: Also note that for some systems, the Intel CPUs can become memory bandwidth limited for some tasks. For those cases, you can get better performance by NOT using all the cores, or use a mix of boinc tasks so as to not hit the limits for CPU cache and memory accesses. That was especially true for the later multi-cores using the old Intel FSB. Has that now been eased with the more recent CPUs that no longer use a 'northbridge' for RAM access? Happy fast crunchin', Martin ____________ Powered by Mandriva Linux A user friendly OS! See the Boinc HELP Wiki | |
| ID: 108676 | | |
|
I can vaguely remember that MS put quite some effort into making Server 2008R2 more power efficient (I think there was a review on Anandtech about this). They achieved quite an improvement over the previous versions. And as far as I remember the optimizations include NUMA-awareness and HT-awareness in the scheduler. It may not be perfect (which software is?), but if it wasn't there I'd expect the HT_4 result to be even worse, maybe right in the middle between nHT_4 and HT_8 (without a proper calculation of probabilities). | |
| ID: 108677 | | |
In the "set affinity" interface for Process Explorer it designates CPU 0 through CPU 7 on my E5620, and 0 through 3 on my Q6600.That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident. At least part and perhaps all of my impression made use of reported core-to-core temperature changes in response to task shifts. An additional difficulty is that at least some temperature-reporting aps don't use CPU identification compatible with that used in this affinity interface. After I saw tear's note yesterday, I made a sloppy trial run in which I used suspensions to limit execution to four Einstein 3.06 HF tasks, and used this affinity mechanism to restrict each to a distinct one of the four presumed pairs. It was sloppy in that I failed to monitor things closely enough to avoid some minutes in which fewer than four tasks were running, but my initial impression is fairly strongly that a large improvement over the non-affinity modified case was demonstrated. Long ago I did affinity experiments for a Q6600 with a full SETI/Einstein task load demonstrating no improvement. That, of course, was quite a different issue than this. It is not the un-needed switching of tasks from CPU to CPU that is the primary harm here, but un-needed sharing of a physical core when an idle core is available. ____________ | |
| ID: 108684 | | |
|
I found the results for the nHT_4 << HT_4 to be inconsistent with experiments I've run in the past and like your initial reaction, surprising. | |
| ID: 108703 | | |
From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident. Interesting. On my i7-860 Linux box, it is (0,4) (1,5) (2,6) (3,7). This thread is very interesting - thanks for doing all these tests. I would be very interested to see a similar comparison done under Linux but, sadly, I don't have the time to do it myself... | |
| ID: 108704 | | |
|
As a probe, I tried a new case of HT enabled but only 4 tasks (compared to 8 possible), but with one task restricted (by Process Explorer affinity setting) to what Process Explorer running under Windows 7 construed to be CPUs (2,3) while the other three tasks were all allowed to roam among CPUs (0,1,4,5,6,7). | |
| ID: 108705 | | |
|
![]() - ALF - "Find out what you don't do well ..... then don't do it!" :) | |
| ID: 108715 | | |
|
Some interesting observations and good discussion. ... as tear got an observation on Linux, it is at least hinted that some Linux distributions suffer to at least come degree the same form of sub-optimization. For all you might ever have wanted to know about the introduction of the Intel version of Hyper-Threading... (Note that this is a rather old idea harking long ago back to the days of the Cyber supercomputers and possibly before...) Linux: HyperThreading-Aware Scheduler ... August 28, 2002 - 12:59pm * Linux news Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading. Ingo explains: ... Linux: NUMA Awareness Added To Scheduler ... January 22, 2003 - 3:22am * Linux news After several earlier attempts [story], NUMA awareness has been merged into the 2.5 development kernel's scheduler. Martin Bligh submitted the patches, explaining: ... Hyper-Threading support in Linux kernel 2.5.x Linux kernel 2.4.x was made aware of HT since the release of 2.4.17. The kernel 2.4.17 knows about the logical processor, and it treats a Hyper-Threaded processor as two physical processors. However, the scheduler used in the stock kernel 2.4.x is still considered naive for not being able to distinguish the resource contention problem between two logical processors versus two separate physical processors. Ingo Molnar has pointed out scenarios in which the current scheduler gets things wrong... The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor that is able to feed tasks into all of the virtual processors. Throw in a smarter sense... Rather interesting for the various scenarios... As mentioned, a complication observed elsewhere is when the multiple CPU cores become resource limited for memory access (or even cache access). Happy fast crunchin', Martin ____________ Powered by Mandriva Linux A user friendly OS! See the Boinc HELP Wiki | |
| ID: 108727 | | |
If you had the wish to: 1. run with HT enabled and 2. limit the number of BOINC tasks to the number of physical cores (thus losing appreciable Einstein throughput compared to allowing use of all of the apparent CPUs) 3. get more Einstein output than one gets allowing unlucky task assignment to sibling CPUs. If my cursory understanding of Process Lasso from looking at your reference, my current belief on sibling pair numbering for Process Explorer applies to Process Lasso, and a new thought I had a couple of minutes ago are all correct, Then one could, I think do this: Use Process Lasso to assign CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair) to the "worker" exe for all BOINC applications that you run (needs to be the same list for all this class of aps). One would of course also wish to use BOINC to restrict the number of running BOINC processes to four or to 50%. On mixed fleets with both HT and nHT multi-core hosts, doing this from account preferences would use up some venue dimension--if unacceptable one might use host preference over-ride. This should assure that no BOINC execution task ever shares a physical core with another. On a lightly loaded system of Nehalem-generation architecture, I'd expect based on what we have observed so far, such a system to get very close to the nHT BOINC throughput. A possible reason to consider this might be that such a system might be found to be more responsive to non-BOINC tasks than the 4-task nHT alternative, and quite likely more responsive than the (admittedly higher throughput) 8-task variant. Einstein current GC work has Working Set size reported about 260,000 kbytes by Process Explorer. Folks with modest-memory systems, or needing to run memory-hungry non Einstein aps (say PhotoShop...) or running BOINC projects yet more memory hungry might find this ocnfiguration attractive. I'm not sure we have quite met the usefulness objection to this line of inquiry we lightheartedly entertained early in the thread, but I do think we are getting closer. ____________ | |
| ID: 108730 | | |
|
I'm wondering: does MS know the "unlucky assignment" apparently does happen this often? They should be scratching their heads already.. | |
| ID: 108731 | | |
|
archae86: have you checked your windows power management settings? I ask because any setting below max performance could have the scheduler intentionally pairing WU's up at times in order to idle cores and drop power levels. | |
| ID: 108732 | | |
CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair)While I mentioned this thought in conjunction with another poster's mention of Process Lasso, I just tried this recipe solo. The result was successful, and has been added to the image displayed in the second post in this thread. Possibly this is what tear meant in referring to affinity settings avoiding sibling conflict. The resulting (low) CPU time--actually I've entered the average of four results sharing this condition--seems to endorse this particular setting for this purpose. DanNeely: my power management setting currently says "Turn off the display": 10 minutes, "Put the computer to sleep": never. not sure how that comports with your concerns. ____________ | |
| ID: 108735 | | |
CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair)While I mentioned this thought in conjunction with another poster's mention of Process Lasso, I just tried this recipe solo. The result was successful, and has been added to the image displayed in the second post in this thread. Possibly this is what tear meant in referring to affinity settings avoiding sibling conflict. The resulting (low) CPU time--actually I've entered the average of four results sharing this condition--seems to endorse this particular setting for this purpose. To be clear on this, we'd be looking at the figures for nHT vs HT, both @ 4 tasks? Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108738 | | |
From that dialog, click change advanced settings, scroll down to Processor Power Management, and take a look at the values for Minimum and Maximum processor speed (unless you locked your multiplier in the BIOS and disabled the power management features that let it throttle down). IF your power plan is based off of Balanced or Power Saver instead of maximum, the minimum value will be 5% leaving windows free to throttle your CPU as it sees fit. I thought there were also settings relating to standing down cores as well, but unless they're subsumed in the cpu speed setting I can't find them. ____________ ![]() | |
| ID: 108740 | | |
|
DanNeely wrote: take a look at the values for Minimum and Maximum processor speed100%, 100% (unless you locked your multiplier in the BIOSYes it is locked ____________ | |
| ID: 108741 | | |
|
Mike Hewson wrote: To be clear on this, we'd be looking at the figures for nHT vs HT, both @ 4 tasks?It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present. And, yes, I'm speaking of comparing several entries with 4 in the tasks columns and both HT and nHT in the first column. ____________ | |
| ID: 108742 | | |
It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present. Done that, can't see any 'affinity' .... :-( Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108743 | | |
It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present. same, untill i cleared my cash and done a refresh, now its there :) edit: aside, might wonna trim that image by one colum to prevent side scrolling :) ____________ seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift. | |
| ID: 108744 | | |
Odd, I see a new column after Thpt/Watt/HT_8 with the column label of "Affinity Restrictions" The Excel table fragment screen capture currently has 12 rows. How many rows do you see (including the header row)It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present. Maybe there is a bit of network kit imposing caching between your browser and Photobucket? That could be a reason not to use this update method. If you say this is still a problem I shall pose the new capture to a new post with a new name. ____________ | |
| ID: 108745 | | |
|
Got it, a FireFox thingy. Cleared the cache, now I can see! ;-) | |
| ID: 108746 | | |
edit: aside, might wonna trim that image by one colum to prevent side scrolling :)Gee, it was only 887 pixels wide. I've made an edit and it is down to 787 wide now. ____________ | |
| ID: 108748 | | |
|
Well, this is why we have scroll bars. :-) :-) | |
| ID: 108749 | | |
edit: aside, might wonna trim that image by one colum to prevent side scrolling :)Gee, it was only 887 pixels wide. I've made an edit and it is down to 787 wide now. err, sorry. i just had flashbacks of folks way back when complaining about wide signature images and was tryin to head that off :| didnt mean to offend. ____________ seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift. | |
| ID: 108781 | | |
ExtraTerrestrial Apes: The problem appears to be in the windows scheduler either not being able to detect that the boinc tasks are long running 100% load items to keep them separate, or that it bumps them deliberately because they're low priority tasks in favor of giving exclusive core use to a higher priority item that requested CPU time. Nr. 1: possible.. but that seems like a really stupid mistake, as it should be ovious that they are essentially "long running 100% load items". Nr. 2: I would expect this to happen occasionally, but in this case I'd be really surprised by the magnitude of the effect. In the nHT_4 case each task took 14800s, whereas in the full HT_8 config each one took 22300s. In the run in question the 3 tasks with presumably "unlucky assignment" needed 16900s. So I think it is safe to say that these tasks spent approximately 72% of the runtime alone on a core and 28% of the time on a shared core. If this was intentional, as some other single task seemed more important, then this one would have been run 3*0.28 = 84% of the entire runtime. That means Archae would observe an average non-BOINC CPU load of about 10% (for this statement it does not matter if this was one or many important tasks). Since Archae took care not to have excessive background tasks running and since on my Win 7 machines I do not normally observe such high background activity I don't think Nr. 2 is a likely explanation for the observed runtimes. Nr. 3: what if the scheduler wasn't HT-aware? If it assigned tasks completely random? I just quickly tried to count the possibilities for scheduling 3 tasks over 3 HT cores and arrived at 24 lucky possibilities and 20 unlucky ones. If we assume the same probability for each one that would mean 20/(24+20) = 45% unlucky assignments. That's not exactly the 28% from the previous paragraph and I'd be surprised if I didn't make some mistake here. But in my opinion it's somewhat close.. and should be much further away if the scheduler worked remotely the way I'd imagine. In either case this is a Microsoft problem, and not something Intel could do anything about themselves. Yes, but it's Intel's products which are suffering due to this, i.e. are getting worse benchmark scores and worse real world performance. That's why I think it would be in their best interest to look into this and, if confirmed, ask MS politely but firmly to change their scheduler. MrS ____________ Scanning for our furry friends since Jan 2002 | |
| ID: 108799 | | |
Nr. 3: what if the scheduler wasn't HT-aware? If it assigned tasks completely random? I just quickly tried to count the possibilities for scheduling 3 tasks over 3 HT cores and arrived at 24 lucky possibilities and 20 unlucky ones. If we assume the same probability for each one that would mean 20/(24+20) = 45% unlucky assignments. That's not exactly the 28% from the previous paragraph and I'd be surprised if I didn't make some mistake here. But in my opinion it's somewhat close.. and should be much further away if the scheduler worked remotely the way I'd imagine. Ah. Three physical cores ( 1, 2, 3 ) each twice virtualised ( a, b ) with three distinct processes ( A, B, C )?
Phys Virt
1 | a | A A A A A A A A A A
| b | B B B B A A A A A A
2 | a | C B B B B B B A A
| b | C C B B C B B B B A
3 | a | C C C B C C B C B
| b | C C C C C C C C C
________________________________________________________________________
Y Y Y Y Y Y Y Y Where Y indicates 'good' combinations that don't compete for a physical core. Have I expressed your desired scenario correctly? You just permute from ABC to ACB, BAC, BCA, CAB, CBA to get the others. But the fraction of good ones is still the same. The total good is 8 * 6 = 48, but out of 19 * 6 = 114 in all. Thus 8/19 = 48/114 ~ 0.42105 or 42% lucky and thus 48% unlucky ( assumes random assignment of task to virtual core ). Cheers, Mike. ( edit ) I suppose we're gonna want a 4 HT core by 4 tasks matrix .... as Arnie says : I'll be back. :-) ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108803 | | |
Thus 8/19 = 48/114 ~ 0.42105 or 42% lucky and thus 48% unlucky ( assumes random assignment of task to virtual core ). Excellent explanation! The funny part is that you did right the hard part and wrong the easy last arithmetic ;) It's 100-42 = 58% unlucky The "Answer to the Ultimate Question of Life, the Universe, and Everything" (42) get in the way :) http://en.wikipedia.org/wiki/42_(number) http://en.wikipedia.org/wiki/Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything#The_number_42 ____________ ![]() - ALF - "Find out what you don't do well ..... then don't do it!" :) | |
| ID: 108808 | | |
|
Typo? No .... err sales tax? No .... err brain fade? yes .... :-) | |
| ID: 108809 | | |
|
OK here's, I believe, the cases for 4 tasks on 4HT cores :
Phys Virt
1 | a | A A A A A A A A A A
| b | B B B B A A A A A A
2 | a | C B B B B B B A A A
| b | C C B B C B B B B A
3 | a | C C C B C C B C B B
| b | C C C C C C C C C C
___________________________________________________________________________
Y Y Y Y Y Y Y Y * Thus 8/20 = 0.4 or 40% lucky and thus 60% unlucky. Sorry ...... ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 108811 | | |
So that's 22.86% ( 16/70 ) are good, 8.57% ( 6/70 ) are worst, and 68.57% ( 48/70 ) are mediocre [ yup, that adds to 100! :-) ] I spent some time getting relative performance estimates for the assignments Mike here calls Good, Mediocre, and Worst. For this purpose I used a new range of frequencies/seq, having exhausted my stock of the previous, but actually think these are still very close to the previous in work content. However I shifted the reported CPU time to elapsed, rather than reported task, particularly because I observed some anomalous behavior in my test condition: Some of the time, Windows would not activate all of the currently executing Einstein tasks, even though the affinity prescriptions left open a virtual CPU. One practical impact was a much bigger discrepancy between elapsed time and reported CPU time than usual. Paradoxically, the most restrictive case of allowing a task to run on one and only one virtual CPU, which one would expect to suffer from the occasional case of waiting for a higher priority system task which happened to get assigned that CPU during one of the many times per second that the Einstein task is out on a context switch (waiting for disk, or ...). Suffer it doubtless does, but on my rather artificial test cases, sometimes the most restrictive assignment was far more productive than the most free one (specifically, in the case of running four Einstein tasks on two physical cores--thus four CPUs--it was much more productive to hard assign each of the four tasks, than to all all four free range). Here are some comparison numbers. I'll leave it to others to estimate whether these--matched with Mike's proportions, suggest that Windows is doing a bit better or a bit worse than one might predict on random assignment. ![]() The top populated HT_4 row corresponds to Mike's Good (.7622), the next two rows both represent Mike's Bad, but I suspect the first of the two (.5692) is more representative of actual occurrence, and the throughput composite line (.6596) is my estimate of Mike's Mediocre case. The relative performance numbers in this paragraph are all estimates of aggregate system throughput compared to a HT_8 case on the same work. One should perhaps mention that the designers of the Windows scheme probably did not consider maximizing system aggregate throughput of persistent very low priority tasks as high on their desired outcome list. ____________ | |
| ID: 109021 | | |
|
To be clear for other readers : by elapsed time is meant as per a clock on the wall from the task beginning until completed, whereas CPU time is the total of accumulated time slices devoted to the task. The difference would be how long the task is waiting to be executed on a CPU. Think of a CPU, physical or virtual, as a contended resource which requires allocation - this is an OS dependent function that switches tasks to CPUs on a priority basis. Tasks thus compete amongst themselves for CPU time, including the tasks we "don't see" like a heap of mundane OS stuff. But our tasks need those mundane ones to perform ( disk access say ) and thus may be 'blocked' in proceeding while awaiting their completion. | |
| ID: 109023 | | |
|
Mike Hewson wrote: by elapsed time is meant as per a clock on the wall from the task beginning until completed, whereas CPU time is the total of accumulated time slices devoted to the task. Just so. And most of the time you want to quote CPU time for various comparisons, as it is generally less subject to variation from the "otherwise" state of the system than is elapsed time. But for my tests I believe I did a pretty good job of avoiding nearly all the appreciable time consumers, so that for most cases either would serve--save in this toxic mis-allocation, where an available task theoretically in execution nevertheless fails to get assigned to an available CPU at substantial likelihood over an extended period of time. On a completely unrelated note--I appear to have killed my Westmere host late this afternoon. I was debugging the problem of only getting 4G out of 6G of RAM, and had just completed the last step in satisfying Corsair's RMA requirements by testing each RAM module of the set separately to memtest86+. It failed even to boot with the offending module, so could be deemed to have failed the test. But something happened in transitioning back to a known good RAM configuration (for one thing, I think I failed to turn off the power supply before shuffling RAM modules, a rookie mistake for sure), and as of now the system gives no signs of life at all save for consuming 20 watts from the wall. No fan spins, no beep codes, no Mobo Dr Debug digits displayed, no sounds for hard drive or CD drive--in fact no detectable response to pressing the front panel "power" button at all--not even in power consumption which remains steady at 20W (before this death, the behavior was that on turning on the real power switch on the supply. it would go up to about 20W over a couple of seconds, stay there for a couple of seconds, then descend to about 5W until the front panel button was pressed). Yes I have exercised the ClrCMOS jumper. At the moment I suspect death of the motherboard or of the power supply, though there are some other possibilities. I plan to sleep on it, and tomorrow disconnect very nearly everything (eventually including the CPU). ____________ | |
| ID: 109030 | | |
On a completely unrelated note--I appear to have killed my Westmere host late this afternoon. I was debugging the problem of only getting 4G out of 6G of RAM, and had just completed the last step in satisfying Corsair's RMA requirements by testing each RAM module of the set separately to memtest86+. It failed even to boot with the offending module, so could be deemed to have failed the test. But something happened in transitioning back to a known good RAM configuration (for one thing, I think I failed to turn off the power supply before shuffling RAM modules, a rookie mistake for sure), and as of now the system gives no signs of life at all save for consuming 20 watts from the wall. No fan spins, no beep codes, no Mobo Dr Debug digits displayed, no sounds for hard drive or CD drive--in fact no detectable response to pressing the front panel "power" button at all--not even in power consumption which remains steady at 20W (before this death, the behavior was that on turning on the real power switch on the supply. it would go up to about 20W over a couple of seconds, stay there for a couple of seconds, then descend to about 5W until the front panel button was pressed). Yes I have exercised the ClrCMOS jumper. At the moment I suspect death of the motherboard or of the power supply, though there are some other possibilities. I plan to sleep on it, and tomorrow disconnect very nearly everything (eventually including the CPU). Oh. With a bit of luck it'll turn out to be something cheaper and simpler like the power supply not supplying a trickle current to the board ( so that it knows when the power switch has been toggled via the mobo input pins ) for full switch on. Swap in a known good PSU and see what happens ... I've seen this before and found/claimed the PSU capacitors at fault ( age plus paper'n'paste and not solid-state ). So your 20W could represent a 'short' across the capacitors, dropping the voltage of outputs ( including the PSU's own fans ), and of course ripple control, thus little happens on the mobo. Look for eburnation of the capacitor connections to the printed circuit board. I really like Corsairs myself. I'll look at the recent Westmere data and see if I can soundly deduce anything. Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 109042 | | |
|
Mike Hewson wrote: Oh. With a bit of luck it'll turn out to be something cheaper and simpler like the power supply not supplying a trickle current to the board ( so that it knows when the power switch has been toggled via the mobo input pins ) for full switch on. Swap in a known good PSU and see what happens ... I disconnected the suspect supply and did rudimentary testing. With a 575 ohm resistor across Vsb to meet minimum current requirement, I saw 5.1V on VSB. When I shorted /PS_ON to ground, I was voltages on all main supplies too close to correct to give this stone cold dead behavior even though I was not providing any load to them. Then I attached a different supply to the motherboard 8 pin and 24 pin ATX connectors, and got the same behavior, save only that the power consumed was 10W instead of 20W. I failed to mention this before, but when healthy, the system draw when "off" (meaning standby) before was something like 2W. I suspect something failed on the motherboard is putting a heavy load on Vsb. If the motherboard itself is not at fault, I think the most likely thing is that I fried the CPU during mishandling the RAM in a way that happens to present an intolerable load to motherboard or supply. So once I have the HSF and CPU off, I'll probably do a last trial to see if the mobo shows a little life (debug LED at least) in that state. ____________ | |
| ID: 109052 | | |
I disconnected the suspect supply and did rudimentary testing. With a 575 ohm resistor across Vsb to meet minimum current requirement, I saw 5.1V on VSB. When I shorted /PS_ON to ground, I was voltages on all main supplies too close to correct to give this stone cold dead behavior even though I was not providing any load to them. Then I attached a different supply to the motherboard 8 pin and 24 pin ATX connectors, and got the same behavior, save only that the power consumed was 10W instead of 20W. Darn. Let's hope it's the mobo. [aside] I lost a CPU once because of a cheap case. When screwing in the case cover screws, the soft metal in the edges of the screw holes made little shavings. One fell across the CPU pins at one edge ( pre CPU heat-sink days, and pre air-in-a-can days ) causing a dead short on boot up and dead CPU. What are the odds on that? Some days it can be like 'for want of a nail, a horseshoe was lost ....' :-) [/aside] Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 109053 | | |
|
I am surprised, happy, and very puzzled. This morning I pulled the motherboard out of the case and removed HSF and CPU. I figured if the stone-cold dead symptom with excess power consumption continued, I had a bad motherboard, and if not, I had a fried CPU. It was a well-behaved 2W peak descending to 1W (not the steady 10W I had seen on this supply) and the smart button LEDs lit up, so my diagnosis was CPU, but before spending almost $400 US for a replacement I put the "bad" one back in the socket. All was still well !?!? | |
| ID: 109078 | | |
|
Well, this is all good. Luck is a fortune! Some element has held onto charge and biased something inappropriately, now discharged. | |
| ID: 109085 | | |
|
I'll try a guess at longer term performance with the Westmere, taking account of the settings/scenario : | |
| ID: 109090 | | |
I am surprised, happy, and very puzzled. This morning I pulled the motherboard out of the case and removed HSF and CPU. I figured if the stone-cold dead symptom with excess power consumption continued, I had a bad motherboard, and if not, I had a fried CPU. It was a well-behaved 2W peak descending to 1W (not the steady 10W I had seen on this supply) and the smart button LEDs lit up, so my diagnosis was CPU, but before spending almost $400 US for a replacement I put the "bad" one back in the socket. All was still well !?!? I have seen this before in pc's, I have always thought it was some capacitor holding its charge and only after sitting, thus losing it charge, do things go back to normal. After telling people to try rebooting this is one of my secret fixes when providing over the phone pc tech help. I always tell them to wait about 15 to 30 minutes before restarting the pc, and it really does seem to work sometimes. It has saved me many a trip only to find things 'just working' when I do make the trip to their homes. I always tell them 'the pc is scared and knows I am there and will fix it' so it just works before I have to whip it into shape, we all laugh and I walk away wondering! ps I have enjoyed this thread and your testing of the different ways to crunch and which is best, please don't stop! | |
| ID: 109091 | | |
It has saved me many a trip only to find things 'just working' when I do make the trip to their homes. I always tell them 'the pc is scared and knows I am there and will fix it' so it just works before I have to whip it into shape, we all laugh and I walk away wondering! Cherish this effect! In medicine we say that 'a good doctor times his treatment to coincide with recovery!' :-) :-) My other favorite is 'neither kill nor cure, if you seek repeat business!' ;0 :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 109093 | | |
Cherish this effect! In medicine we say that 'a good doctor times his treatment to coincide with recovery!' :-) :-) A-ha... Isn't that subverting the natural immune response to give a Pavlovian reinforcement to have you called out to merely administer a placebo?... My other favorite is 'neither kill nor cure, if you seek repeat business!' ;0 :-) Ouch! That also sounds like certain dubious business practices as is foisted in IT/computers to maintain a never-ending upgrade cycle... How to distinguish the good from the game? Cheers, Martin ____________ Powered by Mandriva Linux A user friendly OS! See the Boinc HELP Wiki | |
| ID: 109108 | | |
|
Interesting analysis. | |
| ID: 109109 | | |
I'm surprised at the 12% penalty for having WUs roam around the cores... Is Windows scheduling really that bad? That 12% adds up to an awful lot of poisoned cache. Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle? (Again, a quirk of poor scheduling?) No, No, and No. The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. I thought I made this point clear in my notes, but obviously I failed, as both you and Mike seem to have a different notion. The situation is quite artificial, in that affinity constraints to pools of a subset of all CPUs are placed on a task. So this behavior has no obvious relevance to typical working system behaviors. No such effect is seen where no affinity constraint is supplied, and sufficient Einstein work is allowed to execute to populate all cores (i.e. 8 active Einstein tasks on my system). I hope that would put to rest the mistaken references to poisoned cache, excess context switches, RAM bandwidth, and so on. Clearly the 8/8 task situation has worse inherent constraint from each of these than does the case where 4 tasks are constrained 4 CPUs on two physical cores. ____________ | |
| ID: 109111 | | |
I'm surprised at the 12% penalty for having WUs roam around the cores... What is the case that the task becomes not active? The situation is quite artificial, in that affinity constraints to pools of a subset of all CPUs are placed on a task. So this behavior has no obvious relevance to typical working system behaviors. Are you suggesting that the affinity restrictions will push multiple tasks onto just one CPU?... No such effect is seen where no affinity constraint is supplied, and sufficient Einstein work is allowed to execute to populate all cores (i.e. 8 active Einstein tasks on my system). Which is what we expect to be the optimum usage and that does indeed appear to be the case from the numbers. The interesting bits are the numbers from the artificial cases to try to work out what the effects are, and their significance. I hope that would put to rest the mistaken references to poisoned cache, excess context switches, RAM bandwidth, and so on. Clearly the 8/8 task situation has worse inherent constraint from each of these than does the case where 4 tasks are constrained 4 CPUs on two physical cores. There are examples of some systems where due to RAM bandwidth constraints and CPU cache usage, you may well get higher throughput by running tasks on only 6 or 7 out of 8 virtual cores... This came up in previous s@h or e@h threads. It comes back to an old argument that certain mixes of Boinc tasks can be beneficial for maximum throughput, and some combinations can be detrimental... It's all a question of what system bottlenecks get hit. We usually tune the system to keep the most expensive resource (the CPU) fully busy. (However, on my recent systems, the CPU, GPU, and RAM have all been about equally priced...) Happy fast crunchin', Martin ____________ Powered by Mandriva Linux A user friendly OS! See the Boinc HELP Wiki | |
| ID: 109113 | | |
The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. I thought I made this point clear in my notes, but obviously I failed, as both you and Mike seem to have a different notion. Oh, I see now. Quite right. My bad, and even after you went to the trouble of color highlighting! :O] I was thinking about HT too much. Potential execution time 'lost' by a task could be when it is not allocated a slice at all, even if it appears/seems it reasonably could have been. My 'unseen mechanisms' are imaginary. This is an OS issue, so the scheduling algorithm is the proper focus. Ooooh. [ See my earlier conclusions with (B) vs (D) - 'you lose more time if there is more choices of chairs' when not hard assigning ] Since your machine is rigged to be on the rather light side of task load ( compared to 'typical' use ) - what's about the number you see on the 'Processes' tab of Task Manager? Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 109114 | | |
Since your machine is rigged to be on the rather light side of task load ( compared to 'typical' use ) - what's about the number you see on the 'Processes' tab of Task Manager? I construe your inquiry to apply to the condition for my tests, as to which I asserted that I shut down some things but not all (think I mentioned leaving my antivirus running, for example). Just now I exited BOINC and shut down the same things I was shutting down for the tests, then I looked at Task Managers Process count and saw 41. Then I started boincmgr, and waited a little while for it to spawn things, and saw 61 as the TM Process count. That would be boincmgr, boinc, 8 Einstein executables, plus 8 instances of Console Window Host and two more I failed to spot. The non-BOINC stuff showing in general has very low CPU use and pretty low context switch delta counts, but it is not nothing. ____________ | |
| ID: 109118 | | |
|
I've been collecting data on the new Intel i7-2600K for quite a while and thought these particular S5 measurements fit well here in this thread on hyper-threading. Yes, I realize that the S5 run just finished, but these results are applicable to the S6 run also. In fact, I was just finishing verify a few data points just as I ran out of my final S5 work units. Only S5 gravity waves jobs were run during this test collection, no BRP jobs. | |
| ID: 112461 | | |
I've been collecting data .... work has a direct relationship to clock speed. What a brilliant set of observations! Thank you very kindly for collecting, analysing and presenting that here. :-) :-) Yes, the trends are clear. Let it be our benchmark for HT thinking. By eye I can see the relation to clock speed ( all else held same ) could be modelled as linear to close fit. The 'knee' at 4 HT cores is vivid. Indeed the RAC benefit per extra/added core ( the slope of the curves, ~ 2000 initially ) halves thereafter to ~ 1000. Which is near as stuff all to 2:1 ..... so there's the swapping overhead arising. Again, thanks for the work on that! :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112462 | | |
I've been collecting data .... Indeed, very nice clear results. Thanks for sharing. By eye I can see the relation to clock speed ( all else held same ) could be modelled as linear to close fit. That suggests a nicely balanced system, or a system where the memory bandwidth nicely exceeds that needed by the CPU for these tasks. That is: The CPU processing is the limiting factor. There's enough fast enough memory to let the CPU run at 100% utilisation for the CPU critical paths. The 'knee' at 4 HT cores is vivid. Indeed the RAC benefit per extra/added core ( the slope of the curves, ~ 2000 initially ) halves thereafter to ~ 1000. Which is near as stuff all to 2:1 ..... so there's the swapping overhead arising. ... I don't interpret that as swapping unless you really mean 'process thread interleaving'. Intel's "Hyper-threading" uses additional state registers/logic to allow two process threads share the same one pool of processing units for a (physical) CPU core. You are certainly getting a useful increase in throughput with the HT. Happy fast crunchin', Martin ____________ Powered by Mandriva Linux A user friendly OS! See the Boinc HELP Wiki | |
| ID: 112465 | | |
..... unless you really mean 'process thread interleaving' .... No I don't especially. Unless/until we have any information as to how ( or indeed if ) his Linux machine's process scheduler handles affinity, I'll leave it as a generic 'swap' concept. See earlier discussions in this thread. Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112486 | | |
so there's the swapping overhead arising. You probably already know everything I'm going to say now, but this wording leads to misunderstanding, even if you meant the right thing. If HT is used 2 tasks are being run on one core at the same time. That is not only at the same time for the observing user (as running 2 threads with 50% time share each would look like), but also at the same time for the CPU, clock for clock if you will. This is totally independent of OS scheduling and everything people normally associate with "swapping". Speed per task does drop upon HT use because, although both threads have individual registers at such (the "core components" of the core), they have to share caches and, most importantly, execution units. That's the whole point of HT: making better use of the execution units for relatively little more die space. What you seemed to talk about is OS scheduling, where the scheduler reassignes tasks to specific cores at typically ~1 ms intervals (Windows). Which is an eternity compared to the CPU clock ;) MrS ____________ Scanning for our furry friends since Jan 2002 | |
| ID: 112512 | | |
so there's the swapping overhead arising. Absolutely correct, but moot. Since we don't know what his machine's actual scheduling behaviour is, it's an assumption that it's the same across the graph ie. is affinity maintained? Recall that E@H workunits are given low priority by default and hence will be readily displaced by system calls etc especially/inevitably if all physical cores are busy. So there will be a substantial part of the latter right side of the graph ( 4 and over virtual cores busy ) involving actual OS task swaps in addition to HT behaviours. Again, see earlier discussion ... Cheers, Mike. ( edit ) Sorry, the other thing I've not mentioned here is the probable 'pure' HT overhead. I've seen various estimates but nowhere near enough penalty to give that 2:1 ratio ( change at the knee ) in the benefit per added virtual core. I think the HT throughput ( opinions differ ) for 2 units on a core was given at about 1.7 at worst? Mind you my 2:1 estimate was by eye ..... ( edit ) Printing out the graphic and drawing lines I find that : the ratio of the slope of the lines for less than 4 jobs to the slope of the lines for more than 4 jobs ( ie. before and after the knee, and for each given processor speed ) are all about 2.7:1 plus/minus ~ 0.05. So I guess the question is what are the best and worst estimates for HT throughput - the 'pure' HT part - with the remainder being OS task swaps ? ( edit ) Moreover if one does draw a line from the knee to the 8 job point, you'll find the graph dips slightly below it at around 6 jobs and then returns .... a short mild upwards concavity. This is repeated for all curves. So there's something ( my guess is non-HT related ie. scheduling behaviour ) kicking in there. I'll post a modified graphic explaining what I mean by all this when I get a chance ..... :-) ( edit ) I hope this is sufficient : ![]() NB : I did the calculations for the other intermediate clock speeds and got very close to the above ratio pre/post knee ~ 2.69 Plus I've assumed that since the measure is daily RAC then wall clock time ( as opposed to CPU time ) - "averaged the time required to complete a single S5 work unit" - is the relevant scale. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112515 | | |
|
Further thoughts ( based on prior statements/assumptions ) : why the dip below linear at around 6 jobs? The slope of the curve is : the rate of change of RAC with change of virtual core number. That means there is a slight "penalty" for going from 4 to 6 which is "recovered" by going from 6 to 8 ( within the expected entire pattern of somewhat less than 2:1 throughput from HT at over 4 virtual cores ). Shouldn't task swaps per se forced by higher thread occupancy of the CPUs give a concave down aspect to the 'thigh' part of the curve? Meaning that at say 5 GW jobs there'll probably be more physical cores ( 3 ) only occupied by a single GW job, than at 7 GW jobs with fewer physical cores ( 1 ) occupied by a single GW job ie. non-GW work ( system stuff say ) is more likely to bump a GW job off a given physical core with 7 jobs than 5 .... unless I'm viewing an artifact of the data presentation. | |
| ID: 112519 | | |
|
one thing to add.. | |
| ID: 112526 | | |
since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable. I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic. As one of the opportunities for HT benefit is clearly finding something useful to do while waiting for a memory read, that would seem to suggest possibly less HT benefit on the hypothetical variant. But memory references able to be supplanted by registers are, I should think, highly likely to be filled from cache, not RAM, and usually a fast level of the cache. In practice I doubt the speculated effect is either substantial or consistent. Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail. But such an answer would necessarily be highly application-specific. I don't think either of the current SETI analyses much resembles any of the Einstein analyses computationally (if Bikeman or others know I'm wrong here, please correct me), and the considerable history and infrastructure of the Lunatics effort may mean that available tuning benefits have been more thoroughly explored there. ____________ | |
| ID: 112532 | | |
since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable. that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers. Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? nope - i do not care about YETI.. ;) | |
| ID: 112534 | | |
I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic. That would seem to push back in the opposite direction if it were true. Not denying the performance benefit possibly available to the portion of the code doing that, but the opportunity for HT benefit would seem to go up with more closely spaced data reads from memory, which this would do. But I doubt it is true. Do you seriously think that even Nehalem comes equipped with enough distinct SSE floating point units ever to keep 16 registers in use in real-world code? Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen? ____________ | |
| ID: 112538 | | |
Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen? this is MAYBE another thing on nehalems and bulldozers. but talking about cores capable of SSE2 in 64bit mode, and this is ( cough ) P4 and ATHLON64! the difference between "native mode" and 64-bit mode started back then.. http://en.wikipedia.org/wiki/SSE2 that's just another reason why you'll find "AMD64" everywhere. getting back to a certain app - it really depends: heavy use of SSEx instructions? code which can be vectorized? processor architecture? bottom line is: unless you really give it a try, you'll never know, but if you do not do it, you can as well still believe in earth being a flat thing. in boincworld it's very rare that a real 64-bit app is not faster - you might want to check the numbers on http://wuprop.boinc-af.org/results/delai.py | |
| ID: 112539 | | |
Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail. There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps, Claggy | |
| ID: 112541 | | |
There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,Thanks for the correction, I carelessly relied on a heading in their download area reading AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments. ____________ | |
| ID: 112544 | | |
There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,Thanks for the correction, I carelessly relied on a heading in their download area reading There are just different Installers aimed at 32bit or 64bit Boinc's, the 64bit installer does more <app_version> entries to try and make sure no one looses any work when installing the Lunatics apps, But the only 64bit app in it is the AK_V8 MB app, Claggy | |
| ID: 112552 | | |
|
I've been tied up the last week and I see a few questions have come up on the HT results I collected. - No special techniques were used for setting processor affinity - I used a standard 64 bit ubuntu load with 32 bit compatibility libs, no other OS tuning - The 32 bit S5 SSE2 application was utilized, version 1.07 - This machine is dedicated to running E@H, so no other side loads - It took many weeks to collect measurements for all the data points, so different frequencies were used - I assumed slight variations in measurement points were from the different frequencies and data sets, as discussed in the beginning of this thread - Thanks Mike for the suggestion to compare slope ratios as a way to answer the question how much benefit from HT - My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio
| |
| ID: 112574 | | |
|
There seemed to be interest in this data, so I'll post the hyper-threading S5 data I collected on the i7-980 for comparison. | |
| ID: 112575 | | |
My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio Being lines fitted by ( my ) eye on a piece of paper : I definitely think I'm over exact at quoting 2 decimal places. Probably better to quote to only one, say call it 2.7 .... :-) Anyway they're all around 2.5 to 3.0 hence my impression is that OS swap overhead is at least comparable to HT effects at high core loads. OK. If so, then to test : did someone mention 'process lasso' or somesuch as an appropriate Windoze based utility to achieve affinity control? And return detailed timings for that matter, or does some other utility do that better? Suggestions? This machine of mine if divested of BRP work could be an ideal test rig methinks, I could measure actual core times vs wall clock times on a per virtual core basis? Thus times per core not devoted to GW tasks of interest, derive fractional overheads etc. Thanks again for collecting and presenting that! We're always looking to study such behaviours and perhaps get a hint or two on optimising. :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112577 | | |
|
just stumbled over that one: | |
| ID: 112795 | | |
just stumbled over that one: It is indeed! Thanks for digging that up. :-) Those bandwidth curves look eerily familiar. I'll look closer at the article and comment further if appropriate. FWIW : I have stopped the GPU work within and have tooled up that machine I mentioned earlier with Process Lasso. I am currently benchmarking the virtual cores when used alone for bucket WU's ie. only one WU at a time on the entire machine and I am proceeding through each core ( 5 of 8 done ). I thought I would at first examine the ( entirely reasonable ) assumption that all cores are equivalent one to the next, so that if there is some asymmetry in the hardware it'll come out and I can account for that when I move to testing the ideas I mentioned earlier. They are doing fine thus far with average times overlapping well within each others' one-standard-deviation widths. I'll publish the full spread sheet of data when complete. I'm doing 12 WU's per core, tossing out the highest and the lowest, and using the remaining 10 as 'typical' for statistics. I'm also after the idea of a 'fiducial occasion' or 'single virtual core run-time' for this machine as it is presently configured on bucket WU's, and thus have already collected very many to aggregate to form that. I've also had to admonish my offspring for daring to touch it meantime .... :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112815 | | |
well - actually it's a pretty much ancient thing. P4-area you know, and things might have changed a lot since then. another thing which comes in mind here: how are cores mapped? if i look at my i5m i see this: Coreinfo v2.11 - Dump information on system CPU and memory topology Copyright (C) 2008-2010 Mark Russinovich Sysinternals - www.sysinternals.com Logical to Physical Processor Map: *-*- Physical Processor 0 (Hyperthreaded) -*-* Physical Processor 1 (Hyperthreaded) Logical Processor to Socket Map: **** Socket 0 Logical Processor to NUMA Node Map: **** NUMA Node 0 Logical Processor to Cache Map: *-*- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64 *-*- Instruction Cache 0, Level 1, 32 KB, Assoc 4, LineSize 64 *-*- Unified Cache 0, Level 2, 256 KB, Assoc 8, LineSize 64 -*-* Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64 -*-* Instruction Cache 1, Level 1, 32 KB, Assoc 4, LineSize 64 -*-* Unified Cache 1, Level 2, 256 KB, Assoc 8, LineSize 64 **** Unified Cache 2, Level 3, 3 MB, Assoc 12, LineSize 64 so all this testing will need to pick the right cores. probably the real freak-out will come, if someone shows up with a quad-socket Xeon E7-4800. ;) | |
| ID: 112816 | | |
I am currently benchmarking the virtual cores when used alone for bucket WU's ie. only one WU at a time on the entire machine and I am proceeding through each core ( 5 of 8 done ).While I think it rather likely that the virtual CPUs are in fact equivalent, I was surprised to find on my own host that at least one of the many background programs had an affinity. If you're ambitious you might check, for example by using Process Explorer and right clicking one process at a time and checking under "set affinity…". Remembering I had seen something in the past I did a little of this just now and noticed that Speedfan is set to run only on CPU number five (of eight on this Westmere host). ____________ | |
| ID: 112824 | | |
I am currently benchmarking the virtual cores when used alone for bucket WU's ie. only one WU at a time on the entire machine and I am proceeding through each core ( 5 of 8 done ).While I think it rather likely that the virtual CPUs are in fact equivalent, I was surprised to find on my own host that at least one of the many background programs had an affinity. Thank you indeed! I was basically thinking of possible mild hardware disparity, but yes there may well be OS bindings. This had not occurred to me. An excellent idea and I will check that. :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112826 | | |
Thank you indeed! I was basically thinking of possible mild hardware disparity, but yes there may well be OS bindings. This had not occurred to me. An excellent idea and I will check that. :-) now that you get rolling.. ;) get the whole sysinternals suite - things to check here: process-explorer, PSSTART (because it can use the pretty ancient API), COREinfo to tell you what's really under the hood... i am using marks tools for a long time before the battleship of lawyers showed up and forced them to sign for redmond. | |
| ID: 112829 | | |
Thank you indeed! I was basically thinking of possible mild hardware disparity, but yes there may well be OS bindings. This had not occurred to me. An excellent idea and I will check that. :-) Terrific ideas. I will indeed get that to drill down and more cleanly separate the 'pure HT' aspect I seek from the rest. :-) Aside : I can use LogMeIn - pro version with LogMeIn Ignition - which basically acts as a neat layer over Windows Remote Desktop. It creates a secure VPN connection ( long & very random key ) and thence allows full remote control, file sharing, FTP, clandestine monitoring ( so any user on the target won't obviously know ), chat even, plus other stuff. So here I am in Germany fiddling/tweaking my HT profiling experiments on the DownUnda machine - with my only trouble being the laptop screen here doesn't match the bigger desktop there. So I have to pop my spectacles on. :-) Cheers, Mike. ____________ "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal | |
| ID: 112832 | | |
Message boards :
Cruncher's Corner :
Hyperthreading and Task number Impact Observations