Hyperthreading and Task number Impact Observations

log in

Advanced search

Message boards : Cruncher's Corner : Hyperthreading and Task number Impact Observations

1 · 2 · 3 · 4 . . . 5 · Next
Author Message
archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108579 - Posted: 15 Dec 2010, 16:39:07 UTC

Over the next couple of days I plan to generate some observations on the execution time and aggregate throughput impact of switching my Westmere E5620 between hyperthreaded and not, and of varying the number of simultaneously executing Einstein HF 3.06 tasks.

Westmere is mostly a 32nm Xeon flavor of the classic Nehalem 4-core design, but with a 12 M L3 cache. For these tests I'll leave it running as it has been lately, with a moderate overclock of 3.42 GHz, with 4 Gbyte of RAM running at default settings (and one more Gbyte plugged in not currently actually recognized by the BIOS--oops).

My current intentions--subject to revision if I have a better thought or get better advice here:

1. One "measurement task" for each condition, will be started only after the remaining tasks for the condition are up and running, and will all be chosen from the same HF frequency 1373.90.

2. I'll turn off most overhead tasks of the more frivolous sort, and not use the system for personal work during the timed runs, but leave running my Kaspersky AV (which does hurt a bit during that nasty slow startup phase, but I consider essential).

The test conditions I definitely plan to log are:
HT 8
HT 1
nHT 4
nHT 1

I might fill in some of the intermediate task count points, say perhaps HT 4 and HT 6, but probably not all of them.

For each condition I think I'll show CPU time for the task, CPU time for the task relative to the HT_8 case, and implied system throughput relative to the HT_8 case. Also system input power.


Why bother? On the negative side, since these results are rather strongly influenced by the particular CPU design, by the application being run, by some other system design and configuration details, and by the other code executing on the system they won't generalize very far.

But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case, or people who expect a full doubling of aggregate throughput with HT application.

My target is not mostly the regular posters here, for whom few of my results will be surprising, but mostly others who drop by who may be less well informed.

I don't think thread starters own threads, but I'd be perfectly pleased if others with useful observation data saw fit to add to this thread.

While my timing in starting this thread and project was influenced by this other thread I don't regard this as an answer to or continuation of that one.
____________

archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108590 - Posted: 15 Dec 2010, 23:50:26 UTC - in response to Message 108579.
Last modified: 15 Dec 2010, 23:53:15 UTC

I'm planning to post results by doing a screen capture of a bit of Excel spreadsheet, posting to my Photobucket account, and linking the image here.

Translation--the image below should change over time as I measure new observations or correct old ones. The divide by zero errors will mostly disappear once I've observed the primary reference of hyperthreaded eight parallel Einstein tasks.



In general I'll make any further comments in later posts, but for this one I'll observe that the non-hyperthreaded single task case at 13483 seconds is rather a lot below the recent typical values (at hyperthreaded 8 tasks) for this host running near 22,500 CPU seconds at the same clock rate, RAM parameters, and other operating parameters.
____________

archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108610 - Posted: 16 Dec 2010, 21:44:47 UTC - in response to Message 108579.

archae86 wrote:
But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case
<snip>
My target is not mostly the regular posters here, for whom few of my results will be surprising, but mostly others who drop by who may be less well informed

While I can't speak for other regular posters, a result here surprised me.

The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case.

The observed result is of course what one would like and naively expect--with nothing to do on the "other half" of a core running HT you would of course want no context switching or other overhead to occur. But on my previous main machine with HT some years ago, I formed the strong impression that single task execution was considerably slower with HT enabled than not. I assumed that was still true for Nehalem and have a dedicated BIOS setting group aimed to support my audio processing which is nHT because I thought my (largely single-threaded) audio processing would go faster (I don't run BOINC when I'm doing audio).

Possibly I was mistake then, or possibly the Intel HT implementation in the Nehalem architecture is dramatically superior to that in the Gallatin (Northwood-derived with big cache) machine I used to own. Considering how unfortunate some other aspects of the whole Willamette-descended set of designs were, this would not surprise me.

So it is even more crucial to assure that performance data are taken with appropriate simultaneous workload than I thought. To the problems of conflict for RAM and cache resources, one must add the fact that an underloaded HT host can actually provide dramatically more computation per charged CPU second than a loaded one.

____________
Richard Haselgrove
Send message
Joined: 10 Dec 05
Posts: 1659
Credit: 54,807,104
RAC: 44,195
Message 108613 - Posted: 16 Dec 2010, 22:14:11 UTC - in response to Message 108610.

Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch.

But with four physical cores available in your Westmere, and with it being only lightly loaded, a clever operating system could keep one 100% utilisation task running on one core without context switches, and distribute the housekeeping tasks around the other three cores as necessary: it could even be running as effectively a seven-core computer, with 3xHT handling the non-computationally-intensive tasks, and 1xnHT for the busy one?

archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108616 - Posted: 16 Dec 2010, 22:39:41 UTC - in response to Message 108613.

Richard Haselgrove wrote:
Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch.

But the same system running nHT still needed to process interrupts and make context switches for those same things. And a big part of the claim for HT is that it supports a sort of context switch between the two threads sharing a core at any given moment that is immensely faster than a standard context switch. So at least some should have gone faster, and I don't see why there would be a penalty for all the others unless the scheme somehow forced frequent thread to thread switches even though there was not work for the other thread. Something like that is what I've assumed, but not from inside information, but the behavior I thought I'd seen.

a clever operating system could keep one 100% utilisation task running on one core without context switches
I've noticed that Windows 7 on my application load seems far more inclined to leave a process on a core for a while than Windows XP Pro. Not sure Windows 7 qualifies as clever in this respect.

____________
Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 4617
Credit: 39,759,240
RAC: 14,881
Message 108619 - Posted: 16 Dec 2010, 23:09:27 UTC - in response to Message 108610.
Last modified: 16 Dec 2010, 23:20:33 UTC

The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case.

Based on the work we did a couple a years ago ( Ready Reckoner et al ), if still valid, then one could easily get variation of the order of a third ( average variation, measured extremes were from as low as ~15% to as high as ~45% ) of the run time due to stepping through phase space at a given frequency ( sinusoids etc ). That's clearly of the order of +/- HT effect we expect anyway .....

[ specifically, if true, this implies nHT processing getting lucky with a 'short' WU with correspondingly disadvantaged HT processing working on a 'long' WU, to explain your finding ]

Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
tear
Send message
Joined: 12 Sep 10
Posts: 9
Credit: 9,914,974
RAC: 0
Message 108621 - Posted: 16 Dec 2010, 23:22:46 UTC - in response to Message 108619.

Running (up to) N/2 tasks* on machine with N HT CPUs yields pretty much same
performance as running same number of tasks with HT disabled**. What's surprising
about it?

*) task, as in "non-MPI CPU/memory intensive task"

**) on condition that no two tasks share sibling HT CPUs

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 4617
Credit: 39,759,240
RAC: 14,881
Message 108622 - Posted: 16 Dec 2010, 23:31:37 UTC - in response to Message 108621.

What's surprising about it?

For many, not alot. For others, a credulity issue .... :-)

Cheers, Mike.

____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108623 - Posted: 17 Dec 2010, 0:03:30 UTC - in response to Message 108619.

Mike Hewson wrote:
if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )
My impression is that the proportionate execution time variability of the current work load is much less than it was in that era.

Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload.



I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention.

____________
Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 4617
Credit: 39,759,240
RAC: 14,881
Message 108624 - Posted: 17 Dec 2010, 0:28:00 UTC - in response to Message 108623.
Last modified: 17 Dec 2010, 10:21:15 UTC

Mike Hewson wrote:
if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )
My impression is that the proportionate execution time variability of the current work load is much less than it was in that era.

Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload.



I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention.

Fair enough. I thought that might well be so, as the variation 'back then' related to sinusoidal function 'look-ups' and like issues, which has undergone optimisation ( or become less relevant ) since. The phase space is right ascension and declination, effectively considered as orthogonal co-ordinates, but to un-Doppler a signal you still need to resolve components to detector/Earth frame ie. trigonometry. At least that's how I remember it. :-)

Cheers, Mike.

( edit ) Nice curve too. To a first attempt you'd model that as normally distributed :

normalising_const * exp[-((ordinate - mean_measure)/spread_measure)^2]

If so then you have an underlying random variable with no especial 'preference' related to the task at hand*. Asynchronous ( with respect to WU processing ) interruptions would explain that nicely .....

( edit ) Mean is ~ 22460, standard deviation is ~ 148. Average absolute residual ( of actual WU's per run-time bracket ) from Gaussian prediction is ~ 3.1 or around 10% of the peak. Close enough ... certainly believable for that sample size.

( edit ) This means that I am saying that the WU 'interruptions' account for around +/- 2% of their run-times ( 3 standard deviations/average ). So this is way less than the 'sequence number' effect studied in days of old. But you could have guessed that. I couldn't remember any more Excel-Fu ..... :-)

( edit ) Actually yet another reason for intrinsic WU variation not affecting your group of 144, is that at ~ 1500 Hz : each frequency is going to have well over 1000 WU's to plow through. [ We found earlier that the number of work-units per sequence-number-cycle at a specific frequency went quadratically with frequency, as you have to plod through the phase space more finely ]. With E@H's use of locality scheduling there is a mighty tendency for a given host ( especially a fast one ) to be given near consecutive sequence numbers, thus your example of 144 WU's may not sample much of any ( if existing ) sinusoidal variation in run-times from that cause. In fact looking at your Xeon's first 20 tasks in the current 'In progress' list ( a page full ) that's exactly what's happening. So I'll shut up now, having demonstrated that this is definitely not relevant to the HT exploration here. :-) :-)

( edit ) * Sorry, quiet night-shift! I didn't explain that if it was predominantly sequence-number related you'd get a low-skewed ( to the left ) concave-up curve, and not a convex-down symmetric bell shape, as most WU's would cluster at the sinusoid trough ( shorter run-times ). If there is any skew asymmetry in your data it is definitely to the right.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108653 - Posted: 17 Dec 2010, 22:51:04 UTC
Last modified: 17 Dec 2010, 23:03:04 UTC

I've been surprised yet again. The most recent addition to the result table in the second post of this thread covers the case of 4 tasks running with hyperthreading turned on.

Recalling that the single task HT case surprised me by having a tiny (and possibly noise rather than truth) deficit to nHT single task, one might have guessed that this case would very closely approximate the 4 task nHT result.

It does not--while the HT_4 case gives considerably lower CPU times than does the HT_8, by far more than RAM or cache conflict might be expected to produce, the shortfall to nHT_4 is quite large. Importantly it is large compared to the plausible random noise stemming from task-to-task execution difficulty and system loading variation.

So a BOINC person seeking to "keep threads free" from BOINC by setting a maximum number of CPUs below the (virtual) number available) but keeping HT turned on hoping that in quite times within nothing going on the system it will at perform about as well on BOINC as the same system running the same number of tasks nHT seems on Nehalem architecture to lose very little for one task, but quite a lot for 4. With my poor prediction record so far, I probably should not guess how this would be for six, but my guess is that the loss from perfection will continue to grow on the larger core count Nehalems unless they have a specific logic upgrade aimed at this problem.

One other point, and a new picture embedded here of data: For the HT_4 case, I was able to get two other results to run in the nearly identical conditions as the intended test subject. They enjoyed the same reduction of normal load from background tasks and foreground usage, the same environment of HT with 3 companion 3.06 tasks, and in fact ran in parallel with primary measured task for all save about three minutes of offset. I deliberately chose tasks of differing frequency and sequence, hoping to raise the chance of catching systematic execution variation. What I actually got was very, very close matching.



Combining the sort of Big Picture variation from the histogram I posted a few posts back, with the bottom up better controlled (but much smaller data set) evidence here, I think the case is pretty well made that task to task systematic CPU time variation may be usually quite low for 3.06 HF work in the near neighborhood of 1373 frequency. If Bikeman or anyone else can add some understanding or data on current Einstein result execution time systematic variation I'd be pleased.

(edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons)

But for this little study I think the available evidence supports the following for a 4-core current generation Nehalem type CPU running near my system's operating point and running Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2):

1. With the system allowed to run all the parallel tasks it can, enabling HT raises total throughput appreciably, with the nHT system giving only a little over 75% as much BOINC throughput.

2. For the extreme case of restricting BOINC to a single task, it appears that for a system otherwise very, very lightly loaded there is little disadvantage to running HT--mostly likely between 1 and 2% loss of BOINC throughpu.

3. Probably the loss incurred by running a restricted number of tasks HT instead of nHT grows with number of tasks. For the 4 task case this loss is substantial with the HT system giving only about 87% as much BOINC throughput as the nHT case.

While power consumption increases with number of tasks running, the overall "greenness" for the fixed clock rate fixed voltage case considered here consistently improves with more tasks run and with higher BOINC throughput. If one limits tasks to no more than the number of physical cores present, then both throughput and power efficiency will be best running with hyperthreading disabled.

While I'd expect the general trends to hold across a fair range of clock rates and for both the two channel and three-channel RAM variants, it is important to note that while my host is equipped with three channels, the BIOS reported only two to be recognized and active during these tests. I expect that any three-channel variant fully populated and working properly will suffer less degradation in execution times with increasing number of tasks than seen here.
____________

tear
Send message
Joined: 12 Sep 10
Posts: 9
Credit: 9,914,974
RAC: 0
Message 108654 - Posted: 17 Dec 2010, 23:11:06 UTC - in response to Message 108653.

(...) the shortfall (of HT_4 --ed.) to nHT_4 is quite large.


I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities
so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.
archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108655 - Posted: 17 Dec 2010, 23:46:30 UTC - in response to Message 108654.

tear wrote:
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.

In hyperthreading the OS is presented with a set of apparently equivalent CPUs. But in the current form pairs of them use the same physical hardware. So, at least in the Nehalem generation, there would be a great advantage to assigning next execution of a thread to a virtual CPU which was not only only idle itself, but whose "sibling" as you call it--the other virtual CPU in fact using the same core--was currently idle rather than to a virtual CPU whose sibling was already using a full core resource.

I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways. That may have seemed a needless complication or a departure from the purity of apparent equivalence to those making the original design decisions.

That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.

____________
ML1
Send message
Joined: 20 Feb 05
Posts: 321
Credit: 24,444,932
RAC: 1,593
Message 108657 - Posted: 18 Dec 2010, 0:52:40 UTC - in response to Message 108654.

I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.

No 'forcing CPU affinity' required. It's already included!

You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).


Happy fast crunchin',
Martin

____________
Powered by Mandriva Linux A user friendly OS!
See the Boinc HELP Wiki
tear
Send message
Joined: 12 Sep 10
Posts: 9
Credit: 9,914,974
RAC: 0
Message 108663 - Posted: 18 Dec 2010, 4:47:31 UTC - in response to Message 108657.

I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.

No 'forcing CPU affinity' required. It's already included!

I've seen plenty of bad scheduling with HT support in the scheduler
(CONFIG_SCHED_SMT). Though I admit, theory does appear nice.

You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).

No disagreement here :)

Profile Mike Hewson
Volunteer moderator
Avatar
Send message
Joined: 1 Dec 05
Posts: 4617
Credit: 39,759,240
RAC: 14,881
Message 108664 - Posted: 18 Dec 2010, 5:01:58 UTC - in response to Message 108653.
Last modified: 18 Dec 2010, 5:09:07 UTC

(edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons)

Exactly right. The close frequencies and sequence numbers mean the skygrid right ascension and declination are real close. My guess ( admittedly based on old parameter estimates ) is that the sequence numbers have about a cycle of 400 work units before returning to similiar runtimes, for around that frequency value.

[ For those not familiar with this aspect of the discussion : at each assumed signal frequency the entire sky is examined in small areas ( one per work unit ) with more, and thus individually smaller, areas required for higher frequencies. Because the Earth is rotating around it's own axis, and it is also orbiting the Sun then a signal channel from each interferometer needs to be 'de-Dopplered' accordingly for each and every choice of distant sky grid element ( tiny area on a construct called the 'celestial sphere' ). Ultimately a signal is effectively expressed in what it would be like if it were heard at a place called the solar system 'barycenter'. There is another line of adjustment according to estimates of putative source movements too. The part of the algorithm that steps through the skygrid has to acknowledge some trigonometry to resolve a signal's components to the directions along which a particular interferometer's arms happen to lie at a given instant. In addition not all skygrid areas are equal which is a consequence of spherical geometry not being 'flat'. In any case the work unit's runtime used to be very dependent on skygrid position, with a marked sinusoidal variation above an amount that was constant regardless of sky position. The algorithm starts stepping from I think at the equator, but it could have been a pole as I can't remember which, and wraps around the sphere with a 'stagger' reminiscent of winding yarn around a ball. The number steps to return for another wrap around is this cycle length of approximately 400 that I'm referring to. At lower frequencies than we are currently doing now, around 3 such cycles were required to cover the entire sky grid. There was also another effect 'rippling' the sinusoidal runtime vs. sequence number curve, probably ( well that was my view ) due to conversion of co-ordinates from an Earth based equatorial view to the Earth's orbital plane or ecliptic. The Earth's axis is rotated with respect to the ecliptic, which is why we have seasons etc. In any case method changes have made all this rather less relevant now ..... but it used to be a huge issue in comparing runtimes and relative (in)efficiencies ]

Cheers, Mike.
____________
"I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal
tear
Send message
Joined: 12 Sep 10
Posts: 9
Credit: 9,914,974
RAC: 0
Message 108665 - Posted: 18 Dec 2010, 5:08:53 UTC - in response to Message 108655.
Last modified: 18 Dec 2010, 5:12:22 UTC

tear wrote:
I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.

<snipped interpretation>
(nb, yes, that's my message)

I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways.

If I were to nitpick I'd say "Hardware enables software to retrieve physical layout"... ;) sorry.

That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon.

As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
know how to do that in Windows.

However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.

Yes... use cases, use cases, use cases, use cases (to paraphrase Steve
Ballmer). I can't see one (use case, not Steve -- ed.) either.
ML1
Send message
Joined: 20 Feb 05
Posts: 321
Credit: 24,444,932
RAC: 1,593
Message 108676 - Posted: 18 Dec 2010, 11:40:50 UTC - in response to Message 108665.

... As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't know how to do that in Windows.

Is the Windows scheduler HT-aware yet?...

Aside: Also note that for some systems, the Intel CPUs can become memory bandwidth limited for some tasks. For those cases, you can get better performance by NOT using all the cores, or use a mix of boinc tasks so as to not hit the limits for CPU cache and memory accesses.

That was especially true for the later multi-cores using the old Intel FSB. Has that now been eased with the more recent CPUs that no longer use a 'northbridge' for RAM access?


Happy fast crunchin',
Martin


____________
Powered by Mandriva Linux A user friendly OS!
See the Boinc HELP Wiki
ExtraTerrestrial Apes
Avatar
Send message
Joined: 10 Nov 04
Posts: 678
Credit: 38,778,944
RAC: 7,234
Message 108677 - Posted: 18 Dec 2010, 12:12:29 UTC

I can vaguely remember that MS put quite some effort into making Server 2008R2 more power efficient (I think there was a review on Anandtech about this). They achieved quite an improvement over the previous versions. And as far as I remember the optimizations include NUMA-awareness and HT-awareness in the scheduler. It may not be perfect (which software is?), but if it wasn't there I'd expect the HT_4 result to be even worse, maybe right in the middle between nHT_4 and HT_8 (without a proper calculation of probabilities).

MrS
____________
Scanning for our furry friends since Jan 2002

archae86
Send message
Joined: 6 Dec 05
Posts: 1485
Credit: 273,618,622
RAC: 497,069
Message 108684 - Posted: 18 Dec 2010, 12:59:09 UTC - in response to Message 108665.

That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon.

As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
know how to do that in Windows.
In the "set affinity" interface for Process Explorer it designates CPU 0 through CPU 7 on my E5620, and 0 through 3 on my Q6600.

From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident. At least part and perhaps all of my impression made use of reported core-to-core temperature changes in response to task shifts. An additional difficulty is that at least some temperature-reporting aps don't use CPU identification compatible with that used in this affinity interface.

After I saw tear's note yesterday, I made a sloppy trial run in which I used suspensions to limit execution to four Einstein 3.06 HF tasks, and used this affinity mechanism to restrict each to a distinct one of the four presumed pairs. It was sloppy in that I failed to monitor things closely enough to avoid some minutes in which fewer than four tasks were running, but my initial impression is fairly strongly that a large improvement over the non-affinity modified case was demonstrated. Long ago I did affinity experiments for a Q6600 with a full SETI/Einstein task load demonstrating no improvement. That, of course, was quite a different issue than this. It is not the un-needed switching of tasks from CPU to CPU that is the primary harm here, but un-needed sharing of a physical core when an idle core is available.
____________
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Cruncher's Corner : Hyperthreading and Task number Impact Observations


Home · Your account · Message boards

This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

Copyright © 2016 Bruce Allen