Hyperthreading and Task number Impact Observations


Advanced search

Message boards : Cruncher's Corner : Hyperthreading and Task number Impact Observations

AuthorMessage
archae86
Send message
Joined: Dec 6 05
Posts: 1065
Credit: 112,197,382
RAC: 98,840
Message 108579 - Posted 15 Dec 2010 16:39:07 UTC

    Over the next couple of days I plan to generate some observations on the execution time and aggregate throughput impact of switching my Westmere E5620 between hyperthreaded and not, and of varying the number of simultaneously executing Einstein HF 3.06 tasks.

    Westmere is mostly a 32nm Xeon flavor of the classic Nehalem 4-core design, but with a 12 M L3 cache. For these tests I'll leave it running as it has been lately, with a moderate overclock of 3.42 GHz, with 4 Gbyte of RAM running at default settings (and one more Gbyte plugged in not currently actually recognized by the BIOS--oops).

    My current intentions--subject to revision if I have a better thought or get better advice here:

    1. One "measurement task" for each condition, will be started only after the remaining tasks for the condition are up and running, and will all be chosen from the same HF frequency 1373.90.

    2. I'll turn off most overhead tasks of the more frivolous sort, and not use the system for personal work during the timed runs, but leave running my Kaspersky AV (which does hurt a bit during that nasty slow startup phase, but I consider essential).

    The test conditions I definitely plan to log are:
    HT 8
    HT 1
    nHT 4
    nHT 1

    I might fill in some of the intermediate task count points, say perhaps HT 4 and HT 6, but probably not all of them.

    For each condition I think I'll show CPU time for the task, CPU time for the task relative to the HT_8 case, and implied system throughput relative to the HT_8 case. Also system input power.


    Why bother? On the negative side, since these results are rather strongly influenced by the particular CPU design, by the application being run, by some other system design and configuration details, and by the other code executing on the system they won't generalize very far.

    But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case, or people who expect a full doubling of aggregate throughput with HT application.

    My target is not mostly the regular posters here, for whom few of my results will be surprising, but mostly others who drop by who may be less well informed.

    I don't think thread starters own threads, but I'd be perfectly pleased if others with useful observation data saw fit to add to this thread.

    While my timing in starting this thread and project was influenced by this other thread I don't regard this as an answer to or continuation of that one.
    ____________

    archae86
    Send message
    Joined: Dec 6 05
    Posts: 1065
    Credit: 112,197,382
    RAC: 98,840
    Message 108590 - Posted 15 Dec 2010 23:50:26 UTC - in response to Message 108579.

      Last modified: 15 Dec 2010 23:53:15 UTC

      I'm planning to post results by doing a screen capture of a bit of Excel spreadsheet, posting to my Photobucket account, and linking the image here.

      Translation--the image below should change over time as I measure new observations or correct old ones. The divide by zero errors will mostly disappear once I've observed the primary reference of hyperthreaded eight parallel Einstein tasks.



      In general I'll make any further comments in later posts, but for this one I'll observe that the non-hyperthreaded single task case at 13483 seconds is rather a lot below the recent typical values (at hyperthreaded 8 tasks) for this host running near 22,500 CPU seconds at the same clock rate, RAM parameters, and other operating parameters.
      ____________

      archae86
      Send message
      Joined: Dec 6 05
      Posts: 1065
      Credit: 112,197,382
      RAC: 98,840
      Message 108610 - Posted 16 Dec 2010 21:44:47 UTC - in response to Message 108579.

        archae86 wrote:
        But I think some of the results may surprise people--for example those who may expect a single Einstein task running HT to run about as fast as the single nHT case
        <snip>
        My target is not mostly the regular posters here, for whom few of my results will be surprising, but mostly others who drop by who may be less well informed

        While I can't speak for other regular posters, a result here surprised me.

        The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case.

        The observed result is of course what one would like and naively expect--with nothing to do on the "other half" of a core running HT you would of course want no context switching or other overhead to occur. But on my previous main machine with HT some years ago, I formed the strong impression that single task execution was considerably slower with HT enabled than not. I assumed that was still true for Nehalem and have a dedicated BIOS setting group aimed to support my audio processing which is nHT because I thought my (largely single-threaded) audio processing would go faster (I don't run BOINC when I'm doing audio).

        Possibly I was mistake then, or possibly the Intel HT implementation in the Nehalem architecture is dramatically superior to that in the Gallatin (Northwood-derived with big cache) machine I used to own. Considering how unfortunate some other aspects of the whole Willamette-descended set of designs were, this would not surprise me.

        So it is even more crucial to assure that performance data are taken with appropriate simultaneous workload than I thought. To the problems of conflict for RAM and cache resources, one must add the fact that an underloaded HT host can actually provide dramatically more computation per charged CPU second than a loaded one.

        ____________

        Richard Haselgrove
        Send message
        Joined: Dec 10 05
        Posts: 1354
        Credit: 30,397,872
        RAC: 20,070
        Message 108613 - Posted 16 Dec 2010 22:14:11 UTC - in response to Message 108610.

          Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch.

          But with four physical cores available in your Westmere, and with it being only lightly loaded, a clever operating system could keep one 100% utilisation task running on one core without context switches, and distribute the housekeeping tasks around the other three cores as necessary: it could even be running as effectively a seven-core computer, with 3xHT handling the non-computationally-intensive tasks, and 1xnHT for the busy one?

          archae86
          Send message
          Joined: Dec 6 05
          Posts: 1065
          Credit: 112,197,382
          RAC: 98,840
          Message 108616 - Posted 16 Dec 2010 22:39:41 UTC - in response to Message 108613.

            Richard Haselgrove wrote:
            Could not the difference be that your Gallatin was a single-core processor, so everything else running on the computer (including the OS itself and OS background tasks) would require a context switch.

            But the same system running nHT still needed to process interrupts and make context switches for those same things. And a big part of the claim for HT is that it supports a sort of context switch between the two threads sharing a core at any given moment that is immensely faster than a standard context switch. So at least some should have gone faster, and I don't see why there would be a penalty for all the others unless the scheme somehow forced frequent thread to thread switches even though there was not work for the other thread. Something like that is what I've assumed, but not from inside information, but the behavior I thought I'd seen.

            a clever operating system could keep one 100% utilisation task running on one core without context switches
            I've noticed that Windows 7 on my application load seems far more inclined to leave a process on a core for a while than Windows XP Pro. Not sure Windows 7 qualifies as clever in this respect.

            ____________

            Profile Mike Hewson
            Forum moderator
            Avatar
            Send message
            Joined: Dec 1 05
            Posts: 3592
            Credit: 28,563,336
            RAC: 12,021
            Message 108619 - Posted 16 Dec 2010 23:09:27 UTC - in response to Message 108610.

              Last modified: 16 Dec 2010 23:20:33 UTC

              The single task execution time running hyperthreaded was so close to that running nHT that I cannot confidently assert the difference was not just WU to WU natural variation. I had expected a large disadvantage for the HT case.

              Based on the work we did a couple a years ago ( Ready Reckoner et al ), if still valid, then one could easily get variation of the order of a third ( average variation, measured extremes were from as low as ~15% to as high as ~45% ) of the run time due to stepping through phase space at a given frequency ( sinusoids etc ). That's clearly of the order of +/- HT effect we expect anyway .....

              [ specifically, if true, this implies nHT processing getting lucky with a 'short' WU with correspondingly disadvantaged HT processing working on a 'long' WU, to explain your finding ]

              Cheers, Mike.
              ____________
              "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

              tear
              Send message
              Joined: Sep 12 10
              Posts: 9
              Credit: 9,914,974
              RAC: 0
              Message 108621 - Posted 16 Dec 2010 23:22:46 UTC - in response to Message 108619.

                Running (up to) N/2 tasks* on machine with N HT CPUs yields pretty much same
                performance as running same number of tasks with HT disabled**. What's surprising
                about it?

                *) task, as in "non-MPI CPU/memory intensive task"

                **) on condition that no two tasks share sibling HT CPUs

                Profile Mike Hewson
                Forum moderator
                Avatar
                Send message
                Joined: Dec 1 05
                Posts: 3592
                Credit: 28,563,336
                RAC: 12,021
                Message 108622 - Posted 16 Dec 2010 23:31:37 UTC - in response to Message 108621.

                  What's surprising about it?

                  For many, not alot. For others, a credulity issue .... :-)

                  Cheers, Mike.

                  ____________
                  "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                  archae86
                  Send message
                  Joined: Dec 6 05
                  Posts: 1065
                  Credit: 112,197,382
                  RAC: 98,840
                  Message 108623 - Posted 17 Dec 2010 0:03:30 UTC - in response to Message 108619.

                    Mike Hewson wrote:
                    if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )
                    My impression is that the proportionate execution time variability of the current work load is much less than it was in that era.

                    Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload.



                    I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention.

                    ____________

                    Profile Mike Hewson
                    Forum moderator
                    Avatar
                    Send message
                    Joined: Dec 1 05
                    Posts: 3592
                    Credit: 28,563,336
                    RAC: 12,021
                    Message 108624 - Posted 17 Dec 2010 0:28:00 UTC - in response to Message 108623.

                      Last modified: 17 Dec 2010 10:21:15 UTC

                      Mike Hewson wrote:
                      if still valid, then one could easily get variation of the order of a third of the run time due to stepping through phase space at a given frequency ( sinusoids etc )
                      My impression is that the proportionate execution time variability of the current work load is much less than it was in that era.

                      Here is a histogram of recent 144 results of 3.04 ap work done on that same host running HT 8 tasks, 3.42 GHz, but suffering excess variation from handling my daily personal computing workload.



                      I agree that the "difference" I saw between nHT 1 task and HT 1 task is likely within random variation, but I don't think the current random variation is anything at all close to the third of the run time you mention.

                      Fair enough. I thought that might well be so, as the variation 'back then' related to sinusoidal function 'look-ups' and like issues, which has undergone optimisation ( or become less relevant ) since. The phase space is right ascension and declination, effectively considered as orthogonal co-ordinates, but to un-Doppler a signal you still need to resolve components to detector/Earth frame ie. trigonometry. At least that's how I remember it. :-)

                      Cheers, Mike.

                      ( edit ) Nice curve too. To a first attempt you'd model that as normally distributed :

                      normalising_const * exp[-((ordinate - mean_measure)/spread_measure)^2]

                      If so then you have an underlying random variable with no especial 'preference' related to the task at hand*. Asynchronous ( with respect to WU processing ) interruptions would explain that nicely .....

                      ( edit ) Mean is ~ 22460, standard deviation is ~ 148. Average absolute residual ( of actual WU's per run-time bracket ) from Gaussian prediction is ~ 3.1 or around 10% of the peak. Close enough ... certainly believable for that sample size.

                      ( edit ) This means that I am saying that the WU 'interruptions' account for around +/- 2% of their run-times ( 3 standard deviations/average ). So this is way less than the 'sequence number' effect studied in days of old. But you could have guessed that. I couldn't remember any more Excel-Fu ..... :-)

                      ( edit ) Actually yet another reason for intrinsic WU variation not affecting your group of 144, is that at ~ 1500 Hz : each frequency is going to have well over 1000 WU's to plow through. [ We found earlier that the number of work-units per sequence-number-cycle at a specific frequency went quadratically with frequency, as you have to plod through the phase space more finely ]. With E@H's use of locality scheduling there is a mighty tendency for a given host ( especially a fast one ) to be given near consecutive sequence numbers, thus your example of 144 WU's may not sample much of any ( if existing ) sinusoidal variation in run-times from that cause. In fact looking at your Xeon's first 20 tasks in the current 'In progress' list ( a page full ) that's exactly what's happening. So I'll shut up now, having demonstrated that this is definitely not relevant to the HT exploration here. :-) :-)

                      ( edit ) * Sorry, quiet night-shift! I didn't explain that if it was predominantly sequence-number related you'd get a low-skewed ( to the left ) concave-up curve, and not a convex-down symmetric bell shape, as most WU's would cluster at the sinusoid trough ( shorter run-times ). If there is any skew asymmetry in your data it is definitely to the right.
                      ____________
                      "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                      archae86
                      Send message
                      Joined: Dec 6 05
                      Posts: 1065
                      Credit: 112,197,382
                      RAC: 98,840
                      Message 108653 - Posted 17 Dec 2010 22:51:04 UTC

                        Last modified: 17 Dec 2010 23:03:04 UTC

                        I've been surprised yet again. The most recent addition to the result table in the second post of this thread covers the case of 4 tasks running with hyperthreading turned on.

                        Recalling that the single task HT case surprised me by having a tiny (and possibly noise rather than truth) deficit to nHT single task, one might have guessed that this case would very closely approximate the 4 task nHT result.

                        It does not--while the HT_4 case gives considerably lower CPU times than does the HT_8, by far more than RAM or cache conflict might be expected to produce, the shortfall to nHT_4 is quite large. Importantly it is large compared to the plausible random noise stemming from task-to-task execution difficulty and system loading variation.

                        So a BOINC person seeking to "keep threads free" from BOINC by setting a maximum number of CPUs below the (virtual) number available) but keeping HT turned on hoping that in quite times within nothing going on the system it will at perform about as well on BOINC as the same system running the same number of tasks nHT seems on Nehalem architecture to lose very little for one task, but quite a lot for 4. With my poor prediction record so far, I probably should not guess how this would be for six, but my guess is that the loss from perfection will continue to grow on the larger core count Nehalems unless they have a specific logic upgrade aimed at this problem.

                        One other point, and a new picture embedded here of data: For the HT_4 case, I was able to get two other results to run in the nearly identical conditions as the intended test subject. They enjoyed the same reduction of normal load from background tasks and foreground usage, the same environment of HT with 3 companion 3.06 tasks, and in fact ran in parallel with primary measured task for all save about three minutes of offset. I deliberately chose tasks of differing frequency and sequence, hoping to raise the chance of catching systematic execution variation. What I actually got was very, very close matching.



                        Combining the sort of Big Picture variation from the histogram I posted a few posts back, with the bottom up better controlled (but much smaller data set) evidence here, I think the case is pretty well made that task to task systematic CPU time variation may be usually quite low for 3.06 HF work in the near neighborhood of 1373 frequency. If Bikeman or anyone else can add some understanding or data on current Einstein result execution time systematic variation I'd be pleased.

                        (edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons)

                        But for this little study I think the available evidence supports the following for a 4-core current generation Nehalem type CPU running near my system's operating point and running Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2):

                        1. With the system allowed to run all the parallel tasks it can, enabling HT raises total throughput appreciably, with the nHT system giving only a little over 75% as much BOINC throughput.

                        2. For the extreme case of restricting BOINC to a single task, it appears that for a system otherwise very, very lightly loaded there is little disadvantage to running HT--mostly likely between 1 and 2% loss of BOINC throughpu.

                        3. Probably the loss incurred by running a restricted number of tasks HT instead of nHT grows with number of tasks. For the 4 task case this loss is substantial with the HT system giving only about 87% as much BOINC throughput as the nHT case.

                        While power consumption increases with number of tasks running, the overall "greenness" for the fixed clock rate fixed voltage case considered here consistently improves with more tasks run and with higher BOINC throughput. If one limits tasks to no more than the number of physical cores present, then both throughput and power efficiency will be best running with hyperthreading disabled.

                        While I'd expect the general trends to hold across a fair range of clock rates and for both the two channel and three-channel RAM variants, it is important to note that while my host is equipped with three channels, the BIOS reported only two to be recognized and active during these tests. I expect that any three-channel variant fully populated and working properly will suffer less degradation in execution times with increasing number of tasks than seen here.
                        ____________

                        tear
                        Send message
                        Joined: Sep 12 10
                        Posts: 9
                        Credit: 9,914,974
                        RAC: 0
                        Message 108654 - Posted 17 Dec 2010 23:11:06 UTC - in response to Message 108653.

                          (...) the shortfall (of HT_4 --ed.) to nHT_4 is quite large.


                          I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities
                          so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

                          archae86
                          Send message
                          Joined: Dec 6 05
                          Posts: 1065
                          Credit: 112,197,382
                          RAC: 98,840
                          Message 108655 - Posted 17 Dec 2010 23:46:30 UTC - in response to Message 108654.

                            tear wrote:
                            I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

                            Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.

                            In hyperthreading the OS is presented with a set of apparently equivalent CPUs. But in the current form pairs of them use the same physical hardware. So, at least in the Nehalem generation, there would be a great advantage to assigning next execution of a thread to a virtual CPU which was not only only idle itself, but whose "sibling" as you call it--the other virtual CPU in fact using the same core--was currently idle rather than to a virtual CPU whose sibling was already using a full core resource.

                            I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways. That may have seemed a needless complication or a departure from the purity of apparent equivalence to those making the original design decisions.

                            That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon. However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.

                            ____________

                            ML1
                            Send message
                            Joined: Feb 20 05
                            Posts: 320
                            Credit: 18,426,418
                            RAC: 11,363
                            Message 108657 - Posted 18 Dec 2010 0:52:40 UTC - in response to Message 108654.

                              I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

                              The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.

                              No 'forcing CPU affinity' required. It's already included!

                              You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).


                              Happy fast crunchin',
                              Martin

                              ____________
                              Powered by Mandriva Linux A user friendly OS!
                              See the Boinc HELP Wiki

                              tear
                              Send message
                              Joined: Sep 12 10
                              Posts: 9
                              Credit: 9,914,974
                              RAC: 0
                              Message 108663 - Posted 18 Dec 2010 4:47:31 UTC - in response to Message 108657.

                                I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

                                The Linux scheduler is HT-aware and so optimally balances out the loading and tries to avoid core thrashing and subsequent cache trashing.

                                No 'forcing CPU affinity' required. It's already included!

                                I've seen plenty of bad scheduling with HT support in the scheduler
                                (CONFIG_SCHED_SMT). Though I admit, theory does appear nice.

                                You should get optimal throughput by utilising fully loaded HT (for an Intel HT CPU).

                                No disagreement here :)

                                Profile Mike Hewson
                                Forum moderator
                                Avatar
                                Send message
                                Joined: Dec 1 05
                                Posts: 3592
                                Credit: 28,563,336
                                RAC: 12,021
                                Message 108664 - Posted 18 Dec 2010 5:01:58 UTC - in response to Message 108653.

                                  Last modified: 18 Dec 2010 5:09:07 UTC

                                  (edit: after I wrote this paragraph I noticed that Mike Hewson had added considerable updates to his original comments on my histogram. An appropriate modification to my claim here is to say that I think that the current overall systematic variation may be far less than the old days, but that in any case the restricted set of results actually being compared here, being all from frequency 1373.90, and spanning a sequence number range only from 1000 to 1022 probably contributed very little measurement noise stemming from systematic execution work variation to the reported comparisons)

                                  Exactly right. The close frequencies and sequence numbers mean the skygrid right ascension and declination are real close. My guess ( admittedly based on old parameter estimates ) is that the sequence numbers have about a cycle of 400 work units before returning to similiar runtimes, for around that frequency value.

                                  [ For those not familiar with this aspect of the discussion : at each assumed signal frequency the entire sky is examined in small areas ( one per work unit ) with more, and thus individually smaller, areas required for higher frequencies. Because the Earth is rotating around it's own axis, and it is also orbiting the Sun then a signal channel from each interferometer needs to be 'de-Dopplered' accordingly for each and every choice of distant sky grid element ( tiny area on a construct called the 'celestial sphere' ). Ultimately a signal is effectively expressed in what it would be like if it were heard at a place called the solar system 'barycenter'. There is another line of adjustment according to estimates of putative source movements too. The part of the algorithm that steps through the skygrid has to acknowledge some trigonometry to resolve a signal's components to the directions along which a particular interferometer's arms happen to lie at a given instant. In addition not all skygrid areas are equal which is a consequence of spherical geometry not being 'flat'. In any case the work unit's runtime used to be very dependent on skygrid position, with a marked sinusoidal variation above an amount that was constant regardless of sky position. The algorithm starts stepping from I think at the equator, but it could have been a pole as I can't remember which, and wraps around the sphere with a 'stagger' reminiscent of winding yarn around a ball. The number steps to return for another wrap around is this cycle length of approximately 400 that I'm referring to. At lower frequencies than we are currently doing now, around 3 such cycles were required to cover the entire sky grid. There was also another effect 'rippling' the sinusoidal runtime vs. sequence number curve, probably ( well that was my view ) due to conversion of co-ordinates from an Earth based equatorial view to the Earth's orbital plane or ecliptic. The Earth's axis is rotated with respect to the ecliptic, which is why we have seasons etc. In any case method changes have made all this rather less relevant now ..... but it used to be a huge issue in comparing runtimes and relative (in)efficiencies ]

                                  Cheers, Mike.
                                  ____________
                                  "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                  tear
                                  Send message
                                  Joined: Sep 12 10
                                  Posts: 9
                                  Credit: 9,914,974
                                  RAC: 0
                                  Message 108665 - Posted 18 Dec 2010 5:08:53 UTC - in response to Message 108655.

                                    Last modified: 18 Dec 2010 5:12:22 UTC

                                    tear wrote:
                                    I'm blaming the OS here. I once did similar experiment _but_ set CPU affinities so no two sibling HT cores would be used (Linux). Got on-par results. Just FYI.

                                    Now that is an interesting thought. Allow me to express in highly verbose form my understanding of what you have said so tersely.

                                    <snipped interpretation>
                                    (nb, yes, that's my message)

                                    I'm not at all sure the hardware communicates to the software anything about which virtual CPUs share hardware in what ways.

                                    If I were to nitpick I'd say "Hardware enables software to retrieve physical layout"... ;) sorry.

                                    That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon.

                                    As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
                                    know how to do that in Windows.

                                    However at the practical level of suggesting system configuration for users, that result seems unlikely to help much. Possibly a third-party add-on could repeatedly set affinities for new BOINC tasks to even or odd numbered virtual CPUs on HT systems restricted to half the maximum number of CPUs or less, but those systems would still execute less BOINC work at poorer BOINC power cost efficiency than unrestricted systems. So there seems not likely to be a big market for the feature.

                                    Yes... use cases, use cases, use cases, use cases (to paraphrase Steve
                                    Ballmer). I can't see one (use case, not Steve -- ed.) either.

                                    ML1
                                    Send message
                                    Joined: Feb 20 05
                                    Posts: 320
                                    Credit: 18,426,418
                                    RAC: 11,363
                                    Message 108676 - Posted 18 Dec 2010 11:40:50 UTC - in response to Message 108665.

                                      ... As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't know how to do that in Windows.

                                      Is the Windows scheduler HT-aware yet?...

                                      Aside: Also note that for some systems, the Intel CPUs can become memory bandwidth limited for some tasks. For those cases, you can get better performance by NOT using all the cores, or use a mix of boinc tasks so as to not hit the limits for CPU cache and memory accesses.

                                      That was especially true for the later multi-cores using the old Intel FSB. Has that now been eased with the more recent CPUs that no longer use a 'northbridge' for RAM access?


                                      Happy fast crunchin',
                                      Martin


                                      ____________
                                      Powered by Mandriva Linux A user friendly OS!
                                      See the Boinc HELP Wiki

                                      ExtraTerrestrial Apes
                                      Avatar
                                      Send message
                                      Joined: Nov 10 04
                                      Posts: 464
                                      Credit: 32,328,849
                                      RAC: 32,896
                                      Message 108677 - Posted 18 Dec 2010 12:12:29 UTC

                                        I can vaguely remember that MS put quite some effort into making Server 2008R2 more power efficient (I think there was a review on Anandtech about this). They achieved quite an improvement over the previous versions. And as far as I remember the optimizations include NUMA-awareness and HT-awareness in the scheduler. It may not be perfect (which software is?), but if it wasn't there I'd expect the HT_4 result to be even worse, maybe right in the middle between nHT_4 and HT_8 (without a proper calculation of probabilities).

                                        MrS
                                        ____________
                                        Scanning for our furry friends since Jan 2002

                                        archae86
                                        Send message
                                        Joined: Dec 6 05
                                        Posts: 1065
                                        Credit: 112,197,382
                                        RAC: 98,840
                                        Message 108684 - Posted 18 Dec 2010 12:59:09 UTC - in response to Message 108665.

                                          That is an interesting and plausible suggestion. I believe I could repeat my HT_4 experiment using Process Explorer to force 4 tasks to distinct cores. I'm interested enough to consider trying the experiment soon.

                                          As long as you're able to identify (or determine) HT CPU "pairs". I wouldn't
                                          know how to do that in Windows.
                                          In the "set affinity" interface for Process Explorer it designates CPU 0 through CPU 7 on my E5620, and 0 through 3 on my Q6600.

                                          From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident. At least part and perhaps all of my impression made use of reported core-to-core temperature changes in response to task shifts. An additional difficulty is that at least some temperature-reporting aps don't use CPU identification compatible with that used in this affinity interface.

                                          After I saw tear's note yesterday, I made a sloppy trial run in which I used suspensions to limit execution to four Einstein 3.06 HF tasks, and used this affinity mechanism to restrict each to a distinct one of the four presumed pairs. It was sloppy in that I failed to monitor things closely enough to avoid some minutes in which fewer than four tasks were running, but my initial impression is fairly strongly that a large improvement over the non-affinity modified case was demonstrated. Long ago I did affinity experiments for a Q6600 with a full SETI/Einstein task load demonstrating no improvement. That, of course, was quite a different issue than this. It is not the un-needed switching of tasks from CPU to CPU that is the primary harm here, but un-needed sharing of a physical core when an idle core is available.
                                          ____________

                                          Robert
                                          Send message
                                          Joined: Nov 5 05
                                          Posts: 34
                                          Credit: 205,806,050
                                          RAC: 160,846
                                          Message 108703 - Posted 19 Dec 2010 2:38:38 UTC

                                            I found the results for the nHT_4 << HT_4 to be inconsistent with experiments I've run in the past and like your initial reaction, surprising.

                                            I just ran this same experiment on a Core i7-920 (OC to 3.7GHz), a Nehalem quad core with hyperthreading and 3 x 2GB of RAM under Windows 7. This is of course using the older 45 nm process versus the Westmeres 32 nm process, but as you point out, they are essentially the same architecture.

                                            My results:

                                            nHT_4 = 13,500 seconds
                                            HT_4 = 13,560 seconds

                                            Maybe someone else can run this experiment and provide an additional data point.
                                            ____________

                                            telegd
                                            Avatar
                                            Send message
                                            Joined: Apr 17 07
                                            Posts: 91
                                            Credit: 8,780,918
                                            RAC: 26,221
                                            Message 108704 - Posted 19 Dec 2010 4:20:37 UTC - in response to Message 108684.

                                              From some previous work I formed an impression that (0,1), (2,3), (4,5), and (6,7) were core-sharing pairs on my E5620, though I'm not highly confident.

                                              Interesting. On my i7-860 Linux box, it is (0,4) (1,5) (2,6) (3,7).

                                              This thread is very interesting - thanks for doing all these tests. I would be very interested to see a similar comparison done under Linux but, sadly, I don't have the time to do it myself...


                                              archae86
                                              Send message
                                              Joined: Dec 6 05
                                              Posts: 1065
                                              Credit: 112,197,382
                                              RAC: 98,840
                                              Message 108705 - Posted 19 Dec 2010 5:32:47 UTC

                                                As a probe, I tried a new case of HT enabled but only 4 tasks (compared to 8 possible), but with one task restricted (by Process Explorer affinity setting) to what Process Explorer running under Windows 7 construed to be CPUs (2,3) while the other three tasks were all allowed to roam among CPUs (0,1,4,5,6,7).

                                                On the "unlucky assignment" hypothesis, one would expect the task restricted to a single sibling pair from which the three other Einstein tasks were excluded to have spent nearly all of the time with sole use of a full core. Thus it would be expected to finish a good deal sooner than the three other tasks, who would by bad luck spend part of their time sharing a real physical core with another Einstein while a real physical core sat idle.

                                                To first order this prophecy seems borne out by the observed result.

                                                The task restricted to what Process Explorer called CPUs (2,3) required 14,449.05 of CPU time to complete.

                                                The other three required 16,245.74, 16,267.50, and 16,258.21. They are quite tightly matched compared to the large difference to the (2,3) restricted task.

                                                For those keeping track of frequency and sequence as possible indicia of varying inherent computation work, those were:

                                                for the favored single-core WU:
                                                freq 1373.05 seq 1015
                                                for the disfavored three condemned to suffer temporary sharing of an actual physical core with each other:
                                                freq 1373.10 seq 1009
                                                freq 1373.90 seq 999
                                                freq 1373.05 seq 1014

                                                It seems clear to me that for my system with Windows 7 and other conditions, the OS is fairly likely to assign a "durable" task to the "other half" of a real physical core fairly often, and when that happens in an underworked situation, a real physical CPU is often left idle when a more ideal task assignment could have gotten more throughput.

                                                All of this is what my reading of tear's post suggested. Attention Windows haters: as tear got an observation on Linux, it is at least hinted that some Linux distributions suffer to at least come degree the same form of sub-optimization.

                                                [political observation]I don't much like Windows or Bill Gates myself, save for his second life in the Foundation, as to which at least the vaccine work, and quite a bit else seems well founded [/political observation]
                                                ____________

                                                BilBg
                                                Avatar
                                                Send message
                                                Joined: May 27 07
                                                Posts: 56
                                                Credit: 23,998
                                                RAC: 0
                                                Message 108715 - Posted 19 Dec 2010 14:42:31 UTC - in response to Message 108705.


                                                  I'm not sure but can Process Lasso assign CPU affinities automatically to use preferably the real cores in HT case? (I don't have such CPU to test):
                                                  http://www.bitsum.com/prolasso.php

                                                  " ProBalance
                                                  Balance process priorities (or CPU affinities) ...

                                                  Automated Process Control
                                                  Set default priorities and CPU affinities ...

                                                  Multi-Core Optimization
                                                  Through default CPU affinities and ProBalance affinity adjustments, you can optimize your multi-core processor to make the most efficient use of your CPUs (cores)
                                                  "

                                                  http://www.bitsum.com/docs/pl/how_does_lasso_work.htm


                                                  ____________



                                                  - ALF - "Find out what you don't do well ..... then don't do it!" :)

                                                  ML1
                                                  Send message
                                                  Joined: Feb 20 05
                                                  Posts: 320
                                                  Credit: 18,426,418
                                                  RAC: 11,363
                                                  Message 108727 - Posted 19 Dec 2010 18:08:26 UTC - in response to Message 108705.

                                                    Some interesting observations and good discussion.

                                                    Still... Beware the aspect of memory bandwidth contention confusing the issue of CPU thread allocation and overall performance...

                                                    ... as tear got an observation on Linux, it is at least hinted that some Linux distributions suffer to at least come degree the same form of sub-optimization.


                                                    For all you might ever have wanted to know about the introduction of the Intel version of Hyper-Threading... (Note that this is a rather old idea harking long ago back to the days of the Cyber supercomputers and possibly before...)


                                                    Linux: HyperThreading-Aware Scheduler

                                                    ... August 28, 2002 - 12:59pm

                                                    * Linux news

                                                    Ingo Molnar, author of the O(1) scheduler [earlier story] and the orginal preemptive kernel patch, has provided a patch to make the O(1) scheduler fully aware of HyperThreading. Ingo explains: ...



                                                    Linux: NUMA Awareness Added To Scheduler

                                                    ... January 22, 2003 - 3:22am

                                                    * Linux news

                                                    After several earlier attempts [story], NUMA awareness has been merged into the 2.5 development kernel's scheduler. Martin Bligh submitted the patches, explaining: ...



                                                    Hyper-Threading support in Linux kernel 2.5.x

                                                    Linux kernel 2.4.x was made aware of HT since the release of 2.4.17. The kernel 2.4.17 knows about the logical processor, and it treats a Hyper-Threaded processor as two physical processors. However, the scheduler used in the stock kernel 2.4.x is still considered naive for not being able to distinguish the resource contention problem between two logical processors versus two separate physical processors.

                                                    Ingo Molnar has pointed out scenarios in which the current scheduler gets things wrong...

                                                    The solution is to change the way the run queues work. The 2.5 scheduler maintains one run queue per processor and attempts to avoid moving tasks between queues. The change is to have one run queue per physical processor that is able to feed tasks into all of the virtual processors. Throw in a smarter sense...



                                                    Rather interesting for the various scenarios...

                                                    As mentioned, a complication observed elsewhere is when the multiple CPU cores become resource limited for memory access (or even cache access).

                                                    Happy fast crunchin',
                                                    Martin



                                                    ____________
                                                    Powered by Mandriva Linux A user friendly OS!
                                                    See the Boinc HELP Wiki

                                                    archae86
                                                    Send message
                                                    Joined: Dec 6 05
                                                    Posts: 1065
                                                    Credit: 112,197,382
                                                    RAC: 98,840
                                                    Message 108730 - Posted 19 Dec 2010 20:31:12 UTC - in response to Message 108715.


                                                      I'm not sure but can Process Lasso assign CPU affinities automatically to use preferably the real cores in HT case? (I don't have such CPU to test):


                                                      If you had the wish to:
                                                      1. run with HT enabled and
                                                      2. limit the number of BOINC tasks to the number of physical cores (thus losing appreciable Einstein throughput compared to allowing use of all of the apparent CPUs)
                                                      3. get more Einstein output than one gets allowing unlucky task assignment to sibling CPUs.

                                                      If my cursory understanding of Process Lasso from looking at your reference, my current belief on sibling pair numbering for Process Explorer applies to Process Lasso, and a new thought I had a couple of minutes ago are all correct,

                                                      Then one could, I think do this:

                                                      Use Process Lasso to assign CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair) to the "worker" exe for all BOINC applications that you run (needs to be the same list for all this class of aps). One would of course also wish to use BOINC to restrict the number of running BOINC processes to four or to 50%. On mixed fleets with both HT and nHT multi-core hosts, doing this from account preferences would use up some venue dimension--if unacceptable one might use host preference over-ride.

                                                      This should assure that no BOINC execution task ever shares a physical core with another. On a lightly loaded system of Nehalem-generation architecture, I'd expect based on what we have observed so far, such a system to get very close to the nHT BOINC throughput.

                                                      A possible reason to consider this might be that such a system might be found to be more responsive to non-BOINC tasks than the 4-task nHT alternative, and quite likely more responsive than the (admittedly higher throughput) 8-task variant.

                                                      Einstein current GC work has Working Set size reported about 260,000 kbytes by Process Explorer. Folks with modest-memory systems, or needing to run memory-hungry non Einstein aps (say PhotoShop...) or running BOINC projects yet more memory hungry might find this ocnfiguration attractive.

                                                      I'm not sure we have quite met the usefulness objection to this line of inquiry we lightheartedly entertained early in the thread, but I do think we are getting closer.
                                                      ____________

                                                      ExtraTerrestrial Apes
                                                      Avatar
                                                      Send message
                                                      Joined: Nov 10 04
                                                      Posts: 464
                                                      Credit: 32,328,849
                                                      RAC: 32,896
                                                      Message 108731 - Posted 19 Dec 2010 22:31:04 UTC - in response to Message 108705.

                                                        Last modified: 19 Dec 2010 22:43:50 UTC

                                                        I'm wondering: does MS know the "unlucky assignment" apparently does happen this often? They should be scratching their heads already..

                                                        Edit: now that I think about it.. this should really upset Intel. Any regular (i.e. non-BOINC) software which uses more than 1 core is likely to use less than 8 cores. And that means it will be unnecessarily slowed down by "unlucky assignment".

                                                        MrS
                                                        ____________
                                                        Scanning for our furry friends since Jan 2002

                                                        DanNeely
                                                        Send message
                                                        Joined: Sep 4 05
                                                        Posts: 1075
                                                        Credit: 71,664,405
                                                        RAC: 80,755
                                                        Message 108732 - Posted 20 Dec 2010 0:14:55 UTC

                                                          archae86: have you checked your windows power management settings? I ask because any setting below max performance could have the scheduler intentionally pairing WU's up at times in order to idle cores and drop power levels.

                                                          ExtraTerrestrial Apes: The problem appears to be in the windows scheduler either not being able to detect that the boinc tasks are long running 100% load items to keep them separate, or that it bumps them deliberately because they're low priority tasks in favor of giving exclusive core use to a higher priority item that requested CPU time. In either case this is a Microsoft problem, and not something Intel could do anything about themselves.

                                                          It might be possible to differentiate between the two scheduler failures by changing the priority of the science apps from low to high. IF the problem is my second guess, this should discourage the scheduler from cramming 2 tasks onto a single core to give something else exclusive access.


                                                          ____________

                                                          archae86
                                                          Send message
                                                          Joined: Dec 6 05
                                                          Posts: 1065
                                                          Credit: 112,197,382
                                                          RAC: 98,840
                                                          Message 108735 - Posted 20 Dec 2010 1:03:12 UTC - in response to Message 108730.

                                                            CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair)
                                                            While I mentioned this thought in conjunction with another poster's mention of Process Lasso, I just tried this recipe solo. The result was successful, and has been added to the image displayed in the second post in this thread. Possibly this is what tear meant in referring to affinity settings avoiding sibling conflict. The resulting (low) CPU time--actually I've entered the average of four results sharing this condition--seems to endorse this particular setting for this purpose.

                                                            DanNeely: my power management setting currently says "Turn off the display": 10 minutes, "Put the computer to sleep": never.

                                                            not sure how that comports with your concerns.

                                                            ____________

                                                            Profile Mike Hewson
                                                            Forum moderator
                                                            Avatar
                                                            Send message
                                                            Joined: Dec 1 05
                                                            Posts: 3592
                                                            Credit: 28,563,336
                                                            RAC: 12,021
                                                            Message 108738 - Posted 20 Dec 2010 1:48:36 UTC - in response to Message 108735.

                                                              CPU affinity of 0,2,4,6 (or any other list of four that includes only one of each sibling pair)
                                                              While I mentioned this thought in conjunction with another poster's mention of Process Lasso, I just tried this recipe solo. The result was successful, and has been added to the image displayed in the second post in this thread. Possibly this is what tear meant in referring to affinity settings avoiding sibling conflict. The resulting (low) CPU time--actually I've entered the average of four results sharing this condition--seems to endorse this particular setting for this purpose.

                                                              To be clear on this, we'd be looking at the figures for nHT vs HT, both @ 4 tasks?

                                                              Cheers, Mike.

                                                              ____________
                                                              "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                              DanNeely
                                                              Send message
                                                              Joined: Sep 4 05
                                                              Posts: 1075
                                                              Credit: 71,664,405
                                                              RAC: 80,755
                                                              Message 108740 - Posted 20 Dec 2010 2:16:05 UTC - in response to Message 108735.


                                                                DanNeely: my power management setting currently says "Turn off the display": 10 minutes, "Put the computer to sleep": never.

                                                                not sure how that comports with your concerns.


                                                                From that dialog, click change advanced settings, scroll down to Processor Power Management, and take a look at the values for Minimum and Maximum processor speed (unless you locked your multiplier in the BIOS and disabled the power management features that let it throttle down). IF your power plan is based off of Balanced or Power Saver instead of maximum, the minimum value will be 5% leaving windows free to throttle your CPU as it sees fit. I thought there were also settings relating to standing down cores as well, but unless they're subsumed in the cpu speed setting I can't find them.
                                                                ____________

                                                                archae86
                                                                Send message
                                                                Joined: Dec 6 05
                                                                Posts: 1065
                                                                Credit: 112,197,382
                                                                RAC: 98,840
                                                                Message 108741 - Posted 20 Dec 2010 2:24:39 UTC - in response to Message 108740.

                                                                  Last modified: 20 Dec 2010 2:28:48 UTC

                                                                  DanNeely wrote:
                                                                  take a look at the values for Minimum and Maximum processor speed
                                                                  100%, 100%
                                                                  (unless you locked your multiplier in the BIOS
                                                                  Yes it is locked
                                                                  ____________

                                                                  archae86
                                                                  Send message
                                                                  Joined: Dec 6 05
                                                                  Posts: 1065
                                                                  Credit: 112,197,382
                                                                  RAC: 98,840
                                                                  Message 108742 - Posted 20 Dec 2010 2:28:27 UTC - in response to Message 108738.

                                                                    Mike Hewson wrote:
                                                                    To be clear on this, we'd be looking at the figures for nHT vs HT, both @ 4 tasks?

                                                                    Cheers, Mike.
                                                                    It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present.

                                                                    And, yes, I'm speaking of comparing several entries with 4 in the tasks columns and both HT and nHT in the first column.

                                                                    ____________

                                                                    Profile Mike Hewson
                                                                    Forum moderator
                                                                    Avatar
                                                                    Send message
                                                                    Joined: Dec 1 05
                                                                    Posts: 3592
                                                                    Credit: 28,563,336
                                                                    RAC: 12,021
                                                                    Message 108743 - Posted 20 Dec 2010 2:54:17 UTC - in response to Message 108742.

                                                                      It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present.

                                                                      Done that, can't see any 'affinity' .... :-(

                                                                      Cheers, Mike.

                                                                      ____________
                                                                      "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                      Profile paul milton
                                                                      Avatar
                                                                      Send message
                                                                      Joined: Sep 16 05
                                                                      Posts: 330
                                                                      Credit: 9,238,780
                                                                      RAC: 9,835
                                                                      Message 108744 - Posted 20 Dec 2010 2:56:13 UTC - in response to Message 108743.

                                                                        Last modified: 20 Dec 2010 2:57:11 UTC

                                                                        It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present.

                                                                        Done that, can't see any 'affinity' .... :-(

                                                                        Cheers, Mike.


                                                                        same, untill i cleared my cash and done a refresh, now its there :)

                                                                        edit: aside, might wonna trim that image by one colum to prevent side scrolling :)
                                                                        ____________
                                                                        seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

                                                                        archae86
                                                                        Send message
                                                                        Joined: Dec 6 05
                                                                        Posts: 1065
                                                                        Credit: 112,197,382
                                                                        RAC: 98,840
                                                                        Message 108745 - Posted 20 Dec 2010 3:00:54 UTC - in response to Message 108743.

                                                                          It may require a click on the page reload button (I think Einstein does not set specially page expiry times), but the image of a portion of a spreadsheet in the second post should now include an "affinity" column not previously present.

                                                                          Done that, can't see any 'affinity' .... :-(

                                                                          Cheers, Mike.
                                                                          Odd, I see a new column after Thpt/Watt/HT_8 with the column label of "Affinity Restrictions" The Excel table fragment screen capture currently has 12 rows. How many rows do you see (including the header row)

                                                                          Maybe there is a bit of network kit imposing caching between your browser and Photobucket? That could be a reason not to use this update method. If you say this is still a problem I shall pose the new capture to a new post with a new name.

                                                                          ____________

                                                                          Profile Mike Hewson
                                                                          Forum moderator
                                                                          Avatar
                                                                          Send message
                                                                          Joined: Dec 1 05
                                                                          Posts: 3592
                                                                          Credit: 28,563,336
                                                                          RAC: 12,021
                                                                          Message 108746 - Posted 20 Dec 2010 3:04:25 UTC

                                                                            Last modified: 20 Dec 2010 3:18:15 UTC

                                                                            Got it, a FireFox thingy. Cleared the cache, now I can see! ;-)

                                                                            Yup, a clear-cut result there. In the context of the setup, roaming/loose-affinity has a not inconsiderable price of ~ 10% ... and with restriction nearby the nHT case.

                                                                            Cheers, Mike.
                                                                            ____________
                                                                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                            archae86
                                                                            Send message
                                                                            Joined: Dec 6 05
                                                                            Posts: 1065
                                                                            Credit: 112,197,382
                                                                            RAC: 98,840
                                                                            Message 108748 - Posted 20 Dec 2010 3:20:30 UTC - in response to Message 108744.

                                                                              edit: aside, might wonna trim that image by one colum to prevent side scrolling :)
                                                                              Gee, it was only 887 pixels wide. I've made an edit and it is down to 787 wide now.

                                                                              ____________

                                                                              Profile Mike Hewson
                                                                              Forum moderator
                                                                              Avatar
                                                                              Send message
                                                                              Joined: Dec 1 05
                                                                              Posts: 3592
                                                                              Credit: 28,563,336
                                                                              RAC: 12,021
                                                                              Message 108749 - Posted 20 Dec 2010 3:26:09 UTC

                                                                                Well, this is why we have scroll bars. :-) :-)

                                                                                Cheers, Mike.
                                                                                ____________
                                                                                "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                Profile paul milton
                                                                                Avatar
                                                                                Send message
                                                                                Joined: Sep 16 05
                                                                                Posts: 330
                                                                                Credit: 9,238,780
                                                                                RAC: 9,835
                                                                                Message 108781 - Posted 20 Dec 2010 15:22:01 UTC - in response to Message 108748.

                                                                                  edit: aside, might wonna trim that image by one colum to prevent side scrolling :)
                                                                                  Gee, it was only 887 pixels wide. I've made an edit and it is down to 787 wide now.



                                                                                  err, sorry. i just had flashbacks of folks way back when complaining about wide signature images and was tryin to head that off :| didnt mean to offend.
                                                                                  ____________
                                                                                  seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

                                                                                  ExtraTerrestrial Apes
                                                                                  Avatar
                                                                                  Send message
                                                                                  Joined: Nov 10 04
                                                                                  Posts: 464
                                                                                  Credit: 32,328,849
                                                                                  RAC: 32,896
                                                                                  Message 108799 - Posted 20 Dec 2010 22:11:20 UTC - in response to Message 108732.

                                                                                    ExtraTerrestrial Apes: The problem appears to be in the windows scheduler either not being able to detect that the boinc tasks are long running 100% load items to keep them separate, or that it bumps them deliberately because they're low priority tasks in favor of giving exclusive core use to a higher priority item that requested CPU time.


                                                                                    Nr. 1: possible.. but that seems like a really stupid mistake, as it should be ovious that they are essentially "long running 100% load items".

                                                                                    Nr. 2: I would expect this to happen occasionally, but in this case I'd be really surprised by the magnitude of the effect. In the nHT_4 case each task took 14800s, whereas in the full HT_8 config each one took 22300s. In the run in question the 3 tasks with presumably "unlucky assignment" needed 16900s. So I think it is safe to say that these tasks spent approximately 72% of the runtime alone on a core and 28% of the time on a shared core.
                                                                                    If this was intentional, as some other single task seemed more important, then this one would have been run 3*0.28 = 84% of the entire runtime. That means Archae would observe an average non-BOINC CPU load of about 10% (for this statement it does not matter if this was one or many important tasks).

                                                                                    Since Archae took care not to have excessive background tasks running and since on my Win 7 machines I do not normally observe such high background activity I don't think Nr. 2 is a likely explanation for the observed runtimes.

                                                                                    Nr. 3: what if the scheduler wasn't HT-aware? If it assigned tasks completely random? I just quickly tried to count the possibilities for scheduling 3 tasks over 3 HT cores and arrived at 24 lucky possibilities and 20 unlucky ones. If we assume the same probability for each one that would mean 20/(24+20) = 45% unlucky assignments. That's not exactly the 28% from the previous paragraph and I'd be surprised if I didn't make some mistake here. But in my opinion it's somewhat close.. and should be much further away if the scheduler worked remotely the way I'd imagine.

                                                                                    In either case this is a Microsoft problem, and not something Intel could do anything about themselves.


                                                                                    Yes, but it's Intel's products which are suffering due to this, i.e. are getting worse benchmark scores and worse real world performance. That's why I think it would be in their best interest to look into this and, if confirmed, ask MS politely but firmly to change their scheduler.

                                                                                    MrS
                                                                                    ____________
                                                                                    Scanning for our furry friends since Jan 2002

                                                                                    Profile Mike Hewson
                                                                                    Forum moderator
                                                                                    Avatar
                                                                                    Send message
                                                                                    Joined: Dec 1 05
                                                                                    Posts: 3592
                                                                                    Credit: 28,563,336
                                                                                    RAC: 12,021
                                                                                    Message 108803 - Posted 21 Dec 2010 0:06:04 UTC - in response to Message 108799.

                                                                                      Last modified: 21 Dec 2010 0:27:27 UTC

                                                                                      Nr. 3: what if the scheduler wasn't HT-aware? If it assigned tasks completely random? I just quickly tried to count the possibilities for scheduling 3 tasks over 3 HT cores and arrived at 24 lucky possibilities and 20 unlucky ones. If we assume the same probability for each one that would mean 20/(24+20) = 45% unlucky assignments. That's not exactly the 28% from the previous paragraph and I'd be surprised if I didn't make some mistake here. But in my opinion it's somewhat close.. and should be much further away if the scheduler worked remotely the way I'd imagine.

                                                                                      Ah. Three physical cores ( 1, 2, 3 ) each twice virtualised ( a, b ) with three distinct processes ( A, B, C )?
                                                                                      Phys Virt 1 | a | A A A A A A A A A A | b | B B B B A A A A A A 2 | a | C B B B B B B A A | b | C C B B C B B B B A 3 | a | C C C B C C B C B | b | C C C C C C C C C ________________________________________________________________________ Y Y Y Y Y Y Y Y

                                                                                      Where Y indicates 'good' combinations that don't compete for a physical core. Have I expressed your desired scenario correctly?

                                                                                      You just permute from ABC to ACB, BAC, BCA, CAB, CBA to get the others. But the fraction of good ones is still the same. The total good is 8 * 6 = 48, but out of 19 * 6 = 114 in all. Thus 8/19 = 48/114 ~ 0.42105 or 42% lucky and thus 48% unlucky ( assumes random assignment of task to virtual core ).

                                                                                      Cheers, Mike.

                                                                                      ( edit ) I suppose we're gonna want a 4 HT core by 4 tasks matrix .... as Arnie says : I'll be back. :-)
                                                                                      ____________
                                                                                      "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                      BilBg
                                                                                      Avatar
                                                                                      Send message
                                                                                      Joined: May 27 07
                                                                                      Posts: 56
                                                                                      Credit: 23,998
                                                                                      RAC: 0
                                                                                      Message 108808 - Posted 21 Dec 2010 1:23:12 UTC - in response to Message 108803.

                                                                                        Last modified: 21 Dec 2010 1:34:34 UTC

                                                                                        Thus 8/19 = 48/114 ~ 0.42105 or 42% lucky and thus 48% unlucky ( assumes random assignment of task to virtual core ).


                                                                                        Excellent explanation!

                                                                                        The funny part is that you did right the hard part and wrong the easy last arithmetic ;)

                                                                                        It's 100-42 = 58% unlucky


                                                                                        The "Answer to the Ultimate Question of Life, the Universe, and Everything" (42) get in the way :)
                                                                                        http://en.wikipedia.org/wiki/42_(number)
                                                                                        http://en.wikipedia.org/wiki/Answer_to_the_Ultimate_Question_of_Life,_the_Universe,_and_Everything#The_number_42


                                                                                        ____________


                                                                                        - ALF - "Find out what you don't do well ..... then don't do it!" :)

                                                                                        Profile Mike Hewson
                                                                                        Forum moderator
                                                                                        Avatar
                                                                                        Send message
                                                                                        Joined: Dec 1 05
                                                                                        Posts: 3592
                                                                                        Credit: 28,563,336
                                                                                        RAC: 12,021
                                                                                        Message 108809 - Posted 21 Dec 2010 1:44:04 UTC - in response to Message 108808.

                                                                                          Typo? No .... err sales tax? No .... err brain fade? yes .... :-)

                                                                                          As they say in Anchorman ( The Legend of Ron Burgundy ).... They've done studies, you know. 60% of the time it works, every time. ...

                                                                                          Cheers, Mike.


                                                                                          ____________
                                                                                          "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                          Profile Mike Hewson
                                                                                          Forum moderator
                                                                                          Avatar
                                                                                          Send message
                                                                                          Joined: Dec 1 05
                                                                                          Posts: 3592
                                                                                          Credit: 28,563,336
                                                                                          RAC: 12,021
                                                                                          Message 108811 - Posted 21 Dec 2010 4:26:58 UTC

                                                                                            Last modified: 21 Dec 2010 9:30:12 UTC

                                                                                            OK here's, I believe, the cases for 4 tasks on 4HT cores :



                                                                                            As you can see out of 70 cases in all, those marked with a 'Y' are 'good' ( 16 ). The remainder are not equivalent though, some as marked by the caret '^' are where two physical cores are contending and two cores idle ( 6 ). Those not especially marked have one physical core contending, two active but not contending, and one idle ( 70 - 16 - 6 = 48 ).

                                                                                            [ Same comment as before regarding the ( 24 ) permutations of A, B, C and D amongst themselves. ]

                                                                                            So that's 22.86% ( 16/70 ) are good, 8.57% ( 6/70 ) are worst, and 68.57% ( 48/70 ) are mediocre [ yup, that adds to 100! :-) ]

                                                                                            Cheers, Mike.

                                                                                            ( edit ) Ya, 70 total cases is right = [8! /((8 - 4)! * 4!) = (8 * 7 * 6 * 5)/(4 * 3 * 2 * 1) = 7 * 2 * 5 = 70.

                                                                                            ( edit ) Whoops, missed a case ( marked with '*' ) for 3 tasks on 3 HT cores :

                                                                                            Phys Virt 1 | a | A A A A A A A A A A | b | B B B B A A A A A A 2 | a | C B B B B B B A A A | b | C C B B C B B B B A 3 | a | C C C B C C B C B B | b | C C C C C C C C C C ___________________________________________________________________________ Y Y Y Y Y Y Y Y *

                                                                                            Thus 8/20 = 0.4 or 40% lucky and thus 60% unlucky. Sorry ......
                                                                                            ____________
                                                                                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                            archae86
                                                                                            Send message
                                                                                            Joined: Dec 6 05
                                                                                            Posts: 1065
                                                                                            Credit: 112,197,382
                                                                                            RAC: 98,840
                                                                                            Message 109021 - Posted 28 Dec 2010 20:01:54 UTC - in response to Message 108811.

                                                                                              So that's 22.86% ( 16/70 ) are good, 8.57% ( 6/70 ) are worst, and 68.57% ( 48/70 ) are mediocre [ yup, that adds to 100! :-) ]

                                                                                              Cheers, Mike.

                                                                                              I spent some time getting relative performance estimates for the assignments Mike here calls Good, Mediocre, and Worst.

                                                                                              For this purpose I used a new range of frequencies/seq, having exhausted my stock of the previous, but actually think these are still very close to the previous in work content. However I shifted the reported CPU time to elapsed, rather than reported task, particularly because I observed some anomalous behavior in my test condition:

                                                                                              Some of the time, Windows would not activate all of the currently executing Einstein tasks, even though the affinity prescriptions left open a virtual CPU. One practical impact was a much bigger discrepancy between elapsed time and reported CPU time than usual.

                                                                                              Paradoxically, the most restrictive case of allowing a task to run on one and only one virtual CPU, which one would expect to suffer from the occasional case of waiting for a higher priority system task which happened to get assigned that CPU during one of the many times per second that the Einstein task is out on a context switch (waiting for disk, or ...). Suffer it doubtless does, but on my rather artificial test cases, sometimes the most restrictive assignment was far more productive than the most free one (specifically, in the case of running four Einstein tasks on two physical cores--thus four CPUs--it was much more productive to hard assign each of the four tasks, than to all all four free range).

                                                                                              Here are some comparison numbers. I'll leave it to others to estimate whether these--matched with Mike's proportions, suggest that Windows is doing a bit better or a bit worse than one might predict on random assignment.



                                                                                              The top populated HT_4 row corresponds to Mike's Good (.7622), the next two rows both represent Mike's Bad, but I suspect the first of the two (.5692) is more representative of actual occurrence, and the throughput composite line (.6596) is my estimate of Mike's Mediocre case. The relative performance numbers in this paragraph are all estimates of aggregate system throughput compared to a HT_8 case on the same work.

                                                                                              One should perhaps mention that the designers of the Windows scheme probably did not consider maximizing system aggregate throughput of persistent very low priority tasks as high on their desired outcome list.
                                                                                              ____________

                                                                                              Profile Mike Hewson
                                                                                              Forum moderator
                                                                                              Avatar
                                                                                              Send message
                                                                                              Joined: Dec 1 05
                                                                                              Posts: 3592
                                                                                              Credit: 28,563,336
                                                                                              RAC: 12,021
                                                                                              Message 109023 - Posted 28 Dec 2010 22:18:29 UTC

                                                                                                Last modified: 28 Dec 2010 22:22:43 UTC

                                                                                                To be clear for other readers : by elapsed time is meant as per a clock on the wall from the task beginning until completed, whereas CPU time is the total of accumulated time slices devoted to the task. The difference would be how long the task is waiting to be executed on a CPU. Think of a CPU, physical or virtual, as a contended resource which requires allocation - this is an OS dependent function that switches tasks to CPUs on a priority basis. Tasks thus compete amongst themselves for CPU time, including the tasks we "don't see" like a heap of mundane OS stuff. But our tasks need those mundane ones to perform ( disk access say ) and thus may be 'blocked' in proceeding while awaiting their completion.

                                                                                                Cheers, Mike.
                                                                                                ____________
                                                                                                "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                archae86
                                                                                                Send message
                                                                                                Joined: Dec 6 05
                                                                                                Posts: 1065
                                                                                                Credit: 112,197,382
                                                                                                RAC: 98,840
                                                                                                Message 109030 - Posted 29 Dec 2010 1:41:17 UTC - in response to Message 109023.

                                                                                                  Last modified: 29 Dec 2010 1:43:52 UTC

                                                                                                  Mike Hewson wrote:
                                                                                                  by elapsed time is meant as per a clock on the wall from the task beginning until completed, whereas CPU time is the total of accumulated time slices devoted to the task.

                                                                                                  Just so. And most of the time you want to quote CPU time for various comparisons, as it is generally less subject to variation from the "otherwise" state of the system than is elapsed time.

                                                                                                  But for my tests I believe I did a pretty good job of avoiding nearly all the appreciable time consumers, so that for most cases either would serve--save in this toxic mis-allocation, where an available task theoretically in execution nevertheless fails to get assigned to an available CPU at substantial likelihood over an extended period of time.

                                                                                                  On a completely unrelated note--I appear to have killed my Westmere host late this afternoon. I was debugging the problem of only getting 4G out of 6G of RAM, and had just completed the last step in satisfying Corsair's RMA requirements by testing each RAM module of the set separately to memtest86+. It failed even to boot with the offending module, so could be deemed to have failed the test. But something happened in transitioning back to a known good RAM configuration (for one thing, I think I failed to turn off the power supply before shuffling RAM modules, a rookie mistake for sure), and as of now the system gives no signs of life at all save for consuming 20 watts from the wall. No fan spins, no beep codes, no Mobo Dr Debug digits displayed, no sounds for hard drive or CD drive--in fact no detectable response to pressing the front panel "power" button at all--not even in power consumption which remains steady at 20W (before this death, the behavior was that on turning on the real power switch on the supply. it would go up to about 20W over a couple of seconds, stay there for a couple of seconds, then descend to about 5W until the front panel button was pressed). Yes I have exercised the ClrCMOS jumper. At the moment I suspect death of the motherboard or of the power supply, though there are some other possibilities. I plan to sleep on it, and tomorrow disconnect very nearly everything (eventually including the CPU).
                                                                                                  ____________

                                                                                                  Profile Mike Hewson
                                                                                                  Forum moderator
                                                                                                  Avatar
                                                                                                  Send message
                                                                                                  Joined: Dec 1 05
                                                                                                  Posts: 3592
                                                                                                  Credit: 28,563,336
                                                                                                  RAC: 12,021
                                                                                                  Message 109042 - Posted 29 Dec 2010 6:20:38 UTC - in response to Message 109030.

                                                                                                    Last modified: 29 Dec 2010 6:31:15 UTC

                                                                                                    On a completely unrelated note--I appear to have killed my Westmere host late this afternoon. I was debugging the problem of only getting 4G out of 6G of RAM, and had just completed the last step in satisfying Corsair's RMA requirements by testing each RAM module of the set separately to memtest86+. It failed even to boot with the offending module, so could be deemed to have failed the test. But something happened in transitioning back to a known good RAM configuration (for one thing, I think I failed to turn off the power supply before shuffling RAM modules, a rookie mistake for sure), and as of now the system gives no signs of life at all save for consuming 20 watts from the wall. No fan spins, no beep codes, no Mobo Dr Debug digits displayed, no sounds for hard drive or CD drive--in fact no detectable response to pressing the front panel "power" button at all--not even in power consumption which remains steady at 20W (before this death, the behavior was that on turning on the real power switch on the supply. it would go up to about 20W over a couple of seconds, stay there for a couple of seconds, then descend to about 5W until the front panel button was pressed). Yes I have exercised the ClrCMOS jumper. At the moment I suspect death of the motherboard or of the power supply, though there are some other possibilities. I plan to sleep on it, and tomorrow disconnect very nearly everything (eventually including the CPU).

                                                                                                    Oh. With a bit of luck it'll turn out to be something cheaper and simpler like the power supply not supplying a trickle current to the board ( so that it knows when the power switch has been toggled via the mobo input pins ) for full switch on. Swap in a known good PSU and see what happens ... I've seen this before and found/claimed the PSU capacitors at fault ( age plus paper'n'paste and not solid-state ). So your 20W could represent a 'short' across the capacitors, dropping the voltage of outputs ( including the PSU's own fans ), and of course ripple control, thus little happens on the mobo. Look for eburnation of the capacitor connections to the printed circuit board. I really like Corsairs myself.

                                                                                                    I'll look at the recent Westmere data and see if I can soundly deduce anything.

                                                                                                    Cheers, Mike.
                                                                                                    ____________
                                                                                                    "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                    archae86
                                                                                                    Send message
                                                                                                    Joined: Dec 6 05
                                                                                                    Posts: 1065
                                                                                                    Credit: 112,197,382
                                                                                                    RAC: 98,840
                                                                                                    Message 109052 - Posted 29 Dec 2010 20:14:34 UTC - in response to Message 109042.

                                                                                                      Mike Hewson wrote:
                                                                                                      Oh. With a bit of luck it'll turn out to be something cheaper and simpler like the power supply not supplying a trickle current to the board ( so that it knows when the power switch has been toggled via the mobo input pins ) for full switch on. Swap in a known good PSU and see what happens ...

                                                                                                      I disconnected the suspect supply and did rudimentary testing. With a 575 ohm resistor across Vsb to meet minimum current requirement, I saw 5.1V on VSB. When I shorted /PS_ON to ground, I was voltages on all main supplies too close to correct to give this stone cold dead behavior even though I was not providing any load to them. Then I attached a different supply to the motherboard 8 pin and 24 pin ATX connectors, and got the same behavior, save only that the power consumed was 10W instead of 20W.

                                                                                                      I failed to mention this before, but when healthy, the system draw when "off" (meaning standby) before was something like 2W. I suspect something failed on the motherboard is putting a heavy load on Vsb. If the motherboard itself is not at fault, I think the most likely thing is that I fried the CPU during mishandling the RAM in a way that happens to present an intolerable load to motherboard or supply. So once I have the HSF and CPU off, I'll probably do a last trial to see if the mobo shows a little life (debug LED at least) in that state.

                                                                                                      ____________

                                                                                                      Profile Mike Hewson
                                                                                                      Forum moderator
                                                                                                      Avatar
                                                                                                      Send message
                                                                                                      Joined: Dec 1 05
                                                                                                      Posts: 3592
                                                                                                      Credit: 28,563,336
                                                                                                      RAC: 12,021
                                                                                                      Message 109053 - Posted 29 Dec 2010 22:15:32 UTC - in response to Message 109052.

                                                                                                        I disconnected the suspect supply and did rudimentary testing. With a 575 ohm resistor across Vsb to meet minimum current requirement, I saw 5.1V on VSB. When I shorted /PS_ON to ground, I was voltages on all main supplies too close to correct to give this stone cold dead behavior even though I was not providing any load to them. Then I attached a different supply to the motherboard 8 pin and 24 pin ATX connectors, and got the same behavior, save only that the power consumed was 10W instead of 20W.

                                                                                                        I failed to mention this before, but when healthy, the system draw when "off" (meaning standby) before was something like 2W. I suspect something failed on the motherboard is putting a heavy load on Vsb. If the motherboard itself is not at fault, I think the most likely thing is that I fried the CPU during mishandling the RAM in a way that happens to present an intolerable load to motherboard or supply. So once I have the HSF and CPU off, I'll probably do a last trial to see if the mobo shows a little life (debug LED at least) in that state.

                                                                                                        Darn. Let's hope it's the mobo.

                                                                                                        [aside]
                                                                                                        I lost a CPU once because of a cheap case. When screwing in the case cover screws, the soft metal in the edges of the screw holes made little shavings. One fell across the CPU pins at one edge ( pre CPU heat-sink days, and pre air-in-a-can days ) causing a dead short on boot up and dead CPU. What are the odds on that? Some days it can be like 'for want of a nail, a horseshoe was lost ....' :-)
                                                                                                        [/aside]

                                                                                                        Cheers, Mike.
                                                                                                        ____________
                                                                                                        "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                        archae86
                                                                                                        Send message
                                                                                                        Joined: Dec 6 05
                                                                                                        Posts: 1065
                                                                                                        Credit: 112,197,382
                                                                                                        RAC: 98,840
                                                                                                        Message 109078 - Posted 30 Dec 2010 23:51:31 UTC

                                                                                                          I am surprised, happy, and very puzzled. This morning I pulled the motherboard out of the case and removed HSF and CPU. I figured if the stone-cold dead symptom with excess power consumption continued, I had a bad motherboard, and if not, I had a fried CPU. It was a well-behaved 2W peak descending to 1W (not the steady 10W I had seen on this supply) and the smart button LEDs lit up, so my diagnosis was CPU, but before spending almost $400 US for a replacement I put the "bad" one back in the socket. All was still well !?!?

                                                                                                          So I very, very slowly reconnected things, taking the time to connect power after making almost every connection (including each cable to the case). As I added things, the peak initial power consumption fitfully rose, eventually to 5W, and at some point the smart button LED's began just to make a momentary flash instead of staying on, but aside from these behavior changes all went well. I now have all internals re-connected, the case buttoned up, and the host is on the internet and processing Einstein and a little SETI at the previous 3.4 GHz. I've not plugged in any USB devices, but it seems fully functional.

                                                                                                          I'm very puzzled as to what was wrong, and how it got fixed. My two lead candidates:

                                                                                                          1. I fat-fingered some connection to an unacceptable state, and that only when I finally pulled off all the case cables could things return to right.

                                                                                                          2. My stupid error of doing RAM change with power on the box put some piece of internal state into an unacceptable state, which did not remedy with repeated reboots, a CMOS clear, or genuine power disconnections, but decayed away in spending a night disconnected from power.

                                                                                                          Sorry to divert this thread from performance content. As I have a new set of three RAM sticks on order for delivery next week, I think I shall do a few trials and document the execution impact of going from 1 to 2 to 3 channels of RAM. Or maybe I'll see the wisdom of leaving well enough alone.
                                                                                                          ____________

                                                                                                          Profile Mike Hewson
                                                                                                          Forum moderator
                                                                                                          Avatar
                                                                                                          Send message
                                                                                                          Joined: Dec 1 05
                                                                                                          Posts: 3592
                                                                                                          Credit: 28,563,336
                                                                                                          RAC: 12,021
                                                                                                          Message 109085 - Posted 31 Dec 2010 8:26:19 UTC

                                                                                                            Last modified: 31 Dec 2010 9:11:39 UTC

                                                                                                            Well, this is all good. Luck is a fortune! Some element has held onto charge and biased something inappropriately, now discharged.

                                                                                                            Now what do the latest Westmere figures say to us?

                                                                                                            (A) Comparing the first two rows and dividing :

                                                                                                            0.5692/0.7622 ~ 0.7468

                                                                                                            This means a WU that was alone on a physical core then loses ~25% of speed if it has to share with another WU. But we are still ahead per physical core total WU throughput. A more or less expected level of benefit from hyper threading. Fair enough, but not quite the right description as when being the only WU on a physical core there might have been some other non-WU task coming or going to share, compared to when you knew you were definitely always sharing a core with another WU. But having said that let's assume this may happen if the other virtual slot on a physical CPU isn't servicing a WU task, and say no more about it ( expectant of random spread of non WU tasks to whatever free virtual cores are about ). Even on a system where non-WU tasks have been well weeded.

                                                                                                            (B) Now divide row three by row two :

                                                                                                            0.4975/0.5692 ~ 0.874

                                                                                                            Thus we see that each of four WU's sharing two physical cores lose ~ 12% of speed if we allow them to shift chairs rather than telling them to sit still in allocated seats. If you have four people on four chairs then swaps are probably done in pairs? Do we swap tasks on the same physical core in pairs ( if such a statement has meaning! ), or do we swap tasks across physical cores? Well however that's done the average over all such unseen mechanisms is 12%. Hence I agree and thus have simply restated your earlier comment of the benefit of hard assignment.

                                                                                                            (C) Move to the fourth line. Now we have two WU's sharing a physical core, with two other WU's definitely on different physical cores. What ought we expect? There's a mild surprise. To achieve this case we take the situation indicated by the first line ( a physical core to each of 4 WU's ) and force two WU's to share a physical core ( and leaving a remaining physical core quite un-occupied by WU's at all ). Now the time per physical core isn't much different, in fact slightly faster ( 14373 vs 14606 )! But let's not get too excited, and just say it's the same. An obvious idea is that each WU is bound/limited by something other than any presence of another WU on the same physical core. I like this answer because it is quite consistent with the CPU being the fastest chip on the machine and having to wait on occasions for slower devices ( generally orders of magnitude slower, with possible contention for that device plus gate/buffer delays and longer distances too ....). So, say, if a WU has to wait for a hard disk then it really makes no odds if it waits alone or in company. Indeed on the face of it there's at least a one in two chance that such system tasks get assigned to that totally WU free physical core. And unless I'm mistaken all the WU's are contending for the same disk! Yup, I like WU contention for the same disk as the rate limiting step ..... because that will definitely be independent of any WU's core context.

                                                                                                            (D) Now I'm not entirely sure what configuration line five describes '2+1+1 configuration--these two tasks confined to one core' and thus how that differs from the (C) above ( even assuming I captured that scenario correctly ). Please advise. My punt is the two WU's having a physical core each to themselves ( the '1+1' ) means they can roam - having three physical cores to choose from - so that also implies that WU-free physical core is also effectively being shifted ( much like a semiconductor lattice hole ). If that is indeed the correct view of this case then we repeat the lesson of hard assignment, specifically the penalty is :

                                                                                                            0.2723/0.3873 ~ 0.7117

                                                                                                            or about 30% for allowing them to stuff about amongst the available chairs. It's even worse than that [ compare (B) above ], you lose more time if there is more choices of chairs!!

                                                                                                            None of which connects especially with my earlier 4 over 4 analysis, as that was referring to likelihood of overall random assignment of tasks to cores by the OS, which we have avoided here by design.

                                                                                                            As usual please point out if I've missed some aspect. And a safe New Year to one and all. :-) :-)

                                                                                                            Cheers, Mike.
                                                                                                            ____________
                                                                                                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                            Profile Mike Hewson
                                                                                                            Forum moderator
                                                                                                            Avatar
                                                                                                            Send message
                                                                                                            Joined: Dec 1 05
                                                                                                            Posts: 3592
                                                                                                            Credit: 28,563,336
                                                                                                            RAC: 12,021
                                                                                                            Message 109090 - Posted 31 Dec 2010 10:42:11 UTC

                                                                                                              Last modified: 31 Dec 2010 11:25:35 UTC

                                                                                                              I'll try a guess at longer term performance with the Westmere, taking account of the settings/scenario :

                                                                                                              - 4 WU's and 4 physical cores.

                                                                                                              - Windows randomly assigning virtual CPU cores to WU's ( NB which will have equal priority )

                                                                                                              - averaged over a suitable long period or number of WU's. ( say ~ 50 WU's )

                                                                                                              - using times as per Pete's latest Westmere table.

                                                                                                              Good is allocated at 22.86% of instances at 14606 seconds

                                                                                                              Mediocre is allocated at 68.57% of instances at 20444 seconds

                                                                                                              Worst is allocated at 8.57% of instances at 22380 seconds

                                                                                                              That is, a weighted mean :

                                                                                                              = 0.2286 * 14606 + 0.6857 * 20444 + 0.0857 * 22380

                                                                                                              ~ 19275

                                                                                                              meaning the number of seconds per work unit ( elapsed time ) given those assumptions.

                                                                                                              Cheers, Mike.
                                                                                                              ____________
                                                                                                              "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                              mikey
                                                                                                              Avatar
                                                                                                              Send message
                                                                                                              Joined: Jan 22 05
                                                                                                              Posts: 1493
                                                                                                              Credit: 56,390,938
                                                                                                              RAC: 217,044
                                                                                                              Message 109091 - Posted 31 Dec 2010 11:13:45 UTC - in response to Message 109078.

                                                                                                                I am surprised, happy, and very puzzled. This morning I pulled the motherboard out of the case and removed HSF and CPU. I figured if the stone-cold dead symptom with excess power consumption continued, I had a bad motherboard, and if not, I had a fried CPU. It was a well-behaved 2W peak descending to 1W (not the steady 10W I had seen on this supply) and the smart button LEDs lit up, so my diagnosis was CPU, but before spending almost $400 US for a replacement I put the "bad" one back in the socket. All was still well !?!?

                                                                                                                So I very, very slowly reconnected things, taking the time to connect power after making almost every connection (including each cable to the case). As I added things, the peak initial power consumption fitfully rose, eventually to 5W, and at some point the smart button LED's began just to make a momentary flash instead of staying on, but aside from these behavior changes all went well. I now have all internals re-connected, the case buttoned up, and the host is on the internet and processing Einstein and a little SETI at the previous 3.4 GHz. I've not plugged in any USB devices, but it seems fully functional.

                                                                                                                I'm very puzzled as to what was wrong, and how it got fixed. My two lead candidates:

                                                                                                                1. I fat-fingered some connection to an unacceptable state, and that only when I finally pulled off all the case cables could things return to right.

                                                                                                                2. My stupid error of doing RAM change with power on the box put some piece of internal state into an unacceptable state, which did not remedy with repeated reboots, a CMOS clear, or genuine power disconnections, but decayed away in spending a night disconnected from power.

                                                                                                                Sorry to divert this thread from performance content. As I have a new set of three RAM sticks on order for delivery next week, I think I shall do a few trials and document the execution impact of going from 1 to 2 to 3 channels of RAM. Or maybe I'll see the wisdom of leaving well enough alone.


                                                                                                                I have seen this before in pc's, I have always thought it was some capacitor holding its charge and only after sitting, thus losing it charge, do things go back to normal. After telling people to try rebooting this is one of my secret fixes when providing over the phone pc tech help. I always tell them to wait about 15 to 30 minutes before restarting the pc, and it really does seem to work sometimes. It has saved me many a trip only to find things 'just working' when I do make the trip to their homes. I always tell them 'the pc is scared and knows I am there and will fix it' so it just works before I have to whip it into shape, we all laugh and I walk away wondering!

                                                                                                                ps I have enjoyed this thread and your testing of the different ways to crunch and which is best, please don't stop!

                                                                                                                Profile Mike Hewson
                                                                                                                Forum moderator
                                                                                                                Avatar
                                                                                                                Send message
                                                                                                                Joined: Dec 1 05
                                                                                                                Posts: 3592
                                                                                                                Credit: 28,563,336
                                                                                                                RAC: 12,021
                                                                                                                Message 109093 - Posted 31 Dec 2010 11:31:49 UTC - in response to Message 109091.

                                                                                                                  Last modified: 31 Dec 2010 11:46:52 UTC

                                                                                                                  It has saved me many a trip only to find things 'just working' when I do make the trip to their homes. I always tell them 'the pc is scared and knows I am there and will fix it' so it just works before I have to whip it into shape, we all laugh and I walk away wondering!

                                                                                                                  Cherish this effect! In medicine we say that 'a good doctor times his treatment to coincide with recovery!' :-) :-)

                                                                                                                  My other favorite is 'neither kill nor cure, if you seek repeat business!' ;0 :-)

                                                                                                                  Cheers, Mike.
                                                                                                                  ____________
                                                                                                                  "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                  ML1
                                                                                                                  Send message
                                                                                                                  Joined: Feb 20 05
                                                                                                                  Posts: 320
                                                                                                                  Credit: 18,426,418
                                                                                                                  RAC: 11,363
                                                                                                                  Message 109108 - Posted 1 Jan 2011 17:09:58 UTC - in response to Message 109093.

                                                                                                                    Last modified: 1 Jan 2011 17:10:17 UTC

                                                                                                                    Cherish this effect! In medicine we say that 'a good doctor times his treatment to coincide with recovery!' :-) :-)

                                                                                                                    A-ha... Isn't that subverting the natural immune response to give a Pavlovian reinforcement to have you called out to merely administer a placebo?...

                                                                                                                    My other favorite is 'neither kill nor cure, if you seek repeat business!' ;0 :-)

                                                                                                                    Ouch! That also sounds like certain dubious business practices as is foisted in IT/computers to maintain a never-ending upgrade cycle...


                                                                                                                    How to distinguish the good from the game?

                                                                                                                    Cheers,
                                                                                                                    Martin
                                                                                                                    ____________
                                                                                                                    Powered by Mandriva Linux A user friendly OS!
                                                                                                                    See the Boinc HELP Wiki

                                                                                                                    ML1
                                                                                                                    Send message
                                                                                                                    Joined: Feb 20 05
                                                                                                                    Posts: 320
                                                                                                                    Credit: 18,426,418
                                                                                                                    RAC: 11,363
                                                                                                                    Message 109109 - Posted 1 Jan 2011 17:18:07 UTC - in response to Message 109085.

                                                                                                                      Interesting analysis.

                                                                                                                      I'm surprised at the 12% penalty for having WUs roam around the cores... Is Windows scheduling really that bad? That 12% adds up to an awful lot of poisoned cache. Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle? (Again, a quirk of poor scheduling?)

                                                                                                                      The 'other rate limiting feature' outside of the CPU may well be system RAM bandwidth limits being more significant when the CPU cache cannot be used as effectively as for the best cases.


                                                                                                                      Some good sleuthing there.

                                                                                                                      What would be interesting for comparison is to do an identical set of tests but using the latest Linux and then Apple Mac on the same hardware (all 64-bit).

                                                                                                                      Happy crunchin',
                                                                                                                      Martin

                                                                                                                      ____________
                                                                                                                      Powered by Mandriva Linux A user friendly OS!
                                                                                                                      See the Boinc HELP Wiki

                                                                                                                      archae86
                                                                                                                      Send message
                                                                                                                      Joined: Dec 6 05
                                                                                                                      Posts: 1065
                                                                                                                      Credit: 112,197,382
                                                                                                                      RAC: 98,840
                                                                                                                      Message 109111 - Posted 1 Jan 2011 18:14:22 UTC - in response to Message 109109.

                                                                                                                        I'm surprised at the 12% penalty for having WUs roam around the cores... Is Windows scheduling really that bad? That 12% adds up to an awful lot of poisoned cache. Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle? (Again, a quirk of poor scheduling?)

                                                                                                                        No, No, and No.

                                                                                                                        The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. I thought I made this point clear in my notes, but obviously I failed, as both you and Mike seem to have a different notion.

                                                                                                                        The situation is quite artificial, in that affinity constraints to pools of a subset of all CPUs are placed on a task. So this behavior has no obvious relevance to typical working system behaviors.

                                                                                                                        No such effect is seen where no affinity constraint is supplied, and sufficient Einstein work is allowed to execute to populate all cores (i.e. 8 active Einstein tasks on my system). I hope that would put to rest the mistaken references to poisoned cache, excess context switches, RAM bandwidth, and so on. Clearly the 8/8 task situation has worse inherent constraint from each of these than does the case where 4 tasks are constrained 4 CPUs on two physical cores.
                                                                                                                        ____________

                                                                                                                        ML1
                                                                                                                        Send message
                                                                                                                        Joined: Feb 20 05
                                                                                                                        Posts: 320
                                                                                                                        Credit: 18,426,418
                                                                                                                        RAC: 11,363
                                                                                                                        Message 109113 - Posted 1 Jan 2011 20:03:11 UTC - in response to Message 109111.

                                                                                                                          Last modified: 1 Jan 2011 20:05:49 UTC

                                                                                                                          I'm surprised at the 12% penalty for having WUs roam around the cores...

                                                                                                                          ... Or is it more a case of the low priority tasks for the roaming WUs being interrupted more frequently by other tasks even when other cores are idle?...

                                                                                                                          No, No, and No.

                                                                                                                          The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. ...

                                                                                                                          What is the case that the task becomes not active?


                                                                                                                          The situation is quite artificial, in that affinity constraints to pools of a subset of all CPUs are placed on a task. So this behavior has no obvious relevance to typical working system behaviors.

                                                                                                                          Are you suggesting that the affinity restrictions will push multiple tasks onto just one CPU?...


                                                                                                                          No such effect is seen where no affinity constraint is supplied, and sufficient Einstein work is allowed to execute to populate all cores (i.e. 8 active Einstein tasks on my system).

                                                                                                                          Which is what we expect to be the optimum usage and that does indeed appear to be the case from the numbers.

                                                                                                                          The interesting bits are the numbers from the artificial cases to try to work out what the effects are, and their significance.


                                                                                                                          I hope that would put to rest the mistaken references to poisoned cache, excess context switches, RAM bandwidth, and so on. Clearly the 8/8 task situation has worse inherent constraint from each of these than does the case where 4 tasks are constrained 4 CPUs on two physical cores.

                                                                                                                          There are examples of some systems where due to RAM bandwidth constraints and CPU cache usage, you may well get higher throughput by running tasks on only 6 or 7 out of 8 virtual cores... This came up in previous s@h or e@h threads.

                                                                                                                          It comes back to an old argument that certain mixes of Boinc tasks can be beneficial for maximum throughput, and some combinations can be detrimental... It's all a question of what system bottlenecks get hit. We usually tune the system to keep the most expensive resource (the CPU) fully busy.

                                                                                                                          (However, on my recent systems, the CPU, GPU, and RAM have all been about equally priced...)


                                                                                                                          Happy fast crunchin',
                                                                                                                          Martin
                                                                                                                          ____________
                                                                                                                          Powered by Mandriva Linux A user friendly OS!
                                                                                                                          See the Boinc HELP Wiki

                                                                                                                          Profile Mike Hewson
                                                                                                                          Forum moderator
                                                                                                                          Avatar
                                                                                                                          Send message
                                                                                                                          Joined: Dec 1 05
                                                                                                                          Posts: 3592
                                                                                                                          Credit: 28,563,336
                                                                                                                          RAC: 12,021
                                                                                                                          Message 109114 - Posted 1 Jan 2011 22:15:17 UTC - in response to Message 109111.

                                                                                                                            Last modified: 1 Jan 2011 22:31:40 UTC

                                                                                                                            The issue is not thrashing of any kind, but rather that for significant periods of time a task is not active. I thought I made this point clear in my notes, but obviously I failed, as both you and Mike seem to have a different notion.

                                                                                                                            Oh, I see now. Quite right. My bad, and even after you went to the trouble of color highlighting! :O]

                                                                                                                            I was thinking about HT too much. Potential execution time 'lost' by a task could be when it is not allocated a slice at all, even if it appears/seems it reasonably could have been. My 'unseen mechanisms' are imaginary. This is an OS issue, so the scheduling algorithm is the proper focus. Ooooh.

                                                                                                                            [ See my earlier conclusions with (B) vs (D) - 'you lose more time if there is more choices of chairs' when not hard assigning ]

                                                                                                                            Since your machine is rigged to be on the rather light side of task load ( compared to 'typical' use ) - what's about the number you see on the 'Processes' tab of Task Manager?

                                                                                                                            Cheers, Mike.
                                                                                                                            ____________
                                                                                                                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                            archae86
                                                                                                                            Send message
                                                                                                                            Joined: Dec 6 05
                                                                                                                            Posts: 1065
                                                                                                                            Credit: 112,197,382
                                                                                                                            RAC: 98,840
                                                                                                                            Message 109118 - Posted 2 Jan 2011 0:18:21 UTC - in response to Message 109114.

                                                                                                                              Since your machine is rigged to be on the rather light side of task load ( compared to 'typical' use ) - what's about the number you see on the 'Processes' tab of Task Manager?

                                                                                                                              I construe your inquiry to apply to the condition for my tests, as to which I asserted that I shut down some things but not all (think I mentioned leaving my antivirus running, for example).

                                                                                                                              Just now I exited BOINC and shut down the same things I was shutting down for the tests, then I looked at Task Managers Process count and saw 41. Then I started boincmgr, and waited a little while for it to spawn things, and saw 61 as the TM Process count. That would be boincmgr, boinc, 8 Einstein executables, plus 8 instances of Console Window Host and two more I failed to spot.

                                                                                                                              The non-BOINC stuff showing in general has very low CPU use and pretty low context switch delta counts, but it is not nothing.

                                                                                                                              ____________

                                                                                                                              Robert
                                                                                                                              Send message
                                                                                                                              Joined: Nov 5 05
                                                                                                                              Posts: 34
                                                                                                                              Credit: 205,806,050
                                                                                                                              RAC: 160,846
                                                                                                                              Message 112461 - Posted 3 Jun 2011 0:37:27 UTC

                                                                                                                                I've been collecting data on the new Intel i7-2600K for quite a while and thought these particular S5 measurements fit well here in this thread on hyper-threading. Yes, I realize that the S5 run just finished, but these results are applicable to the S6 run also. In fact, I was just finishing verify a few data points just as I ran out of my final S5 work units. Only S5 gravity waves jobs were run during this test collection, no BRP jobs.

                                                                                                                                A couple of notes on hardware details, I used a i7-2600K processor clocked at 5 different frequencies paired with 2 x 4GB DDR3-1600 memory modules. Hyper-threading was enabled at all times, speedstep and turbo modes were disabled to ensure a consistent frequency for each core. Ubuntu was the operating system. For each point along the curve, I collected data and averaged the time required to complete a single S5 work unit. Any end cases with vastly different times were thrown out. Daily RAC for each point on the curve [0..8] was estimated by the formula below for S5 jobs.

                                                                                                                                N = Number of simultaneous Threads = [0..8]
                                                                                                                                RAC = 251 credits * N * (seconds in day / average single work unit time for N)



                                                                                                                                Interesting observations; you can see the clear transition point where hyper-threading kicks in at 5 threads. And work has a direct relationship to clock speed.
                                                                                                                                ____________

                                                                                                                                Profile Mike Hewson
                                                                                                                                Forum moderator
                                                                                                                                Avatar
                                                                                                                                Send message
                                                                                                                                Joined: Dec 1 05
                                                                                                                                Posts: 3592
                                                                                                                                Credit: 28,563,336
                                                                                                                                RAC: 12,021
                                                                                                                                Message 112462 - Posted 3 Jun 2011 1:30:14 UTC - in response to Message 112461.

                                                                                                                                  Last modified: 3 Jun 2011 2:02:38 UTC

                                                                                                                                  I've been collecting data .... work has a direct relationship to clock speed.

                                                                                                                                  What a brilliant set of observations! Thank you very kindly for collecting, analysing and presenting that here. :-) :-)

                                                                                                                                  Yes, the trends are clear. Let it be our benchmark for HT thinking.

                                                                                                                                  By eye I can see the relation to clock speed ( all else held same ) could be modelled as linear to close fit.

                                                                                                                                  The 'knee' at 4 HT cores is vivid. Indeed the RAC benefit per extra/added core ( the slope of the curves, ~ 2000 initially ) halves thereafter to ~ 1000. Which is near as stuff all to 2:1 ..... so there's the swapping overhead arising.

                                                                                                                                  Again, thanks for the work on that! :-)

                                                                                                                                  Cheers, Mike.
                                                                                                                                  ____________
                                                                                                                                  "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                  ML1
                                                                                                                                  Send message
                                                                                                                                  Joined: Feb 20 05
                                                                                                                                  Posts: 320
                                                                                                                                  Credit: 18,426,418
                                                                                                                                  RAC: 11,363
                                                                                                                                  Message 112465 - Posted 3 Jun 2011 12:45:58 UTC - in response to Message 112462.

                                                                                                                                    Last modified: 3 Jun 2011 12:47:04 UTC

                                                                                                                                    I've been collecting data ....

                                                                                                                                    ... Yes, the trends are clear. Let it be our benchmark for HT thinking.

                                                                                                                                    Indeed, very nice clear results. Thanks for sharing.


                                                                                                                                    By eye I can see the relation to clock speed ( all else held same ) could be modelled as linear to close fit.

                                                                                                                                    That suggests a nicely balanced system, or a system where the memory bandwidth nicely exceeds that needed by the CPU for these tasks. That is: The CPU processing is the limiting factor. There's enough fast enough memory to let the CPU run at 100% utilisation for the CPU critical paths.


                                                                                                                                    The 'knee' at 4 HT cores is vivid. Indeed the RAC benefit per extra/added core ( the slope of the curves, ~ 2000 initially ) halves thereafter to ~ 1000. Which is near as stuff all to 2:1 ..... so there's the swapping overhead arising. ...

                                                                                                                                    I don't interpret that as swapping unless you really mean 'process thread interleaving'. Intel's "Hyper-threading" uses additional state registers/logic to allow two process threads share the same one pool of processing units for a (physical) CPU core.

                                                                                                                                    You are certainly getting a useful increase in throughput with the HT.


                                                                                                                                    Happy fast crunchin',
                                                                                                                                    Martin
                                                                                                                                    ____________
                                                                                                                                    Powered by Mandriva Linux A user friendly OS!
                                                                                                                                    See the Boinc HELP Wiki

                                                                                                                                    Profile Mike Hewson
                                                                                                                                    Forum moderator
                                                                                                                                    Avatar
                                                                                                                                    Send message
                                                                                                                                    Joined: Dec 1 05
                                                                                                                                    Posts: 3592
                                                                                                                                    Credit: 28,563,336
                                                                                                                                    RAC: 12,021
                                                                                                                                    Message 112486 - Posted 3 Jun 2011 23:44:22 UTC - in response to Message 112465.

                                                                                                                                      ..... unless you really mean 'process thread interleaving' ....

                                                                                                                                      No I don't especially. Unless/until we have any information as to how ( or indeed if ) his Linux machine's process scheduler handles affinity, I'll leave it as a generic 'swap' concept. See earlier discussions in this thread.

                                                                                                                                      Cheers, Mike.

                                                                                                                                      ____________
                                                                                                                                      "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                      ExtraTerrestrial Apes
                                                                                                                                      Avatar
                                                                                                                                      Send message
                                                                                                                                      Joined: Nov 10 04
                                                                                                                                      Posts: 464
                                                                                                                                      Credit: 32,328,849
                                                                                                                                      RAC: 32,896
                                                                                                                                      Message 112512 - Posted 5 Jun 2011 15:39:34 UTC - in response to Message 112462.

                                                                                                                                        so there's the swapping overhead arising.


                                                                                                                                        You probably already know everything I'm going to say now, but this wording leads to misunderstanding, even if you meant the right thing.

                                                                                                                                        If HT is used 2 tasks are being run on one core at the same time. That is not only at the same time for the observing user (as running 2 threads with 50% time share each would look like), but also at the same time for the CPU, clock for clock if you will. This is totally independent of OS scheduling and everything people normally associate with "swapping".

                                                                                                                                        Speed per task does drop upon HT use because, although both threads have individual registers at such (the "core components" of the core), they have to share caches and, most importantly, execution units. That's the whole point of HT: making better use of the execution units for relatively little more die space.

                                                                                                                                        What you seemed to talk about is OS scheduling, where the scheduler reassignes tasks to specific cores at typically ~1 ms intervals (Windows). Which is an eternity compared to the CPU clock ;)

                                                                                                                                        MrS
                                                                                                                                        ____________
                                                                                                                                        Scanning for our furry friends since Jan 2002

                                                                                                                                        Profile Mike Hewson
                                                                                                                                        Forum moderator
                                                                                                                                        Avatar
                                                                                                                                        Send message
                                                                                                                                        Joined: Dec 1 05
                                                                                                                                        Posts: 3592
                                                                                                                                        Credit: 28,563,336
                                                                                                                                        RAC: 12,021
                                                                                                                                        Message 112515 - Posted 5 Jun 2011 22:40:13 UTC - in response to Message 112512.

                                                                                                                                          Last modified: 6 Jun 2011 2:54:59 UTC

                                                                                                                                          so there's the swapping overhead arising.


                                                                                                                                          You probably already know everything I'm going to say now, but this wording leads to misunderstanding, even if you meant the right thing.

                                                                                                                                          If HT is used 2 tasks are being run on one core at the same time. That is not only at the same time for the observing user (as running 2 threads with 50% time share each would look like), but also at the same time for the CPU, clock for clock if you will. This is totally independent of OS scheduling and everything people normally associate with "swapping".

                                                                                                                                          Speed per task does drop upon HT use because, although both threads have individual registers at such (the "core components" of the core), they have to share caches and, most importantly, execution units. That's the whole point of HT: making better use of the execution units for relatively little more die space.

                                                                                                                                          What you seemed to talk about is OS scheduling, where the scheduler reassigns tasks to specific cores at typically ~1 ms intervals (Windows). Which is an eternity compared to the CPU clock ;)

                                                                                                                                          MrS

                                                                                                                                          Absolutely correct, but moot. Since we don't know what his machine's actual scheduling behaviour is, it's an assumption that it's the same across the graph ie. is affinity maintained? Recall that E@H workunits are given low priority by default and hence will be readily displaced by system calls etc especially/inevitably if all physical cores are busy. So there will be a substantial part of the latter right side of the graph ( 4 and over virtual cores busy ) involving actual OS task swaps in addition to HT behaviours. Again, see earlier discussion ...

                                                                                                                                          Cheers, Mike.

                                                                                                                                          ( edit ) Sorry, the other thing I've not mentioned here is the probable 'pure' HT overhead. I've seen various estimates but nowhere near enough penalty to give that 2:1 ratio ( change at the knee ) in the benefit per added virtual core. I think the HT throughput ( opinions differ ) for 2 units on a core was given at about 1.7 at worst? Mind you my 2:1 estimate was by eye .....

                                                                                                                                          ( edit ) Printing out the graphic and drawing lines I find that : the ratio of the slope of the lines for less than 4 jobs to the slope of the lines for more than 4 jobs ( ie. before and after the knee, and for each given processor speed ) are all about 2.7:1 plus/minus ~ 0.05. So I guess the question is what are the best and worst estimates for HT throughput - the 'pure' HT part - with the remainder being OS task swaps ?

                                                                                                                                          ( edit ) Moreover if one does draw a line from the knee to the 8 job point, you'll find the graph dips slightly below it at around 6 jobs and then returns .... a short mild upwards concavity. This is repeated for all curves. So there's something ( my guess is non-HT related ie. scheduling behaviour ) kicking in there. I'll post a modified graphic explaining what I mean by all this when I get a chance ..... :-)

                                                                                                                                          ( edit ) I hope this is sufficient :


                                                                                                                                          NB : I did the calculations for the other intermediate clock speeds and got very close to the above ratio pre/post knee ~ 2.69

                                                                                                                                          Plus I've assumed that since the measure is daily RAC then wall clock time ( as opposed to CPU time ) - "averaged the time required to complete a single S5 work unit" - is the relevant scale.
                                                                                                                                          ____________
                                                                                                                                          "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                          Profile Mike Hewson
                                                                                                                                          Forum moderator
                                                                                                                                          Avatar
                                                                                                                                          Send message
                                                                                                                                          Joined: Dec 1 05
                                                                                                                                          Posts: 3592
                                                                                                                                          Credit: 28,563,336
                                                                                                                                          RAC: 12,021
                                                                                                                                          Message 112519 - Posted 6 Jun 2011 4:29:02 UTC

                                                                                                                                            Last modified: 6 Jun 2011 4:35:05 UTC

                                                                                                                                            Further thoughts ( based on prior statements/assumptions ) : why the dip below linear at around 6 jobs? The slope of the curve is : the rate of change of RAC with change of virtual core number. That means there is a slight "penalty" for going from 4 to 6 which is "recovered" by going from 6 to 8 ( within the expected entire pattern of somewhat less than 2:1 throughput from HT at over 4 virtual cores ). Shouldn't task swaps per se forced by higher thread occupancy of the CPUs give a concave down aspect to the 'thigh' part of the curve? Meaning that at say 5 GW jobs there'll probably be more physical cores ( 3 ) only occupied by a single GW job, than at 7 GW jobs with fewer physical cores ( 1 ) occupied by a single GW job ie. non-GW work ( system stuff say ) is more likely to bump a GW job off a given physical core with 7 jobs than 5 .... unless I'm viewing an artifact of the data presentation.

                                                                                                                                            Of course to decide matters firmly, what we need is a repeat of all the prior work with process affinities nailed down [ plus recording elapsed ( run ) time vs core ( CPU ) time ] Volunteers ? :-) :-) :-)

                                                                                                                                            Cheers, Mike.
                                                                                                                                            ____________
                                                                                                                                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                            FrankHagen
                                                                                                                                            Send message
                                                                                                                                            Joined: Feb 13 08
                                                                                                                                            Posts: 102
                                                                                                                                            Credit: 63,762
                                                                                                                                            RAC: 0
                                                                                                                                            Message 112526 - Posted 6 Jun 2011 14:32:58 UTC - in response to Message 112519.

                                                                                                                                              one thing to add..

                                                                                                                                              since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
                                                                                                                                              on the other hand exactly this may lead to a better performance of HT.

                                                                                                                                              but as long as there is no real 64-bit app which makes full use of SSE2 capabilities, we won't know.



                                                                                                                                              archae86
                                                                                                                                              Send message
                                                                                                                                              Joined: Dec 6 05
                                                                                                                                              Posts: 1065
                                                                                                                                              Credit: 112,197,382
                                                                                                                                              RAC: 98,840
                                                                                                                                              Message 112532 - Posted 7 Jun 2011 16:33:45 UTC - in response to Message 112526.

                                                                                                                                                since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
                                                                                                                                                on the other hand exactly this may lead to a better performance of HT.

                                                                                                                                                I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic. As one of the opportunities for HT benefit is clearly finding something useful to do while waiting for a memory read, that would seem to suggest possibly less HT benefit on the hypothetical variant.

                                                                                                                                                But memory references able to be supplanted by registers are, I should think, highly likely to be filled from cache, not RAM, and usually a fast level of the cache.

                                                                                                                                                In practice I doubt the speculated effect is either substantial or consistent.

                                                                                                                                                Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail.

                                                                                                                                                But such an answer would necessarily be highly application-specific. I don't think either of the current SETI analyses much resembles any of the Einstein analyses computationally (if Bikeman or others know I'm wrong here, please correct me), and the considerable history and infrastructure of the Lunatics effort may mean that available tuning benefits have been more thoroughly explored there.

                                                                                                                                                ____________

                                                                                                                                                FrankHagen
                                                                                                                                                Send message
                                                                                                                                                Joined: Feb 13 08
                                                                                                                                                Posts: 102
                                                                                                                                                Credit: 63,762
                                                                                                                                                RAC: 0
                                                                                                                                                Message 112534 - Posted 7 Jun 2011 17:01:59 UTC - in response to Message 112532.

                                                                                                                                                  since even on a 64-bit linux we are running a 32-bit app (correct me if i'm wrong), this leads to only 8 of the 16 SSE2 registers of a core being usable.
                                                                                                                                                  on the other hand exactly this may lead to a better performance of HT.

                                                                                                                                                  I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.


                                                                                                                                                  that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.

                                                                                                                                                  Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS?


                                                                                                                                                  nope - i do not care about YETI.. ;)

                                                                                                                                                  archae86
                                                                                                                                                  Send message
                                                                                                                                                  Joined: Dec 6 05
                                                                                                                                                  Posts: 1065
                                                                                                                                                  Credit: 112,197,382
                                                                                                                                                  RAC: 98,840
                                                                                                                                                  Message 112538 - Posted 7 Jun 2011 19:49:52 UTC - in response to Message 112534.

                                                                                                                                                    I was not even aware of this particular distinction. Speaking hypothetically, I suppose that an application variant which used more registers might be expected to generate less data memory traffic.


                                                                                                                                                    that's only part one of the story - part 2 is, that in theory the core can process twice the number of calculations in a single operation. if the code can be and is vectorized to use all 16 registers.

                                                                                                                                                    That would seem to push back in the opposite direction if it were true. Not denying the performance benefit possibly available to the portion of the code doing that, but the opportunity for HT benefit would seem to go up with more closely spaced data reads from memory, which this would do.

                                                                                                                                                    But I doubt it is true. Do you seriously think that even Nehalem comes equipped with enough distinct SSE floating point units ever to keep 16 registers in use in real-world code?

                                                                                                                                                    Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?
                                                                                                                                                    ____________

                                                                                                                                                    FrankHagen
                                                                                                                                                    Send message
                                                                                                                                                    Joined: Feb 13 08
                                                                                                                                                    Posts: 102
                                                                                                                                                    Credit: 63,762
                                                                                                                                                    RAC: 0
                                                                                                                                                    Message 112539 - Posted 7 Jun 2011 20:14:34 UTC - in response to Message 112538.

                                                                                                                                                      Looking at this Nehalem block diagram I see two available ADD SSE units and one available MUL/DIV SSE unit. So thinking in terms of two-operand instructions that could support six, once in a while, but how on earth do you imagine getting to sixteen?


                                                                                                                                                      this is MAYBE another thing on nehalems and bulldozers. but talking about cores capable of SSE2 in 64bit mode, and this is ( cough ) P4 and ATHLON64!

                                                                                                                                                      the difference between "native mode" and 64-bit mode started back then..

                                                                                                                                                      http://en.wikipedia.org/wiki/SSE2

                                                                                                                                                      that's just another reason why you'll find "AMD64" everywhere.

                                                                                                                                                      getting back to a certain app - it really depends:

                                                                                                                                                      heavy use of SSEx instructions?
                                                                                                                                                      code which can be vectorized?
                                                                                                                                                      processor architecture?


                                                                                                                                                      bottom line is: unless you really give it a try, you'll never know, but if you do not do it, you can as well still believe in earth being a flat thing.


                                                                                                                                                      in boincworld it's very rare that a real 64-bit app is not faster - you might want to check the numbers on http://wuprop.boinc-af.org/results/delai.py

                                                                                                                                                      Claggy
                                                                                                                                                      Send message
                                                                                                                                                      Joined: Dec 29 06
                                                                                                                                                      Posts: 437
                                                                                                                                                      Credit: 970,749
                                                                                                                                                      RAC: 260
                                                                                                                                                      Message 112541 - Posted 7 Jun 2011 20:52:53 UTC - in response to Message 112532.

                                                                                                                                                        Last modified: 7 Jun 2011 20:54:44 UTC

                                                                                                                                                        Over at SETI, it appears that the Lunatics tuned applications include distinct x64 and x32 Windows variants for both Astropulse and Multibeam. Do you know whether anyone has done work to compare the actual execution performance to see what benefit their x64 version provides compared to x32 when both are running in a 64-bit OS? I think there has actually been less recent careful HT assessment there than here, and certainly don't recall spotting any x64 vs. x32 HT detail.

                                                                                                                                                        There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

                                                                                                                                                        Claggy

                                                                                                                                                        archae86
                                                                                                                                                        Send message
                                                                                                                                                        Joined: Dec 6 05
                                                                                                                                                        Posts: 1065
                                                                                                                                                        Credit: 112,197,382
                                                                                                                                                        RAC: 98,840
                                                                                                                                                        Message 112544 - Posted 8 Jun 2011 1:31:14 UTC - in response to Message 112541.

                                                                                                                                                          There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

                                                                                                                                                          Claggy
                                                                                                                                                          Thanks for the correction, I carelessly relied on a heading in their download area reading
                                                                                                                                                          AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.
                                                                                                                                                          which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.

                                                                                                                                                          ____________

                                                                                                                                                          Claggy
                                                                                                                                                          Send message
                                                                                                                                                          Joined: Dec 29 06
                                                                                                                                                          Posts: 437
                                                                                                                                                          Credit: 970,749
                                                                                                                                                          RAC: 260
                                                                                                                                                          Message 112552 - Posted 8 Jun 2011 20:01:48 UTC - in response to Message 112544.

                                                                                                                                                            Last modified: 8 Jun 2011 20:02:57 UTC

                                                                                                                                                            There are only 64bit Windows apps for CPU Multibeam, there are no Windows 64bit Astropulse apps, and no Windows 64bit Cuda apps,

                                                                                                                                                            Claggy
                                                                                                                                                            Thanks for the correction, I carelessly relied on a heading in their download area reading
                                                                                                                                                            AstroPulse for Windows - x64 & x32 Bit Windows AstroPulse apps for SSE & SSE3.
                                                                                                                                                            which is similar language to that used for the Multibeam entry. Maybe the intended meaning is that the applications will run in those environments.

                                                                                                                                                            There are just different Installers aimed at 32bit or 64bit Boinc's, the 64bit installer does more <app_version> entries to try and make sure no one looses any work when installing the Lunatics apps,
                                                                                                                                                            But the only 64bit app in it is the AK_V8 MB app,

                                                                                                                                                            Claggy

                                                                                                                                                            Robert
                                                                                                                                                            Send message
                                                                                                                                                            Joined: Nov 5 05
                                                                                                                                                            Posts: 34
                                                                                                                                                            Credit: 205,806,050
                                                                                                                                                            RAC: 160,846
                                                                                                                                                            Message 112574 - Posted 12 Jun 2011 0:47:46 UTC

                                                                                                                                                              I've been tied up the last week and I see a few questions have come up on the HT results I collected.


                                                                                                                                                                - No special techniques were used for setting processor affinity
                                                                                                                                                                - I used a standard 64 bit ubuntu load with 32 bit compatibility libs, no other OS tuning
                                                                                                                                                                - The 32 bit S5 SSE2 application was utilized, version 1.07
                                                                                                                                                                - This machine is dedicated to running E@H, so no other side loads
                                                                                                                                                                - It took many weeks to collect measurements for all the data points, so different frequencies were used
                                                                                                                                                                - I assumed slight variations in measurement points were from the different frequencies and data sets, as discussed in the beginning of this thread
                                                                                                                                                                - Thanks Mike for the suggestion to compare slope ratios as a way to answer the question how much benefit from HT
                                                                                                                                                                - My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio


                                                                                                                                                              ____________

                                                                                                                                                              Robert
                                                                                                                                                              Send message
                                                                                                                                                              Joined: Nov 5 05
                                                                                                                                                              Posts: 34
                                                                                                                                                              Credit: 205,806,050
                                                                                                                                                              RAC: 160,846
                                                                                                                                                              Message 112575 - Posted 12 Jun 2011 0:52:50 UTC

                                                                                                                                                                There seemed to be interest in this data, so I'll post the hyper-threading S5 data I collected on the i7-980 for comparison.

                                                                                                                                                                The same collection conditions apply.



                                                                                                                                                                Only two frequency curves were collected for the 980. I'm not willing to deduce the 980's slope ratios are the same or different, they are close. At 3.0 GHz the 980's slope ratio is close to the 2600K's slope ratio as shown by Mikes earlier calculations.
                                                                                                                                                                ____________

                                                                                                                                                                Profile Mike Hewson
                                                                                                                                                                Forum moderator
                                                                                                                                                                Avatar
                                                                                                                                                                Send message
                                                                                                                                                                Joined: Dec 1 05
                                                                                                                                                                Posts: 3592
                                                                                                                                                                Credit: 28,563,336
                                                                                                                                                                RAC: 12,021
                                                                                                                                                                Message 112577 - Posted 12 Jun 2011 2:31:59 UTC - in response to Message 112574.

                                                                                                                                                                  Last modified: 12 Jun 2011 2:33:12 UTC

                                                                                                                                                                  My Calculations are in reasonable agreement with Mike's 2.69 calculated slope ratio

                                                                                                                                                                  Being lines fitted by ( my ) eye on a piece of paper : I definitely think I'm over exact at quoting 2 decimal places. Probably better to quote to only one, say call it 2.7 .... :-)

                                                                                                                                                                  Anyway they're all around 2.5 to 3.0 hence my impression is that OS swap overhead is at least comparable to HT effects at high core loads. OK. If so, then to test : did someone mention 'process lasso' or somesuch as an appropriate Windoze based utility to achieve affinity control? And return detailed timings for that matter, or does some other utility do that better? Suggestions?

                                                                                                                                                                  This machine of mine if divested of BRP work could be an ideal test rig methinks, I could measure actual core times vs wall clock times on a per virtual core basis? Thus times per core not devoted to GW tasks of interest, derive fractional overheads etc.

                                                                                                                                                                  Thanks again for collecting and presenting that! We're always looking to study such behaviours and perhaps get a hint or two on optimising. :-)

                                                                                                                                                                  Cheers, Mike.
                                                                                                                                                                  ____________
                                                                                                                                                                  "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                                                  FrankHagen
                                                                                                                                                                  Send message
                                                                                                                                                                  Joined: Feb 13 08
                                                                                                                                                                  Posts: 102
                                                                                                                                                                  Credit: 63,762
                                                                                                                                                                  RAC: 0
                                                                                                                                                                  Message 112795 - Posted 27 Jun 2011 12:47:59 UTC

                                                                                                                                                                    just stumbled over that one:

                                                                                                                                                                    http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/

                                                                                                                                                                    interesting..

                                                                                                                                                                    Profile Mike Hewson
                                                                                                                                                                    Forum moderator
                                                                                                                                                                    Avatar
                                                                                                                                                                    Send message
                                                                                                                                                                    Joined: Dec 1 05
                                                                                                                                                                    Posts: 3592
                                                                                                                                                                    Credit: 28,563,336
                                                                                                                                                                    RAC: 12,021
                                                                                                                                                                    Message 112815 - Posted 28 Jun 2011 15:22:23 UTC - in response to Message 112795.

                                                                                                                                                                      just stumbled over that one:

                                                                                                                                                                      http://software.intel.com/en-us/articles/improved-linux-smp-scaling-user-directed-processor-affinity/

                                                                                                                                                                      interesting..

                                                                                                                                                                      It is indeed! Thanks for digging that up. :-)

                                                                                                                                                                      Those bandwidth curves look eerily familiar. I'll look closer at the article and comment further if appropriate.

                                                                                                                                                                      FWIW : I have stopped the GPU work within and have tooled up that machine I mentioned earlier with Process Lasso. I am currently benchmarking the virtual cores when used alone for bucket WU's ie. only one WU at a time on the entire machine and I am proceeding through each core ( 5 of 8 done ). I thought I would at first examine the ( entirely reasonable ) assumption that all cores are equivalent one to the next, so that if there is some asymmetry in the hardware it'll come out and I can account for that when I move to testing the ideas I mentioned earlier. They are doing fine thus far with average times overlapping well within each others' one-standard-deviation widths. I'll publish the full spread sheet of data when complete. I'm doing 12 WU's per core, tossing out the highest and the lowest, and using the remaining 10 as 'typical' for statistics. I'm also after the idea of a 'fiducial occasion' or 'single virtual core run-time' for this machine as it is presently configured on bucket WU's, and thus have already collected very many to aggregate to form that. I've also had to admonish my offspring for daring to touch it meantime .... :-)

                                                                                                                                                                      Cheers, Mike.
                                                                                                                                                                      ____________
                                                                                                                                                                      "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                                                      FrankHagen
                                                                                                                                                                      Send message
                                                                                                                                                                      Joined: Feb 13 08
                                                                                                                                                                      Posts: 102
                                                                                                                                                                      Credit: 63,762
                                                                                                                                                                      RAC: 0
                                                                                                                                                                      Message 112816 - Posted 28 Jun 2011 15:42:58 UTC - in response to Message 112815.

                                                                                                                                                                        Last modified: 28 Jun 2011 16:27:09 UTC


                                                                                                                                                                        It is indeed! Thanks for digging that up. :-)

                                                                                                                                                                        Those bandwidth curves look eerily familiar. I'll look closer at the article and comment further if appropriate.


                                                                                                                                                                        well - actually it's a pretty much ancient thing. P4-area you know, and things might have changed a lot since then.

                                                                                                                                                                        another thing which comes in mind here: how are cores mapped?

                                                                                                                                                                        if i look at my i5m i see this:

                                                                                                                                                                        Coreinfo v2.11 - Dump information on system CPU and memory topology
                                                                                                                                                                        Copyright (C) 2008-2010 Mark Russinovich
                                                                                                                                                                        Sysinternals - www.sysinternals.com

                                                                                                                                                                        Logical to Physical Processor Map:
                                                                                                                                                                        *-*- Physical Processor 0 (Hyperthreaded)
                                                                                                                                                                        -*-* Physical Processor 1 (Hyperthreaded)

                                                                                                                                                                        Logical Processor to Socket Map:
                                                                                                                                                                        **** Socket 0

                                                                                                                                                                        Logical Processor to NUMA Node Map:
                                                                                                                                                                        **** NUMA Node 0

                                                                                                                                                                        Logical Processor to Cache Map:
                                                                                                                                                                        *-*- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
                                                                                                                                                                        *-*- Instruction Cache 0, Level 1, 32 KB, Assoc 4, LineSize 64
                                                                                                                                                                        *-*- Unified Cache 0, Level 2, 256 KB, Assoc 8, LineSize 64
                                                                                                                                                                        -*-* Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
                                                                                                                                                                        -*-* Instruction Cache 1, Level 1, 32 KB, Assoc 4, LineSize 64
                                                                                                                                                                        -*-* Unified Cache 1, Level 2, 256 KB, Assoc 8, LineSize 64
                                                                                                                                                                        **** Unified Cache 2, Level 3, 3 MB, Assoc 12, LineSize 64


                                                                                                                                                                        so all this testing will need to pick the right cores.

                                                                                                                                                                        probably the real freak-out will come, if someone shows up with a quad-socket Xeon E7-4800. ;)

                                                                                                                                                                        archae86
                                                                                                                                                                        Send message
                                                                                                                                                                        Joined: Dec 6 05
                                                                                                                                                                        Posts: 1065
                                                                                                                                                                        Credit: 112,197,382
                                                                                                                                                                        RAC: 98,840
                                                                                                                                                                        Message 112824 - Posted 29 Jun 2011 13:24:09 UTC - in response to Message 112815.

                                                                                                                                                                          I am currently benchmarking the virtual cores when used alone for bucket WU's ie. only one WU at a time on the entire machine and I am proceeding through each core ( 5 of 8 done ).
                                                                                                                                                                          While I think it rather likely that the virtual CPUs are in fact equivalent, I was surprised to find on my own host that at least one of the many background programs had an affinity.

                                                                                                                                                                          If you're ambitious you might check, for example by using Process Explorer and right clicking one process at a time and checking under "set affinity…". Remembering I had seen something in the past I did a little of this just now and noticed that Speedfan is set to run only on CPU number five (of eight on this Westmere host).

                                                                                                                                                                          ____________

                                                                                                                                                                          Profile Mike Hewson
                                                                                                                                                                          Forum moderator
                                                                                                                                                                          Avatar
                                                                                                                                                                          Send message
                                                                                                                                                                          Joined: Dec 1 05
                                                                                                                                                                          Posts: 3592
                                                                                                                                                                          Credit: 28,563,336
                                                                                                                                                                          RAC: 12,021
                                                                                                                                                                          Message 112826 - Posted 29 Jun 2011 17:02:10 UTC - in response to Message 112824.

                                                                                                                                                                            I am currently benchmarking the virtual cores when used alone for bucket WU's ie. only one WU at a time on the entire machine and I am proceeding through each core ( 5 of 8 done ).
                                                                                                                                                                            While I think it rather likely that the virtual CPUs are in fact equivalent, I was surprised to find on my own host that at least one of the many background programs had an affinity.

                                                                                                                                                                            If you're ambitious you might check, for example by using Process Explorer and right clicking one process at a time and checking under "set affinity…". Remembering I had seen something in the past I did a little of this just now and noticed that Speedfan is set to run only on CPU number five (of eight on this Westmere host).

                                                                                                                                                                            Thank you indeed! I was basically thinking of possible mild hardware disparity, but yes there may well be OS bindings. This had not occurred to me. An excellent idea and I will check that. :-)

                                                                                                                                                                            Cheers, Mike.

                                                                                                                                                                            ____________
                                                                                                                                                                            "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                                                            FrankHagen
                                                                                                                                                                            Send message
                                                                                                                                                                            Joined: Feb 13 08
                                                                                                                                                                            Posts: 102
                                                                                                                                                                            Credit: 63,762
                                                                                                                                                                            RAC: 0
                                                                                                                                                                            Message 112829 - Posted 29 Jun 2011 18:55:09 UTC - in response to Message 112826.

                                                                                                                                                                              Last modified: 29 Jun 2011 18:55:53 UTC

                                                                                                                                                                              Thank you indeed! I was basically thinking of possible mild hardware disparity, but yes there may well be OS bindings. This had not occurred to me. An excellent idea and I will check that. :-)


                                                                                                                                                                              now that you get rolling.. ;)


                                                                                                                                                                              get the whole sysinternals suite - things to check here: process-explorer, PSSTART (because it can use the pretty ancient API), COREinfo to tell you what's really under the hood...

                                                                                                                                                                              i am using marks tools for a long time before the battleship of lawyers showed up and forced them to sign for redmond.

                                                                                                                                                                              Profile Mike Hewson
                                                                                                                                                                              Forum moderator
                                                                                                                                                                              Avatar
                                                                                                                                                                              Send message
                                                                                                                                                                              Joined: Dec 1 05
                                                                                                                                                                              Posts: 3592
                                                                                                                                                                              Credit: 28,563,336
                                                                                                                                                                              RAC: 12,021
                                                                                                                                                                              Message 112832 - Posted 30 Jun 2011 3:24:38 UTC - in response to Message 112829.

                                                                                                                                                                                Thank you indeed! I was basically thinking of possible mild hardware disparity, but yes there may well be OS bindings. This had not occurred to me. An excellent idea and I will check that. :-)


                                                                                                                                                                                now that you get rolling.. ;)


                                                                                                                                                                                get the whole sysinternals suite - things to check here: process-explorer, PSSTART (because it can use the pretty ancient API), COREinfo to tell you what's really under the hood...

                                                                                                                                                                                i am using marks tools for a long time before the battleship of lawyers showed up and forced them to sign for redmond.

                                                                                                                                                                                Terrific ideas. I will indeed get that to drill down and more cleanly separate the 'pure HT' aspect I seek from the rest. :-)

                                                                                                                                                                                Aside : I can use LogMeIn - pro version with LogMeIn Ignition - which basically acts as a neat layer over Windows Remote Desktop. It creates a secure VPN connection ( long & very random key ) and thence allows full remote control, file sharing, FTP, clandestine monitoring ( so any user on the target won't obviously know ), chat even, plus other stuff. So here I am in Germany fiddling/tweaking my HT profiling experiments on the DownUnda machine - with my only trouble being the laptop screen here doesn't match the bigger desktop there. So I have to pop my spectacles on. :-)

                                                                                                                                                                                Cheers, Mike.
                                                                                                                                                                                ____________
                                                                                                                                                                                "I have made this letter longer than usual, because I lack the time to make it short." - Blaise Pascal

                                                                                                                                                                                Post to thread

                                                                                                                                                                                Message boards : Cruncher's Corner : Hyperthreading and Task number Impact Observations


                                                                                                                                                                                Home · Your account · Message boards

                                                                                                                                                                                This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

                                                                                                                                                                                Copyright © 2014 Bruce Allen