Memory channels provisioned vs. Einstein performance E5620


Advanced search

Message boards : Cruncher's Corner : Memory channels provisioned vs. Einstein performance E5620

AuthorMessage
archae86
Send message
Joined: Dec 6 05
Posts: 1065
Credit: 112,178,638
RAC: 97,934
Message 109266 - Posted 7 Jan 2011 0:21:17 UTC

    I've often wondered whether the 3-channel Nehalem-family parts are heavily over-supplied with RAM bandwidth for Einstein application performance purposes. The fact Intel is providing many variants with just two channels available in their more consumer oriented socket suggests this.

    I recently had to invade my Westmere (E5620) host because one RAM stick had failed, and took the time to run comparison tests of 1, 2, and 3 populated channels, hyperthreaded and non hyperthreaded.

    The results will not be broadly applicable, as RAM requirements vary widely by application, Nehalem-family cache sizes vary, and the component stock specs and user overclocking practices move around the relative performance of CPU and RAM channels quite a bit. Still, some may find some interest here.

    For all these tests, the CPU was running at a moderate overclock of 3.42 GHz with the multiplier at 19. For all these tests the BIOS was left to set up the RAM as it wished, presumably based on SPD information and the clock rate implied by the 100/180 clock settings.

    For those who speak CPU-Z, here is the CPU state:


    and here the RAM state (3-channel populated case shown):


    The memory sticks were all Corsair Platinum series XMS3 DDR3 parts, shipping as 3-packs under part number TR3X6G1333C9, for which CPU-Z reads the SPD information as:


    And here are the mean execution times in CPU seconds for a full set (4 for nHT, 8 for HT) of current Einstein Global Correlations S5 HF search #1 v3.06 (S5GCESSE2) tasks.



    The RelProd line indicates system productivity in each configuration as a fraction of the highest performing (hyperthreaded 3-channel) configuration.

    It must be kept in mind that I've used an appreciable overclock on the CPU, and none at all on the RAM. Those who buy premium RAM and slave away to find clock counts to save may get faster RAM relative to CPU than this, and those who run everything dead stock will have slower RAM relative to CPU than this.

    1. But for this case and this ap, in hyperthreaded mode the one-channel case is severely starved, and even the nHT case is significantly impaired.

    2. Going from two to three channels in this configuration help the nHT case very little, and the HT case only moderately.

    3. HT always helps, but it helps a lot more when the configuration is not memory starved (the HTBen line is direct productivity comparison of HT vs. nHT at same channel count).

    Conscious of our old problem with high result-to-result variation in execution requirements, I made a reasonably serious effort to match the freq/seq characteristics of the six results sets here compared. I've shown the Stdev for each set, mostly to document that by and large the differences between test cases are large compared to the possibly random timing variations present, and also that generally the timing variations observed within my samples were very small. True, the Frequency range was very small (1264.00 to 1264.25), but the seq range was wider (3 to 462) and I just did not see evidence of major systematic variation. I think the high stdev for the hyperthreaded one channel case is another symptom of the severe memory famine of that configuration, not evidence that my matching efforts failed and somehow stocked that case with massively more inherent result effort variation.

    I have no doubt that effort expended on getting lower RAM latency through tighter memory timings would benefit _all_ of these cases (even when you are not waiting for your brother tasks because of bandwidth constraints, you must wait out the latency time for every jump for which the target is not in some form of cache, and for every similarly challenged data memory access). But I have a low appetite for and little experience in twisting the tail on DDR3 RAM clocks, so don't plan to try. I would, however, be happy to watch another contributor taking a look at that.


    ____________

    DanNeely
    Send message
    Joined: Sep 4 05
    Posts: 1075
    Credit: 71,651,502
    RAC: 80,295
    Message 109268 - Posted 7 Jan 2011 0:41:50 UTC

      I'm mildly surprised that you saw any difference from the 3rd channel. When LGA1156 came out the Intel engineer who gave Anand the tech dump said that the 3rd channel in LGA1366 was for hex core support, and that outside of synthetic benchmarks 2 channels would be sufficient to keep a quadcore from bottlenecking. The einstien apps must really be hammering the memory controllers in order to see that effect.
      ____________

      archae86
      Send message
      Joined: Dec 6 05
      Posts: 1065
      Credit: 112,178,638
      RAC: 97,934
      Message 109270 - Posted 7 Jan 2011 3:35:19 UTC - in response to Message 109268.

        The einstien apps must really be hammering the memory controllers in order to see that effect.

        I imagine the Intel engineer was presuming that neither the CPU clock nor the RAM timings would be overclocked. He also might quite likely decline to label what I saw as bottlenecking-reserving that term for a more severe level. In pushing up the CPU clock and not the RAM I've definitely pushed further into the RAM congestion side of the envelope.

        That said, I'll wager there exist aps far more RAM intensive than Einstein, though they may not be ones that are likely to make up much of most plausible workloads.

        Separately, I failed to mention an important RAM configuration detail. Those who know that Einstein work of this type has about a 250 Mbyte working set may figure that my single channel case run hyperthreaded would have gone to serious swapping as 2G of Einstein and something like 1G of Windows 7 tried to fit into 2G of physical RAM. However I actually placed two 2G modules on the single channel in service, so the 1 channel and 2 channel cases had the same RAM capacity. True, the 3-channel case had two more gig, but I doubt it found any use for it that had appreciable affect on execution times.

        ____________

        ML1
        Send message
        Joined: Feb 20 05
        Posts: 320
        Credit: 18,426,418
        RAC: 11,363
        Message 109281 - Posted 7 Jan 2011 11:24:06 UTC - in response to Message 109266.

          Last modified: 7 Jan 2011 11:24:55 UTC



          Very good test there thanks.

          To put some e@h performance percentages on there, you get:
          Single channel -> double channel: +38% (HT), +17% (nHT) Double channel -> triple channel: +05% (HT), +01% (nHT)

          vs, how much does the extra channel cost?...



          Happy fast crunchin',
          Martin
          ____________
          Powered by Mandriva Linux A user friendly OS!
          See the Boinc HELP Wiki

          DanNeely
          Send message
          Joined: Sep 4 05
          Posts: 1075
          Credit: 71,651,502
          RAC: 80,295
          Message 109317 - Posted 8 Jan 2011 21:55:32 UTC

            LGA 1166 includes quad cores that will turbo to 3.33ghz on DDR3-1333, so the fact that you didn't clock your ram to 1600mhz probably isn't a significant factor.
            ____________

            ExtraTerrestrial Apes
            Avatar
            Send message
            Joined: Nov 10 04
            Posts: 464
            Credit: 32,327,219
            RAC: 32,997
            Message 109348 - Posted 9 Jan 2011 14:55:47 UTC

              Thanks for the tests!

              And I agree: you've probably got a higher than average CPU clock (3.4 GHz for all cores loaded) and your DDR3 1333 memory is actually running at 1080 effective MHz (2 x 540 MHz as reported by CPU-Z). Most stock configurations interested in high performance should rather run 1333 or 1600, so they'll be less bandwidth starved.

              MrS
              ____________
              Scanning for our furry friends since Jan 2002

              Robert
              Send message
              Joined: Nov 5 05
              Posts: 34
              Credit: 205,795,358
              RAC: 160,972
              Message 109498 - Posted 13 Jan 2011 23:48:41 UTC

                For reference about the effects of fully populating all 6 memory slots, I ran a test like this a year ago against the applications that were active then.

                The system was a i7-980 (hexacore) OC = 3.6 GHz with HT on (so 12 jobs running at the same time) using an X58 board. Memory configurations were 3 x 2GB DDR3 1866 @ 1.5 volts (HT_3) and 6 x 2GB DDR3 1866 @ 1.5 volts (call it HT_6).

                Result for gravity wave jobs: HT_3 = HT_6

                I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.
                ____________

                DanNeely
                Send message
                Joined: Sep 4 05
                Posts: 1075
                Credit: 71,651,502
                RAC: 80,295
                Message 109505 - Posted 14 Jan 2011 2:25:55 UTC - in response to Message 109498.

                  For reference about the effects of fully populating all 6 memory slots, I ran a test like this a year ago against the applications that were active then.

                  The system was a i7-980 (hexacore) OC = 3.6 GHz with HT on (so 12 jobs running at the same time) using an X58 board. Memory configurations were 3 x 2GB DDR3 1866 @ 1.5 volts (HT_3) and 6 x 2GB DDR3 1866 @ 1.5 volts (call it HT_6).

                  Result for gravity wave jobs: HT_3 = HT_6

                  I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.


                  The second set of memory slots are just to connect a second dimm to each channel. Your power usage went up because you were powering more chips.
                  ____________

                  archae86
                  Send message
                  Joined: Dec 6 05
                  Posts: 1065
                  Credit: 112,178,638
                  RAC: 97,934
                  Message 109509 - Posted 14 Jan 2011 6:16:46 UTC - in response to Message 109498.

                    Robert wrote:
                    Result for gravity wave jobs: HT_3 = HT_6

                    I saw no difference in the speed of the jobs, but power increased by 17 watts (250 w - 233 w) for the HT_6 case. That is quite a power penalty in my mind.

                    Your case is a useful illustration of a basic relationship: When you don't have enough memory and are swapping, just about nothing beats the cost/performance value of adding memory. When you do have enough memory, adding memory does nothing but add cost, failure rate, and power dissipation.
                    ____________

                    Profile tullio
                    Send message
                    Joined: Jan 22 05
                    Posts: 1843
                    Credit: 516,461
                    RAC: 492
                    Message 109512 - Posted 14 Jan 2011 8:29:30 UTC

                      I have a 8 GB RAM on my Linux-pae. I am running 6 BOINC projects including 2 VirtualMachines via VirtualBox. Application Data uses 26% of RAM, Disk Caching 58%. 10% is free, plus some disk buffers.
                      Tullio
                      ____________

                      ExtraTerrestrial Apes
                      Avatar
                      Send message
                      Joined: Nov 10 04
                      Posts: 464
                      Credit: 32,327,219
                      RAC: 32,997
                      Message 109545 - Posted 14 Jan 2011 21:58:41 UTC - in response to Message 109509.

                        Your case is a useful illustration of a basic relationship: When you don't have enough memory and are swapping, just about nothing beats the cost/performance value of adding memory. When you do have enough memory, adding memory does nothing but add cost, failure rate, and power dissipation.


                        Totally agreed. When people say "more RAM makes your computer faster" than I like to reply "More RAM doesn't make it faster, it keeps it from getting slower". That changed a bit due to super fetch, i.e. not only caching recent files but also predicting which stuff I'll usually need next, but generally I still stand by this.

                        MrS
                        ____________
                        Scanning for our furry friends since Jan 2002

                        Robert
                        Send message
                        Joined: Nov 5 05
                        Posts: 34
                        Credit: 205,795,358
                        RAC: 160,972
                        Message 109852 - Posted 25 Jan 2011 2:25:06 UTC

                          Here are some numbers to compare the new sandy bridge architecture with this threads baseline nehalem architecture. CPU is the i7-2600K at a stock 3.4 GHz. Memory is 1.5 volt DDR3-1333 in single (1 x 4GB) or dual (2 x 4GB) channel configuration, current sandy bridge parts are limited to dual memory channels.

                          Note that since the memory stick capacity is 4GB, the single channel case used 1 dimm which is currently enough memory for the 8 thread case + OS. I'm using the same definitions as orginally defined by the OP, restated below.

                          And here are the mean execution times in CPU seconds for a full set (4 for nHT, 8 for HT) of current Einstein Global Correlations S5 HF search #1 v1.07 [ubuntu 10.10] (S5GCESSE2) tasks.

                          HT_1 = 24,350 sec HT_2 = 17,400 sec

                          nHT_1 = 14,100 sec nHT_2 = 11,740 sec
                          ____________

                          Post to thread

                          Message boards : Cruncher's Corner : Memory channels provisioned vs. Einstein performance E5620


                          Home · Your account · Message boards

                          This material is based upon work supported by the National Science Foundation (NSF) under Grants PHY-1104902, PHY-1104617 and PHY-1105572 and by the Max Planck Gesellschaft (MPG). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the investigators and do not necessarily reflect the views of the NSF or the MPG.

                          Copyright © 2014 Bruce Allen