Subject: Processors | November 20, 2015 - 11:21 PM | Scott Michaud
Tagged: xeon, Intel, FPGA
UPDATE (Nov 26th, 3:30pm ET): A few readers have mentioned that FPGAs take much less than hours to reprogram. I even received an email last night that claims FPGAs can be reprogrammed in "well under a second." This differs from the sources I've read when I was reading up on their OpenCL capabilities (for potential evolutions of projects) back in ~2013. That said, multiple sources, including one who claim to have personal experience with FPGAs, say that it's not the case. Also, I've never used an FPGA myself -- again, I was just researching them to see where some GPU-based projects could go.
Designing integrated circuits, as I've said a few times, is basically a game. You have a blank canvas that you can etch complexity into. The amount of “complexity” depends on your fabrication process, how big your chip is, the intended power, and so forth. Performance depends on how you use the complexity to compute actual tasks. If you know something special about your workload, you can optimize your circuit to do more with less. CPUs are designed to do basically anything, while GPUs assume similar tasks can be run together. If you will only ever run a single program, you can even bake some or all of its source code into hardware called an “application-specific integrated circuit” (ASIC), which is often used for video decoding, rasterizing geometry, and so forth.
This is an old Atom back when Intel was partnered with Altera for custom chips.
FPGAs are circuits that can be baked into a specific application, but can also be reprogrammed later. Changing tasks requires a significant amount of time (sometimes hours) but it is easier than reconfiguring an ASIC, which involves removing it from your system, throwing it in the trash, and printing a new one. FPGAs are not quite as efficient as a dedicated ASIC, but it's about as close as you can get without translating the actual source code directly into a circuit.
Intel, after purchasing FPGA manufacturer, Altera, will integrate their technology into Xeons in Q1 2016. This will be useful to offload specific tasks that dominate a server's total workload. According to PC World, they will be integrated as a two-chip package, where both the CPU and FPGA can access the same cache. I'm not sure what form of heterogeneous memory architecture that Intel is using, but this would be a great example of a part that could benefit from in-place acceleration. You could imagine a simple function being baked into the FPGA to, I don't know, process large videos in very specific ways without expensive copies.
Again, this is not a consumer product, and may never be. Reprogramming an FPGA can take hours, and I can't think of too many situations where consumers will trade off hours of time to switch tasks with high performance. Then again, it just takes one person to think of a great application for it to take off.
Subject: Processors | May 28, 2015 - 01:45 AM | Scott Michaud
Tagged: xeon, Skylake, Intel, Cannonlake, avx-512
AVX-512 is an instruction set that expands the CPU registers from 256-bit to 512-bit. It comes with a core specification, AVX-512 Foundation, and several extensions that can be added where it makes sense. For instance, AVX-512 Exponential and Reciprocal Instructions (ERI) help solve transcendental problems, which occur in geometry and are useful for GPU-style architectures. As such, it appears in Knights Landing but not anywhere else.
Image Credit: Bits and Chips
Today's rumor is that Skylake, the successor to Broadwell, will not include any AVX-512 support in its consumer parts. According to the lineup, Xeons based on Skylake will support AVX-512 Foundation, Conflict Detection Instructions, Vector Length Extensions, Byte and Word Instructions, and Double and Quadword Instructions. Fused Multiply and Add for 52-bit Integers and Vector Byte Manipulation Instructions will not arrive until Cannonlake shrinks everything down to 10nm.
The main advantage of larger registers is speed. When you can fit 512 bits of data in a memory bank and operate upon it at once, you are able to do several, linked calculations together. AVX-512 has the capability to operate on sixteen 32-bit values at the same time, which is obviously sixteen times the compute performance compared with doing just one at a time... if all sixteen undergo the same operation. This is especially useful for games, media, and other, vector-based workloads (like science).
This also makes me question whether the entire Cannonlake product stack will support AVX-512. While vectorization is a cheap way to get performance for suitable workloads, it does take up a large amount of transistors (wider memory, extra instructions, etc.). Hopefully Intel will be able to afford the cost with the next die shrink.
Subject: Processors | May 7, 2015 - 11:36 PM | Scott Michaud
Tagged: Intel, xeon, xeon e7 v3, xeon e7
On May 5th, Intel officially announced their new E7 v3 lineup of Xeon processors. This replaces the Xeon E7 v2 processors, which were based on Ivy Bridge-EX, with the newer Haswell-EX architecture. Interestingly, WCCFTech has Broadwell-EX listed next, even though the desktop is expected to mostly skip Broadwell and jump to Skylake in high-performance roles.
The largest model is the E7-8890 v3, which contains eighteen cores fed by a total of 45MB in L3 cache. Despite the high core count, the E7-8890 v3 has its base frequency set at 2.5 GHz to yield a TDP of 165W. The E7-8891 v3 (165W) and the E7-8893 v3 (140W) drop the core count to ten and four, but raise the base frequency to 2.8 GHz and 3.2 GHz, respectively. The E7-8880L v3 is a low power version, relatively speaking, which will also contains eighteen cores that are clocked at 2.0 GHz. This drops its TDP to 115W while still maintaining 45 MB of L3 cache.
Image Credit: WCCFTech
The product stack trickles down from there, but not much further. Just twelve processors are listed in the Xeon E7 segment, which Intel points out in the WCCFTech slides is a significant reduction in SKUs. This suggests that they believe their previous line was too many options for enterprise customers. When dealing with prices in the range of $1,223 - $7,174 USD for bulk orders, it makes sense to offer a little choice to slightly up-sell potential buyers, but too many choices can defeat that purpose. Also, it was a bit humorous to see such an engineering-focused company highlight a reduction of SKUs with a bubble point like it was a technological feature. Not bad, actually quite good as I mentioned above, just a bit funny.
The Xeon E7 v3 is listed as now available, with SKUs ranging from $1223 - $7174 USD.
Subject: Editorial, Processors | March 13, 2015 - 12:29 AM | Tim Verry
Tagged: Xeon D, xeon, servers, opinion, microserver, Intel
Intel dealt a blow to AMD and ARM this week with the introduction of the Xeon Processor D Product Family of low power server SoCs. The new Xeon D chips use Intel’s latest 14nm process and top out at 45W. The chips are aimed at low power high density servers for general web hosting, storage clusters, web caches, and networking hardware.
Currently, Intel has announced two Xeon D chips, the Xeon D-1540 and Xeon D-1520. Both chips are comprised of two dies inside a single package. The main die uses a 14nm process and holds the CPU cores, L3 cache, DDR3 and DDR4 memory controllers, networking controller, PCI-E 3.0, and USB 3.0 while a secondary die using a larger (but easier to implement) manufacturing process hosts the higher latency I/O that would traditionally sit on the southbridge including SATA, PCI-E 2.0, and USB 2.0.
In all, a fairly typical SoC setup from Intel. The specifics are where things get interesting, however. At the top end, Xeon D offers eight Broadwell-based CPU cores (with Hyper-Threading for 16 total threads) clocked at 2.0 GHz base and 2.5 GHz max all-core Turbo (2.6 GHz on a single core). The cores are slightly more efficient than Haswell, especially in this low power setup. The eight cores can tap into 12MB of L3 cache as well as up to 128GB of registered ECC memory (or 64GB unbuffered and/or SODIMMs) in DDR3 1600 MHz or DDR4 2133 MHz flavors. Xeon D also features 24 PCI-E 3.0 lanes (which can be broken up to as small as six PCI-E 3.0 x4 lanes or in a x16+x8 configuration among others), eight PCI-E 2.0 lanes, two 10GbE connections, six SATA III 6.0 Gbps channels, four USB 3.0 ports, and four USB 2.0 ports.
All of this hardware is rolled into a part with a 45W TDP. Needless to say, this is a new level of efficiency for Xeons! Intel chose to compare the new chips to its Atom C2000 “Avoton” (Silvermont-based) SoCs which were also aimed at low power servers and related devices. According to the company, Xeon D offers up to 3.4-times the performance and 1.7-times the performance-per-watt of the top end Atom C2750 processor. Keeping in mind that Xeon D uses approximately twice the power as Atom C2000, it is still looking good for Intel since you are getting more than twice the performance and a more power efficient part. Further, while the TDPs are much higher,
Intel has packed Xeon D with a slew of power management technology including Integrated Voltage Regulation (IVR), an energy efficient turbo mode that will analyze whether increased frequencies actually help get work done faster (and if not will reduce turbo to allow extra power to be used elsewhere on the chip or to simply reduce wasted energy), and optional “hardware power management” that allows the processor itself to determine the appropriate power and sleep states independently from the OS.
Being server parts, Xeon D supports ECC, PCI-E Non-Transparent Bridging, memory and PCI-E Checksums, and corrected (errata-free) TSX instructions.
Ars Technica notes that Xeon D is strictly single socket and that Intel has reserved multi-socket servers for its higher end and more expensive Xeons (Haswell-EP). Where does the “high density” I mentioned come from then? Well, by cramming as many Xeon D SoCs on small motherboards with their own RAM and IO into rack mounted cases as possible, of course! It is hard to say just how many Xeon Ds will fit in a 1U, 2U, or even 4U rack mounted system without seeing associated motherboards and networking hardware needed but Xeon D should fare better than Avoton in this case since we are looking at higher bandwidth networking links and more PCI-E lanes, but AMD with SeaMicro’s Freedom Fabric and head start on low power x86 and ARM-based Opteron chip research as well as other ARM-based companies like AppliedMicro (X-Gene) will have a slight density advantage (though the Intel chips will be faster per chip).
Which brings me to my final point. Xeon D truly appears like a shot across both ARM and AMD’s bow. It seems like Intel is not content with it’s dominant position in the overall server market and is putting its weight into a move to take over the low power server market as well, a niche that ARM and AMD in particular have been actively pursuing. Intel is not quite to the low power levels that AMD and other ARM-based companies are, but bringing Xeon down to 45W (with Atom-based solutions going upwards performance wise), the Intel juggernaut is closing in and I’m interested to see how it all plays out.
Right now, ARM still has the TDP and customization advantage (where customers can create custom chips and cores to suit their exact needs) and AMD will be able to leverage its GPU expertise by including processor graphics for a leg up on highly multi-threaded GPGPU workloads. On the other hand, Intel has the better manufacturing process and engineering budget. Xeon D seems to be the first step towards going after a market that they have in the past not really focused on.
With Intel pushing its weight around, where will that leave the little guys that I have been rooting for in this low power high density server space?
Subject: General Tech | October 27, 2014 - 04:35 PM | Jeremy Hellstrom
Tagged: Haswell-EX, Haswell-EP4S, Intel, server, xeon, Broadwell-DE, Skylake
Intel's release schedules have been slowing down, unfortunately in a large part that is due to the fact that the only competition they face in certain market segments is themselves. For high end servers it looks like we won't see Haswell-EX or EP4S until the second half of next year and Skylake chips for entry level servers until after the third quarter. Intel does have to fight for their share of the SoC and low powered chips, DigiTimes reports the Broadwell-DE family and the C2750 and C2350 should be here in the second quarter which gives AMD and ARM a chance to gain market share against Intel's current offerings. Along with the arrival of the new chips we will also see older models from Itanium, Xeon, Xeon Phi and Atom be discontinued; some may be gone before the end of the year. You have already heard the bad news about Broadwell-E.
"Intel's next-generation server processors for 2015 including new Haswell-EX (Xeon E7 v3 series) and -EP4S (Xeon E5-4600 v3 series), are scheduled to be released in the second quarter of 2015, giving clients more time to transition to the new platform, according to industry sources."
Here is some more Tech News from around the web:
- iOS 8.1 @ The Inquirer
- How to Get Open Source Android @ Linux.com
- Mozilla to make Firefox OS a tasty filling for a Raspberry Pi @ The Inquirer
- Pesky POS poison won't Backoff @ The Register
- Cisco patches three-year-old remote code-execution hole @ The Register
- Netgear Nighthawk R7000 AC1900 @ Kitguru
- Tech ARP 2014 Mega Giveaway Contest
- WIN a 1TB monster Samsung EVO 840 SSD @ The Register
Subject: General Tech, Motherboards, Processors | September 20, 2014 - 10:51 PM | Scott Michaud
Tagged: xeon, Haswell-EP, ddr4, ddr3, Intel
Well this is interesting and, while not new, is news to me.
The upper-tier Haswell processors ushered DDR4 into the desktops for enthusiasts and servers, but DIMMs are quite expensive and incompatible with the DDR3 sticks that your organization might have been stocking up on. Despite the memory controller being placed on the processor, ASRock has a few motherboards which claim DDR3 support. ASRock, responding to Anandtech's inquiry, confirmed that this is not an error and Intel will launch three SKUs, one eight-core, one ten-core, and one twelve-core, with a DDR3-supporting memory controller.
The three models are:
|E5-2629 v3||E5-2649 v3||E5-2669 v3|
|Cores (Threads)||8 (16)||10 (20)||12 (24)|
|Clock Rate||2.4 GHz||2.3 GHz||2.3 Ghz|
The processors, themselves, might not be cheap or easily attainable, though. There are rumors that Intel will require customers purchase at least a minimum amount. It might not be worth buying these processors unless you have a significant server farm (or similar situation).
Server and Workstation Upgrades
Today, on the eve of the Intel Developer Forum, the company is taking the wraps off its new server and workstation class high performance processors, Xeon E5-2600 v3. Known previously by the code name Haswell-EP, the release marks the entry of the latest microarchitecture from Intel to multi-socket infrastructure. Though we don't have hardware today to offer you in-house benchmarks quite yet, the details Intel shared with me last month in Oregon are simply stunning.
Starting with the E5-2600 v3 processor overview, there are more changes in this product transition than we saw in the move from Sandy Bridge-EP to Ivy Bridge-EP. First and foremost, the v3 Xeons will be available in core counts as high as 18, with HyperThreading allowing for 36 accessible threads in a single CPU socket. A new socket, LGA2011-v3 or R3, allows the Xeon platforms to run a quad-channel DDR4 memory system, very similar to the upgrade we saw with the Haswell-E Core i7-5960X processor we reviewed just last week.
The move to a Haswell-based microarchitecture also means that the Xeon line of processors is getting AVX 2.0, known also as Haswell New Instructions, allowing for 2x the FLOPS per clock per core. It also introduces some interesting changes to Turbo Mode and power delivery we'll discuss in a bit.
Maybe the most interesting architectural change to the Haswell-EP design is per core P-states, allowing each of the up to 18 cores running on a single Xeon processor to run at independent voltages and clocks. This is something that the consumer variants of Haswell do not currently support - every cores is tied to the same P-state. It turns out that when you have up to 18 cores on a single die, this ability is crucial to supporting maximum performance on a wide array of compute workloads and to maintain power efficiency. This is also the first processor to allow independent uncore frequency scaling, giving Intel the ability to improve performance with available headroom even if the CPU cores aren't the bottleneck.
Subject: Motherboards | August 30, 2014 - 05:10 PM | Tim Verry
Tagged: xeon, x99 ws, X99, workstation, socket 2011-3, Intel, Haswell-E, asrock
Alongside the X99 Extreme6 for enthusiasts, ASRock has launched the X99 WS (E-ATX) motherboard for professional workstations and servers. The board uses a black PCB, blue aluminum heatsinks on the power phase and PCH areas, and gold colored caps. The E-ATX motherboard uses short screws and a thinner CPU socket backplate to allow it to fit into 1U server cases. You have ample PCI-E slots, DDR4 (and ECC RDIMMs), USB 3.0, Thunderbolt 2 (with an AIC), eSATA, M.2, and SATA III for connectivity. Power is regulated by a 12 power phase "Digi Power" design with high quality Ultra Dual-N MOSFETs, 60A chokes, 12K Nichicon capacitors.
The X99 WS also uses larger heatsinks, especially around the CPU area. The board features an Intel LGA 2011-3 socket that will accept Haswell-E or Xeon E5-1600/2600 v3 processors up to 18 cores and 160W TDPs. Eight memory slots surround the CPU socket and support a maximum of 128GB of DDR4 3200+ memory. Storage is handled by 10 SATA III 6Gbps ports and a single "Ultra M.2" slot that supports SATA III or PCI-E x4 controller equipped drives.
This worskation board takes advantage of the Extended ATX form factor to cram in additional PCI-E 3.0 slots versus the shorter X99 Extreme6. A PCI-E 2.0 x 16 slot sits directly below the CPU socket followed by the (mentioned above) M.2 slot and five PCI-E 3.0 x 16 slots. The five PCI-E 3.0 slots allow for 4-Way CrossFireX and 4-Way SLI multi-GPU setups. To take full advantage of this board, you will want to pair it with the higher end Haswell-E parts (5930K, 5960X, et al) with the full 40 lanes rather than the entry level 5820K with its limited 28 PCI-E lanes.
Using internal headers, the X99 WS supports a TPM module, two COM ports, two USB 3.0 ports, four USB 2.0 ports, and a Thunderbolt port (via an extra add-in-card). It features two CPU fan and three chassis fan headers as well.
The X99 WS features the following ports on the rear IO panel.
- 1 x PS/2
- 4 x USB 2.0
- 1 x eSATA
- 4 x USB 3.0
- 2 x Gigabit LAN
- 1 x Intel I217LM
- 1 x Intel I210AT
- 7.1 Channel Audio (Realtek ALC1150)
- 5 x Analog audio outputs
- 1 x Optical audio output
The X99 WS is available now with a three year manufaturer warranty for $323.99. The price alone pushes this board well into the professional market, but gamers can get most of the way to the WS feature-wise with the X99 Extreme6. The WS goes up against the boards like the ASUS X99-Deluxe and MSI X99S XPOWER AC.
Subject: General Tech, Graphics Cards, Processors | July 19, 2014 - 07:05 AM | Scott Michaud
Tagged: Xeon Phi, xeon, Intel, avx-512, avx
It is difficult to know what is actually new information in this Intel blog post, but it is interesting none-the-less. Its topic is the AVX-512 extension to x86, designed for Xeon and Xeon Phi processors and co-processors. Basically, last year, Intel announced "Foundation", the minimum support level for AVX-512, as well as Conflict Detection, Exponential and Reciprocal, and Prefetch, which are optional. This, earlier blog post was very much focused on Xeon Phi, but it acknowledged that the instructions will make their way to standard, CPU-like Xeons at around the same time.
This year's blog post brings in a bit more information, especially for common Xeons. While all AVX-512-supporting processors (and co-processors) will support "AVX-512 Foundation", the instruction set extensions are a bit more scattered.
|Conflict Detection Instructions||Yes||Yes||Yes|
|Exponential and Reciprocal Instructions||No||Yes||Yes|
|Byte and Word Instructions||Yes||No||No|
|Doubleword and Quadword Instructions||Yes||No||No|
|Vector Length Extensions||Yes||No||No|
Source: Intel AVX-512 Blog Post (and my understanding thereof).
So why do we care? Simply put: speed. Vectorization, the purpose of AVX-512, has similar benefits to multiple cores. It is not as flexible as having multiple, unique, independent cores, but it is easier to implement (and works just fine with having multiple cores, too). For an example: imagine that you have to multiply two colors together. The direct way to do it is multiply red with red, green with green, blue with blue, and alpha with alpha. AMD's 3DNow! and, later, Intel's SSE included instructions to multiply two, four-component vectors together. This reduces four similar instructions into a single operating between wider registers.
Smart compilers (and programmers, although that is becoming less common as compilers are pretty good, especially when they are not fighting developers) are able to pack seemingly unrelated data together, too, if they undergo similar instructions. AVX-512 allows for sixteen 32-bit pieces of data to be worked on at the same time. If your pixel only has four, single-precision RGBA data values, but you are looping through 2 million pixels, do four pixels at a time (16 components).
For the record, I basically just described "SIMD" (single instruction, multiple data) as a whole.
This theory is part of how GPUs became so powerful at certain tasks. They are capable of pushing a lot of data because they can exploit similarities. If your task is full of similar problems, they can just churn through tonnes of data. CPUs have been doing these tricks, too, just without compromising what they do well.
Subject: General Tech, Graphics Cards, Processors | July 2, 2014 - 07:55 AM | Scott Michaud
Tagged: Intel, Xeon Phi, xeon, silvermont, 14nm
Anandtech has just published a large editorial detailing Intel's Knights Landing. Mostly, it is stuff that we already knew from previous announcements and leaks, such as one by VR-Zone from last November (which we reported on). Officially, few details were given back then, except that it would be available as either a PCIe-based add-in board or as a socketed, bootable, x86-compatible processor based on the Silvermont architecture. Its many cores, threads, and 512 bit registers are each pretty weak, compared to Haswell, for instance, but combine to about 3 TFLOPs of double precision performance.
Not enough graphs. Could use another 256...
The best way to imagine it is running a PC with a modern, Silvermont-based Atom processor -- only with up to 288 processors listed in your Task Manager (72 actual cores with quad HyperThreading).
The main limitation of GPUs (and similar coprocessors), however, is memory bandwidth. GDDR5 is often the main bottleneck of compute performance and just about the first thing to be optimized. To compensate, Intel is packaging up-to 16GB of memory (stacked DRAM) on the chip, itself. This RAM is based on "Hybrid Memory Cube" (HMC), developed by Micron Technology, and supported by the Hybrid Memory Cube Consortium (HMCC). While the actual memory used in Knights Landing is derived from HMC, it uses a proprietary interface that is customized for Knights Landing. Its bandwidth is rated at around 500GB/s. For comparison, the NVIDIA GeForce Titan Black has 336.4GB/s of memory bandwidth.
Intel and Micron have worked together in the past. In 2006, the two companies formed "IM Flash" to produce the NAND flash for Intel and Crucial SSDs. Crucial is Micron's consumer-facing brand.
So the vision for Knights Landing seems to be the bridge between CPU-like architectures and GPU-like ones. For compute tasks, GPUs edge out CPUs by crunching through bundles of similar tasks at the same time, across many (hundreds of, thousands of) computing units. The difference with (at least socketed) Xeon Phi processors is that, unlike most GPUs, Intel does not rely upon APIs, such as OpenCL, and drivers to translate a handful of functions into bundles of GPU-specific machine language. Instead, especially if the Xeon Phi is your system's main processor, it will run standard, x86-based software. The software will just run slowly, unless it is capable of vectorizing itself and splitting across multiple threads. Obviously, OpenCL (and other APIs) would make this parallelization easy, by their host/kernel design, but it is apparently not required.
It is a cool way that Intel arrives at the same goal, based on their background. Especially when you mix-and-match Xeons and Xeon Phis on the same computer, it is a push toward heterogeneous computing -- with a lot of specialized threads backing up a handful of strong ones. I just wonder if providing a more-direct method of programming will really help developers finally adopt massively parallel coding practices.
I mean, without even considering GPU compute, how efficient is most software at splitting into even two threads? Four threads? Eight threads? Can this help drive heterogeneous development? Or will this product simply try to appeal to those who are already considering it?