Smp linux что это
Национальная библиотека им. Н. Э. Баумана
Bauman National Library
Персональные инструменты
SMP (Symmetric Multiprocessing)
В таксономии Флинна SMP-машины относятся к классу SM-MIMD-машин. Большинство многопроцессорных систем сегодня используют архитектуру SMP.
Содержание
Описание
SMP-системы позволяют любому процессору работать над любой задачей независимо от того, где в памяти хранятся данные для этой задачи, — при должной поддержке операционной системой SMP-системы могут легко перемещать задачи между процессорами, эффективно распределяя нагрузку.
Разные SMP-системы соединяют процессоры с общей памятью по-разному. Самый простой и дешевый подход — это соединение по общей шине (system bus). В этом случае только один процессор может обращаться к памяти в каждый данный момент, что накладывает существенное ограничение на количество процессоров, поддерживаемых в таких системах. Чем больше процессоров, тем больше нагрузка на общую шину, тем дольше должен ждать каждый процессор, пока освободится шина, чтобы обратиться к памяти. Снижение общей производительности такой системы с ростом количества процессоров происходит очень быстро, поэтому обычно в таких системах количество процессоров не превышает 2-4. Примером SMP-машин с таким способом соединения процессоров являются любые многопроцессорные серверы начального уровня.
Второй способ соединения процессоров — через коммутируемое соединение (crossbar switch). При таком соединении вся общая память делится на банки памяти, каждый банк памяти имеет свою собственную шину, и процессоры соединены со всеми шинами, имея доступ по ним к любому из банков памяти. Такое соединение схемотехнически более сложное, но оно позволяет процессорам обращаться к общей памяти одновременно. Это позволяет увеличить количество процессоров в системе до 8-16 без заметного снижения общей производительности. Примером таких SMP-машин являются многопроцессорные рабочие станции RS/6000.
Преимущества и недостатки
SMP — самый простой и экономически выгодный способ масштабирования вычислительной системы: путём наращивания числа процессоров. Также просто и программирование: с помощью потоков и сопутствующих механизмов обмена данными между ними через общие переменные в памяти.
SMP часто применяется в науке, промышленности, бизнесе, где программное обеспечение специально разрабатывается для многопоточного выполнения. В то же время большинство потребительских продуктов, таких как текстовые редакторы и компьютерные игры, написаны так, что они не могут использовать сильные стороны SMP-систем. В случае игр это зачастую связано с тем, что оптимизация программы под SMP-системы приведёт к потере производительности при работе на однопроцессорных системах, которые еще недавно занимали большую часть рынка. (Современные многоядерные процессоры — лишь еще одна аппаратная реализация SMP.) В силу природы разных методов программирования для максимальной производительности потребуются отдельные проекты для поддержки одного процессора (одноядерного процессора) и SMP-систем. И все же программы, запущенные на SMP-системах, получают незначительный прирост производительности, даже если они были написаны для однопроцессорных систем. Это связано с тем, что аппаратные прерывания, обычно приостанавливающие выполнение программы для их обработки ядром, могут обрабатываться на свободном процессоре. Эффект в большинстве приложений проявляется не столько в приросте производительности, сколько в ощущении, что программа выполняется более плавно. В некоторых прикладных программах, в частности, программных компиляторах и некоторых проектах распределённых вычислений, повышение производительности будет почти прямо пропорционально числу дополнительных процессоров.
Выход из строя одного процессора приводит к некорректной работе всей системы и требует перезагрузки всей системы для отключения неисправного процессора.
Ограничение на количество процессоров
При увеличении числа процессоров заметно увеличивается требование к полосе пропускания шины памяти. Это налагает ограничение на количество процессоров в SMP архитектуре. Современные SMP-системы позволяют эффективно работать при 16 процессорах.
Проблема когерентности кэш-памяти
Каждый современный процессор оборудован многоуровневой кэш-памятью для более быстрой выборки данных и инструкций из основной памяти, которая работает медленнее, чем процессор. В многопроцессорной системе наличие кэш-памяти у процессоров снижает нагрузку на общую шину или на коммутируемое соединение, что весьма благоприятно сказывается на общей производительности системы. Но, так как каждый процессор оборудован своей индивидуальной кэш-памятью, возникает опасность, что в кэш-память одного процессора попадет значение переменной отличное от того, что хранится в основной памяти и в кэш-памяти другого процессора. Представим, что процессор изменяет значение переменной в своем кэше, а другой процессор запрашивает эту переменную из основной памяти. И он получит старое значение переменной. Или, например, подсистема ввода-вывода записывает в основную память новое значение переменной, а в кэш-памяти процессора по-прежнему остается старое. Разрешение этой проблемы возложено на протокол согласования кэшей (cache coherence protocol), который призван обеспечить согласованность («когерентность») кэшей всех процессоров и основной памяти без потери общей производительности.
Поддержка операционной системой
Поддержка SMP должна быть встроена в операционную систему, иначе дополнительные процессоры будут бездействовать, и система будет работать как однопроцессорная. Большинство современных операционных систем поддерживает симметричную мультипроцессорность, но в разной степени.
Поддержка многопроцессорности в ОС Linux была добавлена в версии ядра 2.0 и усовершенствована в версии 2.6. ОС Windows NT изначально создавалась с поддержкой многопроцессорности. Но вот Windows 9x не поддерживали SMP до выхода Windows XP, ставшей развитием Windows 2000, т.е. ветки Windows NT.
Альтернативы
SMP — лишь один вариант построения многопроцессорной машины. Другая концепция — NUMA, которая предоставляет процессорам отдельные банки памяти. Это позволяет процессорам работать с памятью параллельно, и это может значительно повысить пропускную способность памяти в случае когда данные привязаны к конкретному процессу (а, следовательно, и процессору). С другой стороны, NUMA повышает стоимость перемещения данных между процессорами, значит, и балансирование загрузки обходится дороже. Преимущества NUMA ограничены специфическим кругом задач, в основном серверами, где данные часто жестко привязаны к конкретным задачам или пользователям.
Другая концепция — асимметричное мультипроцессирование (ASMP), в котором отдельные специализированные процессоры используются для конкретных задач, и кластерная мультипроцессорность (Beowulf), в которой не вся память доступна всем процессорам. Такие подходы нечасто используются (хотя высокопроизводительные 3D-чипсеты в современных видеокартах могут рассматриваться как форма асимметричной мультипроцессорности), в то время как кластерные системы широко применяются при построении очень больших суперкомпьютеров.
Понимание концепции SMP в linux
Недавно я начал заниматься программированием SMP, пытался понять концепции и примеры экспериментов в Linux. Когда я начал искать в Гугле то же самое, то наткнулся на нижеприведенную книгу:
UNIX системы для современных архитектур: симметричная многопроцессорная обработка и кэширование для Kernel Программисты
Эта книга действительно хороша и доставляет то, что я говорю, но я запутан или не совсем понимаю, если те же понятия применимы и к Linux: например, виртуальный кэш, существуют ли они для Linux.
В основном я ищу совета относительно того, насколько эта книга была бы полезна, если бы я работал исключительно в среде Linux.
1 ответ
Это двукратный вопрос, возникший из моего тривиального наблюдения, что я запускаю SMP с поддержкой Linux на нашем ARM-Cortex 8 на основе SoC. Первая часть посвящена разнице в производительности (пространство памяти/время CPU)между SMP и не SMP Linux kernel в однопроцессорной системе. Есть ли.
Конечно, я предполагаю, что вы хотите программировать пользовательские приложения (а не модули kernel).
Похожие вопросы:
В реализации kernel mutex поле владельца потока существует только в том случае, если это сборка SMP. Я могу понять, что при хорошем и чистом коде поток вызовет release только в том случае, если.
Я разработал драйвер блочного устройства Linux для устройства CD. Драйвер работает хорошо, но теперь есть требование, чтобы он работал в системе SMP. Когда я провел тестовый запуск системы SMP, я.
Есть ли в SMP Linux kernel что-то вроде pthread_barrier? Когда kernel работает одновременно на 2 и более CPUs с одной и той же структурой, барьер (например, pthread_barrier) может быть полезен. Он.
Это двукратный вопрос, возникший из моего тривиального наблюдения, что я запускаю SMP с поддержкой Linux на нашем ARM-Cortex 8 на основе SoC. Первая часть посвящена разнице в производительности.
Поскольку вопрос говорит сам за себя, я искал глубокое объяснение барьера зависимости данных в SMP, особенно в отношении Linux Kernel. У меня есть определение и краткое описание под рукой в этой.
Я использую версию kernel 4.2.0 с bluez-5.35 в linux. Я использую для программирования на Bluez только ДГУ общения. При сопряжении BLE SMP наша система не отправляет идентификационную информацию.
Есть ли архитектурная причина использовать политику кэша writealloc в ARM SMP Linux kernel? Можем ли мы изменить его на политику кэша обратной записи? Kernel boot log : [ 0.000000] принудительная.
Smp linux что это
This document gives a brief overview of how to use SMP Linux systems for parallel processing. The most up-to-date information on SMP Linux is probably available via the SMP Linux project mailing list; send email to majordomo@vger.rutgers.edu with the text subscribe linux-smp to join the list.
The next question is how much high-level support is available for writing and executing shared memory parallel programs under SMP Linux. Through early 1996, there wasn’t much. Things have changed. For example, there is now a very complete POSIX threads library.
Although performance may be lower than for native shared-memory mechanisms, an SMP Linux system also can use most parallel processing software that was originally developed for a workstation cluster using socket communication. Sockets (see section 3.3) work within an SMP Linux system, and even for multiple SMPs networked as a cluster. However, sockets imply a lot of unnecessary overhead for an SMP. Much of that overhead is within the kernel or interrupt handlers; this worsens the problem because SMP Linux generally allows only one processor to be in the kernel at a time and the interrupt controller is set so that only the boot processor can process interrupts. Despite this, typical SMP communication hardware is so much better than most cluster networks that cluster software will often run better on an SMP than on the cluster for which it was designed.
The remainder of this section discusses SMP hardware, reviews the basic Linux mechanisms for sharing memory across the processes of a parallel program, makes a few observations about atomicity, volatility, locks, and cache lines, and finally gives some pointers to other shared memory parallel processing resources.
The only non-MPS, non-IA32, systems supported by SMP Linux are Sun4m multiprocessor SPARC machines. SMP Linux supports most Intel MPS version 1.1 or 1.4 compliant machines with up to sixteen 486DX, Pentium, Pentium MMX, Pentium Pro, or Pentium II processors. Unsupported IA32 processors include the Intel 386, Intel 486SX/SLC processors (the lack of floating point hardware interferes with the SMP mechanisms), and AMD & Cyrix processors (they require different SMP support chips that do not seem to be available at this writing).
It is important to understand that the performance of MPS-compliant systems can vary widely. As expected, one cause for performance differences is processor speed: faster clock speeds tend to yield faster systems, and a Pentium Pro processor is faster than a Pentium. However, MPS does not really specify how hardware implements shared memory, but only how that implementation must function from a software point of view; this means that performance is also a function of how the shared memory implementation interacts with the characteristics of SMP Linux and your particular programs.
The primary way in which systems that comply with MPS differ is in how they implement access to physically shared memory.
Does each processor have its own L2 cache?
Many relatively inexpensive systems are organized so that two Pentium processors share a single L2 cache. The bad news is that this causes contention for the cache, seriously degrading performance when running multiple independent sequential programs. The good news is that many parallel programs might actually benefit from the shared cache because if both processors will want to access the same line from shared memory, only one had to fetch it into cache and contention for the bus is averted. The lack of processor affinity also causes less damage with a shared L2 cache. Thus, for parallel programs, it isn’t really clear that sharing L2 cache is as harmful as one might expect.
Experience with our dual Pentium shared 256K cache system shows quite a wide range of performance depending on the level of kernel activity required. At worst, we see only about 1.2x speedup. However, we also have seen up to 2.1x speedup, which suggests that compute-intensive SPMD-style code really does profit from the «shared fetch» effect.
Bus configuration?
The first thing to say is that most modern systems connect the processors to one or more PCI buses that in turn are «bridged» to one or more ISA/EISA buses. These bridges add latency, and both EISA and ISA generally offer lower bandwidth than PCI (ISA being the lowest), so disk drives, video cards, and other high-performance devices generally should be connected via a PCI bus interface.
Although an MPS system can achieve good speed-up for many compute-intensive parallel programs even if there is only one PCI bus, I/O operations occur at no better than uniprocessor performance. and probably a little worse due to bus contention from the processors. Thus, if you are looking to speed-up I/O, make sure that you get an MPS system with multiple independent PCI busses and I/O controllers (e.g., multiple SCSI chains). You will need to be careful to make sure SMP Linux supports what you get. Also keep in mind that the current SMP Linux essentially allows only one processor in the kernel at any time, so you should choose your I/O controllers carefully to pick ones that minimize the kernel time required for each I/O operation. For really high performance, you might even consider doing raw device I/O directly from user processes, without a system call. this isn’t necessarily as hard as it sounds, and need not compromise security (see section 3.3 for a description of the basic techniques).
It is important to note that the relationship between bus speed and processor clock rate has become very fuzzy over the past few years. Although most systems now use the same PCI clock rate, it is not uncommon to find a faster processor clock paired with a slower bus clock. The classic example of this was that the Pentium 133 generally used a faster bus than a Pentium 150, with appropriately strange-looking performance on various benchmarks. These effects are amplified in SMP systems; it is even more important to have a faster bus clock.
Memory interleaving and DRAM technologies?
Memory interleaving actually has nothing whatsoever to do with MPS, but you will often see it mentioned for MPS systems because these systems are typically more demanding of memory bandwidth. Basically, two-way or four-way interleaving organizes RAM so that a block access is accomplished using multiple banks of RAM rather than just one. This provides higher memory access bandwidth, particularly for cache line loads and stores.
The waters are a bit muddied about this, however, because EDO DRAM and various other memory technologies tend to improve similar kinds of operations. An excellent overview of DRAM technologies is given in http://www.pcguide.com/ref/ram/tech.htm.
So, for example, is it better to have 2-way interleaved EDO DRAM or non-interleaved SDRAM? That is a very good question with no simple answer, because both interleaving and exotic DRAM technologies tend to be expensive. The same dollar investment in more ordinary memory configurations generally will give you a significantly larger main memory. Even the slowest DRAM is still a heck of a lot faster than using disk-based virtual memory.
Ok, so you have decided that parallel processing on an SMP is a great thing to do. how do you get started? Well, the first step is to learn a little bit about how shared memory communication really works.
It sounds like you simply have one processor store a value into memory and another processor load it; unfortunately, it isn’t quite that simple. For example, the relationship between processes and processors is very blurry; however, if we have no more active processes than there are processors, the terms are roughly interchangeable. The remainder of this section briefly summarizes the key issues that could cause serious problems, if you were not aware of them: the two different models used to determine what is shared, atomicity issues, the concept of volatility, hardware lock instructions, cache line effects, and Linux scheduler issues.
Shared Everything Vs. Shared Something
Which shared memory model should you use? That is mostly a question of religion. A lot of people like the shared everything model because they do not really need to identify which data structures should be shared at the time they are declared. you simply put locks around potentially-conflicting accesses to shared objects to ensure that only one process(or) has access at any moment. Then again, that really isn’t all that simple. so many people prefer the relative safety of shared something.
Shared Everything
The nice thing about sharing everything is that you can easily take an existing sequential program and incrementally convert it into a shared everything parallel program. You do not have to first determine which data need to be accessible by other processors.
Put simply, the primary problem with sharing everything is that any action taken by one processor could affect the other processors. This problem surfaces in two ways:
Neither of these types of problems is common when shared something is used, because only the explicitly-marked data structures are shared. It also is fairly obvious that shared everything only works if all processors are executing the exact same memory image; you cannot use shared everything across multiple different code images (i.e., can use only SPMD, not general MIMD).
The first threads library that supported SMP Linux parallelism was the now somewhat obsolete bb_threads library, ftp://caliban.physics.utoronto.ca/pub/linux/, a very small library that used the Linux clone() call to fork new, independently scheduled, Linux processes all sharing a single address space. SMP Linux machines can run multiple of these «threads» in parallel because each «thread» is a full Linux process; the trade-off is that you do not get the same «light-weight» scheduling control provided by some thread libraries under other operating systems. The library used a bit of C-wrapped assembly code to install a new chunk of memory as each thread’s stack and to provide atomic access functions for an array of locks (mutex objects). Documentation consisted of a README and a short sample program.
More recently, a version of POSIX threads using clone() has been developed. This library, LinuxThreads, is clearly the preferred shared everything library for use under SMP Linux. POSIX threads are well documented, and the LinuxThreads README and LinuxThreads FAQ are very well done. The primary problem now is simply that POSIX threads have a lot of details to get right and LinuxThreads is still a work in progress. There is also the problem that the POSIX thread standard has evolved through the standardization process, so you need to be a bit careful not to program for obsolete early versions of the standard.
Shared Something
Shared something is really «only share what needs to be shared.» This approach can work for general MIMD (not just SPMD) provided that care is taken for the shared objects to be allocated at the same places in each processor’s memory map. More importantly, shared something makes it easier to predict and tune performance, debug code, etc. The only problems are:
Currently, there are two very similar mechanisms that allow groups of Linux processes to have independent memory spaces, all sharing only a relatively small memory segment. Assuming that you didn’t foolishly exclude «System V IPC» when you configured your Linux system, Linux supports a very portable mechanism that has generally become known as «System V Shared Memory.» The other alternative is a memory mapping facility whose implementation varies widely across different UNIX systems: the mmap() system call. You can, and should, learn about these calls from the manual pages. but a brief overview of each is given in sections 2.5 and 2.6 to help get you started.
Atomicity And Ordering
No matter which of the above two models you use, the result is pretty much the same: you get a pointer to a chunk of read/write memory that is accessible by all processes within your parallel program. Does that mean I can just have my parallel program access shared memory objects as though they were in ordinary local memory? Well, not quite.
Atomicity refers to the concept that an operation on an object is accomplished as an indivisible, uninterruptible, sequence. Unfortunately, sharing memory access does not imply that all operations on data in shared memory occur atomically. Unless special precautions are taken, only simple load or store operations that occur within a single bus transaction (i.e., aligned 8, 16, or 32-bit operations, but not misaligned nor 64-bit operations) are atomic. Worse still, «smart» compilers like GCC will often perform optimizations that could eliminate the memory operations needed to ensure that other processors can see what this processor has done. Fortunately, both these problems can be remedied. leaving only the relationship between access efficiency and cache line size for us to worry about.
However, before discussing these issues, it is useful to point-out that all of this assumes that memory references for each processor happen in the order in which they were coded. The Pentium does this, but also notes that future Intel processors might not. So, for future processors, keep in mind that it may be necessary to surround some shared memory accesses with instructions that cause all pending memory accesses to complete, thus providing memory access ordering. The CPUID instruction apparently is reserved to have this side-effect.
Volatility
To prevent GCC’s optimizer from buffering values of shared memory objects in registers, all objects in shared memory should be declared as having types with the volatile attribute. If this is done, all shared object reads and writes that require just one word access will occur atomically. For example, suppose that p is a pointer to an integer, where both the pointer and the integer it will point at are in shared memory; the ANSI C declaration might be:
Note that you can cause a volatile access to an ordinary variable by using a type cast that imposes the volatile attribute. For example, the ordinary int i; can be referenced as a volatile by *((volatile int *) &i) ; thus, you can explicitly invoke the «overhead» of volatility only where it is critical.
Locks
If you thought that ++i; would always work to add one to a variable i in shared memory, you’ve got a nasty little surprise coming: even if coded as a single instruction, the load and store of the result are separate memory transactions, and other processors could access i between these two transactions. For example, having two processes both perform ++i; might only increment i by one, rather than by two. According to the Intel Pentium «Architecture and Programming Manual,» the LOCK prefix can be used to ensure that any of the following instructions is atomic relative to the data memory location it accesses:
However, it probably is not a good idea to use all these operations. For example, XADD did not even exist for the 386, so coding it may cause portability problems.
Examples of GCC in-line assembly code using bit operations for locking are given in the source code for the bb_threads library.
It is important to remember, however, that there is a cost associated with making memory transactions atomic. A locking operation carries a fair amount of overhead and may delay memory activity from other processors, whereas ordinary references may use local cache. The best performance results when locking operations are used as infrequently as possible. Further, these IA32 atomic instructions obviously are not portable to other systems.
Cache Line Size
Linux Scheduler Issues
Although the whole point of using shared memory for parallel processing is to avoid OS overhead, OS overhead can come from things other than communication per se. We have already said that the number of processes that should be constructed is less than or equal to the number of processors in the machine. But how do you decide exactly how many processes to make?
Alternatively, you could boost the priority of the processes in your parallel program using, for example, the renice command or nice() system call. You must be privileged to increase priority. The idea is simply to force the other processes out of processors so that your program can run simultaneously across all processors. This can be accomplished somewhat more explicitly using the prototype version of SMP Linux at http://luz.cs.nmt.edu/
rtlinux/, which offers real-time schedulers.
There is one more twist to this. Suppose that you are developing a program on a machine that is heavily used all day, but will be fully available for parallel execution at night. You need to write and test your code for correctness with the full number of processes, even though you know that your daytime test runs will be slow. Well, they will be very slow if you have processes busy waiting for shared memory values to be changed by other processes that are not currently running (on other processors). The same problem occurs if you develop and test your code on a single-processor system.
The bb_threads («Bare Bones» threads) library, ftp://caliban.physics.utoronto.ca/pub/linux/, is a remarkably simple library that demonstrates use of the Linux clone() call. The gzip tar file is only 7K bytes! Although this library is essentially made obsolete by the LinuxThreads library discussed in section 2.4, bb_threads is still usable, and it is small and simple enough to serve well as an introduction to use of Linux thread support. Certainly, it is far less daunting to read this source code than to browse the source code for LinuxThreads. In summary, the bb_threads library is a good starting point, but is not really suitable for coding large projects.
The basic program structure for using the bb_threads library is:
The following C program uses the algorithm discussed in section 1.3 to compute the approximate value of Pi using two bb_threads threads.
xleroy/linuxthreads/ is a fairly complete and solid implementation of «shared everything» as per the POSIX 1003.1c threads standard. Unlike other POSIX threads ports, LinuxThreads uses the same Linux kernel threads facility ( clone() ) that is used by bb_threads. POSIX compatibility means that it is relatively easy to port quite a few threaded applications from other systems and various tutorial materials are available. In short, this is definitely the threads package to use under Linux for developing large-scale threaded programs.
The basic program structure for using the LinuxThreads library is:
An example parallel computation of Pi using LinuxThreads follows. The algorithm of section 1.3 is used and, as for the bb_threads example, two threads execute in parallel.
The System V IPC (Inter-Process Communication) support consists of a number of system calls providing message queues, semaphores, and a shared memory mechanism. Of course, these mechanisms were originally intended to be used for multiple processes to communicate within a uniprocessor system. However, that implies that it also should work to communicate between processes under SMP Linux, no matter which processors they run on.
Before going into how these calls are used, it is important to understand that although System V IPC calls exist for things like semaphores and message transmission, you probably should not use them. Why not? These functions are generally slow and serialized under SMP Linux. Enough said.
The basic procedure for creating a group of processes sharing access to a shared memory segment is:
Although the above set-up does require a few system calls, once the shared memory segment has been established, any change made by one processor to a value in that memory will automatically be visible to all processes. Most importantly, each communication operation will occur without the overhead of a system call.
An example C program using System V shared memory segments follows. It computes Pi, using the same algorithm given in section 1.3.
In this example, I have used the IA32 atomic exchange instruction to implement locking. For better performance and portability, substitute a synchronization technique that avoids atomic bus-locking instructions (discussed in section 2.2).
When debugging your code, it is useful to remember that the ipcs command will report the status of the System V IPC facilities currently in use.
In essence, the Linux implementation of mmap() is a plug-in replacement for steps 2, 3, and 4 in the System V shared memory scheme outlined in section 2.5. To create an anonymous shared memory segment:
The equivalent to the System V shared memory shmdt() call is munmap() :
In my opinion, there is no real benefit in using mmap() instead of the System V shared memory support.