From b066d6894d38a2be3f4eb67e974cfdf1275e9da2 Mon Sep 17 00:00:00 2001 From: Keir Fraser Date: Wed, 21 Apr 2010 08:31:31 +0100 Subject: docs: Add tmem documentation Signed-off-by: Dan Magenheimer --- docs/misc/tmem-internals.html | 798 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 798 insertions(+) create mode 100644 docs/misc/tmem-internals.html (limited to 'docs/misc/tmem-internals.html') diff --git a/docs/misc/tmem-internals.html b/docs/misc/tmem-internals.html new file mode 100644 index 0000000000..ccc34c374f --- /dev/null +++ b/docs/misc/tmem-internals.html @@ -0,0 +1,798 @@ +

Transcendent Memory Internals in Xen

+

+by Dan Magenheimer, Oracle Corp.

+

+Draft 0.1 -- Updated: 20100324 +

Overview

+

+This document focuses on the internal implementation of +Transcendent Memory (tmem) on Xen. It assumes +that the reader has a basic knowledge of the terminology, objectives, and +functionality of tmem and also has access to the Xen source code. +It corresponds to the Xen 4.0 release, with +patch added to support page deduplication (V2). +

+The primary responsibilities of the tmem implementation are to: +

+

Source Code Organization

+ +

+The source code in Xen that provides the tmem functionality +is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h. +The files tmem.c and tmem.h are intended to +be implementation- (and hypervisor-) independent and the other two files +provide the Xen-specific code. This +division is intended to make it easier to port tmem functionality to other +hypervisors, though at this time porting to other hypervisors has not been +attempted. Together, these four files +total less than 4000 lines of C code. +

+Even ignoring the implementation-specific functionality, the +implementation-independent part of tmem has several dependencies on +library functionality (Xen source filenames in parentheses): +

+

+More information about the specific functionality of these +libraries can easily be found through a search engine, via wikipedia, or in the +Xen or Linux source logs so we will not elaborate further here. + +

Prefixes/Abbreviations/Glossary

+ +

+The tmem code uses several prefixes and abbreviations. +Knowledge of these will improve code readability: +

+

+When used in prose, common tmem operations are indicated +with a different font, such as put +and get. + +

Key Data Structures

+ +

+To manage a huge number of pages, efficient data structures +must be carefully selected. +

+Recall that a tmem-enabled guest OS may create one or more +pools with different attributes. It then +puts and gets +pages to/from this pool, identifying the page +with a handle that consists of a pool_id, an +object_id, and a page_id (sometimes +called an index). +This suggests a few obvious core data +structures: +

+

+The most apparent usage of this multi-layer web of data structures +is "top-down" because, in normal operation, the vast majority of tmem +operations invoked by a client are +puts and gets, which require the various +data structures to be walked starting with the client_t, then +a pool_t, then an obj_t, then a pgd_t. +However, there is another highly frequent tmem operation that is not +visible from a client: memory reclamation. +Since tmem attempts to use all spare memory in the system, it must +frequently free up, or evict, +pages. The eviction algorithm will be +explained in more detail later but, in brief, to free memory, ephemeral pages +are removed from the tail of one of the doubly-linked lists, which means that +all of the data structures associated with that page-to-be-removed must be +updated or eliminated and freed. As a +result, each data structure also contains a back-pointer +to its parent, for example every obj_t +contains a pointer to its containing pool_t. +

+This complex web of interconnected data structures is updated constantly and +thus extremely sensitive to careless code changes which, for example, may +result in unexpected hypervisor crashes or non-obvious memory leaks. +On the other hand, the code is fairly well +modularized so, once understood, it is possible to relatively easily switch out +one kind of data structure for another. +To catch problems as quickly as possible when debug is enabled, most of +the data structures are equipped with sentinelsand many inter-function +assumptions are documented and tested dynamically +with assertions. +While these clutter and lengthen the tmem +code substantially, their presence has proven invaluable on many occasions. +

+For completeness, we should also describe a key data structure in the Xen +implementation-dependent code: the tmh_page_list. For security and +performance reasons, pages that are freed due to tmem operations (such +as get) are not immediately put back into Xen's pool +of free memory (aka the Xen heap). +Tmem pages may contain guest-private data that must be scrubbed before +those memory pages are released for the use of other guests. +But if a page is immediately re-used inside of tmem itself, the entire +page is overwritten with new data, so need not be scrubbed. +Since tmem is usually the most frequent +customer of the Xen heap allocation code, it would be a waste of time to scrub +a page, release it to the Xen heap, and then immediately re-allocate it +again. So, instead, tmem maintains +currently-unused pages of memory on its own free list, tmh_page_list, +and returns the pages to Xen only when non-tmem Xen +heap allocation requests would otherwise fail. + +

Scalablility/Concurrency

+ +

Tmem has been designed to be highly scalable. +Since tmem access is invoked similarly in +many ways to asynchronous disk access, a "big SMP" tmem-aware guest +OS can, and often will, invoke tmem hypercalls simultaneously on many different +physical CPUs. And, of course, multiple +tmem-aware guests may independently and simultaneously invoke tmem +hypercalls. While the normal frequency +of tmem invocations is rarely extremely high, some tmem operations such as data +compression or lookups in a very large tree may take tens of thousands of +cycles or more to complete. Measurements +have shown that normal workloads spend no more than about 0.2% (2% with +compression enabled) of CPU time executing tmem operations. +But those familiar with OS scalability issues +recognize that even this limited execution time can create concurrency problems +in large systems and result in poorly-scalable performance. +

+A good locking strategy is critical to concurrency, but also +must be designed carefully to avoid deadlock and livelock problems. For +debugging purposes, tmem supports a "big kernel lock" which disables +concurrency altogether (enabled in Xen with "tmem_lock", but note +that this functionality is rarely tested and likely has bit-rotted). Infrequent +but invasive tmem hypercalls, such as pool creation or the control operations, +are serialized on a single read-write lock, called tmem_rwlock, +which must be held for writing. All other tmem operations must hold this lock +for reading, so frequent operations such as +put and get flush can execute simultaneously +as long as no invasive operations are occurring. +

+Once a pool has been selected, there is a per-pool +read-write lock (pool_rwlock) which +must be held for writing if any transformative operations might occur within +that pool, such as when an obj_t is +created or destroyed. For the highly +frequent operation of finding an obj_t +within a pool, pool_rwlock must be held for reading. +

+Once an object has been selected, there is a per-object +spinlock (obj_spinlock). +This is a spinlock rather than a read-write +lock because nearly all of the most frequent tmem operations (e.g. +put and get flush) +are transformative, in +that they add or remove a page within the object. +This lock is generally taken whenever an +object lookup occurs and released when the tmem operation is complete. +

+Next, the per-client and global ephemeral lists are +protected by a single global spinlock (eph_lists_spinlock) +and the per-client persistent lists are also protected by a single global +spinlock (pers_list_spinlock). +And to complete the description of +implementation-independent locks, if page deduplication is enabled, all pages +for which the first byte match are contained in one of 256 trees that are +protected by one of 256 corresponding read-write locks +(pcd_tree_rwlocks). +

+In the Xen-specific code (tmem_xen.c), page frames (e.g. struct page_info) +that have been released are kept in a list (tmh_page_list) that +is protected by a spinlock (tmh_page_list_lock). +There is also an "implied" lock +associated with compression, which is likely the most time-consuming operation +in all of tmem (of course, only when compression is enabled): A compression +buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to +this buffer is returned to implementation-independent code and used without a +lock. +

+The proper method to avoid deadlocks is to take and release +locks in a very specific predetermined order. +Unfortunately, since tmem data structures must simultaneously be +accessed "top-down" ( +put and get) +and "bottoms-up" +(memory reclamation), more complex methods must be employed: +A trylockmechanism is used (c.f. tmem_try_to_evict_pgp()), +which takes the lock if it is available but returns immediately (rather than +spinning and waiting) if the lock is not available. +When walking the ephemeral list to identify +pages to free, any page that belongs to an object that is locked is simply +skipped. Further, if the page is the +last page belonging to an object, and the pool read-write lock for the pool the +object belongs to is not available (for writing), that object is skipped. +These constraints modify the LRU algorithm +somewhat, but avoid the potential for deadlock. +

+Unfortunately, a livelock was still discovered in this approach: +When memory is scarce and each client is +putting a large number of pages +for exactly one object (and thus holding the object spinlock for that object), +memory reclamation takes a very long time to determine that it is unable to +free any pages, and so the time to do a +put (which eventually fails) becomes linear to the +number of pages in the object! To avoid +this situation, a workaround was added to always ensure a minimum amount of +memory (1MB) is available before any object lock is taken for the client +invoking tmem (see tmem_ensure_avail_pages()). +Other such livelocks (and perhaps deadlocks) +may be lurking. +

+A last issue related to concurrency is atomicity of counters. +Tmem gathers a large number of +statistics. Some of these counters are +informational only, while some are critical to tmem operation and must be +incremented and decremented atomically to ensure, for example, that the number +of pages in a tree never goes negative if two concurrent tmem operations access +the counter exactly simultaneously. Some +of the atomic counters are used for debugging (in assertions) and perhaps need +not be atomic; fixing these may increase performance slightly by reducing +cache-coherency traffic. Similarly, some +of the non-atomic counters may yield strange results to management tools, such +as showing the total number of successful +puts as being higher than the number of +puts attempted. +These are left as exercises for future tmem implementors. + +

Control and Manageability

+ +

+Tmem has a control interface to, for example, set various +parameters and obtain statistics. All +tmem control operations funnel through do_tmem_control() +and other functions supporting tmem control operations are prefixed +with tmemc_. + +

+During normal operation, even if only one tmem-aware guest +is running, tmem may absorb nearly all free memory in the system for its own +use. Then if a management tool wishes to +create a new guest (or migrate a guest from another system to this one), it may +notice that there is insufficient "free" memory and fail the creation +(or migration). For this reason, tmem +introduces a new tool-visible class of memory -- freeable memory -- +and provides a control interface to access +it. All ephemeral memory and all pages on the tmh_page_list +are freeable. To properly access freeable +memory, a management tool must follow a sequence of steps: +

+

+Extensive tmem statistics are available through tmem's +control interface (see tmemc_list and +the separate source for the "xm tmem-list" command and the +xen-tmem-list-parse tool). To maximize +forward/backward compatibility with future tmem and tools versions, statistical +information is passed via an ASCII interface where each individual counter is +identified by an easily parseable two-letter ASCII sequence. + +

Save/Restore/Migrate

+ +

+Another piece of functionality that has a major impact on +the tmem code is support for save/restore of a tmem client and, highly related, +live migration of a tmem client. +Ephemeral pages, by definition, do not need to be saved or +live-migrated, but persistent pages are part of the state of a running VM and +so must be properly preserved. +

+When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step +is for the tmem client to be frozen (see the manageability section). +Next, tmem API version information is +recorded (to avoid possible incompatibility issues as the tmem spec evolves in +the future). Then, certain high-level +tmem structural information specific to the client is recorded, including +information about the existing pools. +Finally, the contents of all persistent pages are recorded. +

+For live-migration, the process is somewhat more complicated. +Ignoring tmem for a moment, recall that in +live migration, the vast majority of the VM's memory is transferred while the +VM is still fully operational. During +each phase, memory pages belonging to the VM that are changed are marked and +then retransmitted during a later phase. +Eventually only a small amount of memory remains, the VM is paused, the +remaining memory is transmitted, and the VM is unpaused on the target machine. +

+The number of persistent tmem pages may be quite large, +possibly even larger than all the other memory used by the VM; so it is +unacceptable to transmit persistent tmem pages during the "paused" +phase of live migration. But if the VM +is still operational, it may be making calls to tmem: +A frozen tmem client will reject any +put operations, but tmem must +still correctly process flushes +(page and object), including implicit flushes due to duplicate +puts. +Fortunately, these operations can only +invalidate tmem pages, not overwrite tmem pages or create new pages. +So, when a live-migrate has been initiated, +the client is frozen. Then during the +"live" phase, tmem transmits all persistent pages, but also records +the handle of all persistent pages that are invalidated. +Then, during the "paused" phase, +only the handles of invalidated persistent pages are transmitted, resulting in +the invalidation on the target machine of any matching pages that were +previously transmitted during the "live" phase. +

+For restore (and on the target machine of a live migration), +tmem must be capable of reconstructing the internal state of the client from +the saved/migrated data. However, it is +not the client itself that is put'ing +the pages but the management tools conducting the restore/migration. +This slightly complicates tmem by requiring +new API calls and new functions in the implementation, but the code is +structured so that duplication is minimized. +Once all tmem data structures for the client are reconstructed, all +persistent pages are recreated and, in the case of live-migration, all +invalidations have been processed and the client has been thawed, the restored +client can be resumed. +

+Finally, tmem's data structures must be cluttered a bit to +support save/restore/migration. Notably, +a per-pool list of persistent pages must be maintained and, during live +migration, a per-client list of invalidated pages must be logged. +A reader of the code will note that these +lists are overlaid into space-sensitive data structures as a union, which may +be more error-prone but eliminates significant space waste. + +

Miscellaneous Tmem Topics

+ +

+Duplicate puts. +One interesting corner case that +significantly complicates the tmem source code is the possibility +of a duplicate +put, +which occurs when two +puts +are requested with the same handle but with possibly different data. +The tmem API addresses + +put-put-get +coherence explicitly: When a duplicate +put occurs, tmem may react one of two ways: (1) The +put may succeed with the old +data overwritten by the new data, or (2) the +put may be failed with the original data flushed and +neither the old nor the new data accessible. +Tmem may not fail the +put and leave the old data accessible. +

+When tmem has been actively working for an extended period, +system memory may be in short supply and it is possible for a memory allocation +for a page (or even a data structure such as a pgd_t) to fail. Thus, +for a duplicate +put, it may be impossible for tmem to temporarily +simultaneously maintain data structures and data for both the original +put and the duplicate +put. +When the space required for the data is +identical, tmem may be able to overwrite in place the old data with +the new data (option 1). But in some circumstances, such as when data +is being compressed, overwriting is not always possible and option 2 must be +performed. +

+Page deduplication and trailing-zero elimination. +When page deduplication is enabled +("tmem_dedup" option to Xen), ephemeral pages for which the contents +are identical -- whether the pages belong +to the same client or different clients -- utilize the same pageframe of +memory. In Xen environments where +multiple domains have a highly similar workload, this can save a substantial +amount of memory, allowing a much larger number of ephemeral pages to be +used. Tmem page deduplication uses +methods similar to the KSM implementation in Linux [ref], but differences between +the two are sufficiently great that tmem does not directly leverage the +code. In particular, ephemeral pages in +tmem are never dirtied, so need never be copied-on-write. +Like KSM, however, tmem avoids hashing, +instead employing red-black trees +that use the entire page contents as the lookup +key. There may be better ways to implement this. +

+Dedup'ed pages may optionally be compressed +("tmem_compress" and "tmem_dedup" Xen options specified), +to save even more space, at the cost of more time. +Additionally, trailing zero elimination (tze) may be applied to dedup'ed +pages. With tze, pages that contain a +significant number of zeroes at the end of the page are saved without the trailing +zeroes; an all-zero page requires no data to be saved at all. +In certain workloads that utilize a large number +of small files (and for which the last partial page of a file is padded with +zeroes), a significant space savings can be realized without the high cost of +compression/decompression. +

+Both compression and tze significantly complicate memory +allocation. This will be discussed more below. +

+Memory accounting. +Accounting is boring, but poor accounting may +result in some interesting problems. In +the implementation-independent code of tmem, most data structures, page frames, +and partial pages (e.g. for compresssion) are billed to a pool, +and thus to a client. Some infrastructure data structures, such as +pools and clients, are allocated with tmh_alloc_infra(), which does not +require a pool to be specified. Two other +exceptions are page content descriptors (pcd_t) +and sharelists (sharelist_t) which +are explicitly not associated with a pool/client by specifying NULL instead of +a pool_t. +(Note to self: +These should probably just use the tmh_alloc_infra() interface too.) +As we shall see, persistent pool pages and +data structures may need to be handled a bit differently, so the +implementation-independent layer calls a different allocation/free routine for +persistent pages (e.g. tmh_alloc_page_thispool()) +than for ephemeral pages (e.g. tmh_alloc_page()). +

+In the Xen-specific layer, we +disregard the pool_t for ephemeral +pages, as we use the generic Xen heap for all ephemeral pages and data +structures.(Denial-of-service attacks +can be handled in the implementation-independent layer because ephemeral pages +are kept in per-client queues each with a counted length. +See the discussion on weights and caps below.) +However we explicitly bill persistent pages +and data structures against the client/domain that is using them. +(See the calls to the Xen routine alloc_domheap_pages() in tmem_xen.h; of +the first argument is a domain, the pages allocated are billed by Xen to that +domain.)This means that a Xen domain +cannot allocate even a single tmem persistent page when it is currently utilizing +its maximum assigned memory allocation! +This is reasonable for persistent pages because, even though the data is +not directly accessible by the domain, the data is permanently saved until +either the domain flushes it or the domain dies. +

+Note that proper accounting requires (even for ephemeral pools) that the same +pool is referenced when memory is freed as when it was allocated, even if the +ownership of a pool has been moved from one client to another (c.f. shared_pool_reassign()). +The underlying Xen-specific information may +not always enforce this for ephemeral pools, but incorrect alloc/free matching +can cause some difficult-to-find memory leaks and bent pointers. +

+Page deduplication is not possible for persistent pools for +accounting reasons: Imagine a page that is created by persistent pool A, which +belongs to a domain that is currently well under its maximum allocation. +Then the pcd_tis matched by persistent pool B, which is +currently at its maximum. +Then the domain owning pool A is destroyed. +Is B beyond its maximum? +(There may be a clever way around this +problem. Exercise for the reader!) +

+Memory allocation. The implementation-independent layer assumes +there is a good fast general-purpose dynamic memory allocator with bounded +response time and efficient use of memory for a very large number of sub-page +allocations. The old xmalloc memory +allocator in Xen was not a good match for this purpose, so was replaced by the +TLSF allocator. Note that the TLSF +allocator is used only for allocations smaller than a page (and, more +precisely, no larger than tmem_subpage_maxsize()); +full pages are allocated by Xen's normal heap allocator. +

+After the TLSF allocator was integrated into Xen, more work +was required so that each client could allocate memory from a separate +independent pool. (See the call to xmem_pool_create()in +tmh_client_init().) +This allows the data structures allocated for the +purpose of supporting persistent pages to be billed to the same client as the +pages themselves. It also allows partial +(e.g. compressed) pages to be properly billed. +Further, when partial page allocations cause internal fragmentation, +this fragmentation can be isolated per-client. +And, when a domain dies, full pages can be freed, rather than only +partial pages. One other change was +required in the TLSF allocator: In the original version, when a TLSF memory +pool was allocated, the first page of memory was also allocated. +Since, for a persistent pool, this page would +be billed to the client, the allocation of the first page failed if the domain +was started at its maximum memory, and this resulted in a failure to create the +memory pool. To avoid this, the code was +changed to delay the allocation of the first page until first use of the memory +pool. +

+Memory allocation interdependency. +As previously described, +pages of memory must be moveable back and forth between the Xen heap and the +tmem ephemeral lists (and page lists). +When tmem needs a page but doesn't have one, it requests one from the +Xen heap (either indirectly via xmalloc, or directly via Xen's alloc_domheap_pages()). +And when Xen needs a page but doesn't have +one, it requests one from tmem (via a call to tmem_relinquish_pages() in Xen's alloc_heap_pages() in page_alloc.c). +This leads to a potential infinite loop! +To break this loop, a new memory flag (MEMF_tmem) was added to Xen +to flag and disallow the loop. +See tmh_called_from_tmem() +in tmem_relinquish_pages(). +Note that the tmem_relinquish_pages() interface allows for memory requests of +order > 0 (multiple contiguous pages), but the tmem implementation disallows +any requests larger than a single page. +

+LRU page reclamation. +Ephemeral pages generally age in +a queue, and the space associated with the oldest -- or least-recently-used -- page is reclaimed when tmem needs more +memory. But there are a few exceptions +to strict LRU queuing. First is when +removal from a queue is constrained by locks, as previously described above. +Second, when an ephemeral pool is shared, unlike a private ephemeral +pool, a +get +does not imply a +flush +Instead, in a shared pool, a +results in the page being promoted to the front of the queue. +Third, when a page that is deduplicated (i.e. +is referenced by more than one pgp_t) +reaches the end of the LRU queue, it is marked as eviction attempted and promoted to the front of the queue; if it +reaches the end of the queue a second time, eviction occurs. +Note that only the pgp_t is evicted; the actual data is only reclaimed if there is no +other pgp_t pointing to the data. +

+All of these modified- LRU algorithms deserve to be studied +carefully against a broad range of workloads. +

+Internal fragmentation. +When +compression or tze is enabled, allocations between a half-page and a full-page +in size are very common and this places a great deal of pressure on even the +best memory allocator. Additionally, +problems may be caused for memory reclamation: When one tmem ephemeral page is +evicted, only a fragment of a physical page of memory might be reclaimed. +As a result, when compression or tze is +enabled, it may take a very large number of eviction attempts to free up a full +contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction +must be assumed to be able to fail. +While all memory allocation paths in tmem are resilient to failure, very +complex corner cases may eventually occur. +As a result, compression and tze are disabled by default and should be +used with caution until they have been tested with a much broader set of +workloads.(Note to self: The +code needs work.) +

+Weights and caps. +Because +of the just-discussed LRU-based eviction algorithms, a client that uses tmem at +a very high frequency can quickly swamp tmem so that it provides little benefit +to a client that uses it less frequently. +To reduce the possibility of this denial-of-service, limits can be +specified via management tools that are enforced internally by tmem. +On Xen, the "xm tmem-set" command +can specify "weight=<weight>" or "cap=<cap>" +for any client. If weight is non-zero +for a client and the current percentage of ephemeral pages in use by the client +exceeds its share (as measured by the sum of weights of all clients), the next +page chosen for eviction is selected from the requesting client's ephemeral +queue, instead of the global ephemeral queue that contains pages from all +clients.(See client_over_quota().) +Setting a cap for a client is currently a no-op. +

+Shared pools and authentication. +When tmem was first proposed to the linux kernel mailing list +(LKML), there was concern expressed about security of shared ephemeral +pools. The initial tmem implementation only +required a client to provide a 128-bit UUID to identify a shared pool, and the +linux-side tmem implementation obtained this UUID from the superblock of the +shared filesystem (in ocfs2). It was +pointed out on LKML that the UUID was essentially a security key and any +malicious domain that guessed it would have access to any data from the shared +filesystem that found its way into tmem. +Ocfs2 has only very limited security; it is assumed that anyone who can +access the filesystem bits on the shared disk can mount the filesystem and use +it. But in a virtualized data center, +higher isolation requirements may apply. +As a result, a Xen boot option -- "tmem_shared_auth" -- was +added. The option defaults to disabled, +but when it is enabled, management tools must explicitly authenticate (or may +explicitly deny) shared pool access to any client. +On Xen, this is done with the "xm +tmem-shared-auth" command. +

+32-bit implementation. +There was some effort put into getting tmem working on a 32-bit Xen. +However, the Xen heap is limited in size on +32-bit Xen so tmem did not work very well. +There are still 32-bit ifdefs in some places in the code, but things may +have bit-rotted so using tmem on a 32-bit Xen is not recommended. +

+IA-64 implementation. +The vast majority of the tmem +implementation is architecture-independent. +For tmem to run on Xen/ia64, it is believed that only one or two +routines needs to be written.(See the +#ifdef __ia64__ at cli_mfn_to_va().) + +

Known Issues

+ +

Fragmentation.When tmem +is active, all physically memory becomes fragmented +into individual pages. However, the Xen +memory allocator allows memory to be requested in multi-page contiguous +quantities, called order>0 allocations. +(e.g. 2order so +order==4 is sixteen contiguous pages.) +In some cases, a request for a larger order will fail gracefully if no +matching contiguous allocation is available from Xen. +As of Xen 4.0, however, there are several +critical order>0 allocation requests that do not fail gracefully. +Notably, when a domain is created, and +order==4 structure is required or the domain creation will fail. +And shadow paging requires many order==2 +allocations; if these fail, a PV live-migration may fail. +There are likely other such issues. +

+But, fragmentation can occur even without tmem if any domU does +any extensive ballooning; tmem just accelerates the fragmentation. +So the fragmentation problem must be solved +anyway. The best solution is to disallow +order>0 allocations altogether in Xen -- or at least ensure that any attempt +to allocate order>0 can fail gracefully, e.g. by falling back to a sequence +of single page allocations. However this restriction may require a major rewrite +in some of Xen's most sensitive code. +(Note that order>0 allocations during Xen boot and early in domain0 +launch are safe and, if dom0 does not enable tmem, any order>0 allocation by +dom0 is safe, until the first domU is created.) +

+Until Xen can be rewritten to be fragmentation-safe, a small hack +was added in the Xen page +allocator.(See the comment " +memory is scarce" in alloc_heap_pages().) +Briefly, a portion of memory is pre-reserved +for allocations where order>0 and order<9. +(Domain creation uses 2MB pages, but fails +gracefully, and there are no other known order==9 allocations or order>9 +allocations currently in Xen.) +

+NUMA. Tmem assumes that +all memory pages are equal and any RAM page can store a page of data for any +client. This has potential performance +consequences in any NUMA machine where access to far memory is significantly slower than access to near memory. +On nearly all of today's servers, however, +access times to far memory is still +much faster than access to disk or network-based storage, and tmem's primary performance +advantage comes from the fact that paging and swapping are reduced. +So, the current tmem implementation ignores +NUMA-ness; future tmem design for NUMA machines is an exercise left for the +reader. + +

Bibliography

+ +

+(needs work)