From b066d6894d38a2be3f4eb67e974cfdf1275e9da2 Mon Sep 17 00:00:00 2001
From: Keir Fraser <keir.fraser@citrix.com>
Date: Wed, 21 Apr 2010 08:31:31 +0100
Subject: docs: Add tmem documentation

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>
---
 docs/misc/tmem-internals.html | 798 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 798 insertions(+)
 create mode 100644 docs/misc/tmem-internals.html

(limited to 'docs/misc/tmem-internals.html')
diff --git a/docs/misc/tmem-internals.html b/docs/misc/tmem-internals.html
new file mode 100644
index 0000000000..ccc34c374f
--- /dev/null
+++ b/docs/misc/tmem-internals.html
@@ -0,0 +1,798 @@
+<h1>Transcendent Memory Internals in Xen</h1>
+<P>
+by Dan Magenheimer, Oracle Corp.</p>
+<P>
+Draft 0.1 -- Updated: 20100324
+<h2>Overview</h2>
+<P>
+This document focuses on the internal implementation of
+Transcendent Memory (tmem) on Xen.  It assumes
+that the reader has a basic knowledge of the terminology, objectives, and
+functionality of tmem and also has access to the Xen source code.
+It corresponds to the Xen 4.0 release, with
+patch added to support page deduplication (V2).
+<P>
+The primary responsibilities of the tmem implementation are to:
+<ul>
+<li>manage a potentially huge and extremely dynamic
+number of memory pages from a potentially large number of clients (domains)
+with low memory overhead and proper isolation
+<li>provide quick and efficient access to these
+pages with as much concurrency as possible
+<li>enable efficient reclamation and <i>eviction</i> of pages (e.g. when
+memory is fully utilized)
+<li>optionally, increase page density through compression and/or
+deduplication
+<li>where necessary, properly assign and account for
+memory belonging to guests to avoid malicious and/or accidental unfairness
+and/or denial-of-service
+<li>record utilization statistics and make them available to management tools
+</ul>
+<h2>Source Code Organization</h2>
+
+<P>
+The source code in Xen that provides the tmem functionality
+is divided up into four files: tmem.c, tmem.h, tmem_xen.c, and tmem_xen.h.
+The files tmem.c and tmem.h are intended to
+be implementation- (and hypervisor-) independent and the other two files
+provide the Xen-specific code.  This
+division is intended to make it easier to port tmem functionality to other
+hypervisors, though at this time porting to other hypervisors has not been
+attempted.  Together, these four files
+total less than 4000 lines of C code.
+<P>
+Even ignoring the implementation-specific functionality, the
+implementation-independent part of tmem has several dependencies on
+library functionality (Xen source filenames in parentheses):
+<ul>
+<li>
+a good fast general-purpose dynamic memory
+allocator with bounded response time and efficient use of memory for a very
+large number of sub-page allocations.  To
+achieve this in Xen, the bad old memory allocator was replaced with a
+slightly-modified version of TLSF (xmalloc_tlsf.c), first ported to Linux by
+Nitin Gupta for compcache.
+<li>
+good tree data structure libraries, specifically
+<i>red-black</i> trees (rbtree.c) and <i>radix</i> trees (radix-tree.c).
+Code for these was borrowed for Linux and adapted for tmem and Xen.
+<li>
+good locking and list code.  Both of these existed in Xen and required
+little or no change.
+<li>
+optionally, a good fast lossless compression
+library.  The Xen implementation added to
+support tmem uses LZO1X (lzo.c), also ported for Linux by Nitin Gupta.
+</ul>
+<P>
+More information about the specific functionality of these
+libraries can easily be found through a search engine, via wikipedia, or in the
+Xen or Linux source logs so we will not elaborate further here.
+
+<h2>Prefixes/Abbreviations/Glossary</h2>
+
+<P>
+The tmem code uses several prefixes and abbreviations.
+Knowledge of these will improve code readability:
+<ul>
+<li>
+<i>tmh</i> ==
+transcendent memory host.  Functions or
+data structures that are defined by the implementation-specific code, i.e. the
+Xen host code
+<li>
+<i>tmemc</i>
+== transcendent memory control.
+Functions or data structures that provide management tool functionality,
+rather than core tmem operations.
+<li>
+<i>cli </i>or
+<i>client</i> == client.
+The tmem generic term for a domain or a guest OS.
+</ul>
+<P>
+When used in prose, common tmem operations are indicated
+with a different font, such as <big><kbd>put</kbd></big>
+and <big><kbd>get</kbd></big>.
+
+<h2>Key Data Structures</h2>
+
+<P>
+To manage a huge number of pages, efficient data structures
+must be carefully selected.
+<P>
+Recall that a tmem-enabled guest OS may create one or more
+pools with different attributes.  It then
+<kbd>put</kbd></big>s and <kbd>get</kbd></big>s
+pages to/from this pool, identifying the page
+with a <i>handle</i> that consists of a <i>pool_id</i>, an <i>
+object_id</i>, and a <i>page_id </i>(sometimes
+called an <i>index</i>).
+This suggests a few obvious core data
+structures:
+<ul>
+<li>
+When a guest OS first calls tmem, a <i>client_t</i> is created to contain
+and track all uses of tmem by that guest OS.  Among
+other things, a <i>client_t</i> keeps pointers
+to a fixed number of pools (16 in the current Xen implementation).
+<li>
+When a guest OS requests a new pool, a <i>pool_t</i> is created.
+Some pools are shared and are kept in a
+sharelist (<i>sharelist_t</i>) which points
+to all the clients that are sharing the pool.
+Since an <i>object_id</i> is 64-bits,
+a <i>pool_t</i> must be able to keep track
+of a potentially very large number of objects.
+To do so, it maintains a number of parallel trees (256 in the current
+Xen implementation) and a hash algorithm is applied to the <i>object_id</i>
+to select the correct tree.
+Each tree element points to an object.
+Because an <i>object_id</i> usually represents an <i>inode</i>
+(a unique file number identifier), and <i>inode</i> numbers
+are fairly random, though often &quot;clumpy&quot;, a <i>red-black tree</i>
+is used.
+<li>
+When a guest first
+<kbd>put</kbd></big>s a page to a pool with an as-yet-unused <i>object_id,</i> an
+<i>obj_t</i> is created.  Since a <i
+>page_id</i> is usually an index into a file,
+it is often a small number, but may sometimes be very large (up to
+32-bits).  A <i>radix tree</i> is a good data structure to contain items
+with this kind of index distribution.
+<li>
+When a page is
+<kbd>put</kbd></big>, a page descriptor, or <i>pgp_t</i>, is created, which
+among other things will point to the storage location where the data is kept.
+In the normal case the pointer is to a <i>pfp_t</i>, which is an
+implementation-specific datatype representing a physical pageframe in memory
+(which in Xen is a &quot;struct page_info&quot;).
+When deduplication is enabled, it points to
+yet another data structure, a <i>pcd_</i>t
+(see below).  When compression is enabled
+(and deduplication is not), the pointer points directly to the compressed data.
+For reasons we will see shortly, each <i>pgp_t</i> that represents
+an <i>ephemeral</i> page (that is, a page placed
+in an <i>ephemeral</i> pool) is also placed
+into two doubly-linked linked lists, one containing all ephemeral pages
+<kbd>put</kbd></big> by the same client and one
+containing all ephemeral pages across all clients (&quot;global&quot;).
+<li>
+When deduplication is enabled, multiple <i>pgp_</i>t's may need to point to
+the same data, so another data structure (and level of indirection) is used
+called a page content descriptor, or <i>pcd_t</i>.
+Multiple page descriptors (<i>pgp_t</i>'s) may point to the same <i>pcd_t</i>.
+The <i>pcd_t</i>, in turn, points to either a <i>pfp_t</i>
+(if a full page of data), directly to a
+location in memory (if the page has been compressed or trailing zeroes have
+been eliminated), or even a NULL pointer (if the page contained all zeroes and
+trailing zero elimination is enabled).
+</ul>
+<P>
+The most apparent usage of this multi-layer web of data structures
+is &quot;top-down&quot; because, in normal operation, the vast majority of tmem
+operations invoked by a client are
+<kbd>put</kbd></big>s and <kbd>get</kbd></big>s, which require the various
+data structures to be walked starting with the <i>client_t</i>, then
+a <i>pool_t</i>, then an <i>obj_t</i>, then a <i>pgd_t</i>.
+However, there is another highly frequent tmem operation that is not
+visible from a client: memory reclamation.
+Since tmem attempts to use all spare memory in the system, it must
+frequently free up, or <i>evict</i>,
+pages.  The eviction algorithm will be
+explained in more detail later but, in brief, to free memory, ephemeral pages
+are removed from the tail of one of the doubly-linked lists, which means that
+all of the data structures associated with that page-to-be-removed must be
+updated or eliminated and freed.  As a
+result, each data structure also contains a <i>back-pointer</i>
+to its parent, for example every <i>obj_t</i>
+contains a pointer to its containing <i>pool_t</i>.
+<P>
+This complex web of interconnected data structures is updated constantly and
+thus extremely sensitive to careless code changes which, for example, may
+result in unexpected hypervisor crashes or non-obvious memory leaks.
+On the other hand, the code is fairly well
+modularized so, once understood, it is possible to relatively easily switch out
+one kind of data structure for another.
+To catch problems as quickly as possible when debug is enabled, most of
+the data structures are equipped with <i>sentinels</i>and many inter-function
+assumptions are documented and tested dynamically
+with <i>assertions</i>.
+While these clutter and lengthen the tmem
+code substantially, their presence has proven invaluable on many occasions.
+<P>
+For completeness, we should also describe a key data structure in the Xen
+implementation-dependent code: the <i>tmh_page_list</i>. For security and
+performance reasons, pages that are freed due to tmem operations (such
+as <kbd>get</kbd></big>) are not immediately put back into Xen's pool
+of free memory (aka the Xen <i>heap</i>).
+Tmem pages may contain guest-private data that must be <i>scrubbed</i> before
+those memory pages are released for the use of other guests.
+But if a page is immediately re-used inside of tmem itself, the entire
+page is overwritten with new data, so need not be scrubbed.
+Since tmem is usually the most frequent
+customer of the Xen heap allocation code, it would be a waste of time to scrub
+a page, release it to the Xen heap, and then immediately re-allocate it
+again.  So, instead, tmem maintains
+currently-unused pages of memory on its own free list, <i>tmh_page_list</i>,
+and returns the pages to Xen only when non-tmem Xen
+heap allocation requests would otherwise fail.
+
+<h2>Scalablility/Concurrency</h2>
+
+<P>Tmem has been designed to be highly scalable.
+Since tmem access is invoked similarly in
+many ways to asynchronous disk access, a &quot;big SMP&quot; tmem-aware guest
+OS can, and often will, invoke tmem hypercalls simultaneously on many different
+physical CPUs.  And, of course, multiple
+tmem-aware guests may independently and simultaneously invoke tmem
+hypercalls.  While the normal frequency
+of tmem invocations is rarely extremely high, some tmem operations such as data
+compression or lookups in a very large tree may take tens of thousands of
+cycles or more to complete.  Measurements
+have shown that normal workloads spend no more than about 0.2% (2% with
+compression enabled) of CPU time executing tmem operations.
+But those familiar with OS scalability issues
+recognize that even this limited execution time can create concurrency problems
+in large systems and result in poorly-scalable performance.
+<P>
+A good locking strategy is critical to concurrency, but also
+must be designed carefully to avoid deadlock and <i>livelock</i> problems.  For
+debugging purposes, tmem supports a &quot;big kernel lock&quot; which disables
+concurrency altogether (enabled in Xen with &quot;tmem_lock&quot;, but note
+that this functionality is rarely tested and likely has bit-rotted). Infrequent
+but invasive tmem hypercalls, such as pool creation or the control operations,
+are serialized on a single <i>read-write lock</i>, called tmem_rwlock,
+which must be held for writing.  All other tmem operations must hold this lock
+for reading, so frequent operations such as
+<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big> can execute simultaneously
+as long as no invasive operations are occurring.
+<P>
+Once a pool has been selected, there is a per-pool
+read-write lock (<i>pool_rwlock</i>) which
+must be held for writing if any transformative operations might occur within
+that pool, such as when an<i> obj_t</i> is
+created or destroyed.  For the highly
+frequent operation of finding an<i> obj_t</i>
+within a pool, pool_rwlock must be held for reading.
+<P>
+Once an object has been selected, there is a per-object
+spinlock (<i>obj_spinlock)</i>.
+This is a spinlock rather than a read-write
+lock because nearly all of the most frequent tmem operations (e.g.
+<kbd>put</kbd></big> and <kbd>get</kbd></big> <kbd>flush</kbd></big>)
+are transformative, in
+that they add or remove a page within the object.
+This lock is generally taken whenever an
+object lookup occurs and released when the tmem operation is complete.
+<P>
+Next, the per-client and global ephemeral lists are
+protected by a single global spinlock (<i>eph_lists_</i>spinlock)
+and the per-client persistent lists are also protected by a single global
+spinlock (<i>pers_list_spinlock</i>).
+And to complete the description of
+implementation-independent locks, if page deduplication is enabled, all pages
+for which the first byte match are contained in one of 256 trees that are
+protected by one of 256 corresponding read-write locks
+(<i>pcd_tree_rwlocks</i>).
+<P>
+In the Xen-specific code (tmem_xen.c), page frames (e.g.  struct page_info)
+that have been released are kept in a list (<i>tmh_page_list</i>) that
+is protected by a spinlock (<i>tmh_page_list_lock</i>).
+There is also an &quot;implied&quot; lock
+associated with compression, which is likely the most time-consuming operation
+in all of tmem (of course, only when compression is enabled): A compression
+buffer is allocated one-per-physical-cpu early in Xen boot and a pointer to
+this buffer is returned to implementation-independent code and used without a
+lock.
+<P>
+The proper method to avoid deadlocks is to take and release
+locks in a very specific predetermined order.
+Unfortunately, since tmem data structures must simultaneously be
+accessed &quot;top-down&quot; (
+<kbd>put</kbd></big> and <kbd>get</kbd></big>)
+and &quot;bottoms-up&quot;
+(memory reclamation), more complex methods must be employed:
+A <i>trylock</i>mechanism is used (c.f. <i>tmem_try_to_evict_pgp()</i>),
+which takes the lock if it is available but returns immediately (rather than
+spinning and waiting) if the lock is not available.
+When walking the ephemeral list to identify
+pages to free, any page that belongs to an object that is locked is simply
+skipped.  Further, if the page is the
+last page belonging to an object, and the pool read-write lock for the pool the
+object belongs to is not available (for writing), that object is skipped.
+These constraints modify the LRU algorithm
+somewhat, but avoid the potential for deadlock.
+<P>
+Unfortunately, a livelock was still discovered in this approach:
+When memory is scarce and each client is
+<kbd>put</kbd></big>ting a large number of pages
+for exactly one object (and thus holding the object spinlock for that object),
+memory reclamation takes a very long time to determine that it is unable to
+free any pages, and so the time to do a
+<kbd>put</kbd></big> (which eventually fails) becomes linear to the
+number of pages in the object!  To avoid
+this situation, a workaround was added to always ensure a minimum amount of
+memory (1MB) is available before any object lock is taken for the client
+invoking tmem (see <i>tmem_ensure_avail_pages()</i>).
+Other such livelocks (and perhaps deadlocks)
+may be lurking.
+<P>
+A last issue related to concurrency is atomicity of counters.
+Tmem gathers a large number of
+statistics.  Some of these counters are
+informational only, while some are critical to tmem operation and must be
+incremented and decremented atomically to ensure, for example, that the number
+of pages in a tree never goes negative if two concurrent tmem operations access
+the counter exactly simultaneously.  Some
+of the atomic counters are used for debugging (in assertions) and perhaps need
+not be atomic; fixing these may increase performance slightly by reducing
+cache-coherency traffic.  Similarly, some
+of the non-atomic counters may yield strange results to management tools, such
+as showing the total number of successful
+<kbd>put</kbd></big>s as being higher than the number of
+<kbd>put</kbd></big>s attempted.
+These are left as exercises for future tmem implementors.
+
+<h2>Control and Manageability</h2>
+
+<P>
+Tmem has a control interface to, for example, set various
+parameters and obtain statistics.  All
+tmem control operations funnel through <i>do_tmem_control()</i>
+and other functions supporting tmem control operations are prefixed
+with <i>tmemc_</i>.
+
+<P>
+During normal operation, even if only one tmem-aware guest
+is running, tmem may absorb nearly all free memory in the system for its own
+use.  Then if a management tool wishes to
+create a new guest (or migrate a guest from another system to this one), it may
+notice that there is insufficient &quot;free&quot; memory and fail the creation
+(or migration).  For this reason, tmem
+introduces a new tool-visible class of memory -- <i>freeable</i> memory --
+and provides a control interface to access
+it.  All ephemeral memory and all pages on the <i>tmh_page_list</i>
+are freeable. To properly access freeable
+memory, a management tool must follow a sequence of steps:
+<ul>
+<li>
+<i>freeze</i>
+tmem:When tmem is frozen, all 
+<kbd>put</kbd></big>s fail, which ensures that no
+additional memory may be absorbed by tmem.
+(See <i>tmemc_freeze_pools()</i>, and
+note that individual clients may be frozen, though this functionality may be
+used only rarely.)
+<li>
+<i>query freeable MB: </i>If all freeable memory were released to the Xen
+heap, this is the amount of memory (in MB) that would be freed.
+See <i>tmh_freeable_pages()</i>.
+<li>
+<i>flush</i>:
+Tmem may be requested to flush, or relinquish, a certain amount of memory, e.g.
+back to the Xen heap.  This amount is
+specified in KB.  See <i
+>tmemc_flush_mem()</i> and <i
+>tmem_relinquish_npages()</i>.
+<li>
+At this point the management tool may allocate
+the memory, e.g. using Xen's published interfaces.
+<li>
+<i>thaw</i>
+tmem: This terminates the freeze, allowing tmem to accept 
+<kbd>put</kbd></big>s again.
+</ul>
+<P>
+Extensive tmem statistics are available through tmem's
+control interface (see <i>tmemc_list </i>and
+the separate source for the &quot;xm tmem-list&quot; command and the
+xen-tmem-list-parse tool).  To maximize
+forward/backward compatibility with future tmem and tools versions, statistical
+information is passed via an ASCII interface where each individual counter is
+identified by an easily parseable two-letter ASCII sequence.
+
+<h2>Save/Restore/Migrate</h2>
+
+<P>
+Another piece of functionality that has a major impact on
+the tmem code is support for save/restore of a tmem client and, highly related,
+live migration of a tmem client.
+Ephemeral pages, by definition, do not need to be saved or
+live-migrated, but persistent pages are part of the state of a running VM and
+so must be properly preserved.
+<P>
+When a save (or live-migrate) of a tmem-enabled VM is initiated, the first step
+is for the tmem client to be frozen (see the manageability section).
+Next, tmem API version information is
+recorded (to avoid possible incompatibility issues as the tmem spec evolves in
+the future).  Then, certain high-level
+tmem structural information specific to the client is recorded, including
+information about the existing pools.
+Finally, the contents of all persistent pages are recorded.
+<P>
+For live-migration, the process is somewhat more complicated.
+Ignoring tmem for a moment, recall that in
+live migration, the vast majority of the VM's memory is transferred while the
+VM is still fully operational.  During
+each phase, memory pages belonging to the VM that are changed are marked and
+then retransmitted during a later phase.
+Eventually only a small amount of memory remains, the VM is paused, the
+remaining memory is transmitted, and the VM is unpaused on the target machine.
+<P>
+The number of persistent tmem pages may be quite large,
+possibly even larger than all the other memory used by the VM; so it is
+unacceptable to transmit persistent tmem pages during the &quot;paused&quot;
+phase of live migration.  But if the VM
+is still operational, it may be making calls to tmem:
+A frozen tmem client will reject any 
+<big><kbd>put</kbd></big> operations, but tmem must
+still correctly process <big><kbd>flush</kbd></big>es
+(page and object), including implicit flushes due to duplicate 
+<big><kbd>put</kbd></big>s.
+Fortunately, these operations can only
+invalidate tmem pages, not overwrite tmem pages or create new pages.
+So, when a live-migrate has been initiated,
+the client is frozen.  Then during the
+&quot;live&quot; phase, tmem transmits all persistent pages, but also records
+the handle of all persistent pages that are invalidated.
+Then, during the &quot;paused&quot; phase,
+only the handles of invalidated persistent pages are transmitted, resulting in
+the invalidation on the target machine of any matching pages that were
+previously transmitted during the &quot;live&quot; phase.
+<P>
+For restore (and on the target machine of a live migration),
+tmem must be capable of reconstructing the internal state of the client from
+the saved/migrated data.  However, it is
+not the client itself that is <big><kbd>put</kbd></big>'ing
+the pages but the management tools conducting the restore/migration.
+This slightly complicates tmem by requiring
+new API calls and new functions in the implementation, but the code is
+structured so that duplication is minimized.
+Once all tmem data structures for the client are reconstructed, all
+persistent pages are recreated and, in the case of live-migration, all
+invalidations have been processed and the client has been thawed, the restored
+client can be resumed.
+<P>
+Finally, tmem's data structures must be cluttered a bit to
+support save/restore/migration.  Notably,
+a per-pool list of persistent pages must be maintained and, during live
+migration, a per-client list of invalidated pages must be logged.
+A reader of the code will note that these
+lists are overlaid into space-sensitive data structures as a union, which may
+be more error-prone but eliminates significant space waste.
+
+<h2>Miscellaneous Tmem Topics</h2>
+
+<P>
+<i><b>Duplicate <big><kbd>puts</kbd></big></b></i>.
+One interesting corner case that
+significantly complicates the tmem source code is the possibility
+of a <i>duplicate</i>
+<big><kbd>put</kbd></big>,
+which occurs when two
+<big><kbd>put</kbd></big>s
+are requested with the same handle but with possibly different data.
+The tmem API addresses
+<i>
+<big><kbd>put</kbd></big>-<big><kbd>put</kbd></big>-<big><kbd>get</kbd></big>
+coherence</i> explicitly: When a duplicate
+<big><kbd>put</kbd></big> occurs, tmem may react one of two ways: (1) The 
+<big><kbd>put</kbd></big> may succeed with the old
+data overwritten by the new data, or (2) the
+<big><kbd>put</kbd></big> may be failed with the original data flushed and
+neither the old nor the new data accessible.
+Tmem may <i>not</i> fail the 
+<big><kbd>put</kbd></big> and leave the old data accessible.
+<P>
+When tmem has been actively working for an extended period,
+system memory may be in short supply and it is possible for a memory allocation
+for a page (or even a data structure such as a <i>pgd_t</i>) to fail. Thus,
+for a duplicate 
+<big><kbd>put</kbd></big>, it may be impossible for tmem to temporarily
+simultaneously maintain data structures and data for both the original 
+<big><kbd>put</kbd></big> and the duplicate 
+<big><kbd>put</kbd></big>.
+When the space required for the data is
+identical, tmem may be able to overwrite <i>in place </i>the old data with
+the new data (option 1).  But in some circumstances, such as when data
+is being compressed, overwriting is not always possible and option 2 must be
+performed.
+<P>
+<i><b>Page deduplication and trailing-zero elimination.</b></i>
+When page deduplication is enabled
+(&quot;tmem_dedup&quot; option to Xen), ephemeral pages for which the contents
+are identical -- whether the pages belong
+to the same client or different clients -- utilize the same pageframe of
+memory.  In Xen environments where
+multiple domains have a highly similar workload, this can save a substantial
+amount of memory, allowing a much larger number of ephemeral pages to be
+used.  Tmem page deduplication uses
+methods similar to the KSM implementation in Linux [ref], but differences between
+the two are sufficiently great that tmem does not directly leverage the
+code.  In particular, ephemeral pages in
+tmem are never dirtied, so need never be <i>copied-on-write</i>.
+Like KSM, however, tmem avoids hashing,
+instead employing <i>red-black trees</i>
+that use the entire page contents as the <i>lookup
+key</i>.  There may be better ways to implement this.
+<P>
+Dedup'ed pages may optionally be compressed
+(&quot;tmem_compress&quot; and &quot;tmem_dedup&quot; Xen options specified),
+to save even more space, at the cost of more time.
+Additionally, <i>trailing zero elimination (tze)</i> may be applied to dedup'ed
+pages.  With tze, pages that contain a
+significant number of zeroes at the end of the page are saved without the trailing
+zeroes; an all-zero page requires no data to be saved at all.
+In certain workloads that utilize a large number
+of small files (and for which the last partial page of a file is padded with
+zeroes), a significant space savings can be realized without the high cost of
+compression/decompression.
+<P>
+Both compression and tze significantly complicate memory
+allocation.  This will be discussed more below.
+<P>
+<b><i>Memory accounting</i>.</b>
+Accounting is boring, but poor accounting may
+result in some interesting problems.  In
+the implementation-independent code of tmem, most data structures, page frames,
+and partial pages (e.g. for compresssion) are <i>billed</i> to a pool,
+and thus to a client.  Some <i>infrastructure</i> data structures, such as
+pools and clients, are allocated with <i>tmh_alloc_infra()</i>, which does not
+require a pool to be specified.  Two other
+exceptions are page content descriptors (<i>pcd_t</i>)
+and sharelists (<i>sharelist_t</i>) which
+are explicitly not associated with a pool/client by specifying NULL instead of
+a <i>pool_t</i>.
+(Note to self:
+These should probably just use the <i>tmh_alloc_infra()</i> interface too.)
+As we shall see, persistent pool pages and
+data structures may need to be handled a bit differently, so the
+implementation-independent layer calls a different allocation/free routine for
+persistent pages (e.g. <i>tmh_alloc_page_thispool()</i>)
+than for ephemeral pages (e.g. <i>tmh_alloc_page()</i>).
+<P>
+In the Xen-specific layer, we
+disregard the <i>pool_t</i> for ephemeral
+pages, as we use the generic Xen heap for all ephemeral pages and data
+structures.(Denial-of-service attacks
+can be handled in the implementation-independent layer because ephemeral pages
+are kept in per-client queues each with a counted length.
+See the discussion on weights and caps below.)
+However we explicitly bill persistent pages
+and data structures against the client/domain that is using them.
+(See the calls to the Xen routine <i>alloc_domheap_pages() </i>in tmem_xen.h; of
+the first argument is a domain, the pages allocated are billed by Xen to that
+domain.)This means that a Xen domain
+cannot allocate even a single tmem persistent page when it is currently utilizing
+its maximum assigned memory allocation!
+This is reasonable for persistent pages because, even though the data is
+not directly accessible by the domain, the data is permanently saved until
+either the domain flushes it or the domain dies.
+<P>
+Note that proper accounting requires (even for ephemeral pools) that the same
+pool is referenced when memory is freed as when it was allocated, even if the
+ownership of a pool has been moved from one client to another (c.f. <i
+>shared_pool_reassign()</i>).
+The underlying Xen-specific information may
+not always enforce this for ephemeral pools, but incorrect alloc/free matching
+can cause some difficult-to-find memory leaks and bent pointers.
+<P>
+Page deduplication is not possible for persistent pools for
+accounting reasons: Imagine a page that is created by persistent pool A, which
+belongs to a domain that is currently well under its maximum allocation.
+Then the <i>pcd_t</i>is matched by persistent pool B, which is
+currently at its maximum.
+Then the domain owning pool A is destroyed.
+Is B beyond its maximum?
+(There may be a clever way around this
+problem.  Exercise for the reader!)
+<P>
+<b><i>Memory allocation.</i></b> The implementation-independent layer assumes
+there is a good fast general-purpose dynamic memory allocator with bounded
+response time and efficient use of memory for a very large number of sub-page
+allocations.  The old xmalloc memory
+allocator in Xen was not a good match for this purpose, so was replaced by the
+TLSF allocator.  Note that the TLSF
+allocator is used only for allocations smaller than a page (and, more
+precisely, no larger than <i>tmem_subpage_maxsize()</i>);
+full pages are allocated by Xen's normal heap allocator.
+<P>
+After the TLSF allocator was integrated into Xen, more work
+was required so that each client could allocate memory from a separate
+independent pool. (See the call to <i>xmem_pool_create()</i>in
+<i>tmh_client_init()</i>.) 
+This allows the data structures allocated for the
+purpose of supporting persistent pages to be billed to the same client as the
+pages themselves.  It also allows partial
+(e.g. compressed) pages to be properly billed.
+Further, when partial page allocations cause internal fragmentation,
+this fragmentation can be isolated per-client.
+And, when a domain dies, full pages can be freed, rather than only
+partial pages. One other change was
+required in the TLSF allocator: In the original version, when a TLSF memory
+pool was allocated, the first page of memory was also allocated.
+Since, for a persistent pool, this page would
+be billed to the client, the allocation of the first page failed if the domain
+was started at its maximum memory, and this resulted in a failure to create the
+memory pool.  To avoid this, the code was
+changed to delay the allocation of the first page until first use of the memory
+pool.
+<P>
+<b><i>Memory allocation interdependency.</i></b>
+As previously described,
+pages of memory must be moveable back and forth between the Xen heap and the
+tmem ephemeral lists (and page lists).
+When tmem needs a page but doesn't have one, it requests one from the
+Xen heap (either indirectly via xmalloc, or directly via Xen's <i
+>alloc_domheap_pages()</i>).
+And when Xen needs a page but doesn't have
+one, it requests one from tmem (via a call to <i
+>tmem_relinquish_pages()</i> in Xen's <i
+>alloc_heap_pages() </i>in page_alloc.c).
+This leads to a potential infinite loop!
+To break this loop, a new memory flag (<i>MEMF_tmem</i>) was added to Xen
+to flag and disallow the loop.
+See <i>tmh_called_from_tmem()</i>
+in <i>tmem_relinquish_pages()</i>.
+Note that the <i
+>tmem_relinquish_pages()</i> interface allows for memory requests of
+order &gt; 0 (multiple contiguous pages), but the tmem implementation disallows
+any requests larger than a single page.
+<P>
+<b><i>LRU page reclamation</i></b>.
+Ephemeral pages generally <i>age </i>in
+a queue, and the space associated with the oldest -- or <i
+>least-recently-used -- </i>page is reclaimed when tmem needs more
+memory.  But there are a few exceptions
+to strict LRU queuing.  First is when
+removal from a queue is constrained by locks, as previously described above.
+Second, when an ephemeral pool is <i>shared,</i> unlike a private ephemeral
+pool, a
+<big><kbd>get</kbd></big>
+does not imply a
+<big><kbd>flush</kbd></big>
+Instead, in a shared pool, a 
+results in the page being promoted to the front of the queue.
+Third, when a page that is deduplicated (i.e.
+is referenced by more than one <i>pgp_</i>t)
+reaches the end of the LRU queue, it is marked as <i
+>eviction attempted</i> and promoted to the front of the queue; if it
+reaches the end of the queue a second time, eviction occurs.
+Note that only the <i
+>pgp_</i>t is evicted; the actual data is only reclaimed if there is no
+other <i>pgp_t </i>pointing to the data.
+<P>
+All of these modified- LRU algorithms deserve to be studied
+carefully against a broad range of workloads.
+<P>
+<b><i>Internal fragmentation</i>.</b>
+When
+compression or tze is enabled, allocations between a half-page and a full-page
+in size are very common and this places a great deal of pressure on even the
+best memory allocator.  Additionally,
+problems may be caused for memory reclamation: When one tmem ephemeral page is
+evicted, only a fragment of a physical page of memory might be reclaimed.
+As a result, when compression or tze is
+enabled, it may take a very large number of eviction attempts to free up a full
+contiguous page of memory and so, to avoid near-infinite loops and livelocks, eviction
+must be assumed to be able to fail.
+While all memory allocation paths in tmem are resilient to failure, very
+complex corner cases may eventually occur.
+As a result, compression and tze are disabled by default and should be
+used with caution until they have been tested with a much broader set of
+workloads.(Note to self: The 
+code needs work.)
+<P>
+<b><i>Weights and caps</i>.</b>
+Because
+of the just-discussed LRU-based eviction algorithms, a client that uses tmem at
+a very high frequency can quickly swamp tmem so that it provides little benefit
+to a client that uses it less frequently.
+To reduce the possibility of this denial-of-service, limits can be
+specified via management tools that are enforced internally by tmem.
+On Xen, the &quot;xm tmem-set&quot; command
+can specify &quot;weight=&lt;weight&gt;&quot; or &quot;cap=&lt;cap&gt;&quot;
+for any client.  If weight is non-zero
+for a client and the current percentage of ephemeral pages in use by the client
+exceeds its share (as measured by the sum of weights of all clients), the next
+page chosen for eviction is selected from the requesting client's ephemeral
+queue, instead of the global ephemeral queue that contains pages from all
+clients.(See <i>client_over_quota().</i>)
+Setting a cap for a client is currently a no-op.
+<P>
+<b><i>Shared pools and authentication.</i></b>
+When tmem was first proposed to the linux kernel mailing list
+(LKML), there was concern expressed about security of shared ephemeral
+pools.  The initial tmem implementation only
+required a client to provide a 128-bit UUID to identify a shared pool, and the
+linux-side tmem implementation obtained this UUID from the superblock of the
+shared filesystem (in ocfs2).  It was
+pointed out on LKML that the UUID was essentially a security key and any
+malicious domain that guessed it would have access to any data from the shared
+filesystem that found its way into tmem.
+Ocfs2 has only very limited security; it is assumed that anyone who can
+access the filesystem bits on the shared disk can mount the filesystem and use
+it.  But in a virtualized data center,
+higher isolation requirements may apply.
+As a result, a Xen boot option -- &quot;tmem_shared_auth&quot; -- was
+added.  The option defaults to disabled,
+but when it is enabled, management tools must explicitly authenticate (or may
+explicitly deny) shared pool access to any client.
+On Xen, this is done with the &quot;xm
+tmem-shared-auth&quot; command.
+<P>
+<b><i>32-bit implementation</i>.</b>
+There was some effort put into getting tmem working on a 32-bit Xen.
+However, the Xen heap is limited in size on
+32-bit Xen so tmem did not work very well.
+There are still 32-bit ifdefs in some places in the code, but things may
+have bit-rotted so using tmem on a 32-bit Xen is not recommended.
+<P>
+<b><i>IA-64 implementation. </i></b>
+The vast majority of the tmem
+implementation is architecture-independent.
+For tmem to run on Xen/ia64, it is believed that only one or two
+routines needs to be written.(See the
+#ifdef __ia64__ at <i>cli_mfn_to_va()</i>.)
+
+<h2>Known Issues</h2>
+
+<p><b><i>Fragmentation.</i></b>When tmem
+is active, all physically memory becomes <i>fragmented</i>
+into individual pages.  However, the Xen
+memory allocator allows memory to be requested in multi-page contiguous
+quantities, called order&gt;0 allocations.
+(e.g. 2<sup>order</sup> so
+order==4 is sixteen contiguous pages.)
+In some cases, a request for a larger order will fail gracefully if no
+matching contiguous allocation is available from Xen.
+As of Xen 4.0, however, there are several
+critical order&gt;0 allocation requests that do not fail gracefully.
+Notably, when a domain is created, and
+order==4 structure is required or the domain creation will fail.
+And shadow paging requires many order==2
+allocations; if these fail, a PV live-migration may fail.
+There are likely other such issues.
+<P>
+But, fragmentation can occur even without tmem if any domU does
+any extensive ballooning; tmem just accelerates the fragmentation.
+So the fragmentation problem must be solved
+anyway.  The best solution is to disallow
+order&gt;0 allocations altogether in Xen -- or at least ensure that any attempt
+to allocate order&gt;0 can fail gracefully, e.g. by falling back to a sequence
+of single page allocations. However this restriction may require a major rewrite
+in some of Xen's most sensitive code.
+(Note that order&gt;0 allocations during Xen boot and early in domain0
+launch are safe and, if dom0 does not enable tmem, any order&gt;0 allocation by
+dom0 is safe, until the first domU is created.)
+<P>
+Until Xen can be rewritten to be <i>fragmentation-safe</i>, a small hack
+was added in the Xen page
+allocator.(See the comment &quot;
+memory is scarce&quot; in <i>alloc_heap_pages()</i>.)
+Briefly, a portion of memory is pre-reserved
+for allocations where order&gt;0 and order&lt;9.
+(Domain creation uses 2MB pages, but fails
+gracefully, and there are no other known order==9 allocations or order&gt;9
+allocations currently in Xen.)
+<P>
+<b><i>NUMA</i></b>.  Tmem assumes that
+all memory pages are equal and any RAM page can store a page of data for any
+client.  This has potential performance
+consequences in any NUMA machine where access to <i
+>far memory</i> is significantly slower than access to <i
+>near memory</i>.
+On nearly all of today's servers, however,
+access times to <i>far memory</i> is still
+much faster than access to disk or network-based storage, and tmem's primary performance
+advantage comes from the fact that paging and swapping are reduced.
+So, the current tmem implementation ignores
+NUMA-ness; future tmem design for NUMA machines is an exercise left for the
+reader.
+
+<h2>Bibliography</h2>
+
+<P>
+(needs work)<b style='mso-bidi-font-weight:>
+<P><a href="http://oss.oracle.com/projects/tmem">http://oss.oracle.com/projects/tmem</a>
-- 
cgit v1.2.3