You've already forked linux-packaging-mono
							
							
		
			
				
	
	
		
			211 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
			
		
		
	
	
			211 lines
		
	
	
		
			10 KiB
		
	
	
	
		
			HTML
		
	
	
	
	
	
| <HTML>
 | |
| <HEAD>
 | |
| <TITLE>Garbage collector scalability</TITLE>
 | |
| </HEAD>
 | |
| <BODY>
 | |
| <H1>Garbage collector scalability</h1>
 | |
| In its default configuration, the Boehm-Demers-Weiser garbage collector
 | |
| is not thread-safe.  It can be made thread-safe for a number of environments
 | |
| by building the collector with the appropriate
 | |
| <TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilation
 | |
| flag.  This has primarily two effects:
 | |
| <OL>
 | |
| <LI> It causes the garbage collector to stop all other threads when
 | |
| it needs to see a consistent memory state.
 | |
| <LI> It causes the collector to acquire a lock around essentially all
 | |
| allocation and garbage collection activity.
 | |
| </ol>
 | |
| Since a single lock is used for all allocation-related activity, only one
 | |
| thread can be allocating or collecting at one point.  This inherently
 | |
| limits performance of multi-threaded applications on multiprocessors.
 | |
| <P>
 | |
| On most platforms, the allocator/collector lock is implemented as a
 | |
| spin lock with exponential back-off.  Longer wait times are implemented
 | |
| by yielding and/or sleeping.  If a collection is in progress, the pure
 | |
| spinning stage is skipped.  This has the advantage that uncontested and
 | |
| thus most uniprocessor lock acquisitions are very cheap.  It has the
 | |
| disadvantage that the application may sleep for small periods of time
 | |
| even when there is work to be done.  And threads may be unnecessarily
 | |
| woken up for short periods.  Nonetheless, this scheme empirically
 | |
| outperforms native queue-based mutual exclusion implementations in most
 | |
| cases, sometimes drastically so.
 | |
| <H2>Options for enhanced scalability</h2>
 | |
| Version 6.0 of the collector adds two facilities to enhance collector
 | |
| scalability on multiprocessors.  As of 6.0alpha1, these are supported 
 | |
| only under Linux on X86 and IA64 processors, though ports to other
 | |
| otherwise supported Pthreads platforms should be straightforward.
 | |
| They are intended to be used together.
 | |
| <UL>
 | |
| <LI>
 | |
| Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector to
 | |
| run the mark phase in parallel in multiple threads, and thus on multiple
 | |
| processors.  The mark phase typically consumes the large majority of the
 | |
| collection time.  Thus this largely parallelizes the garbage collector
 | |
| itself, though not the allocation process.  Currently the marking is
 | |
| performed by the thread that triggered the collection, together with
 | |
| <I>N</i>-1 dedicated
 | |
| threads, where <I>N</i> is the number of processors detected by the collector.
 | |
| The dedicated threads are created once at initialization time.
 | |
| <P>
 | |
| A second effect of this flag is to switch to a more concurrent
 | |
| implementation of <TT>GC_malloc_many</tt>, so that free lists can be
 | |
| built, and memory can be cleared, by more than one thread concurrently.
 | |
| <LI>
 | |
| Building the collector with -DTHREAD_LOCAL_ALLOC adds support for thread
 | |
| local allocation.  It does not, by itself, cause thread local allocation
 | |
| to be used.  It simply allows the use of the interface in 
 | |
| <TT>gc_local_alloc.h</tt>.
 | |
| <P>
 | |
| Memory returned from thread-local allocators is completely interchangeable
 | |
| with that returned by the standard allocators.  It may be used by other
 | |
| threads.  The only difference is that, if the thread allocates enough
 | |
| memory of a certain kind, it will build a thread-local free list for
 | |
| objects of that kind, and allocate from that.  This greatly reduces
 | |
| locking.  The thread-local free lists are refilled using 
 | |
| <TT>GC_malloc_many</tt>.
 | |
| <P>
 | |
| An important side effect of this flag is to replace the default
 | |
| spin-then-sleep lock to be replace by a spin-then-queue based implementation.
 | |
| This <I>reduces performance</i> for the standard allocation functions,
 | |
| though it usually improves performance when thread-local allocation is
 | |
| used heavily, and thus the number of short-duration lock acquisitions
 | |
| is greatly reduced.
 | |
| </ul>
 | |
| <P>
 | |
| The easiest way to switch an application to thread-local allocation is to
 | |
| <OL>
 | |
| <LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,
 | |
| and then include the <TT>gc.h</tt>
 | |
| header in each client source file.
 | |
| <LI> Invoke <TT>GC_thr_init()</tt> before any allocation.
 | |
| <LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,
 | |
| and/or <TT>GC_GCJ_MALLOC</tt>.
 | |
| </ol>
 | |
| <H2>The Parallel Marking Algorithm</h2>
 | |
| We use an algorithm similar to
 | |
| <A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed by
 | |
| Endo, Taura, and Yonezawa</a> at the University of Tokyo.
 | |
| However, the data structures and implementation are different,
 | |
| and represent a smaller change to the original collector source,
 | |
| probably at the expense of extreme scalability.  Some of
 | |
| the refinements they suggest, <I>e.g.</i> splitting large
 | |
| objects, were also incorporated into out approach.
 | |
| <P>
 | |
| The global mark stack is transformed into a global work queue.
 | |
| Unlike the usual case, it never shrinks during a mark phase.
 | |
| The mark threads remove objects from the queue by copying them to a
 | |
| local mark stack and changing the global descriptor to zero, indicating
 | |
| that there is no more work to be done for this entry.
 | |
| This removal
 | |
| is done with no synchronization.  Thus it is possible for more than
 | |
| one worker to remove the same entry, resulting in some work duplication.
 | |
| <P>
 | |
| The global work queue grows only if a marker thread decides to
 | |
| return some of its local mark stack to the global one.  This
 | |
| is done if the global queue appears to be running low, or if
 | |
| the local stack is in danger of overflowing.  It does require
 | |
| synchronization, but should be relatively rare.
 | |
| <P>
 | |
| The sequential marking code is reused to process local mark stacks.
 | |
| Hence the amount of additional code required for parallel marking
 | |
| is minimal.
 | |
| <P>
 | |
| It should be possible to use generational collection in the presence of the
 | |
| parallel collector, by calling <TT>GC_enable_incremental()</tt>.
 | |
| This does not result in fully incremental collection, since parallel mark
 | |
| phases cannot currently be interrupted, and doing so may be too
 | |
| expensive.
 | |
| <P>
 | |
| Gcj-style mark descriptors do not currently mix with the combination
 | |
| of local allocation and incremental collection.  They should work correctly
 | |
| with one or the other, but not both.
 | |
| <P>
 | |
| The number of marker threads is set on startup to the number of
 | |
| available processors (or to the value of the <TT>GC_NPROCS</tt>
 | |
| environment variable).  If only a single processor is detected,
 | |
| parallel marking is disabled.
 | |
| <P>
 | |
| Note that setting GC_NPROCS to 1 also causes some lock acquisitions inside
 | |
| the collector to immediately yield the processor instead of busy waiting
 | |
| first.  In the case of a multiprocessor and a client with multiple
 | |
| simultaneously runnable threads, this may have disastrous performance
 | |
| consequences (e.g. a factor of 10 slowdown). 
 | |
| <H2>Performance</h2>
 | |
| We conducted some simple experiments with a version of
 | |
| <A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified to
 | |
| run multiple concurrent client threads in the same address space.
 | |
| Each client thread does the same work as the original benchmark, but they share
 | |
| a heap.
 | |
| This benchmark involves very little work outside of memory allocation.
 | |
| This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machine
 | |
| under Linux 2.2.12.
 | |
| <P>
 | |
| Running with a thread-unsafe collector,  the benchmark ran in 9
 | |
| seconds.  With the simple thread-safe collector,
 | |
| built with <TT>-DLINUX_THREADS</tt>, the execution time
 | |
| increased to 10.3 seconds, or 23.5 elapsed seconds with two clients.
 | |
| (The times for the <TT>malloc</tt>/i<TT>free</tt> version
 | |
| with glibc <TT>malloc</tt>
 | |
| are 10.51 (standard library, pthreads not linked),
 | |
| 20.90 (one thread, pthreads linked),
 | |
| and 24.55 seconds respectively. The benchmark favors a
 | |
| garbage collector, since most objects are small.)
 | |
| <P>
 | |
| The following table gives execution times for the collector built
 | |
| with parallel marking and thread-local allocation support
 | |
| (<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>).  We tested
 | |
| the client using either one or two marker threads, and running
 | |
| one or two client threads.  Note that the client uses thread local
 | |
| allocation exclusively.  With -DTHREAD_LOCAL_ALLOC the collector
 | |
| switches to a locking strategy that is better tuned to less frequent
 | |
| lock acquisition.  The standard allocation primitives thus peform
 | |
| slightly worse than without -DTHREAD_LOCAL_ALLOC, and should be
 | |
| avoided in time-critical code.
 | |
| <P>
 | |
| (The results using <TT>pthread_mutex_lock</tt>
 | |
| directly for allocation locking would have been worse still, at
 | |
| least for older versions of linuxthreads.
 | |
| With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire the
 | |
| lock with pthread_mutex_try_lock(), busy_waiting between attempts.
 | |
| After a fixed number of attempts, we use pthread_mutex_lock().)
 | |
| <P>
 | |
| These measurements do not use incremental collection, nor was prefetching
 | |
| enabled in the marker.  We used the C version of the benchmark.
 | |
| All measurements are in elapsed seconds on an unloaded machine.
 | |
| <P>
 | |
| <TABLE BORDER ALIGN="CENTER">
 | |
| <TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th>
 | |
| <TH>2 marker threads (secs.)</th></tr>
 | |
| <TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td>
 | |
| <TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td>
 | |
| </table>
 | |
| <PP>
 | |
| The execution time for the single threaded case is slightly worse than with
 | |
| simple locking.  However, even the single-threaded benchmark runs faster than
 | |
| even the thread-unsafe version if a second processor is available.
 | |
| The execution time for two clients with thread local allocation time is
 | |
| only 1.4 times the sequential execution time for a single thread in a
 | |
| thread-unsafe environment, even though it involves twice the client work.
 | |
| That represents close to a
 | |
| factor of 2 improvement over the 2 client case with the old collector.
 | |
| The old collector clearly
 | |
| still suffered from some contention overhead, in spite of the fact that the
 | |
| locking scheme had been fairly well tuned.
 | |
| <P>
 | |
| Full linear speedup (i.e. the same execution time for 1 client on one
 | |
| processor as 2 clients on 2 processors)
 | |
| is probably not achievable on this kind of
 | |
| hardware even with such a small number of processors,
 | |
| since the memory system is
 | |
| a major constraint for the garbage collector,
 | |
| the processors usually share a single memory bus, and thus
 | |
| the aggregate memory bandwidth does not increase in
 | |
| proportion to the number of processors. 
 | |
| <P>
 | |
| These results are likely to be very sensitive to both hardware and OS
 | |
| issues.  Preliminary experiments with an older Pentium Pro machine running
 | |
| an older kernel were far less encouraging.
 | |
| 
 | |
| </body>
 | |
| </html>
 |