I apologise in advance for the size of this check-in. At Intel we do

understand that this is not friendly, and are working to change our internal code-development to make it easier to make development features available more frequently and in finer (more functional) chunks. Unfortunately we haven't got that in place yet, and unpicking this into multiple separate check-ins would be non-trivial, so please bear with me on this one. We should be better in the future. Apologies over, what do we have here? GGC 4.9 compatibility -------------------- * We have implemented the new entrypoints used by code compiled by GCC 4.9 to implement the same functionality in gcc 4.8. Therefore code compiled with gcc 4.9 that used to work will continue to do so. However, there are some other new entrypoints (associated with task cancellation) which are not implemented. Therefore user code compiled by gcc 4.9 that uses these new features will not link against the LLVM runtime. (It remains unclear how to handle those entrypoints, since the GCC interface has potentially unpleasant performance implications for join barriers even when cancellation is not used) --- new parallel entry points --- new entry points that aren't OpenMP 4.0 related These are implemented fully :- GOMP_parallel_loop_dynamic() GOMP_parallel_loop_guided() GOMP_parallel_loop_runtime() GOMP_parallel_loop_static() GOMP_parallel_sections() GOMP_parallel() --- cancellation entry points --- Currently, these only give a runtime error if OMP_CANCELLATION is true because our plain barriers don't check for cancellation while waiting GOMP_barrier_cancel() GOMP_cancel() GOMP_cancellation_point() GOMP_loop_end_cancel() GOMP_sections_end_cancel() --- taskgroup entry points --- These are implemented fully. GOMP_taskgroup_start() GOMP_taskgroup_end() --- target entry points --- These are empty (as they are in libgomp) GOMP_target() GOMP_target_data() GOMP_target_end_data() GOMP_target_update() GOMP_teams() Improvements in Barriers and Fork/Join -------------------------------------- * Barrier and fork/join code is now in its own file (which makes it easier to understand and modify). * Wait/release code is now templated and in its own file; suspend/resume code is also templated * There's a new, hierarchical, barrier, which exploits the cache-hierarchy of the Intel(r) Xeon Phi(tm) coprocessor to improve fork/join and barrier performance. ***BEWARE*** the new source files have *not* been added to the legacy Cmake build system. If you want to use that fixes wil be required. Statistics Collection Code -------------------------- * New code has been added to collect application statistics (if this is enabled at library compile time; by default it is not). The statistics code itself is generally useful, the lightweight timing code uses the X86 rdtsc instruction, so will require changes for other architectures. The intent of this code is not for users to tune their codes but rather 1) For timing code-paths inside the runtime 2) For gathering general properties of OpenMP codes to focus attention on which OpenMP features are most used. Nested Hot Teams ---------------- * The runtime now maintains more state to reduce the overhead of creating and destroying inner parallel teams. This improves the performance of code that repeatedly uses nested parallelism with the same resource allocation. Set the new KMP_HOT_TEAMS_MAX_LEVEL envirable to a depth to enable this (and, of course, OMP_NESTED=true to enable nested parallelism at all). Improved Intel(r) VTune(Tm) Amplifier support --------------------------------------------- * The runtime provides additional information to Vtune via the itt_notify interface to allow it to display better OpenMP specific analyses of load-imbalance. Support for OpenMP Composite Statements --------------------------------------- * Implement new entrypoints required by some of the OpenMP 4.1 composite statements. Improved ifdefs --------------- * More separation of concepts ("Does this platform do X?") from platforms ("Are we compiling for platform Y?"), which should simplify future porting. ScaleMP* contribution --------------------- Stack padding to improve the performance in their environment where cross-node coherency is managed at the page level. Redesign of wait and release code --------------------------------- The code is simplified and performance improved. Bug Fixes --------- *Fixes for Windows multiple processor groups. *Fix Fortran module build on Linux: offload attribute added. *Fix entry names for distribute-parallel-loop construct to be consistent with the compiler codegen. *Fix an inconsistent error message for KMP_PLACE_THREADS environment variable. llvm-svn: 219214
2025-04-01 12:43:47 +00:00 · 2014-10-07 16:25:50 +00:00 · 2014-10-07 16:25:50 +00:00 · 4cc4bb4c60
commit 4cc4bb4c60
parent f72fa67fc3
121 changed files with 31865 additions and 22060 deletions
--- a/openmp/CREDITS.txt
+++ b/openmp/CREDITS.txt
@ -9,6 +9,7 @@ beautification by scripts.  The fields are: name (N), email (E), web-address
 (S).

 N: Carlo Bertolli
+W: http://ibm.com
 D: IBM contributor to PowerPC support in CMake files and elsewhere.

 N: Sunita Chandrasekaran
@ -28,6 +29,11 @@ D: Created the runtime.
 N: Matthias Muller
 D: Contributor to testsuite from OpenUH

+N: Tal Nevo
+E: tal@scalemp.com
+D: ScaleMP contributor to improve runtime performance there.
+W: http://scalemp.com
+
 N: Pavel Neytchev
 D: Contributor to testsuite from OpenUH

--- a/openmp/LICENSE.txt
+++ b/openmp/LICENSE.txt
@ -14,7 +14,7 @@ software contained in this directory tree is included below.
 University of Illinois/NCSA
 Open Source License

-Copyright (c) 1997-2013 Intel Corporation
+Copyright (c) 1997-2014 Intel Corporation

 All rights reserved.

@ -51,7 +51,7 @@ SOFTWARE.

 ==============================================================================

-Copyright (c) 1997-2013 Intel Corporation
+Copyright (c) 1997-2014 Intel Corporation

 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
--- a/openmp/runtime/Build_With_CMake.txt
+++ b/openmp/runtime/Build_With_CMake.txt
@ -137,8 +137,7 @@ libiomp5 version can be 5 or 4.
 OpenMP version can be either 40 or 30.

 -Dmic_arch=knc|knf
-Intel(R) MIC Architecture.  Can be 
-knf (Knights Ferry) or knc (Knights Corner).
+Intel(R) MIC Architecture, can be knf or knc.
 This value is ignored if os != mic
  
 -Dmic_os=lin|bsd
--- a/openmp/runtime/CMakeLists.txt
+++ b/openmp/runtime/CMakeLists.txt
@ -238,8 +238,18 @@ set(USE_BUILDPL_RULES false CACHE BOOL "Should the build follow build.pl rules/r
 # - these predefined linker flags should work for Windows, Mac, and True Linux for the most popular compilers/linkers
 set(USE_PREDEFINED_LINKER_FLAGS true CACHE BOOL "Should the build use the predefined linker flags in CommonFlags.cmake?")

+# - On multinode systems, larger alignment is desired to avoid false sharing
+set(USE_INTERNODE_ALIGNMENT false CACHE BOOL "Should larger alignment (4096 bytes) be used for some locks and data structures?")
+
+# - libgomp drop-in compatibility
+if(${LINUX} AND NOT ${PPC64})
+    set(USE_VERSION_SYMBOLS true CACHE BOOL "Should version symbols be used? These provide binary compatibility with libgomp.")
+else()
+    set(USE_VERSION_SYMBOLS false CACHE BOOL "Should version symbols be used? These provide binary compatibility with libgomp.")
+endif()
+
 # - TSX based locks have __asm code which can be troublesome for some compilers.  This feature is also x86 specific.
-if({${IA32} OR ${INTEL64})
+if(${IA32} OR ${INTEL64})
    set(USE_ADAPTIVE_LOCKS true CACHE BOOL "Should TSX-based lock be compiled (adaptive lock in kmp_lock.cpp).  These are x86 specific.")
 else()
    set(USE_ADAPTIVE_LOCKS false CACHE BOOL "Should TSX-based lock be compiled (adaptive lock in kmp_lock.cpp).  These are x86 specific.")
--- a/openmp/runtime/README.txt
+++ b/openmp/runtime/README.txt
@ -37,7 +37,7 @@ omp_root:    The path to the top-level directory containing the top-level
 	     current working directory.

 omp_os:      Operating system.  By default, the build will attempt to 
-	     detect this. Currently supports "linux", "freebsd", "macos", and
+	     detect this. Currently supports "linux", "freebsd", "macos", and 
 	     "windows".

 arch:        Architecture. By default, the build will attempt to 
@ -72,36 +72,44 @@ There is also an experimental CMake build system. This is *not* yet
 supported for production use and resulting binaries have not been checked
 for compatibility.

+On OS X* machines, it is possible to build universal (or fat) libraries which
+include both IA-32 architecture and Intel(R) 64 architecture objects in a
+single archive; just build the 32 and 32e libraries separately, then invoke 
+make again with a special argument as follows:
+
+make compiler=clang build_args=fat
+
 Supported RTL Build Configurations
 ==================================

 Supported Architectures: IA-32 architecture, Intel(R) 64, and 
 Intel(R) Many Integrated Core Architecture

-              --------------------------------------------  
-              |   icc/icl     |    gcc      |   clang    |
--------------|---------------|--------------------------|
-| Linux* OS   |   Yes(1,5)    |  Yes(2,4)   | Yes(4,6,7) |
-| FreeBSD*    |   No          |  No         | Yes(4,6,7) |
-| OS X*       |   Yes(1,3,4)  |  No         | Yes(4,6,7) |
-| Windows* OS |   Yes(1,4)    |  No         | No         |
----------------------------------------------------------
+              ----------------------------------------------
+              |   icc/icl     |    gcc      |   clang      |
+--------------|---------------|----------------------------|
+| Linux* OS   |   Yes(1,5)    |  Yes(2,4)   | Yes(4,6,7)   |
+| FreeBSD*    |   No          |  No         | Yes(4,6,7,8) |
+| OS X*       |   Yes(1,3,4)  |  No         | Yes(4,6,7)   |
+| Windows* OS |   Yes(1,4)    |  No         | No           |
+------------------------------------------------------------

 (1) On IA-32 architecture and Intel(R) 64, icc/icl versions 12.x are 
    supported (12.1 is recommended).
-(2) gcc version 4.6.2 is supported.
+(2) GCC* version 4.6.2 is supported.
 (3) For icc on OS X*, OS X* version 10.5.8 is supported.
 (4) Intel(R) Many Integrated Core Architecture not supported.
 (5) On Intel(R) Many Integrated Core Architecture, icc/icl versions 13.0 
    or later are required.
-(6) clang version 3.3 is supported.
-(7) clang currently does not offer a software-implemented 128 bit extended 
+(6) Clang* version 3.3 is supported.
+(7) Clang* currently does not offer a software-implemented 128 bit extended 
    precision type.  Thus, all entry points reliant on this type are removed
    from the library and cannot be called in the user program.  The following
    functions are not available:
    __kmpc_atomic_cmplx16_*
    __kmpc_atomic_float16_*
    __kmpc_atomic_*_fp
+(8) Community contribution provided AS IS, not tested by Intel.

 Front-end Compilers that work with this RTL
 ===========================================
--- a/openmp/runtime/cmake/BuildPLRules.cmake
+++ b/openmp/runtime/cmake/BuildPLRules.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 ###############################################################################
 # This file contains additional build rules that correspond to build.pl's rules
 # Building libiomp5.dbg is linux only, Windows will build libiomp5md.dll.pdb
--- a/openmp/runtime/cmake/Clang/AsmFlags.cmake
+++ b/openmp/runtime/cmake/Clang/AsmFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Clang (clang/clang++) specific compiler dependent flags
 # The flag types are:
 #   1) Assembly flags
--- a/openmp/runtime/cmake/Clang/CFlags.cmake
+++ b/openmp/runtime/cmake/Clang/CFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Clang (clang/clang++) specific compiler dependent flags
 # The flag types are:
 #   1) C/C++ Compiler flags
@ -19,6 +30,7 @@ function(append_compiler_specific_c_and_cxx_flags input_c_flags input_cxx_flags)
    endif()
    append_c_and_cxx_flags("-Wno-unused-value") # Don't warn about unused values
    append_c_and_cxx_flags("-Wno-switch") # Don't warn about switch statements that don't cover entire range of values
+    append_c_and_cxx_flags("-Wno-deprecated-register") # Don't warn about using register keyword
    set(${input_c_flags}   ${${input_c_flags}}   "${local_c_flags}" PARENT_SCOPE)
    set(${input_cxx_flags} ${${input_cxx_flags}} "${local_cxx_flags}" PARENT_SCOPE)
 endfunction()
--- a/openmp/runtime/cmake/CommonFlags.cmake
+++ b/openmp/runtime/cmake/CommonFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds the common flags independent of compiler
 # The flag types are: 
 #   1) Assembly flags          (append_asm_flags_common)
@ -71,22 +82,21 @@ function(append_linker_flags_common input_ld_flags input_ld_flags_libs)
    set(local_ld_flags)
    set(local_ld_flags_libs)

-    #################################
-    # Windows linker flags
-    if(${WINDOWS}) 
+    if(${USE_PREDEFINED_LINKER_FLAGS})

-    ##################
-    # MAC linker flags
-    elseif(${MAC})
-        if(${USE_PREDEFINED_LINKER_FLAGS})
+        #################################
+        # Windows linker flags
+        if(${WINDOWS}) 
+
+        ##################
+        # MAC linker flags
+        elseif(${MAC})
            append_linker_flags("-single_module")
            append_linker_flags("-current_version ${version}.0")
            append_linker_flags("-compatibility_version ${version}.0")
-        endif()
-    #####################################################################################
-    # Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) linker flags
-    elseif(${MIC})
-        if(${USE_PREDEFINED_LINKER_FLAGS})
+        #####################################################################################
+        # Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) linker flags
+        elseif(${MIC})
            append_linker_flags("-Wl,-x")
            append_linker_flags("-Wl,--warn-shared-textrel") #  Warn if the linker adds a DT_TEXTREL to a shared object.
            append_linker_flags("-Wl,--as-needed")
@ -98,13 +108,11 @@ function(append_linker_flags_common input_ld_flags input_ld_flags_libs)
            if(${STATS_GATHERING})
                append_linker_flags_library("-Wl,-lstdc++") # link in standard c++ library (stats-gathering needs it)
            endif()
-        endif()
-    #########################
-    # Unix based linker flags
-    else()
-        # For now, always include --version-script flag on Unix systems.
-        append_linker_flags("-Wl,--version-script=${src_dir}/exports_so.txt") # Use exports_so.txt as version script to create versioned symbols for ELF libraries
-        if(${USE_PREDEFINED_LINKER_FLAGS})
+        #########################
+        # Unix based linker flags
+        else()
+            # For now, always include --version-script flag on Unix systems.
+            append_linker_flags("-Wl,--version-script=${src_dir}/exports_so.txt") # Use exports_so.txt as version script to create versioned symbols for ELF libraries
            append_linker_flags("-Wl,-z,noexecstack") #  Marks the object as not requiring executable stack.
            append_linker_flags("-Wl,--as-needed")    #  Only adds library dependencies as they are needed. (if libiomp5 actually uses a function from the library, then add it)
            if(NOT ${STUBS_LIBRARY})
@ -117,8 +125,9 @@ function(append_linker_flags_common input_ld_flags input_ld_flags_libs)
                    append_linker_flags_library("-Wl,-ldl") # link in libdl (dynamic loader library)
                endif()
            endif()
-        endif() # if(${USE_PREDEFINED_LINKER_FLAGS})
-    endif() # if(${OPERATING_SYSTEM}) ...
+        endif() # if(${OPERATING_SYSTEM}) ...
+
+    endif() # USE_PREDEFINED_LINKER_FLAGS

    set(${input_ld_flags}      "${${input_ld_flags}}"      "${local_ld_flags}"      "${USER_LD_FLAGS}"     PARENT_SCOPE)
    set(${input_ld_flags_libs} "${${input_ld_flags_libs}}" "${local_ld_flags_libs}" "${USER_LD_LIB_FLAGS}" PARENT_SCOPE)
--- a/openmp/runtime/cmake/Definitions.cmake
+++ b/openmp/runtime/cmake/Definitions.cmake
@ -42,6 +42,10 @@ function(append_cpp_flags input_cpp_flags)
    endif()
    append_definitions("-D INTEL_ITTNOTIFY_PREFIX=__kmp_itt_")

+    if(${USE_VERSION_SYMBOLS})
+        append_definitions("-D KMP_USE_VERSION_SYMBOLS")
+    endif()
+
    #####################
    # Windows definitions
    if(${WINDOWS})
@ -133,6 +137,11 @@ function(append_cpp_flags input_cpp_flags)
        append_definitions("-D KMP_USE_ADAPTIVE_LOCKS=0")
        append_definitions("-D KMP_DEBUG_ADAPTIVE_LOCKS=0")
    endif()
+    if(${USE_INTERNODE_ALIGNMENT})
+        append_definitions("-D KMP_USE_INTERNODE_ALIGNMENT=1")
+    else()
+        append_definitions("-D KMP_USE_INTERNODE_ALIGNMENT=0")
+    endif()
    set(${input_cpp_flags} "${${input_cpp_flags}}" "${local_cpp_flags}" "${USER_CPP_FLAGS}" "$ENV{CPPFLAGS}" PARENT_SCOPE)
 endfunction()

--- a/openmp/runtime/cmake/GNU/AsmFlags.cmake
+++ b/openmp/runtime/cmake/GNU/AsmFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds GNU (gcc/g++) specific compiler dependent flags
 # The flag types are:
 #   1) Assembly flags
--- a/openmp/runtime/cmake/GNU/CFlags.cmake
+++ b/openmp/runtime/cmake/GNU/CFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds GNU (gcc/g++) specific compiler dependent flags
 # The flag types are:
 #   2) C/C++ Compiler flags
--- a/openmp/runtime/cmake/GNU/FortranFlags.cmake
+++ b/openmp/runtime/cmake/GNU/FortranFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds GNU (gcc/g++) specific compiler dependent flags
 # The flag types are:
 #   1) Fortran Compiler flags
--- a/openmp/runtime/cmake/Intel/AsmFlags.cmake
+++ b/openmp/runtime/cmake/Intel/AsmFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Intel(R) C Compiler / Intel(R) C++ Compiler / Intel(R) Fortran Compiler (icc/icpc/icl.exe/ifort) dependent flags
 # The flag types are:
 #   1) Assembly flags
--- a/openmp/runtime/cmake/Intel/CFlags.cmake
+++ b/openmp/runtime/cmake/Intel/CFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Intel(R) C Compiler / Intel(R) C++ Compiler / Intel(R) Fortran Compiler (icc/icpc/icl.exe/ifort) dependent flags
 # The flag types are:
 #   2) C/C++ Compiler flags
@ -41,7 +52,6 @@ function(append_compiler_specific_c_and_cxx_flags input_c_flags input_cxx_flags)
        endif()
    else()
        append_c_and_cxx_flags("-Wsign-compare") # warn on sign comparisons
-        append_c_and_cxx_flags("-Werror") # Changes all warnings to errors.
        append_c_and_cxx_flags("-Qoption,cpp,--extended_float_types") # Enabled _Quad type.
        append_c_and_cxx_flags("-fno-exceptions") # Exception handling table generation is disabled.
        append_c_and_cxx_flags("-x c++") # Compile C files as C++ files
--- a/openmp/runtime/cmake/Intel/FortranFlags.cmake
+++ b/openmp/runtime/cmake/Intel/FortranFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Intel(R) C Compiler / Intel(R) C++ Compiler / Intel(R) Fortran Compiler (icc/icpc/icl.exe/ifort) dependent flags
 # The flag types are:
 #   1) Fortran Compiler flags
@ -17,12 +28,20 @@ function(append_fortran_compiler_specific_fort_flags input_fort_flags)
        append_fort_flags("-GS")
        append_fort_flags("-DynamicBase")
        append_fort_flags("-Zi")
+        # On Linux and Windows Intel(R) 64 architecture we need offload attribute
+        # for all Fortran entries in order to support OpenMP function calls inside device contructs
+        if(${INTEL64})
+            append_fort_flags("/Qoffload-attribute-target:mic")
+        endif()
    else()
        if(${MIC})
            append_fort_flags("-mmic")
        endif()
        if(NOT ${MAC})
            append_fort_flags("-sox")
+            if(${INTEL64} AND ${LINUX})
+                append_fort_flags("-offload-attribute-target=mic")
+            endif()
        endif()
    endif()
    set(${input_fort_flags} ${${input_fort_flags}} "${local_fort_flags}" PARENT_SCOPE)
--- a/openmp/runtime/cmake/MSVC/AsmFlags.cmake
+++ b/openmp/runtime/cmake/MSVC/AsmFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Microsoft Visual Studio dependent flags
 # The flag types are:
 #   1) Assembly flags
--- a/openmp/runtime/cmake/MSVC/CFlags.cmake
+++ b/openmp/runtime/cmake/MSVC/CFlags.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 # This file holds Microsoft Visual Studio dependent flags
 # The flag types are:
 #   1) C/C++ Compiler flags
--- a/openmp/runtime/cmake/MicroTests.cmake
+++ b/openmp/runtime/cmake/MicroTests.cmake
@ -1,3 +1,14 @@
+#
+#//===----------------------------------------------------------------------===//
+#//
+#//                     The LLVM Compiler Infrastructure
+#//
+#// This file is dual licensed under the MIT and the University of Illinois Open
+#// Source Licenses. See LICENSE.txt for details.
+#//
+#//===----------------------------------------------------------------------===//
+#
+
 ######################################################
 # MICRO TESTS
 # The following micro-tests are small tests to perform on 
@ -219,15 +230,14 @@ if(${test_deps} AND ${tests})
        set(td_exp libc.so.7 libthr.so.3 libunwind.so.5)
    elseif(${LINUX})
        set(td_exp libdl.so.2,libgcc_s.so.1)
-        if(NOT ${IA32} AND NOT ${INTEL64})
-            set(td_exp ${td_exp},libffi.so.6,libffi.so.5)
-        endif()
        if(${IA32})
            set(td_exp ${td_exp},libc.so.6,ld-linux.so.2)  
        elseif(${INTEL64})
            set(td_exp ${td_exp},libc.so.6,ld-linux-x86-64.so.2)  
        elseif(${ARM})
-            set(td_exp ${td_exp},libc.so.6,ld-linux-armhf.so.3)  
+            set(td_exp ${td_exp},libffi.so.6,libffi.so.5,libc.so.6,ld-linux-armhf.so.3)  
+        elseif(${PPC64})
+            set(td_exp ${td_exp},libc.so.6,ld64.so.1)  
        endif()
        if(${STD_CPP_LIB})
            set(td_exp ${td_exp},libstdc++.so.6)
--- a/openmp/runtime/cmake/SourceFiles.cmake
+++ b/openmp/runtime/cmake/SourceFiles.cmake
@ -69,7 +69,8 @@ endfunction()
 function(set_cpp_files input_cpp_source_files) 
    set(local_cpp_source_files "")
    if(NOT ${STUBS_LIBRARY})
-        #append_cpp_source_file("kmp_barrier.cpp")
+        append_cpp_source_file("kmp_barrier.cpp")
+        append_cpp_source_file("kmp_wait_release.cpp")
        append_cpp_source_file("kmp_affinity.cpp")
        append_cpp_source_file("kmp_dispatch.cpp")
        append_cpp_source_file("kmp_lock.cpp")
@ -78,10 +79,10 @@ function(set_cpp_files input_cpp_source_files)
            append_cpp_source_file("kmp_taskdeps.cpp")
            append_cpp_source_file("kmp_cancel.cpp")
        endif()
-        #if(${STATS_GATHERING})
-        #   append_cpp_source_file("kmp_stats.cpp")
-        #    append_cpp_source_file("kmp_stats_timing.cpp")
-        #endif()
+        if(${STATS_GATHERING})
+            append_cpp_source_file("kmp_stats.cpp")
+            append_cpp_source_file("kmp_stats_timing.cpp")
+        endif()
    endif()

    set(${input_cpp_source_files} "${local_cpp_source_files}" PARENT_SCOPE)
--- a/openmp/runtime/doc/Reference.pdf
+++ b/openmp/runtime/doc/Reference.pdf
--- a/openmp/runtime/doc/doxygen/config
+++ b/openmp/runtime/doc/doxygen/config
@ -1539,7 +1539,7 @@ INCLUDE_FILE_PATTERNS  =
 # undefined via #undef or recursively expanded use the := operator
 # instead of the = operator.

-PREDEFINED             = OMP_30_ENABLED=1, OMP_40_ENABLED=1
+PREDEFINED             = OMP_30_ENABLED=1, OMP_40_ENABLED=1, KMP_STATS_ENABLED=1

 # If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then
 # this tag can be used to specify a list of macro names that should be expanded.
--- a/openmp/runtime/doc/doxygen/libomp_interface.h
+++ b/openmp/runtime/doc/doxygen/libomp_interface.h
@ -208,6 +208,7 @@ are documented in different modules.
 - @ref THREADPRIVATE functions to support thread private data, copyin etc
 - @ref SYNCHRONIZATION functions to support `omp critical`, `omp barrier`, `omp master`, reductions etc
 - @ref ATOMIC_OPS functions to support atomic operations
+ - @ref STATS_GATHERING macros to support developer profiling of libiomp5
 - Documentation on tasking has still to be written...

@section SEC_EXAMPLES Examples
@ -319,8 +320,29 @@ These functions are used for implementing barriers.
@defgroup THREADPRIVATE Thread private data support
 These functions support copyin/out and thread private data.

+@defgroup STATS_GATHERING Statistics Gathering from OMPTB
+These macros support profiling the libiomp5 library.  Use --stats=on when building with build.pl to enable
+and then use the KMP_* macros to profile (through counts or clock ticks) libiomp5 during execution of an OpenMP program.
+
+@section sec_stats_env_vars Environment Variables
+
+This section describes the environment variables relevent to stats-gathering in libiomp5
+
+@code
+KMP_STATS_FILE
+@endcode
+This environment variable is set to an output filename that will be appended *NOT OVERWRITTEN* if it exists.  If this environment variable is undefined, the statistics will be output to stderr
+
+@code
+KMP_STATS_THREADS
+@endcode
+This environment variable indicates to print thread-specific statistics as well as aggregate statistics.  Each thread's statistics will be shown as well as the collective sum of all threads.  The values "true", "on", "1", "yes" will all indicate to print per thread statistics.
+
@defgroup TASKING Tasking support
-These functions support are used to implement tasking constructs.
+These functions support tasking constructs.
+
+@defgroup USER User visible functions
+These functions can be called directly by the user, but are runtime library specific, rather than being OpenMP interfaces.

 */

--- a/openmp/runtime/src/defs.mk
+++ b/openmp/runtime/src/defs.mk
@ -1,6 +1,6 @@
 # defs.mk
-# $Revision: 42061 $
-# $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+# $Revision: 42951 $
+# $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 #
 #//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/dllexports
+++ b/openmp/runtime/src/dllexports
@ -161,10 +161,8 @@
    # Regular entry points
        __kmp_wait_yield_4
        __kmp_wait_yield_8
-        __kmp_wait_sleep
        __kmp_fork_call
        __kmp_invoke_microtask
-        __kmp_release
        __kmp_launch_monitor
        __kmp_launch_worker
        __kmp_reap_monitor
@ -192,6 +190,14 @@
            _You_must_link_with_Microsoft_OpenMP_library  DATA
        %endif

+        __kmp_wait_32
+        __kmp_wait_64
+        __kmp_wait_oncore
+        __kmp_release_32
+        __kmp_release_64
+        __kmp_release_oncore
+
+
 #    VT_getthid                              1
 #    vtgthid                                 2

@ -360,6 +366,18 @@ kmpc_set_defaults                           224
        __kmpc_cancel                       244
        __kmpc_cancellationpoint            245
        __kmpc_cancel_barrier               246
+        __kmpc_dist_for_static_init_4       247
+        __kmpc_dist_for_static_init_4u      248
+        __kmpc_dist_for_static_init_8       249
+        __kmpc_dist_for_static_init_8u      250
+        __kmpc_dist_dispatch_init_4         251
+        __kmpc_dist_dispatch_init_4u        252
+        __kmpc_dist_dispatch_init_8         253
+        __kmpc_dist_dispatch_init_8u        254
+        __kmpc_team_static_init_4           255
+        __kmpc_team_static_init_4u          256
+        __kmpc_team_static_init_8           257
+        __kmpc_team_static_init_8u          258
    %endif # OMP_40
 %endif

--- a/openmp/runtime/src/exports_so.txt
+++ b/openmp/runtime/src/exports_so.txt
@ -40,6 +40,8 @@ VERSION {
        __kmp_thread_pool;
        __kmp_thread_pool_nth;

+	__kmp_reset_stats;
+
 #if USE_ITT_BUILD
        #
        # ITT support.
@ -64,8 +66,12 @@ VERSION {
        __kmp_launch_worker;
        __kmp_reap_monitor;
        __kmp_reap_worker;
-        __kmp_release;
-        __kmp_wait_sleep;
+        __kmp_release_32;
+        __kmp_release_64;
+        __kmp_release_oncore;
+        __kmp_wait_32;
+        __kmp_wait_64;
+        __kmp_wait_oncore;
        __kmp_wait_yield_4;
        __kmp_wait_yield_8;

--- a/openmp/runtime/src/extractExternal.cpp
+++ b/openmp/runtime/src/extractExternal.cpp
@ -1,7 +1,7 @@
 /*
 * extractExternal.cpp
- * $Revision: 42181 $
- * $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+ * $Revision: 43084 $
+ * $Date: 2014-04-15 09:15:14 -0500 (Tue, 15 Apr 2014) $
 */


--- a/openmp/runtime/src/i18n/en_US.txt
+++ b/openmp/runtime/src/i18n/en_US.txt
@ -1,6 +1,6 @@
 # en_US.txt #
-# $Revision: 42659 $
-# $Date: 2013-09-12 09:22:48 -0500 (Thu, 12 Sep 2013) $
+# $Revision: 43419 $
+# $Date: 2014-08-27 14:59:52 -0500 (Wed, 27 Aug 2014) $

 #
 #//===----------------------------------------------------------------------===//
@ -40,7 +40,7 @@ Language "English"
 Country  "USA"
 LangId   "1033"
 Version  "2"
-Revision "20130911"
+Revision "20140827"



@ -290,7 +290,7 @@ ChangeThreadAffMaskError     "Cannot change thread affinity mask."
 ThreadsMigrate               "%1$s: Threads may migrate across %2$d innermost levels of machine"
 DecreaseToThreads            "%1$s: decrease to %2$d threads"
 IncreaseToThreads            "%1$s: increase to %2$d threads"
-BoundToOSProcSet             "%1$s: Internal thread %2$d bound to OS proc set %3$s"
+OBSOLETE                     "%1$s: Internal thread %2$d bound to OS proc set %3$s"
 AffCapableUseCpuinfo         "%1$s: Affinity capable, using cpuinfo file"
 AffUseGlobCpuid              "%1$s: Affinity capable, using global cpuid info"
 AffCapableUseFlat            "%1$s: Affinity capable, using default \"flat\" topology"
@ -395,9 +395,17 @@ AffThrPlaceInvalid           "%1$s: invalid value \"%2$s\", valid format is \"nC
 AffThrPlaceUnsupported       "KMP_PLACE_THREADS ignored: unsupported architecture."
 AffThrPlaceManyCores         "KMP_PLACE_THREADS ignored: too many cores requested."
 SyntaxErrorUsing             "%1$s: syntax error, using %2$s."
-AdaptiveNotSupported         "%1$s: Adaptive locks are not supported; using queuing." 
-EnvSyntaxError               "%1$s: Invalid symbols found. Check the value \"%2$s\"." 
-EnvSpacesNotAllowed          "%1$s: Spaces between digits are not allowed \"%2$s\"." 
+AdaptiveNotSupported         "%1$s: Adaptive locks are not supported; using queuing."
+EnvSyntaxError               "%1$s: Invalid symbols found. Check the value \"%2$s\"."
+EnvSpacesNotAllowed          "%1$s: Spaces between digits are not allowed \"%2$s\"."
+BoundToOSProcSet             "%1$s: pid %2$d thread %3$d bound to OS proc set %4$s"
+CnsLoopIncrIllegal           "%1$s error: parallel loop increment and condition are inconsistent."
+NoGompCancellation           "libgomp cancellation is not currently supported."
+AffThrPlaceNonUniform        "KMP_PLACE_THREADS ignored: non-uniform topology."
+AffThrPlaceNonThreeLevel     "KMP_PLACE_THREADS ignored: only three-level topology is supported."
+AffGranTopGroup              "%1$s: granularity=%2$s is not supported with KMP_TOPOLOGY_METHOD=group. Using \"granularity=fine\"."
+AffGranGroupType             "%1$s: granularity=group is not supported with KMP_AFFINITY=%2$s. Using \"granularity=core\"."
+

 # --------------------------------------------------------------------------------------------------
 -*- HINTS -*-
--- a/openmp/runtime/src/include/25/iomp.h.var
+++ b/openmp/runtime/src/include/25/iomp.h.var
@ -1,7 +1,7 @@
 /*
 * include/25/iomp.h.var
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/include/25/iomp_lib.h.var
+++ b/openmp/runtime/src/include/25/iomp_lib.h.var
@ -1,6 +1,6 @@
 ! include/25/iomp_lib.h.var
-! $Revision: 42061 $
-! $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/include/25/omp.h.var
+++ b/openmp/runtime/src/include/25/omp.h.var
@ -1,7 +1,7 @@
 /*
 * include/25/omp.h.var
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/include/25/omp_lib.f.var
+++ b/openmp/runtime/src/include/25/omp_lib.f.var
@ -1,6 +1,6 @@
 ! include/25/omp_lib.f.var
-! $Revision: 42181 $
-! $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
@ -314,7 +314,7 @@
 !dec$   else

 !***
-!*** On Windows* OS IA-32 architecture, the Fortran entry points have an 
+!*** On Windows* OS IA-32 architecture, the Fortran entry points have an
 !*** underscore prepended.
 !***

--- a/openmp/runtime/src/include/25/omp_lib.f90.var
+++ b/openmp/runtime/src/include/25/omp_lib.f90.var
@ -1,6 +1,6 @@
 ! include/25/omp_lib.f90.var
-! $Revision: 42061 $
-! $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/include/25/omp_lib.h.var
+++ b/openmp/runtime/src/include/25/omp_lib.h.var
@ -1,6 +1,6 @@
 ! include/25/omp_lib.h.var
-! $Revision: 42181 $
-! $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
@ -301,7 +301,7 @@
 !dec$   else

 !***
-!*** On Windows* OS IA-32 architecture, the Fortran entry points have an 
+!*** On Windows* OS IA-32 architecture, the Fortran entry points have an
 !*** underscore prepended.
 !***

--- a/openmp/runtime/src/include/30/iomp.h.var
+++ b/openmp/runtime/src/include/30/iomp.h.var
@ -1,7 +1,7 @@
 /*
 * include/30/iomp.h.var
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/include/30/iomp_lib.h.var
+++ b/openmp/runtime/src/include/30/iomp_lib.h.var
@ -1,6 +1,6 @@
 ! include/30/iomp_lib.h.var
-! $Revision: 42061 $
-! $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/include/30/omp.h.var
+++ b/openmp/runtime/src/include/30/omp.h.var
@ -1,7 +1,7 @@
 /*
 * include/30/omp.h.var
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/include/30/omp_lib.f.var
+++ b/openmp/runtime/src/include/30/omp_lib.f.var
@ -1,6 +1,6 @@
 ! include/30/omp_lib.f.var
-! $Revision: 42181 $
-! $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/include/30/omp_lib.f90.var
+++ b/openmp/runtime/src/include/30/omp_lib.f90.var
@ -1,6 +1,6 @@
 ! include/30/omp_lib.f90.var
-! $Revision: 42061 $
-! $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/include/30/omp_lib.h.var
+++ b/openmp/runtime/src/include/30/omp_lib.h.var
@ -1,6 +1,6 @@
 ! include/30/omp_lib.h.var
-! $Revision: 42181 $
-! $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+! $Revision: 42951 $
+! $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 !
 !//===----------------------------------------------------------------------===//
--- a/openmp/runtime/src/include/40/iomp.h.var
+++ b/openmp/runtime/src/include/40/iomp.h.var
@ -91,7 +91,7 @@
    } kmp_cancel_kind_t;

    extern int    __KAI_KMPC_CONVENTION  kmp_get_cancellation_status(kmp_cancel_kind_t);
-    
+
 #   undef __KAI_KMPC_CONVENTION

    /* Warning:
--- a/openmp/runtime/src/kmp.h
+++ b/openmp/runtime/src/kmp.h
--- a/openmp/runtime/src/kmp_affinity.cpp
+++ b/openmp/runtime/src/kmp_affinity.cpp
@ -1,7 +1,7 @@
 /*
 * kmp_affinity.cpp -- affinity management
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -19,7 +19,7 @@
 #include "kmp_i18n.h"
 #include "kmp_io.h"
 #include "kmp_str.h"
-
+#include "kmp_wrapper_getpid.h"

 #if KMP_AFFINITY_SUPPORTED

@ -49,7 +49,7 @@ __kmp_affinity_print_mask(char *buf, int buf_len, kmp_affin_mask_t *mask)
        return buf;
    }

-    sprintf(scan, "{%ld", i);
+    sprintf(scan, "{%ld", (long)i);
    while (*scan != '\0') scan++;
    i++;
    for (; i < KMP_CPU_SETSIZE; i++) {
@ -66,7 +66,7 @@ __kmp_affinity_print_mask(char *buf, int buf_len, kmp_affin_mask_t *mask)
        if (end - scan < 15) {
           break;
        }
-        sprintf(scan, ",%-ld", i);
+        sprintf(scan, ",%-ld", (long)i);
        while (*scan != '\0') scan++;
    }
    if (i < KMP_CPU_SETSIZE) {
@ -89,7 +89,6 @@ __kmp_affinity_entire_machine_mask(kmp_affin_mask_t *mask)

    if (__kmp_num_proc_groups > 1) {
        int group;
-        struct GROUP_AFFINITY ga;
        KMP_DEBUG_ASSERT(__kmp_GetActiveProcessorCount != NULL);
        for (group = 0; group < __kmp_num_proc_groups; group++) {
            int i;
@ -315,6 +314,106 @@ __kmp_affinity_cmp_Address_child_num(const void *a, const void *b)
    return 0;
 }

+/** A structure for holding machine-specific hierarchy info to be computed once at init. */
+class hierarchy_info {
+public:
+    /** Typical levels are threads/core, cores/package or socket, packages/node, nodes/machine,
+        etc.  We don't want to get specific with nomenclature */
+    static const kmp_uint32 maxLevels=7;
+
+    /** This is specifically the depth of the machine configuration hierarchy, in terms of the
+        number of levels along the longest path from root to any leaf. It corresponds to the
+        number of entries in numPerLevel if we exclude all but one trailing 1. */
+    kmp_uint32 depth;
+    kmp_uint32 base_depth;
+    kmp_uint32 base_num_threads;
+    bool uninitialized;
+
+    /** Level 0 corresponds to leaves. numPerLevel[i] is the number of children the parent of a
+        node at level i has. For example, if we have a machine with 4 packages, 4 cores/package
+        and 2 HT per core, then numPerLevel = {2, 4, 4, 1, 1}. All empty levels are set to 1. */
+    kmp_uint32 numPerLevel[maxLevels];
+    kmp_uint32 skipPerLevel[maxLevels];
+
+    void deriveLevels(AddrUnsPair *adr2os, int num_addrs) {
+        int hier_depth = adr2os[0].first.depth;
+        int level = 0;
+        for (int i=hier_depth-1; i>=0; --i) {
+            int max = -1;
+            for (int j=0; j<num_addrs; ++j) {
+                int next = adr2os[j].first.childNums[i];
+                if (next > max) max = next;
+            }
+            numPerLevel[level] = max+1;
+            ++level;
+        }
+    }
+
+    hierarchy_info() : depth(1), uninitialized(true) {}
+    void init(AddrUnsPair *adr2os, int num_addrs)
+    {
+        uninitialized = false;
+        for (kmp_uint32 i=0; i<maxLevels; ++i) { // init numPerLevel[*] to 1 item per level
+            numPerLevel[i] = 1;
+            skipPerLevel[i] = 1;
+        }
+
+        // Sort table by physical ID
+        if (adr2os) {
+            qsort(adr2os, num_addrs, sizeof(*adr2os), __kmp_affinity_cmp_Address_labels);
+            deriveLevels(adr2os, num_addrs);
+        }
+        else {
+            numPerLevel[0] = 4;
+            numPerLevel[1] = num_addrs/4;
+            if (num_addrs%4) numPerLevel[1]++;
+        }
+
+        base_num_threads = num_addrs;
+        for (int i=maxLevels-1; i>=0; --i) // count non-empty levels to get depth
+            if (numPerLevel[i] != 1 || depth > 1) // only count one top-level '1'
+                depth++;
+
+        kmp_uint32 branch = 4;
+        if (numPerLevel[0] == 1) branch = num_addrs/4;
+        if (branch<4) branch=4;
+        for (kmp_uint32 d=0; d<depth-1; ++d) { // optimize hierarchy width
+            while (numPerLevel[d] > branch || (d==0 && numPerLevel[d]>4)) { // max 4 on level 0!
+                if (numPerLevel[d] & 1) numPerLevel[d]++;
+                numPerLevel[d] = numPerLevel[d] >> 1;
+                if (numPerLevel[d+1] == 1) depth++;
+                numPerLevel[d+1] = numPerLevel[d+1] << 1;
+            }
+            if(numPerLevel[0] == 1) {
+                branch = branch >> 1;
+                if (branch<4) branch = 4;
+            }
+        }
+
+        for (kmp_uint32 i=1; i<depth; ++i)
+            skipPerLevel[i] = numPerLevel[i-1] * skipPerLevel[i-1];
+
+        base_depth = depth;
+    }
+};
+
+static hierarchy_info machine_hierarchy;
+
+void __kmp_get_hierarchy(kmp_uint32 nproc, kmp_bstate_t *thr_bar) {
+    if (machine_hierarchy.uninitialized)
+        machine_hierarchy.init(NULL, nproc);
+
+    if (nproc <= machine_hierarchy.base_num_threads)
+        machine_hierarchy.depth = machine_hierarchy.base_depth;
+    KMP_DEBUG_ASSERT(machine_hierarchy.depth > 0);
+    while (nproc > machine_hierarchy.skipPerLevel[machine_hierarchy.depth-1]) {
+        machine_hierarchy.depth++;
+        machine_hierarchy.skipPerLevel[machine_hierarchy.depth-1] = 2*machine_hierarchy.skipPerLevel[machine_hierarchy.depth-2];
+    }
+    thr_bar->depth = machine_hierarchy.depth;
+    thr_bar->base_leaf_kids = (kmp_uint8)machine_hierarchy.numPerLevel[0]-1;
+    thr_bar->skip_per_level = machine_hierarchy.skipPerLevel;
+}

 //
 // When sorting by labels, __kmp_affinity_assign_child_nums() must first be
@ -1963,7 +2062,7 @@ __kmp_affinity_create_cpuinfo_map(AddrUnsPair **address2os, int *line,
            // A newline has signalled the end of the processor record.
            // Check that there aren't too many procs specified.
            //
-            if (num_avail == __kmp_xproc) {
+            if ((int)num_avail == __kmp_xproc) {
                CLEANUP_THREAD_INFO;
                *msg_id = kmp_i18n_str_TooManyEntries;
                return -1;
@ -2587,7 +2686,7 @@ static int nextNewMask;
 #define ADD_MASK_OSID(_osId,_osId2Mask,_maxOsId) \
    {                                                                   \
        if (((_osId) > _maxOsId) ||                                     \
-          (! KMP_CPU_ISSET((_osId), KMP_CPU_INDEX(_osId2Mask, (_osId))))) {\
+          (! KMP_CPU_ISSET((_osId), KMP_CPU_INDEX((_osId2Mask), (_osId))))) { \
            if (__kmp_affinity_verbose || (__kmp_affinity_warnings      \
              && (__kmp_affinity_type != affinity_none))) {             \
                KMP_WARNING(AffIgnoreInvalidProcID, _osId);             \
@ -3045,14 +3144,15 @@ __kmp_process_place(const char **scan, kmp_affin_mask_t *osId2Mask,
            (*setSize)++;
        }
        *scan = next;  // skip num
-        }
+    }
    else {
        KMP_ASSERT2(0, "bad explicit places list");
    }
 }


-static void
+//static void
+void
 __kmp_affinity_process_placelist(kmp_affin_mask_t **out_masks,
  unsigned int *out_numMasks, const char *placelist,
  kmp_affin_mask_t *osId2Mask, int maxOsId)
@ -3109,71 +3209,41 @@ __kmp_affinity_process_placelist(kmp_affin_mask_t **out_masks,
        // valid follow sets are ',' ':' and EOL
        //
        SKIP_WS(scan);
+        int stride;
        if (*scan == '\0' || *scan == ',') {
-            int i;
-            for (i = 0; i < count; i++) {
-                int j;
-                if (setSize == 0) {
-                    break;
-                }
-                ADD_MASK(tempMask);
-                setSize = 0;
-                for (j = __kmp_affin_mask_size * CHAR_BIT - 1; j > 0; j--) {
-                    //
-                    // Use a temp var in case macro is changed to evaluate
-                    // args multiple times.
-                    //
-                    if (KMP_CPU_ISSET(j - 1, tempMask)) {
-                        KMP_CPU_SET(j, tempMask);
-                        setSize++;
-                    }
-                    else {
-                        KMP_CPU_CLR(j, tempMask);
-                    }
-                }
-                for (; j >= 0; j--) {
-                    KMP_CPU_CLR(j, tempMask);
-                }
-            }
-            KMP_CPU_ZERO(tempMask);
-            setSize = 0;
+            stride = +1;
+        }
+        else {
+            KMP_ASSERT2(*scan == ':', "bad explicit places list");
+            scan++;         // skip ':'

-            if (*scan == '\0') {
+            //
+            // Read stride parameter
+            //
+            int sign = +1;
+            for (;;) {
+                SKIP_WS(scan);
+                if (*scan == '+') {
+                    scan++; // skip '+'
+                    continue;
+                }
+                if (*scan == '-') {
+                    sign *= -1;
+                    scan++; // skip '-'
+                    continue;
+                }
                break;
            }
-            scan++;     // skip ','
-            continue;
-        }
-
-        KMP_ASSERT2(*scan == ':', "bad explicit places list");
-        scan++;         // skip ':'
-
-        //
-        // Read stride parameter
-        //
-        int sign = +1;
-        for (;;) {
            SKIP_WS(scan);
-            if (*scan == '+') {
-                scan++; // skip '+'
-                continue;
-            }
-            if (*scan == '-') {
-                sign *= -1;
-                scan++; // skip '-'
-                continue;
-            }
-            break;
+            KMP_ASSERT2((*scan >= '0') && (*scan <= '9'),
+              "bad explicit places list");
+            next = scan;
+            SKIP_DIGITS(next);
+            stride = __kmp_str_to_int(scan, *next);
+            KMP_DEBUG_ASSERT(stride >= 0);
+            scan = next;
+            stride *= sign;
        }
-        SKIP_WS(scan);
-        KMP_ASSERT2((*scan >= '0') && (*scan <= '9'),
-          "bad explicit places list");
-        next = scan;
-        SKIP_DIGITS(next);
-        int stride = __kmp_str_to_int(scan, *next);
-        KMP_DEBUG_ASSERT(stride >= 0);
-        scan = next;
-        stride *= sign;

        if (stride > 0) {
            int i;
@ -3185,12 +3255,20 @@ __kmp_affinity_process_placelist(kmp_affin_mask_t **out_masks,
                ADD_MASK(tempMask);
                setSize = 0;
                for (j = __kmp_affin_mask_size * CHAR_BIT - 1; j >= stride; j--) {
-                    if (KMP_CPU_ISSET(j - stride, tempMask)) {
-                        KMP_CPU_SET(j, tempMask);
-                        setSize++;
+                    if (! KMP_CPU_ISSET(j - stride, tempMask)) {
+                        KMP_CPU_CLR(j, tempMask);
+                    }
+                    else if ((j > maxOsId) ||
+                      (! KMP_CPU_ISSET(j, KMP_CPU_INDEX(osId2Mask, j)))) {
+                        if (__kmp_affinity_verbose || (__kmp_affinity_warnings
+                          && (__kmp_affinity_type != affinity_none))) {
+                            KMP_WARNING(AffIgnoreInvalidProcID, j);
+                        }
+                        KMP_CPU_CLR(j, tempMask);
                    }
                    else {
-                        KMP_CPU_CLR(j, tempMask);
+                        KMP_CPU_SET(j, tempMask);
+                        setSize++;
                    }
                }
                for (; j >= 0; j--) {
@ -3201,23 +3279,31 @@ __kmp_affinity_process_placelist(kmp_affin_mask_t **out_masks,
        else {
            int i;
            for (i = 0; i < count; i++) {
-                unsigned j;
+                int j;
                if (setSize == 0) {
                    break;
                }
                ADD_MASK(tempMask);
                setSize = 0;
-                for (j = 0; j < (__kmp_affin_mask_size * CHAR_BIT) + stride;
+                for (j = 0; j < ((int)__kmp_affin_mask_size * CHAR_BIT) + stride;
                  j++) {
-                    if (KMP_CPU_ISSET(j - stride, tempMask)) {
+                    if (! KMP_CPU_ISSET(j - stride, tempMask)) {
+                        KMP_CPU_CLR(j, tempMask);
+                    }
+                    else if ((j > maxOsId) ||
+                      (! KMP_CPU_ISSET(j, KMP_CPU_INDEX(osId2Mask, j)))) {
+                        if (__kmp_affinity_verbose || (__kmp_affinity_warnings
+                          && (__kmp_affinity_type != affinity_none))) {
+                            KMP_WARNING(AffIgnoreInvalidProcID, j);
+                        }
+                        KMP_CPU_CLR(j, tempMask);
+                    }
+                    else {
                        KMP_CPU_SET(j, tempMask);
                        setSize++;
                    }
-                    else {
-                        KMP_CPU_CLR(j, tempMask);
-                    }
                }
-                for (; j < __kmp_affin_mask_size * CHAR_BIT; j++) {
+                for (; j < (int)__kmp_affin_mask_size * CHAR_BIT; j++) {
                    KMP_CPU_CLR(j, tempMask);
                }
            }
@ -3270,9 +3356,13 @@ __kmp_apply_thread_places(AddrUnsPair **pAddr, int depth)
        }
        __kmp_place_num_cores = nCoresPerPkg;   // use all available cores
    }
-    if ( !__kmp_affinity_uniform_topology() || depth != 3 ) {
-        KMP_WARNING( AffThrPlaceUnsupported );
-        return; // don't support non-uniform topology or not-3-level architecture
+    if ( !__kmp_affinity_uniform_topology() ) {
+        KMP_WARNING( AffThrPlaceNonUniform );
+        return; // don't support non-uniform topology
+    }
+    if ( depth != 3 ) {
+        KMP_WARNING( AffThrPlaceNonThreeLevel );
+        return; // don't support not-3-level topology
    }
    if ( __kmp_place_num_threads_per_core == 0 ) {
        __kmp_place_num_threads_per_core = __kmp_nThreadsPerCore;  // use all HW contexts
@ -3400,18 +3490,14 @@ __kmp_aux_affinity_initialize(void)
        }

        if (depth < 0) {
-            if ((msg_id != kmp_i18n_null)
-              && (__kmp_affinity_verbose || (__kmp_affinity_warnings
-              && (__kmp_affinity_type != affinity_none)))) {
-#  if KMP_MIC
-                if (__kmp_affinity_verbose) {
+            if (__kmp_affinity_verbose) {
+                if (msg_id != kmp_i18n_null) {
                    KMP_INFORM(AffInfoStrStr, "KMP_AFFINITY", __kmp_i18n_catgets(msg_id),
                      KMP_I18N_STR(DecodingLegacyAPIC));
                }
-#  else
-                KMP_WARNING(AffInfoStrStr, "KMP_AFFINITY", __kmp_i18n_catgets(msg_id),
-                  KMP_I18N_STR(DecodingLegacyAPIC));
-#  endif
+                else {
+                    KMP_INFORM(AffInfoStr, "KMP_AFFINITY", KMP_I18N_STR(DecodingLegacyAPIC));
+                }
            }

            file_name = NULL;
@ -3428,19 +3514,13 @@ __kmp_aux_affinity_initialize(void)
 # if KMP_OS_LINUX

        if (depth < 0) {
-            if ((msg_id != kmp_i18n_null)
-              && (__kmp_affinity_verbose || (__kmp_affinity_warnings
-              && (__kmp_affinity_type != affinity_none)))) {
-#  if KMP_MIC
-                if (__kmp_affinity_verbose) {
+            if (__kmp_affinity_verbose) {
+                if (msg_id != kmp_i18n_null) {
                    KMP_INFORM(AffStrParseFilename, "KMP_AFFINITY", __kmp_i18n_catgets(msg_id), "/proc/cpuinfo");
                }
-#  else
-                KMP_WARNING(AffStrParseFilename, "KMP_AFFINITY", __kmp_i18n_catgets(msg_id), "/proc/cpuinfo");
-#  endif
-            }
-            else if (__kmp_affinity_verbose) {
-                KMP_INFORM(AffParseFilename, "KMP_AFFINITY", "/proc/cpuinfo");
+                else {
+                    KMP_INFORM(AffParseFilename, "KMP_AFFINITY", "/proc/cpuinfo");
+                }
            }

            FILE *f = fopen("/proc/cpuinfo", "r");
@ -3461,20 +3541,32 @@ __kmp_aux_affinity_initialize(void)

 # endif /* KMP_OS_LINUX */

+# if KMP_OS_WINDOWS && KMP_ARCH_X86_64
+
+        if ((depth < 0) && (__kmp_num_proc_groups > 1)) {
+            if (__kmp_affinity_verbose) {
+                KMP_INFORM(AffWindowsProcGroupMap, "KMP_AFFINITY");
+            }
+
+            depth = __kmp_affinity_create_proc_group_map(&address2os, &msg_id);
+            KMP_ASSERT(depth != 0);
+        }
+
+# endif /* KMP_OS_WINDOWS && KMP_ARCH_X86_64 */
+
        if (depth < 0) {
-            if (msg_id != kmp_i18n_null
-              && (__kmp_affinity_verbose || (__kmp_affinity_warnings
-              && (__kmp_affinity_type != affinity_none)))) {
+            if (__kmp_affinity_verbose && (msg_id != kmp_i18n_null)) {
                if (file_name == NULL) {
-                    KMP_WARNING(UsingFlatOS, __kmp_i18n_catgets(msg_id));
+                    KMP_INFORM(UsingFlatOS, __kmp_i18n_catgets(msg_id));
                }
                else if (line == 0) {
-                    KMP_WARNING(UsingFlatOSFile, file_name, __kmp_i18n_catgets(msg_id));
+                    KMP_INFORM(UsingFlatOSFile, file_name, __kmp_i18n_catgets(msg_id));
                }
                else {
-                    KMP_WARNING(UsingFlatOSFileLine, file_name, line, __kmp_i18n_catgets(msg_id));
+                    KMP_INFORM(UsingFlatOSFileLine, file_name, line, __kmp_i18n_catgets(msg_id));
                }
            }
+            // FIXME - print msg if msg_id = kmp_i18n_null ???

            file_name = "";
            depth = __kmp_affinity_create_flat_map(&address2os, &msg_id);
@ -3508,7 +3600,6 @@ __kmp_aux_affinity_initialize(void)
            KMP_ASSERT(address2os == NULL);
            return;
        }
-
        if (depth < 0) {
            KMP_ASSERT(msg_id != kmp_i18n_null);
            KMP_FATAL(MsgExiting, __kmp_i18n_catgets(msg_id));
@ -3526,7 +3617,6 @@ __kmp_aux_affinity_initialize(void)
            KMP_ASSERT(address2os == NULL);
            return;
        }
-
        if (depth < 0) {
            KMP_ASSERT(msg_id != kmp_i18n_null);
            KMP_FATAL(MsgExiting, __kmp_i18n_catgets(msg_id));
@ -3597,23 +3687,9 @@ __kmp_aux_affinity_initialize(void)

        depth = __kmp_affinity_create_proc_group_map(&address2os, &msg_id);
        KMP_ASSERT(depth != 0);
-
        if (depth < 0) {
-            if ((msg_id != kmp_i18n_null)
-              && (__kmp_affinity_verbose || (__kmp_affinity_warnings
-              && (__kmp_affinity_type != affinity_none)))) {
-                KMP_WARNING(UsingFlatOS, __kmp_i18n_catgets(msg_id));
-            }
-
-            depth = __kmp_affinity_create_flat_map(&address2os, &msg_id);
-            if (depth == 0) {
-                KMP_ASSERT(__kmp_affinity_type == affinity_none);
-                KMP_ASSERT(address2os == NULL);
-                return;
-            }
-            // should not fail
-            KMP_ASSERT(depth > 0);
-            KMP_ASSERT(address2os != NULL);
+            KMP_ASSERT(msg_id != kmp_i18n_null);
+            KMP_FATAL(MsgExiting, __kmp_i18n_catgets(msg_id));
        }
    }

@ -3658,7 +3734,7 @@ __kmp_aux_affinity_initialize(void)
    kmp_affin_mask_t *osId2Mask = __kmp_create_masks(&maxIndex, &numUnique,
      address2os, __kmp_avail_proc);
    if (__kmp_affinity_gran_levels == 0) {
-        KMP_DEBUG_ASSERT(numUnique == __kmp_avail_proc);
+        KMP_DEBUG_ASSERT((int)numUnique == __kmp_avail_proc);
    }

    //
@ -3852,6 +3928,7 @@ __kmp_aux_affinity_initialize(void)
    }

    __kmp_free(osId2Mask);
+    machine_hierarchy.init(address2os, __kmp_avail_proc);
 }


@ -3953,7 +4030,7 @@ __kmp_affinity_set_init_mask(int gtid, int isa_root)
            }
 # endif
            KMP_ASSERT(fullMask != NULL);
-            i = -1;
+            i = KMP_PLACE_ALL;
            mask = fullMask;
        }
        else {
@ -4020,7 +4097,8 @@ __kmp_affinity_set_init_mask(int gtid, int isa_root)
        char buf[KMP_AFFIN_MASK_PRINT_LEN];
        __kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN,
          th->th.th_affin_mask);
-        KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", gtid, buf);
+        KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", (kmp_int32)getpid(), gtid,
+          buf);
    }

 # if KMP_OS_WINDOWS
@ -4058,14 +4136,14 @@ __kmp_affinity_set_place(int gtid)
    // Check that the new place is within this thread's partition.
    //
    KMP_DEBUG_ASSERT(th->th.th_affin_mask != NULL);
-    KMP_DEBUG_ASSERT(th->th.th_new_place >= 0);
-    KMP_DEBUG_ASSERT((unsigned)th->th.th_new_place <= __kmp_affinity_num_masks);
+    KMP_ASSERT(th->th.th_new_place >= 0);
+    KMP_ASSERT((unsigned)th->th.th_new_place <= __kmp_affinity_num_masks);
    if (th->th.th_first_place <= th->th.th_last_place) {
-        KMP_DEBUG_ASSERT((th->th.th_new_place >= th->th.th_first_place)
+        KMP_ASSERT((th->th.th_new_place >= th->th.th_first_place)
         && (th->th.th_new_place <= th->th.th_last_place));
    }
    else {
-        KMP_DEBUG_ASSERT((th->th.th_new_place <= th->th.th_first_place)
+        KMP_ASSERT((th->th.th_new_place <= th->th.th_first_place)
         || (th->th.th_new_place >= th->th.th_last_place));
    }

@ -4082,7 +4160,8 @@ __kmp_affinity_set_place(int gtid)
        char buf[KMP_AFFIN_MASK_PRINT_LEN];
        __kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN,
          th->th.th_affin_mask);
-        KMP_INFORM(BoundToOSProcSet, "OMP_PROC_BIND", gtid, buf);
+        KMP_INFORM(BoundToOSProcSet, "OMP_PROC_BIND", (kmp_int32)getpid(),
+          gtid, buf);
    }
    __kmp_set_system_affinity(th->th.th_affin_mask, TRUE);
 }
@ -4153,6 +4232,11 @@ __kmp_aux_set_affinity(void **mask)
    th->th.th_new_place = KMP_PLACE_UNDEFINED;
    th->th.th_first_place = 0;
    th->th.th_last_place = __kmp_affinity_num_masks - 1;
+
+    //
+    // Turn off 4.0 affinity for the current tread at this parallel level.
+    //
+    th->th.th_current_task->td_icvs.proc_bind = proc_bind_false;
 # endif

    return retval;
@ -4207,7 +4291,6 @@ __kmp_aux_get_affinity(void **mask)

 }

-
 int
 __kmp_aux_set_affinity_mask_proc(int proc, void **mask)
 {
@ -4360,7 +4443,8 @@ void __kmp_balanced_affinity( int tid, int nthreads )
        if (__kmp_affinity_verbose) {
            char buf[KMP_AFFIN_MASK_PRINT_LEN];
            __kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN, mask);
-            KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", tid, buf);
+            KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", (kmp_int32)getpid(),
+              tid, buf);
        }
        __kmp_set_system_affinity( mask, TRUE );
    } else { // Non-uniform topology
@ -4535,7 +4619,8 @@ void __kmp_balanced_affinity( int tid, int nthreads )
        if (__kmp_affinity_verbose) {
            char buf[KMP_AFFIN_MASK_PRINT_LEN];
            __kmp_affinity_print_mask(buf, KMP_AFFIN_MASK_PRINT_LEN, mask);
-            KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", tid, buf);
+            KMP_INFORM(BoundToOSProcSet, "KMP_AFFINITY", (kmp_int32)getpid(),
+              tid, buf);
        }
        __kmp_set_system_affinity( mask, TRUE );
    }
@ -4543,4 +4628,50 @@ void __kmp_balanced_affinity( int tid, int nthreads )

 # endif /* KMP_MIC */

+#else
+    // affinity not supported
+
+kmp_uint32 mac_skipPerLevel[7];
+kmp_uint32 mac_depth;
+kmp_uint8 mac_leaf_kids;
+void __kmp_get_hierarchy(kmp_uint32 nproc, kmp_bstate_t *thr_bar) {
+    static int first = 1;
+    if (first) {
+        const kmp_uint32 maxLevels = 7;
+        kmp_uint32 numPerLevel[maxLevels];
+
+        for (kmp_uint32 i=0; i<maxLevels; ++i) { // init numPerLevel[*] to 1 item per level
+            numPerLevel[i] = 1;
+            mac_skipPerLevel[i] = 1;
+        }
+
+        mac_depth = 2;
+        numPerLevel[0] = nproc;
+
+        kmp_uint32 branch = 4;
+        if (numPerLevel[0] == 1) branch = nproc/4;
+        if (branch<4) branch=4;
+        for (kmp_uint32 d=0; d<mac_depth-1; ++d) { // optimize hierarchy width
+            while (numPerLevel[d] > branch || (d==0 && numPerLevel[d]>4)) { // max 4 on level 0!
+                if (numPerLevel[d] & 1) numPerLevel[d]++;
+                numPerLevel[d] = numPerLevel[d] >> 1;
+                if (numPerLevel[d+1] == 1) mac_depth++;
+                numPerLevel[d+1] = numPerLevel[d+1] << 1;
+            }
+            if(numPerLevel[0] == 1) {
+                branch = branch >> 1;
+                if (branch<4) branch = 4;
+            }
+        }
+
+        for (kmp_uint32 i=1; i<mac_depth; ++i)
+            mac_skipPerLevel[i] = numPerLevel[i-1] * mac_skipPerLevel[i-1];
+        mac_leaf_kids = (kmp_uint8)numPerLevel[0]-1;
+        first=0;
+    }
+    thr_bar->depth = mac_depth;
+    thr_bar->base_leaf_kids = mac_leaf_kids;
+    thr_bar->skip_per_level = mac_skipPerLevel;
+}
+
 #endif // KMP_AFFINITY_SUPPORTED
--- a/openmp/runtime/src/kmp_alloc.c
+++ b/openmp/runtime/src/kmp_alloc.c
@ -1,7 +1,7 @@
 /*
 * kmp_alloc.c -- private/shared dyanmic memory allocation and management
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43450 $
+ * $Date: 2014-09-09 10:07:22 -0500 (Tue, 09 Sep 2014) $
 */


@ -1228,7 +1228,7 @@ bpoold(  kmp_info_t *th, void *buf, int dumpalloc, int dumpfree)
                bufdump( th, (void *) (((char *) b) + sizeof(bhead_t)));
            }
        } else {
-            char *lerr = "";
+            const char *lerr = "";

            KMP_DEBUG_ASSERT(bs > 0);
            if ((b->ql.blink->ql.flink != b) || (b->ql.flink->ql.blink != b)) {
@ -1772,7 +1772,11 @@ ___kmp_free( void * ptr KMP_SRC_LOC_DECL )

        #ifndef LEAK_MEMORY
            KE_TRACE( 10, ( "   free( %p )\n", descr.ptr_allocated ) );
+        # ifdef KMP_DEBUG
+            _free_src_loc( descr.ptr_allocated, _file_, _line_ );
+        # else
            free_src_loc( descr.ptr_allocated KMP_SRC_LOC_PARM );
+        # endif
        #endif

    KMP_MB();
@ -1790,7 +1794,7 @@ ___kmp_free( void * ptr KMP_SRC_LOC_DECL )
 // Otherwise allocate normally using kmp_thread_malloc.

 // AC: How to choose the limit? Just get 16 for now...
-static int const __kmp_free_list_limit = 16;
+#define KMP_FREE_LIST_LIMIT 16

 // Always use 128 bytes for determining buckets for caching memory blocks
 #define DCACHE_LINE  128
@ -1932,7 +1936,7 @@ ___kmp_fast_free( kmp_info_t *this_thr, void * ptr KMP_SRC_LOC_DECL )
            kmp_mem_descr_t * dsc  = (kmp_mem_descr_t *)( (char*)head - sizeof(kmp_mem_descr_t) );
            kmp_info_t      * q_th = (kmp_info_t *)(dsc->ptr_aligned); // allocating thread, same for all queue nodes
            size_t            q_sz = dsc->size_allocated + 1;          // new size in case we add current task
-            if ( q_th == alloc_thr && q_sz <= __kmp_free_list_limit ) {
+            if ( q_th == alloc_thr && q_sz <= KMP_FREE_LIST_LIMIT ) {
                // we can add current task to "other" list, no sync needed
                *((void **)ptr) = head;
                descr->size_allocated = q_sz;
--- a/openmp/runtime/src/kmp_atomic.c
+++ b/openmp/runtime/src/kmp_atomic.c
@ -1,7 +1,7 @@
 /*
 * kmp_atomic.c -- ATOMIC implementation routines
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43421 $
+ * $Date: 2014-08-28 08:56:10 -0500 (Thu, 28 Aug 2014) $
 */


@ -690,7 +690,7 @@ RET_TYPE __kmpc_atomic_##TYPE_ID##_##OP_ID( ident_t *id_ref, int gtid, TYPE * lh
 #endif /* KMP_GOMP_COMPAT */

 #if KMP_MIC
-# define KMP_DO_PAUSE _mm_delay_32( 30 )
+# define KMP_DO_PAUSE _mm_delay_32( 1 )
 #else
 # define KMP_DO_PAUSE KMP_CPU_PAUSE()
 #endif /* KMP_MIC */
@ -700,14 +700,10 @@ RET_TYPE __kmpc_atomic_##TYPE_ID##_##OP_ID( ident_t *id_ref, int gtid, TYPE * lh
 //     TYPE    - operands' type
 //     BITS    - size in bits, used to distinguish low level calls
 //     OP      - operator
-// Note: temp_val introduced in order to force the compiler to read
-//       *lhs only once (w/o it the compiler reads *lhs twice)
 #define OP_CMPXCHG(TYPE,BITS,OP)                                          \
    {                                                                     \
-        TYPE KMP_ATOMIC_VOLATILE temp_val;                                \
        TYPE old_value, new_value;                                        \
-        temp_val = *lhs;                                                  \
-        old_value = temp_val;                                             \
+        old_value = *(TYPE volatile *)lhs;                                \
        new_value = old_value OP rhs;                                     \
        while ( ! KMP_COMPARE_AND_STORE_ACQ##BITS( (kmp_int##BITS *) lhs, \
                      *VOLATILE_CAST(kmp_int##BITS *) &old_value,         \
@ -715,8 +711,7 @@ RET_TYPE __kmpc_atomic_##TYPE_ID##_##OP_ID( ident_t *id_ref, int gtid, TYPE * lh
        {                                                                 \
                KMP_DO_PAUSE;                                             \
                                                                          \
-            temp_val = *lhs;                                              \
-            old_value = temp_val;                                         \
+            old_value = *(TYPE volatile *)lhs;                            \
            new_value = old_value OP rhs;                                 \
        }                                                                 \
    }
@ -765,13 +760,6 @@ ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)                                      \
    KMP_TEST_THEN_ADD##BITS( lhs, OP rhs );                                \
 }
 // -------------------------------------------------------------------------
-#define ATOMIC_FLOAT_ADD(TYPE_ID,OP_ID,TYPE,BITS,OP,LCK_ID,MASK,GOMP_FLAG) \
-ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)                                      \
-    OP_GOMP_CRITICAL(OP##=,GOMP_FLAG)                                      \
-    /* OP used as a sign for subtraction: (lhs-rhs) --> (lhs+-rhs) */      \
-    KMP_TEST_THEN_ADD_REAL##BITS( lhs, OP rhs );                           \
-}
-// -------------------------------------------------------------------------
 #define ATOMIC_CMPXCHG(TYPE_ID,OP_ID,TYPE,BITS,OP,LCK_ID,MASK,GOMP_FLAG)   \
 ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)                                      \
    OP_GOMP_CRITICAL(OP##=,GOMP_FLAG)                                      \
@ -803,17 +791,6 @@ ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)                                      \
    }                                                                      \
 }
 // -------------------------------------------------------------------------
-#define ATOMIC_FLOAT_ADD(TYPE_ID,OP_ID,TYPE,BITS,OP,LCK_ID,MASK,GOMP_FLAG) \
-ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)                                      \
-    OP_GOMP_CRITICAL(OP##=,GOMP_FLAG)                                      \
-    if ( ! ( (kmp_uintptr_t) lhs & 0x##MASK) ) {                           \
-        OP_CMPXCHG(TYPE,BITS,OP)     /* aligned address */                 \
-    } else {                                                               \
-        KMP_CHECK_GTID;                                                    \
-        OP_CRITICAL(OP##=,LCK_ID)  /* unaligned address - use critical */  \
-    }                                                                      \
-}
-// -------------------------------------------------------------------------
 #define ATOMIC_CMPXCHG(TYPE_ID,OP_ID,TYPE,BITS,OP,LCK_ID,MASK,GOMP_FLAG)   \
 ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)                                      \
    OP_GOMP_CRITICAL(OP##=,GOMP_FLAG)                                      \
@ -845,25 +822,15 @@ ATOMIC_BEGIN(TYPE_ID,OP_ID,TYPE,void)
 ATOMIC_FIXED_ADD( fixed4, add, kmp_int32,  32, +, 4i, 3, 0            )  // __kmpc_atomic_fixed4_add
 ATOMIC_FIXED_ADD( fixed4, sub, kmp_int32,  32, -, 4i, 3, 0            )  // __kmpc_atomic_fixed4_sub

-#if KMP_MIC
 ATOMIC_CMPXCHG( float4,  add, kmp_real32, 32, +,  4r, 3, KMP_ARCH_X86 )  // __kmpc_atomic_float4_add
 ATOMIC_CMPXCHG( float4,  sub, kmp_real32, 32, -,  4r, 3, KMP_ARCH_X86 )  // __kmpc_atomic_float4_sub
-#else
-ATOMIC_FLOAT_ADD( float4, add, kmp_real32, 32, +, 4r, 3, KMP_ARCH_X86 )  // __kmpc_atomic_float4_add
-ATOMIC_FLOAT_ADD( float4, sub, kmp_real32, 32, -, 4r, 3, KMP_ARCH_X86 )  // __kmpc_atomic_float4_sub
-#endif // KMP_MIC

 // Routines for ATOMIC 8-byte operands addition and subtraction
 ATOMIC_FIXED_ADD( fixed8, add, kmp_int64,  64, +, 8i, 7, KMP_ARCH_X86 )  // __kmpc_atomic_fixed8_add
 ATOMIC_FIXED_ADD( fixed8, sub, kmp_int64,  64, -, 8i, 7, KMP_ARCH_X86 )  // __kmpc_atomic_fixed8_sub

-#if KMP_MIC
 ATOMIC_CMPXCHG( float8,  add, kmp_real64, 64, +,  8r, 7, KMP_ARCH_X86 )  // __kmpc_atomic_float8_add
 ATOMIC_CMPXCHG( float8,  sub, kmp_real64, 64, -,  8r, 7, KMP_ARCH_X86 )  // __kmpc_atomic_float8_sub
-#else
-ATOMIC_FLOAT_ADD( float8, add, kmp_real64, 64, +, 8r, 7, KMP_ARCH_X86 )  // __kmpc_atomic_float8_add
-ATOMIC_FLOAT_ADD( float8, sub, kmp_real64, 64, -, 8r, 7, KMP_ARCH_X86 )  // __kmpc_atomic_float8_sub
-#endif // KMP_MIC

 // ------------------------------------------------------------------------
 // Entries definition for integer operands
@ -1867,35 +1834,16 @@ ATOMIC_BEGIN_CPT(TYPE_ID,OP_ID,TYPE,TYPE)                                  \
        return old_value;                                                  \
 }
 // -------------------------------------------------------------------------
-#define ATOMIC_FLOAT_ADD_CPT(TYPE_ID,OP_ID,TYPE,BITS,OP,GOMP_FLAG)         \
-ATOMIC_BEGIN_CPT(TYPE_ID,OP_ID,TYPE,TYPE)                                  \
-    TYPE old_value, new_value;                                             \
-    OP_GOMP_CRITICAL_CPT(OP,GOMP_FLAG)                                     \
-    /* OP used as a sign for subtraction: (lhs-rhs) --> (lhs+-rhs) */      \
-    old_value = KMP_TEST_THEN_ADD_REAL##BITS( lhs, OP rhs );               \
-    if( flag ) {                                                           \
-        return old_value OP rhs;                                           \
-    } else                                                                 \
-        return old_value;                                                  \
-}
-// -------------------------------------------------------------------------

 ATOMIC_FIXED_ADD_CPT( fixed4, add_cpt, kmp_int32,  32, +, 0            )  // __kmpc_atomic_fixed4_add_cpt
 ATOMIC_FIXED_ADD_CPT( fixed4, sub_cpt, kmp_int32,  32, -, 0            )  // __kmpc_atomic_fixed4_sub_cpt
 ATOMIC_FIXED_ADD_CPT( fixed8, add_cpt, kmp_int64,  64, +, KMP_ARCH_X86 )  // __kmpc_atomic_fixed8_add_cpt
 ATOMIC_FIXED_ADD_CPT( fixed8, sub_cpt, kmp_int64,  64, -, KMP_ARCH_X86 )  // __kmpc_atomic_fixed8_sub_cpt

-#if KMP_MIC
 ATOMIC_CMPXCHG_CPT( float4, add_cpt, kmp_real32, 32, +, KMP_ARCH_X86 )  // __kmpc_atomic_float4_add_cpt
 ATOMIC_CMPXCHG_CPT( float4, sub_cpt, kmp_real32, 32, -, KMP_ARCH_X86 )  // __kmpc_atomic_float4_sub_cpt
 ATOMIC_CMPXCHG_CPT( float8, add_cpt, kmp_real64, 64, +, KMP_ARCH_X86 )  // __kmpc_atomic_float8_add_cpt
 ATOMIC_CMPXCHG_CPT( float8, sub_cpt, kmp_real64, 64, -, KMP_ARCH_X86 )  // __kmpc_atomic_float8_sub_cpt
-#else
-ATOMIC_FLOAT_ADD_CPT( float4, add_cpt, kmp_real32, 32, +, KMP_ARCH_X86 )  // __kmpc_atomic_float4_add_cpt
-ATOMIC_FLOAT_ADD_CPT( float4, sub_cpt, kmp_real32, 32, -, KMP_ARCH_X86 )  // __kmpc_atomic_float4_sub_cpt
-ATOMIC_FLOAT_ADD_CPT( float8, add_cpt, kmp_real64, 64, +, KMP_ARCH_X86 )  // __kmpc_atomic_float8_add_cpt
-ATOMIC_FLOAT_ADD_CPT( float8, sub_cpt, kmp_real64, 64, -, KMP_ARCH_X86 )  // __kmpc_atomic_float8_sub_cpt
-#endif // KMP_MIC

 // ------------------------------------------------------------------------
 // Entries definition for integer operands
--- a/openmp/runtime/src/kmp_atomic.h
+++ b/openmp/runtime/src/kmp_atomic.h
@ -1,7 +1,7 @@
 /*
 * kmp_atomic.h - ATOMIC header file
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43191 $
+ * $Date: 2014-05-27 07:44:11 -0500 (Tue, 27 May 2014) $
 */


@ -33,7 +33,7 @@
 #if defined( __cplusplus ) && ( KMP_OS_WINDOWS )
    // create shortcuts for c99 complex types

-    #ifdef _DEBUG
+    #if (_MSC_VER < 1600) && defined(_DEBUG)
        // Workaround for the problem of _DebugHeapTag unresolved external.
        // This problem prevented to use our static debug library for C tests
        // compiled with /MDd option (the library itself built with /MTd),
--- a/openmp/runtime/src/kmp_barrier.cpp
+++ b/openmp/runtime/src/kmp_barrier.cpp
--- a/openmp/runtime/src/kmp_csupport.c
+++ b/openmp/runtime/src/kmp_csupport.c
@ -1,7 +1,7 @@
 /*
 * kmp_csupport.c -- kfront linkage support for OpenMP.
- * $Revision: 42826 $
- * $Date: 2013-11-20 03:39:45 -0600 (Wed, 20 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -20,6 +20,7 @@
 #include "kmp_i18n.h"
 #include "kmp_itt.h"
 #include "kmp_error.h"
+#include "kmp_stats.h"

 #define MAX_MESSAGE 512

@ -35,7 +36,7 @@
 * @param flags in   for future use (currently ignored)
 *
 * Initialize the runtime library. This call is optional; if it is not made then
- * it will be implicilty called by attempts to use other library functions.
+ * it will be implicitly called by attempts to use other library functions.
 *
 */
 void
@ -276,13 +277,18 @@ Do the actual fork and call the microtask in the relevant number of threads.
 void
 __kmpc_fork_call(ident_t *loc, kmp_int32 argc, kmpc_micro microtask, ...)
 {
+  KMP_STOP_EXPLICIT_TIMER(OMP_serial);
+  KMP_COUNT_BLOCK(OMP_PARALLEL);
  int         gtid = __kmp_entry_gtid();
  // maybe to save thr_state is enough here
  {
    va_list     ap;
    va_start(   ap, microtask );

-    __kmp_fork_call( loc, gtid, TRUE,
+#if INCLUDE_SSC_MARKS
+    SSC_MARK_FORKING();
+#endif
+    __kmp_fork_call( loc, gtid, fork_context_intel,
            argc,
            VOLATILE_CAST(microtask_t) microtask,
            VOLATILE_CAST(launch_t)    __kmp_invoke_task_func,
@ -293,10 +299,14 @@ __kmpc_fork_call(ident_t *loc, kmp_int32 argc, kmpc_micro microtask, ...)
            ap
 #endif
            );
+#if INCLUDE_SSC_MARKS
+    SSC_MARK_JOINING();
+#endif
    __kmp_join_call( loc, gtid );

    va_end( ap );
  }
+  KMP_START_EXPLICIT_TIMER(OMP_serial);
 }

 #if OMP_40_ENABLED
@ -337,17 +347,18 @@ __kmpc_fork_teams(ident_t *loc, kmp_int32 argc, kmpc_micro microtask, ...)
    va_start(   ap, microtask );

    // remember teams entry point and nesting level
-    this_thr->th.th_team_microtask = microtask;
+    this_thr->th.th_teams_microtask = microtask;
    this_thr->th.th_teams_level = this_thr->th.th_team->t.t_level; // AC: can be >0 on host

    // check if __kmpc_push_num_teams called, set default number of teams otherwise
-    if ( this_thr->th.th_set_nth_teams == 0 ) {
+    if ( this_thr->th.th_teams_size.nteams == 0 ) {
        __kmp_push_num_teams( loc, gtid, 0, 0 );
    }
    KMP_DEBUG_ASSERT(this_thr->th.th_set_nproc >= 1);
-    KMP_DEBUG_ASSERT(this_thr->th.th_set_nth_teams >= 1);
+    KMP_DEBUG_ASSERT(this_thr->th.th_teams_size.nteams >= 1);
+    KMP_DEBUG_ASSERT(this_thr->th.th_teams_size.nth >= 1);

-    __kmp_fork_call( loc, gtid, TRUE,
+    __kmp_fork_call( loc, gtid, fork_context_intel,
            argc,
            VOLATILE_CAST(microtask_t) __kmp_teams_master,
            VOLATILE_CAST(launch_t)    __kmp_invoke_teams_master,
@ -358,9 +369,9 @@ __kmpc_fork_teams(ident_t *loc, kmp_int32 argc, kmpc_micro microtask, ...)
 #endif
            );
    __kmp_join_call( loc, gtid );
-    this_thr->th.th_team_microtask = NULL;
+    this_thr->th.th_teams_microtask = NULL;
    this_thr->th.th_teams_level = 0;
-
+    *(kmp_int64*)(&this_thr->th.th_teams_size) = 0L;
    va_end( ap );
 }
 #endif /* OMP_40_ENABLED */
@ -393,252 +404,9 @@ when the condition is false.
 void
 __kmpc_serialized_parallel(ident_t *loc, kmp_int32 global_tid)
 {
-    kmp_info_t *this_thr;
-    kmp_team_t *serial_team;
-
-    KC_TRACE( 10, ("__kmpc_serialized_parallel: called by T#%d\n", global_tid ) );
-
-    /* Skip all this code for autopar serialized loops since it results in
-       unacceptable overhead */
-    if( loc != NULL && (loc->flags & KMP_IDENT_AUTOPAR ) )
-        return;
-
-    if( ! TCR_4( __kmp_init_parallel ) )
-        __kmp_parallel_initialize();
-
-    this_thr     = __kmp_threads[ global_tid ];
-    serial_team  = this_thr -> th.th_serial_team;
-
-    /* utilize the serialized team held by this thread */
-    KMP_DEBUG_ASSERT( serial_team );
-    KMP_MB();
-
-#if OMP_30_ENABLED
-    if ( __kmp_tasking_mode != tskm_immediate_exec ) {
-        KMP_DEBUG_ASSERT( this_thr -> th.th_task_team == this_thr -> th.th_team -> t.t_task_team );
-        KMP_DEBUG_ASSERT( serial_team -> t.t_task_team == NULL );
-        KA_TRACE( 20, ( "__kmpc_serialized_parallel: T#%d pushing task_team %p / team %p, new task_team = NULL\n",
-                        global_tid, this_thr -> th.th_task_team, this_thr -> th.th_team ) );
-        this_thr -> th.th_task_team = NULL;
-    }
-#endif // OMP_30_ENABLED
-
-#if OMP_40_ENABLED
-    kmp_proc_bind_t proc_bind = this_thr->th.th_set_proc_bind;
-    if ( this_thr->th.th_current_task->td_icvs.proc_bind == proc_bind_false ) {
-        proc_bind = proc_bind_false;
-    }
-    else if ( proc_bind == proc_bind_default ) {
-        //
-        // No proc_bind clause was specified, so use the current value
-        // of proc-bind-var for this parallel region.
-        //
-        proc_bind = this_thr->th.th_current_task->td_icvs.proc_bind;
-    }
-    //
-    // Reset for next parallel region
-    //
-    this_thr->th.th_set_proc_bind = proc_bind_default;
-#endif /* OMP_3_ENABLED */
-
-    if( this_thr -> th.th_team != serial_team ) {
-#if OMP_30_ENABLED
-        // Nested level will be an index in the nested nthreads array
-        int level = this_thr->th.th_team->t.t_level;
-#endif
-        if( serial_team -> t.t_serialized ) {
-            /* this serial team was already used
-             * TODO increase performance by making this locks more specific */
-            kmp_team_t *new_team;
-            int tid = this_thr->th.th_info.ds.ds_tid;
-
-            __kmp_acquire_bootstrap_lock( &__kmp_forkjoin_lock );
-
-            new_team = __kmp_allocate_team(this_thr->th.th_root, 1, 1,
-#if OMP_40_ENABLED
-                                           proc_bind,
-#endif
-#if OMP_30_ENABLED
-                                           & this_thr->th.th_current_task->td_icvs,
-#else
-                                           this_thr->th.th_team->t.t_set_nproc[tid],
-                                           this_thr->th.th_team->t.t_set_dynamic[tid],
-                                           this_thr->th.th_team->t.t_set_nested[tid],
-                                           this_thr->th.th_team->t.t_set_blocktime[tid],
-                                           this_thr->th.th_team->t.t_set_bt_intervals[tid],
-                                           this_thr->th.th_team->t.t_set_bt_set[tid],
-#endif // OMP_30_ENABLED
-                                           0);
-            __kmp_release_bootstrap_lock( &__kmp_forkjoin_lock );
-            KMP_ASSERT( new_team );
-
-            /* setup new serialized team and install it */
-            new_team -> t.t_threads[0] = this_thr;
-            new_team -> t.t_parent = this_thr -> th.th_team;
-            serial_team = new_team;
-            this_thr -> th.th_serial_team = serial_team;
-
-            KF_TRACE( 10, ( "__kmpc_serialized_parallel: T#%d allocated new serial team %p\n",
-                            global_tid, serial_team ) );
-
-
-            /* TODO the above breaks the requirement that if we run out of
-             * resources, then we can still guarantee that serialized teams
-             * are ok, since we may need to allocate a new one */
-        } else {
-            KF_TRACE( 10, ( "__kmpc_serialized_parallel: T#%d reusing cached serial team %p\n",
-                            global_tid, serial_team ) );
-        }
-
-        /* we have to initialize this serial team */
-        KMP_DEBUG_ASSERT( serial_team->t.t_threads );
-        KMP_DEBUG_ASSERT( serial_team->t.t_threads[0] == this_thr );
-        KMP_DEBUG_ASSERT( this_thr->th.th_team != serial_team );
-        serial_team -> t.t_ident         = loc;
-        serial_team -> t.t_serialized    = 1;
-        serial_team -> t.t_nproc         = 1;
-        serial_team -> t.t_parent        = this_thr->th.th_team;
-#if OMP_30_ENABLED
-        serial_team -> t.t_sched         = this_thr->th.th_team->t.t_sched;
-#endif // OMP_30_ENABLED
-        this_thr -> th.th_team           = serial_team;
-        serial_team -> t.t_master_tid    = this_thr->th.th_info.ds.ds_tid;
-
-#if OMP_30_ENABLED
-        KF_TRACE( 10, ( "__kmpc_serialized_parallel: T#d curtask=%p\n",
-                        global_tid, this_thr->th.th_current_task ) );
-        KMP_ASSERT( this_thr->th.th_current_task->td_flags.executing == 1 );
-        this_thr->th.th_current_task->td_flags.executing = 0;
-
-        __kmp_push_current_task_to_thread( this_thr, serial_team, 0 );
-
-        /* TODO: GEH: do the ICVs work for nested serialized teams?  Don't we need an implicit task for
-           each serialized task represented by team->t.t_serialized? */
-        copy_icvs(
-                  & this_thr->th.th_current_task->td_icvs,
-                  & this_thr->th.th_current_task->td_parent->td_icvs );
-
-        // Thread value exists in the nested nthreads array for the next nested level
-        if ( __kmp_nested_nth.used && ( level + 1 < __kmp_nested_nth.used ) ) {
-            this_thr->th.th_current_task->td_icvs.nproc = __kmp_nested_nth.nth[ level + 1 ];
-        }
-
-#if OMP_40_ENABLED
-        if ( __kmp_nested_proc_bind.used && ( level + 1 < __kmp_nested_proc_bind.used ) ) {
-            this_thr->th.th_current_task->td_icvs.proc_bind
-                = __kmp_nested_proc_bind.bind_types[ level + 1 ];
-        }
-#endif /* OMP_40_ENABLED */
-
-#else /* pre-3.0 icv's */
-        serial_team -> t.t_set_nproc[0]  = serial_team->t.t_parent->
-            t.t_set_nproc[serial_team->
-                          t.t_master_tid];
-        serial_team -> t.t_set_dynamic[0] = serial_team->t.t_parent->
-            t.t_set_dynamic[serial_team->
-                            t.t_master_tid];
-        serial_team -> t.t_set_nested[0] = serial_team->t.t_parent->
-            t.t_set_nested[serial_team->
-                           t.t_master_tid];
-        serial_team -> t.t_set_blocktime[0]  = serial_team->t.t_parent->
-            t.t_set_blocktime[serial_team->
-                              t.t_master_tid];
-        serial_team -> t.t_set_bt_intervals[0] = serial_team->t.t_parent->
-            t.t_set_bt_intervals[serial_team->
-                                 t.t_master_tid];
-        serial_team -> t.t_set_bt_set[0] = serial_team->t.t_parent->
-            t.t_set_bt_set[serial_team->
-                           t.t_master_tid];
-#endif // OMP_30_ENABLED
-        this_thr -> th.th_info.ds.ds_tid = 0;
-
-        /* set thread cache values */
-        this_thr -> th.th_team_nproc     = 1;
-        this_thr -> th.th_team_master    = this_thr;
-        this_thr -> th.th_team_serialized = 1;
-
-#if OMP_30_ENABLED
-        serial_team -> t.t_level        = serial_team -> t.t_parent -> t.t_level + 1;
-        serial_team -> t.t_active_level = serial_team -> t.t_parent -> t.t_active_level;
-#endif // OMP_30_ENABLED
-
-#if KMP_ARCH_X86 || KMP_ARCH_X86_64
-        if ( __kmp_inherit_fp_control ) {
-            __kmp_store_x87_fpu_control_word( &serial_team->t.t_x87_fpu_control_word );
-            __kmp_store_mxcsr( &serial_team->t.t_mxcsr );
-            serial_team->t.t_mxcsr &= KMP_X86_MXCSR_MASK;
-            serial_team->t.t_fp_control_saved = TRUE;
-        } else {
-            serial_team->t.t_fp_control_saved = FALSE;
-        }
-#endif /* KMP_ARCH_X86 || KMP_ARCH_X86_64 */
-        /* check if we need to allocate dispatch buffers stack */
-        KMP_DEBUG_ASSERT(serial_team->t.t_dispatch);
-        if ( !serial_team->t.t_dispatch->th_disp_buffer ) {
-            serial_team->t.t_dispatch->th_disp_buffer = (dispatch_private_info_t *)
-                __kmp_allocate( sizeof( dispatch_private_info_t ) );
-        }
-        this_thr -> th.th_dispatch = serial_team->t.t_dispatch;
-
-        KMP_MB();
-
-    } else {
-        /* this serialized team is already being used,
-         * that's fine, just add another nested level */
-        KMP_DEBUG_ASSERT( this_thr->th.th_team == serial_team );
-        KMP_DEBUG_ASSERT( serial_team -> t.t_threads );
-        KMP_DEBUG_ASSERT( serial_team -> t.t_threads[0] == this_thr );
-        ++ serial_team -> t.t_serialized;
-        this_thr -> th.th_team_serialized = serial_team -> t.t_serialized;
-
-#if OMP_30_ENABLED
-        // Nested level will be an index in the nested nthreads array
-        int level = this_thr->th.th_team->t.t_level;
-        // Thread value exists in the nested nthreads array for the next nested level
-        if ( __kmp_nested_nth.used && ( level + 1 < __kmp_nested_nth.used ) ) {
-            this_thr->th.th_current_task->td_icvs.nproc = __kmp_nested_nth.nth[ level + 1 ];
-        }
-        serial_team -> t.t_level++;
-        KF_TRACE( 10, ( "__kmpc_serialized_parallel: T#%d increasing nesting level of serial team %p to %d\n",
-                        global_tid, serial_team, serial_team -> t.t_level ) );
-#else
-        KF_TRACE( 10, ( "__kmpc_serialized_parallel: T#%d reusing team %p for nested serialized parallel region\n",
-                        global_tid, serial_team ) );
-#endif // OMP_30_ENABLED
-
-        /* allocate/push dispatch buffers stack */
-        KMP_DEBUG_ASSERT(serial_team->t.t_dispatch);
-        {
-            dispatch_private_info_t * disp_buffer = (dispatch_private_info_t *)
-                __kmp_allocate( sizeof( dispatch_private_info_t ) );
-            disp_buffer->next = serial_team->t.t_dispatch->th_disp_buffer;
-            serial_team->t.t_dispatch->th_disp_buffer = disp_buffer;
-        }
-        this_thr -> th.th_dispatch = serial_team->t.t_dispatch;
-
-        KMP_MB();
-    }
-
-    if ( __kmp_env_consistency_check )
-        __kmp_push_parallel( global_tid, NULL );
-
-// t_level is not available in 2.5 build, so check for OMP_30_ENABLED
-#if USE_ITT_BUILD && OMP_30_ENABLED
-    // Mark the start of the "parallel" region for VTune. Only use one of frame notification scheme at the moment.
-    if ( ( __itt_frame_begin_v3_ptr && __kmp_forkjoin_frames && ! __kmp_forkjoin_frames_mode ) || KMP_ITT_DEBUG )
-    {
-        __kmp_itt_region_forking( global_tid, 1 );
-    }
-    if( ( __kmp_forkjoin_frames_mode == 1 || __kmp_forkjoin_frames_mode == 3 ) && __itt_frame_submit_v3_ptr && __itt_get_timestamp_ptr )
-    {
-#if USE_ITT_NOTIFY
-        if( this_thr->th.th_team->t.t_level == 1 ) {
-            this_thr->th.th_frame_time_serialized = __itt_get_timestamp();
-        }
-#endif
-    }
-#endif /* USE_ITT_BUILD */
-
+    __kmp_serialized_parallel(loc, global_tid); /* The implementation is now in kmp_runtime.c so that it can share static functions with
+                                                 * kmp_fork_call since the tasks to be done are similar in each case.
+                                                 */
 }

 /*!
@ -680,26 +448,13 @@ __kmpc_end_serialized_parallel(ident_t *loc, kmp_int32 global_tid)
    /* If necessary, pop the internal control stack values and replace the team values */
    top = serial_team -> t.t_control_stack_top;
    if ( top && top -> serial_nesting_level == serial_team -> t.t_serialized ) {
-#if OMP_30_ENABLED
-        copy_icvs(
-                  &serial_team -> t.t_threads[0] -> th.th_current_task -> td_icvs,
-                  top );
-#else
-        serial_team -> t.t_set_nproc[0]   = top -> nproc;
-        serial_team -> t.t_set_dynamic[0] = top -> dynamic;
-        serial_team -> t.t_set_nested[0]  = top -> nested;
-        serial_team -> t.t_set_blocktime[0]   = top -> blocktime;
-        serial_team -> t.t_set_bt_intervals[0] = top -> bt_intervals;
-        serial_team -> t.t_set_bt_set[0]  = top -> bt_set;
-#endif // OMP_30_ENABLED
+        copy_icvs( &serial_team -> t.t_threads[0] -> th.th_current_task -> td_icvs, top );
        serial_team -> t.t_control_stack_top = top -> next;
        __kmp_free(top);
    }

-#if OMP_30_ENABLED
    //if( serial_team -> t.t_serialized > 1 )
    serial_team -> t.t_level--;
-#endif // OMP_30_ENABLED

    /* pop dispatch buffers stack */
    KMP_DEBUG_ASSERT(serial_team->t.t_dispatch->th_disp_buffer);
@ -735,7 +490,6 @@ __kmpc_end_serialized_parallel(ident_t *loc, kmp_int32 global_tid)
        this_thr -> th.th_dispatch       = & this_thr -> th.th_team ->
            t.t_dispatch[ serial_team -> t.t_master_tid ];

-#if OMP_30_ENABLED
        __kmp_pop_current_task_from_thread( this_thr );

        KMP_ASSERT( this_thr -> th.th_current_task -> td_flags.executing == 0 );
@ -752,32 +506,37 @@ __kmpc_end_serialized_parallel(ident_t *loc, kmp_int32 global_tid)
            KA_TRACE( 20, ( "__kmpc_end_serialized_parallel: T#%d restoring task_team %p / team %p\n",
                            global_tid, this_thr -> th.th_task_team, this_thr -> th.th_team ) );
        }
-#endif // OMP_30_ENABLED
-
-    }
-    else {
-
-#if OMP_30_ENABLED
+    } else {
        if ( __kmp_tasking_mode != tskm_immediate_exec ) {
            KA_TRACE( 20, ( "__kmpc_end_serialized_parallel: T#%d decreasing nesting depth of serial team %p to %d\n",
                            global_tid, serial_team, serial_team -> t.t_serialized ) );
        }
-#endif // OMP_30_ENABLED
-
    }

-// t_level is not available in 2.5 build, so check for OMP_30_ENABLED
-#if USE_ITT_BUILD && OMP_30_ENABLED
+#if USE_ITT_BUILD
+    kmp_uint64 cur_time = 0;
+#if  USE_ITT_NOTIFY
+    if( __itt_get_timestamp_ptr ) {
+        cur_time = __itt_get_timestamp();
+    }
+#endif /* USE_ITT_NOTIFY */
+    // Report the barrier
+    if( ( __kmp_forkjoin_frames_mode == 1 || __kmp_forkjoin_frames_mode == 3 ) && __itt_frame_submit_v3_ptr ) {
+        if( this_thr->th.th_team->t.t_level == 0 ) {
+            __kmp_itt_frame_submit( global_tid, this_thr->th.th_frame_time_serialized, cur_time, 0, loc, this_thr->th.th_team_nproc, 0 );
+        }
+    }
    // Mark the end of the "parallel" region for VTune. Only use one of frame notification scheme at the moment.
    if ( ( __itt_frame_end_v3_ptr && __kmp_forkjoin_frames && ! __kmp_forkjoin_frames_mode ) || KMP_ITT_DEBUG )
    {
        this_thr->th.th_ident = loc;
        __kmp_itt_region_joined( global_tid, 1 );
    }
-    if( ( __kmp_forkjoin_frames_mode == 1 || __kmp_forkjoin_frames_mode == 3 ) && __itt_frame_submit_v3_ptr ) {
-        if( this_thr->th.th_team->t.t_level == 0 ) {
-            __kmp_itt_frame_submit( global_tid, this_thr->th.th_frame_time_serialized, __itt_timestamp_none, 0, loc );
-        }
+    if ( ( __itt_frame_submit_v3_ptr && __kmp_forkjoin_frames_mode == 3 ) || KMP_ITT_DEBUG )
+    {
+        this_thr->th.th_ident = loc;
+        // Since barrier frame for serialized region is equal to the region we use the same begin timestamp as for the barrier.
+        __kmp_itt_frame_submit( global_tid, serial_team->t.t_region_time, cur_time, 0, loc, this_thr->th.th_team_nproc, 2 );
    }
 #endif /* USE_ITT_BUILD */

@ -805,55 +564,50 @@ __kmpc_flush(ident_t *loc, ...)
    /* need explicit __mf() here since use volatile instead in library */
    KMP_MB();       /* Flush all pending memory write invalidates.  */

-    // This is not an OMP 3.0 feature.
-    // This macro is used here just not to let the change go to 10.1.
-    // This change will go to the mainline first.
-    #if OMP_30_ENABLED
-        #if ( KMP_ARCH_X86 || KMP_ARCH_X86_64 )
-            #if KMP_MIC
-                // fence-style instructions do not exist, but lock; xaddl $0,(%rsp) can be used.
-                // We shouldn't need it, though, since the ABI rules require that
-                // * If the compiler generates NGO stores it also generates the fence
-                // * If users hand-code NGO stores they should insert the fence
-                // therefore no incomplete unordered stores should be visible.
-            #else
-                // C74404
-                // This is to address non-temporal store instructions (sfence needed).
-                // The clflush instruction is addressed either (mfence needed).
-                // Probably the non-temporal load monvtdqa instruction should also be addressed.
-                // mfence is a SSE2 instruction. Do not execute it if CPU is not SSE2.
-                if ( ! __kmp_cpuinfo.initialized ) {
-                    __kmp_query_cpuid( & __kmp_cpuinfo );
-                }; // if
-                if ( ! __kmp_cpuinfo.sse2 ) {
-                    // CPU cannot execute SSE2 instructions.
-                } else {
-                    #if KMP_COMPILER_ICC || KMP_COMPILER_MSVC
-                    _mm_mfence();
-                    #else
-                    __sync_synchronize();
-                    #endif // KMP_COMPILER_ICC
-                }; // if
-            #endif // KMP_MIC
-        #elif KMP_ARCH_ARM
-            // Nothing yet
-	     #elif KMP_ARCH_PPC64
-            // Nothing needed here (we have a real MB above).
-            #if KMP_OS_CNK
-	     	 // The flushing thread needs to yield here; this prevents a
-		    // busy-waiting thread from saturating the pipeline. flush is
-		       // often used in loops like this:
-                // while (!flag) {
-                //   #pragma omp flush(flag)
-                // }
-		    // and adding the yield here is good for at least a 10x speedup
-		       // when running >2 threads per core (on the NAS LU benchmark).
-                __kmp_yield(TRUE);
-            #endif
+    #if ( KMP_ARCH_X86 || KMP_ARCH_X86_64 )
+        #if KMP_MIC
+            // fence-style instructions do not exist, but lock; xaddl $0,(%rsp) can be used.
+            // We shouldn't need it, though, since the ABI rules require that
+            // * If the compiler generates NGO stores it also generates the fence
+            // * If users hand-code NGO stores they should insert the fence
+            // therefore no incomplete unordered stores should be visible.
        #else
-            #error Unknown or unsupported architecture
+            // C74404
+            // This is to address non-temporal store instructions (sfence needed).
+            // The clflush instruction is addressed either (mfence needed).
+            // Probably the non-temporal load monvtdqa instruction should also be addressed.
+            // mfence is a SSE2 instruction. Do not execute it if CPU is not SSE2.
+            if ( ! __kmp_cpuinfo.initialized ) {
+                __kmp_query_cpuid( & __kmp_cpuinfo );
+            }; // if
+            if ( ! __kmp_cpuinfo.sse2 ) {
+                // CPU cannot execute SSE2 instructions.
+            } else {
+                #if KMP_COMPILER_ICC || KMP_COMPILER_MSVC
+                _mm_mfence();
+                #else
+                __sync_synchronize();
+                #endif // KMP_COMPILER_ICC
+            }; // if
+        #endif // KMP_MIC
+    #elif KMP_ARCH_ARM
+        // Nothing yet
+    #elif KMP_ARCH_PPC64
+        // Nothing needed here (we have a real MB above).
+        #if KMP_OS_CNK
+        // The flushing thread needs to yield here; this prevents a
+       // busy-waiting thread from saturating the pipeline. flush is
+          // often used in loops like this:
+           // while (!flag) {
+           //   #pragma omp flush(flag)
+           // }
+       // and adding the yield here is good for at least a 10x speedup
+          // when running >2 threads per core (on the NAS LU benchmark).
+            __kmp_yield(TRUE);
        #endif
-    #endif // OMP_30_ENABLED
+    #else
+        #error Unknown or unsupported architecture
+    #endif

 }

@ -871,6 +625,8 @@ Execute a barrier.
 void
 __kmpc_barrier(ident_t *loc, kmp_int32 global_tid)
 {
+    KMP_COUNT_BLOCK(OMP_BARRIER);
+    KMP_TIME_BLOCK(OMP_barrier);
    int explicit_barrier_flag;
    KC_TRACE( 10, ("__kmpc_barrier: called T#%d\n", global_tid ) );

@ -906,6 +662,7 @@ __kmpc_barrier(ident_t *loc, kmp_int32 global_tid)
 kmp_int32
 __kmpc_master(ident_t *loc, kmp_int32 global_tid)
 {
+    KMP_COUNT_BLOCK(OMP_MASTER);
    int status = 0;

    KC_TRACE( 10, ("__kmpc_master: called T#%d\n", global_tid ) );
@ -1014,11 +771,6 @@ __kmpc_end_ordered( ident_t * loc, kmp_int32 gtid )
        __kmp_parallel_dxo( & gtid, & cid, loc );
 }

-inline void
-__kmp_static_yield( int arg ) { // AC: needed in macro __kmp_acquire_user_lock_with_checks
-    __kmp_yield( arg );
-}
-
 static kmp_user_lock_p
 __kmp_get_critical_section_ptr( kmp_critical_name * crit, ident_t const * loc, kmp_int32 gtid )
 {
@ -1082,6 +834,7 @@ This function blocks until the executing thread can enter the critical section.
 */
 void
 __kmpc_critical( ident_t * loc, kmp_int32 global_tid, kmp_critical_name * crit ) {
+    KMP_COUNT_BLOCK(OMP_CRITICAL);

    kmp_user_lock_p lck;

@ -1194,6 +947,9 @@ __kmpc_barrier_master(ident_t *loc, kmp_int32 global_tid)
    if ( __kmp_env_consistency_check )
        __kmp_check_barrier( global_tid, ct_barrier, loc );

+#if USE_ITT_NOTIFY
+    __kmp_threads[global_tid]->th.th_ident = loc;
+#endif
    status = __kmp_barrier( bs_plain_barrier, global_tid, TRUE, 0, NULL, NULL );

    return (status != 0) ? 0 : 1;
@ -1243,6 +999,9 @@ __kmpc_barrier_master_nowait( ident_t * loc, kmp_int32 global_tid )
        __kmp_check_barrier( global_tid, ct_barrier, loc );
    }

+#if USE_ITT_NOTIFY
+    __kmp_threads[global_tid]->th.th_ident = loc;
+#endif
    __kmp_barrier( bs_plain_barrier, global_tid, FALSE, 0, NULL, NULL );

    ret = __kmpc_master (loc, global_tid);
@ -1280,6 +1039,7 @@ introduce an explicit barrier if it is required.
 kmp_int32
 __kmpc_single(ident_t *loc, kmp_int32 global_tid)
 {
+    KMP_COUNT_BLOCK(OMP_SINGLE);
    kmp_int32 rc = __kmp_enter_single( global_tid, loc, TRUE );
    return rc;
 }
@ -1353,8 +1113,6 @@ ompc_set_nested( int flag )
    set__nested( thread, flag ? TRUE : FALSE );
 }

-#if OMP_30_ENABLED
-
 void
 ompc_set_max_active_levels( int max_active_levels )
 {
@ -1384,8 +1142,6 @@ ompc_get_team_size( int level )
    return __kmp_get_team_size( __kmp_entry_gtid(), level );
 }

-#endif // OMP_30_ENABLED
-
 void
 kmpc_set_stacksize( int arg )
 {
@ -1427,8 +1183,6 @@ kmpc_set_defaults( char const * str )
    __kmp_aux_set_defaults( str, strlen( str ) );
 }

-#ifdef OMP_30_ENABLED
-
 int
 kmpc_set_affinity_mask_proc( int proc, void **mask )
 {
@ -1468,7 +1222,6 @@ kmpc_get_affinity_mask_proc( int proc, void **mask )
 #endif
 }

-#endif /* OMP_30_ENABLED */

 /* -------------------------------------------------------------------------- */
 /*!
@ -1533,6 +1286,9 @@ __kmpc_copyprivate( ident_t *loc, kmp_int32 gtid, size_t cpy_size, void *cpy_dat
    if (didit) *data_ptr = cpy_data;

    /* This barrier is not a barrier region boundary */
+#if USE_ITT_NOTIFY
+    __kmp_threads[gtid]->th.th_ident = loc;
+#endif
    __kmp_barrier( bs_plain_barrier, gtid, FALSE , 0, NULL, NULL );

    if (! didit) (*cpy_func)( cpy_data, *data_ptr );
@ -1540,6 +1296,9 @@ __kmpc_copyprivate( ident_t *loc, kmp_int32 gtid, size_t cpy_size, void *cpy_dat
    /* Consider next barrier the user-visible barrier for barrier region boundaries */
    /* Nesting checks are already handled by the single construct checks */

+#if USE_ITT_NOTIFY
+    __kmp_threads[gtid]->th.th_ident = loc; // TODO: check if it is needed (e.g. tasks can overwrite the location)
+#endif
    __kmp_barrier( bs_plain_barrier, gtid, FALSE , 0, NULL, NULL );
 }

@ -1722,6 +1481,7 @@ __kmpc_destroy_nest_lock( ident_t * loc, kmp_int32 gtid, void ** user_lock ) {

 void
 __kmpc_set_lock( ident_t * loc, kmp_int32 gtid, void ** user_lock ) {
+    KMP_COUNT_BLOCK(OMP_set_lock);
    kmp_user_lock_p lck;

    if ( ( __kmp_user_lock_kind == lk_tas )
@ -1866,6 +1626,8 @@ __kmpc_unset_nest_lock( ident_t *loc, kmp_int32 gtid, void **user_lock )
 int
 __kmpc_test_lock( ident_t *loc, kmp_int32 gtid, void **user_lock )
 {
+    KMP_COUNT_BLOCK(OMP_test_lock);
+    KMP_TIME_BLOCK(OMP_test_lock);
    kmp_user_lock_p lck;
    int          rc;

@ -2028,9 +1790,14 @@ __kmpc_reduce_nowait(
    kmp_int32 num_vars, size_t reduce_size, void *reduce_data, void (*reduce_func)(void *lhs_data, void *rhs_data),
    kmp_critical_name *lck ) {

+    KMP_COUNT_BLOCK(REDUCE_nowait);
    int retval;
    PACKED_REDUCTION_METHOD_T packed_reduction_method;
-
+#if OMP_40_ENABLED
+    kmp_team_t *team;
+    kmp_info_t *th;
+    int teams_swapped = 0, task_state;
+#endif
    KA_TRACE( 10, ( "__kmpc_reduce_nowait() enter: called T#%d\n", global_tid ) );

    // why do we need this initialization here at all?
@ -2045,7 +1812,25 @@ __kmpc_reduce_nowait(
    if ( __kmp_env_consistency_check )
        __kmp_push_sync( global_tid, ct_reduce, loc, NULL );

-    // it's better to check an assertion ASSERT( thr_state == THR_WORK_STATE )
+#if OMP_40_ENABLED
+    th = __kmp_thread_from_gtid(global_tid);
+    if( th->th.th_teams_microtask ) {   // AC: check if we are inside the teams construct?
+        team = th->th.th_team;
+        if( team->t.t_level == th->th.th_teams_level ) {
+            // this is reduction at teams construct
+            KMP_DEBUG_ASSERT(!th->th.th_info.ds.ds_tid);  // AC: check that tid == 0
+            // Let's swap teams temporarily for the reduction barrier
+            teams_swapped = 1;
+            th->th.th_info.ds.ds_tid = team->t.t_master_tid;
+            th->th.th_team = team->t.t_parent;
+            th->th.th_task_team = th->th.th_team->t.t_task_team;
+            th->th.th_team_nproc = th->th.th_team->t.t_nproc;
+            task_state = th->th.th_task_state;
+            if( th->th.th_task_team )
+                th->th.th_task_state = th->th.th_task_team->tt.tt_state;
+        }
+    }
+#endif // OMP_40_ENABLED

    // packed_reduction_method value will be reused by __kmp_end_reduce* function, the value should be kept in a variable
    // the variable should be either a construct-specific or thread-specific property, not a team specific property
@ -2091,6 +1876,9 @@ __kmpc_reduce_nowait(

        // this barrier should be invisible to a customer and to the thread profiler
        //              (it's neither a terminating barrier nor customer's code, it's used for an internal purpose)
+#if USE_ITT_NOTIFY
+        __kmp_threads[global_tid]->th.th_ident = loc;
+#endif
        retval = __kmp_barrier( UNPACK_REDUCTION_BARRIER( packed_reduction_method ), global_tid, FALSE, reduce_size, reduce_data, reduce_func );
        retval = ( retval != 0 ) ? ( 0 ) : ( 1 );

@ -2108,7 +1896,16 @@ __kmpc_reduce_nowait(
        KMP_ASSERT( 0 ); // "unexpected method"

    }
-
+#if OMP_40_ENABLED
+    if( teams_swapped ) {
+        // Restore thread structure
+        th->th.th_info.ds.ds_tid = 0;
+        th->th.th_team = team;
+        th->th.th_task_team = team->t.t_task_team;
+        th->th.th_team_nproc = team->t.t_nproc;
+        th->th.th_task_state = task_state;
+    }
+#endif
    KA_TRACE( 10, ( "__kmpc_reduce_nowait() exit: called T#%d: method %08x, returns %08x\n", global_tid, packed_reduction_method, retval ) );

    return retval;
@ -2187,6 +1984,7 @@ __kmpc_reduce(
    void (*reduce_func)(void *lhs_data, void *rhs_data),
    kmp_critical_name *lck )
 {
+    KMP_COUNT_BLOCK(REDUCE_wait);
    int retval;
    PACKED_REDUCTION_METHOD_T packed_reduction_method;

@ -2204,8 +2002,6 @@ __kmpc_reduce(
    if ( __kmp_env_consistency_check )
        __kmp_push_sync( global_tid, ct_reduce, loc, NULL );

-    // it's better to check an assertion ASSERT( thr_state == THR_WORK_STATE )
-
    packed_reduction_method = __kmp_determine_reduction_method( loc, global_tid, num_vars, reduce_size, reduce_data, reduce_func, lck );
    __KMP_SET_REDUCTION_METHOD( global_tid, packed_reduction_method );

@ -2228,6 +2024,9 @@ __kmpc_reduce(
        //case tree_reduce_block:
        // this barrier should be visible to a customer and to the thread profiler
        //              (it's a terminating barrier on constructs if NOWAIT not specified)
+#if USE_ITT_NOTIFY
+        __kmp_threads[global_tid]->th.th_ident = loc; // needed for correct notification of frames
+#endif
        retval = __kmp_barrier( UNPACK_REDUCTION_BARRIER( packed_reduction_method ), global_tid, TRUE, reduce_size, reduce_data, reduce_func );
        retval = ( retval != 0 ) ? ( 0 ) : ( 1 );

@ -2277,6 +2076,9 @@ __kmpc_end_reduce( ident_t *loc, kmp_int32 global_tid, kmp_critical_name *lck )
        __kmp_end_critical_section_reduce_block( loc, global_tid, lck );

        // TODO: implicit barrier: should be exposed
+#if USE_ITT_NOTIFY
+        __kmp_threads[global_tid]->th.th_ident = loc;
+#endif
        __kmp_barrier( bs_plain_barrier, global_tid, FALSE, 0, NULL, NULL );

    } else if( packed_reduction_method == empty_reduce_block ) {
@ -2284,11 +2086,17 @@ __kmpc_end_reduce( ident_t *loc, kmp_int32 global_tid, kmp_critical_name *lck )
        // usage: if team size == 1, no synchronization is required ( Intel platforms only )

        // TODO: implicit barrier: should be exposed
+#if USE_ITT_NOTIFY
+        __kmp_threads[global_tid]->th.th_ident = loc;
+#endif
        __kmp_barrier( bs_plain_barrier, global_tid, FALSE, 0, NULL, NULL );

    } else if( packed_reduction_method == atomic_reduce_block ) {

        // TODO: implicit barrier: should be exposed
+#if USE_ITT_NOTIFY
+        __kmp_threads[global_tid]->th.th_ident = loc;
+#endif
        __kmp_barrier( bs_plain_barrier, global_tid, FALSE, 0, NULL, NULL );

    } else if( TEST_REDUCTION_METHOD( packed_reduction_method, tree_reduce_block ) ) {
@ -2319,23 +2127,15 @@ __kmpc_end_reduce( ident_t *loc, kmp_int32 global_tid, kmp_critical_name *lck )
 kmp_uint64
 __kmpc_get_taskid() {

-    #if OMP_30_ENABLED
-
-        kmp_int32    gtid;
-        kmp_info_t * thread;
-
-        gtid = __kmp_get_gtid();
-        if ( gtid < 0 ) {
-            return 0;
-        }; // if
-        thread = __kmp_thread_from_gtid( gtid );
-        return thread->th.th_current_task->td_task_id;
-
-    #else
+    kmp_int32    gtid;
+    kmp_info_t * thread;

+    gtid = __kmp_get_gtid();
+    if ( gtid < 0 ) {
        return 0;
-
-    #endif
+    }; // if
+    thread = __kmp_thread_from_gtid( gtid );
+    return thread->th.th_current_task->td_task_id;

 } // __kmpc_get_taskid

@ -2343,25 +2143,17 @@ __kmpc_get_taskid() {
 kmp_uint64
 __kmpc_get_parent_taskid() {

-    #if OMP_30_ENABLED
-
-        kmp_int32        gtid;
-        kmp_info_t *     thread;
-        kmp_taskdata_t * parent_task;
-
-        gtid = __kmp_get_gtid();
-        if ( gtid < 0 ) {
-            return 0;
-        }; // if
-        thread      = __kmp_thread_from_gtid( gtid );
-        parent_task = thread->th.th_current_task->td_parent;
-        return ( parent_task == NULL ? 0 : parent_task->td_task_id );
-
-    #else
+    kmp_int32        gtid;
+    kmp_info_t *     thread;
+    kmp_taskdata_t * parent_task;

+    gtid = __kmp_get_gtid();
+    if ( gtid < 0 ) {
        return 0;
-
-    #endif
+    }; // if
+    thread      = __kmp_thread_from_gtid( gtid );
+    parent_task = thread->th.th_current_task->td_parent;
+    return ( parent_task == NULL ? 0 : parent_task->td_task_id );

 } // __kmpc_get_parent_taskid

--- a/openmp/runtime/src/kmp_debug.c
+++ b/openmp/runtime/src/kmp_debug.c
@ -1,7 +1,7 @@
 /*
 * kmp_debug.c -- debug utilities for the Guide library
- * $Revision: 42150 $
- * $Date: 2013-03-15 15:40:38 -0500 (Fri, 15 Mar 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_debug.h
+++ b/openmp/runtime/src/kmp_debug.h
@ -1,7 +1,7 @@
 /*
 * kmp_debug.h -- debug / assertion code for Assure library
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_dispatch.cpp
+++ b/openmp/runtime/src/kmp_dispatch.cpp
@ -1,7 +1,7 @@
 /*
 * kmp_dispatch.cpp: dynamic scheduling - iteration initialization and dispatch.
- * $Revision: 42674 $
- * $Date: 2013-09-18 11:12:49 -0500 (Wed, 18 Sep 2013) $
+ * $Revision: 43457 $
+ * $Date: 2014-09-17 03:57:22 -0500 (Wed, 17 Sep 2014) $
 */


@ -32,6 +32,7 @@
 #include "kmp_itt.h"
 #include "kmp_str.h"
 #include "kmp_error.h"
+#include "kmp_stats.h"
 #if KMP_OS_WINDOWS && KMP_ARCH_X86
    #include <float.h>
 #endif
@ -39,6 +40,34 @@
 /* ------------------------------------------------------------------------ */
 /* ------------------------------------------------------------------------ */

+// template for type limits
+template< typename T >
+struct i_maxmin {
+    static const T mx;
+    static const T mn;
+};
+template<>
+struct i_maxmin< int > {
+    static const int mx = 0x7fffffff;
+    static const int mn = 0x80000000;
+};
+template<>
+struct i_maxmin< unsigned int > {
+    static const unsigned int mx = 0xffffffff;
+    static const unsigned int mn = 0x00000000;
+};
+template<>
+struct i_maxmin< long long > {
+    static const long long mx = 0x7fffffffffffffffLL;
+    static const long long mn = 0x8000000000000000LL;
+};
+template<>
+struct i_maxmin< unsigned long long > {
+    static const unsigned long long mx = 0xffffffffffffffffLL;
+    static const unsigned long long mn = 0x0000000000000000LL;
+};
+//-------------------------------------------------------------------------
+
 #ifdef KMP_STATIC_STEAL_ENABLED

    // replaces dispatch_private_info{32,64} structures and dispatch_private_info{32,64}_t types
@ -148,22 +177,6 @@ struct dispatch_shared_info_template {
 /* ------------------------------------------------------------------------ */
 /* ------------------------------------------------------------------------ */

-static void
-__kmp_static_delay( int arg )
-{
-    /* Work around weird code-gen bug that causes assert to trip */
-    #if KMP_ARCH_X86_64 && KMP_OS_LINUX
-    #else
-        KMP_ASSERT( arg >= 0 );
-    #endif
-}
-
-static void
-__kmp_static_yield( int arg )
-{
-    __kmp_yield( arg );
-}
-
 #undef USE_TEST_LOCKS

 // test_then_add template (general template should NOT be used)
@ -294,8 +307,6 @@ __kmp_wait_yield( volatile UT * spinner,
        /* if ( TCR_4(__kmp_global.g.g_done) && __kmp_global.g.g_abort)
            __kmp_abort_thread(); */

-        __kmp_static_delay(TRUE);
-
        // if we are oversubscribed,
        // or have waited a bit (and KMP_LIBRARY=throughput, then yield
        // pause is in the following code
@ -589,6 +600,9 @@ __kmp_dispatch_init(
    if ( ! TCR_4( __kmp_init_parallel ) )
        __kmp_parallel_initialize();

+#if INCLUDE_SSC_MARKS
+    SSC_MARK_DISPATCH_INIT();
+#endif
    #ifdef KMP_DEBUG
    {
        const char * buff;
@ -606,6 +620,9 @@ __kmp_dispatch_init(
    active = ! team -> t.t_serialized;
    th->th.th_ident = loc;

+#if USE_ITT_BUILD
+    kmp_uint64 cur_chunk = chunk;
+#endif
    if ( ! active ) {
        pr = reinterpret_cast< dispatch_private_info_template< T >* >
            ( th -> th.th_dispatch -> th_disp_buffer ); /* top of the stack */
@ -640,23 +657,16 @@ __kmp_dispatch_init(
        schedule = __kmp_static;
    } else {
        if ( schedule == kmp_sch_runtime ) {
-            #if OMP_30_ENABLED
-                // Use the scheduling specified by OMP_SCHEDULE (or __kmp_sch_default if not specified)
-                schedule = team -> t.t_sched.r_sched_type;
-                // Detail the schedule if needed (global controls are differentiated appropriately)
-                if ( schedule == kmp_sch_guided_chunked ) {
-                    schedule = __kmp_guided;
-                } else if ( schedule == kmp_sch_static ) {
-                    schedule = __kmp_static;
-                }
-                // Use the chunk size specified by OMP_SCHEDULE (or default if not specified)
-                chunk = team -> t.t_sched.chunk;
-            #else
-                kmp_r_sched_t r_sched = __kmp_get_schedule_global();
-                // Use the scheduling specified by OMP_SCHEDULE and/or KMP_SCHEDULE or default
-                schedule = r_sched.r_sched_type;
-                chunk    = r_sched.chunk;
-            #endif
+            // Use the scheduling specified by OMP_SCHEDULE (or __kmp_sch_default if not specified)
+            schedule = team -> t.t_sched.r_sched_type;
+            // Detail the schedule if needed (global controls are differentiated appropriately)
+            if ( schedule == kmp_sch_guided_chunked ) {
+                schedule = __kmp_guided;
+            } else if ( schedule == kmp_sch_static ) {
+                schedule = __kmp_static;
+            }
+            // Use the chunk size specified by OMP_SCHEDULE (or default if not specified)
+            chunk = team -> t.t_sched.chunk;

            #ifdef KMP_DEBUG
            {
@ -678,7 +688,6 @@ __kmp_dispatch_init(
            }
        }

-        #if OMP_30_ENABLED
        if ( schedule == kmp_sch_auto ) {
            // mapping and differentiation: in the __kmp_do_serial_initialize()
            schedule = __kmp_auto;
@ -694,7 +703,6 @@ __kmp_dispatch_init(
            }
            #endif
        }
-        #endif // OMP_30_ENABLED

        /* guided analytical not safe for too many threads */
        if ( team->t.t_nproc > 1<<20 && schedule == kmp_sch_guided_analytical_chunked ) {
@ -848,6 +856,12 @@ __kmp_dispatch_init(
                    break;
                }
            }
+#if USE_ITT_BUILD
+            // Calculate chunk for metadata report
+            if(  __itt_metadata_add_ptr  && __kmp_forkjoin_frames_mode == 3 ) {
+                cur_chunk = limit - init + 1;
+            }
+#endif
            if ( st == 1 ) {
                pr->u.p.lb = lb + init;
                pr->u.p.ub = lb + limit;
@ -1101,6 +1115,39 @@ __kmp_dispatch_init(
        }; // if
 #endif /* USE_ITT_BUILD */
    }; // if
+
+#if USE_ITT_BUILD
+    // Report loop metadata
+    if( __itt_metadata_add_ptr  && __kmp_forkjoin_frames_mode == 3 ) {
+        kmp_uint32 tid  = __kmp_tid_from_gtid( gtid );
+        if (KMP_MASTER_TID(tid)) {
+            kmp_uint64 schedtype = 0;
+
+            switch ( schedule ) {
+            case kmp_sch_static_chunked:
+            case kmp_sch_static_balanced:// Chunk is calculated in the switch above
+                break;
+            case kmp_sch_static_greedy:
+                cur_chunk = pr->u.p.parm1;
+                break;
+            case kmp_sch_dynamic_chunked:
+                schedtype = 1;
+                break;
+            case kmp_sch_guided_iterative_chunked:
+            case kmp_sch_guided_analytical_chunked:
+                schedtype = 2;
+                break;
+            default:
+//            Should we put this case under "static"?
+//            case kmp_sch_static_steal:
+                schedtype = 3;
+                break;
+            }
+            __kmp_itt_metadata_loop(loc, schedtype, tc, cur_chunk);
+        }
+    }
+#endif /* USE_ITT_BUILD */
+
    #ifdef KMP_DEBUG
    {
        const char * buff;
@ -1302,6 +1349,7 @@ __kmp_dispatch_next(
    kmp_info_t                          * th   = __kmp_threads[ gtid ];
    kmp_team_t                          * team = th -> th.th_team;

+    KMP_DEBUG_ASSERT( p_last && p_lb && p_ub && p_st ); // AC: these cannot be NULL
    #ifdef KMP_DEBUG
    {
        const char * buff;
@ -1323,9 +1371,10 @@ __kmp_dispatch_next(
        if ( (status = (pr->u.p.tc != 0)) == 0 ) {
            *p_lb = 0;
            *p_ub = 0;
-            if ( p_st != 0 ) {
+//            if ( p_last != NULL )
+//                *p_last = 0;
+            if ( p_st != NULL )
                *p_st = 0;
-            }
            if ( __kmp_env_consistency_check ) {
                if ( pr->pushed_ws != ct_none ) {
                    pr->pushed_ws = __kmp_pop_workshare( gtid, pr->pushed_ws, loc );
@ -1346,7 +1395,10 @@ __kmp_dispatch_next(
            if ( (status = (init <= trip)) == 0 ) {
                *p_lb = 0;
                *p_ub = 0;
-                if ( p_st != 0 ) *p_st = 0;
+//                if ( p_last != NULL )
+//                    *p_last = 0;
+                if ( p_st != NULL )
+                    *p_st = 0;
                if ( __kmp_env_consistency_check ) {
                    if ( pr->pushed_ws != ct_none ) {
                        pr->pushed_ws = __kmp_pop_workshare( gtid, pr->pushed_ws, loc );
@ -1363,12 +1415,10 @@ __kmp_dispatch_next(
                    pr->u.p.last_upper = pr->u.p.ub;
                    #endif /* KMP_OS_WINDOWS */
                }
-                if ( p_last ) {
+                if ( p_last != NULL )
                    *p_last = last;
-                }
-                if ( p_st != 0 ) {
+                if ( p_st != NULL )
                    *p_st = incr;
-                }
                if ( incr == 1 ) {
                    *p_lb = start + init;
                    *p_ub = start + limit;
@ -1395,19 +1445,15 @@ __kmp_dispatch_next(
            } // if
        } else {
            pr->u.p.tc = 0;
-
            *p_lb = pr->u.p.lb;
            *p_ub = pr->u.p.ub;
            #if KMP_OS_WINDOWS
            pr->u.p.last_upper = *p_ub;
            #endif /* KMP_OS_WINDOWS */
-
-            if ( p_st != 0 ) {
-                *p_st = pr->u.p.st;
-            }
-            if ( p_last ) {
+            if ( p_last != NULL )
                *p_last = TRUE;
-            }
+            if ( p_st != NULL )
+                *p_st = pr->u.p.st;
        } // if
        #ifdef KMP_DEBUG
        {
@ -1415,12 +1461,15 @@ __kmp_dispatch_next(
            // create format specifiers before the debug output
            buff = __kmp_str_format(
                "__kmp_dispatch_next: T#%%d serialized case: p_lb:%%%s " \
-                "p_ub:%%%s p_st:%%%s p_last:%%p  returning:%%d\n",
+                "p_ub:%%%s p_st:%%%s p_last:%%p %%d  returning:%%d\n",
                traits_t< T >::spec, traits_t< T >::spec, traits_t< ST >::spec );
-            KD_TRACE(10, ( buff, gtid, *p_lb, *p_ub, *p_st, p_last, status) );
+            KD_TRACE(10, ( buff, gtid, *p_lb, *p_ub, *p_st, p_last, *p_last, status) );
            __kmp_str_free( &buff );
        }
        #endif
+#if INCLUDE_SSC_MARKS
+        SSC_MARK_DISPATCH_NEXT();
+#endif
        return status;
    } else {
        kmp_int32 last = 0;
@ -1572,7 +1621,7 @@ __kmp_dispatch_next(
                    if ( !status ) {
                        *p_lb = 0;
                        *p_ub = 0;
-                        if ( p_st != 0 ) *p_st = 0;
+                        if ( p_st != NULL ) *p_st = 0;
                    } else {
                        start = pr->u.p.parm2;
                        init *= chunk;
@ -1582,10 +1631,7 @@ __kmp_dispatch_next(
                        KMP_DEBUG_ASSERT(init <= trip);
                        if ( (last = (limit >= trip)) != 0 )
                            limit = trip;
-                        if ( p_last ) {
-                            *p_last = last;
-                        }
-                        if ( p_st != 0 ) *p_st = incr;
+                        if ( p_st != NULL ) *p_st = incr;

                        if ( incr == 1 ) {
                            *p_lb = start + init;
@ -1622,10 +1668,7 @@ __kmp_dispatch_next(
                        *p_lb = pr->u.p.lb;
                        *p_ub = pr->u.p.ub;
                        last = pr->u.p.parm1;
-                        if ( p_last ) {
-                            *p_last = last;
-                        }
-                        if ( p_st )
+                        if ( p_st != NULL )
                            *p_st = pr->u.p.st;
                    } else {  /* no iterations to do */
                        pr->u.p.lb = pr->u.p.ub + pr->u.p.st;
@ -1665,10 +1708,7 @@ __kmp_dispatch_next(
                        if ( (last = (limit >= trip)) != 0 )
                            limit = trip;

-                        if ( p_last ) {
-                            *p_last = last;
-                        }
-                        if ( p_st != 0 ) *p_st = incr;
+                        if ( p_st != NULL ) *p_st = incr;

                        pr->u.p.count += team->t.t_nproc;

@ -1713,7 +1753,7 @@ __kmp_dispatch_next(
                    if ( (status = (init <= trip)) == 0 ) {
                        *p_lb = 0;
                        *p_ub = 0;
-                        if ( p_st != 0 ) *p_st = 0;
+                        if ( p_st != NULL ) *p_st = 0;
                    } else {
                        start = pr->u.p.lb;
                        limit = chunk + init - 1;
@ -1721,10 +1761,8 @@ __kmp_dispatch_next(

                        if ( (last = (limit >= trip)) != 0 )
                            limit = trip;
-                        if ( p_last ) {
-                            *p_last = last;
-                        }
-                        if ( p_st != 0 ) *p_st = incr;
+
+                        if ( p_st != NULL ) *p_st = incr;

                        if ( incr == 1 ) {
                            *p_lb = start + init;
@ -1801,8 +1839,6 @@ __kmp_dispatch_next(
                        incr = pr->u.p.st;
                        if ( p_st != NULL )
                            *p_st = incr;
-                        if ( p_last != NULL )
-                            *p_last = last;
                        *p_lb = start + init * incr;
                        *p_ub = start + limit * incr;
                        if ( pr->ordered ) {
@ -1906,8 +1942,6 @@ __kmp_dispatch_next(
                        incr = pr->u.p.st;
                        if ( p_st != NULL )
                            *p_st = incr;
-                        if ( p_last != NULL )
-                            *p_last = last;
                        *p_lb = start + init * incr;
                        *p_ub = start + limit * incr;
                        if ( pr->ordered ) {
@ -1951,7 +1985,7 @@ __kmp_dispatch_next(
                    if ( (status = ((T)index < parm3 && init <= trip)) == 0 ) {
                        *p_lb = 0;
                        *p_ub = 0;
-                        if ( p_st != 0 ) *p_st = 0;
+                        if ( p_st != NULL ) *p_st = 0;
                    } else {
                        start = pr->u.p.lb;
                        limit = ( (index+1) * ( 2*parm2 - index*parm4 ) ) / 2 - 1;
@ -1960,10 +1994,7 @@ __kmp_dispatch_next(
                        if ( (last = (limit >= trip)) != 0 )
                            limit = trip;

-                        if ( p_last != 0 ) {
-                            *p_last = last;
-                        }
-                        if ( p_st != 0 ) *p_st = incr;
+                        if ( p_st != NULL ) *p_st = incr;

                        if ( incr == 1 ) {
                            *p_lb = start + init;
@ -1991,6 +2022,17 @@ __kmp_dispatch_next(
                    } // if
                } // case
                break;
+            default:
+                {
+                    status = 0; // to avoid complaints on uninitialized variable use
+                    __kmp_msg(
+                        kmp_ms_fatal,                        // Severity
+                        KMP_MSG( UnknownSchedTypeDetected ), // Primary message
+                        KMP_HNT( GetNewerLibrary ),          // Hint
+                        __kmp_msg_null                       // Variadic argument list terminator
+                    );
+                }
+                break;
            } // switch
        } // if tc == 0;

@ -2010,7 +2052,7 @@ __kmp_dispatch_next(
            }
            #endif

-            if ( num_done == team->t.t_nproc-1 ) {
+            if ( (ST)num_done == team->t.t_nproc-1 ) {
                /* NOTE: release this buffer to be reused */

                KMP_MB();       /* Flush all pending memory write invalidates.  */
@ -2048,6 +2090,8 @@ __kmp_dispatch_next(
            pr->u.p.last_upper = pr->u.p.ub;
        }
 #endif /* KMP_OS_WINDOWS */
+        if ( p_last != NULL && status != 0 )
+            *p_last = last;
    } // if

    #ifdef KMP_DEBUG
@ -2062,9 +2106,129 @@ __kmp_dispatch_next(
        __kmp_str_free( &buff );
    }
    #endif
+#if INCLUDE_SSC_MARKS
+    SSC_MARK_DISPATCH_NEXT();
+#endif
    return status;
 }

+template< typename T >
+static void
+__kmp_dist_get_bounds(
+    ident_t                          *loc,
+    kmp_int32                         gtid,
+    kmp_int32                        *plastiter,
+    T                                *plower,
+    T                                *pupper,
+    typename traits_t< T >::signed_t  incr
+) {
+    KMP_COUNT_BLOCK(OMP_DISTR_FOR_dynamic);
+    typedef typename traits_t< T >::unsigned_t  UT;
+    typedef typename traits_t< T >::signed_t    ST;
+    register kmp_uint32  team_id;
+    register kmp_uint32  nteams;
+    register UT          trip_count;
+    register kmp_team_t *team;
+    kmp_info_t * th;
+
+    KMP_DEBUG_ASSERT( plastiter && plower && pupper );
+    KE_TRACE( 10, ("__kmpc_dist_get_bounds called (%d)\n", gtid));
+    #ifdef KMP_DEBUG
+    {
+        const char * buff;
+        // create format specifiers before the debug output
+        buff = __kmp_str_format( "__kmpc_dist_get_bounds: T#%%d liter=%%d "\
+            "iter=(%%%s, %%%s, %%%s) signed?<%s>\n",
+            traits_t< T >::spec, traits_t< T >::spec, traits_t< ST >::spec,
+            traits_t< T >::spec );
+        KD_TRACE(100, ( buff, gtid, *plastiter, *plower, *pupper, incr ) );
+        __kmp_str_free( &buff );
+    }
+    #endif
+
+    if( __kmp_env_consistency_check ) {
+        if( incr == 0 ) {
+            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrZeroProhibited, ct_pdo, loc );
+        }
+        if( incr > 0 ? (*pupper < *plower) : (*plower < *pupper) ) {
+            // The loop is illegal.
+            // Some zero-trip loops maintained by compiler, e.g.:
+            //   for(i=10;i<0;++i) // lower >= upper - run-time check
+            //   for(i=0;i>10;--i) // lower <= upper - run-time check
+            //   for(i=0;i>10;++i) // incr > 0       - compile-time check
+            //   for(i=10;i<0;--i) // incr < 0       - compile-time check
+            // Compiler does not check the following illegal loops:
+            //   for(i=0;i<10;i+=incr) // where incr<0
+            //   for(i=10;i>0;i-=incr) // where incr<0
+            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrIllegal, ct_pdo, loc );
+        }
+    }
+    th = __kmp_threads[gtid];
+    KMP_DEBUG_ASSERT(th->th.th_teams_microtask);   // we are in the teams construct
+    team = th->th.th_team;
+    #if OMP_40_ENABLED
+    nteams = th->th.th_teams_size.nteams;
+    #endif
+    team_id = team->t.t_master_tid;
+    KMP_DEBUG_ASSERT(nteams == team->t.t_parent->t.t_nproc);
+
+    // compute global trip count
+    if( incr == 1 ) {
+        trip_count = *pupper - *plower + 1;
+    } else if(incr == -1) {
+        trip_count = *plower - *pupper + 1;
+    } else {
+        trip_count = (ST)(*pupper - *plower) / incr + 1; // cast to signed to cover incr<0 case
+    }
+    if( trip_count <= nteams ) {
+        KMP_DEBUG_ASSERT(
+            __kmp_static == kmp_sch_static_greedy || \
+            __kmp_static == kmp_sch_static_balanced
+        ); // Unknown static scheduling type.
+        // only some teams get single iteration, others get nothing
+        if( team_id < trip_count ) {
+            *pupper = *plower = *plower + team_id * incr;
+        } else {
+            *plower = *pupper + incr; // zero-trip loop
+        }
+        if( plastiter != NULL )
+            *plastiter = ( team_id == trip_count - 1 );
+    } else {
+        if( __kmp_static == kmp_sch_static_balanced ) {
+            register UT chunk = trip_count / nteams;
+            register UT extras = trip_count % nteams;
+            *plower += incr * ( team_id * chunk + ( team_id < extras ? team_id : extras ) );
+            *pupper = *plower + chunk * incr - ( team_id < extras ? 0 : incr );
+            if( plastiter != NULL )
+                *plastiter = ( team_id == nteams - 1 );
+        } else {
+            register T chunk_inc_count =
+                ( trip_count / nteams + ( ( trip_count % nteams ) ? 1 : 0) ) * incr;
+            register T upper = *pupper;
+            KMP_DEBUG_ASSERT( __kmp_static == kmp_sch_static_greedy );
+                // Unknown static scheduling type.
+            *plower += team_id * chunk_inc_count;
+            *pupper = *plower + chunk_inc_count - incr;
+            // Check/correct bounds if needed
+            if( incr > 0 ) {
+                if( *pupper < *plower )
+                    *pupper = i_maxmin< T >::mx;
+                if( plastiter != NULL )
+                    *plastiter = *plower <= upper && *pupper > upper - incr;
+                if( *pupper > upper )
+                    *pupper = upper; // tracker C73258
+            } else {
+                if( *pupper > *plower )
+                    *pupper = i_maxmin< T >::mn;
+                if( plastiter != NULL )
+                    *plastiter = *plower >= upper && *pupper < upper - incr;
+                if( *pupper < upper )
+                    *pupper = upper; // tracker C73258
+            }
+        }
+    }
+}
+
 //-----------------------------------------------------------------------------------------
 // Dispatch routines
 //    Transfer call to template< type T >
@ -2091,6 +2255,7 @@ void
 __kmpc_dispatch_init_4( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
                        kmp_int32 lb, kmp_int32 ub, kmp_int32 st, kmp_int32 chunk )
 {
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
    KMP_DEBUG_ASSERT( __kmp_init_serial );
    __kmp_dispatch_init< kmp_int32 >( loc, gtid, schedule, lb, ub, st, chunk, true );
 }
@ -2101,6 +2266,7 @@ void
 __kmpc_dispatch_init_4u( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
                        kmp_uint32 lb, kmp_uint32 ub, kmp_int32 st, kmp_int32 chunk )
 {
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
    KMP_DEBUG_ASSERT( __kmp_init_serial );
    __kmp_dispatch_init< kmp_uint32 >( loc, gtid, schedule, lb, ub, st, chunk, true );
 }
@ -2113,6 +2279,7 @@ __kmpc_dispatch_init_8( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
                        kmp_int64 lb, kmp_int64 ub,
                        kmp_int64 st, kmp_int64 chunk )
 {
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
    KMP_DEBUG_ASSERT( __kmp_init_serial );
    __kmp_dispatch_init< kmp_int64 >( loc, gtid, schedule, lb, ub, st, chunk, true );
 }
@ -2125,10 +2292,60 @@ __kmpc_dispatch_init_8u( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
                         kmp_uint64 lb, kmp_uint64 ub,
                         kmp_int64 st, kmp_int64 chunk )
 {
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
    KMP_DEBUG_ASSERT( __kmp_init_serial );
    __kmp_dispatch_init< kmp_uint64 >( loc, gtid, schedule, lb, ub, st, chunk, true );
 }

+/*!
+See @ref __kmpc_dispatch_init_4
+
+Difference from __kmpc_dispatch_init set of functions is these functions
+are called for composite distribute parallel for construct. Thus before
+regular iterations dispatching we need to calc per-team iteration space.
+
+These functions are all identical apart from the types of the arguments.
+*/
+void
+__kmpc_dist_dispatch_init_4( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
+    kmp_int32 *p_last, kmp_int32 lb, kmp_int32 ub, kmp_int32 st, kmp_int32 chunk )
+{
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_dist_get_bounds< kmp_int32 >( loc, gtid, p_last, &lb, &ub, st );
+    __kmp_dispatch_init< kmp_int32 >( loc, gtid, schedule, lb, ub, st, chunk, true );
+}
+
+void
+__kmpc_dist_dispatch_init_4u( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
+    kmp_int32 *p_last, kmp_uint32 lb, kmp_uint32 ub, kmp_int32 st, kmp_int32 chunk )
+{
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_dist_get_bounds< kmp_uint32 >( loc, gtid, p_last, &lb, &ub, st );
+    __kmp_dispatch_init< kmp_uint32 >( loc, gtid, schedule, lb, ub, st, chunk, true );
+}
+
+void
+__kmpc_dist_dispatch_init_8( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
+    kmp_int32 *p_last, kmp_int64 lb, kmp_int64 ub, kmp_int64 st, kmp_int64 chunk )
+{
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_dist_get_bounds< kmp_int64 >( loc, gtid, p_last, &lb, &ub, st );
+    __kmp_dispatch_init< kmp_int64 >( loc, gtid, schedule, lb, ub, st, chunk, true );
+}
+
+void
+__kmpc_dist_dispatch_init_8u( ident_t *loc, kmp_int32 gtid, enum sched_type schedule,
+    kmp_int32 *p_last, kmp_uint64 lb, kmp_uint64 ub, kmp_int64 st, kmp_int64 chunk )
+{
+    KMP_COUNT_BLOCK(OMP_FOR_dynamic);
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_dist_get_bounds< kmp_uint64 >( loc, gtid, p_last, &lb, &ub, st );
+    __kmp_dispatch_init< kmp_uint64 >( loc, gtid, schedule, lb, ub, st, chunk, true );
+}
+
 /*!
@param loc Source code location
@param gtid Global thread id
@ -2284,8 +2501,6 @@ __kmp_wait_yield_4(volatile kmp_uint32 * spinner,
        /* if ( TCR_4(__kmp_global.g.g_done) && __kmp_global.g.g_abort)
            __kmp_abort_thread(); */

-        __kmp_static_delay(TRUE);
-
        /* if we have waited a bit, or are oversubscribed, yield */
        /* pause is in the following code */
        KMP_YIELD( TCR_4(__kmp_nth) > __kmp_avail_proc );
@ -2320,8 +2535,6 @@ __kmp_wait_yield_8( volatile kmp_uint64 * spinner,
        /* if ( TCR_4(__kmp_global.g.g_done) && __kmp_global.g.g_abort)
            __kmp_abort_thread(); */

-        __kmp_static_delay(TRUE);
-
        // if we are oversubscribed,
        // or have waited a bit (and KMP_LIBARRY=throughput, then yield
        // pause is in the following code
--- a/openmp/runtime/src/kmp_environment.c
+++ b/openmp/runtime/src/kmp_environment.c
@ -1,7 +1,7 @@
 /*
 * kmp_environment.c -- Handle environment variables OS-independently.
- * $Revision: 42263 $
- * $Date: 2013-04-04 11:03:19 -0500 (Thu, 04 Apr 2013) $
+ * $Revision: 43084 $
+ * $Date: 2014-04-15 09:15:14 -0500 (Tue, 15 Apr 2014) $
 */


--- a/openmp/runtime/src/kmp_environment.h
+++ b/openmp/runtime/src/kmp_environment.h
@ -1,7 +1,7 @@
 /*
 * kmp_environment.h -- Handle environment varoiables OS-independently.
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_error.c
+++ b/openmp/runtime/src/kmp_error.c
@ -1,7 +1,7 @@
 /* 
 * kmp_error.c -- KPTS functions for error checking at runtime
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_error.h
+++ b/openmp/runtime/src/kmp_error.h
@ -1,7 +1,7 @@
 /*
 * kmp_error.h -- PTS functions for error checking at runtime.
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_ftn_cdecl.c
+++ b/openmp/runtime/src/kmp_ftn_cdecl.c
@ -1,7 +1,7 @@
 /*
 * kmp_ftn_cdecl.c -- Fortran __cdecl linkage support for OpenMP.
- * $Revision: 42757 $
- * $Date: 2013-10-18 08:20:57 -0500 (Fri, 18 Oct 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_ftn_entry.h
+++ b/openmp/runtime/src/kmp_ftn_entry.h
@ -1,7 +1,7 @@
 /*
 * kmp_ftn_entry.h -- Fortran entry linkage support for OpenMP.
- * $Revision: 42798 $
- * $Date: 2013-10-30 16:39:54 -0500 (Wed, 30 Oct 2013) $
+ * $Revision: 43435 $
+ * $Date: 2014-09-04 15:16:08 -0500 (Thu, 04 Sep 2014) $
 */


@ -217,8 +217,6 @@ FTN_GET_LIBRARY (void)
    #endif
 }

-#if OMP_30_ENABLED
-
 int FTN_STDCALL
 FTN_SET_AFFINITY( void **mask )
 {
@ -348,8 +346,6 @@ FTN_GET_AFFINITY_MASK_PROC( int KMP_DEREF proc, void **mask )
    #endif
 }

-#endif /* OMP_30_ENABLED */
-

 /* ------------------------------------------------------------------------ */

@ -391,12 +387,8 @@ xexpand(FTN_GET_MAX_THREADS)( void )
        }
        gtid   = __kmp_entry_gtid();
        thread = __kmp_threads[ gtid ];
-        #if OMP_30_ENABLED
        //return thread -> th.th_team -> t.t_current_task[ thread->th.th_info.ds.ds_tid ] -> icvs.nproc;
 	return thread -> th.th_current_task -> td_icvs.nproc;
-        #else
-        return thread -> th.th_team -> t.t_set_nproc[ thread->th.th_info.ds.ds_tid ];
-        #endif
    #endif
 }

@ -533,7 +525,7 @@ xexpand(FTN_IN_PARALLEL)( void )
    #else
        kmp_info_t *th = __kmp_entry_thread();
 #if OMP_40_ENABLED
-        if ( th->th.th_team_microtask ) {
+        if ( th->th.th_teams_microtask ) {
            // AC: r_in_parallel does not work inside teams construct
            //     where real parallel is inactive, but all threads have same root,
            //     so setting it in one team affects other teams.
@ -546,8 +538,6 @@ xexpand(FTN_IN_PARALLEL)( void )
    #endif
 }

-#if OMP_30_ENABLED
-
 void FTN_STDCALL
 xexpand(FTN_SET_SCHEDULE)( kmp_sched_t KMP_DEREF kind, int KMP_DEREF modifier )
 {
@ -667,8 +657,6 @@ xexpand(FTN_IN_FINAL)( void )
    #endif
 }

-#endif // OMP_30_ENABLED
-
 #if OMP_40_ENABLED


@ -689,7 +677,7 @@ xexpand(FTN_GET_NUM_TEAMS)( void )
        return 1;
    #else
        kmp_info_t *thr = __kmp_entry_thread();
-        if ( thr->th.th_team_microtask ) {
+        if ( thr->th.th_teams_microtask ) {
            kmp_team_t *team = thr->th.th_team;
            int tlevel = thr->th.th_teams_level;
            int ii = team->t.t_level;            // the level of the teams construct
@ -728,7 +716,7 @@ xexpand(FTN_GET_TEAM_NUM)( void )
        return 0;
    #else
        kmp_info_t *thr = __kmp_entry_thread();
-        if ( thr->th.th_team_microtask ) {
+        if ( thr->th.th_teams_microtask ) {
            kmp_team_t *team = thr->th.th_team;
            int tlevel = thr->th.th_teams_level; // the level of the teams construct
            int ii = team->t.t_level;
@ -1048,19 +1036,19 @@ FTN_GET_CANCELLATION_STATUS(int cancel_kind) {
 #endif // OMP_40_ENABLED

 // GCC compatibility (versioned symbols)
-#if KMP_OS_LINUX
+#ifdef KMP_USE_VERSION_SYMBOLS

 /*
    These following sections create function aliases (dummy symbols) for the omp_* routines.
-    These aliases will then be versioned according to how libgomp ``versions'' its 
-    symbols (OMP_1.0, OMP_2.0, OMP_3.0, ...) while also retaining the 
+    These aliases will then be versioned according to how libgomp ``versions'' its
+    symbols (OMP_1.0, OMP_2.0, OMP_3.0, ...) while also retaining the
    default version which libiomp5 uses: VERSION (defined in exports_so.txt)
-    If you want to see the versioned symbols for libgomp.so.1 then just type: 
+    If you want to see the versioned symbols for libgomp.so.1 then just type:

    objdump -T /path/to/libgomp.so.1 | grep omp_

    Example:
-    Step 1)  Create __kmp_api_omp_set_num_threads_10_alias 
+    Step 1)  Create __kmp_api_omp_set_num_threads_10_alias
             which is alias of __kmp_api_omp_set_num_threads
    Step 2)  Set __kmp_api_omp_set_num_threads_10_alias to version: omp_set_num_threads@OMP_1.0
    Step 2B) Set __kmp_api_omp_set_num_threads to default version : omp_set_num_threads@@VERSION
@ -1092,7 +1080,6 @@ xaliasify(FTN_TEST_NEST_LOCK,    10);
 xaliasify(FTN_GET_WTICK, 20);
 xaliasify(FTN_GET_WTIME, 20);

-#if OMP_30_ENABLED
 // OMP_3.0 aliases
 xaliasify(FTN_SET_SCHEDULE,            30);
 xaliasify(FTN_GET_SCHEDULE,            30);
@ -1116,7 +1103,6 @@ xaliasify(FTN_TEST_NEST_LOCK,          30);

 // OMP_3.1 aliases
 xaliasify(FTN_IN_FINAL, 31);
-#endif /* OMP_30_ENABLED */

 #if OMP_40_ENABLED
 // OMP_4.0 aliases
@ -1160,7 +1146,6 @@ xversionify(FTN_TEST_NEST_LOCK,    10, "OMP_1.0");
 xversionify(FTN_GET_WTICK,         20, "OMP_2.0");
 xversionify(FTN_GET_WTIME,         20, "OMP_2.0");

-#if OMP_30_ENABLED
 // OMP_3.0 versioned symbols
 xversionify(FTN_SET_SCHEDULE,      30, "OMP_3.0");
 xversionify(FTN_GET_SCHEDULE,      30, "OMP_3.0");
@ -1186,7 +1171,6 @@ xversionify(FTN_TEST_NEST_LOCK,    30, "OMP_3.0");

 // OMP_3.1 versioned symbol
 xversionify(FTN_IN_FINAL,          31, "OMP_3.1");
-#endif /* OMP_30_ENABLED */

 #if OMP_40_ENABLED
 // OMP_4.0 versioned symbols
@ -1204,7 +1188,7 @@ xversionify(FTN_GET_CANCELLATION,  40, "OMP_4.0");
 // OMP_5.0 versioned symbols
 #endif

-#endif /* KMP_OS_LINUX */
+#endif // KMP_USE_VERSION_SYMBOLS

 #ifdef __cplusplus
    } //extern "C"
--- a/openmp/runtime/src/kmp_ftn_extra.c
+++ b/openmp/runtime/src/kmp_ftn_extra.c
@ -1,7 +1,7 @@
 /*
 * kmp_ftn_extra.c -- Fortran 'extra' linkage support for OpenMP.
- * $Revision: 42757 $
- * $Date: 2013-10-18 08:20:57 -0500 (Fri, 18 Oct 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_ftn_os.h
+++ b/openmp/runtime/src/kmp_ftn_os.h
@ -1,7 +1,7 @@
 /*
 * kmp_ftn_os.h -- KPTS Fortran defines header file.
- * $Revision: 42745 $
- * $Date: 2013-10-14 17:02:04 -0500 (Mon, 14 Oct 2013) $
+ * $Revision: 43354 $
+ * $Date: 2014-07-22 17:15:02 -0500 (Tue, 22 Jul 2014) $
 */


@ -472,14 +472,14 @@
 #define KMP_API_NAME_GOMP_TASKGROUP_START                GOMP_taskgroup_start
 #define KMP_API_NAME_GOMP_TASKGROUP_END                  GOMP_taskgroup_end
 /* Target functions should be taken care of by liboffload */
-//#define KMP_API_NAME_GOMP_TARGET                       GOMP_target
-//#define KMP_API_NAME_GOMP_TARGET_DATA                  GOMP_target_data
-//#define KMP_API_NAME_GOMP_TARGET_END_DATA              GOMP_target_end_data
-//#define KMP_API_NAME_GOMP_TARGET_UPDATE                GOMP_target_update
+#define KMP_API_NAME_GOMP_TARGET                         GOMP_target
+#define KMP_API_NAME_GOMP_TARGET_DATA                    GOMP_target_data
+#define KMP_API_NAME_GOMP_TARGET_END_DATA                GOMP_target_end_data
+#define KMP_API_NAME_GOMP_TARGET_UPDATE                  GOMP_target_update
 #define KMP_API_NAME_GOMP_TEAMS                          GOMP_teams

-#if KMP_OS_LINUX && !KMP_OS_CNK && !KMP_ARCH_PPC64
-    #define xstr(x) str(x) 
+#ifdef KMP_USE_VERSION_SYMBOLS
+    #define xstr(x) str(x)
    #define str(x) #x

    // If Linux, xexpand prepends __kmp_api_ to the real API name
@ -494,7 +494,7 @@
    __asm__(".symver " xstr(__kmp_api_##api_name##_##version_num##_alias) "," xstr(api_name) "@" version_str "\n\t"); \
    __asm__(".symver " xstr(__kmp_api_##api_name) "," xstr(api_name) "@@" default_ver "\n\t")

-#else /* KMP_OS_LINUX */
+#else // KMP_USE_VERSION_SYMBOLS
    #define xstr(x) /* Nothing */
    #define str(x)  /* Nothing */

@ -508,7 +508,7 @@
    #define xversionify(api_name, version_num, version_str) /* Nothing */
    #define versionify(api_name, version_num, version_str, default_ver) /* Nothing */

-#endif /* KMP_OS_LINUX */
+#endif // KMP_USE_VERSION_SYMBOLS

 #endif /* KMP_FTN_OS_H */

--- a/openmp/runtime/src/kmp_ftn_stdcall.c
+++ b/openmp/runtime/src/kmp_ftn_stdcall.c
@ -1,7 +1,7 @@
 /*
 * kmp_ftn_stdcall.c -- Fortran __stdcall linkage support for OpenMP.
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_global.c
+++ b/openmp/runtime/src/kmp_global.c
@ -1,7 +1,7 @@
 /*
 * kmp_global.c -- KPTS global variables for runtime support library
- * $Revision: 42816 $
- * $Date: 2013-11-11 15:33:37 -0600 (Mon, 11 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -25,6 +25,20 @@ kmp_key_t __kmp_gtid_threadprivate_key;

 kmp_cpuinfo_t   __kmp_cpuinfo = { 0 }; // Not initialized

+#if KMP_STATS_ENABLED
+#include "kmp_stats.h"
+// lock for modifying the global __kmp_stats_list
+kmp_tas_lock_t __kmp_stats_lock = KMP_TAS_LOCK_INITIALIZER(__kmp_stats_lock);
+
+// global list of per thread stats, the head is a sentinel node which accumulates all stats produced before __kmp_create_worker is called.
+kmp_stats_list __kmp_stats_list;
+
+// thread local pointer to stats node within list
+__thread kmp_stats_list* __kmp_stats_thread_ptr = &__kmp_stats_list;
+
+// gives reference tick for all events (considered the 0 tick)
+tsc_tick_count __kmp_stats_start_time;
+#endif

 /* ----------------------------------------------------- */
 /* INITIALIZATION VARIABLES */
@ -53,6 +67,7 @@ unsigned int __kmp_next_wait = KMP_DEFAULT_NEXT_WAIT;   /* susequent number of s
 size_t      __kmp_stksize         = KMP_DEFAULT_STKSIZE;
 size_t      __kmp_monitor_stksize = 0;  // auto adjust
 size_t      __kmp_stkoffset       = KMP_DEFAULT_STKOFFSET;
+int         __kmp_stkpadding      = KMP_MIN_STKPADDING;

 size_t    __kmp_malloc_pool_incr  = KMP_DEFAULT_MALLOC_POOL_INCR;

@ -94,7 +109,7 @@ char const *__kmp_barrier_type_name           [ bs_last_barrier ] =
                                    , "reduction"
                                #endif // KMP_FAST_REDUCTION_BARRIER
                            };
-char const *__kmp_barrier_pattern_name [ bp_last_bar ] = { "linear", "tree", "hyper" };
+char const *__kmp_barrier_pattern_name [ bp_last_bar ] = { "linear", "tree", "hyper", "hierarchical" };


 int       __kmp_allThreadsSpecified = 0;
@ -114,16 +129,17 @@ int      __kmp_dflt_team_nth_ub = 0;
 int           __kmp_tp_capacity = 0;
 int             __kmp_tp_cached = 0;
 int           __kmp_dflt_nested = FALSE;
-#if OMP_30_ENABLED
 int __kmp_dflt_max_active_levels = KMP_MAX_ACTIVE_LEVELS_LIMIT; /* max_active_levels limit */
-#endif // OMP_30_ENABLED
+#if KMP_NESTED_HOT_TEAMS
+int __kmp_hot_teams_mode         = 0; /* 0 - free extra threads when reduced */
+                                      /* 1 - keep extra threads when reduced */
+int __kmp_hot_teams_max_level    = 1; /* nesting level of hot teams */
+#endif
 enum library_type __kmp_library = library_none;
 enum sched_type     __kmp_sched = kmp_sch_default;  /* scheduling method for runtime scheduling */
 enum sched_type    __kmp_static = kmp_sch_static_greedy; /* default static scheduling method */
 enum sched_type    __kmp_guided = kmp_sch_guided_iterative_chunked; /* default guided scheduling method */
-#if OMP_30_ENABLED
 enum sched_type      __kmp_auto = kmp_sch_guided_analytical_chunked; /* default auto scheduling method */
-#endif // OMP_30_ENABLED
 int        __kmp_dflt_blocktime = KMP_DEFAULT_BLOCKTIME;
 int       __kmp_monitor_wakeups = KMP_MIN_MONITOR_WAKEUPS;
 int          __kmp_bt_intervals = KMP_INTERVALS_FROM_BLOCKTIME( KMP_DEFAULT_BLOCKTIME, KMP_MIN_MONITOR_WAKEUPS );
@ -242,7 +258,6 @@ unsigned int __kmp_place_num_threads_per_core = 0;
 unsigned int __kmp_place_core_offset = 0;
 #endif

-#if OMP_30_ENABLED
 kmp_tasking_mode_t __kmp_tasking_mode = tskm_task_teams;

 /* This check ensures that the compiler is passing the correct data type
@ -255,8 +270,6 @@ KMP_BUILD_ASSERT( sizeof(kmp_tasking_flags_t) == 4 );

 kmp_int32 __kmp_task_stealing_constraint = 1;   /* Constrain task stealing by default */

-#endif /* OMP_30_ENABLED */
-
 #ifdef DEBUG_SUSPEND
 int         __kmp_suspend_count = 0;
 #endif
@ -364,6 +377,29 @@ kmp_global_t __kmp_global = {{ 0 }};
 /* ----------------------------------------------- */
 /* GLOBAL SYNCHRONIZATION LOCKS */
 /* TODO verify the need for these locks and if they need to be global */
+
+#if KMP_USE_INTERNODE_ALIGNMENT
+/* Multinode systems have larger cache line granularity which can cause
+ * false sharing if the alignment is not large enough for these locks */
+KMP_ALIGN_CACHE_INTERNODE
+
+kmp_bootstrap_lock_t __kmp_initz_lock   = KMP_BOOTSTRAP_LOCK_INITIALIZER( __kmp_initz_lock   ); /* Control initializations */
+KMP_ALIGN_CACHE_INTERNODE
+kmp_bootstrap_lock_t __kmp_forkjoin_lock; /* control fork/join access */
+KMP_ALIGN_CACHE_INTERNODE
+kmp_bootstrap_lock_t __kmp_exit_lock;   /* exit() is not always thread-safe */
+KMP_ALIGN_CACHE_INTERNODE
+kmp_bootstrap_lock_t __kmp_monitor_lock; /* control monitor thread creation */
+KMP_ALIGN_CACHE_INTERNODE
+kmp_bootstrap_lock_t __kmp_tp_cached_lock; /* used for the hack to allow threadprivate cache and __kmp_threads expansion to co-exist */
+
+KMP_ALIGN_CACHE_INTERNODE
+kmp_lock_t __kmp_global_lock;           /* Control OS/global access */
+KMP_ALIGN_CACHE_INTERNODE
+kmp_queuing_lock_t __kmp_dispatch_lock;         /* Control dispatch access  */
+KMP_ALIGN_CACHE_INTERNODE
+kmp_lock_t __kmp_debug_lock;            /* Control I/O access for KMP_DEBUG */
+#else
 KMP_ALIGN_CACHE

 kmp_bootstrap_lock_t __kmp_initz_lock   = KMP_BOOTSTRAP_LOCK_INITIALIZER( __kmp_initz_lock   ); /* Control initializations */
@ -378,6 +414,7 @@ KMP_ALIGN(128)
 kmp_queuing_lock_t __kmp_dispatch_lock;         /* Control dispatch access  */
 KMP_ALIGN(128)
 kmp_lock_t __kmp_debug_lock;            /* Control I/O access for KMP_DEBUG */
+#endif

 /* ----------------------------------------------- */

--- a/openmp/runtime/src/kmp_gsupport.c
+++ b/openmp/runtime/src/kmp_gsupport.c
@ -1,7 +1,7 @@
 /*
 * kmp_gsupport.c
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -244,7 +244,7 @@ xexpand(KMP_API_NAME_GOMP_ORDERED_END)(void)
 // The parallel contruct
 //

-#ifdef KMP_DEBUG
+#ifndef KMP_DEBUG
 static
 #endif /* KMP_DEBUG */
 void
@ -255,7 +255,7 @@ __kmp_GOMP_microtask_wrapper(int *gtid, int *npr, void (*task)(void *),
 }


-#ifdef KMP_DEBUG
+#ifndef KMP_DEBUG
 static
 #endif /* KMP_DEBUG */
 void
@ -276,7 +276,7 @@ __kmp_GOMP_parallel_microtask_wrapper(int *gtid, int *npr,
 }


-#ifdef KMP_DEBUG
+#ifndef KMP_DEBUG
 static
 #endif /* KMP_DEBUG */
 void
@ -287,7 +287,7 @@ __kmp_GOMP_fork_call(ident_t *loc, int gtid, microtask_t wrapper, int argc,...)
    va_list ap;
    va_start(ap, argc);

-    rc = __kmp_fork_call(loc, gtid, FALSE, argc, wrapper, __kmp_invoke_task_func,
+    rc = __kmp_fork_call(loc, gtid, fork_context_gnu, argc, wrapper, __kmp_invoke_task_func,
 #if (KMP_ARCH_X86_64 || KMP_ARCH_ARM) && KMP_OS_LINUX
      &ap
 #else
@ -563,7 +563,7 @@ xexpand(KMP_API_NAME_GOMP_LOOP_END_NOWAIT)(void)
            status = KMP_DISPATCH_NEXT_ULL(&loc, gtid, NULL,                 \
              (kmp_uint64 *)p_lb, (kmp_uint64 *)p_ub, (kmp_int64 *)&stride); \
            if (status) {                                                    \
-                KMP_DEBUG_ASSERT(stride == str2);                            \
+                KMP_DEBUG_ASSERT((long long)stride == str2);                 \
                *p_ub += (str > 0) ? 1 : -1;                                 \
            }                                                                \
        }                                                                    \
@ -666,9 +666,6 @@ PARALLEL_LOOP_START(xexpand(KMP_API_NAME_GOMP_PARALLEL_LOOP_GUIDED_START), kmp_s
 PARALLEL_LOOP_START(xexpand(KMP_API_NAME_GOMP_PARALLEL_LOOP_RUNTIME_START), kmp_sch_runtime)


-#if OMP_30_ENABLED
-
-
 /**/
 //
 // Tasking constructs
@ -742,9 +739,6 @@ xexpand(KMP_API_NAME_GOMP_TASKWAIT)(void)
 }


-#endif /* OMP_30_ENABLED */
-
-
 /**/
 //
 // Sections worksharing constructs
@ -861,9 +855,268 @@ xexpand(KMP_API_NAME_GOMP_SECTIONS_END_NOWAIT)(void)
 void
 xexpand(KMP_API_NAME_GOMP_TASKYIELD)(void)
 {
-
+    KA_TRACE(20, ("GOMP_taskyield: T#%d\n", __kmp_get_gtid()))
+    return;
 }

+#if OMP_40_ENABLED // these are new GOMP_4.0 entry points
+
+void
+xexpand(KMP_API_NAME_GOMP_PARALLEL)(void (*task)(void *), void *data, unsigned num_threads, unsigned int flags)
+{
+    int gtid = __kmp_entry_gtid();
+    MKLOC(loc, "GOMP_parallel");
+    KA_TRACE(20, ("GOMP_parallel: T#%d\n", gtid));
+
+    if (__kmpc_ok_to_fork(&loc) && (num_threads != 1)) {
+        if (num_threads != 0) {
+            __kmp_push_num_threads(&loc, gtid, num_threads);
+        }
+        if(flags != 0) {
+            __kmp_push_proc_bind(&loc, gtid, (kmp_proc_bind_t)flags);
+        }
+        __kmp_GOMP_fork_call(&loc, gtid,
+          (microtask_t)__kmp_GOMP_microtask_wrapper, 2, task, data);
+    }
+    else {
+        __kmpc_serialized_parallel(&loc, gtid);
+    }
+    task(data);
+    xexpand(KMP_API_NAME_GOMP_PARALLEL_END)();
+}
+
+void
+xexpand(KMP_API_NAME_GOMP_PARALLEL_SECTIONS)(void (*task) (void *), void *data,
+  unsigned num_threads, unsigned count, unsigned flags)
+{
+    int gtid = __kmp_entry_gtid();
+    int last = FALSE;
+    MKLOC(loc, "GOMP_parallel_sections");
+    KA_TRACE(20, ("GOMP_parallel_sections: T#%d\n", gtid));
+
+    if (__kmpc_ok_to_fork(&loc) && (num_threads != 1)) {
+        if (num_threads != 0) {
+            __kmp_push_num_threads(&loc, gtid, num_threads);
+        }
+        if(flags != 0) {
+            __kmp_push_proc_bind(&loc, gtid, (kmp_proc_bind_t)flags);
+        }
+        __kmp_GOMP_fork_call(&loc, gtid,
+          (microtask_t)__kmp_GOMP_parallel_microtask_wrapper, 9, task, data,
+          num_threads, &loc, kmp_nm_dynamic_chunked, (kmp_int)1,
+          (kmp_int)count, (kmp_int)1, (kmp_int)1);
+    }
+    else {
+        __kmpc_serialized_parallel(&loc, gtid);
+    }
+
+    KMP_DISPATCH_INIT(&loc, gtid, kmp_nm_dynamic_chunked, 1, count, 1, 1, TRUE);
+
+    task(data);
+    xexpand(KMP_API_NAME_GOMP_PARALLEL_END)();
+    KA_TRACE(20, ("GOMP_parallel_sections exit: T#%d\n", gtid));
+}
+
+#define PARALLEL_LOOP(func, schedule) \
+    void func (void (*task) (void *), void *data, unsigned num_threads,      \
+      long lb, long ub, long str, long chunk_sz, unsigned flags)             \
+    {                                                                        \
+        int gtid = __kmp_entry_gtid();                                       \
+        int last = FALSE;                                                    \
+        MKLOC(loc, #func);                                                   \
+        KA_TRACE(20, ( #func ": T#%d, lb 0x%lx, ub 0x%lx, str 0x%lx, chunk_sz 0x%lx\n",        \
+          gtid, lb, ub, str, chunk_sz ));                                    \
+                                                                             \
+        if (__kmpc_ok_to_fork(&loc) && (num_threads != 1)) {                 \
+            if (num_threads != 0) {                                          \
+                __kmp_push_num_threads(&loc, gtid, num_threads);             \
+            }                                                                \
+            if (flags != 0) {                                                \
+                __kmp_push_proc_bind(&loc, gtid, (kmp_proc_bind_t)flags);    \
+            }                                                                \
+            __kmp_GOMP_fork_call(&loc, gtid,                                 \
+              (microtask_t)__kmp_GOMP_parallel_microtask_wrapper, 9,         \
+              task, data, num_threads, &loc, (schedule), lb,                 \
+              (str > 0) ? (ub - 1) : (ub + 1), str, chunk_sz);               \
+        }                                                                    \
+        else {                                                               \
+            __kmpc_serialized_parallel(&loc, gtid);                          \
+        }                                                                    \
+                                                                             \
+        KMP_DISPATCH_INIT(&loc, gtid, (schedule), lb,                        \
+          (str > 0) ? (ub - 1) : (ub + 1), str, chunk_sz,                    \
+          (schedule) != kmp_sch_static);                                     \
+        task(data);                                                          \
+        xexpand(KMP_API_NAME_GOMP_PARALLEL_END)();                           \
+                                                                             \
+        KA_TRACE(20, ( #func " exit: T#%d\n", gtid));                        \
+    }
+
+PARALLEL_LOOP(xexpand(KMP_API_NAME_GOMP_PARALLEL_LOOP_STATIC), kmp_sch_static)
+PARALLEL_LOOP(xexpand(KMP_API_NAME_GOMP_PARALLEL_LOOP_DYNAMIC), kmp_sch_dynamic_chunked)
+PARALLEL_LOOP(xexpand(KMP_API_NAME_GOMP_PARALLEL_LOOP_GUIDED), kmp_sch_guided_chunked)
+PARALLEL_LOOP(xexpand(KMP_API_NAME_GOMP_PARALLEL_LOOP_RUNTIME), kmp_sch_runtime)
+
+
+void
+xexpand(KMP_API_NAME_GOMP_TASKGROUP_START)(void)
+{
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_taskgroup_start");
+    KA_TRACE(20, ("GOMP_taskgroup_start: T#%d\n", gtid));
+
+    __kmpc_taskgroup(&loc, gtid);
+
+    return;
+}
+
+void
+xexpand(KMP_API_NAME_GOMP_TASKGROUP_END)(void)
+{
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_taskgroup_end");
+    KA_TRACE(20, ("GOMP_taskgroup_end: T#%d\n", gtid));
+
+    __kmpc_end_taskgroup(&loc, gtid);
+
+    return;
+}
+
+#ifndef KMP_DEBUG
+static
+#endif /* KMP_DEBUG */
+kmp_int32 __kmp_gomp_to_iomp_cancellation_kind(int gomp_kind) {
+    kmp_int32 cncl_kind = 0;
+    switch(gomp_kind) {
+      case 1:
+        cncl_kind = cancel_parallel;
+        break;
+      case 2:
+        cncl_kind = cancel_loop;
+        break;
+      case 4:
+        cncl_kind = cancel_sections;
+        break;
+      case 8:
+        cncl_kind = cancel_taskgroup;
+        break;
+    }
+    return cncl_kind;
+}
+
+bool
+xexpand(KMP_API_NAME_GOMP_CANCELLATION_POINT)(int which)
+{
+    if(__kmp_omp_cancellation) {
+        KMP_FATAL(NoGompCancellation);
+    }
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_cancellation_point");
+    KA_TRACE(20, ("GOMP_cancellation_point: T#%d\n", gtid));
+
+    kmp_int32 cncl_kind = __kmp_gomp_to_iomp_cancellation_kind(which);
+
+    return __kmpc_cancellationpoint(&loc, gtid, cncl_kind);
+}
+
+bool
+xexpand(KMP_API_NAME_GOMP_BARRIER_CANCEL)(void)
+{
+    if(__kmp_omp_cancellation) {
+        KMP_FATAL(NoGompCancellation);
+    }
+    KMP_FATAL(NoGompCancellation);
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_barrier_cancel");
+    KA_TRACE(20, ("GOMP_barrier_cancel: T#%d\n", gtid));
+
+    return __kmpc_cancel_barrier(&loc, gtid);
+}
+
+bool
+xexpand(KMP_API_NAME_GOMP_CANCEL)(int which, bool do_cancel)
+{
+    if(__kmp_omp_cancellation) {
+        KMP_FATAL(NoGompCancellation);
+    } else {
+        return FALSE;
+    }
+
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_cancel");
+    KA_TRACE(20, ("GOMP_cancel: T#%d\n", gtid));
+
+    kmp_int32 cncl_kind = __kmp_gomp_to_iomp_cancellation_kind(which);
+
+    if(do_cancel == FALSE) {
+        return xexpand(KMP_API_NAME_GOMP_CANCELLATION_POINT)(which);
+    } else {
+        return __kmpc_cancel(&loc, gtid, cncl_kind);
+    }
+}
+
+bool
+xexpand(KMP_API_NAME_GOMP_SECTIONS_END_CANCEL)(void)
+{
+    if(__kmp_omp_cancellation) {
+        KMP_FATAL(NoGompCancellation);
+    }
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_sections_end_cancel");
+    KA_TRACE(20, ("GOMP_sections_end_cancel: T#%d\n", gtid));
+
+    return __kmpc_cancel_barrier(&loc, gtid);
+}
+
+bool
+xexpand(KMP_API_NAME_GOMP_LOOP_END_CANCEL)(void)
+{
+    if(__kmp_omp_cancellation) {
+        KMP_FATAL(NoGompCancellation);
+    }
+    int gtid = __kmp_get_gtid();
+    MKLOC(loc, "GOMP_loop_end_cancel");
+    KA_TRACE(20, ("GOMP_loop_end_cancel: T#%d\n", gtid));
+
+    return __kmpc_cancel_barrier(&loc, gtid);
+}
+
+// All target functions are empty as of 2014-05-29
+void
+xexpand(KMP_API_NAME_GOMP_TARGET)(int device, void (*fn) (void *), const void *openmp_target,
+             size_t mapnum, void **hostaddrs, size_t *sizes, unsigned char *kinds)
+{
+    return;
+}
+
+void
+xexpand(KMP_API_NAME_GOMP_TARGET_DATA)(int device, const void *openmp_target, size_t mapnum,
+                  void **hostaddrs, size_t *sizes, unsigned char *kinds)
+{
+    return;
+}
+
+void
+xexpand(KMP_API_NAME_GOMP_TARGET_END_DATA)(void)
+{
+    return;
+}
+
+void
+xexpand(KMP_API_NAME_GOMP_TARGET_UPDATE)(int device, const void *openmp_target, size_t mapnum,
+                    void **hostaddrs, size_t *sizes, unsigned char *kinds)
+{
+    return;
+}
+
+void
+xexpand(KMP_API_NAME_GOMP_TEAMS)(unsigned int num_teams, unsigned int thread_limit)
+{
+    return;
+}
+#endif // OMP_40_ENABLED
+
+
 /*
    The following sections of code create aliases for the GOMP_* functions,
    then create versioned symbols using the assembler directive .symver.
@ -871,7 +1124,7 @@ xexpand(KMP_API_NAME_GOMP_TASKYIELD)(void)
    xaliasify and xversionify are defined in kmp_ftn_os.h
 */

-#if KMP_OS_LINUX
+#ifdef KMP_USE_VERSION_SYMBOLS

 // GOMP_1.0 aliases
 xaliasify(KMP_API_NAME_GOMP_ATOMIC_END, 10);
@ -917,10 +1170,8 @@ xaliasify(KMP_API_NAME_GOMP_SINGLE_COPY_START, 10);
 xaliasify(KMP_API_NAME_GOMP_SINGLE_START, 10);

 // GOMP_2.0 aliases
-#if OMP_30_ENABLED
 xaliasify(KMP_API_NAME_GOMP_TASK, 20);
 xaliasify(KMP_API_NAME_GOMP_TASKWAIT, 20);
-#endif
 xaliasify(KMP_API_NAME_GOMP_LOOP_ULL_DYNAMIC_NEXT, 20);
 xaliasify(KMP_API_NAME_GOMP_LOOP_ULL_DYNAMIC_START, 20);
 xaliasify(KMP_API_NAME_GOMP_LOOP_ULL_GUIDED_NEXT, 20);
@ -942,9 +1193,27 @@ xaliasify(KMP_API_NAME_GOMP_LOOP_ULL_STATIC_START, 20);
 xaliasify(KMP_API_NAME_GOMP_TASKYIELD, 30);

 // GOMP_4.0 aliases
-/* TODO: add GOMP_4.0 aliases when corresponding
-         GOMP_* functions are implemented
-*/
+// The GOMP_parallel* entry points below aren't OpenMP 4.0 related.
+#if OMP_40_ENABLED
+xaliasify(KMP_API_NAME_GOMP_PARALLEL, 40);
+xaliasify(KMP_API_NAME_GOMP_PARALLEL_SECTIONS, 40);
+xaliasify(KMP_API_NAME_GOMP_PARALLEL_LOOP_DYNAMIC, 40);
+xaliasify(KMP_API_NAME_GOMP_PARALLEL_LOOP_GUIDED, 40);
+xaliasify(KMP_API_NAME_GOMP_PARALLEL_LOOP_RUNTIME, 40);
+xaliasify(KMP_API_NAME_GOMP_PARALLEL_LOOP_STATIC, 40);
+xaliasify(KMP_API_NAME_GOMP_TASKGROUP_START, 40);
+xaliasify(KMP_API_NAME_GOMP_TASKGROUP_END, 40);
+xaliasify(KMP_API_NAME_GOMP_BARRIER_CANCEL, 40);
+xaliasify(KMP_API_NAME_GOMP_CANCEL, 40);
+xaliasify(KMP_API_NAME_GOMP_CANCELLATION_POINT, 40);
+xaliasify(KMP_API_NAME_GOMP_LOOP_END_CANCEL, 40);
+xaliasify(KMP_API_NAME_GOMP_SECTIONS_END_CANCEL, 40);
+xaliasify(KMP_API_NAME_GOMP_TARGET, 40);
+xaliasify(KMP_API_NAME_GOMP_TARGET_DATA, 40);
+xaliasify(KMP_API_NAME_GOMP_TARGET_END_DATA, 40);
+xaliasify(KMP_API_NAME_GOMP_TARGET_UPDATE, 40);
+xaliasify(KMP_API_NAME_GOMP_TEAMS, 40);
+#endif

 // GOMP_1.0 versioned symbols
 xversionify(KMP_API_NAME_GOMP_ATOMIC_END, 10, "GOMP_1.0");
@ -990,10 +1259,8 @@ xversionify(KMP_API_NAME_GOMP_SINGLE_COPY_START, 10, "GOMP_1.0");
 xversionify(KMP_API_NAME_GOMP_SINGLE_START, 10, "GOMP_1.0");

 // GOMP_2.0 versioned symbols
-#if OMP_30_ENABLED
 xversionify(KMP_API_NAME_GOMP_TASK, 20, "GOMP_2.0");
 xversionify(KMP_API_NAME_GOMP_TASKWAIT, 20, "GOMP_2.0");
-#endif
 xversionify(KMP_API_NAME_GOMP_LOOP_ULL_DYNAMIC_NEXT, 20, "GOMP_2.0");
 xversionify(KMP_API_NAME_GOMP_LOOP_ULL_DYNAMIC_START, 20, "GOMP_2.0");
 xversionify(KMP_API_NAME_GOMP_LOOP_ULL_GUIDED_NEXT, 20, "GOMP_2.0");
@ -1015,11 +1282,28 @@ xversionify(KMP_API_NAME_GOMP_LOOP_ULL_STATIC_START, 20, "GOMP_2.0");
 xversionify(KMP_API_NAME_GOMP_TASKYIELD, 30, "GOMP_3.0");

 // GOMP_4.0 versioned symbols
-/* TODO: add GOMP_4.0 versioned symbols when corresponding
-         GOMP_* functions are implemented
-*/
+#if OMP_40_ENABLED
+xversionify(KMP_API_NAME_GOMP_PARALLEL, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_PARALLEL_SECTIONS, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_PARALLEL_LOOP_DYNAMIC, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_PARALLEL_LOOP_GUIDED, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_PARALLEL_LOOP_RUNTIME, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_PARALLEL_LOOP_STATIC, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TASKGROUP_START, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TASKGROUP_END, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_BARRIER_CANCEL, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_CANCEL, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_CANCELLATION_POINT, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_LOOP_END_CANCEL, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_SECTIONS_END_CANCEL, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TARGET, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TARGET_DATA, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TARGET_END_DATA, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TARGET_UPDATE, 40, "GOMP_4.0");
+xversionify(KMP_API_NAME_GOMP_TEAMS, 40, "GOMP_4.0");
+#endif

-#endif /* KMP_OS_LINUX */
+#endif // KMP_USE_VERSION_SYMBOLS

 #ifdef __cplusplus
    } //extern "C"
--- a/openmp/runtime/src/kmp_i18n.c
+++ b/openmp/runtime/src/kmp_i18n.c
@ -1,7 +1,7 @@
 /*
 * kmp_i18n.c
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43084 $
+ * $Date: 2014-04-15 09:15:14 -0500 (Tue, 15 Apr 2014) $
 */


@ -815,7 +815,7 @@ sys_error(
                // not issue warning if strerror_r() returns `int' instead of expected `char *'.
            message = __kmp_str_format( "%s", err_msg );

-        #else // OS X*, FreeBSD etc.
+        #else // OS X*, FreeBSD* etc.

            // XSI version of strerror_r.

--- a/openmp/runtime/src/kmp_i18n.h
+++ b/openmp/runtime/src/kmp_i18n.h
@ -1,7 +1,7 @@
 /*
 * kmp_i18n.h
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_import.c
+++ b/openmp/runtime/src/kmp_import.c
@ -1,7 +1,7 @@
 /*
 * kmp_import.c
- * $Revision: 42286 $
- * $Date: 2013-04-18 10:53:26 -0500 (Thu, 18 Apr 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_io.c
+++ b/openmp/runtime/src/kmp_io.c
@ -1,7 +1,7 @@
 /*
 * kmp_io.c -- RTL IO
- * $Revision: 42150 $
- * $Date: 2013-03-15 15:40:38 -0500 (Fri, 15 Mar 2013) $
+ * $Revision: 43236 $
+ * $Date: 2014-06-04 16:42:35 -0500 (Wed, 04 Jun 2014) $
 */


@ -171,7 +171,7 @@ __kmp_vprintf( enum kmp_io __kmp_io, char const * format, va_list ap )
        int chars = 0;

        #ifdef KMP_DEBUG_PIDS
-            chars = sprintf( db, "pid=%d: ", getpid() );
+            chars = sprintf( db, "pid=%d: ", (kmp_int32)getpid() );
        #endif
        chars += vsprintf( db, format, ap );

@ -200,7 +200,8 @@ __kmp_vprintf( enum kmp_io __kmp_io, char const * format, va_list ap )
        #if KMP_OS_WINDOWS
            DWORD count;
            #ifdef KMP_DEBUG_PIDS
-                __kmp_str_buf_print( &__kmp_console_buf, "pid=%d: ", getpid() );
+                __kmp_str_buf_print( &__kmp_console_buf, "pid=%d: ",
+                  (kmp_int32)getpid() );
            #endif
            __kmp_str_buf_vprint( &__kmp_console_buf, format, ap );
            WriteFile(
@ -213,7 +214,7 @@ __kmp_vprintf( enum kmp_io __kmp_io, char const * format, va_list ap )
            __kmp_str_buf_clear( &__kmp_console_buf );
        #else
            #ifdef KMP_DEBUG_PIDS
-                fprintf( __kmp_stderr, "pid=%d: ", getpid() );
+                fprintf( __kmp_stderr, "pid=%d: ", (kmp_int32)getpid() );
            #endif
            vfprintf( __kmp_stderr, format, ap );
            fflush( __kmp_stderr );
--- a/openmp/runtime/src/kmp_io.h
+++ b/openmp/runtime/src/kmp_io.h
@ -1,7 +1,7 @@
 /*
 * kmp_io.h -- RTL IO header file.
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_itt.c
+++ b/openmp/runtime/src/kmp_itt.c
@ -1,8 +1,8 @@
 #if USE_ITT_BUILD
 /*
 * kmp_itt.c -- ITT Notify interface.
- * $Revision: 42489 $
- * $Date: 2013-07-08 11:00:09 -0500 (Mon, 08 Jul 2013) $
+ * $Revision: 43457 $
+ * $Date: 2014-09-17 03:57:22 -0500 (Wed, 17 Sep 2014) $
 */


@ -25,8 +25,13 @@

 #if USE_ITT_NOTIFY

-    kmp_int32 __kmp_frame_domain_count = 0;
-    __itt_domain* __kmp_itt_domains[KMP_MAX_FRAME_DOMAINS];
+    kmp_int32 __kmp_barrier_domain_count;
+    kmp_int32 __kmp_region_domain_count;
+    __itt_domain* __kmp_itt_barrier_domains[KMP_MAX_FRAME_DOMAINS];
+    __itt_domain* __kmp_itt_region_domains[KMP_MAX_FRAME_DOMAINS];
+    __itt_domain* __kmp_itt_imbalance_domains[KMP_MAX_FRAME_DOMAINS];
+    kmp_int32 __kmp_itt_region_team_size[KMP_MAX_FRAME_DOMAINS];
+    __itt_domain * metadata_domain = NULL;

    #include "kmp_version.h"
    #include "kmp_i18n.h"
--- a/openmp/runtime/src/kmp_itt.h
+++ b/openmp/runtime/src/kmp_itt.h
@ -1,8 +1,8 @@
 #if USE_ITT_BUILD
 /*
 * kmp_itt.h -- ITT Notify interface.
- * $Revision: 42829 $
- * $Date: 2013-11-21 05:44:01 -0600 (Thu, 21 Nov 2013) $
+ * $Revision: 43457 $
+ * $Date: 2014-09-17 03:57:22 -0500 (Wed, 17 Sep 2014) $
 */


@ -55,12 +55,20 @@ void __kmp_itt_destroy();
 //     __kmp_itt_xxxed()  function should be called after action.

 // --- Parallel region reporting ---
-__kmp_inline void __kmp_itt_region_forking(  int gtid, int serialized = 0 ); // Master only, before forking threads.
+__kmp_inline void __kmp_itt_region_forking(  int gtid, int team_size, int barriers, int serialized = 0 ); // Master only, before forking threads.
 __kmp_inline void __kmp_itt_region_joined(   int gtid, int serialized = 0 ); // Master only, after joining threads.
    // (*) Note: A thread may execute tasks after this point, though.

 // --- Frame reporting ---
-__kmp_inline void __kmp_itt_frame_submit( int gtid, __itt_timestamp begin, __itt_timestamp end, int imbalance, ident_t *loc );
+// region = 0 - no regions, region = 1 - parallel, region = 2 - serialized parallel
+__kmp_inline void __kmp_itt_frame_submit( int gtid, __itt_timestamp begin, __itt_timestamp end, int imbalance, ident_t *loc, int team_size, int region = 0 );
+
+// --- Metadata reporting ---
+// begin/end - begin/end timestamps of a barrier frame, imbalance - aggregated wait time value, reduction -if this is a reduction barrier
+__kmp_inline void __kmp_itt_metadata_imbalance( int gtid, kmp_uint64 begin, kmp_uint64 end, kmp_uint64 imbalance, kmp_uint64 reduction );
+// sched_type: 0 - static, 1 - dynamic, 2 - guided, 3 - custom (all others); iterations - loop trip count, chunk - chunk size
+__kmp_inline void __kmp_itt_metadata_loop( ident_t * loc, kmp_uint64 sched_type, kmp_uint64 iterations, kmp_uint64 chunk );
+__kmp_inline void __kmp_itt_metadata_single();

 // --- Barrier reporting ---
 __kmp_inline void * __kmp_itt_barrier_object( int gtid, int bt, int set_name = 0, int delta = 0 );
@ -135,8 +143,12 @@ __kmp_inline void __kmp_itt_stack_callee_leave(__itt_caller);
    #if (INCLUDE_SSC_MARKS && KMP_OS_LINUX && KMP_ARCH_X86_64)
    // Portable (at least for gcc and icc) code to insert the necessary instructions
    // to set %ebx and execute the unlikely no-op.
-    # define INSERT_SSC_MARK(tag)                                           \
-        __asm__ __volatile__ ("movl %0, %%ebx; .byte 0x64, 0x67, 0x90 " ::"i"(tag):"%ebx")
+      #if defined( __INTEL_COMPILER )
+      # define INSERT_SSC_MARK(tag) __SSC_MARK(tag)
+      #else
+      # define INSERT_SSC_MARK(tag)                                          \
+      __asm__ __volatile__ ("movl %0, %%ebx; .byte 0x64, 0x67, 0x90 " ::"i"(tag):"%ebx")
+      #endif
    #else
    # define INSERT_SSC_MARK(tag) ((void)0)
    #endif
@ -150,6 +162,18 @@ __kmp_inline void __kmp_itt_stack_callee_leave(__itt_caller);
    #define SSC_MARK_SPIN_START() INSERT_SSC_MARK(0x4376)
    #define SSC_MARK_SPIN_END()   INSERT_SSC_MARK(0x4377)

+    // Markers for architecture simulation.
+    // FORKING      : Before the master thread forks.
+    // JOINING      : At the start of the join.
+    // INVOKING     : Before the threads invoke microtasks.
+    // DISPATCH_INIT: At the start of dynamically scheduled loop.
+    // DISPATCH_NEXT: After claming next iteration of dynamically scheduled loop.
+    #define SSC_MARK_FORKING()          INSERT_SSC_MARK(0xd693)
+    #define SSC_MARK_JOINING()          INSERT_SSC_MARK(0xd694)
+    #define SSC_MARK_INVOKING()         INSERT_SSC_MARK(0xd695)
+    #define SSC_MARK_DISPATCH_INIT()    INSERT_SSC_MARK(0xd696)
+    #define SSC_MARK_DISPATCH_NEXT()    INSERT_SSC_MARK(0xd697)
+
    // The object is an address that associates a specific set of the prepare, acquire, release,
    // and cancel operations.

@ -227,8 +251,14 @@ __kmp_inline void __kmp_itt_stack_callee_leave(__itt_caller);

    const int KMP_MAX_FRAME_DOMAINS = 512; // Maximum number of frame domains to use (maps to
                                           // different OpenMP regions in the user source code).
-    extern kmp_int32 __kmp_frame_domain_count;
-    extern __itt_domain* __kmp_itt_domains[KMP_MAX_FRAME_DOMAINS];
+    extern kmp_int32 __kmp_barrier_domain_count;
+    extern kmp_int32 __kmp_region_domain_count;
+    extern __itt_domain* __kmp_itt_barrier_domains[KMP_MAX_FRAME_DOMAINS];
+    extern __itt_domain* __kmp_itt_region_domains[KMP_MAX_FRAME_DOMAINS];
+    extern __itt_domain* __kmp_itt_imbalance_domains[KMP_MAX_FRAME_DOMAINS];
+    extern kmp_int32 __kmp_itt_region_team_size[KMP_MAX_FRAME_DOMAINS];
+    extern __itt_domain * metadata_domain;
+
 #else

 // Null definitions of the synchronization tracing functions.
--- a/openmp/runtime/src/kmp_itt.inl
+++ b/openmp/runtime/src/kmp_itt.inl
@ -1,8 +1,8 @@
 #if USE_ITT_BUILD
 /*
 * kmp_itt.inl -- Inline functions of ITT Notify.
- * $Revision: 42866 $
- * $Date: 2013-12-10 15:15:58 -0600 (Tue, 10 Dec 2013) $
+ * $Revision: 43457 $
+ * $Date: 2014-09-17 03:57:22 -0500 (Wed, 17 Sep 2014) $
 */


@ -63,6 +63,8 @@
 #endif
 #endif

+static kmp_bootstrap_lock_t  metadata_lock = KMP_BOOTSTRAP_LOCK_INITIALIZER( metadata_lock );
+
 /*
    ------------------------------------------------------------------------------------------------
    Parallel region reporting.
@ -89,12 +91,10 @@
 // -------------------------------------------------------------------------------------------------

 LINKAGE void
-__kmp_itt_region_forking( int gtid, int serialized ) {
+__kmp_itt_region_forking( int gtid, int team_size, int barriers, int serialized ) {
 #if USE_ITT_NOTIFY
    kmp_team_t *      team = __kmp_team_from_gtid( gtid );
-#if OMP_30_ENABLED
    if (team->t.t_active_level + serialized > 1)
-#endif
    {
        // The frame notifications are only supported for the outermost teams.
        return;
@ -105,40 +105,81 @@ __kmp_itt_region_forking( int gtid, int serialized ) {
        // Assume that reserved_2 contains zero initially.  Since zero is special
        // value here, store the index into domain array increased by 1.
        if (loc->reserved_2 == 0) {
-            if (__kmp_frame_domain_count < KMP_MAX_FRAME_DOMAINS) {
-                int frm = KMP_TEST_THEN_INC32( & __kmp_frame_domain_count ); // get "old" value
+            if (__kmp_region_domain_count < KMP_MAX_FRAME_DOMAINS) {
+                int frm = KMP_TEST_THEN_INC32( & __kmp_region_domain_count ); // get "old" value
                if (frm >= KMP_MAX_FRAME_DOMAINS) {
-                    KMP_TEST_THEN_DEC32( & __kmp_frame_domain_count );       // revert the count
+                    KMP_TEST_THEN_DEC32( & __kmp_region_domain_count );       // revert the count
                    return;                      // loc->reserved_2 is still 0
                }
                //if (!KMP_COMPARE_AND_STORE_ACQ32( &loc->reserved_2, 0, frm + 1 )) {
                //    frm = loc->reserved_2 - 1;   // get value saved by other thread for same loc
                //} // AC: this block is to replace next unsynchronized line
-                loc->reserved_2 = frm + 1;                                   // save "new" value
+
+                // We need to save indexes for both region and barrier frames. We'll use loc->reserved_2
+                // field but put region index to the low two bytes and barrier indexes to the high
+                // two bytes. It is OK because KMP_MAX_FRAME_DOMAINS = 512.
+                loc->reserved_2 |= (frm + 1);                                    // save "new" value

                // Transform compiler-generated region location into the format
                // that the tools more or less standardized on:
                //                               "<func>$omp$parallel@[file:]<line>[:<col>]"
                const char * buff = NULL;
                kmp_str_loc_t str_loc = __kmp_str_loc_init( loc->psource, 1 );
-                buff = __kmp_str_format("%s$omp$parallel@%s:%d:%d",
-                                        str_loc.func, str_loc.file,
+                buff = __kmp_str_format("%s$omp$parallel:%d@%s:%d:%d",
+                                        str_loc.func, team_size, str_loc.file,
                                        str_loc.line, str_loc.col);
-                __kmp_str_loc_free( &str_loc );

                __itt_suppress_push(__itt_suppress_memory_errors);
-                __kmp_itt_domains[ frm ] = __itt_domain_create( buff );
+                __kmp_itt_region_domains[ frm ] = __itt_domain_create( buff );
                __itt_suppress_pop();

                __kmp_str_free( &buff );
-                __itt_frame_begin_v3(__kmp_itt_domains[ frm ], NULL);
+                if( barriers ) {
+                    if (__kmp_barrier_domain_count < KMP_MAX_FRAME_DOMAINS) {
+                        int frm = KMP_TEST_THEN_INC32( & __kmp_barrier_domain_count ); // get "old" value
+                        if (frm >= KMP_MAX_FRAME_DOMAINS) {
+                            KMP_TEST_THEN_DEC32( & __kmp_barrier_domain_count );       // revert the count
+                            return;                      // loc->reserved_2 is still 0
+                        }
+                        const char * buff = NULL;
+                        buff = __kmp_str_format("%s$omp$barrier@%s:%d",
+                                                str_loc.func, str_loc.file, str_loc.col);
+                        __itt_suppress_push(__itt_suppress_memory_errors);
+                        __kmp_itt_barrier_domains[ frm ] = __itt_domain_create( buff );
+                        __itt_suppress_pop();
+                        __kmp_str_free( &buff );
+                        // Save the barrier frame index to the high two bytes.
+                        loc->reserved_2 |= (frm + 1) << 16;
+                    }
+                }
+                __kmp_str_loc_free( &str_loc );
+                __itt_frame_begin_v3(__kmp_itt_region_domains[ frm ], NULL);
+            }
+        } else { // Region domain exists for this location
+            // Check if team size was changed. Then create new region domain for this location
+            int frm = (loc->reserved_2 & 0x0000FFFF) - 1;
+            if( __kmp_itt_region_team_size[frm] != team_size ) {
+                const char * buff = NULL;
+                kmp_str_loc_t str_loc = __kmp_str_loc_init( loc->psource, 1 );
+                buff = __kmp_str_format("%s$omp$parallel:%d@%s:%d:%d",
+                                        str_loc.func, team_size, str_loc.file,
+                                        str_loc.line, str_loc.col);
+
+                __itt_suppress_push(__itt_suppress_memory_errors);
+                __kmp_itt_region_domains[ frm ] = __itt_domain_create( buff );
+                __itt_suppress_pop();
+
+                __kmp_str_free( &buff );
+                __kmp_str_loc_free( &str_loc );
+                __kmp_itt_region_team_size[frm] = team_size;
+                __itt_frame_begin_v3(__kmp_itt_region_domains[frm], NULL);
+            } else { // Team size was not changed. Use existing domain.
+                __itt_frame_begin_v3(__kmp_itt_region_domains[frm], NULL);
            }
-        } else { // if it is not 0 then it should be <= KMP_MAX_FRAME_DOMAINS
-            __itt_frame_begin_v3(__kmp_itt_domains[loc->reserved_2 - 1], NULL);
        }
        KMP_ITT_DEBUG_LOCK();
-        KMP_ITT_DEBUG_PRINT( "[frm beg] gtid=%d, idx=%d, serialized:%d, loc:%p\n",
-                         gtid, loc->reserved_2 - 1, serialized, loc );
+        KMP_ITT_DEBUG_PRINT( "[frm beg] gtid=%d, idx=%x, serialized:%d, loc:%p\n",
+                         gtid, loc->reserved_2, serialized, loc );
    }
 #endif
 } // __kmp_itt_region_forking
@ -146,50 +187,207 @@ __kmp_itt_region_forking( int gtid, int serialized ) {
 // -------------------------------------------------------------------------------------------------

 LINKAGE void
-__kmp_itt_frame_submit( int gtid, __itt_timestamp begin, __itt_timestamp end, int imbalance, ident_t * loc ) {
+__kmp_itt_frame_submit( int gtid, __itt_timestamp begin, __itt_timestamp end, int imbalance, ident_t * loc, int team_size, int region ) {
 #if USE_ITT_NOTIFY
+    if( region ) {
+        kmp_team_t *      team = __kmp_team_from_gtid( gtid );
+        int serialized = ( region == 2 ? 1 : 0 );
+        if (team->t.t_active_level + serialized > 1)
+        {
+            // The frame notifications are only supported for the outermost teams.
+            return;
+        }
+         //Check region domain has not been created before. It's index is saved in the low two bytes.
+         if ((loc->reserved_2 & 0x0000FFFF) == 0) {
+             if (__kmp_region_domain_count < KMP_MAX_FRAME_DOMAINS) {
+                 int frm = KMP_TEST_THEN_INC32( & __kmp_region_domain_count ); // get "old" value
+                 if (frm >= KMP_MAX_FRAME_DOMAINS) {
+                     KMP_TEST_THEN_DEC32( & __kmp_region_domain_count );       // revert the count
+                     return;                      // loc->reserved_2 is still 0
+                 }
+
+                 // We need to save indexes for both region and barrier frames. We'll use loc->reserved_2
+                 // field but put region index to the low two bytes and barrier indexes to the high
+                 // two bytes. It is OK because KMP_MAX_FRAME_DOMAINS = 512.
+                 loc->reserved_2 |= (frm + 1);                                 // save "new" value
+
+                 // Transform compiler-generated region location into the format
+                 // that the tools more or less standardized on:
+                 //                               "<func>$omp$parallel:team_size@[file:]<line>[:<col>]"
+                 const char * buff = NULL;
+                 kmp_str_loc_t str_loc = __kmp_str_loc_init( loc->psource, 1 );
+                 buff = __kmp_str_format("%s$omp$parallel:%d@%s:%d:%d",
+                                         str_loc.func, team_size, str_loc.file,
+                                         str_loc.line, str_loc.col);
+
+                 __itt_suppress_push(__itt_suppress_memory_errors);
+                 __kmp_itt_region_domains[ frm ] = __itt_domain_create( buff );
+                 __itt_suppress_pop();
+
+                 __kmp_str_free( &buff );
+                 __kmp_str_loc_free( &str_loc );
+                 __kmp_itt_region_team_size[frm] = team_size;
+                 __itt_frame_submit_v3(__kmp_itt_region_domains[ frm ], NULL, begin, end );
+             }
+         } else { // Region domain exists for this location
+             // Check if team size was changed. Then create new region domain for this location
+             int frm = (loc->reserved_2 & 0x0000FFFF) - 1;
+             if( __kmp_itt_region_team_size[frm] != team_size ) {
+                 const char * buff = NULL;
+                 kmp_str_loc_t str_loc = __kmp_str_loc_init( loc->psource, 1 );
+                 buff = __kmp_str_format("%s$omp$parallel:%d@%s:%d:%d",
+                                         str_loc.func, team_size, str_loc.file,
+                                         str_loc.line, str_loc.col);
+
+                 __itt_suppress_push(__itt_suppress_memory_errors);
+                 __kmp_itt_region_domains[ frm ] = __itt_domain_create( buff );
+                 __itt_suppress_pop();
+
+                 __kmp_str_free( &buff );
+                 __kmp_str_loc_free( &str_loc );
+                 __kmp_itt_region_team_size[frm] = team_size;
+                 __itt_frame_submit_v3(__kmp_itt_region_domains[ frm ], NULL, begin, end );
+             } else { // Team size was not changed. Use existing domain.
+                 __itt_frame_submit_v3(__kmp_itt_region_domains[ frm ], NULL, begin, end );
+             }
+         }
+         KMP_ITT_DEBUG_LOCK();
+         KMP_ITT_DEBUG_PRINT( "[reg sub] gtid=%d, idx=%x, region:%d, loc:%p, beg:%llu, end:%llu\n",
+                          gtid, loc->reserved_2, region, loc, begin, end );
+         return;
+    } else { // called for barrier reporting
        if (loc) {
-            if (loc->reserved_2 == 0) {
-                if (__kmp_frame_domain_count < KMP_MAX_FRAME_DOMAINS) {
-                    int frm = KMP_TEST_THEN_INC32( & __kmp_frame_domain_count ); // get "old" value
+            if ((loc->reserved_2 & 0xFFFF0000) == 0) {
+                if (__kmp_barrier_domain_count < KMP_MAX_FRAME_DOMAINS) {
+                    int frm = KMP_TEST_THEN_INC32( & __kmp_barrier_domain_count ); // get "old" value
                    if (frm >= KMP_MAX_FRAME_DOMAINS) {
-                        KMP_TEST_THEN_DEC32( & __kmp_frame_domain_count );       // revert the count
+                        KMP_TEST_THEN_DEC32( & __kmp_barrier_domain_count );       // revert the count
                        return;                      // loc->reserved_2 is still 0
                    }
-                    // Should it be synchronized? See the comment in __kmp_itt_region_forking
-                    loc->reserved_2 = frm + 1;                                   // save "new" value
+                    // Save the barrier frame index to the high two bytes.
+                    loc->reserved_2 |= (frm + 1) << 16;                          // save "new" value

                    // Transform compiler-generated region location into the format
                    // that the tools more or less standardized on:
                    //                               "<func>$omp$frame@[file:]<line>[:<col>]"
-                    const char * buff = NULL;
                    kmp_str_loc_t str_loc = __kmp_str_loc_init( loc->psource, 1 );
                    if( imbalance ) {
-                        buff = __kmp_str_format("%s$omp$barrier-imbalance@%s:%d",
-                                                str_loc.func, str_loc.file, str_loc.col);
+                        const char * buff_imb = NULL;
+                        buff_imb = __kmp_str_format("%s$omp$barrier-imbalance:%d@%s:%d",
+                                                str_loc.func, team_size, str_loc.file, str_loc.col);
+                        __itt_suppress_push(__itt_suppress_memory_errors);
+                        __kmp_itt_imbalance_domains[ frm ] = __itt_domain_create( buff_imb );
+                        __itt_suppress_pop();
+                        __itt_frame_submit_v3(__kmp_itt_imbalance_domains[ frm ], NULL, begin, end );
+                        __kmp_str_free( &buff_imb );
                    } else {
+                        const char * buff = NULL;
                        buff = __kmp_str_format("%s$omp$barrier@%s:%d",
                                                str_loc.func, str_loc.file, str_loc.col);
+                        __itt_suppress_push(__itt_suppress_memory_errors);
+                        __kmp_itt_barrier_domains[ frm ] = __itt_domain_create( buff );
+                        __itt_suppress_pop();
+                        __itt_frame_submit_v3(__kmp_itt_barrier_domains[ frm ], NULL, begin, end );
+                        __kmp_str_free( &buff );
                    }
                    __kmp_str_loc_free( &str_loc );
-
-                    __itt_suppress_push(__itt_suppress_memory_errors);
-                    __kmp_itt_domains[ frm ] = __itt_domain_create( buff );
-                    __itt_suppress_pop();
-
-                    __kmp_str_free( &buff );
-                    __itt_frame_submit_v3(__kmp_itt_domains[ frm ], NULL, begin, end );
                }
            } else { // if it is not 0 then it should be <= KMP_MAX_FRAME_DOMAINS
-                __itt_frame_submit_v3(__kmp_itt_domains[loc->reserved_2 - 1], NULL, begin, end );
+                if( imbalance ) {
+                    __itt_frame_submit_v3(__kmp_itt_imbalance_domains[ (loc->reserved_2 >> 16) - 1 ], NULL, begin, end );
+                } else {
+                    __itt_frame_submit_v3(__kmp_itt_barrier_domains[(loc->reserved_2 >> 16) - 1], NULL, begin, end );
+                }
            }
+            KMP_ITT_DEBUG_LOCK();
+            KMP_ITT_DEBUG_PRINT( "[frm sub] gtid=%d, idx=%x, loc:%p, beg:%llu, end:%llu\n",
+                             gtid, loc->reserved_2, loc, begin, end );
+        }
    }
-
 #endif
 } // __kmp_itt_frame_submit

 // -------------------------------------------------------------------------------------------------

+LINKAGE void
+__kmp_itt_metadata_imbalance( int gtid, kmp_uint64 begin, kmp_uint64 end, kmp_uint64 imbalance, kmp_uint64 reduction ) {
+#if USE_ITT_NOTIFY
+    if( metadata_domain == NULL) {
+        __kmp_acquire_bootstrap_lock( & metadata_lock );
+        if( metadata_domain == NULL) {
+            __itt_suppress_push(__itt_suppress_memory_errors);
+            metadata_domain = __itt_domain_create( "OMP Metadata" );
+            __itt_suppress_pop();
+        }
+        __kmp_release_bootstrap_lock( & metadata_lock );
+    }
+
+    __itt_string_handle * string_handle = __itt_string_handle_create( "omp_metadata_imbalance");
+
+    kmp_uint64 imbalance_data[ 4 ];
+    imbalance_data[ 0 ] = begin;
+    imbalance_data[ 1 ] = end;
+    imbalance_data[ 2 ] = imbalance;
+    imbalance_data[ 3 ] = reduction;
+
+    __itt_metadata_add(metadata_domain, __itt_null, string_handle, __itt_metadata_u64, 4, imbalance_data);
+#endif
+} // __kmp_itt_metadata_imbalance
+
+// -------------------------------------------------------------------------------------------------
+
+LINKAGE void
+__kmp_itt_metadata_loop( ident_t * loc, kmp_uint64 sched_type, kmp_uint64 iterations, kmp_uint64 chunk ) {
+#if USE_ITT_NOTIFY
+    if( metadata_domain == NULL) {
+        __kmp_acquire_bootstrap_lock( & metadata_lock );
+        if( metadata_domain == NULL) {
+            __itt_suppress_push(__itt_suppress_memory_errors);
+            metadata_domain = __itt_domain_create( "OMP Metadata" );
+            __itt_suppress_pop();
+        }
+        __kmp_release_bootstrap_lock( & metadata_lock );
+    }
+
+    __itt_string_handle * string_handle = __itt_string_handle_create( "omp_metadata_loop");
+    kmp_str_loc_t str_loc = __kmp_str_loc_init( loc->psource, 1 );
+
+    kmp_uint64 loop_data[ 5 ];
+    loop_data[ 0 ] = str_loc.line;
+    loop_data[ 1 ] = str_loc.col;
+    loop_data[ 2 ] = sched_type;
+    loop_data[ 3 ] = iterations;
+    loop_data[ 4 ] = chunk;
+
+    __kmp_str_loc_free( &str_loc );
+
+    __itt_metadata_add(metadata_domain, __itt_null, string_handle, __itt_metadata_u64, 5, loop_data);
+#endif
+} // __kmp_itt_metadata_loop
+
+// -------------------------------------------------------------------------------------------------
+
+LINKAGE void
+__kmp_itt_metadata_single( ) {
+#if USE_ITT_NOTIFY
+    if( metadata_domain == NULL) {
+        __kmp_acquire_bootstrap_lock( & metadata_lock );
+        if( metadata_domain == NULL) {
+            __itt_suppress_push(__itt_suppress_memory_errors);
+            metadata_domain = __itt_domain_create( "OMP Metadata" );
+            __itt_suppress_pop();
+        }
+        __kmp_release_bootstrap_lock( & metadata_lock );
+    }
+
+    __itt_string_handle * string_handle = __itt_string_handle_create( "omp_metadata_single");
+
+    __itt_metadata_add(metadata_domain, __itt_null, string_handle, __itt_metadata_u64, 0, NULL);
+#endif
+} // __kmp_itt_metadata_single
+
+// -------------------------------------------------------------------------------------------------
+
 LINKAGE void
 __kmp_itt_region_starting( int gtid ) {
 #if USE_ITT_NOTIFY
@ -210,19 +408,21 @@ LINKAGE void
 __kmp_itt_region_joined( int gtid, int serialized ) {
 #if USE_ITT_NOTIFY
    kmp_team_t *      team = __kmp_team_from_gtid( gtid );
-#if OMP_30_ENABLED
    if (team->t.t_active_level + serialized > 1)
-#endif
    {
        // The frame notifications are only supported for the outermost teams.
        return;
    }
    ident_t *         loc  = __kmp_thread_from_gtid( gtid )->th.th_ident;
-    if (loc && loc->reserved_2 && loc->reserved_2 <= KMP_MAX_FRAME_DOMAINS) {
-        KMP_ITT_DEBUG_LOCK();
-        __itt_frame_end_v3(__kmp_itt_domains[loc->reserved_2 - 1], NULL);
-        KMP_ITT_DEBUG_PRINT( "[frm end] gtid=%d, idx=%d, serialized:%d, loc:%p\n",
-                         gtid, loc->reserved_2 - 1, serialized, loc );
+    if (loc && loc->reserved_2)
+    {
+        int frm = (loc->reserved_2 & 0x0000FFFF) - 1;
+        if(frm < KMP_MAX_FRAME_DOMAINS) {
+            KMP_ITT_DEBUG_LOCK();
+            __itt_frame_end_v3(__kmp_itt_region_domains[frm], NULL);
+            KMP_ITT_DEBUG_PRINT( "[frm end] gtid=%d, idx=%x, serialized:%d, loc:%p\n",
+                         gtid, loc->reserved_2, serialized, loc );
+        }
    }
 #endif
 } // __kmp_itt_region_joined
@ -409,8 +609,6 @@ __kmp_itt_barrier_finished( int gtid, void * object ) {
 #endif
 } // __kmp_itt_barrier_finished

-#if OMP_30_ENABLED
-
 /*
    ------------------------------------------------------------------------------------------------
    Taskwait reporting.
@ -507,8 +705,6 @@ __kmp_itt_task_finished(

 // -------------------------------------------------------------------------------------------------

-#endif /* OMP_30_ENABLED */
-
 /*
    ------------------------------------------------------------------------------------------------
    Lock reporting.
@ -757,7 +953,11 @@ __kmp_itt_thread_name( int gtid ) {
    if ( __itt_thr_name_set_ptr ) {
        kmp_str_buf_t name;
        __kmp_str_buf_init( & name );
-        __kmp_str_buf_print( & name, "OMP Worker Thread #%d", gtid );
+        if( KMP_MASTER_GTID(gtid) ) {
+            __kmp_str_buf_print( & name, "OMP Master Thread #%d", gtid );
+        } else {
+            __kmp_str_buf_print( & name, "OMP Worker Thread #%d", gtid );
+        }
        KMP_ITT_DEBUG_LOCK();
        __itt_thr_name_set( name.str, name.used );
        KMP_ITT_DEBUG_PRINT( "[thr nam] name( \"%s\")\n", name.str );
--- a/openmp/runtime/src/kmp_lock.cpp
+++ b/openmp/runtime/src/kmp_lock.cpp
--- a/openmp/runtime/src/kmp_lock.h
+++ b/openmp/runtime/src/kmp_lock.h
@ -1,7 +1,7 @@
 /*
 * kmp_lock.h -- lock header file
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -280,16 +280,16 @@ extern void __kmp_destroy_nested_ticket_lock( kmp_ticket_lock_t *lck );

 #if KMP_USE_ADAPTIVE_LOCKS

-struct kmp_adaptive_lock;
+struct kmp_adaptive_lock_info;

-typedef struct kmp_adaptive_lock kmp_adaptive_lock_t;
+typedef struct kmp_adaptive_lock_info kmp_adaptive_lock_info_t;

 #if KMP_DEBUG_ADAPTIVE_LOCKS

 struct kmp_adaptive_lock_statistics {
    /* So we can get stats from locks that haven't been destroyed. */
-    kmp_adaptive_lock_t * next;
-    kmp_adaptive_lock_t * prev;
+    kmp_adaptive_lock_info_t * next;
+    kmp_adaptive_lock_info_t * prev;

    /* Other statistics */
    kmp_uint32 successfulSpeculations;
@ -307,7 +307,7 @@ extern void __kmp_init_speculative_stats();

 #endif // KMP_DEBUG_ADAPTIVE_LOCKS

-struct kmp_adaptive_lock
+struct kmp_adaptive_lock_info
 {
    /* Values used for adaptivity.
     * Although these are accessed from multiple threads we don't access them atomically,
@ -348,10 +348,6 @@ struct kmp_base_queuing_lock {
    kmp_int32           depth_locked; // depth locked, for nested locks only

    kmp_lock_flags_t    flags;        // lock specifics, e.g. critical section lock
-#if KMP_USE_ADAPTIVE_LOCKS
-    KMP_ALIGN(CACHE_LINE)
-    kmp_adaptive_lock_t adaptive;     // Information for the speculative adaptive lock
-#endif
 };

 typedef struct kmp_base_queuing_lock kmp_base_queuing_lock_t;
@ -379,6 +375,30 @@ extern void __kmp_release_nested_queuing_lock( kmp_queuing_lock_t *lck, kmp_int3
 extern void __kmp_init_nested_queuing_lock( kmp_queuing_lock_t *lck );
 extern void __kmp_destroy_nested_queuing_lock( kmp_queuing_lock_t *lck );

+#if KMP_USE_ADAPTIVE_LOCKS
+
+// ----------------------------------------------------------------------------
+// Adaptive locks.
+// ----------------------------------------------------------------------------
+struct kmp_base_adaptive_lock {
+    kmp_base_queuing_lock qlk;
+    KMP_ALIGN(CACHE_LINE)
+    kmp_adaptive_lock_info_t adaptive;     // Information for the speculative adaptive lock
+};
+
+typedef struct kmp_base_adaptive_lock kmp_base_adaptive_lock_t;
+
+union KMP_ALIGN_CACHE kmp_adaptive_lock {
+    kmp_base_adaptive_lock_t lk;
+    kmp_lock_pool_t pool;
+    double lk_align;
+    char lk_pad[ KMP_PAD(kmp_base_adaptive_lock_t, CACHE_LINE) ];
+};
+typedef union kmp_adaptive_lock kmp_adaptive_lock_t;
+
+# define GET_QLK_PTR(l) ((kmp_queuing_lock_t *) & (l)->lk.qlk)
+
+#endif // KMP_USE_ADAPTIVE_LOCKS

 // ----------------------------------------------------------------------------
 // DRDPA ticket locks.
@ -913,7 +933,26 @@ __kmp_set_user_lock_flags( kmp_user_lock_p lck, kmp_lock_flags_t flags )
 //
 extern void __kmp_set_user_lock_vptrs( kmp_lock_kind_t user_lock_kind );

+//
+// Macros for binding user lock functions.
+//
+#define KMP_BIND_USER_LOCK_TEMPLATE(nest, kind, suffix) {                                       \
+    __kmp_acquire##nest##user_lock_with_checks_ = ( void (*)( kmp_user_lock_p, kmp_int32 ) )    \
+                                                  __kmp_acquire##nest##kind##_##suffix;         \
+    __kmp_release##nest##user_lock_with_checks_ = ( void (*)( kmp_user_lock_p, kmp_int32 ) )    \
+                                                  __kmp_release##nest##kind##_##suffix;         \
+    __kmp_test##nest##user_lock_with_checks_    = ( int (*)( kmp_user_lock_p, kmp_int32 ) )     \
+                                                  __kmp_test##nest##kind##_##suffix;            \
+    __kmp_init##nest##user_lock_with_checks_    = ( void (*)( kmp_user_lock_p ) )               \
+                                                  __kmp_init##nest##kind##_##suffix;            \
+    __kmp_destroy##nest##user_lock_with_checks_ = ( void (*)( kmp_user_lock_p ) )               \
+                                                  __kmp_destroy##nest##kind##_##suffix;         \
+}

+#define KMP_BIND_USER_LOCK(kind)                    KMP_BIND_USER_LOCK_TEMPLATE(_, kind, lock)
+#define KMP_BIND_USER_LOCK_WITH_CHECKS(kind)        KMP_BIND_USER_LOCK_TEMPLATE(_, kind, lock_with_checks)
+#define KMP_BIND_NESTED_USER_LOCK(kind)             KMP_BIND_USER_LOCK_TEMPLATE(_nested_, kind, lock)
+#define KMP_BIND_NESTED_USER_LOCK_WITH_CHECKS(kind) KMP_BIND_USER_LOCK_TEMPLATE(_nested_, kind, lock_with_checks)

 // ----------------------------------------------------------------------------
 // User lock table & lock allocation
--- a/openmp/runtime/src/kmp_omp.h
+++ b/openmp/runtime/src/kmp_omp.h
@ -1,8 +1,8 @@
 /*
 * kmp_omp.h -- OpenMP definition for kmp_omp_struct_info_t.
 *              This is for information about runtime library structures.
- * $Revision: 42105 $
- * $Date: 2013-03-11 14:51:34 -0500 (Mon, 11 Mar 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_os.h
+++ b/openmp/runtime/src/kmp_os.h
@ -1,7 +1,7 @@
 /*
 * kmp_os.h -- KPTS runtime header file.
- * $Revision: 42820 $
- * $Date: 2013-11-13 16:53:44 -0600 (Wed, 13 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -69,7 +69,7 @@
 #define KMP_OS_LINUX    0
 #define KMP_OS_FREEBSD  0
 #define KMP_OS_DARWIN   0
-#define KMP_OS_WINDOWS    0
+#define KMP_OS_WINDOWS  0
 #define KMP_OS_CNK      0
 #define KMP_OS_UNIX     0  /* disjunction of KMP_OS_LINUX, KMP_OS_DARWIN etc. */

@ -116,6 +116,12 @@
 # define KMP_OS_UNIX 1
 #endif

+#if (KMP_OS_LINUX || KMP_OS_WINDOWS) && !KMP_OS_CNK && !KMP_ARCH_PPC64
+# define KMP_AFFINITY_SUPPORTED 1
+#else
+# define KMP_AFFINITY_SUPPORTED 0
+#endif
+
 #if KMP_OS_WINDOWS
 # if defined _M_AMD64
 #  undef KMP_ARCH_X86_64
@ -356,6 +362,8 @@ typedef double  kmp_real64;
 extern "C" {
 #endif // __cplusplus

+#define INTERNODE_CACHE_LINE 4096 /* for multi-node systems */
+
 /* Define the default size of the cache line */
 #ifndef CACHE_LINE
    #define CACHE_LINE                  128         /* cache line size in bytes */
@ -366,16 +374,6 @@ extern "C" {
    #endif
 #endif /* CACHE_LINE */

-/* SGI's cache padding improvements using align decl specs (Ver 19) */
-#if !defined KMP_PERF_V19
-# define KMP_PERF_V19			KMP_ON
-#endif
-
-/* SGI's improvements for inline argv (Ver 106) */
-#if !defined KMP_PERF_V106
-# define KMP_PERF_V106			KMP_ON
-#endif
-
 #define KMP_CACHE_PREFETCH(ADDR) 	/* nothing */

 /* Temporary note: if performance testing of this passes, we can remove
@ -383,10 +381,12 @@ extern "C" {
 #if KMP_OS_UNIX && defined(__GNUC__)
 # define KMP_DO_ALIGN(bytes)  __attribute__((aligned(bytes)))
 # define KMP_ALIGN_CACHE      __attribute__((aligned(CACHE_LINE)))
+# define KMP_ALIGN_CACHE_INTERNODE __attribute__((aligned(INTERNODE_CACHE_LINE)))
 # define KMP_ALIGN(bytes)     __attribute__((aligned(bytes)))
 #else
 # define KMP_DO_ALIGN(bytes)  __declspec( align(bytes) )
 # define KMP_ALIGN_CACHE      __declspec( align(CACHE_LINE) )
+# define KMP_ALIGN_CACHE_INTERNODE      __declspec( align(INTERNODE_CACHE_LINE) )
 # define KMP_ALIGN(bytes)     __declspec( align(bytes) )
 #endif

@ -525,7 +525,7 @@ extern kmp_real64 __kmp_xchg_real64( volatile kmp_real64 *p, kmp_real64 v );
 # define KMP_XCHG_REAL64(p, v)                  __kmp_xchg_real64( (p), (v) );


-#elif (KMP_ASM_INTRINS && (KMP_OS_LINUX || KMP_OS_FREEBSD || KMP_OS_DARWIN)) || !(KMP_ARCH_X86 || KMP_ARCH_X86_64)
+#elif (KMP_ASM_INTRINS && KMP_OS_UNIX) || !(KMP_ARCH_X86 || KMP_ARCH_X86_64)

 /* cast p to correct type so that proper intrinsic will be used */
 # define KMP_TEST_THEN_INC32(p)                 __sync_fetch_and_add( (kmp_int32 *)(p), 1 )
@ -654,17 +654,6 @@ extern kmp_real64 __kmp_xchg_real64( volatile kmp_real64 *p, kmp_real64 v );

 #endif /* KMP_ASM_INTRINS */

-# if !KMP_MIC
-//
-// no routines for floating addition on MIC
-// no intrinsic support for floating addition on UNIX
-//
-extern kmp_real32 __kmp_test_then_add_real32 ( volatile kmp_real32 *p, kmp_real32 v );
-extern kmp_real64 __kmp_test_then_add_real64 ( volatile kmp_real64 *p, kmp_real64 v );
-#  define KMP_TEST_THEN_ADD_REAL32(p, v)        __kmp_test_then_add_real32( (p), (v) )
-#  define KMP_TEST_THEN_ADD_REAL64(p, v)        __kmp_test_then_add_real64( (p), (v) )
-# endif
-

 /* ------------- relaxed consistency memory model stuff ------------------ */

--- a/openmp/runtime/src/kmp_runtime.c
+++ b/openmp/runtime/src/kmp_runtime.c
--- a/openmp/runtime/src/kmp_sched.cpp
+++ b/openmp/runtime/src/kmp_sched.cpp
@ -1,7 +1,7 @@
 /*
 * kmp_sched.c -- static scheduling -- iteration initialization
- * $Revision: 42358 $
- * $Date: 2013-05-07 13:43:26 -0500 (Tue, 07 May 2013) $
+ * $Revision: 43457 $
+ * $Date: 2014-09-17 03:57:22 -0500 (Wed, 17 Sep 2014) $
 */


@ -28,6 +28,8 @@
 #include "kmp_i18n.h"
 #include "kmp_str.h"
 #include "kmp_error.h"
+#include "kmp_stats.h"
+#include "kmp_itt.h"

 // template for type limits
 template< typename T >
@ -79,6 +81,7 @@ __kmp_for_static_init(
    typename traits_t< T >::signed_t  incr,
    typename traits_t< T >::signed_t  chunk
 ) {
+    KMP_COUNT_BLOCK(OMP_FOR_static);
    typedef typename traits_t< T >::unsigned_t  UT;
    typedef typename traits_t< T >::signed_t    ST;
    /*  this all has to be changed back to TID and such.. */
@ -88,6 +91,7 @@ __kmp_for_static_init(
    register UT          trip_count;
    register kmp_team_t *team;

+    KMP_DEBUG_ASSERT( plastiter && plower && pupper && pstride );
    KE_TRACE( 10, ("__kmpc_for_static_init called (%d)\n", global_tid));
    #ifdef KMP_DEBUG
    {
@ -108,12 +112,12 @@ __kmp_for_static_init(
        __kmp_push_workshare( global_tid, ct_pdo, loc );
        if ( incr == 0 ) {
            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrZeroProhibited, ct_pdo, loc );
-
        }
    }
    /* special handling for zero-trip loops */
    if ( incr > 0 ? (*pupper < *plower) : (*plower < *pupper) ) {
-        *plastiter = FALSE;
+        if( plastiter != NULL )
+            *plastiter = FALSE;
        /* leave pupper and plower set to entire iteration space */
        *pstride = incr;   /* value should never be used */
 	//        *plower = *pupper - incr;   // let compiler bypass the illegal loop (like for(i=1;i<10;i--))  THIS LINE CAUSED shape2F/h_tests_1.f TO HAVE A FAILURE ON A ZERO-TRIP LOOP (lower=1,\
@ -149,7 +153,8 @@ __kmp_for_static_init(
    /* determine if "for" loop is an active worksharing construct */
    if ( team -> t.t_serialized ) {
        /* serialized parallel, each thread executes whole iteration space */
-        *plastiter = TRUE;
+        if( plastiter != NULL )
+            *plastiter = TRUE;
        /* leave pupper and plower set to entire iteration space */
        *pstride = (incr > 0) ? (*pupper - *plower + 1) : (-(*plower - *pupper + 1));

@ -169,8 +174,9 @@ __kmp_for_static_init(
    }
    nth = team->t.t_nproc;
    if ( nth == 1 ) {
-        *plastiter = TRUE;
-
+        if( plastiter != NULL )
+            *plastiter = TRUE;
+        *pstride = (incr > 0) ? (*pupper - *plower + 1) : (-(*plower - *pupper + 1));
        #ifdef KMP_DEBUG
        {
            const char * buff;
@ -192,12 +198,13 @@ __kmp_for_static_init(
    } else if (incr == -1) {
        trip_count = *plower - *pupper + 1;
    } else {
-        if ( incr > 1 ) {
+        if ( incr > 1 ) {  // the check is needed for unsigned division when incr < 0
            trip_count = (*pupper - *plower) / incr + 1;
        } else {
            trip_count = (*plower - *pupper) / ( -incr ) + 1;
        }
    }
+
    if ( __kmp_env_consistency_check ) {
        /* tripcount overflow? */
        if ( trip_count == 0 && *pupper != *plower ) {
@ -219,14 +226,16 @@ __kmp_for_static_init(
                } else {
                    *plower = *pupper + incr;
                }
-                *plastiter = ( tid == trip_count - 1 );
+                if( plastiter != NULL )
+                    *plastiter = ( tid == trip_count - 1 );
            } else {
                if ( __kmp_static == kmp_sch_static_balanced ) {
                    register UT small_chunk = trip_count / nth;
                    register UT extras = trip_count % nth;
                    *plower += incr * ( tid * small_chunk + ( tid < extras ? tid : extras ) );
                    *pupper = *plower + small_chunk * incr - ( tid < extras ? 0 : incr );
-                    *plastiter = ( tid == nth - 1 );
+                    if( plastiter != NULL )
+                        *plastiter = ( tid == nth - 1 );
                } else {
                    register T big_chunk_inc_count = ( trip_count/nth +
                                                     ( ( trip_count % nth ) ? 1 : 0) ) * incr;
@ -238,16 +247,16 @@ __kmp_for_static_init(
                    *plower += tid * big_chunk_inc_count;
                    *pupper = *plower + big_chunk_inc_count - incr;
                    if ( incr > 0 ) {
-                        if ( *pupper < *plower ) {
+                        if( *pupper < *plower )
                            *pupper = i_maxmin< T >::mx;
-                        }
-                        *plastiter = *plower <= old_upper && *pupper > old_upper - incr;
+                        if( plastiter != NULL )
+                            *plastiter = *plower <= old_upper && *pupper > old_upper - incr;
                        if ( *pupper > old_upper ) *pupper = old_upper; // tracker C73258
                    } else {
-                        if ( *pupper > *plower ) {
+                        if( *pupper > *plower )
                            *pupper = i_maxmin< T >::mn;
-                        }
-                        *plastiter = *plower >= old_upper && *pupper < old_upper - incr;
+                        if( plastiter != NULL )
+                            *plastiter = *plower >= old_upper && *pupper < old_upper - incr;
                        if ( *pupper < old_upper ) *pupper = old_upper; // tracker C73258
                    }
                }
@ -256,7 +265,7 @@ __kmp_for_static_init(
        }
    case kmp_sch_static_chunked:
        {
-            register T span;
+            register ST span;
            if ( chunk < 1 ) {
                chunk = 1;
            }
@ -264,11 +273,8 @@ __kmp_for_static_init(
            *pstride = span * nth;
            *plower = *plower + (span * tid);
            *pupper = *plower + span - incr;
-            /* TODO: is the following line a bug?  Shouldn't it be plastiter instead of *plastiter ? */
-            if (*plastiter) { /* only calculate this if it was requested */
-                kmp_int32 lasttid = ((trip_count - 1) / ( UT )chunk) % nth;
-                *plastiter = (tid == lasttid);
-            }
+            if( plastiter != NULL )
+                *plastiter = (tid == ((trip_count - 1)/( UT )chunk) % nth);
            break;
        }
    default:
@ -276,6 +282,18 @@ __kmp_for_static_init(
        break;
    }

+#if USE_ITT_BUILD
+    // Report loop metadata
+    if ( KMP_MASTER_TID(tid) && __itt_metadata_add_ptr && __kmp_forkjoin_frames_mode == 3 ) {
+        kmp_uint64 cur_chunk = chunk;
+        // Calculate chunk in case it was not specified; it is specified for kmp_sch_static_chunked
+        if ( schedtype == kmp_sch_static ) {
+            cur_chunk = trip_count / nth + ( ( trip_count % nth ) ? 1 : 0);
+        }
+        // 0 - "static" schedule
+        __kmp_itt_metadata_loop(loc, 0, trip_count, cur_chunk);
+    }
+#endif
    #ifdef KMP_DEBUG
    {
        const char * buff;
@ -291,6 +309,355 @@ __kmp_for_static_init(
    return;
 }

+template< typename T >
+static void
+__kmp_dist_for_static_init(
+    ident_t                          *loc,
+    kmp_int32                         gtid,
+    kmp_int32                         schedule,
+    kmp_int32                        *plastiter,
+    T                                *plower,
+    T                                *pupper,
+    T                                *pupperDist,
+    typename traits_t< T >::signed_t *pstride,
+    typename traits_t< T >::signed_t  incr,
+    typename traits_t< T >::signed_t  chunk
+) {
+    KMP_COUNT_BLOCK(OMP_DISTR_FOR_static);
+    typedef typename traits_t< T >::unsigned_t  UT;
+    typedef typename traits_t< T >::signed_t    ST;
+    register kmp_uint32  tid;
+    register kmp_uint32  nth;
+    register kmp_uint32  team_id;
+    register kmp_uint32  nteams;
+    register UT          trip_count;
+    register kmp_team_t *team;
+    kmp_info_t * th;
+
+    KMP_DEBUG_ASSERT( plastiter && plower && pupper && pupperDist && pstride );
+    KE_TRACE( 10, ("__kmpc_dist_for_static_init called (%d)\n", gtid));
+    #ifdef KMP_DEBUG
+    {
+        const char * buff;
+        // create format specifiers before the debug output
+        buff = __kmp_str_format(
+            "__kmpc_dist_for_static_init: T#%%d schedLoop=%%d liter=%%d "\
+            "iter=(%%%s, %%%s, %%%s) chunk=%%%s signed?<%s>\n",
+            traits_t< T >::spec, traits_t< T >::spec, traits_t< ST >::spec,
+            traits_t< ST >::spec, traits_t< T >::spec );
+        KD_TRACE(100, ( buff, gtid, schedule, *plastiter,
+                       *plower, *pupper, incr, chunk ) );
+        __kmp_str_free( &buff );
+    }
+    #endif
+
+    if( __kmp_env_consistency_check ) {
+        __kmp_push_workshare( gtid, ct_pdo, loc );
+        if( incr == 0 ) {
+            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrZeroProhibited, ct_pdo, loc );
+        }
+        if( incr > 0 ? (*pupper < *plower) : (*plower < *pupper) ) {
+            // The loop is illegal.
+            // Some zero-trip loops maintained by compiler, e.g.:
+            //   for(i=10;i<0;++i) // lower >= upper - run-time check
+            //   for(i=0;i>10;--i) // lower <= upper - run-time check
+            //   for(i=0;i>10;++i) // incr > 0       - compile-time check
+            //   for(i=10;i<0;--i) // incr < 0       - compile-time check
+            // Compiler does not check the following illegal loops:
+            //   for(i=0;i<10;i+=incr) // where incr<0
+            //   for(i=10;i>0;i-=incr) // where incr<0
+            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrIllegal, ct_pdo, loc );
+        }
+    }
+    tid = __kmp_tid_from_gtid( gtid );
+    th = __kmp_threads[gtid];
+    KMP_DEBUG_ASSERT(th->th.th_teams_microtask);   // we are in the teams construct
+    nth = th->th.th_team_nproc;
+    team = th->th.th_team;
+    #if OMP_40_ENABLED
+    nteams = th->th.th_teams_size.nteams;
+    #endif
+    team_id = team->t.t_master_tid;
+    KMP_DEBUG_ASSERT(nteams == team->t.t_parent->t.t_nproc);
+
+    // compute global trip count
+    if( incr == 1 ) {
+        trip_count = *pupper - *plower + 1;
+    } else if(incr == -1) {
+        trip_count = *plower - *pupper + 1;
+    } else {
+        trip_count = (ST)(*pupper - *plower) / incr + 1; // cast to signed to cover incr<0 case
+    }
+    *pstride = *pupper - *plower;  // just in case (can be unused)
+    if( trip_count <= nteams ) {
+        KMP_DEBUG_ASSERT(
+            __kmp_static == kmp_sch_static_greedy || \
+            __kmp_static == kmp_sch_static_balanced
+        ); // Unknown static scheduling type.
+        // only masters of some teams get single iteration, other threads get nothing
+        if( team_id < trip_count && tid == 0 ) {
+            *pupper = *pupperDist = *plower = *plower + team_id * incr;
+        } else {
+            *pupperDist = *pupper;
+            *plower = *pupper + incr; // compiler should skip loop body
+        }
+        if( plastiter != NULL )
+            *plastiter = ( tid == 0 && team_id == trip_count - 1 );
+    } else {
+        // Get the team's chunk first (each team gets at most one chunk)
+        if( __kmp_static == kmp_sch_static_balanced ) {
+            register UT chunkD = trip_count / nteams;
+            register UT extras = trip_count % nteams;
+            *plower += incr * ( team_id * chunkD + ( team_id < extras ? team_id : extras ) );
+            *pupperDist = *plower + chunkD * incr - ( team_id < extras ? 0 : incr );
+            if( plastiter != NULL )
+                *plastiter = ( team_id == nteams - 1 );
+        } else {
+            register T chunk_inc_count =
+                ( trip_count / nteams + ( ( trip_count % nteams ) ? 1 : 0) ) * incr;
+            register T upper = *pupper;
+            KMP_DEBUG_ASSERT( __kmp_static == kmp_sch_static_greedy );
+                // Unknown static scheduling type.
+            *plower += team_id * chunk_inc_count;
+            *pupperDist = *plower + chunk_inc_count - incr;
+            // Check/correct bounds if needed
+            if( incr > 0 ) {
+                if( *pupperDist < *plower )
+                    *pupperDist = i_maxmin< T >::mx;
+                if( plastiter != NULL )
+                    *plastiter = *plower <= upper && *pupperDist > upper - incr;
+                if( *pupperDist > upper )
+                    *pupperDist = upper; // tracker C73258
+                if( *plower > *pupperDist ) {
+                    *pupper = *pupperDist;  // no iterations available for the team
+                    goto end;
+                }
+            } else {
+                if( *pupperDist > *plower )
+                    *pupperDist = i_maxmin< T >::mn;
+                if( plastiter != NULL )
+                    *plastiter = *plower >= upper && *pupperDist < upper - incr;
+                if( *pupperDist < upper )
+                    *pupperDist = upper; // tracker C73258
+                if( *plower < *pupperDist ) {
+                    *pupper = *pupperDist;  // no iterations available for the team
+                    goto end;
+                }
+            }
+        }
+        // Get the parallel loop chunk now (for thread)
+        // compute trip count for team's chunk
+        if( incr == 1 ) {
+            trip_count = *pupperDist - *plower + 1;
+        } else if(incr == -1) {
+            trip_count = *plower - *pupperDist + 1;
+        } else {
+            trip_count = (ST)(*pupperDist - *plower) / incr + 1;
+        }
+        KMP_DEBUG_ASSERT( trip_count );
+        switch( schedule ) {
+        case kmp_sch_static:
+        {
+            if( trip_count <= nth ) {
+                KMP_DEBUG_ASSERT(
+                    __kmp_static == kmp_sch_static_greedy || \
+                    __kmp_static == kmp_sch_static_balanced
+                ); // Unknown static scheduling type.
+                if( tid < trip_count )
+                    *pupper = *plower = *plower + tid * incr;
+                else
+                    *plower = *pupper + incr; // no iterations available
+                if( plastiter != NULL )
+                    if( *plastiter != 0 && !( tid == trip_count - 1 ) )
+                        *plastiter = 0;
+            } else {
+                if( __kmp_static == kmp_sch_static_balanced ) {
+                    register UT chunkL = trip_count / nth;
+                    register UT extras = trip_count % nth;
+                    *plower += incr * (tid * chunkL + (tid < extras ? tid : extras));
+                    *pupper = *plower + chunkL * incr - (tid < extras ? 0 : incr);
+                    if( plastiter != NULL )
+                        if( *plastiter != 0 && !( tid == nth - 1 ) )
+                            *plastiter = 0;
+                } else {
+                    register T chunk_inc_count =
+                        ( trip_count / nth + ( ( trip_count % nth ) ? 1 : 0) ) * incr;
+                    register T upper = *pupperDist;
+                    KMP_DEBUG_ASSERT( __kmp_static == kmp_sch_static_greedy );
+                        // Unknown static scheduling type.
+                    *plower += tid * chunk_inc_count;
+                    *pupper = *plower + chunk_inc_count - incr;
+                    if( incr > 0 ) {
+                        if( *pupper < *plower )
+                            *pupper = i_maxmin< T >::mx;
+                        if( plastiter != NULL )
+                            if( *plastiter != 0 && !(*plower <= upper && *pupper > upper - incr) )
+                                *plastiter = 0;
+                        if( *pupper > upper )
+                            *pupper = upper;//tracker C73258
+                    } else {
+                        if( *pupper > *plower )
+                            *pupper = i_maxmin< T >::mn;
+                        if( plastiter != NULL )
+                            if( *plastiter != 0 && !(*plower >= upper && *pupper < upper - incr) )
+                                *plastiter = 0;
+                        if( *pupper < upper )
+                            *pupper = upper;//tracker C73258
+                    }
+                }
+            }
+            break;
+        }
+        case kmp_sch_static_chunked:
+        {
+            register ST span;
+            if( chunk < 1 )
+                chunk = 1;
+            span = chunk * incr;
+            *pstride = span * nth;
+            *plower = *plower + (span * tid);
+            *pupper = *plower + span - incr;
+            if( plastiter != NULL )
+                if( *plastiter != 0 && !(tid == ((trip_count - 1) / ( UT )chunk) % nth) )
+                    *plastiter = 0;
+            break;
+        }
+        default:
+            KMP_ASSERT2( 0, "__kmpc_dist_for_static_init: unknown loop scheduling type" );
+            break;
+        }
+    }
+    end:;
+    #ifdef KMP_DEBUG
+    {
+        const char * buff;
+        // create format specifiers before the debug output
+        buff = __kmp_str_format(
+            "__kmpc_dist_for_static_init: last=%%d lo=%%%s up=%%%s upDist=%%%s "\
+            "stride=%%%s signed?<%s>\n",
+            traits_t< T >::spec, traits_t< T >::spec, traits_t< T >::spec,
+            traits_t< ST >::spec, traits_t< T >::spec );
+        KD_TRACE(100, ( buff, *plastiter, *plower, *pupper, *pupperDist, *pstride ) );
+        __kmp_str_free( &buff );
+    }
+    #endif
+    KE_TRACE( 10, ("__kmpc_dist_for_static_init: T#%d return\n", gtid ) );
+    return;
+}
+
+template< typename T >
+static void
+__kmp_team_static_init(
+    ident_t                          *loc,
+    kmp_int32                         gtid,
+    kmp_int32                        *p_last,
+    T                                *p_lb,
+    T                                *p_ub,
+    typename traits_t< T >::signed_t *p_st,
+    typename traits_t< T >::signed_t  incr,
+    typename traits_t< T >::signed_t  chunk
+) {
+    // The routine returns the first chunk distributed to the team and
+    // stride for next chunks calculation.
+    // Last iteration flag set for the team that will execute
+    // the last iteration of the loop.
+    // The routine is called for dist_schedue(static,chunk) only.
+    typedef typename traits_t< T >::unsigned_t  UT;
+    typedef typename traits_t< T >::signed_t    ST;
+    kmp_uint32  team_id;
+    kmp_uint32  nteams;
+    UT          trip_count;
+    T           lower;
+    T           upper;
+    ST          span;
+    kmp_team_t *team;
+    kmp_info_t *th;
+
+    KMP_DEBUG_ASSERT( p_last && p_lb && p_ub && p_st );
+    KE_TRACE( 10, ("__kmp_team_static_init called (%d)\n", gtid));
+    #ifdef KMP_DEBUG
+    {
+        const char * buff;
+        // create format specifiers before the debug output
+        buff = __kmp_str_format( "__kmp_team_static_init enter: T#%%d liter=%%d "\
+            "iter=(%%%s, %%%s, %%%s) chunk %%%s; signed?<%s>\n",
+            traits_t< T >::spec, traits_t< T >::spec, traits_t< ST >::spec,
+            traits_t< ST >::spec, traits_t< T >::spec );
+        KD_TRACE(100, ( buff, gtid, *p_last, *p_lb, *p_ub, *p_st, chunk ) );
+        __kmp_str_free( &buff );
+    }
+    #endif
+
+    lower = *p_lb;
+    upper = *p_ub;
+    if( __kmp_env_consistency_check ) {
+        if( incr == 0 ) {
+            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrZeroProhibited, ct_pdo, loc );
+        }
+        if( incr > 0 ? (upper < lower) : (lower < upper) ) {
+            // The loop is illegal.
+            // Some zero-trip loops maintained by compiler, e.g.:
+            //   for(i=10;i<0;++i) // lower >= upper - run-time check
+            //   for(i=0;i>10;--i) // lower <= upper - run-time check
+            //   for(i=0;i>10;++i) // incr > 0       - compile-time check
+            //   for(i=10;i<0;--i) // incr < 0       - compile-time check
+            // Compiler does not check the following illegal loops:
+            //   for(i=0;i<10;i+=incr) // where incr<0
+            //   for(i=10;i>0;i-=incr) // where incr<0
+            __kmp_error_construct( kmp_i18n_msg_CnsLoopIncrIllegal, ct_pdo, loc );
+        }
+    }
+    th = __kmp_threads[gtid];
+    KMP_DEBUG_ASSERT(th->th.th_teams_microtask);   // we are in the teams construct
+    team = th->th.th_team;
+    #if OMP_40_ENABLED
+    nteams = th->th.th_teams_size.nteams;
+    #endif
+    team_id = team->t.t_master_tid;
+    KMP_DEBUG_ASSERT(nteams == team->t.t_parent->t.t_nproc);
+
+    // compute trip count
+    if( incr == 1 ) {
+        trip_count = upper - lower + 1;
+    } else if(incr == -1) {
+        trip_count = lower - upper + 1;
+    } else {
+        trip_count = (ST)(upper - lower) / incr + 1; // cast to signed to cover incr<0 case
+    }
+    if( chunk < 1 )
+        chunk = 1;
+    span = chunk * incr;
+    *p_st = span * nteams;
+    *p_lb = lower + (span * team_id);
+    *p_ub = *p_lb + span - incr;
+    if ( p_last != NULL )
+        *p_last = (team_id == ((trip_count - 1)/(UT)chunk) % nteams);
+    // Correct upper bound if needed
+    if( incr > 0 ) {
+        if( *p_ub < *p_lb ) // overflow?
+            *p_ub = i_maxmin< T >::mx;
+        if( *p_ub > upper )
+            *p_ub = upper; // tracker C73258
+    } else {   // incr < 0
+        if( *p_ub > *p_lb )
+            *p_ub = i_maxmin< T >::mn;
+        if( *p_ub < upper )
+            *p_ub = upper; // tracker C73258
+    }
+    #ifdef KMP_DEBUG
+    {
+        const char * buff;
+        // create format specifiers before the debug output
+        buff = __kmp_str_format( "__kmp_team_static_init exit: T#%%d team%%u liter=%%d "\
+            "iter=(%%%s, %%%s, %%%s) chunk %%%s\n",
+            traits_t< T >::spec, traits_t< T >::spec, traits_t< ST >::spec,
+            traits_t< ST >::spec );
+        KD_TRACE(100, ( buff, gtid, team_id, *p_last, *p_lb, *p_ub, *p_st, chunk ) );
+        __kmp_str_free( &buff );
+    }
+    #endif
+}
+
 //--------------------------------------------------------------------------------------
 extern "C" {

@ -310,7 +677,7 @@ Each of the four functions here are identical apart from the argument types.

 The functions compute the upper and lower bounds and stride to be used for the set of iterations
 to be executed by the current thread from the statically scheduled loop that is described by the
-initial values of the bround, stride, increment and chunk size.
+initial values of the bounds, stride, increment and chunk size.

@{
 */
@ -362,5 +729,155 @@ __kmpc_for_static_init_8u( ident_t *loc, kmp_int32 gtid, kmp_int32 schedtype, km
@}
 */

+/*!
+@ingroup WORK_SHARING
+@param    loc       Source code location
+@param    gtid      Global thread id of this thread
+@param    scheduleD Scheduling type for the distribute
+@param    scheduleL Scheduling type for the parallel loop
+@param    plastiter Pointer to the "last iteration" flag
+@param    plower    Pointer to the lower bound
+@param    pupper    Pointer to the upper bound of loop chunk
+@param    pupperD   Pointer to the upper bound of dist_chunk
+@param    pstrideD  Pointer to the stride for distribute
+@param    pstrideL  Pointer to the stride for parallel loop
+@param    incr      Loop increment
+@param    chunkD    The chunk size for the distribute
+@param    chunkL    The chunk size for the parallel loop
+
+Each of the four functions here are identical apart from the argument types.
+
+The functions compute the upper and lower bounds and strides to be used for the set of iterations
+to be executed by the current thread from the statically scheduled loop that is described by the
+initial values of the bounds, strides, increment and chunks for parallel loop and distribute
+constructs.
+
+@{
+*/
+void
+__kmpc_dist_for_static_init_4(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 schedule, kmp_int32 *plastiter,
+    kmp_int32 *plower, kmp_int32 *pupper, kmp_int32 *pupperD,
+    kmp_int32 *pstride, kmp_int32 incr, kmp_int32 chunk )
+{
+    __kmp_dist_for_static_init< kmp_int32 >(
+        loc, gtid, schedule, plastiter, plower, pupper, pupperD, pstride, incr, chunk );
+}
+
+/*!
+ See @ref __kmpc_dist_for_static_init_4
+ */
+void
+__kmpc_dist_for_static_init_4u(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 schedule, kmp_int32 *plastiter,
+    kmp_uint32 *plower, kmp_uint32 *pupper, kmp_uint32 *pupperD,
+    kmp_int32 *pstride, kmp_int32 incr, kmp_int32 chunk )
+{
+    __kmp_dist_for_static_init< kmp_uint32 >(
+        loc, gtid, schedule, plastiter, plower, pupper, pupperD, pstride, incr, chunk );
+}
+
+/*!
+ See @ref __kmpc_dist_for_static_init_4
+ */
+void
+__kmpc_dist_for_static_init_8(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 schedule, kmp_int32 *plastiter,
+    kmp_int64 *plower, kmp_int64 *pupper, kmp_int64 *pupperD,
+    kmp_int64 *pstride, kmp_int64 incr, kmp_int64 chunk )
+{
+    __kmp_dist_for_static_init< kmp_int64 >(
+        loc, gtid, schedule, plastiter, plower, pupper, pupperD, pstride, incr, chunk );
+}
+
+/*!
+ See @ref __kmpc_dist_for_static_init_4
+ */
+void
+__kmpc_dist_for_static_init_8u(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 schedule, kmp_int32 *plastiter,
+    kmp_uint64 *plower, kmp_uint64 *pupper, kmp_uint64 *pupperD,
+    kmp_int64 *pstride, kmp_int64 incr, kmp_int64 chunk )
+{
+    __kmp_dist_for_static_init< kmp_uint64 >(
+        loc, gtid, schedule, plastiter, plower, pupper, pupperD, pstride, incr, chunk );
+}
+/*!
+@}
+*/
+
+//-----------------------------------------------------------------------------------------
+// Auxiliary routines for Distribute Parallel Loop construct implementation
+//    Transfer call to template< type T >
+//    __kmp_team_static_init( ident_t *loc, int gtid,
+//        int *p_last, T *lb, T *ub, ST *st, ST incr, ST chunk )
+
+/*!
+@ingroup WORK_SHARING
+@{
+@param loc Source location
+@param gtid Global thread id
+@param p_last pointer to last iteration flag
+@param p_lb  pointer to Lower bound
+@param p_ub  pointer to Upper bound
+@param p_st  Step (or increment if you prefer)
+@param incr  Loop increment
+@param chunk The chunk size to block with
+
+The functions compute the upper and lower bounds and stride to be used for the set of iterations
+to be executed by the current team from the statically scheduled loop that is described by the
+initial values of the bounds, stride, increment and chunk for the distribute construct as part of
+composite distribute parallel loop construct.
+These functions are all identical apart from the types of the arguments.
+*/
+
+void
+__kmpc_team_static_init_4(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 *p_last,
+    kmp_int32 *p_lb, kmp_int32 *p_ub, kmp_int32 *p_st, kmp_int32 incr, kmp_int32 chunk )
+{
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_team_static_init< kmp_int32 >( loc, gtid, p_last, p_lb, p_ub, p_st, incr, chunk );
+}
+
+/*!
+ See @ref __kmpc_team_static_init_4
+ */
+void
+__kmpc_team_static_init_4u(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 *p_last,
+    kmp_uint32 *p_lb, kmp_uint32 *p_ub, kmp_int32 *p_st, kmp_int32 incr, kmp_int32 chunk )
+{
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_team_static_init< kmp_uint32 >( loc, gtid, p_last, p_lb, p_ub, p_st, incr, chunk );
+}
+
+/*!
+ See @ref __kmpc_team_static_init_4
+ */
+void
+__kmpc_team_static_init_8(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 *p_last,
+    kmp_int64 *p_lb, kmp_int64 *p_ub, kmp_int64 *p_st, kmp_int64 incr, kmp_int64 chunk )
+{
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_team_static_init< kmp_int64 >( loc, gtid, p_last, p_lb, p_ub, p_st, incr, chunk );
+}
+
+/*!
+ See @ref __kmpc_team_static_init_4
+ */
+void
+__kmpc_team_static_init_8u(
+    ident_t *loc, kmp_int32 gtid, kmp_int32 *p_last,
+    kmp_uint64 *p_lb, kmp_uint64 *p_ub, kmp_int64 *p_st, kmp_int64 incr, kmp_int64 chunk )
+{
+    KMP_DEBUG_ASSERT( __kmp_init_serial );
+    __kmp_team_static_init< kmp_uint64 >( loc, gtid, p_last, p_lb, p_ub, p_st, incr, chunk );
+}
+/*!
+@}
+*/
+
 } // extern "C"

--- a/openmp/runtime/src/kmp_settings.c
+++ b/openmp/runtime/src/kmp_settings.c
@ -1,7 +1,7 @@
 /*
 * kmp_settings.c -- Initialize environment variables
- * $Revision: 42816 $
- * $Date: 2013-11-11 15:33:37 -0600 (Mon, 11 Nov 2013) $
+ * $Revision: 43473 $
+ * $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $
 */


@ -534,9 +534,9 @@ __kmp_stg_parse_file(
    * out = __kmp_str_format( "%s", buffer );
 } // __kmp_stg_parse_file

+#ifdef KMP_DEBUG
 static char * par_range_to_print = NULL;

-#ifdef KMP_DEBUG
 static void
 __kmp_stg_parse_par_range(
    char const * name,
@ -944,6 +944,26 @@ __kmp_stg_print_settings( kmp_str_buf_t * buffer, char const * name, void * data
    __kmp_stg_print_bool( buffer, name, __kmp_settings );
 } // __kmp_stg_print_settings

+// -------------------------------------------------------------------------------------------------
+// KMP_STACKPAD
+// -------------------------------------------------------------------------------------------------
+
+static void
+__kmp_stg_parse_stackpad( char const * name, char const * value, void * data ) {
+    __kmp_stg_parse_int(
+        name,                             // Env var name
+        value,                            // Env var value
+        KMP_MIN_STKPADDING,               // Min value
+        KMP_MAX_STKPADDING,               // Max value
+        & __kmp_stkpadding                // Var to initialize
+    );
+} // __kmp_stg_parse_stackpad
+
+static void
+__kmp_stg_print_stackpad( kmp_str_buf_t * buffer, char const * name, void * data ) {
+    __kmp_stg_print_int( buffer, name, __kmp_stkpadding );
+} // __kmp_stg_print_stackpad
+
 // -------------------------------------------------------------------------------------------------
 // KMP_STACKOFFSET
 // -------------------------------------------------------------------------------------------------
@ -1229,7 +1249,6 @@ __kmp_stg_print_num_threads( kmp_str_buf_t * buffer, char const * name, void * d
 // OpenMP 3.0: KMP_TASKING, OMP_MAX_ACTIVE_LEVELS,
 // -------------------------------------------------------------------------------------------------

-#if OMP_30_ENABLED
 static void
 __kmp_stg_parse_tasking( char const * name, char const * value, void * data ) {
    __kmp_stg_parse_int( name, value, 0, (int)tskm_max, (int *)&__kmp_tasking_mode );
@ -1259,7 +1278,41 @@ static void
 __kmp_stg_print_max_active_levels( kmp_str_buf_t * buffer, char const * name, void * data ) {
    __kmp_stg_print_int( buffer, name, __kmp_dflt_max_active_levels );
 } // __kmp_stg_print_max_active_levels
-#endif // OMP_30_ENABLED
+
+#if KMP_NESTED_HOT_TEAMS
+// -------------------------------------------------------------------------------------------------
+// KMP_HOT_TEAMS_MAX_LEVEL, KMP_HOT_TEAMS_MODE
+// -------------------------------------------------------------------------------------------------
+
+static void
+__kmp_stg_parse_hot_teams_level( char const * name, char const * value, void * data ) {
+    if ( TCR_4(__kmp_init_parallel) ) {
+        KMP_WARNING( EnvParallelWarn, name );
+        return;
+    }   // read value before first parallel only
+    __kmp_stg_parse_int( name, value, 0, KMP_MAX_ACTIVE_LEVELS_LIMIT, & __kmp_hot_teams_max_level );
+} // __kmp_stg_parse_hot_teams_level
+
+static void
+__kmp_stg_print_hot_teams_level( kmp_str_buf_t * buffer, char const * name, void * data ) {
+    __kmp_stg_print_int( buffer, name, __kmp_hot_teams_max_level );
+} // __kmp_stg_print_hot_teams_level
+
+static void
+__kmp_stg_parse_hot_teams_mode( char const * name, char const * value, void * data ) {
+    if ( TCR_4(__kmp_init_parallel) ) {
+        KMP_WARNING( EnvParallelWarn, name );
+        return;
+    }   // read value before first parallel only
+    __kmp_stg_parse_int( name, value, 0, KMP_MAX_ACTIVE_LEVELS_LIMIT, & __kmp_hot_teams_mode );
+} // __kmp_stg_parse_hot_teams_mode
+
+static void
+__kmp_stg_print_hot_teams_mode( kmp_str_buf_t * buffer, char const * name, void * data ) {
+    __kmp_stg_print_int( buffer, name, __kmp_hot_teams_mode );
+} // __kmp_stg_print_hot_teams_mode
+
+#endif // KMP_NESTED_HOT_TEAMS

 // -------------------------------------------------------------------------------------------------
 // KMP_HANDLE_SIGNALS
@ -1438,12 +1491,10 @@ __kmp_stg_parse_barrier_branch_bit( char const * name, char const * value, void
    const char *var;

    /* ---------- Barrier branch bit control ------------ */
-
    for ( int i=bs_plain_barrier; i<bs_last_barrier; i++ ) {
        var = __kmp_barrier_branch_bit_env_name[ i ];
-
        if ( ( strcmp( var, name) == 0 ) && ( value != 0 ) ) {
-            char   *comma;
+            char *comma;

            comma = (char *) strchr( value, ',' );
            __kmp_barrier_gather_branch_bits[ i ] = ( kmp_uint32 ) __kmp_str_to_int( value, ',' );
@ -1455,7 +1506,6 @@ __kmp_stg_parse_barrier_branch_bit( char const * name, char const * value, void

                if ( __kmp_barrier_release_branch_bits[ i ] > KMP_MAX_BRANCH_BITS ) {
                    __kmp_msg( kmp_ms_warning, KMP_MSG( BarrReleaseValueInvalid, name, comma + 1 ), __kmp_msg_null );
-
                    __kmp_barrier_release_branch_bits[ i ] = __kmp_barrier_release_bb_dflt;
                }
            }
@ -2037,11 +2087,6 @@ __kmp_parse_affinity_env( char const * name, char const * value,
 # if OMP_40_ENABLED
    KMP_DEBUG_ASSERT( ( __kmp_nested_proc_bind.bind_types != NULL )
      && ( __kmp_nested_proc_bind.used > 0 ) );
-    if ( ( __kmp_affinity_notype != NULL )
-      && ( ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_default )
-      || ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_intel ) ) ) {
-        type = TRUE;
-    }
 # endif

    while ( *buf != '\0' ) {
@ -2049,29 +2094,53 @@ __kmp_parse_affinity_env( char const * name, char const * value,

        if (__kmp_match_str("none", buf, (const char **)&next)) {
            set_type( affinity_none );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_false;
+# endif
            buf = next;
        } else if (__kmp_match_str("scatter", buf, (const char **)&next)) {
            set_type( affinity_scatter );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+# endif
            buf = next;
        } else if (__kmp_match_str("compact", buf, (const char **)&next)) {
            set_type( affinity_compact );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+# endif
            buf = next;
        } else if (__kmp_match_str("logical", buf, (const char **)&next)) {
            set_type( affinity_logical );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+# endif
            buf = next;
        } else if (__kmp_match_str("physical", buf, (const char **)&next)) {
            set_type( affinity_physical );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+# endif
            buf = next;
        } else if (__kmp_match_str("explicit", buf, (const char **)&next)) {
            set_type( affinity_explicit );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+# endif
            buf = next;
 # if KMP_MIC
        } else if (__kmp_match_str("balanced", buf, (const char **)&next)) {
            set_type( affinity_balanced );
+#  if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+#  endif
            buf = next;
 # endif
        } else if (__kmp_match_str("disabled", buf, (const char **)&next)) {
            set_type( affinity_disabled );
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_false;
+# endif
            buf = next;
        } else if (__kmp_match_str("verbose", buf, (const char **)&next)) {
            set_verbose( TRUE );
@ -2451,6 +2520,9 @@ __kmp_stg_parse_gomp_cpu_affinity( char const * name, char const * value, void *
            __kmp_affinity_proclist = temp_proclist;
            __kmp_affinity_type = affinity_explicit;
            __kmp_affinity_gran = affinity_gran_fine;
+# if OMP_40_ENABLED
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+# endif
        }
        else {
            KMP_WARNING( AffSyntaxError, name );
@ -2772,6 +2844,21 @@ __kmp_stg_parse_places( char const * name, char const * value, void * data )
    const char *scan = value;
    const char *next = scan;
    const char *kind = "\"threads\"";
+    kmp_setting_t **rivals = (kmp_setting_t **) data;
+    int rc;
+
+    rc = __kmp_stg_check_rivals( name, value, rivals );
+    if ( rc ) {
+        return;
+    }
+
+    //
+    // If OMP_PROC_BIND is not specified but OMP_PLACES is,
+    // then let OMP_PROC_BIND default to true.
+    //
+    if ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_default ) {
+        __kmp_nested_proc_bind.bind_types[0] = proc_bind_true;
+    }

    //__kmp_affinity_num_places = 0;

@ -2805,10 +2892,17 @@ __kmp_stg_parse_places( char const * name, char const * value, void * data )
            __kmp_affinity_type = affinity_explicit;
            __kmp_affinity_gran = affinity_gran_fine;
            __kmp_affinity_dups = FALSE;
+            if ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_default ) {
+                 __kmp_nested_proc_bind.bind_types[0] = proc_bind_true;
+            }
        }
        return;
    }

+    if ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_default ) {
+        __kmp_nested_proc_bind.bind_types[0] = proc_bind_true;
+    }
+
    SKIP_WS(scan);
    if ( *scan == '\0' ) {
        return;
@ -2855,8 +2949,7 @@ __kmp_stg_print_places( kmp_str_buf_t * buffer, char const * name,
    }
    if ( ( __kmp_nested_proc_bind.used == 0 )
      || ( __kmp_nested_proc_bind.bind_types == NULL )
-      || ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_false )
-      || ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_intel ) ) {
+      || ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_false ) ) {
        __kmp_str_buf_print( buffer, ": %s\n", KMP_I18N_STR( NotDefined ) );
    }
    else if ( __kmp_affinity_type == affinity_explicit ) {
@ -2913,7 +3006,7 @@ __kmp_stg_print_places( kmp_str_buf_t * buffer, char const * name,

 # endif /* OMP_40_ENABLED */

-# if OMP_30_ENABLED && (! OMP_40_ENABLED)
+# if (! OMP_40_ENABLED)

 static void
 __kmp_stg_parse_proc_bind( char const * name, char const * value, void * data )
@ -2943,7 +3036,7 @@ __kmp_stg_parse_proc_bind( char const * name, char const * value, void * data )
    }
 } // __kmp_parse_proc_bind

-# endif /* if OMP_30_ENABLED && (! OMP_40_ENABLED) */
+# endif /* if (! OMP_40_ENABLED) */


 static void
@ -3132,11 +3225,7 @@ __kmp_stg_parse_proc_bind( char const * name, char const * value, void * data )
        buf = next;
        SKIP_WS( buf );
        __kmp_nested_proc_bind.used = 1;
-
-        //
-        // "true" currently maps to "spread"
-        //
-        __kmp_nested_proc_bind.bind_types[0] = proc_bind_spread;
+        __kmp_nested_proc_bind.bind_types[0] = proc_bind_true;
    }
    else {
        //
@ -3454,7 +3543,7 @@ __kmp_stg_parse_schedule( char const * name, char const * value, void * data ) {
                    KMP_WARNING( InvalidClause, name, value );
                } else
                    KMP_WARNING( EmptyClause, name );
-            } while ( value = semicolon ? semicolon + 1 : NULL );
+            } while ( (value = semicolon ? semicolon + 1 : NULL) );
        }
    }; // if

@ -3499,7 +3588,6 @@ __kmp_stg_parse_omp_schedule( char const * name, char const * value, void * data
            else if (!__kmp_strcasecmp_with_sentinel("guided", value, ','))      /* GUIDED */
                __kmp_sched = kmp_sch_guided_chunked;
 // AC: TODO: add AUTO schedule, and pprobably remove TRAPEZOIDAL (OMP 3.0 does not allow it)
-            #if OMP_30_ENABLED
            else if (!__kmp_strcasecmp_with_sentinel("auto", value, ',')) {       /* AUTO */
                __kmp_sched = kmp_sch_auto;
                if( comma ) {
@ -3507,7 +3595,6 @@ __kmp_stg_parse_omp_schedule( char const * name, char const * value, void * data
                    comma = NULL;
                }
            }
-            #endif // OMP_30_ENABLED
            else if (!__kmp_strcasecmp_with_sentinel("trapezoidal", value, ',')) /* TRAPEZOIDAL */
                __kmp_sched = kmp_sch_trapezoidal;
            else if (!__kmp_strcasecmp_with_sentinel("static", value, ','))      /* STATIC */
@ -4016,7 +4103,7 @@ __kmp_stg_parse_adaptive_lock_props( const char *name, const char *value, void *
            break;
        }
        // Next character is not an integer or not a comma OR number of values > 2 => end of list
-        if ( ( ( *next < '0' ) || ( *next > '9' ) ) && ( *next !=',') || ( total > 2 ) ) {
+        if ( ( ( *next < '0' || *next > '9' ) && *next !=',' ) || total > 2 ) {
            KMP_WARNING( EnvSyntaxError, name, value );
            return;
        }
@ -4314,6 +4401,10 @@ __kmp_stg_print_omp_display_env( kmp_str_buf_t * buffer, char const * name, void

 static void
 __kmp_stg_parse_omp_cancellation( char const * name, char const * value, void * data ) {
+    if ( TCR_4(__kmp_init_parallel) ) {
+        KMP_WARNING( EnvParallelWarn, name );
+        return;
+    }   // read value before first parallel only
    __kmp_stg_parse_bool( name, value, & __kmp_omp_cancellation );
 } // __kmp_stg_parse_omp_cancellation

@ -4340,6 +4431,7 @@ static kmp_setting_t __kmp_stg_table[] = {
    { "KMP_SETTINGS",                      __kmp_stg_parse_settings,           __kmp_stg_print_settings,           NULL, 0, 0 },
    { "KMP_STACKOFFSET",                   __kmp_stg_parse_stackoffset,        __kmp_stg_print_stackoffset,        NULL, 0, 0 },
    { "KMP_STACKSIZE",                     __kmp_stg_parse_stacksize,          __kmp_stg_print_stacksize,          NULL, 0, 0 },
+    { "KMP_STACKPAD",                      __kmp_stg_parse_stackpad,           __kmp_stg_print_stackpad,           NULL, 0, 0 },
    { "KMP_VERSION",                       __kmp_stg_parse_version,            __kmp_stg_print_version,            NULL, 0, 0 },
    { "KMP_WARNINGS",                      __kmp_stg_parse_warnings,           __kmp_stg_print_warnings,           NULL, 0, 0 },

@ -4347,13 +4439,15 @@ static kmp_setting_t __kmp_stg_table[] = {
    { "OMP_NUM_THREADS",                   __kmp_stg_parse_num_threads,        __kmp_stg_print_num_threads,        NULL, 0, 0 },
    { "OMP_STACKSIZE",                     __kmp_stg_parse_stacksize,          __kmp_stg_print_stacksize,          NULL, 0, 0 },

-#if OMP_30_ENABLED
    { "KMP_TASKING",                       __kmp_stg_parse_tasking,            __kmp_stg_print_tasking,            NULL, 0, 0 },
    { "KMP_TASK_STEALING_CONSTRAINT",      __kmp_stg_parse_task_stealing,      __kmp_stg_print_task_stealing,      NULL, 0, 0 },
    { "OMP_MAX_ACTIVE_LEVELS",             __kmp_stg_parse_max_active_levels,  __kmp_stg_print_max_active_levels,  NULL, 0, 0 },
    { "OMP_THREAD_LIMIT",                  __kmp_stg_parse_all_threads,        __kmp_stg_print_all_threads,        NULL, 0, 0 },
    { "OMP_WAIT_POLICY",                   __kmp_stg_parse_wait_policy,        __kmp_stg_print_wait_policy,        NULL, 0, 0 },
-#endif // OMP_30_ENABLED
+#if KMP_NESTED_HOT_TEAMS
+    { "KMP_HOT_TEAMS_MAX_LEVEL",           __kmp_stg_parse_hot_teams_level,    __kmp_stg_print_hot_teams_level,    NULL, 0, 0 },
+    { "KMP_HOT_TEAMS_MODE",                __kmp_stg_parse_hot_teams_mode,     __kmp_stg_print_hot_teams_mode,     NULL, 0, 0 },
+#endif // KMP_NESTED_HOT_TEAMS

 #if KMP_HANDLE_SIGNALS
    { "KMP_HANDLE_SIGNALS",                __kmp_stg_parse_handle_signals,     __kmp_stg_print_handle_signals,     NULL, 0, 0 },
@ -4411,18 +4505,16 @@ static kmp_setting_t __kmp_stg_table[] = {
 # ifdef KMP_GOMP_COMPAT
    { "GOMP_CPU_AFFINITY",                 __kmp_stg_parse_gomp_cpu_affinity,  NULL, /* no print */                NULL, 0, 0 },
 # endif /* KMP_GOMP_COMPAT */
-# if OMP_30_ENABLED
-#  if OMP_40_ENABLED
+# if OMP_40_ENABLED
    { "OMP_PROC_BIND",                     __kmp_stg_parse_proc_bind,          __kmp_stg_print_proc_bind,          NULL, 0, 0 },
    { "OMP_PLACES",                        __kmp_stg_parse_places,             __kmp_stg_print_places,             NULL, 0, 0 },
-#  else
+# else
    { "OMP_PROC_BIND",                     __kmp_stg_parse_proc_bind,          NULL, /* no print */                NULL, 0, 0 },
-#  endif /* OMP_40_ENABLED */
-# endif /* OMP_30_ENABLED */
+# endif /* OMP_40_ENABLED */

    { "KMP_TOPOLOGY_METHOD",               __kmp_stg_parse_topology_method,    __kmp_stg_print_topology_method,    NULL, 0, 0 },

-#elif !KMP_AFFINITY_SUPPORTED
+#else

    //
    // KMP_AFFINITY is not supported on OS X*, nor is OMP_PLACES.
@ -4432,8 +4524,6 @@ static kmp_setting_t __kmp_stg_table[] = {
    { "OMP_PROC_BIND",                     __kmp_stg_parse_proc_bind,          __kmp_stg_print_proc_bind,          NULL, 0, 0 },
 # endif

-#else
-    #error "Unknown or unsupported OS"
 #endif // KMP_AFFINITY_SUPPORTED

    { "KMP_INIT_AT_FORK",                  __kmp_stg_parse_init_at_fork,       __kmp_stg_print_init_at_fork,       NULL, 0, 0 },
@ -4571,7 +4661,6 @@ __kmp_stg_init( void

        }

-#if OMP_30_ENABLED
        { // Initialize KMP_LIBRARY and OMP_WAIT_POLICY data.

            kmp_setting_t * kmp_library     = __kmp_stg_find( "KMP_LIBRARY" );        // 1st priority.
@ -4595,21 +4684,12 @@ __kmp_stg_init( void
            }; // if

        }
-#else
-        {
-            kmp_setting_t * kmp_library = __kmp_stg_find( "KMP_LIBRARY" );
-            static kmp_stg_wp_data_t kmp_data = { 0, NULL };
-            kmp_library->data = & kmp_data;
-        }
-#endif /* OMP_30_ENABLED */

        { // Initialize KMP_ALL_THREADS, KMP_MAX_THREADS, and OMP_THREAD_LIMIT data.

            kmp_setting_t * kmp_all_threads  = __kmp_stg_find( "KMP_ALL_THREADS"  );  // 1st priority.
            kmp_setting_t * kmp_max_threads  = __kmp_stg_find( "KMP_MAX_THREADS"  );  // 2nd priority.
-#if OMP_30_ENABLED
            kmp_setting_t * omp_thread_limit = __kmp_stg_find( "OMP_THREAD_LIMIT" );  // 3rd priority.
-#endif

            // !!! volatile keyword is Intel (R) C Compiler bug CQ49908 workaround.
            static kmp_setting_t * volatile rivals[ 4 ];
@ -4617,20 +4697,16 @@ __kmp_stg_init( void

            rivals[ i ++ ] = kmp_all_threads;
            rivals[ i ++ ] = kmp_max_threads;
-#if OMP_30_ENABLED
            if ( omp_thread_limit != NULL ) {
                rivals[ i ++ ] = omp_thread_limit;
            }; // if
-#endif
            rivals[ i ++ ] = NULL;

            kmp_all_threads->data = (void*)& rivals;
            kmp_max_threads->data = (void*)& rivals;
-#if OMP_30_ENABLED
            if ( omp_thread_limit != NULL ) {
                omp_thread_limit->data = (void*)& rivals;
            }; // if
-#endif

        }

@ -4645,18 +4721,11 @@ __kmp_stg_init( void
            KMP_DEBUG_ASSERT( gomp_cpu_affinity != NULL );
 # endif

-# if OMP_30_ENABLED
            kmp_setting_t * omp_proc_bind = __kmp_stg_find( "OMP_PROC_BIND" );  // 3rd priority.
            KMP_DEBUG_ASSERT( omp_proc_bind != NULL );
-# endif
-
-# if OMP_40_ENABLED
-            kmp_setting_t * omp_places = __kmp_stg_find( "OMP_PLACES" );  // 3rd priority.
-            KMP_DEBUG_ASSERT( omp_places != NULL );
-# endif

            // !!! volatile keyword is Intel (R) C Compiler bug CQ49908 workaround.
-            static kmp_setting_t * volatile rivals[ 5 ];
+            static kmp_setting_t * volatile rivals[ 4 ];
            int i = 0;

            rivals[ i ++ ] = kmp_affinity;
@ -4666,23 +4735,30 @@ __kmp_stg_init( void
            gomp_cpu_affinity->data = (void*)& rivals;
 # endif

-# if OMP_30_ENABLED
            rivals[ i ++ ] = omp_proc_bind;
            omp_proc_bind->data = (void*)& rivals;
-# endif
+            rivals[ i ++ ] = NULL;

 # if OMP_40_ENABLED
-            rivals[ i ++ ] = omp_places;
-            omp_places->data = (void*)& rivals;
+            static kmp_setting_t * volatile places_rivals[ 4 ];
+            i = 0;
+
+            kmp_setting_t * omp_places = __kmp_stg_find( "OMP_PLACES" );  // 3rd priority.
+            KMP_DEBUG_ASSERT( omp_places != NULL );
+
+            places_rivals[ i ++ ] = kmp_affinity;
+#  ifdef KMP_GOMP_COMPAT
+            places_rivals[ i ++ ] = gomp_cpu_affinity;
+#  endif
+            places_rivals[ i ++ ] = omp_places;
+            omp_places->data = (void*)& places_rivals;
+            places_rivals[ i ++ ] = NULL;
 # endif
-
-            rivals[ i ++ ] = NULL;
        }
-
 #else
    // KMP_AFFINITY not supported, so OMP_PROC_BIND has no rivals.
    // OMP_PLACES not supported yet.
-#endif
+#endif // KMP_AFFINITY_SUPPORTED

        { // Initialize KMP_DETERMINISTIC_REDUCTION and KMP_FORCE_REDUCTION data.

@ -4917,8 +4993,33 @@ __kmp_env_initialize( char const * string ) {
          && ( FIND( aff_str, "disabled" ) == NULL ) ) {
            __kmp_affinity_notype = __kmp_stg_find( "KMP_AFFINITY"  );
        }
+        else {
+            //
+            // A new affinity type is specified.
+            // Reset the affinity flags to their default values,
+            // in case this is called from kmp_set_defaults().
+            //
+            __kmp_affinity_type = affinity_default;
+            __kmp_affinity_gran = affinity_gran_default;
+            __kmp_affinity_top_method = affinity_top_method_default;
+            __kmp_affinity_respect_mask = affinity_respect_mask_default;
+        }
 # undef FIND
+
+#if OMP_40_ENABLED
+        //
+        // Also reset the affinity flags if OMP_PROC_BIND is specified.
+        //
+        aff_str = __kmp_env_blk_var( & block, "OMP_PROC_BIND" );
+        if ( aff_str != NULL ) {
+            __kmp_affinity_type = affinity_default;
+            __kmp_affinity_gran = affinity_gran_default;
+            __kmp_affinity_top_method = affinity_top_method_default;
+            __kmp_affinity_respect_mask = affinity_respect_mask_default;
+        }
+#endif /* OMP_40_ENABLED */
    }
+
 #endif /* KMP_AFFINITY_SUPPORTED */

 #if OMP_40_ENABLED
@ -4956,9 +5057,15 @@ __kmp_env_initialize( char const * string ) {
    else {
        KMP_DEBUG_ASSERT( string != NULL); // kmp_set_defaults() was called
        KMP_DEBUG_ASSERT( __kmp_user_lock_kind != lk_default );
+        __kmp_set_user_lock_vptrs( __kmp_user_lock_kind );
+        // Binds lock functions again to follow the transition between different
+        // KMP_CONSISTENCY_CHECK values. Calling this again is harmless as long
+        // as we do not allow lock kind changes after making a call to any
+        // user lock functions (true).
    }

 #if KMP_AFFINITY_SUPPORTED
+
    if ( ! TCR_4(__kmp_init_middle) ) {
        //
        // Determine if the machine/OS is actually capable of supporting
@ -4984,102 +5091,87 @@ __kmp_env_initialize( char const * string ) {
        }

 # if OMP_40_ENABLED
-
        if ( __kmp_affinity_type == affinity_disabled )  {
            __kmp_nested_proc_bind.bind_types[0] = proc_bind_disabled;
        }
-        else if ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_default ) {
+        else if ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_true ) {
            //
-            // Where supported the default is to use the KMP_AFFINITY
-            // mechanism.  On OS X* etc. it is none.
+            // OMP_PROC_BIND=true maps to OMP_PROC_BIND=spread.
            //
-#  if KMP_AFFINITY_SUPPORTED
-            __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
-#  else
-            __kmp_nested_proc_bind.bind_types[0] = proc_bind_false;
-#  endif
+            __kmp_nested_proc_bind.bind_types[0] = proc_bind_spread;
        }
-
-        //
-        // If OMP_PROC_BIND was specified (so we are using OpenMP 4.0 affinity)
-        // but OMP_PLACES was not, then it defaults to the equivalent of
-        // KMP_AFFINITY=compact,noduplicates,granularity=fine.
-        //
-        if ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_intel ) {
-            if ( ( __kmp_affinity_type == affinity_none )
-#  if ! KMP_MIC
-              || ( __kmp_affinity_type == affinity_default )
-#  endif
-              ) {
-                  __kmp_nested_proc_bind.bind_types[0] = proc_bind_false;
-            }
-        }
-        else if ( ( __kmp_nested_proc_bind.bind_types[0] != proc_bind_false )
-          && ( __kmp_nested_proc_bind.bind_types[0] != proc_bind_disabled ) ) {
-            if ( __kmp_affinity_type == affinity_default ) {
-                __kmp_affinity_type = affinity_compact;
-                __kmp_affinity_dups = FALSE;
-            }
-            if ( __kmp_affinity_gran == affinity_gran_default ) {
-                __kmp_affinity_gran = affinity_gran_fine;
-            }
-        }
-# endif //  OMP_40_ENABLED
+# endif /* OMP_40_ENABLED */

        if ( KMP_AFFINITY_CAPABLE() ) {

 # if KMP_OS_WINDOWS && KMP_ARCH_X86_64

-            if ( __kmp_num_proc_groups > 1 ) {
+            //
+            // Handle the Win 64 group affinity stuff if there are multiple
+            // processor groups, or if the user requested it, and OMP 4.0
+            // affinity is not in effect.
+            //
+            if ( ( ( __kmp_num_proc_groups > 1 )
+              && ( __kmp_affinity_type == affinity_default )
+#  if OMP_40_ENABLED
+              && ( __kmp_nested_proc_bind.bind_types[0] == proc_bind_default ) )
+#  endif
+              || ( __kmp_affinity_top_method == affinity_top_method_group ) ) {
                if ( __kmp_affinity_respect_mask == affinity_respect_mask_default ) {
-                   __kmp_affinity_respect_mask = FALSE;
+                    __kmp_affinity_respect_mask = FALSE;
                }
-
-                if ( ( __kmp_affinity_type == affinity_default )
-                  || ( __kmp_affinity_type == affinity_none ) ) {
-                    if ( __kmp_affinity_type == affinity_none ) {
-                        if ( __kmp_affinity_verbose || ( __kmp_affinity_warnings
-                          && ( __kmp_affinity_type != affinity_none ) ) ) {
-                            KMP_WARNING( AffTypeCantUseMultGroups, "none", "compact" );
-                        }
-                    }
+                if ( __kmp_affinity_type == affinity_default ) {
                    __kmp_affinity_type = affinity_compact;
-                    if ( __kmp_affinity_top_method == affinity_top_method_default ) {
-                       __kmp_affinity_top_method = affinity_top_method_group;
+#  if OMP_40_ENABLED
+                    __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+#  endif
+                }
+                if ( __kmp_affinity_top_method == affinity_top_method_default ) {
+                    if ( __kmp_affinity_gran == affinity_gran_default ) {
+                        __kmp_affinity_top_method = affinity_top_method_group;
+                        __kmp_affinity_gran = affinity_gran_group;
+                    }
+                    else if ( __kmp_affinity_gran == affinity_gran_group ) {
+                        __kmp_affinity_top_method = affinity_top_method_group;
+                    }
+                    else {
+                        __kmp_affinity_top_method = affinity_top_method_all;
                    }
                }
-                else if ( __kmp_affinity_top_method == affinity_top_method_default ) {
-                    __kmp_affinity_top_method = affinity_top_method_all;
-                }
-
-                if ( __kmp_affinity_gran_levels < 0 ) {
-                    if ( __kmp_affinity_top_method == affinity_top_method_group ) {
-                        if ( __kmp_affinity_gran == affinity_gran_default ) {
-                           __kmp_affinity_gran = affinity_gran_group;
-                        }
-                        else if ( __kmp_affinity_gran == affinity_gran_core ) {
-                            if ( __kmp_affinity_verbose || ( __kmp_affinity_warnings
-                              && ( __kmp_affinity_type != affinity_none ) ) ) {
-                                KMP_WARNING( AffGranCantUseMultGroups, "core", "thread" );
-                            }
-                            __kmp_affinity_gran = affinity_gran_thread;
-                        }
-                        else if ( __kmp_affinity_gran == affinity_gran_package ) {
-                            if ( __kmp_affinity_verbose || ( __kmp_affinity_warnings
-                              && ( __kmp_affinity_type != affinity_none ) ) ) {
-                                KMP_WARNING( AffGranCantUseMultGroups, "package", "group" );
-                            }
-                           __kmp_affinity_gran = affinity_gran_group;
-                        }
-                        else if ( __kmp_affinity_gran == affinity_gran_node ) {
-                            if ( __kmp_affinity_verbose || ( __kmp_affinity_warnings
-                              && ( __kmp_affinity_type != affinity_none ) ) ) {
-                                KMP_WARNING( AffGranCantUseMultGroups, "node", "group" );
-                            }
-                           __kmp_affinity_gran = affinity_gran_group;
-                        }
+                else if ( __kmp_affinity_top_method == affinity_top_method_group ) {
+                    if ( __kmp_affinity_gran == affinity_gran_default ) {
+                        __kmp_affinity_gran = affinity_gran_group;
                    }
-                    else if ( __kmp_affinity_gran == affinity_gran_default ) {
+                    else if ( ( __kmp_affinity_gran != affinity_gran_group )
+                      && ( __kmp_affinity_gran != affinity_gran_fine )
+                      && ( __kmp_affinity_gran != affinity_gran_thread ) ) {
+                        char *str = NULL;
+                        switch ( __kmp_affinity_gran ) {
+                            case affinity_gran_core: str = "core"; break;
+                            case affinity_gran_package: str = "package"; break;
+                            case affinity_gran_node: str = "node"; break;
+                            default: KMP_DEBUG_ASSERT( 0 );
+                        }
+                        KMP_WARNING( AffGranTopGroup, var, str );
+                        __kmp_affinity_gran = affinity_gran_fine;
+                    }
+                }
+                else {
+                    if ( __kmp_affinity_gran == affinity_gran_default ) {
+                        __kmp_affinity_gran = affinity_gran_core;
+                    }
+                    else if ( __kmp_affinity_gran == affinity_gran_group ) {
+                        char *str = NULL;
+                        switch ( __kmp_affinity_type ) {
+                            case affinity_physical: str = "physical"; break;
+                            case affinity_logical: str = "logical"; break;
+                            case affinity_compact: str = "compact"; break;
+                            case affinity_scatter: str = "scatter"; break;
+                            case affinity_explicit: str = "explicit"; break;
+                            // No MIC on windows, so no affinity_balanced case
+                            default: KMP_DEBUG_ASSERT( 0 );
+                        }
+                        KMP_WARNING( AffGranGroupType, var, str );
                        __kmp_affinity_gran = affinity_gran_core;
                    }
                }
@ -5087,27 +5179,52 @@ __kmp_env_initialize( char const * string ) {
            else

 # endif /* KMP_OS_WINDOWS && KMP_ARCH_X86_64 */
+
            {
                if ( __kmp_affinity_respect_mask == affinity_respect_mask_default ) {
-                   __kmp_affinity_respect_mask = TRUE;
+# if KMP_OS_WINDOWS && KMP_ARCH_X86_64
+                    if ( __kmp_num_proc_groups > 1 ) {
+                        __kmp_affinity_respect_mask = FALSE;
+                    }
+                    else
+# endif /* KMP_OS_WINDOWS && KMP_ARCH_X86_64 */
+                    {
+                        __kmp_affinity_respect_mask = TRUE;
+                    }
                }
+# if OMP_40_ENABLED
+                if ( ( __kmp_nested_proc_bind.bind_types[0] != proc_bind_intel )
+                  && ( __kmp_nested_proc_bind.bind_types[0] != proc_bind_default ) ) {
+                    if ( __kmp_affinity_type == affinity_default ) {
+                        __kmp_affinity_type = affinity_compact;
+                        __kmp_affinity_dups = FALSE;
+                    }
+                }
+                else
+# endif /* OMP_40_ENABLED */
                if ( __kmp_affinity_type == affinity_default ) {
 # if KMP_MIC
-                   __kmp_affinity_type = affinity_scatter;
+                    __kmp_affinity_type = affinity_scatter;
+#  if OMP_40_ENABLED
+                    __kmp_nested_proc_bind.bind_types[0] = proc_bind_intel;
+#  endif
 # else
-                   __kmp_affinity_type = affinity_none;
+                    __kmp_affinity_type = affinity_none;
+#  if OMP_40_ENABLED
+                    __kmp_nested_proc_bind.bind_types[0] = proc_bind_false;
+#  endif
 # endif
                }
                if ( ( __kmp_affinity_gran == affinity_gran_default )
                  &&  ( __kmp_affinity_gran_levels < 0 ) ) {
 # if KMP_MIC
-                   __kmp_affinity_gran = affinity_gran_fine;
+                    __kmp_affinity_gran = affinity_gran_fine;
 # else
-                   __kmp_affinity_gran = affinity_gran_core;
+                    __kmp_affinity_gran = affinity_gran_core;
 # endif
                }
                if ( __kmp_affinity_top_method == affinity_top_method_default ) {
-                   __kmp_affinity_top_method = affinity_top_method_all;
+                    __kmp_affinity_top_method = affinity_top_method_all;
                }
            }
        }
@ -5164,9 +5281,8 @@ __kmp_env_print() {
        char const * name  = block.vars[ i ].name;
        char const * value = block.vars[ i ].value;
        if (
-            strlen( name ) > 4
-            &&
-            ( strncmp( name, "KMP_", 4 ) == 0 ) || strncmp( name, "OMP_", 4 ) == 0
+            ( strlen( name ) > 4 && strncmp( name, "KMP_", 4 ) == 0 )
+            || strncmp( name, "OMP_", 4 ) == 0
            #ifdef KMP_GOMP_COMPAT
                || strncmp( name, "GOMP_", 5 ) == 0
            #endif // KMP_GOMP_COMPAT
--- a/openmp/runtime/src/kmp_settings.h
+++ b/openmp/runtime/src/kmp_settings.h
@ -1,7 +1,7 @@
 /*
 * kmp_settings.h -- Initialize environment variables
- * $Revision: 42598 $
- * $Date: 2013-08-19 15:40:56 -0500 (Mon, 19 Aug 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_stats.cpp
+++ b/openmp/runtime/src/kmp_stats.cpp
@ -0,0 +1,615 @@
+/** @file kmp_stats.cpp
+ * Statistics gathering and processing.
+ */
+
+
+//===----------------------------------------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is dual licensed under the MIT and the University of Illinois Open
+// Source Licenses. See LICENSE.txt for details.
+//
+//===----------------------------------------------------------------------===//
+
+
+#if KMP_STATS_ENABLED
+
+#include "kmp.h"
+#include "kmp_str.h"
+#include "kmp_lock.h"
+#include "kmp_stats.h"
+
+#include <algorithm>
+#include <sstream>
+#include <iomanip>
+#include <stdlib.h>                             // for atexit
+
+#define STRINGIZE2(x) #x
+#define STRINGIZE(x) STRINGIZE2(x)
+
+#define expandName(name,flags,ignore)  {STRINGIZE(name),flags},
+statInfo timeStat::timerInfo[] = {
+    KMP_FOREACH_TIMER(expandName,0)
+    {0,0}
+};
+const statInfo counter::counterInfo[] = {
+    KMP_FOREACH_COUNTER(expandName,0)
+    {0,0}
+};
+#undef expandName
+
+#define expandName(ignore1,ignore2,ignore3)  {0.0,0.0,0.0},
+kmp_stats_output_module::rgb_color kmp_stats_output_module::timerColorInfo[] = {
+    KMP_FOREACH_TIMER(expandName,0)
+    {0.0,0.0,0.0}
+};
+#undef expandName
+
+const kmp_stats_output_module::rgb_color kmp_stats_output_module::globalColorArray[] = {
+    {1.0, 0.0, 0.0}, // red
+    {1.0, 0.6, 0.0}, // orange
+    {1.0, 1.0, 0.0}, // yellow
+    {0.0, 1.0, 0.0}, // green 
+    {0.0, 0.0, 1.0}, // blue
+    {0.6, 0.2, 0.8}, // purple
+    {1.0, 0.0, 1.0}, // magenta
+    {0.0, 0.4, 0.2}, // dark green
+    {1.0, 1.0, 0.6}, // light yellow
+    {0.6, 0.4, 0.6}, // dirty purple
+    {0.0, 1.0, 1.0}, // cyan
+    {1.0, 0.4, 0.8}, // pink
+    {0.5, 0.5, 0.5}, // grey
+    {0.8, 0.7, 0.5}, // brown
+    {0.6, 0.6, 1.0}, // light blue
+    {1.0, 0.7, 0.5}, // peach
+    {0.8, 0.5, 1.0}, // lavender
+    {0.6, 0.0, 0.0}, // dark red
+    {0.7, 0.6, 0.0}, // gold
+    {0.0, 0.0, 0.0}  // black
+};
+
+// Ensure that the atexit handler only runs once.
+static uint32_t statsPrinted = 0;
+
+// output interface
+static kmp_stats_output_module __kmp_stats_global_output;
+
+/* ****************************************************** */
+/* ************* statistic member functions ************* */
+
+void statistic::addSample(double sample)
+{
+    double delta = sample - meanVal;
+
+    sampleCount = sampleCount + 1;
+    meanVal     = meanVal + delta/sampleCount;
+    m2          = m2 + delta*(sample - meanVal);
+
+    minVal = std::min(minVal, sample);
+    maxVal = std::max(maxVal, sample);
+}
+
+statistic & statistic::operator+= (const statistic & other)
+{
+    if (sampleCount == 0)
+    {
+        *this = other;
+        return *this;
+    }
+
+    uint64_t newSampleCount = sampleCount + other.sampleCount;
+    double dnsc  = double(newSampleCount);
+    double dsc   = double(sampleCount);
+    double dscBydnsc = dsc/dnsc;
+    double dosc  = double(other.sampleCount);
+    double delta = other.meanVal - meanVal;
+
+    // Try to order these calculations to avoid overflows.
+    // If this were Fortran, then the compiler would not be able to re-order over brackets.
+    // In C++ it may be legal to do that (we certainly hope it doesn't, and CC+ Programming Language 2nd edition
+    // suggests it shouldn't, since it says that exploitation of associativity can only be made if the operation
+    // really is associative (which floating addition isn't...)).
+    meanVal     = meanVal*dscBydnsc + other.meanVal*(1-dscBydnsc);
+    m2          = m2 + other.m2 + dscBydnsc*dosc*delta*delta;
+    minVal      = std::min (minVal, other.minVal);
+    maxVal      = std::max (maxVal, other.maxVal);
+    sampleCount = newSampleCount;
+
+
+    return *this;
+}
+
+void statistic::scale(double factor)
+{
+    minVal = minVal*factor;
+    maxVal = maxVal*factor;
+    meanVal= meanVal*factor;
+    m2     = m2*factor*factor;
+    return;
+}
+
+std::string statistic::format(char unit, bool total) const
+{
+    std::string result = formatSI(sampleCount,9,' ');
+
+    result = result + std::string(", ") + formatSI(minVal,  9, unit);
+    result = result + std::string(", ") + formatSI(meanVal, 9, unit);
+    result = result + std::string(", ") + formatSI(maxVal,  9, unit);
+    if (total)
+        result = result + std::string(", ") + formatSI(meanVal*sampleCount, 9, unit);
+    result = result + std::string(", ") + formatSI(getSD(), 9, unit);
+
+    return result;
+}
+
+/* ********************************************************** */
+/* ************* explicitTimer member functions ************* */
+
+void explicitTimer::start(timer_e timerEnumValue) { 
+    startTime = tsc_tick_count::now(); 
+    if(timeStat::logEvent(timerEnumValue)) {
+        __kmp_stats_thread_ptr->incrementNestValue();
+    }
+    return;
+}
+
+void explicitTimer::stop(timer_e timerEnumValue) {
+    if (startTime.getValue() == 0)
+        return;
+
+    tsc_tick_count finishTime = tsc_tick_count::now();
+
+    //stat->addSample ((tsc_tick_count::now() - startTime).ticks());
+    stat->addSample ((finishTime - startTime).ticks());
+
+    if(timeStat::logEvent(timerEnumValue)) {
+        __kmp_stats_thread_ptr->push_event(startTime.getValue() - __kmp_stats_start_time.getValue(), finishTime.getValue() - __kmp_stats_start_time.getValue(), __kmp_stats_thread_ptr->getNestValue(), timerEnumValue); 
+        __kmp_stats_thread_ptr->decrementNestValue();
+    }
+
+    /* We accept the risk that we drop a sample because it really did start at t==0. */
+    startTime = 0; 
+    return;
+}
+
+/* ******************************************************************* */
+/* ************* kmp_stats_event_vector member functions ************* */
+
+void kmp_stats_event_vector::deallocate() {
+    __kmp_free(events);
+    internal_size = 0;
+    allocated_size = 0;
+    events = NULL;
+}
+
+// This function is for qsort() which requires the compare function to return
+// either a negative number if event1 < event2, a positive number if event1 > event2
+// or zero if event1 == event2.  
+// This sorts by start time (lowest to highest).
+int compare_two_events(const void* event1, const void* event2) {
+    kmp_stats_event* ev1 = (kmp_stats_event*)event1;
+    kmp_stats_event* ev2 = (kmp_stats_event*)event2;
+
+    if(ev1->getStart() < ev2->getStart()) return -1;
+    else if(ev1->getStart() > ev2->getStart()) return 1;
+    else return 0;
+}
+
+void kmp_stats_event_vector::sort() {
+    qsort(events, internal_size, sizeof(kmp_stats_event), compare_two_events);
+}
+
+/* *********************************************************** */
+/* ************* kmp_stats_list member functions ************* */
+
+// returns a pointer to newly created stats node
+kmp_stats_list* kmp_stats_list::push_back(int gtid) { 
+    kmp_stats_list* newnode = (kmp_stats_list*)__kmp_allocate(sizeof(kmp_stats_list));
+    // placement new, only requires space and pointer and initializes (so __kmp_allocate instead of C++ new[] is used)
+    new (newnode) kmp_stats_list();
+    newnode->setGtid(gtid);
+    newnode->prev = this->prev;
+    newnode->next = this;
+    newnode->prev->next = newnode;
+    newnode->next->prev = newnode;
+    return newnode;
+}
+void kmp_stats_list::deallocate() {
+    kmp_stats_list* ptr = this->next;
+    kmp_stats_list* delptr = this->next;
+    while(ptr != this) {
+        delptr = ptr;
+        ptr=ptr->next;
+        // placement new means we have to explicitly call destructor.
+        delptr->_event_vector.deallocate();
+        delptr->~kmp_stats_list();
+        __kmp_free(delptr);
+    }
+}
+kmp_stats_list::iterator kmp_stats_list::begin() {
+    kmp_stats_list::iterator it;
+    it.ptr = this->next;
+    return it;
+}
+kmp_stats_list::iterator kmp_stats_list::end() {
+    kmp_stats_list::iterator it;
+    it.ptr = this;
+    return it;
+}
+int kmp_stats_list::size() {
+    int retval;
+    kmp_stats_list::iterator it;
+    for(retval=0, it=begin(); it!=end(); it++, retval++) {}
+    return retval;
+}
+
+/* ********************************************************************* */
+/* ************* kmp_stats_list::iterator member functions ************* */
+
+kmp_stats_list::iterator::iterator() : ptr(NULL) {} 
+kmp_stats_list::iterator::~iterator() {}
+kmp_stats_list::iterator kmp_stats_list::iterator::operator++() {
+    this->ptr = this->ptr->next;
+    return *this;
+}
+kmp_stats_list::iterator kmp_stats_list::iterator::operator++(int dummy) {
+    this->ptr = this->ptr->next;
+    return *this;
+}
+kmp_stats_list::iterator kmp_stats_list::iterator::operator--() {
+    this->ptr = this->ptr->prev;
+    return *this;
+}
+kmp_stats_list::iterator kmp_stats_list::iterator::operator--(int dummy) {
+    this->ptr = this->ptr->prev;
+    return *this;
+}
+bool kmp_stats_list::iterator::operator!=(const kmp_stats_list::iterator & rhs) {
+   return this->ptr!=rhs.ptr; 
+}
+bool kmp_stats_list::iterator::operator==(const kmp_stats_list::iterator & rhs) {
+   return this->ptr==rhs.ptr; 
+}
+kmp_stats_list* kmp_stats_list::iterator::operator*() const {
+    return this->ptr;
+}
+
+/* *************************************************************** */
+/* *************  kmp_stats_output_module functions ************** */
+
+const char* kmp_stats_output_module::outputFileName = NULL;
+const char* kmp_stats_output_module::eventsFileName = NULL;
+const char* kmp_stats_output_module::plotFileName   = NULL;
+int kmp_stats_output_module::printPerThreadFlag       = 0;
+int kmp_stats_output_module::printPerThreadEventsFlag = 0;
+
+// init() is called very near the beginning of execution time in the constructor of __kmp_stats_global_output
+void kmp_stats_output_module::init() 
+{
+    char * statsFileName  = getenv("KMP_STATS_FILE");
+    eventsFileName        = getenv("KMP_STATS_EVENTS_FILE");
+    plotFileName          = getenv("KMP_STATS_PLOT_FILE");
+    char * threadStats    = getenv("KMP_STATS_THREADS");
+    char * threadEvents   = getenv("KMP_STATS_EVENTS");
+
+    // set the stats output filenames based on environment variables and defaults
+    outputFileName = statsFileName;
+    eventsFileName = eventsFileName ? eventsFileName : "events.dat";
+    plotFileName   = plotFileName   ? plotFileName   : "events.plt";
+
+    // set the flags based on environment variables matching: true, on, 1, .true. , .t. , yes
+    printPerThreadFlag        = __kmp_str_match_true(threadStats);
+    printPerThreadEventsFlag  = __kmp_str_match_true(threadEvents);
+
+    if(printPerThreadEventsFlag) {
+        // assigns a color to each timer for printing
+        setupEventColors();
+    } else {
+        // will clear flag so that no event will be logged
+        timeStat::clearEventFlags();
+    }
+
+    return;
+}
+
+void kmp_stats_output_module::setupEventColors() {
+    int i;
+    int globalColorIndex = 0;
+    int numGlobalColors = sizeof(globalColorArray) / sizeof(rgb_color);
+    for(i=0;i<TIMER_LAST;i++) {
+        if(timeStat::logEvent((timer_e)i)) {
+            timerColorInfo[i] = globalColorArray[globalColorIndex];
+            globalColorIndex = (globalColorIndex+1)%numGlobalColors;
+        }
+    }
+    return;
+}
+
+void kmp_stats_output_module::printStats(FILE *statsOut, statistic const * theStats, bool areTimers)
+{
+    if (areTimers)
+    {
+        // Check if we have useful timers, since we don't print zero value timers we need to avoid
+        // printing a header and then no data.
+        bool haveTimers = false;
+        for (int s = 0; s<TIMER_LAST; s++)
+        {
+            if (theStats[s].getCount() != 0)
+            {
+                haveTimers = true;
+                break;
+            }
+        }
+        if (!haveTimers)
+            return;
+    }
+
+    // Print
+    const char * title = areTimers ? "Timer,                   SampleCount," : "Counter,                 ThreadCount,";
+    fprintf (statsOut, "%s    Min,      Mean,       Max,     Total,        SD\n", title);    
+    if (areTimers) {
+        for (int s = 0; s<TIMER_LAST; s++) {
+            statistic const * stat = &theStats[s];
+            if (stat->getCount() != 0) {
+                char tag = timeStat::noUnits(timer_e(s)) ? ' ' : 'T';
+                fprintf (statsOut, "%-25s, %s\n", timeStat::name(timer_e(s)), stat->format(tag, true).c_str());
+            }
+        }
+    } else {   // Counters
+        for (int s = 0; s<COUNTER_LAST; s++) {
+            statistic const * stat = &theStats[s];
+            fprintf (statsOut, "%-25s, %s\n", counter::name(counter_e(s)), stat->format(' ', true).c_str());
+        }
+    }
+} 
+
+void kmp_stats_output_module::printCounters(FILE * statsOut, counter const * theCounters)
+{
+    // We print all the counters even if they are zero.
+    // That makes it easier to slice them into a spreadsheet if you need to.
+    fprintf (statsOut, "\nCounter,                    Count\n");
+    for (int c = 0; c<COUNTER_LAST; c++) {
+        counter const * stat = &theCounters[c];
+        fprintf (statsOut, "%-25s, %s\n", counter::name(counter_e(c)), formatSI(stat->getValue(), 9, ' ').c_str());
+    }
+}
+
+void kmp_stats_output_module::printEvents(FILE* eventsOut, kmp_stats_event_vector* theEvents, int gtid) {
+    // sort by start time before printing
+    theEvents->sort();
+    for (int i = 0; i < theEvents->size(); i++) {
+        kmp_stats_event ev = theEvents->at(i);
+        rgb_color color = getEventColor(ev.getTimerName());
+        fprintf(eventsOut, "%d %lu %lu %1.1f rgb(%1.1f,%1.1f,%1.1f) %s\n", 
+                gtid, 
+                ev.getStart(), 
+                ev.getStop(), 
+                1.2 - (ev.getNestLevel() * 0.2),
+                color.r, color.g, color.b,
+                timeStat::name(ev.getTimerName())
+               );
+    }
+    return;
+}
+
+void kmp_stats_output_module::windupExplicitTimers()
+{
+    // Wind up any explicit timers. We assume that it's fair at this point to just walk all the explcit timers in all threads 
+    // and say "it's over".
+    // If the timer wasn't running, this won't record anything anyway.
+    kmp_stats_list::iterator it;
+    for(it = __kmp_stats_list.begin(); it != __kmp_stats_list.end(); it++) {
+        for (int timer=0; timer<EXPLICIT_TIMER_LAST; timer++) {
+            (*it)->getExplicitTimer(explicit_timer_e(timer))->stop((timer_e)timer);
+        }
+    }
+}
+
+void kmp_stats_output_module::printPloticusFile() {
+    int i;
+    int size = __kmp_stats_list.size();
+    FILE* plotOut = fopen(plotFileName, "w+");
+
+    fprintf(plotOut, "#proc page\n"
+                     "   pagesize: 15 10\n"
+                     "   scale: 1.0\n\n");
+
+    fprintf(plotOut, "#proc getdata\n"
+                     "   file: %s\n\n", 
+                     eventsFileName);
+
+    fprintf(plotOut, "#proc areadef\n"
+                     "   title: OpenMP Sampling Timeline\n"
+                     "   titledetails: align=center size=16\n"
+                     "   rectangle: 1 1 13 9\n"
+                     "   xautorange: datafield=2,3\n"
+                     "   yautorange: -1 %d\n\n", 
+                     size);
+
+    fprintf(plotOut, "#proc xaxis\n"
+                     "   stubs: inc\n"
+                     "   stubdetails: size=12\n"
+                     "   label: Time (ticks)\n"
+                     "   labeldetails: size=14\n\n");
+
+    fprintf(plotOut, "#proc yaxis\n"
+                     "   stubs: inc 1\n"
+                     "   stubrange: 0 %d\n"
+                     "   stubdetails: size=12\n"
+                     "   label: Thread #\n"
+                     "   labeldetails: size=14\n\n", 
+                     size-1);
+
+    fprintf(plotOut, "#proc bars\n"
+                     "   exactcolorfield: 5\n"
+                     "   axis: x\n"
+                     "   locfield: 1\n"
+                     "   segmentfields: 2 3\n"
+                     "   barwidthfield: 4\n\n");
+
+    // create legend entries corresponding to the timer color
+    for(i=0;i<TIMER_LAST;i++) {
+        if(timeStat::logEvent((timer_e)i)) {
+            rgb_color c = getEventColor((timer_e)i);
+            fprintf(plotOut, "#proc legendentry\n"
+                             "   sampletype: color\n"
+                             "   label: %s\n"
+                             "   details: rgb(%1.1f,%1.1f,%1.1f)\n\n",
+                             timeStat::name((timer_e)i),
+                             c.r, c.g, c.b);
+
+        }
+    }
+
+    fprintf(plotOut, "#proc legend\n"
+                     "   format: down\n"
+                     "   location: max max\n\n");
+    fclose(plotOut);
+    return;
+}
+
+void kmp_stats_output_module::outputStats(const char* heading) 
+{
+    statistic allStats[TIMER_LAST];
+    statistic allCounters[COUNTER_LAST];
+
+    // stop all the explicit timers for all threads
+    windupExplicitTimers();
+
+    FILE * eventsOut;
+    FILE * statsOut = outputFileName ? fopen (outputFileName, "a+") : stderr;
+
+    if (eventPrintingEnabled()) {
+        eventsOut = fopen(eventsFileName, "w+");
+    }
+
+    if (!statsOut)
+        statsOut = stderr;
+
+    fprintf(statsOut, "%s\n",heading);
+    // Accumulate across threads.
+    kmp_stats_list::iterator it;
+    for (it = __kmp_stats_list.begin(); it != __kmp_stats_list.end(); it++) {
+        int t = (*it)->getGtid();
+        // Output per thread stats if requested.
+        if (perThreadPrintingEnabled()) {
+            fprintf (statsOut, "Thread %d\n", t);
+            printStats(statsOut, (*it)->getTimers(), true);
+            printCounters(statsOut, (*it)->getCounters());
+            fprintf(statsOut,"\n");
+        }
+        // Output per thread events if requested.
+        if (eventPrintingEnabled()) {
+            kmp_stats_event_vector events = (*it)->getEventVector();
+            printEvents(eventsOut, &events, t);
+        }
+
+        for (int s = 0; s<TIMER_LAST; s++) {
+            // See if we should ignore this timer when aggregating
+            if ((timeStat::masterOnly(timer_e(s)) && (t != 0)) || // Timer is only valid on the master and this thread is a worker
+                (timeStat::workerOnly(timer_e(s)) && (t == 0)) || // Timer is only valid on a worker and this thread is the master
+                timeStat::synthesized(timer_e(s))                 // It's a synthesized stat, so there's no raw data for it.
+               )            
+            {
+                continue;
+            }
+
+            statistic * threadStat = (*it)->getTimer(timer_e(s));
+            allStats[s] += *threadStat;
+        }
+
+        // Special handling for synthesized statistics.
+        // These just have to be coded specially here for now. 
+        // At present we only have one: the total parallel work done in each thread.
+        // The variance here makes it easy to see load imbalance over the whole program (though, of course,
+        // it's possible to have a code with awful load balance in every parallel region but perfect load
+        // balance oever the whole program.)
+        allStats[TIMER_Total_work].addSample ((*it)->getTimer(TIMER_OMP_work)->getTotal());
+
+        // Time waiting for work (synthesized)
+        if ((t != 0) || !timeStat::workerOnly(timer_e(TIMER_OMP_await_work)))
+            allStats[TIMER_Total_await_work].addSample ((*it)->getTimer(TIMER_OMP_await_work)->getTotal());
+
+        // Time in explicit barriers.
+        allStats[TIMER_Total_barrier].addSample ((*it)->getTimer(TIMER_OMP_barrier)->getTotal());
+
+        for (int c = 0; c<COUNTER_LAST; c++) {
+            if (counter::masterOnly(counter_e(c)) && t != 0)
+                continue;
+            allCounters[c].addSample ((*it)->getCounter(counter_e(c))->getValue());
+        }
+    }
+
+    if (eventPrintingEnabled()) {
+        printPloticusFile();
+        fclose(eventsOut);
+    }
+
+    fprintf (statsOut, "Aggregate for all threads\n");
+    printStats (statsOut, &allStats[0], true);
+    fprintf (statsOut, "\n");
+    printStats (statsOut, &allCounters[0], false);
+
+    if (statsOut != stderr)
+        fclose(statsOut);
+
+}
+
+/* ************************************************** */
+/* *************  exported C functions ************** */
+
+// no name mangling for these functions, we want the c files to be able to get at these functions
+extern "C" {
+
+void __kmp_reset_stats()
+{
+    kmp_stats_list::iterator it;
+    for(it = __kmp_stats_list.begin(); it != __kmp_stats_list.end(); it++) {
+        timeStat * timers     = (*it)->getTimers();
+        counter * counters    = (*it)->getCounters();
+        explicitTimer * eTimers = (*it)->getExplicitTimers();
+
+        for (int t = 0; t<TIMER_LAST; t++)
+            timers[t].reset();
+
+        for (int c = 0; c<COUNTER_LAST; c++)
+            counters[c].reset();
+
+        for (int t=0; t<EXPLICIT_TIMER_LAST; t++)
+            eTimers[t].reset();
+
+        // reset the event vector so all previous events are "erased"
+        (*it)->resetEventVector();
+
+        // May need to restart the explicit timers in thread zero?
+    }
+    KMP_START_EXPLICIT_TIMER(OMP_serial);
+    KMP_START_EXPLICIT_TIMER(OMP_start_end);
+}
+
+// This function will reset all stats and stop all threads' explicit timers if they haven't been stopped already.
+void __kmp_output_stats(const char * heading)
+{
+    __kmp_stats_global_output.outputStats(heading);
+    __kmp_reset_stats();
+}
+
+void __kmp_accumulate_stats_at_exit(void)
+{
+    // Only do this once.
+    if (KMP_XCHG_FIXED32(&statsPrinted, 1) != 0)
+        return;
+
+    __kmp_output_stats("Statistics on exit");
+    return;
+}
+
+void __kmp_stats_init(void) 
+{
+    return;
+}
+
+} // extern "C" 
+
+#endif // KMP_STATS_ENABLED
--- a/openmp/runtime/src/kmp_stats.h
+++ b/openmp/runtime/src/kmp_stats.h
@ -0,0 +1,706 @@
+#ifndef KMP_STATS_H
+#define KMP_STATS_H
+
+/** @file kmp_stats.h
+ * Functions for collecting statistics.
+ */
+
+
+//===----------------------------------------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is dual licensed under the MIT and the University of Illinois Open
+// Source Licenses. See LICENSE.txt for details.
+//
+//===----------------------------------------------------------------------===//
+
+
+#if KMP_STATS_ENABLED
+/*
+ * Statistics accumulator.
+ * Accumulates number of samples and computes min, max, mean, standard deviation on the fly.
+ *
+ * Online variance calculation algorithm from http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm
+ */
+
+#include <limits>
+#include <math.h>
+#include <string>
+#include <stdint.h>
+#include <new> // placement new
+#include "kmp_stats_timing.h"
+
+
+/*!
+ * @ingroup STATS_GATHERING
+ * \brief flags to describe the statistic ( timers or counter )
+ *
+*/
+class stats_flags_e {
+    public:
+        const static int onlyInMaster = 1<<0; //!< statistic is valid only for master
+        const static int noUnits      = 1<<1; //!< statistic doesn't need units printed next to it in output
+        const static int synthesized  = 1<<2; //!< statistic's value is created atexit time in the __kmp_output_stats function
+        const static int notInMaster  = 1<<3; //!< statistic is valid for non-master threads
+        const static int logEvent     = 1<<4; //!< statistic can be logged when KMP_STATS_EVENTS is on (valid only for timers)
+};
+
+/*!
+ * \brief Add new counters under KMP_FOREACH_COUNTER() macro in kmp_stats.h
+ *
+ * @param macro a user defined macro that takes three arguments - macro(COUNTER_NAME, flags, arg)
+ * @param arg a user defined argument to send to the user defined macro
+ *
+ * \details A counter counts the occurence of some event.
+ * Each thread accumulates its own count, at the end of execution the counts are aggregated treating each thread
+ * as a separate measurement. (Unless onlyInMaster is set, in which case there's only a single measurement).
+ * The min,mean,max are therefore the values for the threads.
+ * Adding the counter here and then putting in a KMP_BLOCK_COUNTER(name) is all you need to do.
+ * All of the tables and printing is generated from this macro.
+ * Format is "macro(name, flags, arg)"
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_FOREACH_COUNTER(macro, arg)                         \
+    macro (OMP_PARALLEL, stats_flags_e::onlyInMaster, arg)      \
+    macro (OMP_FOR_static, 0, arg)                              \
+    macro (OMP_FOR_dynamic, 0, arg)                             \
+    macro (OMP_DISTR_FOR_static, 0, arg)                        \
+    macro (OMP_DISTR_FOR_dynamic, 0, arg)                       \
+    macro (OMP_BARRIER, 0, arg)                                 \
+    macro (OMP_CRITICAL,0, arg)                                 \
+    macro (OMP_SINGLE, 0, arg)                                  \
+    macro (OMP_MASTER, 0, arg)                                  \
+    macro (OMP_set_lock, 0, arg)                                \
+    macro (OMP_test_lock, 0, arg)                               \
+    macro (OMP_test_lock_failure, 0, arg)                       \
+    macro (REDUCE_wait, 0, arg)                                 \
+    macro (REDUCE_nowait, 0, arg)                               \
+    macro (LAST,0,arg)
+
+/*!
+ * \brief Add new timers under KMP_FOREACH_TIMER() macro in kmp_stats.h
+ *
+ * @param macro a user defined macro that takes three arguments - macro(TIMER_NAME, flags, arg)
+ * @param arg a user defined argument to send to the user defined macro
+ *
+ * \details A timer collects multiple samples of some count in each thread and then finally aggregates over all the threads.
+ * The count is normally a time (in ticks), hence the name "timer". (But can be any value, so we use this for "number of arguments passed to fork"
+ * as well, or we could collect "loop iteration count" if we wanted to).
+ * For timers the threads are not significant, it's the individual observations that count, so the statistics are at that level.
+ * Format is "macro(name, flags, arg)"
+ *
+ * @ingroup STATS_GATHERING
+ */
+#define KMP_FOREACH_TIMER(macro, arg)                                       \
+    macro (OMP_PARALLEL_args, stats_flags_e::onlyInMaster | stats_flags_e::noUnits, arg) \
+    macro (FOR_static_iterations, stats_flags_e::onlyInMaster | stats_flags_e::noUnits, arg) \
+    macro (FOR_dynamic_iterations, stats_flags_e::noUnits, arg)         \
+    macro (OMP_start_end, stats_flags_e::onlyInMaster, arg)             \
+    macro (OMP_serial, stats_flags_e::onlyInMaster, arg)                \
+    macro (OMP_work, 0, arg)                                            \
+    macro (Total_work, stats_flags_e::synthesized, arg)                 \
+    macro (OMP_await_work, stats_flags_e::notInMaster, arg)             \
+    macro (Total_await_work, stats_flags_e::synthesized, arg)           \
+    macro (OMP_barrier, 0, arg)                                         \
+    macro (Total_barrier, stats_flags_e::synthesized, arg)              \
+    macro (OMP_test_lock, 0, arg)                                       \
+    macro (FOR_static_scheduling, 0, arg)                               \
+    macro (FOR_dynamic_scheduling, 0, arg)                              \
+    macro (KMP_fork_call, 0, arg) \
+    macro (KMP_join_call, 0, arg) \
+    macro (KMP_fork_barrier, stats_flags_e::logEvent, arg)              \
+    macro (KMP_join_barrier, stats_flags_e::logEvent, arg)              \
+    macro (KMP_barrier, 0, arg)                   \
+    macro (KMP_end_split_barrier, 0, arg) \
+    macro (KMP_wait_sleep, 0, arg) \
+    macro (KMP_release, 0, arg)                   \
+    macro (KMP_hier_gather, 0, arg) \
+    macro (KMP_hier_release, 0, arg) \
+    macro (KMP_hyper_gather,  stats_flags_e::logEvent, arg) \
+    macro (KMP_hyper_release,  stats_flags_e::logEvent, arg) \
+    macro (KMP_linear_gather, 0, arg)                                   \
+    macro (KMP_linear_release, 0, arg)                                  \
+    macro (KMP_tree_gather, 0, arg)                                     \
+    macro (KMP_tree_release, 0, arg)                                    \
+    macro (USER_master_invoke, stats_flags_e::logEvent, arg) \
+    macro (USER_worker_invoke, stats_flags_e::logEvent, arg) \
+    macro (USER_resume, stats_flags_e::logEvent, arg) \
+    macro (USER_suspend, stats_flags_e::logEvent, arg) \
+    macro (USER_launch_thread_loop, stats_flags_e::logEvent, arg) \
+    macro (KMP_allocate_team, 0, arg) \
+    macro (KMP_setup_icv_copy, 0, arg) \
+    macro (USER_icv_copy, 0, arg) \
+    macro (LAST,0, arg)
+
+
+
+// OMP_PARALLEL_args      -- the number of arguments passed to a fork
+// FOR_static_iterations  -- Number of available parallel chunks of work in a static for
+// FOR_dynamic_iterations -- Number of available parallel chunks of work in a dynamic for
+//                           Both adjust for any chunking, so if there were an iteration count of 20 but a chunk size of 10, we'd record 2.
+// OMP_serial             -- thread zero time executing serial code
+// OMP_start_end          -- time from when OpenMP is initialized until the stats are printed at exit
+// OMP_work               -- elapsed time in code dispatched by a fork (measured in the thread)
+// Total_work             -- a synthesized statistic summarizing how much parallel work each thread executed.
+// OMP_barrier            -- time at "real" barriers
+// Total_barrier          -- a synthesized statistic summarizing how much time at real barriers in each thread
+// OMP_set_lock           -- time in lock setting
+// OMP_test_lock          -- time in testing a lock
+// LOCK_WAIT              -- time waiting for a lock
+// FOR_static_scheduling  -- time spent doing scheduling for a static "for"
+// FOR_dynamic_scheduling -- time spent doing scheduling for a dynamic "for"
+// KMP_wait_sleep         -- time in __kmp_wait_sleep
+// KMP_release            -- time in __kmp_release
+// KMP_fork_barrier       -- time in __kmp_fork_barrier
+// KMP_join_barrier       -- time in __kmp_join_barrier
+// KMP_barrier            -- time in __kmp_barrier
+// KMP_end_split_barrier  -- time in __kmp_end_split_barrier
+// KMP_setup_icv_copy     -- time in __kmp_setup_icv_copy
+// KMP_icv_copy           -- start/stop timer for any ICV copying
+// KMP_linear_gather      -- time in __kmp_linear_barrier_gather
+// KMP_linear_release     -- time in __kmp_linear_barrier_release
+// KMP_tree_gather        -- time in __kmp_tree_barrier_gather
+// KMP_tree_release       -- time in __kmp_tree_barrier_release
+// KMP_hyper_gather       -- time in __kmp_hyper_barrier_gather
+// KMP_hyper_release      -- time in __kmp_hyper_barrier_release
+
+/*!
+ * \brief Add new explicit timers under KMP_FOREACH_EXPLICIT_TIMER() macro.
+ *
+ * @param macro a user defined macro that takes three arguments - macro(TIMER_NAME, flags, arg)
+ * @param arg a user defined argument to send to the user defined macro
+ *
+ * \warning YOU MUST HAVE THE SAME NAMED TIMER UNDER KMP_FOREACH_TIMER() OR ELSE BAD THINGS WILL HAPPEN!
+ *
+ * \details Explicit timers are ones where we need to allocate a timer itself (as well as the accumulated timing statistics).
+ * We allocate these on a per-thread basis, and explicitly start and stop them.
+ * Block timers just allocate the timer itself on the stack, and use the destructor to notice block exit; they don't
+ * need to be defined here.
+ * The name here should be the same as that of a timer above.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_FOREACH_EXPLICIT_TIMER(macro, arg)  \
+    macro(OMP_serial, 0, arg)                   \
+    macro(OMP_start_end, 0, arg)                \
+    macro(USER_icv_copy, 0, arg) \
+    macro(USER_launch_thread_loop, stats_flags_e::logEvent, arg) \
+    macro(LAST, 0, arg)
+
+#define ENUMERATE(name,ignore,prefix) prefix##name,
+enum timer_e {
+    KMP_FOREACH_TIMER(ENUMERATE, TIMER_)
+};
+
+enum explicit_timer_e {
+    KMP_FOREACH_EXPLICIT_TIMER(ENUMERATE, EXPLICIT_TIMER_)
+};
+
+enum counter_e {
+    KMP_FOREACH_COUNTER(ENUMERATE, COUNTER_)
+};
+#undef ENUMERATE
+
+class statistic
+{
+    double   minVal;
+    double   maxVal;
+    double   meanVal;
+    double   m2;
+    uint64_t sampleCount;
+
+ public:
+    statistic() { reset(); }
+    statistic (statistic const &o): minVal(o.minVal), maxVal(o.maxVal), meanVal(o.meanVal), m2(o.m2), sampleCount(o.sampleCount) {}
+
+    double   getMin()   const { return minVal; }
+    double   getMean()  const { return meanVal; }
+    double   getMax()   const { return maxVal; }
+    uint64_t getCount() const { return sampleCount; }
+    double   getSD()    const { return sqrt(m2/sampleCount); }
+    double   getTotal() const { return sampleCount*meanVal; }
+
+    void reset()
+    {
+        minVal =  std::numeric_limits<double>::max();
+        maxVal = -std::numeric_limits<double>::max();
+        meanVal= 0.0;
+        m2     = 0.0;
+        sampleCount = 0;
+    }
+    void addSample(double sample);
+    void scale    (double factor);
+    void scaleDown(double f)  { scale (1./f); }
+    statistic & operator+= (statistic const & other);
+
+    std::string format(char unit, bool total=false) const;
+};
+
+struct statInfo
+{
+    const char * name;
+    uint32_t     flags;
+};
+
+class timeStat : public statistic
+{
+    static statInfo timerInfo[];
+
+ public:
+    timeStat() : statistic() {}
+    static const char * name(timer_e e) { return timerInfo[e].name; }
+    static bool  masterOnly (timer_e e) { return timerInfo[e].flags & stats_flags_e::onlyInMaster; }
+    static bool  workerOnly (timer_e e) { return timerInfo[e].flags & stats_flags_e::notInMaster;  }
+    static bool  noUnits    (timer_e e) { return timerInfo[e].flags & stats_flags_e::noUnits;      }
+    static bool  synthesized(timer_e e) { return timerInfo[e].flags & stats_flags_e::synthesized;  }
+    static bool  logEvent   (timer_e e) { return timerInfo[e].flags & stats_flags_e::logEvent;     }
+    static void  clearEventFlags()      {
+        int i;
+        for(i=0;i<TIMER_LAST;i++) {
+            timerInfo[i].flags &= (~(stats_flags_e::logEvent));
+        }
+    }
+};
+
+// Where we need explicitly to start and end the timer, this version can be used
+// Since these timers normally aren't nicely scoped, so don't have a good place to live
+// on the stack of the thread, they're more work to use.
+class explicitTimer
+{
+    timeStat * stat;
+    tsc_tick_count startTime;
+
+ public:
+    explicitTimer () : stat(0), startTime(0) { }
+    explicitTimer (timeStat * s) : stat(s), startTime() { }
+
+    void setStat (timeStat *s) { stat = s; }
+    void start(timer_e timerEnumValue);
+    void stop(timer_e timerEnumValue);
+    void reset() { startTime = 0; }
+};
+
+// Where all you need is to time a block, this is enough.
+// (It avoids the need to have an explicit end, leaving the scope suffices.)
+class blockTimer : public explicitTimer
+{
+    timer_e timerEnumValue;
+ public:
+    blockTimer (timeStat * s, timer_e newTimerEnumValue) : timerEnumValue(newTimerEnumValue), explicitTimer(s) { start(timerEnumValue); }
+    ~blockTimer() { stop(timerEnumValue); }
+};
+
+// If all you want is a count, then you can use this...
+// The individual per-thread counts will be aggregated into a statistic at program exit.
+class counter
+{
+    uint64_t value;
+    static const statInfo counterInfo[];
+
+ public:
+    counter() : value(0) {}
+    void increment() { value++; }
+    uint64_t getValue() const { return value; }
+    void reset() { value = 0; }
+    static const char * name(counter_e e) { return counterInfo[e].name; }
+    static bool  masterOnly (counter_e e) { return counterInfo[e].flags & stats_flags_e::onlyInMaster; }
+};
+
+/* ****************************************************************
+    Class to implement an event
+
+    There are four components to an event: start time, stop time
+    nest_level, and timer_name.
+    The start and stop time should be obvious (recorded in clock ticks).
+    The nest_level relates to the bar width in the timeline graph.
+    The timer_name is used to determine which timer event triggered this event.
+
+    the interface to this class is through four read-only operations:
+    1) getStart()     -- returns the start time as 64 bit integer
+    2) getStop()      -- returns the stop time as 64 bit integer
+    3) getNestLevel() -- returns the nest level of the event
+    4) getTimerName() -- returns the timer name that triggered event
+
+    *MORE ON NEST_LEVEL*
+    The nest level is used in the bar graph that represents the timeline.
+    Its main purpose is for showing how events are nested inside eachother.
+    For example, say events, A, B, and C are recorded.  If the timeline
+    looks like this:
+
+Begin -------------------------------------------------------------> Time
+         |    |          |        |          |              |
+         A    B          C        C          B              A
+       start start     start     end        end            end
+
+       Then A, B, C will have a nest level of 1, 2, 3 respectively.
+       These values are then used to calculate the barwidth so you can
+       see that inside A, B has occured, and inside B, C has occured.
+       Currently, this is shown with A's bar width being larger than B's
+       bar width, and B's bar width being larger than C's bar width.
+
+**************************************************************** */
+class kmp_stats_event {
+    uint64_t start;
+    uint64_t stop;
+    int nest_level;
+    timer_e timer_name;
+ public:
+    kmp_stats_event() : start(0), stop(0), nest_level(0), timer_name(TIMER_LAST) {}
+    kmp_stats_event(uint64_t strt, uint64_t stp, int nst, timer_e nme) : start(strt), stop(stp), nest_level(nst), timer_name(nme) {}
+    inline uint64_t  getStart() const { return start; }
+    inline uint64_t  getStop() const  { return stop;  }
+    inline int       getNestLevel() const { return nest_level; }
+    inline timer_e   getTimerName() const { return timer_name; }
+};
+
+/* ****************************************************************
+    Class to implement a dynamically expandable array of events
+
+    ---------------------------------------------------------
+    | event 1 | event 2 | event 3 | event 4 | ... | event N |
+    ---------------------------------------------------------
+
+    An event is pushed onto the back of this array at every
+    explicitTimer->stop() call.  The event records the thread #,
+    start time, stop time, and nest level related to the bar width.
+
+    The event vector starts at size INIT_SIZE and grows (doubles in size)
+    if needed.  An implication of this behavior is that log(N)
+    reallocations are needed (where N is number of events).  If you want
+    to avoid reallocations, then set INIT_SIZE to a large value.
+
+    the interface to this class is through six operations:
+    1) reset() -- sets the internal_size back to 0 but does not deallocate any memory
+    2) size()  -- returns the number of valid elements in the vector
+    3) push_back(start, stop, nest, timer_name) -- pushes an event onto
+                                                   the back of the array
+    4) deallocate() -- frees all memory associated with the vector
+    5) sort() -- sorts the vector by start time
+    6) operator[index] or at(index) -- returns event reference at that index
+
+**************************************************************** */
+class kmp_stats_event_vector {
+    kmp_stats_event* events;
+    int internal_size;
+    int allocated_size;
+    static const int INIT_SIZE = 1024;
+ public:
+    kmp_stats_event_vector() {
+        events = (kmp_stats_event*)__kmp_allocate(sizeof(kmp_stats_event)*INIT_SIZE);
+        internal_size = 0;
+        allocated_size = INIT_SIZE;
+    }
+   ~kmp_stats_event_vector() {}
+    inline void reset() { internal_size = 0; }
+    inline int  size() const { return internal_size; }
+    void push_back(uint64_t start_time, uint64_t stop_time, int nest_level, timer_e name) {
+        int i;
+        if(internal_size == allocated_size) {
+            kmp_stats_event* tmp = (kmp_stats_event*)__kmp_allocate(sizeof(kmp_stats_event)*allocated_size*2);
+            for(i=0;i<internal_size;i++) tmp[i] = events[i];
+            __kmp_free(events);
+            events = tmp;
+            allocated_size*=2;
+        }
+        events[internal_size] = kmp_stats_event(start_time, stop_time, nest_level, name);
+        internal_size++;
+        return;
+    }
+    void deallocate();
+    void sort();
+    const kmp_stats_event & operator[](int index) const { return events[index]; }
+          kmp_stats_event & operator[](int index) { return events[index]; }
+    const kmp_stats_event & at(int index) const { return events[index]; }
+          kmp_stats_event & at(int index) { return events[index]; }
+};
+
+/* ****************************************************************
+    Class to implement a doubly-linked, circular, statistics list
+
+    |---| ---> |---| ---> |---| ---> |---| ---> ... next
+    |   |      |   |      |   |      |   |
+    |---| <--- |---| <--- |---| <--- |---| <--- ... prev
+    Sentinel   first      second     third
+    Node       node       node       node
+
+    The Sentinel Node is the user handle on the list.
+    The first node corresponds to thread 0's statistics.
+    The second node corresponds to thread 1's statistics and so on...
+
+    Each node has a _timers, _counters, and _explicitTimers array to
+    hold that thread's statistics.  The _explicitTimers
+    point to the correct _timer and update its statistics at every stop() call.
+    The explicitTimers' pointers are set up in the constructor.
+    Each node also has an event vector to hold that thread's timing events.
+    The event vector expands as necessary and records the start-stop times
+    for each timer.
+
+    The nestLevel variable is for plotting events and is related
+    to the bar width in the timeline graph.
+
+    Every thread will have a __thread local pointer to its node in
+    the list.  The sentinel node is used by the master thread to
+    store "dummy" statistics before __kmp_create_worker() is called.
+
+**************************************************************** */
+class kmp_stats_list {
+    int gtid;
+    timeStat      _timers[TIMER_LAST+1];
+    counter       _counters[COUNTER_LAST+1];
+    explicitTimer _explicitTimers[EXPLICIT_TIMER_LAST+1];
+    int           _nestLevel; // one per thread
+    kmp_stats_event_vector _event_vector;
+    kmp_stats_list* next;
+    kmp_stats_list* prev;
+ public:
+    kmp_stats_list() : next(this) , prev(this) , _event_vector(), _nestLevel(0) {
+#define doInit(name,ignore1,ignore2) \
+        getExplicitTimer(EXPLICIT_TIMER_##name)->setStat(getTimer(TIMER_##name));
+        KMP_FOREACH_EXPLICIT_TIMER(doInit,0);
+#undef doInit
+    }
+   ~kmp_stats_list() { }
+    inline timeStat *      getTimer(timer_e idx)                  { return &_timers[idx]; }
+    inline counter  *      getCounter(counter_e idx)              { return &_counters[idx]; }
+    inline explicitTimer * getExplicitTimer(explicit_timer_e idx) { return &_explicitTimers[idx]; }
+    inline timeStat *      getTimers()                            { return _timers; }
+    inline counter  *      getCounters()                          { return _counters; }
+    inline explicitTimer * getExplicitTimers()                    { return _explicitTimers; }
+    inline kmp_stats_event_vector & getEventVector()              { return _event_vector; }
+    inline void resetEventVector()                                { _event_vector.reset(); }
+    inline void incrementNestValue()                              { _nestLevel++; }
+    inline int  getNestValue()                                    { return _nestLevel; }
+    inline void decrementNestValue()                              { _nestLevel--; }
+    inline int  getGtid() const                                   { return gtid; }
+    inline void setGtid(int newgtid)                              { gtid = newgtid; }
+    kmp_stats_list* push_back(int gtid); // returns newly created list node
+    inline void     push_event(uint64_t start_time, uint64_t stop_time, int nest_level, timer_e name) {
+        _event_vector.push_back(start_time, stop_time, nest_level, name);
+    }
+    void deallocate();
+    class iterator;
+    kmp_stats_list::iterator begin();
+    kmp_stats_list::iterator end();
+    int size();
+    class iterator {
+        kmp_stats_list* ptr;
+        friend kmp_stats_list::iterator kmp_stats_list::begin();
+        friend kmp_stats_list::iterator kmp_stats_list::end();
+      public:
+        iterator();
+       ~iterator();
+        iterator operator++();
+        iterator operator++(int dummy);
+        iterator operator--();
+        iterator operator--(int dummy);
+        bool operator!=(const iterator & rhs);
+        bool operator==(const iterator & rhs);
+        kmp_stats_list* operator*() const; // dereference operator
+    };
+};
+
+/* ****************************************************************
+   Class to encapsulate all output functions and the environment variables
+
+   This module holds filenames for various outputs (normal stats, events, plot file),
+   as well as coloring information for the plot file.
+
+   The filenames and flags variables are read from environment variables.
+   These are read once by the constructor of the global variable __kmp_stats_output
+   which calls init().
+
+   During this init() call, event flags for the timeStat::timerInfo[] global array
+   are cleared if KMP_STATS_EVENTS is not true (on, 1, yes).
+
+   The only interface function that is public is outputStats(heading).  This function
+   should print out everything it needs to, either to files or stderr,
+   depending on the environment variables described below
+
+   ENVIRONMENT VARIABLES:
+   KMP_STATS_FILE -- if set, all statistics (not events) will be printed to this file,
+                     otherwise, print to stderr
+   KMP_STATS_THREADS -- if set to "on", then will print per thread statistics to either
+                        KMP_STATS_FILE or stderr
+   KMP_STATS_PLOT_FILE -- if set, print the ploticus plot file to this filename,
+                          otherwise, the plot file is sent to "events.plt"
+   KMP_STATS_EVENTS -- if set to "on", then log events, otherwise, don't log events
+   KMP_STATS_EVENTS_FILE -- if set, all events are outputted to this file,
+                            otherwise, output is sent to "events.dat"
+
+**************************************************************** */
+class kmp_stats_output_module {
+
+ public:
+    struct rgb_color {
+        float r;
+        float g;
+        float b;
+    };
+
+ private:
+    static const char* outputFileName;
+    static const char* eventsFileName;
+    static const char* plotFileName;
+    static int printPerThreadFlag;
+    static int printPerThreadEventsFlag;
+    static const rgb_color globalColorArray[];
+    static       rgb_color timerColorInfo[];
+
+    void init();
+    static void setupEventColors();
+    static void printPloticusFile();
+    static void printStats(FILE *statsOut, statistic const * theStats, bool areTimers);
+    static void printCounters(FILE * statsOut, counter const * theCounters);
+    static void printEvents(FILE * eventsOut, kmp_stats_event_vector* theEvents, int gtid);
+    static rgb_color getEventColor(timer_e e) { return timerColorInfo[e]; }
+    static void windupExplicitTimers();
+    bool eventPrintingEnabled() {
+        if(printPerThreadEventsFlag) return true;
+        else return false;
+    }
+    bool perThreadPrintingEnabled() {
+        if(printPerThreadFlag) return true;
+        else return false;
+    }
+
+ public:
+    kmp_stats_output_module() { init(); }
+    void outputStats(const char* heading);
+};
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+void __kmp_stats_init();
+void __kmp_reset_stats();
+void __kmp_output_stats(const char *);
+void __kmp_accumulate_stats_at_exit(void);
+// thread local pointer to stats node within list
+extern __thread kmp_stats_list* __kmp_stats_thread_ptr;
+// head to stats list.
+extern kmp_stats_list __kmp_stats_list;
+// lock for __kmp_stats_list
+extern kmp_tas_lock_t  __kmp_stats_lock;
+// reference start time
+extern tsc_tick_count __kmp_stats_start_time;
+// interface to output
+extern kmp_stats_output_module __kmp_stats_output;
+
+#ifdef __cplusplus
+}
+#endif
+
+// Simple, standard interfaces that drop out completely if stats aren't enabled
+
+
+/*!
+ * \brief Uses specified timer (name) to time code block.
+ *
+ * @param name timer name as specified under the KMP_FOREACH_TIMER() macro
+ *
+ * \details Use KMP_TIME_BLOCK(name) macro to time a code block.  This will record the time taken in the block
+ * and use the destructor to stop the timer.  Convenient!
+ * With this definition you can't have more than one KMP_TIME_BLOCK in the same code block.
+ * I don't think that's a problem.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_TIME_BLOCK(name) \
+    blockTimer __BLOCKTIME__(__kmp_stats_thread_ptr->getTimer(TIMER_##name), TIMER_##name)
+
+/*!
+ * \brief Adds value to specified timer (name).
+ *
+ * @param name timer name as specified under the KMP_FOREACH_TIMER() macro
+ * @param value double precision sample value to add to statistics for the timer
+ *
+ * \details Use KMP_COUNT_VALUE(name, value) macro to add a particular value to a timer statistics.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_COUNT_VALUE(name, value) \
+    __kmp_stats_thread_ptr->getTimer(TIMER_##name)->addSample(value)
+
+/*!
+ * \brief Increments specified counter (name).
+ *
+ * @param name counter name as specified under the KMP_FOREACH_COUNTER() macro
+ *
+ * \details Use KMP_COUNT_BLOCK(name, value) macro to increment a statistics counter for the executing thread.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_COUNT_BLOCK(name) \
+   __kmp_stats_thread_ptr->getCounter(COUNTER_##name)->increment()
+
+/*!
+ * \brief "Starts" an explicit timer which will need a corresponding KMP_STOP_EXPLICIT_TIMER() macro.
+ *
+ * @param name explicit timer name as specified under the KMP_FOREACH_EXPLICIT_TIMER() macro
+ *
+ * \details Use to start a timer.  This will need a corresponding KMP_STOP_EXPLICIT_TIMER()
+ * macro to stop the timer unlike the KMP_TIME_BLOCK(name) macro which has an implicit stopping macro at the end
+ * of the code block.  All explicit timers are stopped at library exit time before the final statistics are outputted.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_START_EXPLICIT_TIMER(name) \
+    __kmp_stats_thread_ptr->getExplicitTimer(EXPLICIT_TIMER_##name)->start(TIMER_##name)
+
+/*!
+ * \brief "Stops" an explicit timer.
+ *
+ * @param name explicit timer name as specified under the KMP_FOREACH_EXPLICIT_TIMER() macro
+ *
+ * \details Use KMP_STOP_EXPLICIT_TIMER(name) to stop a timer.  When this is done, the time between the last KMP_START_EXPLICIT_TIMER(name)
+ * and this KMP_STOP_EXPLICIT_TIMER(name) will be added to the timer's stat value.  The timer will then be reset.
+ * After the KMP_STOP_EXPLICIT_TIMER(name) macro is called, another call to KMP_START_EXPLICIT_TIMER(name) will start the timer once again.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_STOP_EXPLICIT_TIMER(name) \
+    __kmp_stats_thread_ptr->getExplicitTimer(EXPLICIT_TIMER_##name)->stop(TIMER_##name)
+
+/*!
+ * \brief Outputs the current thread statistics and reset them.
+ *
+ * @param heading_string heading put above the final stats output
+ *
+ * \details Explicitly stops all timers and outputs all stats.
+ * Environment variable, `OMPTB_STATSFILE=filename`, can be used to output the stats to a filename instead of stderr
+ * Environment variable, `OMPTB_STATSTHREADS=true|undefined`, can be used to output thread specific stats
+ * For now the `OMPTB_STATSTHREADS` environment variable can either be defined with any value, which will print out thread
+ * specific stats, or it can be undefined (not specified in the environment) and thread specific stats won't be printed
+ * It should be noted that all statistics are reset when this macro is called.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_OUTPUT_STATS(heading_string) \
+    __kmp_output_stats(heading_string)
+
+/*!
+ * \brief resets all stats (counters to 0, timers to 0 elapsed ticks)
+ *
+ * \details Reset all stats for all threads.
+ *
+ * @ingroup STATS_GATHERING
+*/
+#define KMP_RESET_STATS()  __kmp_reset_stats()
+
+#else // KMP_STATS_ENABLED
+
+// Null definitions
+#define KMP_TIME_BLOCK(n)             ((void)0)
+#define KMP_COUNT_VALUE(n,v)          ((void)0)
+#define KMP_COUNT_BLOCK(n)            ((void)0)
+#define KMP_START_EXPLICIT_TIMER(n)   ((void)0)
+#define KMP_STOP_EXPLICIT_TIMER(n)    ((void)0)
+
+#define KMP_OUTPUT_STATS(heading_string) ((void)0)
+#define KMP_RESET_STATS()  ((void)0)
+
+#endif  // KMP_STATS_ENABLED
+
+#endif // KMP_STATS_H
--- a/openmp/runtime/src/kmp_stats_timing.cpp
+++ b/openmp/runtime/src/kmp_stats_timing.cpp
@ -0,0 +1,167 @@
+/** @file kmp_stats_timing.cpp
+ * Timing functions
+ */
+
+
+//===----------------------------------------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is dual licensed under the MIT and the University of Illinois Open
+// Source Licenses. See LICENSE.txt for details.
+//
+//===----------------------------------------------------------------------===//
+
+
+#include <stdlib.h>
+#include <unistd.h>
+
+#include <iostream>
+#include <iomanip>
+#include <sstream>
+
+#include "kmp_stats_timing.h"
+
+using namespace std;
+
+#if KMP_OS_LINUX
+# if KMP_MIC
+double tsc_tick_count::tick_time()
+{
+    // pretty bad assumption of 1GHz clock for MIC
+    return 1/((double)1000*1.e6);
+}
+# else
+#  include <string.h>
+// Extract the value from the CPUID information
+double tsc_tick_count::tick_time()
+{
+    static double result = 0.0;
+
+    if (result == 0.0)
+    {
+        int cpuinfo[4];
+        char brand[256];
+
+        __cpuid(cpuinfo, 0x80000000);
+        memset(brand, 0, sizeof(brand));
+        int ids = cpuinfo[0];
+
+        for (unsigned int i=2; i<(ids^0x80000000)+2; i++)
+            __cpuid(brand+(i-2)*sizeof(cpuinfo), i | 0x80000000);
+
+        char * start = &brand[0];
+        for (;*start == ' '; start++)
+            ;
+    
+        char * end = brand + strlen(brand) - 3;
+        uint64_t multiplier;
+
+        if (*end == 'M') multiplier = 1000LL*1000LL;
+        else if (*end == 'G') multiplier = 1000LL*1000LL*1000LL;
+        else if (*end == 'T') multiplier = 1000LL*1000LL*1000LL*1000LL;
+        else 
+        {
+            cout << "Error determining multiplier '" << *end << "'\n";
+            exit (-1);
+        }
+        *end = 0;
+        while (*end != ' ') end--;
+        end++;
+    
+        double freq = strtod(end, &start);
+        if (freq == 0.0) 
+        {
+            cout << "Error calculating frequency " <<  end << "\n";
+            exit (-1);
+        }
+
+        result = ((double)1.0)/(freq * multiplier);
+    }
+    return result;
+}
+# endif
+#endif
+
+static bool useSI = true;
+
+// Return a formatted string after normalising the value into
+// engineering style and using a suitable unit prefix (e.g. ms, us, ns).
+std::string formatSI(double interval, int width, char unit)
+{
+    std::stringstream os;
+
+    if (useSI)
+    {
+        // Preserve accuracy for small numbers, since we only multiply and the positive powers
+        // of ten are precisely representable. 
+        static struct { double scale; char prefix; } ranges[] = {
+            {1.e12,'f'},
+            {1.e9, 'p'},
+            {1.e6, 'n'},
+            {1.e3, 'u'},
+            {1.0,  'm'},
+            {1.e-3,' '},
+            {1.e-6,'k'},
+            {1.e-9,'M'},
+            {1.e-12,'G'},
+            {1.e-15,'T'},
+            {1.e-18,'P'},
+            {1.e-21,'E'},
+            {1.e-24,'Z'},
+            {1.e-27,'Y'}
+        };
+        
+        if (interval == 0.0)
+        {
+            os << std::setw(width-3) << std::right << "0.00" << std::setw(3) << unit;
+            return os.str();
+        }
+
+        bool negative = false;
+        if (interval < 0.0)
+        {
+            negative = true;
+            interval = -interval;
+        }
+        
+        for (int i=0; i<(int)(sizeof(ranges)/sizeof(ranges[0])); i++)
+        {
+            if (interval*ranges[i].scale < 1.e0)
+            {
+                interval = interval * 1000.e0 * ranges[i].scale;
+                os << std::fixed << std::setprecision(2) << std::setw(width-3) << std::right << 
+                    (negative ? -interval : interval) << std::setw(2) << ranges[i].prefix << std::setw(1) << unit;
+
+                return os.str();
+            }
+        }
+    }
+    os << std::setprecision(2) << std::fixed << std::right << std::setw(width-3) << interval << std::setw(3) << unit;
+
+    return os.str();
+}
+
+tsc_tick_count::tsc_interval_t computeLastInLastOutInterval(timePair * times, int nTimes)
+{
+    timePair lastTimes = times[0];
+    tsc_tick_count * startp = lastTimes.get_startp();
+    tsc_tick_count * endp   = lastTimes.get_endp();
+
+    for (int i=1; i<nTimes; i++)
+    {
+       (*startp) = startp->later(times[i].get_start());
+       (*endp)   = endp->later  (times[i].get_end());
+    }
+
+    return lastTimes.duration();
+}
+
+std::string timePair::format() const
+{
+    std::ostringstream oss;
+
+    oss << start.getValue() << ":" << end.getValue() << " = " << (end-start).getValue();
+
+    return oss.str();
+}
--- a/openmp/runtime/src/kmp_stats_timing.h
+++ b/openmp/runtime/src/kmp_stats_timing.h
@ -0,0 +1,104 @@
+#ifndef KMP_STATS_TIMING_H
+#define KMP_STATS_TIMING_H
+
+/** @file kmp_stats_timing.h
+ * Access to real time clock and timers.
+ */
+
+
+//===----------------------------------------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is dual licensed under the MIT and the University of Illinois Open
+// Source Licenses. See LICENSE.txt for details.
+//
+//===----------------------------------------------------------------------===//
+
+
+
+#include <stdint.h>
+#include <string>
+#include <limits>
+#include "kmp_os.h"
+
+class tsc_tick_count {
+  private:
+    int64_t my_count;
+
+  public:
+    class tsc_interval_t {
+        int64_t value;
+        explicit tsc_interval_t(int64_t _value) : value(_value) {}
+     public:
+        tsc_interval_t() : value(0) {}; // Construct 0 time duration
+        double seconds() const; // Return the length of a time interval in seconds
+        double ticks() const { return double(value); }
+        int64_t getValue() const { return value; }
+
+        friend class tsc_tick_count;
+
+        friend tsc_interval_t operator-(
+        const tsc_tick_count t1, const tsc_tick_count t0);
+    };
+
+    tsc_tick_count() : my_count(static_cast<int64_t>(__rdtsc())) {};
+    tsc_tick_count(int64_t value) : my_count(value) {};
+    int64_t getValue() const { return my_count; }
+    tsc_tick_count later (tsc_tick_count const other) const { 
+        return my_count > other.my_count ? (*this) : other; 
+    }
+    tsc_tick_count earlier(tsc_tick_count const other) const { 
+        return my_count < other.my_count ? (*this) : other; 
+    }
+    static double tick_time(); // returns seconds per cycle (period) of clock
+    static tsc_tick_count now() { return tsc_tick_count(); } // returns the rdtsc register value
+    friend tsc_tick_count::tsc_interval_t operator-(const tsc_tick_count t1, const tsc_tick_count t0);
+};
+
+inline tsc_tick_count::tsc_interval_t operator-(const tsc_tick_count t1, const tsc_tick_count t0) 
+{
+    return tsc_tick_count::tsc_interval_t( t1.my_count-t0.my_count );
+}
+
+inline double tsc_tick_count::tsc_interval_t::seconds() const 
+{
+    return value*tick_time();
+}
+
+extern std::string formatSI(double interval, int width, char unit);
+
+inline std::string formatSeconds(double interval, int width)
+{
+    return formatSI(interval, width, 'S');
+}
+
+inline std::string formatTicks(double interval, int width)
+{
+    return formatSI(interval, width, 'T');
+}
+
+class timePair
+{
+    tsc_tick_count KMP_ALIGN_CACHE start;
+    tsc_tick_count end;
+
+public:
+    timePair() : start(-std::numeric_limits<int64_t>::max()), end(-std::numeric_limits<int64_t>::max()) {}
+    tsc_tick_count get_start() const { return start; }
+    tsc_tick_count get_end()   const { return end; }
+    tsc_tick_count * get_startp()    { return &start; }
+    tsc_tick_count * get_endp()      { return &end; }
+
+    void markStart() { start = tsc_tick_count::now(); }
+    void markEnd()   { end   = tsc_tick_count::now(); }
+    void set_start(tsc_tick_count s) { start = s; }
+    void set_end  (tsc_tick_count e) { end = e; }
+
+    tsc_tick_count::tsc_interval_t duration() const { return end-start; }
+    std::string format() const;
+
+};
+
+extern tsc_tick_count::tsc_interval_t computeLastInLastOutInterval(timePair * times, int nTimes);
+#endif // KMP_STATS_TIMING_H
--- a/openmp/runtime/src/kmp_str.c
+++ b/openmp/runtime/src/kmp_str.c
@ -1,7 +1,7 @@
 /*
 * kmp_str.c -- String manipulation routines.
- * $Revision: 42810 $
- * $Date: 2013-11-07 12:06:33 -0600 (Thu, 07 Nov 2013) $
+ * $Revision: 43084 $
+ * $Date: 2014-04-15 09:15:14 -0500 (Tue, 15 Apr 2014) $
 */


--- a/openmp/runtime/src/kmp_str.h
+++ b/openmp/runtime/src/kmp_str.h
@ -1,7 +1,7 @@
 /*
 * kmp_str.h -- String manipulation routines.
- * $Revision: 42613 $
- * $Date: 2013-08-23 13:29:50 -0500 (Fri, 23 Aug 2013) $
+ * $Revision: 43435 $
+ * $Date: 2014-09-04 15:16:08 -0500 (Thu, 04 Sep 2014) $
 */


--- a/openmp/runtime/src/kmp_stub.c
+++ b/openmp/runtime/src/kmp_stub.c
@ -1,7 +1,7 @@
 /*
 * kmp_stub.c -- stub versions of user-callable OpenMP RT functions.
- * $Revision: 42826 $
- * $Date: 2013-11-20 03:39:45 -0600 (Wed, 20 Nov 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


@ -15,13 +15,13 @@
 //===----------------------------------------------------------------------===//


-#include "kmp_stub.h"
-
 #include <stdlib.h>
 #include <limits.h>
 #include <errno.h>

-#include "kmp_os.h"             // KMP_OS_*
+#include "omp.h"                // Function renamings.
+#include "kmp.h"                // KMP_DEFAULT_STKSIZE
+#include "kmp_stub.h"

 #if KMP_OS_WINDOWS
    #include <windows.h>
@ -29,20 +29,12 @@
    #include <sys/time.h>
 #endif

-#include "omp.h"                // Function renamings.
-#include "kmp.h"                // KMP_DEFAULT_STKSIZE
-#include "kmp_version.h"
-
 // Moved from omp.h
-#if OMP_30_ENABLED
-
 #define omp_set_max_active_levels    ompc_set_max_active_levels
 #define omp_set_schedule             ompc_set_schedule
 #define omp_get_ancestor_thread_num  ompc_get_ancestor_thread_num
 #define omp_get_team_size            ompc_get_team_size

-#endif // OMP_30_ENABLED
-
 #define omp_set_num_threads          ompc_set_num_threads
 #define omp_set_dynamic              ompc_set_dynamic
 #define omp_set_nested               ompc_set_nested
@ -95,15 +87,13 @@ static size_t __kmps_init() {
 void omp_set_num_threads( omp_int_t num_threads ) { i; }
 void omp_set_dynamic( omp_int_t dynamic )         { i; __kmps_set_dynamic( dynamic ); }
 void omp_set_nested( omp_int_t nested )           { i; __kmps_set_nested( nested );   }
-#if OMP_30_ENABLED
-    void omp_set_max_active_levels( omp_int_t max_active_levels ) { i; }
-    void omp_set_schedule( omp_sched_t kind, omp_int_t modifier ) { i; __kmps_set_schedule( (kmp_sched_t)kind, modifier ); }
-    int omp_get_ancestor_thread_num( omp_int_t level ) { i; return ( level ) ? ( -1 ) : ( 0 ); }
-    int omp_get_team_size( omp_int_t level ) { i; return ( level ) ? ( -1 ) : ( 1 ); }
-    int kmpc_set_affinity_mask_proc( int proc, void **mask ) { i; return -1; }
-    int kmpc_unset_affinity_mask_proc( int proc, void **mask ) { i; return -1; }
-    int kmpc_get_affinity_mask_proc( int proc, void **mask ) { i; return -1; }
-#endif // OMP_30_ENABLED
+void omp_set_max_active_levels( omp_int_t max_active_levels ) { i; }
+void omp_set_schedule( omp_sched_t kind, omp_int_t modifier ) { i; __kmps_set_schedule( (kmp_sched_t)kind, modifier ); }
+int omp_get_ancestor_thread_num( omp_int_t level ) { i; return ( level ) ? ( -1 ) : ( 0 ); }
+int omp_get_team_size( omp_int_t level ) { i; return ( level ) ? ( -1 ) : ( 1 ); }
+int kmpc_set_affinity_mask_proc( int proc, void **mask ) { i; return -1; }
+int kmpc_unset_affinity_mask_proc( int proc, void **mask ) { i; return -1; }
+int kmpc_get_affinity_mask_proc( int proc, void **mask ) { i; return -1; }

 /* kmp API functions */
 void kmp_set_stacksize( omp_int_t arg )   { i; __kmps_set_stacksize( arg ); }
@ -178,8 +168,6 @@ int __kmps_get_stacksize( void ) {
    return __kmps_stacksize;
 } // __kmps_get_stacksize

-#if OMP_30_ENABLED
-
 static kmp_sched_t __kmps_sched_kind     = kmp_sched_default;
 static int         __kmps_sched_modifier = 0;

@ -195,8 +183,6 @@ static int         __kmps_sched_modifier = 0;
        *modifier = __kmps_sched_modifier;
    } // __kmps_get_schedule

-#endif // OMP_30_ENABLED
-
 #if OMP_40_ENABLED

 static kmp_proc_bind_t __kmps_proc_bind = proc_bind_false;
--- a/openmp/runtime/src/kmp_stub.h
+++ b/openmp/runtime/src/kmp_stub.h
@ -1,7 +1,7 @@
 /*
 * kmp_stub.h
- * $Revision: 42061 $
- * $Date: 2013-02-28 16:36:24 -0600 (Thu, 28 Feb 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


@ -33,7 +33,6 @@ int  __kmps_get_nested( void );
 void __kmps_set_stacksize( int arg );
 int  __kmps_get_stacksize();

-#if OMP_30_ENABLED
 #ifndef KMP_SCHED_TYPE_DEFINED
 #define KMP_SCHED_TYPE_DEFINED
 typedef enum kmp_sched {
@ -46,11 +45,10 @@ typedef enum kmp_sched {
 #endif
 void __kmps_set_schedule( kmp_sched_t kind, int modifier );
 void __kmps_get_schedule( kmp_sched_t *kind, int *modifier );
-#endif // OMP_30_ENABLED

 #if OMP_40_ENABLED
-void __kmps_set_proc_bind( enum kmp_proc_bind_t arg );
-enum kmp_proc_bind_t __kmps_get_proc_bind( void );
+void __kmps_set_proc_bind( kmp_proc_bind_t arg );
+kmp_proc_bind_t __kmps_get_proc_bind( void );
 #endif /* OMP_40_ENABLED */

 double __kmps_get_wtime();
--- a/openmp/runtime/src/kmp_taskdeps.cpp
+++ b/openmp/runtime/src/kmp_taskdeps.cpp
@ -19,6 +19,7 @@

 #include "kmp.h"
 #include "kmp_io.h"
+#include "kmp_wait_release.h"

 #if OMP_40_ENABLED

@ -88,20 +89,20 @@ static kmp_dephash_t *
 __kmp_dephash_create ( kmp_info_t *thread )
 {
    kmp_dephash_t *h;
-	
+
    kmp_int32 size = kmp_dephash_size * sizeof(kmp_dephash_entry_t) + sizeof(kmp_dephash_t);
-	 	
+
 #if USE_FAST_MEMORY
    h = (kmp_dephash_t *) __kmp_fast_allocate( thread, size );
 #else
    h = (kmp_dephash_t *) __kmp_thread_malloc( thread, size );
 #endif

-#ifdef KMP_DEBUG	
+#ifdef KMP_DEBUG
    h->nelements = 0;
 #endif
    h->buckets = (kmp_dephash_entry **)(h+1);
-	
+
    for ( kmp_int32 i = 0; i < kmp_dephash_size; i++ )
        h->buckets[i] = 0;

@ -137,11 +138,11 @@ static kmp_dephash_entry *
 __kmp_dephash_find ( kmp_info_t *thread, kmp_dephash_t *h, kmp_intptr_t addr )
 {
    kmp_int32 bucket = __kmp_dephash_hash(addr);
-	
+
    kmp_dephash_entry_t *entry;
    for ( entry = h->buckets[bucket]; entry; entry = entry->next_in_bucket )
        if ( entry->addr == addr ) break;
-	
+
    if ( entry == NULL ) {
        // create entry. This is only done by one thread so no locking required
 #if USE_FAST_MEMORY
@ -212,6 +213,8 @@ static inline kmp_int32
 __kmp_process_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_dephash_t *hash,
                     bool dep_barrier,kmp_int32 ndeps, kmp_depend_info_t *dep_list)
 {
+    KA_TRACE(30, ("__kmp_process_deps<%d>: T#%d processing %d depencies : dep_barrier = %d\n", filter, gtid, ndeps, dep_barrier ) );
+    
    kmp_info_t *thread = __kmp_threads[ gtid ];
    kmp_int32 npredecessors=0;
    for ( kmp_int32 i = 0; i < ndeps ; i++ ) {
@ -232,6 +235,8 @@ __kmp_process_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_dephash_t *hash,
                    if ( indep->dn.task ) {
                        __kmp_track_dependence(indep,node);
                        indep->dn.successors = __kmp_add_node(thread, indep->dn.successors, node);
+                        KA_TRACE(40,("__kmp_process_deps<%d>: T#%d adding dependence from %p to %p",
+                                 filter,gtid, KMP_TASK_TO_TASKDATA(indep->dn.task), KMP_TASK_TO_TASKDATA(node->dn.task)));
                        npredecessors++;
                    }
                    KMP_RELEASE_DEPNODE(gtid,indep);
@ -246,13 +251,16 @@ __kmp_process_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_dephash_t *hash,
            if ( last_out->dn.task ) {
                __kmp_track_dependence(last_out,node);
                last_out->dn.successors = __kmp_add_node(thread, last_out->dn.successors, node);
+                KA_TRACE(40,("__kmp_process_deps<%d>: T#%d adding dependence from %p to %p", 
+                             filter,gtid, KMP_TASK_TO_TASKDATA(last_out->dn.task), KMP_TASK_TO_TASKDATA(node->dn.task)));
+                
                npredecessors++;
            }
            KMP_RELEASE_DEPNODE(gtid,last_out);
        }

        if ( dep_barrier ) {
-            // if this is a sync point in the serial sequence and previous outputs are guaranteed to be completed after
+            // if this is a sync point in the serial sequence, then the previous outputs are guaranteed to be completed after
            // the execution of this task so the previous output nodes can be cleared.
            __kmp_node_deref(thread,last_out);
            info->last_out = NULL;
@ -265,6 +273,9 @@ __kmp_process_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_dephash_t *hash,
        }

    }
+
+    KA_TRACE(30, ("__kmp_process_deps<%d>: T#%d found %d predecessors\n", filter, gtid, npredecessors ) );
+
    return npredecessors;
 }

@ -278,7 +289,10 @@ __kmp_check_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_task_t *task, kmp_de
                   kmp_int32 ndeps_noalias, kmp_depend_info_t *noalias_dep_list )
 {
    int i;
-   	
+
+    kmp_taskdata_t * taskdata = KMP_TASK_TO_TASKDATA(task);
+    KA_TRACE(20, ("__kmp_check_deps: T#%d checking dependencies for task %p : %d possibly aliased dependencies, %d non-aliased depedencies : dep_barrier=%d .\n", gtid, taskdata, ndeps, ndeps_noalias, dep_barrier ) );
+
    // Filter deps in dep_list
    // TODO: Different algorithm for large dep_list ( > 10 ? )
    for ( i = 0; i < ndeps; i ++ ) {
@ -292,8 +306,8 @@ __kmp_check_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_task_t *task, kmp_de
    }

    // doesn't need to be atomic as no other thread is going to be accessing this node just yet
-    // npredecessors is set 1 to ensure that none of the releasing tasks queues this task before we have finished processing all the dependencies
-    node->dn.npredecessors = 1;
+    // npredecessors is set -1 to ensure that none of the releasing tasks queues this task before we have finished processing all the dependencies
+    node->dn.npredecessors = -1;

    // used to pack all npredecessors additions into a single atomic operation at the end
    int npredecessors;
@ -301,12 +315,16 @@ __kmp_check_deps ( kmp_int32 gtid, kmp_depnode_t *node, kmp_task_t *task, kmp_de
    npredecessors = __kmp_process_deps<true>(gtid, node, hash, dep_barrier, ndeps, dep_list);
    npredecessors += __kmp_process_deps<false>(gtid, node, hash, dep_barrier, ndeps_noalias, noalias_dep_list);

-    KMP_TEST_THEN_ADD32(&node->dn.npredecessors, npredecessors);
-
-    // Remove the fake predecessor and find out if there's any outstanding dependence (some tasks may have finished while we processed the dependences)
    node->dn.task = task;
    KMP_MB();
-    npredecessors = KMP_TEST_THEN_DEC32(&node->dn.npredecessors) - 1;
+
+    // Account for our initial fake value
+    npredecessors++;
+
+    // Update predecessors and obtain current value to check if there are still any outstandig dependences (some tasks may have finished while we processed the dependences)
+    npredecessors = KMP_TEST_THEN_ADD32(&node->dn.npredecessors, npredecessors) + npredecessors;
+
+    KA_TRACE(20, ("__kmp_check_deps: T#%d found %d predecessors for task %p \n", gtid, npredecessors, taskdata ) );

    // beyond this point the task could be queued (and executed) by a releasing task...
    return npredecessors > 0 ? true : false;
@ -318,11 +336,15 @@ __kmp_release_deps ( kmp_int32 gtid, kmp_taskdata_t *task )
    kmp_info_t *thread = __kmp_threads[ gtid ];
    kmp_depnode_t *node = task->td_depnode;

-    if ( task->td_dephash )
+    if ( task->td_dephash ) {
+        KA_TRACE(40, ("__kmp_realease_deps: T#%d freeing dependencies hash of task %p.\n", gtid, task ) );
        __kmp_dephash_free(thread,task->td_dephash);
+    }

    if ( !node ) return;

+    KA_TRACE(20, ("__kmp_realease_deps: T#%d notifying succesors of task %p.\n", gtid, task ) );
+    
    KMP_ACQUIRE_DEPNODE(gtid,node);
    node->dn.task = NULL; // mark this task as finished, so no new dependencies are generated
    KMP_RELEASE_DEPNODE(gtid,node);
@ -335,9 +357,10 @@ __kmp_release_deps ( kmp_int32 gtid, kmp_taskdata_t *task )
        // successor task can be NULL for wait_depends or because deps are still being processed
        if ( npredecessors == 0 ) {
            KMP_MB();
-            if ( successor->dn.task )
-            // loc_ref was already stored in successor's task_data
-                __kmpc_omp_task(NULL,gtid,successor->dn.task);
+            if ( successor->dn.task ) {            
+                KA_TRACE(20, ("__kmp_realease_deps: T#%d successor %p of %p scheduled for execution.\n", gtid, successor->dn.task, task ) );
+                __kmp_omp_task(gtid,successor->dn.task,false);
+            }
        }

        next = p->next;
@ -350,6 +373,8 @@ __kmp_release_deps ( kmp_int32 gtid, kmp_taskdata_t *task )
    }

    __kmp_node_deref(thread,node);
+
+    KA_TRACE(20, ("__kmp_realease_deps: T#%d all successors of %p notified of completation\n", gtid, task ) );
 }

 /*!
@ -368,15 +393,20 @@ Schedule a non-thread-switchable task with dependences for execution
 */
 kmp_int32
 __kmpc_omp_task_with_deps( ident_t *loc_ref, kmp_int32 gtid, kmp_task_t * new_task,
-                                 kmp_int32 ndeps, kmp_depend_info_t *dep_list,
-				 kmp_int32 ndeps_noalias, kmp_depend_info_t *noalias_dep_list )
+                            kmp_int32 ndeps, kmp_depend_info_t *dep_list,
+                            kmp_int32 ndeps_noalias, kmp_depend_info_t *noalias_dep_list )
 {
+
+    kmp_taskdata_t * new_taskdata = KMP_TASK_TO_TASKDATA(new_task);
+    KA_TRACE(10, ("__kmpc_omp_task_with_deps(enter): T#%d loc=%p task=%p\n",
+                  gtid, loc_ref, new_taskdata ) );
+
    kmp_info_t *thread = __kmp_threads[ gtid ];
    kmp_taskdata_t * current_task = thread->th.th_current_task;

    bool serial = current_task->td_flags.team_serial || current_task->td_flags.tasking_ser || current_task->td_flags.final;

-    if ( !serial && ( ndeps > 0 || ndeps_noalias > 0 )) {	   		
+    if ( !serial && ( ndeps > 0 || ndeps_noalias > 0 )) {
        /* if no dependencies have been tracked yet, create the dependence hash */
        if ( current_task->td_dephash == NULL )
            current_task->td_dephash = __kmp_dephash_create(thread);
@ -388,13 +418,21 @@ __kmpc_omp_task_with_deps( ident_t *loc_ref, kmp_int32 gtid, kmp_task_t * new_ta
 #endif

        __kmp_init_node(node);
-        KMP_TASK_TO_TASKDATA(new_task)->td_depnode = node;
+        new_taskdata->td_depnode = node;

        if ( __kmp_check_deps( gtid, node, new_task, current_task->td_dephash, NO_DEP_BARRIER,
-                               ndeps, dep_list, ndeps_noalias,noalias_dep_list ) )
+                               ndeps, dep_list, ndeps_noalias,noalias_dep_list ) ) {
+            KA_TRACE(10, ("__kmpc_omp_task_with_deps(exit): T#%d task had blocking dependencies: "
+                  "loc=%p task=%p, return: TASK_CURRENT_NOT_QUEUED\n", gtid, loc_ref,
+                  new_taskdata ) );
            return TASK_CURRENT_NOT_QUEUED;
+        }
    }

+    KA_TRACE(10, ("__kmpc_omp_task_with_deps(exit): T#%d task had no blocking dependencies : "
+                  "loc=%p task=%p, transferring to __kmpc_omp_task\n", gtid, loc_ref,
+                  new_taskdata ) );    
+
    return __kmpc_omp_task(loc_ref,gtid,new_task);
 }

@ -413,35 +451,44 @@ void
 __kmpc_omp_wait_deps ( ident_t *loc_ref, kmp_int32 gtid, kmp_int32 ndeps, kmp_depend_info_t *dep_list,
                       kmp_int32 ndeps_noalias, kmp_depend_info_t *noalias_dep_list )
 {
-    if ( ndeps == 0 && ndeps_noalias == 0 ) return;
+    KA_TRACE(10, ("__kmpc_omp_wait_deps(enter): T#%d loc=%p\n", gtid, loc_ref) );
+
+    if ( ndeps == 0 && ndeps_noalias == 0 ) {
+        KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d has no dependencies to wait upon : loc=%p\n", gtid, loc_ref) );
+        return;
+    }

    kmp_info_t *thread = __kmp_threads[ gtid ];
    kmp_taskdata_t * current_task = thread->th.th_current_task;

-    // dependences are not computed in serial teams
-    if ( current_task->td_flags.team_serial || current_task->td_flags.tasking_ser || current_task->td_flags.final)
+    // We can return immediately as:
+    //   - dependences are not computed in serial teams
+    //   - if the dephash is not yet created it means we have nothing to wait for
+    if ( current_task->td_flags.team_serial || current_task->td_flags.tasking_ser || current_task->td_flags.final || current_task->td_dephash == NULL ) {
+        KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d has no blocking dependencies : loc=%p\n", gtid, loc_ref) );
        return;
-
-    // if the dephash is not yet created it means we have nothing to wait for
-    if ( current_task->td_dephash == NULL ) return;
+    }

    kmp_depnode_t node;
    __kmp_init_node(&node);

    if (!__kmp_check_deps( gtid, &node, NULL, current_task->td_dephash, DEP_BARRIER,
-                           ndeps, dep_list, ndeps_noalias, noalias_dep_list ))
+                           ndeps, dep_list, ndeps_noalias, noalias_dep_list )) {
+        KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d has no blocking dependencies : loc=%p\n", gtid, loc_ref) );
        return;
-
-    int thread_finished = FALSE;
-    while ( node.dn.npredecessors > 0 ) {
-        __kmp_execute_tasks( thread, gtid, (volatile kmp_uint32 *)&(node.dn.npredecessors),
-                             0, FALSE, &thread_finished,
-#if USE_ITT_BUILD
-                             NULL,
-#endif
-                             __kmp_task_stealing_constraint );
    }

+    int thread_finished = FALSE;
+    kmp_flag_32 flag((volatile kmp_uint32 *)&(node.dn.npredecessors), 0U);
+    while ( node.dn.npredecessors > 0 ) {
+        flag.execute_tasks(thread, gtid, FALSE, &thread_finished,
+#if USE_ITT_BUILD
+                           NULL,
+#endif
+                           __kmp_task_stealing_constraint );
+    }
+
+    KA_TRACE(10, ("__kmpc_omp_wait_deps(exit): T#%d finished waiting : loc=%p\n", gtid, loc_ref) );
 }

 #endif /* OMP_40_ENABLED */
--- a/openmp/runtime/src/kmp_tasking.c
+++ b/openmp/runtime/src/kmp_tasking.c
@ -1,7 +1,7 @@
 /*
 * kmp_tasking.c -- OpenMP 3.0 tasking support.
- * $Revision: 42852 $
- * $Date: 2013-12-04 10:50:49 -0600 (Wed, 04 Dec 2013) $
+ * $Revision: 43389 $
+ * $Date: 2014-08-11 10:54:01 -0500 (Mon, 11 Aug 2014) $
 */


@ -18,9 +18,9 @@
 #include "kmp.h"
 #include "kmp_i18n.h"
 #include "kmp_itt.h"
+#include "kmp_wait_release.h"


-#if OMP_30_ENABLED

 /* ------------------------------------------------------------------------ */
 /* ------------------------------------------------------------------------ */
@ -31,26 +31,12 @@ static void __kmp_enable_tasking( kmp_task_team_t *task_team, kmp_info_t *this_t
 static void __kmp_alloc_task_deque( kmp_info_t *thread, kmp_thread_data_t *thread_data );
 static int  __kmp_realloc_task_threads_data( kmp_info_t *thread, kmp_task_team_t *task_team );

-#ifndef KMP_DEBUG
-# define __kmp_static_delay( arg )     /* nothing to do */
-#else
-
-static void
-__kmp_static_delay( int arg )
-{
-/* Work around weird code-gen bug that causes assert to trip */
-# if KMP_ARCH_X86_64 && KMP_OS_LINUX
-    KMP_ASSERT( arg != 0 );
-# else
-    KMP_ASSERT( arg >= 0 );
-# endif
-}
-#endif /* KMP_DEBUG */
-
-static void
-__kmp_static_yield( int arg )
-{
-    __kmp_yield( arg );
+static inline void __kmp_null_resume_wrapper(int gtid, volatile void *flag) {
+    switch (((kmp_flag_64 *)flag)->get_type()) {
+    case flag32: __kmp_resume_32(gtid, NULL); break;
+    case flag64: __kmp_resume_64(gtid, NULL); break;
+    case flag_oncore: __kmp_resume_oncore(gtid, NULL); break;
+    }
 }

 #ifdef BUILD_TIED_TASK_STACK
@ -605,9 +591,7 @@ __kmp_task_finish( kmp_int32 gtid, kmp_task_t *task, kmp_taskdata_t *resumed_tas
    }
 #endif /* BUILD_TIED_TASK_STACK */

-    KMP_DEBUG_ASSERT( taskdata -> td_flags.executing == 1 );
    KMP_DEBUG_ASSERT( taskdata -> td_flags.complete == 0 );
-    taskdata -> td_flags.executing = 0;  // suspend the finishing task
    taskdata -> td_flags.complete = 1;   // mark the task as completed
    KMP_DEBUG_ASSERT( taskdata -> td_flags.started == 1 );
    KMP_DEBUG_ASSERT( taskdata -> td_flags.freed == 0 );
@ -624,6 +608,12 @@ __kmp_task_finish( kmp_int32 gtid, kmp_task_t *task, kmp_taskdata_t *resumed_tas
 #endif
    }

+    // td_flags.executing  must be marked as 0 after __kmp_release_deps has been called
+    // Othertwise, if a task is executed immediately from the release_deps code
+    // the flag will be reset to 1 again by this same function
+    KMP_DEBUG_ASSERT( taskdata -> td_flags.executing == 1 );
+    taskdata -> td_flags.executing = 0;  // suspend the finishing task
+
    KA_TRACE(20, ("__kmp_task_finish: T#%d finished task %p, %d incomplete children\n",
                  gtid, taskdata, children) );

@ -908,7 +898,7 @@ __kmp_task_alloc( ident_t *loc_ref, kmp_int32 gtid, kmp_tasking_flags_t *flags,
    taskdata->td_taskgroup = parent_task->td_taskgroup; // task inherits the taskgroup from the parent task
    taskdata->td_dephash = NULL;
    taskdata->td_depnode = NULL;
-#endif 
+#endif
    // Only need to keep track of child task counts if team parallel and tasking not serialized
    if ( !( taskdata -> td_flags.team_serial || taskdata -> td_flags.tasking_ser ) ) {
        KMP_TEST_THEN_INC32( (kmp_int32 *)(& parent_task->td_incomplete_child_tasks) );
@ -1047,9 +1037,38 @@ __kmpc_omp_task_parts( ident_t *loc_ref, kmp_int32 gtid, kmp_task_t * new_task)
    return TASK_CURRENT_NOT_QUEUED;
 }

+//---------------------------------------------------------------------
+// __kmp_omp_task: Schedule a non-thread-switchable task for execution
+// gtid: Global Thread ID of encountering thread
+// new_task: non-thread-switchable task thunk allocated by __kmp_omp_task_alloc()
+// serialize_immediate: if TRUE then if the task is executed immediately its execution will be serialized
+// returns:
+//
+//    TASK_CURRENT_NOT_QUEUED (0) if did not suspend and queue current task to be resumed later.
+//    TASK_CURRENT_QUEUED (1) if suspended and queued the current task to be resumed later.
+kmp_int32
+__kmp_omp_task( kmp_int32 gtid, kmp_task_t * new_task, bool serialize_immediate )
+{
+    kmp_taskdata_t * new_taskdata = KMP_TASK_TO_TASKDATA(new_task);
+
+    /* Should we execute the new task or queue it?   For now, let's just always try to
+       queue it.  If the queue fills up, then we'll execute it.  */
+
+    if ( __kmp_push_task( gtid, new_task ) == TASK_NOT_PUSHED ) // if cannot defer
+    {                                                           // Execute this task immediately
+        kmp_taskdata_t * current_task = __kmp_threads[ gtid ] -> th.th_current_task;
+        if ( serialize_immediate )
+          new_taskdata -> td_flags.task_serial = 1;
+        __kmp_invoke_task( gtid, new_task, current_task );
+    }
+
+
+    return TASK_CURRENT_NOT_QUEUED;
+}

 //---------------------------------------------------------------------
-// __kmpc_omp_task: Schedule a non-thread-switchable task for execution
+// __kmpc_omp_task: Wrapper around __kmp_omp_task to schedule a non-thread-switchable task from
+// the parent thread only!
 // loc_ref: location of original task pragma (ignored)
 // gtid: Global Thread ID of encountering thread
 // new_task: non-thread-switchable task thunk allocated by __kmp_omp_task_alloc()
@ -1062,28 +1081,18 @@ kmp_int32
 __kmpc_omp_task( ident_t *loc_ref, kmp_int32 gtid, kmp_task_t * new_task)
 {
    kmp_taskdata_t * new_taskdata = KMP_TASK_TO_TASKDATA(new_task);
-    kmp_int32 rc;
+    kmp_int32 res;

    KA_TRACE(10, ("__kmpc_omp_task(enter): T#%d loc=%p task=%p\n",
                  gtid, loc_ref, new_taskdata ) );

-    /* Should we execute the new task or queue it?   For now, let's just always try to
-       queue it.  If the queue fills up, then we'll execute it.  */
-
-    if ( __kmp_push_task( gtid, new_task ) == TASK_NOT_PUSHED ) // if cannot defer
-    {                                                           // Execute this task immediately
-        kmp_taskdata_t * current_task = __kmp_threads[ gtid ] -> th.th_current_task;
-        new_taskdata -> td_flags.task_serial = 1;
-        __kmp_invoke_task( gtid, new_task, current_task );
-    }
+    res =  __kmp_omp_task(gtid,new_task,true);

    KA_TRACE(10, ("__kmpc_omp_task(exit): T#%d returning TASK_CURRENT_NOT_QUEUED: loc=%p task=%p\n",
                  gtid, loc_ref, new_taskdata ) );
-
-    return TASK_CURRENT_NOT_QUEUED;
+    return res;
 }

-
 //-------------------------------------------------------------------------------------
 // __kmpc_omp_taskwait: Wait until all tasks generated by the current task are complete

@ -1117,11 +1126,10 @@ __kmpc_omp_taskwait( ident_t *loc_ref, kmp_int32 gtid )

        if ( ! taskdata->td_flags.team_serial ) {
            // GEH: if team serialized, avoid reading the volatile variable below.
+            kmp_flag_32 flag(&(taskdata->td_incomplete_child_tasks), 0U);
            while ( TCR_4(taskdata -> td_incomplete_child_tasks) != 0 ) {
-                __kmp_execute_tasks( thread, gtid, &(taskdata->td_incomplete_child_tasks),
-                                     0, FALSE, &thread_finished
-                                     USE_ITT_BUILD_ARG(itt_sync_obj),
-                                     __kmp_task_stealing_constraint );
+                flag.execute_tasks(thread, gtid, FALSE, &thread_finished
+                                   USE_ITT_BUILD_ARG(itt_sync_obj), __kmp_task_stealing_constraint );
            }
        }
 #if USE_ITT_BUILD
@ -1153,7 +1161,7 @@ __kmpc_omp_taskyield( ident_t *loc_ref, kmp_int32 gtid, int end_part )
    KA_TRACE(10, ("__kmpc_omp_taskyield(enter): T#%d loc=%p end_part = %d\n",
                  gtid, loc_ref, end_part) );

-    if ( __kmp_tasking_mode != tskm_immediate_exec ) {
+    if ( __kmp_tasking_mode != tskm_immediate_exec && __kmp_init_parallel ) {
        // GEH TODO: shouldn't we have some sort of OMPRAP API calls here to mark begin wait?

        thread = __kmp_threads[ gtid ];
@ -1172,11 +1180,14 @@ __kmpc_omp_taskyield( ident_t *loc_ref, kmp_int32 gtid, int end_part )
            __kmp_itt_taskwait_starting( gtid, itt_sync_obj );
 #endif /* USE_ITT_BUILD */
        if ( ! taskdata->td_flags.team_serial ) {
-            __kmp_execute_tasks( thread, gtid, NULL, 0, FALSE, &thread_finished
-                                 USE_ITT_BUILD_ARG(itt_sync_obj),
-                                 __kmp_task_stealing_constraint );
+            kmp_task_team_t * task_team = thread->th.th_task_team;
+            if (task_team != NULL) {
+                if (KMP_TASKING_ENABLED(task_team, thread->th.th_task_state)) {
+                    __kmp_execute_tasks_32( thread, gtid, NULL, FALSE, &thread_finished
+                                            USE_ITT_BUILD_ARG(itt_sync_obj), __kmp_task_stealing_constraint );
+                }
+            }
        }
-
 #if USE_ITT_BUILD
        if ( itt_sync_obj != NULL )
            __kmp_itt_taskwait_finished( gtid, itt_sync_obj );
@ -1236,11 +1247,10 @@ __kmpc_end_taskgroup( ident_t* loc, int gtid )
 #endif /* USE_ITT_BUILD */

        if ( ! taskdata->td_flags.team_serial ) {
+            kmp_flag_32 flag(&(taskgroup->count), 0U);
            while ( TCR_4(taskgroup->count) != 0 ) {
-                __kmp_execute_tasks( thread, gtid, &(taskgroup->count),
-                                     0, FALSE, &thread_finished
-                                     USE_ITT_BUILD_ARG(itt_sync_obj),
-                                     __kmp_task_stealing_constraint );
+                flag.execute_tasks(thread, gtid, FALSE, &thread_finished
+                                   USE_ITT_BUILD_ARG(itt_sync_obj), __kmp_task_stealing_constraint );
            }
        }

@ -1433,7 +1443,7 @@ __kmp_steal_task( kmp_info_t *victim, kmp_int32 gtid, kmp_task_team_t *task_team

    __kmp_release_bootstrap_lock( & victim_td -> td.td_deque_lock );

-    KA_TRACE(10, ("__kmp_steal_task(exit #3): T#%d stole task %p from T#d: task_team=%p "
+    KA_TRACE(10, ("__kmp_steal_task(exit #3): T#%d stole task %p from T#%d: task_team=%p "
                  "ntasks=%d head=%u tail=%u\n",
                  gtid, taskdata, __kmp_gtid_from_thread( victim ), task_team,
                  victim_td->td.td_deque_ntasks, victim_td->td.td_deque_head,
@ -1445,7 +1455,7 @@ __kmp_steal_task( kmp_info_t *victim, kmp_int32 gtid, kmp_task_team_t *task_team


 //-----------------------------------------------------------------------------
-// __kmp_execute_tasks: Choose and execute tasks until either the condition
+// __kmp_execute_tasks_template: Choose and execute tasks until either the condition
 // is statisfied (return true) or there are none left (return false).
 // final_spin is TRUE if this is the spin at the release barrier.
 // thread_finished indicates whether the thread is finished executing all
@ -1453,16 +1463,10 @@ __kmp_steal_task( kmp_info_t *victim, kmp_int32 gtid, kmp_task_team_t *task_team
 // spinner is the location on which to spin.
 // spinner == NULL means only execute a single task and return.
 // checker is the value to check to terminate the spin.
-
-int
-__kmp_execute_tasks( kmp_info_t *thread,
-                     kmp_int32 gtid,
-                     volatile kmp_uint *spinner,
-                     kmp_uint checker,
-                     int final_spin,
-                     int *thread_finished
-                     USE_ITT_BUILD_ARG(void * itt_sync_obj),
-                     kmp_int32 is_constrained )
+template <class C>
+static inline int __kmp_execute_tasks_template(kmp_info_t *thread, kmp_int32 gtid, C *flag, int final_spin, 
+                                               int *thread_finished
+                                               USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained)
 {
    kmp_task_team_t *     task_team;
    kmp_team_t *          team;
@ -1478,7 +1482,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
    task_team = thread -> th.th_task_team;
    KMP_DEBUG_ASSERT( task_team != NULL );

-    KA_TRACE(15, ("__kmp_execute_tasks(enter): T#%d final_spin=%d *thread_finished=%d\n",
+    KA_TRACE(15, ("__kmp_execute_tasks_template(enter): T#%d final_spin=%d *thread_finished=%d\n",
                  gtid, final_spin, *thread_finished) );

    threads_data = (kmp_thread_data_t *)TCR_PTR(task_team -> tt.tt_threads_data);
@ -1512,8 +1516,8 @@ __kmp_execute_tasks( kmp_info_t *thread,
        // If this thread is in the last spin loop in the barrier, waiting to be
        // released, we know that the termination condition will not be satisified,
        // so don't waste any cycles checking it.
-        if ((spinner == NULL) || ((!final_spin) && (TCR_4(*spinner) == checker))) {
-            KA_TRACE(15, ("__kmp_execute_tasks(exit #1): T#%d spin condition satisfied\n", gtid) );
+        if (flag == NULL || (!final_spin && flag->done_check())) {
+            KA_TRACE(15, ("__kmp_execute_tasks_template(exit #1): T#%d spin condition satisfied\n", gtid) );
            return TRUE;
        }
        KMP_YIELD( __kmp_library == library_throughput );   // Yield before executing next task
@ -1527,7 +1531,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
        // result in the termination condition being satisfied.
        if (! *thread_finished) {
            kmp_uint32 count = KMP_TEST_THEN_DEC32( (kmp_int32 *)unfinished_threads ) - 1;
-            KA_TRACE(20, ("__kmp_execute_tasks(dec #1): T#%d dec unfinished_threads to %d task_team=%p\n",
+            KA_TRACE(20, ("__kmp_execute_tasks_template(dec #1): T#%d dec unfinished_threads to %d task_team=%p\n",
                          gtid, count, task_team) );
            *thread_finished = TRUE;
        }
@ -1537,8 +1541,8 @@ __kmp_execute_tasks( kmp_info_t *thread,
        // thread to pass through the barrier, where it might reset each thread's
        // th.th_team field for the next parallel region.
        // If we can steal more work, we know that this has not happened yet.
-        if ((spinner != NULL) && (TCR_4(*spinner) == checker)) {
-            KA_TRACE(15, ("__kmp_execute_tasks(exit #2): T#%d spin condition satisfied\n", gtid) );
+        if (flag != NULL && flag->done_check()) {
+            KA_TRACE(15, ("__kmp_execute_tasks_template(exit #2): T#%d spin condition satisfied\n", gtid) );
            return TRUE;
        }
    }
@ -1569,8 +1573,8 @@ __kmp_execute_tasks( kmp_info_t *thread,
 #endif /* USE_ITT_BUILD */

            // Check to see if this thread can proceed.
-            if ((spinner == NULL) || ((!final_spin) && (TCR_4(*spinner) == checker))) {
-                KA_TRACE(15, ("__kmp_execute_tasks(exit #3): T#%d spin condition satisfied\n",
+            if (flag == NULL || (!final_spin && flag->done_check())) {
+                KA_TRACE(15, ("__kmp_execute_tasks_template(exit #3): T#%d spin condition satisfied\n",
                              gtid) );
                return TRUE;
            }
@ -1579,7 +1583,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
            // If the execution of the stolen task resulted in more tasks being
            // placed on our run queue, then restart the whole process.
            if (TCR_4(threads_data[ tid ].td.td_deque_ntasks) != 0) {
-                KA_TRACE(20, ("__kmp_execute_tasks: T#%d stolen task spawned other tasks, restart\n",
+                KA_TRACE(20, ("__kmp_execute_tasks_template: T#%d stolen task spawned other tasks, restart\n",
                              gtid) );
                goto start;
            }
@ -1596,7 +1600,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
            // result in the termination condition being satisfied.
            if (! *thread_finished) {
                kmp_uint32 count = KMP_TEST_THEN_DEC32( (kmp_int32 *)unfinished_threads ) - 1;
-                KA_TRACE(20, ("__kmp_execute_tasks(dec #2): T#%d dec unfinished_threads to %d "
+                KA_TRACE(20, ("__kmp_execute_tasks_template(dec #2): T#%d dec unfinished_threads to %d "
                              "task_team=%p\n", gtid, count, task_team) );
                *thread_finished = TRUE;
            }
@ -1607,8 +1611,8 @@ __kmp_execute_tasks( kmp_info_t *thread,
            // thread to pass through the barrier, where it might reset each thread's
            // th.th_team field for the next parallel region.
            // If we can steal more work, we know that this has not happened yet.
-            if ((spinner != NULL) && (TCR_4(*spinner) == checker)) {
-                KA_TRACE(15, ("__kmp_execute_tasks(exit #4): T#%d spin condition satisfied\n",
+            if (flag != NULL && flag->done_check()) {
+                KA_TRACE(15, ("__kmp_execute_tasks_template(exit #4): T#%d spin condition satisfied\n",
                              gtid) );
                return TRUE;
            }
@ -1640,8 +1644,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
             (__kmp_dflt_blocktime != KMP_MAX_BLOCKTIME) &&
             (TCR_PTR(other_thread->th.th_sleep_loc) != NULL))
        {
-            __kmp_resume( __kmp_gtid_from_thread( other_thread ), NULL );
-
+            __kmp_null_resume_wrapper(__kmp_gtid_from_thread(other_thread), other_thread->th.th_sleep_loc);
            // A sleeping thread should not have any tasks on it's queue.
            // There is a slight possibility that it resumes, steals a task from
            // another thread, which spawns more tasks, all in the that it takes
@ -1677,8 +1680,8 @@ __kmp_execute_tasks( kmp_info_t *thread,
            }

            // Check to see if this thread can proceed.
-            if ((spinner == NULL) || ((!final_spin) && (TCR_4(*spinner) == checker))) {
-                KA_TRACE(15, ("__kmp_execute_tasks(exit #5): T#%d spin condition satisfied\n",
+            if (flag == NULL || (!final_spin && flag->done_check())) {
+                KA_TRACE(15, ("__kmp_execute_tasks_template(exit #5): T#%d spin condition satisfied\n",
                              gtid) );
                return TRUE;
            }
@ -1687,7 +1690,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
            // If the execution of the stolen task resulted in more tasks being
            // placed on our run queue, then restart the whole process.
            if (TCR_4(threads_data[ tid ].td.td_deque_ntasks) != 0) {
-                KA_TRACE(20, ("__kmp_execute_tasks: T#%d stolen task spawned other tasks, restart\n",
+                KA_TRACE(20, ("__kmp_execute_tasks_template: T#%d stolen task spawned other tasks, restart\n",
                              gtid) );
                goto start;
            }
@ -1704,7 +1707,7 @@ __kmp_execute_tasks( kmp_info_t *thread,
            // result in the termination condition being satisfied.
            if (! *thread_finished) {
                kmp_uint32 count = KMP_TEST_THEN_DEC32( (kmp_int32 *)unfinished_threads ) - 1;
-                KA_TRACE(20, ("__kmp_execute_tasks(dec #3): T#%d dec unfinished_threads to %d; "
+                KA_TRACE(20, ("__kmp_execute_tasks_template(dec #3): T#%d dec unfinished_threads to %d; "
                              "task_team=%p\n",
                              gtid, count, task_team) );
                *thread_finished = TRUE;
@ -1716,18 +1719,42 @@ __kmp_execute_tasks( kmp_info_t *thread,
            // thread to pass through the barrier, where it might reset each thread's
            // th.th_team field for the next parallel region.
            // If we can steal more work, we know that this has not happened yet.
-            if ((spinner != NULL) && (TCR_4(*spinner) == checker)) {
-                KA_TRACE(15, ("__kmp_execute_tasks(exit #6): T#%d spin condition satisfied\n",
-                              gtid) );
+            if (flag != NULL && flag->done_check()) {
+                KA_TRACE(15, ("__kmp_execute_tasks_template(exit #6): T#%d spin condition satisfied\n", gtid) );
                return TRUE;
            }
        }
    }

-    KA_TRACE(15, ("__kmp_execute_tasks(exit #7): T#%d can't find work\n", gtid) );
+    KA_TRACE(15, ("__kmp_execute_tasks_template(exit #7): T#%d can't find work\n", gtid) );
    return FALSE;
 }

+int __kmp_execute_tasks_32(kmp_info_t *thread, kmp_int32 gtid, kmp_flag_32 *flag, int final_spin,
+                           int *thread_finished
+                           USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained)
+{
+    return __kmp_execute_tasks_template(thread, gtid, flag, final_spin, thread_finished
+                                        USE_ITT_BUILD_ARG(itt_sync_obj), is_constrained);
+}
+
+int __kmp_execute_tasks_64(kmp_info_t *thread, kmp_int32 gtid, kmp_flag_64 *flag, int final_spin,
+                           int *thread_finished
+                           USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained)
+{
+    return __kmp_execute_tasks_template(thread, gtid, flag, final_spin, thread_finished
+                                        USE_ITT_BUILD_ARG(itt_sync_obj), is_constrained);
+}
+
+int __kmp_execute_tasks_oncore(kmp_info_t *thread, kmp_int32 gtid, kmp_flag_oncore *flag, int final_spin,
+                               int *thread_finished
+                               USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained)
+{
+    return __kmp_execute_tasks_template(thread, gtid, flag, final_spin, thread_finished
+                                        USE_ITT_BUILD_ARG(itt_sync_obj), is_constrained);
+}
+
+

 //-----------------------------------------------------------------------------
 // __kmp_enable_tasking: Allocate task team and resume threads sleeping at the
@ -1770,7 +1797,7 @@ __kmp_enable_tasking( kmp_task_team_t *task_team, kmp_info_t *this_thr )
        // tasks and execute them.  In extra barrier mode, tasks do not sleep
        // at the separate tasking barrier, so this isn't a problem.
        for (i = 0; i < nthreads; i++) {
-            volatile kmp_uint *sleep_loc;
+            volatile void *sleep_loc;
            kmp_info_t *thread = threads_data[i].td.td_thr;

            if (i == this_thr->th.th_info.ds.ds_tid) {
@ -1779,17 +1806,16 @@ __kmp_enable_tasking( kmp_task_team_t *task_team, kmp_info_t *this_thr )
            // Since we haven't locked the thread's suspend mutex lock at this
            // point, there is a small window where a thread might be putting
            // itself to sleep, but hasn't set the th_sleep_loc field yet.
-            // To work around this, __kmp_execute_tasks() periodically checks
+            // To work around this, __kmp_execute_tasks_template() periodically checks
            // see if other threads are sleeping (using the same random
            // mechanism that is used for task stealing) and awakens them if
            // they are.
-            if ( ( sleep_loc =  (volatile kmp_uint *)
-                                TCR_PTR( thread -> th.th_sleep_loc) ) != NULL )
+            if ( ( sleep_loc = TCR_PTR( thread -> th.th_sleep_loc) ) != NULL )
            {
                KF_TRACE( 50, ( "__kmp_enable_tasking: T#%d waking up thread T#%d\n",
                                 __kmp_gtid_from_thread( this_thr ),
                                 __kmp_gtid_from_thread( thread ) ) );
-                __kmp_resume( __kmp_gtid_from_thread( thread ), sleep_loc );
+                __kmp_null_resume_wrapper(__kmp_gtid_from_thread(thread), sleep_loc);
            }
            else {
                KF_TRACE( 50, ( "__kmp_enable_tasking: T#%d don't wake up thread T#%d\n",
@ -1805,7 +1831,7 @@ __kmp_enable_tasking( kmp_task_team_t *task_team, kmp_info_t *this_thr )


 /* ------------------------------------------------------------------------ */
-/*
+/* // TODO: Check the comment consistency
 * Utility routines for "task teams".  A task team (kmp_task_t) is kind of
 * like a shadow of the kmp_team_t data struct, with a different lifetime.
 * After a child * thread checks into a barrier and calls __kmp_release() from
@ -1839,6 +1865,7 @@ __kmp_enable_tasking( kmp_task_team_t *task_team, kmp_info_t *this_thr )
 * barriers, when no explicit tasks were spawned (pushed, actually).
 */

+
 static kmp_task_team_t *__kmp_free_task_teams = NULL;           // Free list for task_team data structures
 // Lock for task team data structures
 static kmp_bootstrap_lock_t __kmp_task_team_lock = KMP_BOOTSTRAP_LOCK_INITIALIZER( __kmp_task_team_lock );
@ -2193,7 +2220,6 @@ __kmp_wait_to_unref_task_teams(void)
             thread != NULL;
             thread = thread->th.th_next_pool)
        {
-            volatile kmp_uint *sleep_loc;
 #if KMP_OS_WINDOWS
            DWORD exit_val;
 #endif
@ -2218,11 +2244,12 @@ __kmp_wait_to_unref_task_teams(void)
                           __kmp_gtid_from_thread( thread ) ) );

            if ( __kmp_dflt_blocktime != KMP_MAX_BLOCKTIME ) {
+                volatile void *sleep_loc;
                // If the thread is sleeping, awaken it.
-                if ( ( sleep_loc = (volatile kmp_uint *) TCR_PTR( thread->th.th_sleep_loc) ) != NULL ) {
+                if ( ( sleep_loc = TCR_PTR( thread->th.th_sleep_loc) ) != NULL ) {
                    KA_TRACE( 10, ( "__kmp_wait_to_unref_task_team: T#%d waking up thread T#%d\n",
                                    __kmp_gtid_from_thread( thread ), __kmp_gtid_from_thread( thread ) ) );
-                    __kmp_resume( __kmp_gtid_from_thread( thread ), sleep_loc );
+                    __kmp_null_resume_wrapper(__kmp_gtid_from_thread(thread), sleep_loc);
                }
            }
        }
@ -2350,9 +2377,9 @@ __kmp_task_team_wait( kmp_info_t *this_thr,
        // contention, only the master thread checks for the
        // termination condition.
        //
-        __kmp_wait_sleep( this_thr, &task_team->tt.tt_unfinished_threads, 0, TRUE
-                          USE_ITT_BUILD_ARG(itt_sync_obj)
-                          );
+        kmp_flag_32 flag(&task_team->tt.tt_unfinished_threads, 0U);
+        flag.wait(this_thr, TRUE
+                  USE_ITT_BUILD_ARG(itt_sync_obj));

        //
        // Kill the old task team, so that the worker threads will
@ -2390,8 +2417,9 @@ __kmp_tasking_barrier( kmp_team_t *team, kmp_info_t *thread, int gtid )
 #if USE_ITT_BUILD
    KMP_FSYNC_SPIN_INIT( spin, (kmp_uint32*) NULL );
 #endif /* USE_ITT_BUILD */
-    while (! __kmp_execute_tasks( thread, gtid, spin, 0, TRUE, &flag 
-                                  USE_ITT_BUILD_ARG(NULL), 0 ) ) {
+    kmp_flag_32 spin_flag(spin, 0U);
+    while (! spin_flag.execute_tasks(thread, gtid, TRUE, &flag
+                                     USE_ITT_BUILD_ARG(NULL), 0 ) ) {
 #if USE_ITT_BUILD
        // TODO: What about itt_sync_obj??
        KMP_FSYNC_SPIN_PREPARE( spin );
@ -2409,5 +2437,3 @@ __kmp_tasking_barrier( kmp_team_t *team, kmp_info_t *thread, int gtid )
 #endif /* USE_ITT_BUILD */
 }

-#endif // OMP_30_ENABLED
-
--- a/openmp/runtime/src/kmp_taskq.c
+++ b/openmp/runtime/src/kmp_taskq.c
@ -1,7 +1,7 @@
 /*
 * kmp_taskq.c -- TASKQ support for OpenMP.
- * $Revision: 42582 $
- * $Date: 2013-08-09 06:30:22 -0500 (Fri, 09 Aug 2013) $
+ * $Revision: 43389 $
+ * $Date: 2014-08-11 10:54:01 -0500 (Mon, 11 Aug 2014) $
 */


@ -33,23 +33,6 @@

 #define THREAD_ALLOC_FOR_TASKQ

-static void
-__kmp_static_delay( int arg )
-{
-/* Work around weird code-gen bug that causes assert to trip */
-#if KMP_ARCH_X86_64 && KMP_OS_LINUX
-    KMP_ASSERT( arg != 0 );
-#else
-    KMP_ASSERT( arg >= 0 );
-#endif
-}
-
-static void
-__kmp_static_yield( int arg )
-{
-    __kmp_yield( arg );
-}
-
 static int
 in_parallel_context( kmp_team_t *team )
 {
@ -790,7 +773,7 @@ __kmp_dequeue_task (kmp_int32 global_tid, kmpc_task_queue_t *queue, int in_paral
 * 1.  Walk up the task queue tree from the current queue's parent and look
 *      on the way up (for loop, below).
 * 2.  Do a depth-first search back down the tree from the root and
- *      look (find_task_in_descandent_queue()).
+ *      look (find_task_in_descendant_queue()).
 *
 * Here are the rules for deciding which task to take from a queue
 * (__kmp_find_task_in_queue ()):
@ -1608,7 +1591,6 @@ __kmpc_end_taskq(ident_t *loc, kmp_int32 global_tid, kmpc_thunk_t *taskq_thunk)
                 && (! __kmp_taskq_has_any_children(queue) )
                 && (! (queue->tq_flags & TQF_ALL_TASKS_QUEUED) )
                  ) {
-                __kmp_static_delay( 1 );
                KMP_YIELD_WHEN( TRUE, spins );
            }

--- a/openmp/runtime/src/kmp_threadprivate.c
+++ b/openmp/runtime/src/kmp_threadprivate.c
@ -1,7 +1,7 @@
 /*
 * kmp_threadprivate.c -- OpenMP threadprivate support library
- * $Revision: 42618 $
- * $Date: 2013-08-27 09:15:45 -0500 (Tue, 27 Aug 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_utility.c
+++ b/openmp/runtime/src/kmp_utility.c
@ -1,7 +1,7 @@
 /*
 * kmp_utility.c -- Utility routines for the OpenMP support library.
- * $Revision: 42588 $
- * $Date: 2013-08-13 01:26:00 -0500 (Tue, 13 Aug 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_version.c
+++ b/openmp/runtime/src/kmp_version.c
@ -1,7 +1,7 @@
 /*
 * kmp_version.c
- * $Revision: 42806 $
- * $Date: 2013-11-05 16:16:45 -0600 (Tue, 05 Nov 2013) $
+ * $Revision: 43435 $
+ * $Date: 2014-09-04 15:16:08 -0500 (Thu, 04 Sep 2014) $
 */


@ -20,7 +20,7 @@
 #include "kmp_version.h"

 // Replace with snapshot date YYYYMMDD for promotion build.
-#define KMP_VERSION_BUILD    00000000
+#define KMP_VERSION_BUILD    20140926

 // Helper macros to convert value of macro to string literal.
 #define _stringer( x ) #x
@ -46,6 +46,8 @@
        #define KMP_COMPILER "Intel C++ Compiler 14.0"
    #elif __INTEL_COMPILER == 1410
        #define KMP_COMPILER "Intel C++ Compiler 14.1"
+    #elif __INTEL_COMPILER == 1500
+        #define KMP_COMPILER "Intel C++ Compiler 15.0"
    #elif __INTEL_COMPILER == 9999
        #define KMP_COMPILER "Intel C++ Compiler mainline"
    #endif
@ -54,7 +56,7 @@
 #elif KMP_COMPILER_GCC
    #define KMP_COMPILER "GCC " stringer( __GNUC__ ) "." stringer( __GNUC_MINOR__ )
 #elif KMP_COMPILER_MSVC
-    #define KMP_COMPILER "MSVC " stringer( __MSC_FULL_VER )
+    #define KMP_COMPILER "MSVC " stringer( _MSC_FULL_VER )
 #endif
 #ifndef KMP_COMPILER
    #warning "Unknown compiler"
@ -77,7 +79,7 @@

 // Finally, define strings.
 #define KMP_LIBRARY   KMP_LIB_TYPE " library (" KMP_LINK_TYPE ")"
-#define KMP_COPYRIGHT "Copyright (C) 1997-2013, Intel Corporation. All Rights Reserved."
+#define KMP_COPYRIGHT ""

 int const __kmp_version_major = KMP_VERSION_MAJOR;
 int const __kmp_version_minor = KMP_VERSION_MINOR;
@ -85,10 +87,8 @@ int const __kmp_version_build = KMP_VERSION_BUILD;
 int const __kmp_openmp_version =
    #if OMP_40_ENABLED
        201307;
-    #elif OMP_30_ENABLED
-        201107;
    #else
-        200505;
+        201107;
    #endif

 /* Do NOT change the format of this string!  Intel(R) Thread Profiler checks for a
@ -128,7 +128,6 @@ __kmp_print_version_1( void )
        kmp_str_buf_t buffer;
        __kmp_str_buf_init( & buffer );
        // Print version strings skipping initial magic.
-        __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_copyright[ KMP_VERSION_MAGIC_LEN ] );
        __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_lib_ver[ KMP_VERSION_MAGIC_LEN ] );
        __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_lib_type[ KMP_VERSION_MAGIC_LEN ] );
        __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_link_type[ KMP_VERSION_MAGIC_LEN ] );
@ -164,8 +163,6 @@ __kmp_print_version_1( void )
                ); // __kmp_str_buf_print
            }; // for i
            __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_lock[ KMP_VERSION_MAGIC_LEN ] );
-            __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_perf_v19[ KMP_VERSION_MAGIC_LEN ] );
-            __kmp_str_buf_print( & buffer, "%s\n", & __kmp_version_perf_v106[ KMP_VERSION_MAGIC_LEN ] );
        #endif
        __kmp_str_buf_print(
            & buffer,
--- a/openmp/runtime/src/kmp_version.h
+++ b/openmp/runtime/src/kmp_version.h
@ -1,7 +1,7 @@
 /*
 * kmp_version.h -- version number for this release
- * $Revision: 42181 $
- * $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+ * $Revision: 42982 $
+ * $Date: 2014-02-12 10:11:02 -0600 (Wed, 12 Feb 2014) $
 */


@ -55,8 +55,6 @@ extern char const __kmp_version_alt_comp[];
 extern char const __kmp_version_omp_api[];
 // ??? extern char const __kmp_version_debug[];
 extern char const __kmp_version_lock[];
-extern char const __kmp_version_perf_v19[];
-extern char const __kmp_version_perf_v106[];
 extern char const __kmp_version_nested_stats_reporting[];
 extern char const __kmp_version_ftnstdcall[];
 extern char const __kmp_version_ftncdecl[];
--- a/openmp/runtime/src/kmp_wait_release.cpp
+++ b/openmp/runtime/src/kmp_wait_release.cpp
@ -0,0 +1,52 @@
+/*
+ * kmp_wait_release.cpp -- Wait/Release implementation
+ * $Revision: 43417 $
+ * $Date: 2014-08-26 14:06:38 -0500 (Tue, 26 Aug 2014) $
+ */
+
+
+//===----------------------------------------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is dual licensed under the MIT and the University of Illinois Open
+// Source Licenses. See LICENSE.txt for details.
+//
+//===----------------------------------------------------------------------===//
+
+#include "kmp_wait_release.h"
+
+void __kmp_wait_32(kmp_info_t *this_thr, kmp_flag_32 *flag, int final_spin
+                   USE_ITT_BUILD_ARG(void * itt_sync_obj) )
+{
+    __kmp_wait_template(this_thr, flag, final_spin
+                        USE_ITT_BUILD_ARG(itt_sync_obj) );
+}
+
+void __kmp_wait_64(kmp_info_t *this_thr, kmp_flag_64 *flag, int final_spin
+                   USE_ITT_BUILD_ARG(void * itt_sync_obj) )
+{
+    __kmp_wait_template(this_thr, flag, final_spin
+                        USE_ITT_BUILD_ARG(itt_sync_obj) );
+}
+
+void __kmp_wait_oncore(kmp_info_t *this_thr, kmp_flag_oncore *flag, int final_spin
+                       USE_ITT_BUILD_ARG(void * itt_sync_obj) )
+{
+    __kmp_wait_template(this_thr, flag, final_spin
+                        USE_ITT_BUILD_ARG(itt_sync_obj) );
+}
+
+
+
+void __kmp_release_32(kmp_flag_32 *flag) {
+    __kmp_release_template(flag);
+}
+
+void __kmp_release_64(kmp_flag_64 *flag) {
+    __kmp_release_template(flag);
+}
+
+void __kmp_release_oncore(kmp_flag_oncore *flag) {
+    __kmp_release_template(flag);
+}
--- a/openmp/runtime/src/kmp_wait_release.h
+++ b/openmp/runtime/src/kmp_wait_release.h
@ -0,0 +1,496 @@
+/*
+ * kmp_wait_release.h -- Wait/Release implementation
+ * $Revision: 43417 $
+ * $Date: 2014-08-26 14:06:38 -0500 (Tue, 26 Aug 2014) $
+ */
+
+
+//===----------------------------------------------------------------------===//
+//
+//                     The LLVM Compiler Infrastructure
+//
+// This file is dual licensed under the MIT and the University of Illinois Open
+// Source Licenses. See LICENSE.txt for details.
+//
+//===----------------------------------------------------------------------===//
+
+
+#ifndef KMP_WAIT_RELEASE_H
+#define KMP_WAIT_RELEASE_H
+
+#include "kmp.h"
+#include "kmp_itt.h"
+
+/*!
+@defgroup WAIT_RELEASE Wait/Release operations
+
+The definitions and functions here implement the lowest level thread
+synchronizations of suspending a thread and awaking it. They are used
+to build higher level operations such as barriers and fork/join.
+*/
+
+/*!
+@ingroup WAIT_RELEASE
+@{
+*/
+
+/*! 
+ * The flag_type describes the storage used for the flag.
+ */
+enum flag_type {
+    flag32,        /**< 32 bit flags */
+    flag64,        /**< 64 bit flags */
+    flag_oncore    /**< special 64-bit flag for on-core barrier (hierarchical) */
+};
+
+/*!
+ * Base class for wait/release volatile flag
+ */
+template <typename P>
+class kmp_flag {
+    volatile P * loc;  /**< Pointer to the flag storage that is modified by another thread */
+    flag_type t;       /**< "Type" of the flag in loc */
+ public:
+    typedef P flag_t;
+    kmp_flag(volatile P *p, flag_type ft) : loc(p), t(ft) {}
+    /*!
+     * @result the pointer to the actual flag
+     */
+    volatile P * get() { return loc; }
+    /*!
+     * @result the flag_type
+     */
+    flag_type get_type() { return t; }
+    // Derived classes must provide the following:
+    /*
+    kmp_info_t * get_waiter(kmp_uint32 i);
+    kmp_uint32 get_num_waiters();
+    bool done_check();
+    bool done_check_val(P old_loc);
+    bool notdone_check();
+    P internal_release();
+    P set_sleeping();
+    P unset_sleeping();
+    bool is_sleeping();
+    bool is_sleeping_val(P old_loc);
+    */
+};
+
+/* Spin wait loop that first does pause, then yield, then sleep. A thread that calls __kmp_wait_*
+   must make certain that another thread calls __kmp_release to wake it back up to prevent deadlocks!  */
+template <class C>
+static inline void __kmp_wait_template(kmp_info_t *this_thr, C *flag, int final_spin
+                                       USE_ITT_BUILD_ARG(void * itt_sync_obj) )
+{
+    // NOTE: We may not belong to a team at this point.
+    volatile typename C::flag_t *spin = flag->get();
+    kmp_uint32 spins;
+    kmp_uint32 hibernate;
+    int th_gtid;
+    int tasks_completed = FALSE;
+
+    KMP_FSYNC_SPIN_INIT(spin, NULL);
+    if (flag->done_check()) {
+        KMP_FSYNC_SPIN_ACQUIRED(spin);
+        return;
+    }
+    th_gtid = this_thr->th.th_info.ds.ds_gtid;
+    KA_TRACE(20, ("__kmp_wait_sleep: T#%d waiting for flag(%p)\n", th_gtid, flag));
+
+    // Setup for waiting
+    KMP_INIT_YIELD(spins);
+
+    if (__kmp_dflt_blocktime != KMP_MAX_BLOCKTIME) {
+        // The worker threads cannot rely on the team struct existing at this point.
+        // Use the bt values cached in the thread struct instead.
+#ifdef KMP_ADJUST_BLOCKTIME
+        if (__kmp_zero_bt && !this_thr->th.th_team_bt_set)
+            // Force immediate suspend if not set by user and more threads than available procs
+            hibernate = 0;
+        else
+            hibernate = this_thr->th.th_team_bt_intervals;
+#else
+        hibernate = this_thr->th.th_team_bt_intervals;
+#endif /* KMP_ADJUST_BLOCKTIME */
+
+        /* If the blocktime is nonzero, we want to make sure that we spin wait for the entirety
+           of the specified #intervals, plus up to one interval more.  This increment make
+           certain that this thread doesn't go to sleep too soon.  */
+        if (hibernate != 0)
+            hibernate++;
+
+        // Add in the current time value.
+        hibernate += TCR_4(__kmp_global.g.g_time.dt.t_value);
+        KF_TRACE(20, ("__kmp_wait_sleep: T#%d now=%d, hibernate=%d, intervals=%d\n",
+                      th_gtid, __kmp_global.g.g_time.dt.t_value, hibernate,
+                      hibernate - __kmp_global.g.g_time.dt.t_value));
+    }
+    KMP_MB();
+
+    // Main wait spin loop
+    while (flag->notdone_check()) {
+        int in_pool;
+
+        /* If the task team is NULL, it means one of things:
+           1) A newly-created thread is first being released by __kmp_fork_barrier(), and
+              its task team has not been set up yet.
+           2) All tasks have been executed to completion, this thread has decremented the task
+              team's ref ct and possibly deallocated it, and should no longer reference it.
+           3) Tasking is off for this region.  This could be because we are in a serialized region
+              (perhaps the outer one), or else tasking was manually disabled (KMP_TASKING=0).  */
+        kmp_task_team_t * task_team = NULL;
+        if (__kmp_tasking_mode != tskm_immediate_exec) {
+            task_team = this_thr->th.th_task_team;
+            if (task_team != NULL) {
+                if (!TCR_SYNC_4(task_team->tt.tt_active)) {
+                    KMP_DEBUG_ASSERT(!KMP_MASTER_TID(this_thr->th.th_info.ds.ds_tid));
+                    __kmp_unref_task_team(task_team, this_thr);
+                } else if (KMP_TASKING_ENABLED(task_team, this_thr->th.th_task_state)) {
+                    flag->execute_tasks(this_thr, th_gtid, final_spin, &tasks_completed
+                                        USE_ITT_BUILD_ARG(itt_sync_obj), 0);
+                }
+            } // if
+        } // if
+
+        KMP_FSYNC_SPIN_PREPARE(spin);
+        if (TCR_4(__kmp_global.g.g_done)) {
+            if (__kmp_global.g.g_abort)
+                __kmp_abort_thread();
+            break;
+        }
+
+        // If we are oversubscribed, or have waited a bit (and KMP_LIBRARY=throughput), then yield
+        KMP_YIELD(TCR_4(__kmp_nth) > __kmp_avail_proc);
+        // TODO: Should it be number of cores instead of thread contexts? Like:
+        // KMP_YIELD(TCR_4(__kmp_nth) > __kmp_ncores);
+        // Need performance improvement data to make the change...
+        KMP_YIELD_SPIN(spins);
+
+        // Check if this thread was transferred from a team
+        // to the thread pool (or vice-versa) while spinning.
+        in_pool = !!TCR_4(this_thr->th.th_in_pool);
+        if (in_pool != !!this_thr->th.th_active_in_pool) {
+            if (in_pool) { // Recently transferred from team to pool
+                KMP_TEST_THEN_INC32((kmp_int32 *)&__kmp_thread_pool_active_nth);
+                this_thr->th.th_active_in_pool = TRUE;
+                /* Here, we cannot assert that:
+                   KMP_DEBUG_ASSERT(TCR_4(__kmp_thread_pool_active_nth) <= __kmp_thread_pool_nth);
+                   __kmp_thread_pool_nth is inc/dec'd by the master thread while the fork/join
+                   lock is held, whereas __kmp_thread_pool_active_nth is inc/dec'd asynchronously
+                   by the workers.  The two can get out of sync for brief periods of time.  */
+            }
+            else { // Recently transferred from pool to team
+                KMP_TEST_THEN_DEC32((kmp_int32 *) &__kmp_thread_pool_active_nth);
+                KMP_DEBUG_ASSERT(TCR_4(__kmp_thread_pool_active_nth) >= 0);
+                this_thr->th.th_active_in_pool = FALSE;
+            }
+        }
+
+        // Don't suspend if KMP_BLOCKTIME is set to "infinite"
+        if (__kmp_dflt_blocktime == KMP_MAX_BLOCKTIME)
+            continue;
+
+        // Don't suspend if there is a likelihood of new tasks being spawned.
+        if ((task_team != NULL) && TCR_4(task_team->tt.tt_found_tasks))
+            continue;
+
+        // If we have waited a bit more, fall asleep
+        if (TCR_4(__kmp_global.g.g_time.dt.t_value) < hibernate)
+            continue;
+
+        KF_TRACE(50, ("__kmp_wait_sleep: T#%d suspend time reached\n", th_gtid));
+
+        flag->suspend(th_gtid);
+
+        if (TCR_4(__kmp_global.g.g_done)) {
+            if (__kmp_global.g.g_abort)
+                __kmp_abort_thread();
+            break;
+        }
+        // TODO: If thread is done with work and times out, disband/free
+    }
+    KMP_FSYNC_SPIN_ACQUIRED(spin);
+}
+
+/* Release any threads specified as waiting on the flag by releasing the flag and resume the waiting thread
+   if indicated by the sleep bit(s). A thread that calls __kmp_wait_template must call this function to wake
+   up the potentially sleeping thread and prevent deadlocks!  */
+template <class C>
+static inline void __kmp_release_template(C *flag)
+{
+#ifdef KMP_DEBUG
+    // FIX ME
+    kmp_info_t * wait_thr = flag->get_waiter(0);
+    int target_gtid = wait_thr->th.th_info.ds.ds_gtid;
+    int gtid = TCR_4(__kmp_init_gtid) ? __kmp_get_gtid() : -1;
+#endif
+    KF_TRACE(20, ("__kmp_release: T#%d releasing T#%d spin(%p)\n", gtid, target_gtid, flag->get()));
+    KMP_DEBUG_ASSERT(flag->get());
+    KMP_FSYNC_RELEASING(flag->get());
+
+    typename C::flag_t old_spin = flag->internal_release();
+
+    KF_TRACE(100, ("__kmp_release: T#%d old spin(%p)=%d, set new spin=%d\n",
+                   gtid, flag->get(), old_spin, *(flag->get())));
+
+    if (__kmp_dflt_blocktime != KMP_MAX_BLOCKTIME) {
+        // Only need to check sleep stuff if infinite block time not set
+        if (flag->is_sleeping_val(old_spin)) {
+            for (unsigned int i=0; i<flag->get_num_waiters(); ++i) {
+                kmp_info_t * waiter = flag->get_waiter(i);
+                int wait_gtid = waiter->th.th_info.ds.ds_gtid;
+                // Wake up thread if needed
+                KF_TRACE(50, ("__kmp_release: T#%d waking up thread T#%d since sleep spin(%p) set\n",
+                              gtid, wait_gtid, flag->get()));
+                flag->resume(wait_gtid);
+            }
+        } else {
+            KF_TRACE(50, ("__kmp_release: T#%d don't wake up thread T#%d since sleep spin(%p) not set\n",
+                          gtid, target_gtid, flag->get()));
+        }
+    }
+}
+
+template <typename FlagType>
+struct flag_traits {};
+
+template <>
+struct flag_traits<kmp_uint32> {
+    typedef kmp_uint32 flag_t;
+    static const flag_type t = flag32;
+    static inline flag_t tcr(flag_t f) { return TCR_4(f); }
+    static inline flag_t test_then_add4(volatile flag_t *f) { return KMP_TEST_THEN_ADD4_32((volatile kmp_int32 *)f); }
+    static inline flag_t test_then_or(volatile flag_t *f, flag_t v) { return KMP_TEST_THEN_OR32((volatile kmp_int32 *)f, v); }
+    static inline flag_t test_then_and(volatile flag_t *f, flag_t v) { return KMP_TEST_THEN_AND32((volatile kmp_int32 *)f, v); }
+};
+
+template <>
+struct flag_traits<kmp_uint64> {
+    typedef kmp_uint64 flag_t;
+    static const flag_type t = flag64;
+    static inline flag_t tcr(flag_t f) { return TCR_8(f); }
+    static inline flag_t test_then_add4(volatile flag_t *f) { return KMP_TEST_THEN_ADD4_64((volatile kmp_int64 *)f); }
+    static inline flag_t test_then_or(volatile flag_t *f, flag_t v) { return KMP_TEST_THEN_OR64((volatile kmp_int64 *)f, v); }
+    static inline flag_t test_then_and(volatile flag_t *f, flag_t v) { return KMP_TEST_THEN_AND64((volatile kmp_int64 *)f, v); }
+};
+
+template <typename FlagType>
+class kmp_basic_flag : public kmp_flag<FlagType> {
+    typedef flag_traits<FlagType> traits_type;
+    FlagType checker;  /**< Value to compare flag to to check if flag has been released. */
+    kmp_info_t * waiting_threads[1];  /**< Array of threads sleeping on this thread. */
+    kmp_uint32 num_waiting_threads;       /**< Number of threads sleeping on this thread. */
+public:
+    kmp_basic_flag(volatile FlagType *p) : kmp_flag<FlagType>(p, traits_type::t), num_waiting_threads(0) {}
+    kmp_basic_flag(volatile FlagType *p, kmp_info_t *thr) : kmp_flag<FlagType>(p, traits_type::t), num_waiting_threads(1) {
+        waiting_threads[0] = thr; 
+    }
+    kmp_basic_flag(volatile FlagType *p, FlagType c) : kmp_flag<FlagType>(p, traits_type::t), checker(c), num_waiting_threads(0) {}
+    /*!
+     * param i in   index into waiting_threads
+     * @result the thread that is waiting at index i
+     */
+    kmp_info_t * get_waiter(kmp_uint32 i) { 
+        KMP_DEBUG_ASSERT(i<num_waiting_threads);
+        return waiting_threads[i]; 
+    }
+    /*!
+     * @result num_waiting_threads
+     */
+    kmp_uint32 get_num_waiters() { return num_waiting_threads; }
+    /*!
+     * @param thr in   the thread which is now waiting
+     *
+     * Insert a waiting thread at index 0.
+     */
+    void set_waiter(kmp_info_t *thr) { 
+        waiting_threads[0] = thr; 
+        num_waiting_threads = 1;
+    }
+    /*!
+     * @result true if the flag object has been released.
+     */
+    bool done_check() { return traits_type::tcr(*(this->get())) == checker; }
+    /*!
+     * @param old_loc in   old value of flag
+     * @result true if the flag's old value indicates it was released.
+     */
+    bool done_check_val(FlagType old_loc) { return old_loc == checker; }
+    /*!
+     * @result true if the flag object is not yet released.
+     * Used in __kmp_wait_template like:
+     * @code
+     * while (flag.notdone_check()) { pause(); }
+     * @endcode
+     */
+    bool notdone_check() { return traits_type::tcr(*(this->get())) != checker; }
+    /*!
+     * @result Actual flag value before release was applied.
+     * Trigger all waiting threads to run by modifying flag to release state.
+     */
+    FlagType internal_release() {
+        return traits_type::test_then_add4((volatile FlagType *)this->get());
+    }
+    /*!
+     * @result Actual flag value before sleep bit(s) set.
+     * Notes that there is at least one thread sleeping on the flag by setting sleep bit(s).
+     */
+    FlagType set_sleeping() { 
+        return traits_type::test_then_or((volatile FlagType *)this->get(), KMP_BARRIER_SLEEP_STATE);
+    }
+    /*!
+     * @result Actual flag value before sleep bit(s) cleared.
+     * Notes that there are no longer threads sleeping on the flag by clearing sleep bit(s).
+     */
+    FlagType unset_sleeping() { 
+        return traits_type::test_then_and((volatile FlagType *)this->get(), ~KMP_BARRIER_SLEEP_STATE);
+    }
+    /*! 
+     * @param old_loc in   old value of flag
+     * Test whether there are threads sleeping on the flag's old value in old_loc.
+     */
+    bool is_sleeping_val(FlagType old_loc) { return old_loc & KMP_BARRIER_SLEEP_STATE; }
+    /*! 
+     * Test whether there are threads sleeping on the flag.
+     */
+    bool is_sleeping() { return is_sleeping_val(*(this->get())); }
+};
+
+class kmp_flag_32 : public kmp_basic_flag<kmp_uint32> {
+public:
+    kmp_flag_32(volatile kmp_uint32 *p) : kmp_basic_flag<kmp_uint32>(p) {}
+    kmp_flag_32(volatile kmp_uint32 *p, kmp_info_t *thr) : kmp_basic_flag<kmp_uint32>(p, thr) {}
+    kmp_flag_32(volatile kmp_uint32 *p, kmp_uint32 c) : kmp_basic_flag<kmp_uint32>(p, c) {}
+    void suspend(int th_gtid) { __kmp_suspend_32(th_gtid, this); }
+    void resume(int th_gtid) { __kmp_resume_32(th_gtid, this); }
+    int execute_tasks(kmp_info_t *this_thr, kmp_int32 gtid, int final_spin, int *thread_finished
+                      USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained) {
+        return __kmp_execute_tasks_32(this_thr, gtid, this, final_spin, thread_finished
+                                      USE_ITT_BUILD_ARG(itt_sync_obj), is_constrained);
+    }
+    void wait(kmp_info_t *this_thr, int final_spin
+              USE_ITT_BUILD_ARG(void * itt_sync_obj)) {
+        __kmp_wait_template(this_thr, this, final_spin
+                            USE_ITT_BUILD_ARG(itt_sync_obj));
+    }
+    void release() { __kmp_release_template(this); }
+};
+
+class kmp_flag_64 : public kmp_basic_flag<kmp_uint64> {
+public:
+    kmp_flag_64(volatile kmp_uint64 *p) : kmp_basic_flag<kmp_uint64>(p) {}
+    kmp_flag_64(volatile kmp_uint64 *p, kmp_info_t *thr) : kmp_basic_flag<kmp_uint64>(p, thr) {}
+    kmp_flag_64(volatile kmp_uint64 *p, kmp_uint64 c) : kmp_basic_flag<kmp_uint64>(p, c) {}
+    void suspend(int th_gtid) { __kmp_suspend_64(th_gtid, this); }
+    void resume(int th_gtid) { __kmp_resume_64(th_gtid, this); }
+    int execute_tasks(kmp_info_t *this_thr, kmp_int32 gtid, int final_spin, int *thread_finished
+                      USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained) {
+        return __kmp_execute_tasks_64(this_thr, gtid, this, final_spin, thread_finished
+                                      USE_ITT_BUILD_ARG(itt_sync_obj), is_constrained);
+    }
+    void wait(kmp_info_t *this_thr, int final_spin
+              USE_ITT_BUILD_ARG(void * itt_sync_obj)) {
+        __kmp_wait_template(this_thr, this, final_spin
+                            USE_ITT_BUILD_ARG(itt_sync_obj));
+    }
+    void release() { __kmp_release_template(this); }
+};
+
+// Hierarchical 64-bit on-core barrier instantiation
+class kmp_flag_oncore : public kmp_flag<kmp_uint64> {
+    kmp_uint64 checker;
+    kmp_info_t * waiting_threads[1];
+    kmp_uint32 num_waiting_threads;
+    kmp_uint32 offset;      /**< Portion of flag that is of interest for an operation. */
+    bool flag_switch;       /**< Indicates a switch in flag location. */
+    enum barrier_type bt;   /**< Barrier type. */
+    kmp_info_t * this_thr;  /**< Thread that may be redirected to different flag location. */
+#if USE_ITT_BUILD
+    void *itt_sync_obj;     /**< ITT object that must be passed to new flag location. */
+#endif
+    char& byteref(volatile kmp_uint64* loc, size_t offset) { return ((char *)loc)[offset]; }
+public:
+    kmp_flag_oncore(volatile kmp_uint64 *p)
+        : kmp_flag<kmp_uint64>(p, flag_oncore), num_waiting_threads(0), flag_switch(false) {}
+    kmp_flag_oncore(volatile kmp_uint64 *p, kmp_uint32 idx)
+        : kmp_flag<kmp_uint64>(p, flag_oncore), offset(idx), num_waiting_threads(0), flag_switch(false) {}
+    kmp_flag_oncore(volatile kmp_uint64 *p, kmp_uint64 c, kmp_uint32 idx, enum barrier_type bar_t,
+                    kmp_info_t * thr
+#if USE_ITT_BUILD
+                    , void *itt
+#endif
+                    ) 
+        : kmp_flag<kmp_uint64>(p, flag_oncore), checker(c), offset(idx), bt(bar_t), this_thr(thr)
+#if USE_ITT_BUILD
+        , itt_sync_obj(itt)
+#endif
+        , num_waiting_threads(0), flag_switch(false) {}
+    kmp_info_t * get_waiter(kmp_uint32 i) { 
+        KMP_DEBUG_ASSERT(i<num_waiting_threads);
+        return waiting_threads[i]; 
+    }
+    kmp_uint32 get_num_waiters() { return num_waiting_threads; }
+    void set_waiter(kmp_info_t *thr) { 
+        waiting_threads[0] = thr; 
+        num_waiting_threads = 1;
+    }
+    bool done_check_val(kmp_uint64 old_loc) { return byteref(&old_loc,offset) == checker; }
+    bool done_check() { return done_check_val(*get()); }
+    bool notdone_check() { 
+        // Calculate flag_switch
+        if (this_thr->th.th_bar[bt].bb.wait_flag == KMP_BARRIER_SWITCH_TO_OWN_FLAG)
+            flag_switch = true;
+        if (byteref(get(),offset) != 1 && !flag_switch) 
+            return true;
+        else if (flag_switch) {
+            this_thr->th.th_bar[bt].bb.wait_flag = KMP_BARRIER_SWITCHING;
+            kmp_flag_64 flag(&this_thr->th.th_bar[bt].bb.b_go, (kmp_uint64)KMP_BARRIER_STATE_BUMP);
+            __kmp_wait_64(this_thr, &flag, TRUE
+#if USE_ITT_BUILD
+                          , itt_sync_obj
+#endif
+                          );
+        }
+        return false;
+    }
+    kmp_uint64 internal_release() { 
+        kmp_uint64 old_val;
+        if (__kmp_dflt_blocktime == KMP_MAX_BLOCKTIME) {
+            old_val = *get();
+            byteref(get(),offset) = 1;
+        }
+        else {
+            kmp_uint64 mask=0;
+            byteref(&mask,offset) = 1;
+            old_val = KMP_TEST_THEN_OR64((volatile kmp_int64 *)get(), mask);
+        }
+        return old_val;
+    }
+    kmp_uint64 set_sleeping() { 
+        return KMP_TEST_THEN_OR64((kmp_int64 volatile *)get(), KMP_BARRIER_SLEEP_STATE);
+    }
+    kmp_uint64 unset_sleeping() { 
+        return KMP_TEST_THEN_AND64((kmp_int64 volatile *)get(), ~KMP_BARRIER_SLEEP_STATE);
+    }
+    bool is_sleeping_val(kmp_uint64 old_loc) { return old_loc & KMP_BARRIER_SLEEP_STATE; }
+    bool is_sleeping() { return is_sleeping_val(*get()); }
+    void wait(kmp_info_t *this_thr, int final_spin
+              USE_ITT_BUILD_ARG(void * itt_sync_obj)) {
+        __kmp_wait_template(this_thr, this, final_spin
+                            USE_ITT_BUILD_ARG(itt_sync_obj));
+    }
+    void release() { __kmp_release_template(this); }
+    void suspend(int th_gtid) { __kmp_suspend_oncore(th_gtid, this); }
+    void resume(int th_gtid) { __kmp_resume_oncore(th_gtid, this); }
+    int execute_tasks(kmp_info_t *this_thr, kmp_int32 gtid, int final_spin, int *thread_finished
+                      USE_ITT_BUILD_ARG(void * itt_sync_obj), kmp_int32 is_constrained) {
+        return __kmp_execute_tasks_oncore(this_thr, gtid, this, final_spin, thread_finished
+                                      USE_ITT_BUILD_ARG(itt_sync_obj), is_constrained);
+    }
+};
+
+/*!
+@}
+*/
+
+#endif // KMP_WAIT_RELEASE_H
--- a/openmp/runtime/src/kmp_wrapper_getpid.h
+++ b/openmp/runtime/src/kmp_wrapper_getpid.h
@ -1,7 +1,7 @@
 /*
 * kmp_wrapper_getpid.h -- getpid() declaration.
- * $Revision: 42181 $
- * $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+ * $Revision: 42951 $
+ * $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $
 */


--- a/openmp/runtime/src/kmp_wrapper_malloc.h
+++ b/openmp/runtime/src/kmp_wrapper_malloc.h
@ -1,8 +1,8 @@
 /*
 * kmp_wrapper_malloc.h -- Wrappers for memory allocation routines
 *                         (malloc(), free(), and others).
- * $Revision: 42181 $
- * $Date: 2013-03-26 15:04:45 -0500 (Tue, 26 Mar 2013) $
+ * $Revision: 43084 $
+ * $Date: 2014-04-15 09:15:14 -0500 (Tue, 15 Apr 2014) $
 */


--- a/openmp/runtime/src/libiomp.rc.var
+++ b/openmp/runtime/src/libiomp.rc.var
@ -1,6 +1,6 @@
 // libiomp.rc.var
-// $Revision: 42219 $
-// $Date: 2013-03-29 13:36:05 -0500 (Fri, 29 Mar 2013) $
+// $Revision: 42994 $
+// $Date: 2014-03-04 02:22:15 -0600 (Tue, 04 Mar 2014) $

 //
 ////===----------------------------------------------------------------------===//
@ -41,8 +41,6 @@ VS_VERSION_INFO VERSIONINFO

                // FileDescription and LegalCopyright should be short.
                VALUE "FileDescription",  "Intel(R) OpenMP* Runtime Library${{ our $MESSAGE_CATALOG; $MESSAGE_CATALOG ? " Message Catalog" : "" }}\0"
-                VALUE "LegalCopyright",   "Copyright (C) 1997-2013, Intel Corporation. All rights reserved.\0"
-
                // Following values may be relatively long.
                VALUE "CompanyName",      "Intel Corporation\0"
                // VALUE "LegalTrademarks",  "\0"  // Not used for now.
--- a/openmp/runtime/src/makefile.mk
+++ b/openmp/runtime/src/makefile.mk
@ -1,6 +1,6 @@
 # makefile.mk #
-# $Revision: 42820 $
-# $Date: 2013-11-13 16:53:44 -0600 (Wed, 13 Nov 2013) $
+# $Revision: 43473 $
+# $Date: 2014-09-26 15:02:57 -0500 (Fri, 26 Sep 2014) $

 #
 #//===----------------------------------------------------------------------===//
@ -221,6 +221,18 @@ ifeq "$(filter gcc clang,$(c))" ""
    endif
 endif

+# On Linux and Windows Intel64 we need offload attribute for all Fortran entries
+# in order to support OpenMP function calls inside Device constructs
+ifeq "$(fort)" "ifort"
+    ifeq "$(os)_$(arch)" "lin_32e"
+        # TODO: change to -qoffload... when we stop supporting 14.0 compiler (-offload is deprecated)
+        fort-flags += -offload-attribute-target=mic
+    endif
+    ifeq "$(os)_$(arch)" "win_32e"
+        fort-flags += /Qoffload-attribute-target:mic
+    endif
+endif
+
 ifeq "$(os)" "lrb"
    c-flags    += -mmic
    cxx-flags  += -mmic
@ -361,6 +373,7 @@ ifeq "$(os)" "lin"
        # to remove dependency on libgcc_s:
        ifeq "$(c)" "gcc"
            ld-flags-dll += -static-libgcc
+            # omp_os is non-empty only in the open-source code
            ifneq "$(omp_os)" "freebsd"
                ld-flags-extra += -Wl,-ldl
            endif
@ -417,11 +430,15 @@ ifeq "$(os)" "lrb"
            ld-flags += -ldl
        endif
    endif
+    # include the c++ library for stats-gathering code
+    ifeq "$(stats)" "on"
+        ld-flags-extra += -Wl,-lstdc++
+    endif
  endif
 endif

 ifeq "$(os)" "mac"
-    ifeq "$(c)" "icc"
+    ifeq "$(ld)" "icc"
        ld-flags += -no-intel-extensions
    endif
    ld-flags += -single_module
@ -483,6 +500,13 @@ endif
 cpp-flags += -D KMP_ADJUST_BLOCKTIME=1
 cpp-flags += -D BUILD_PARALLEL_ORDERED
 cpp-flags += -D KMP_ASM_INTRINS
+cpp-flags += -D KMP_USE_INTERNODE_ALIGNMENT=0
+# Linux and MIC compile with version symbols
+ifneq "$(filter lin lrb,$(os))" ""
+ifeq "$(filter ppc64,$(arch))" ""
+    cpp-flags += -D KMP_USE_VERSION_SYMBOLS
+endif
+endif
 ifneq "$(os)" "lrb"
    cpp-flags += -D USE_LOAD_BALANCE
 endif
@ -506,43 +530,52 @@ else # 5
        cpp-flags += -D KMP_GOMP_COMPAT
    endif
 endif
-
+cpp-flags += -D KMP_NESTED_HOT_TEAMS
 ifneq "$(filter 32 32e,$(arch))" ""
 cpp-flags += -D KMP_USE_ADAPTIVE_LOCKS=1 -D KMP_DEBUG_ADAPTIVE_LOCKS=0
 endif

+# is the std c++ library needed? (for stats-gathering, it is)
+std_cpp_lib=0
+ifneq "$(filter lin lrb,$(os))" ""
+    ifeq "$(stats)" "on"
+        cpp-flags += -D KMP_STATS_ENABLED=1
+        std_cpp_lib=1
+    else
+        cpp-flags += -D KMP_STATS_ENABLED=0
+    endif
+else # no mac or windows support for stats-gathering
+    ifeq "$(stats)" "on"
+        $(error Statistics-gathering functionality not available on $(os) platform)
+    endif
+    cpp-flags += -D KMP_STATS_ENABLED=0
+endif
+
 # define compatibility with different OpenMP versions
 have_omp_50=0
 have_omp_41=0
 have_omp_40=0
-have_omp_30=0
 ifeq "$(OMP_VERSION)" "50"
 	have_omp_50=1
 	have_omp_41=1
 	have_omp_40=1
-	have_omp_30=1
 endif
 ifeq "$(OMP_VERSION)" "41"
 	have_omp_50=0
 	have_omp_41=1
 	have_omp_40=1
-	have_omp_30=1
 endif
 ifeq "$(OMP_VERSION)" "40"
 	have_omp_50=0
 	have_omp_41=0
 	have_omp_40=1
-	have_omp_30=1
 endif
 ifeq "$(OMP_VERSION)" "30"
 	have_omp_50=0
 	have_omp_41=0
 	have_omp_40=0
-	have_omp_30=1
 endif
-cpp-flags += -D OMP_50_ENABLED=$(have_omp_50) -D OMP_41_ENABLED=$(have_omp_41)
-cpp-flags += -D OMP_40_ENABLED=$(have_omp_40) -D OMP_30_ENABLED=$(have_omp_30)
-
+cpp-flags += -D OMP_50_ENABLED=$(have_omp_50) -D OMP_41_ENABLED=$(have_omp_41) -D OMP_40_ENABLED=$(have_omp_40)

 # Using ittnotify is enabled by default.
 USE_ITT_NOTIFY = 1
@ -598,8 +631,8 @@ ifneq "$(os)" "win"
        z_Linux_asm$(obj) : \
 			cpp-flags += -D KMP_ARCH_PPC64		    
    else
-    	z_Linux_asm$(obj) : \
-       		cpp-flags += -D KMP_ARCH_X86$(if $(filter 32e,$(arch)),_64)	
+        z_Linux_asm$(obj) : \
+		    cpp-flags += -D KMP_ARCH_X86$(if $(filter 32e,$(arch)),_64)
    endif
 endif

@ -699,6 +732,8 @@ else # norm or prof
        kmp_i18n                     \
        kmp_io                       \
        kmp_runtime                  \
+        kmp_wait_release             \
+        kmp_barrier                  \
        kmp_settings                 \
        kmp_str                      \
        kmp_tasking                  \
@ -715,6 +750,10 @@ ifeq "$(OMP_VERSION)" "40"
    lib_cpp_items += kmp_taskdeps
    lib_cpp_items += kmp_cancel
 endif
+ifeq "$(stats)" "on"
+    lib_cpp_items += kmp_stats
+    lib_cpp_items += kmp_stats_timing
+endif

    # OS-specific files.
    ifeq "$(os)" "win"
@ -1272,8 +1311,20 @@ ifneq "$(os)" "lrb"
        # On Linux* OS and OS X* the test is good enough because GNU compiler knows nothing
        # about libirc and Intel compiler private lib directories, but we will grep verbose linker
        # output just in case.
-        tt-c        = cc
-        ifeq "$(os)" "lin"    # GCC on OS X* does not recognize -pthread.
+        # Using clang on OS X* because of discontinued support of GNU compilers.
+        ifeq "$(os)" "mac"
+            ifeq "$(std_cpp_lib)" "1"
+                tt-c        = clang++
+            else
+                tt-c        = clang
+            endif
+        else # lin
+            ifeq "$(std_cpp_lib)" "1"
+                tt-c        = g++
+            else
+                tt-c        = gcc
+            endif
+            # GCC on OS X* does not recognize -pthread.
            tt-c-flags  += -pthread
        endif
        tt-c-flags += -o $(tt-exe-file)
@ -1416,6 +1467,10 @@ ifneq "$(filter %-dyna win-%,$(os)-$(LINK_TYPE))" ""
            td_exp += libc.so.6
            td_exp += ld64.so.1
        endif
+        ifeq "$(std_cpp_lib)" "1"
+            td_exp += libstdc++.so.6
+        endif
+
        td_exp += libdl.so.2
        td_exp += libgcc_s.so.1
        ifeq "$(filter 32 32e 64 ppc64,$(arch))" ""
@ -1428,6 +1483,9 @@ ifneq "$(filter %-dyna win-%,$(os)-$(LINK_TYPE))" ""
    endif
    ifeq "$(os)" "lrb"
        ifeq "$(MIC_OS)" "lin"
+            ifeq "$(std_cpp_lib)" "1"
+                td_exp += libstdc++.so.6
+            endif
            ifeq "$(MIC_ARCH)" "knf"
                td_exp += "ld-linux-l1om.so.2"
                td_exp += libc.so.6
@ -1459,8 +1517,9 @@ ifneq "$(filter %-dyna win-%,$(os)-$(LINK_TYPE))" ""
            td_exp += uuid
        endif
    endif
+
    ifeq "$(omp_os)" "freebsd"
-        td_exp =
+        td_exp = 
        td_exp += libc.so.7
        td_exp += libthr.so.3
        td_exp += libunwind.so.5
--- a/openmp/runtime/src/rules.mk
+++ b/openmp/runtime/src/rules.mk
@ -1,6 +1,6 @@
 # rules.mk #
-# $Revision: 42423 $
-# $Date: 2013-06-07 09:25:21 -0500 (Fri, 07 Jun 2013) $
+# $Revision: 42951 $
+# $Date: 2014-01-21 14:41:41 -0600 (Tue, 21 Jan 2014) $

 #
 #//===----------------------------------------------------------------------===//
--- a/Show More
+++ b/Show More