/************************************************************************/
/*                                                                      */
/*    vspline - a set of generic tools for creation and evaluation      */
/*              of uniform b-splines                                    */
/*                                                                      */
/*            Copyright 2015 - 2018 by Kay F. Jahnke                    */
/*                                                                      */
/*    The git repository for this software is at                        */
/*                                                                      */
/*    https://bitbucket.org/kfj/vspline                                 */
/*                                                                      */
/*    Please direct questions, bug reports, and contributions to        */
/*                                                                      */
/*    kfjahnke+vspline@gmail.com                                        */
/*                                                                      */
/*    Permission is hereby granted, free of charge, to any person       */
/*    obtaining a copy of this software and associated documentation    */
/*    files (the "Software"), to deal in the Software without           */
/*    restriction, including without limitation the rights to use,      */
/*    copy, modify, merge, publish, distribute, sublicense, and/or      */
/*    sell copies of the Software, and to permit persons to whom the    */
/*    Software is furnished to do so, subject to the following          */
/*    conditions:                                                       */
/*                                                                      */
/*    The above copyright notice and this permission notice shall be    */
/*    included in all copies or substantial portions of the             */
/*    Software.                                                         */
/*                                                                      */
/*    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND    */
/*    EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES   */
/*    OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND          */
/*    NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT       */
/*    HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,      */
/*    WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING      */
/*    FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR     */
/*    OTHER DEALINGS IN THE SOFTWARE.                                   */
/*                                                                      */
/************************************************************************/

/*! \file transform.h
   
    \brief set of generic remap, transform and apply functions
   
    My foremost reason to have efficient B-spline processing is the
    formulation of generic remap-like functions. remap() is a function
    which takes an array of real-valued nD coordinates and an interpolator
    over a source array. Now each of the real-valued coordinates is fed
    into the interpolator in turn, yielding a value, which is placed in
    the output array at the same place the coordinate occupies in the
    coordinate array. To put it concisely, if we have
   
    - c, the coordinate array (or 'warp' array, 'arguments' array)
    - a, the source array (containing 'original' or 'knot point' data)
    - i, the interpolator over a
    - j, a coordinate into both c and t
    - t, the target array (receiving the 'result' of the remap)
   
    remap defines the operation
   
    t[j] = i(c[j]) for all j
   
    Now we widen the concept of remapping to a 'transform' function.
    Instead of limiting the process to the use of an 'interpolator',
    we use an arbitrary unary functor transforming incoming values to
    outgoing values, where the type of the incoming and outgoing values
    is determined by the functor. If the functor actually is an
    interpolator, we have a 'true' remap transforming coordinates
    into values, but this is merely a special case. So here we have:
   
    - c, an array containing input values
    - f, a unary functor converting input to output values
    - j, a coordinate into c and t
    - t, the target array
   
    transform performs the operation
   
    t[j] = f(c[j]) for all j
   
    remaps/transforms to other-dimensional objects are supported.
    This makes it possible to, for example, remap from a volume to a
    2D image, using a 2D coordinate array containing 3D coordinates
    ('slicing' a volume)
   
    There is also a variant of this transform function in this file,
    which doesn't take an input array. Instead, for every target
    location, the location's discrete coordinates are passed to the
    unary_functor type object. This way, transformation-based remaps
    can be implemented easily: the user code just has to provide a
    suitable functor to yield values for coordinates. This functor
    will internally take the discrete incoming coordinates (into the
    target array) and take it from there, eventually producing values
    of the target array's value_type.

    Here we have:
   
    - f, a unary functor converting discrete coordinates to output values
    - j, a discrete coordinate into t
    - t, the target array
   
    'index-based' transform performs the operation
   
    t[j] = f(j) for all j
   
    This file also has code to evaluate a b-spline at coordinates in a 
    grid, which can be used for scaling, and for separable geometric
    transformations.
   
    Finally there is a function to restore the original data from a
    b-spline to the precision possible with the given data type and
    degree of the spline. This is done with a separable convolution,
    using a unit-stepped sampling of the basis function as the
    convolution kernel along every axis.
   
    Let me reiterate the strategy used to perform the transforms and
    remaps in this file. The approach is functional: A 'processing chain'
    is set up and encoded as a functor providing two evaluation functions:
    one for 'single' data and one for vectorized data. This functor is
    applied to the data by 'wielding' code, which partitions the data
    into several jobs to be performed by individual worker threads,
    invokes the worker threads, and, once in the worker thread, feeds the
    data to the functor in turn, using hardware vectorization if possible.
    So while at the user code level a single call to some 'transform'
    or 'remap' routine is issued, passing in arrays of data and functors,
    all the 'wielding' is done automatically without any need for the user
    to even be aware of it, while still using highly efficient vector code
    with a tread pool, potentially speeding up the operation greatly as 
    compared to single-threaded, unvectorized operation, of course
    depending on the presence of several cores and vector units. On
    my system (Haswell 4-core i5 with AVX2) the speedup is about one
    order of magnitude. The only drawback is the production of a
    hardware-specific binary if vectorization is used. Both Vc use
    and multithreading can be easily activated/deactivated by #define
    switches, providing a clean fallback solution to produce code suitable
    for any target, even simple single-core machines with no vector units.
    Vectorization will be used if possible - either explicit vectorization
    (by defining USE_VC) - or autovectorization per default.
    Defining VSPLINE_SINGLETHREAD will disable multithreading.
    The code accessing multithreading and/or Vc use is #ifdeffed, so if
    these features are disabled, their code 'disappears' and the relevant 
    headers are not included, nor do the corresponding libraries have to
    be present.
*/

// TODO: don't multithread or reduce number of jobs for small data sets

#ifndef VSPLINE_TRANSFORM_H
#define VSPLINE_TRANSFORM_H

#include "multithread.h" // vspline's multithreading code
#include "eval.h"        // evaluation of b-splines
#include "poles.h"
#include "convolve.h"

// If user code defines VSPLINE_DEFAULT_PARTITIONER, that's what used
// as the default partitioner. Otherwise, we use partition_to_stripes
// as default partitioner, which does a good job in most situations
// and never a 'really' bad job - hopefully ;)

#ifndef VSPLINE_DEFAULT_PARTITIONER
#define VSPLINE_DEFAULT_PARTITIONER vspline::partition_to_stripes
#endif

// The bulk of the implementation of vspline's two 'transform' functions
// is now in wielding.h:

#include "wielding.h"

namespace vspline {

/// implementation of two-array transform using wielding::coupled_wield.
///
/// 'array-based' transform takes two template arguments:
///
/// - 'unary_functor_type', which is a class satisfying the interface
///   laid down in unary_functor.h. Typically, this would be a type 
///   inheriting from vspline::unary_functor, but any type will do as
///   long as it provides the required typedefs and an the relevant
///   eval() routines.
///
/// - the dimensionality of the input and output array
///
/// this overload of transform takes three parameters:
///
/// - a reference to a const unary_functor_type object providing the
///   functionality needed to generate values from arguments.
///
/// - a reference to a const MultiArrayView holding arguments to feed to
///   the unary functor object. It has to have the same shape as the target
///   array and contain data of the unary_functor's 'in_type'.
///
/// - a reference to a MultiArrayView to use as a target. This is where the
///   resulting data are put, so it has to contain data of unary_functor's
///   'out_type'. It has to have the same shape as the input array.
///
/// transform can be used without template arguments, they will be inferred
/// by ATD from the arguments.

template < typename unary_functor_type ,
           unsigned int dimension >
void transform ( const unary_functor_type & functor ,
                 const vigra::MultiArrayView
                     < dimension ,
                       typename unary_functor_type::in_type
                     > & input ,
                  vigra::MultiArrayView
                     < dimension ,
                       typename unary_functor_type::out_type
                     > & output
               )
{
  // check shape compatibility
  
  if ( output.shape() != input.shape() )
  {
    throw vspline::shape_mismatch
     ( "transform: the shapes of the input and output array do not match" ) ;
  }

  // set up a range covering the whole source/target array

  vspline::shape_type < dimension > begin ;
  vspline::shape_type < dimension > end = output.shape() ;
  vspline::shape_range_type < dimension > range ( begin , end ) ;
  
  // wrap the vspline::unary_functor to be used with wielding code.
  // The wrapper is necessary because the code in wielding.h feeds
  // arguments as TinyVectors, even if the data are 'singular'.
  // The wrapper simply reinterpret_casts any TinyVectors of one
  // element to their corresponding value_type before calling the
  // functor. In other words, to use my own terminology: 'canonical'
  // types are reinterpret_cast to 'synthetic' types.

  typedef wielding::vs_adapter < unary_functor_type > coupled_functor_type ;    
  coupled_functor_type coupled_functor ( functor ) ;
  
  // we'll cast the pointers to the arrays to these types to be
  // compatible with the wrapped functor above.

  typedef typename coupled_functor_type::in_type src_type ;
  typedef typename coupled_functor_type::out_type trg_type ;
  
  typedef vigra::MultiArrayView < dimension , src_type > src_view_type ;
  typedef vigra::MultiArrayView < dimension , trg_type > trg_view_type ;
  
  // now delegate to the wielding code

  wielding::coupled_wield < coupled_functor_type , dimension >
    ( range ,
      vspline::default_njobs ,
      coupled_functor ,
      (src_view_type*)(&input) ,
      (trg_view_type*)(&output) ) ;
}

/// implementation of index-based transform using wielding::index_wield
///
/// this overload of transform() is very similar to the first one, but
/// instead of picking input from an array, it feeds the discrete coordinates
/// of the successive places data should be rendered to to the
/// unary_functor_type object.
///
/// This sounds complicated, but is really quite simple. Let's assume you have
/// a 2X3 output array to fill with data. When this array is passed to transform,
/// the functor will be called with every coordinate pair in turn, and the result
/// the functor produces is written to the output array. So for the example given,
/// with 'ev' being the functor, we have this set of operations:
///
/// output [ ( 0 , 0 ) ] = ev ( ( 0 , 0 ) ) ;
///
/// output [ ( 1 , 0 ) ] = ev ( ( 1 , 0 ) ) ;
///
/// output [ ( 2 , 0 ) ] = ev ( ( 2 , 0 ) ) ;
///
/// output [ ( 0 , 1 ) ] = ev ( ( 0 , 1 ) ) ;
///
/// output [ ( 1 , 1 ) ] = ev ( ( 1 , 1 ) ) ;
///
/// output [ ( 2 , 1 ) ] = ev ( ( 2 , 1 ) ) ;
///
/// this transform overload takes one template argument:
///
/// - 'unary_functor_type', which is a class satisfying the interface laid
///   down in unary_functor.h. This is an object which can provide values
///   given *discrete* coordinates, like class evaluator, but generalized
///   to allow for arbitrary ways of achieving it's goal. The unary functor's
///   'in_type' determines the number of dimensions of the coordinates - since
///   they are coordinates into the target array, the functor's input type
///   has to have the same number of dimensions as the target. The functor's
///   'out_type' has to be the same as the data type of the target array, since
///   the target array stores the results of calling the functor.
///
/// this transform overload takes two parameters:
///
/// - a reference to a const unary_functor_type object providing the
///   functionality needed to generate values from discrete coordinates
///
/// - a reference to a MultiArrayView to use as a target. This is where the
///   resulting data are put.
///
/// Please note that vspline holds with vigra's coordinate handling convention,
/// which puts the fastest-changing index first. In a 2D, image processing,
/// context, this is the column index, or the x coordinate. C and C++ do
/// instead put this index last when using multidimensional array access code.
///
/// transform can be used without template arguments, they will be inferred
/// by ATD from the arguments.

template < class unary_functor_type >
void transform ( const unary_functor_type & functor ,
                 vigra::MultiArrayView
                        < unary_functor_type::dim_in ,
                          typename unary_functor_type::out_type
                        > & output )
{
  enum { dimension = unary_functor_type::dim_in } ;
  
  // set up a range covering the whole target array

  vspline::shape_type < dimension > begin ;
  vspline::shape_type < dimension > end = output.shape() ;
  vspline::shape_range_type < dimension > range ( begin , end ) ;
  
  // wrap the vspline::unary_functor to be used with wielding code

  typedef wielding::vs_adapter < unary_functor_type > index_functor_type ;    
  index_functor_type index_functor ( functor ) ;
  
  // we'll cast the pointer to the target array to this type to be
  // compatible with the wrapped functor above

  typedef typename index_functor_type::out_type trg_type ;
  typedef vigra::MultiArrayView < dimension , trg_type > trg_view_type ;
  
  // now delegate to the wielding code

  wielding::index_wield < index_functor_type , dimension >
    ( range ,
      vspline::default_njobs ,
      index_functor ,
      (trg_view_type*)(&output) ) ;
}

/// we code 'apply' as a special variant of 'transform' where the output
/// is also used as input, so the effect is to feed the unary functor
/// each 'output' value in turn, let it process it and store the result
/// back to the same location. While this looks like a rather roundabout
/// way of performing an apply, it has the advantage of using the same
/// type of functor (namely one with const input and writable output),
/// rather than a different functor type which modifies it's argument
/// in-place. While, at this level, using such a functor looks like a
/// feasible idea, It would require specialized code 'further down the
/// line' when complex functors are built with vspline's functional
/// programming tools: the 'apply-capable' functors would need to read
/// the output values first and write them back after anyway, resulting
/// in the same sequence of loads and stores as we get with the current
/// 'fake apply' implementation.

template < typename unary_functor_type  , // functor to apply
           unsigned int dimension >       // input/output array's dimension
void apply ( const unary_functor_type & ev ,
             vigra::MultiArrayView
                    < dimension ,
                      typename unary_functor_type::out_type >
                    & output )
{
  // make sure the functor's input and output type are the same

  static_assert ( std::is_same < typename unary_functor_type::in_type ,
                                 typename unary_functor_type::out_type > :: value ,
                  "apply: functor's input and output type must be the same" ) ;

  // delegate to transform

  transform ( ev , output , output ) ;
}

/// a type for a set of boundary condition codes, one per axis

template < unsigned int dimension >
using bcv_type = vigra::TinyVector < vspline::bc_code , dimension > ;

/// Implementation of 'classic' remap, which directly takes an array of
/// values and remaps it, internally creating a b-spline of given order
/// just for the purpose. This is used for one-shot remaps where the spline
/// isn't reused, and specific to b-splines, since the functor used is a
/// b-spline evaluator. The spline defaults to a cubic b-spline with
/// mirroring on the bounds.
///
/// So here we have the 'classic' remap, where the input array holds
/// coordinates and the functor used is actually an interpolator. Since
/// this is merely a special case of using transform(), we delegate to
/// transform() once we have the evaluator.
///
/// The template arguments are chosen to allow the user to call 'remap'
/// without template arguments; the template arguments can be found by ATD
/// by looking at the MultiArrayViews passed in.
///
/// - original_type is the value_type of the array holding the 'original' data over
///   which the interpolation is to be performed
///
/// - result_type is the value_type of the array taking the result of the remap, 
///   namely the values produced by the interpolation. these data must have as
///   many channels as original_type
///
/// - coordinate_type is the type for coordinates at which the interpolation is to
///   be performed. coordinates must have as many components as the input array
///   has dimensions.
///
/// optionally, remap takes a set of boundary condition values and a spline
/// degree, to allow creation of splines for specific use cases beyond the
/// default. I refrain from extending the argument list further; user code with
/// more specific requirements will have to create an evaluator and use 'transform'.
///
/// Note that remap can be called without template arguments, the types will
/// be inferred by ATD from the arguments passed in.

template < typename original_type ,   // data type of original data
           typename result_type ,     // data type for interpolated data
           typename coordinate_type , // data type for coordinates
           unsigned int cf_dimension ,  // dimensionality of original data
           unsigned int trg_dimension , // dimensionality of result array
           int bcv_dimension = cf_dimension > // see below. g++ ATD needs this.
void remap ( const vigra::MultiArrayView
                          < cf_dimension , original_type > & input ,
             const vigra::MultiArrayView
                          < trg_dimension , coordinate_type > & coordinates ,
             vigra::MultiArrayView
                    < trg_dimension , result_type > & output ,
             bcv_type < bcv_dimension > bcv
              = bcv_type < bcv_dimension > ( MIRROR ) ,
             int degree = 3 )
{
  static_assert (    vigra::ExpandElementResult < original_type > :: size
                  == vigra::ExpandElementResult < result_type > :: size ,
                  "input and output data type must have same nr. of channels" ) ;
                  
  static_assert (    cf_dimension
                  == vigra::ExpandElementResult < coordinate_type > :: size ,
                  "coordinate type must have the same dimension as input array" ) ;

  // this is silly, but when specifying bcv_type < cf_dimension >, the code failed
  // to compile with g++. So I use a separate template argument bcv_dimension
  // and static_assert it's the same as cf_dimension. TODO this sucks...
                  
  static_assert (    cf_dimension
                  == bcv_dimension ,
                  "boundary condition specification needs same dimension as input array" ) ;

  // check shape compatibility
  
  if ( output.shape() != coordinates.shape() )
  {
    throw shape_mismatch 
    ( "the shapes of the coordinate array and the output array must match" ) ;
  }

  // get a suitable type for the b-spline's coefficients

  typedef typename vigra::PromoteTraits < original_type , result_type >
                          :: Promote _cf_type ;
                          
  typedef typename vigra::NumericTraits < _cf_type >
                          :: RealPromote cf_type ;
  
  // create the bspline object
  
  typedef typename vspline::bspline < cf_type , cf_dimension > spline_type ;
  spline_type bsp ( input.shape() , degree , bcv ) ;
  
  // prefilter, taking data in 'input' as knot point data
  
  bsp.prefilter ( input ) ;

  // since this is a commodity function, we use a 'safe' evaluator.
  // If maximum performance is needed and the coordinates are known to be
  // in range, user code should create a 'naked' vspline::evaluator and
  // use it with vspline::transform.
  // Note how we pass in 'rc_type', the elementary type of a coordinate.
  // We want to allow the user to pass float or double coordinates.
  
  typedef typename vigra::ExpandElementResult < coordinate_type > :: type rc_type ;
  
  auto ev = vspline::make_safe_evaluator < spline_type , rc_type > ( bsp ) ;
  
  // call transform(), passing in the evaluator,
  // the coordinate array and the target array
  
  transform ( ev , coordinates , output ) ;
}

// next we have code for evaluation of b-splines over grids of coordinates.
// This code lends itself to some optimizations, since part of the weight
// generation used in the evaluation process is redundant, and by
// precalculating all redundant values and referring to the precalculated
// values during the evaluation a good deal of time can be saved - provided
// that the data involved a nD.

// TODO: as in separable convolution, it might be profitable here to apply
// weights for one axis to the entire array, then repeat with the other axes
// in turn. storing, modifying and rereading the array may still
// come out faster than the rather expensive DDA needed to produce the
// value with weighting in all dimensions applied at once, as the code
// below does (by simply applying the weights in the innermost eval
// class evaluator offers). The potential gain ought to increase with
// the dimensionality of the data.

// for evaluation over grids of coordinates, we use a vigra::TinyVector
// of 1D MultiArrayViews holding the component coordinates for each
// axis.
// When default-constructed, this object holds default-constructed
// MultiArrayViews, which, when assigned another MultiArrayView,
// will hold another view over the data, rather than copying them.
// initially I was using a small array of pointers for the purpose,
// but that is potentially unsafe and does not allow passing strided
// data.

template < unsigned int dimension , typename rc_ele_type = float >
using grid_spec =
vigra::TinyVector
  < vigra::MultiArrayView < 1 , rc_ele_type > ,
    dimension
  > ;

namespace detail // workhorse code for grid_eval
{
// in grid_weight, for every dimension we have a set of spline_order
// weights for every position in this dimension. in grid_ofs, we have the
// partial offset for this dimension for every position. these partial
// offsets are the product of the index for this dimension at the position
// and the stride for this dimension, so that the sum of the partial
// offsets for all dimensions yields the offset into the coefficient array
// to the window of coefficients where the weights are to be applied.
// First we have code for 'level' > 0. _grid_eval uses a recursive
// descent through the dimensions, starting with the highest one and
// working it's way down to 'level 0', the x axis. For level > 0,
// the vectorized and unvectorized code is the same:

template < typename evaluator_type ,
           int level ,
           size_t _vsize = 0 >
struct _grid_eval
{
  // glean the data type for results and for MultiArrayView of
  // results - this is where the result of the operation goes

  typedef typename evaluator_type::trg_type trg_type ;
  typedef vigra::MultiArrayView < level + 1 , trg_type > target_view_type ;
  
  // get the type of a 'weight', a factor to apply to a coefficient.
  // we obtain this from the evaluator's 'inner_type', which has the
  // actual evaluation code, while class evaluator merely interfaces
  // to it.

  typedef typename evaluator_type::inner_type iev_type ;
  typedef typename iev_type::math_ele_type weight_type ;
  
  void operator() ( int initial_ofs ,
                    vigra::MultiArrayView < 2 , weight_type > & weight ,
                    weight_type** const & grid_weight ,
                    const int & spline_order ,
                    int ** const & grid_ofs ,
                    const evaluator_type & itp ,
                    target_view_type & result )
  {
    // iterating along the axis 'level', we fix a coordinate 'c'
    // to every possible value in turn

    for ( int c = 0 ; c < result.shape ( level ) ; c++ )
    {
      
      // we pick the set of weights corresponding to the axis 'level'
      // from 'grid_weight' for this level
  
      for ( int e = 0 ; e < spline_order ; e++ )
      {
        weight [ vigra::Shape2 ( e , level ) ]
          = grid_weight [ level ] [ spline_order * c + e ] ;
      }
      
      // cum_ofs, the cumulated offset, is the sum of the partial
      // offsets for all levels. here we add the contribution picked
      // from 'grid_ofs' for this level, at coordinate c

      int cum_ofs = initial_ofs + grid_ofs [ level ] [ c ] ;
      
      // for the recursive descent, we create a subdimensional slice
      // of 'result', fixing the coordinate for axis 'level'  at c:
      
      auto region = result.bindAt ( level , c ) ;
      
      // now we call _grid_eval recursively for the next-lower level,
      // passing on the arguments, which have received the current level's
      // additions, and the slice we've created in 'region'.
      
      _grid_eval < evaluator_type ,
                   level - 1 ,
                   _vsize >()
        ( cum_ofs ,
          weight ,
          grid_weight ,
          spline_order ,
          grid_ofs ,
          itp ,
          region ) ;
    }
  }
} ;

/// At level 0 the recursion ends. 'result' is now a 1D MultiArrayView,
/// which is easy to process. Here we perform the actual evaluation.
/// With template argument _vsize unfixed, we have the vector code,
/// below is a specialization for _vsize == 1 which is unvectorized.

template < typename evaluator_type , size_t _vsize >
struct _grid_eval < evaluator_type , 0 , _vsize >
{
  enum { vsize = evaluator_type::vsize } ;
  enum { channels = evaluator_type::channels } ;
  
  typedef typename evaluator_type::math_ele_type weight_type ;
  typedef typename vspline::vector_traits < weight_type , vsize >
                   :: ele_v math_ele_v ;
  typedef typename evaluator_type::trg_ele_type trg_ele_type ;
  typedef typename evaluator_type::trg_type trg_type ;
  typedef typename vspline::vector_traits < trg_type , vsize >
                   :: nd_ele_v trg_v ;
  typedef vigra::MultiArrayView < 1 , trg_type > target_view_type ;

  typedef typename evaluator_type::inner_type iev_type ;
  typedef typename iev_type::ofs_ele_type ofs_ele_type ;
  typedef typename vspline::vector_traits < ofs_ele_type , vsize >
                   :: ele_v ofs_ele_v ;
  typedef typename iev_type::trg_type trg_syn_type ;
  typedef vigra::MultiArrayView < 1 , trg_syn_type > target_syn_view_type ;
  
  void operator() ( int initial_ofs ,
                    vigra::MultiArrayView < 2 , weight_type > & weight ,
                    weight_type** const & grid_weight ,
                    const int & spline_order ,
                    int ** const & grid_ofs ,
                    const evaluator_type & itp ,
                    target_view_type & _region )
  {
    // we'll be using the evaluator's 'inner evaluator' for the actual
    // evaluation, which operates on 'synthetic' types. Hence this cast:
    
    auto & region = reinterpret_cast < target_syn_view_type & >
                                ( _region ) ;

    // number of vectorized results
    
    int aggregates = region.size() / vsize ;
    
    // have storage ready for vectorized weights
    
    using allocator_t
    = typename vspline::allocator_traits < math_ele_v > :: type ;
    
    vigra::MultiArray < 2 , math_ele_v , allocator_t > vweight ( weight.shape() ) ;
    
    // ditto, for vectorized offsets
    
    ofs_ele_v select ;
    
    // and a buffer for vectorized target data
    
    trg_v vtarget ;

    // initialize the vectorized weights for dimensions > 0
    // These remain constant throughout this routine, since the recursive
    // descent has successively fixed them (with the bindAt operation),
    // and in the course of the recursive descent they were deposited in
    // 'weight', where we now pick them up. Note how 'weight' holds these
    // values as fundamentals (weight_type), and now they are broadcast to
    // a vector type (math_ele_v), containing vsize identical copies.

    for ( int d = 1 ; d < weight.shape(1) ; d++ )
    {
      for ( int o = 0 ; o < spline_order ; o++ )
      {
        vweight [ vigra::Shape2 ( o , d ) ]
          = weight [ vigra::Shape2 ( o , d ) ] ;
      }
    }

    // get a pointer to the target array's data (seen as elementary type)

    trg_ele_type * p_target = (trg_ele_type*) ( region.data() ) ;

    // and the stride, if any, also in terms of the elementary type, from
    // one cluster of target data to the next

    int stride = vsize * channels * region.stride(0) ;

    // calculate scatter indexes for depositing result data

    const auto indexes =   ofs_ele_v::IndexesFromZero()
                         * channels
                         * region.stride(0) ;

    // now the peeling run, processing vectorized data as long as we
    // have full vectors to process

    for ( int a = 0 ; a < aggregates ; a++ )
    {
      // gather the individual weights into the vectorized form.
      // this operation gathers the level-0 weights, which are the only
      // part of the weights which vary throughout this routine, while the
      // higher-level weights have been fixed above.
      
      for ( int o = 0 ; o < spline_order ; o++ )
      {
        vweight[ vigra::Shape2 ( o , 0 ) ].gather
          ( grid_weight [ 0 ] + spline_order * a * vsize ,
            spline_order * ofs_ele_v::IndexesFromZero() + o ) ;
      }
    
      // get a set of vsize offsets from grid_ofs
      
      select.load ( grid_ofs [ 0 ] + a * vsize ) ;
      
      // add cumulated offsets from higher dimensions
      
      select += initial_ofs ;
      
      // now we can call the vectorized eval routine of evaluator's
      // 'inner' object.

      itp.inner.eval ( select , vweight , vtarget ) ;
      
      // finally we scatter the vectorized result to target memory

      for ( int e = 0 ; e < channels ; e++ )
        vtarget[e].scatter ( p_target + e , indexes ) ;

      // and set p_target to the next cluster of target values
      
      p_target += stride ;
    }
    
    // the first position unaffected by the peeling run is here: 
    
    int c0 = aggregates * vsize ;

    // create an iterator into target array pointing to this position
    // in the target array
    
    auto iter = region.begin() + c0 ;
    
    // now we finish off the stragglers, which is essentially the
    // same code as in the unvectorized specialization below.

    for ( int c = c0 ; c < region.shape ( 0 ) ; c++ )
    {
      // pick up the level-0 weights at this coordinate

      for ( int e = 0 ; e < spline_order ; e++ )
      {
        weight [ vigra::Shape2 ( e , 0 )  ]
          = grid_weight [ 0 ] [ spline_order * c + e ] ;
      }

      // add the last summand to the cumulated offset

      int cum_ofs = initial_ofs + grid_ofs [ 0 ] [ c ] ;
      
      // now we have everything together and can evaluate.

      itp.inner.eval ( cum_ofs , weight , *iter ) ;
        
      ++iter ;
    }
  }
} ;

/// unvectorized specialization of _grid_eval at level 0. This is,
/// essentially, the vectorized code above minus the peeling run.

template < typename evaluator_type >
struct _grid_eval < evaluator_type , 0 , 1 >
{
  typedef typename evaluator_type::math_ele_type weight_type ;
  typedef typename evaluator_type::trg_ele_type trg_ele_type ;
  typedef typename evaluator_type::trg_type trg_type ;
  typedef vigra::MultiArrayView < 1 , trg_type > target_view_type ;

  typedef typename evaluator_type::inner_type iev_type ;
  
  enum { channels = evaluator_type::channels } ;
  typedef vigra::TinyVector < trg_ele_type , channels > trg_syn_type ;
  typedef vigra::MultiArrayView < 1 , trg_syn_type > target_syn_view_type ;
  
  void operator() ( int initial_ofs ,
                    vigra::MultiArrayView < 2 , weight_type > & weight ,
                    weight_type** const & grid_weight ,
                    const int & spline_order ,
                    int ** const & grid_ofs ,
                    const evaluator_type & itp ,
                    target_view_type & _region )
  {
    auto & region = reinterpret_cast < target_syn_view_type & >
                               ( _region ) ;

    auto iter = region.begin() ;    

    for ( int c = 0 ; c < region.shape ( 0 ) ; c++ )
    {
      for ( int e = 0 ; e < spline_order ; e++ )
      {
        weight [ vigra::Shape2 ( e , 0 )  ]
          = grid_weight [ 0 ] [ spline_order * c + e ] ;
      }
      
      int cum_ofs = initial_ofs + grid_ofs [ 0 ] [ c ] ;
      
      itp.inner.eval ( cum_ofs , weight , *iter ) ;
        
      ++iter ;
    }
  }
} ;

/// Here is the single-threaded code for the grid_eval function.
/// The first argument is a shape range, defining the subsets of data
/// to process in a single thread. the remainder are forwards of the
/// arguments to grid_eval, as pointers. The call is affected via
/// 'multithread()' which sets up the partitioning and distribution
/// to threads from a thread pool.

template < typename evaluator_type > // b-spline evaluator type
void st_grid_eval ( shape_range_type < evaluator_type::dim_in > range ,
                    grid_spec < evaluator_type::dim_in ,
                                typename evaluator_type::rc_ele_type > * p_grid ,
                    const evaluator_type * itp ,
                    vigra::MultiArrayView < evaluator_type::dim_in ,
                                            typename evaluator_type::trg_type >
                      * p_result )
{
  enum { dimension = evaluator_type::dim_in } ;
  
  typedef typename evaluator_type::math_ele_type weight_type ;
  typedef typename evaluator_type::rc_ele_type rc_type ;
  typedef vigra::MultiArrayView < dimension , typename evaluator_type::trg_type > target_type ;
  
  const int spline_order = itp->inner.get_order() ;
  
  // pick the subarray of the 'whole' target array pertaining
  // to this thread's range
  
  auto result = p_result->subarray ( range[0] , range[1] ) ;
  
  // pick the subset of coordinates pertaining to this thread's
  // range
  
  grid_spec < evaluator_type::dim_in ,
              typename evaluator_type::rc_ele_type > r_grid ;
              
  auto & grid ( *p_grid ) ;
  
  // 'subarray' for 1D vigra::MultiArrayView can't take plain 'long',
  // so we have to package the limits of the range in TinyVectors

  vigra::TinyVector < typename evaluator_type::rc_ele_type , 1 > _begin ;
  vigra::TinyVector < typename evaluator_type::rc_ele_type , 1 > _end ;
  
  for ( int d = 0 ; d < dimension ; d++ )
  {
    // would like to do this:
    // r_grid[d] = grid[d].subarray ( range[0][d] , range[1][d] ) ; 

    _begin[0] = range[0][d] ;
    _end[0] = range[1][d] ;
    r_grid[d] = grid[d].subarray ( _begin , _end ) ; 
  }

  // set up storage for precalculated weights and offsets

  weight_type * grid_weight [ dimension ] ;
  int * grid_ofs [ dimension ] ;
  
  // get some metrics

  TinyVector < int , dimension > shape ( result.shape() ) ;
  TinyVector < int , dimension > estride ( itp->inner.get_estride() ) ;
  
  // allocate space for the per-axis weights and offsets

  for ( int d = 0 ; d < dimension ; d++ )
  {
    grid_weight[d] = new weight_type [ spline_order * shape [ d ] ] ;
    grid_ofs[d] = new int [ shape [ d ] ] ;
  }
  
  int select ;
  rc_type tune ;
  
  // fill in the weights and offsets, using the interpolator's
  // split() to split the coordinates received in grid_coordinate,
  // the interpolator's obtain_weights() method to produce the
  // weight components, and the strides of the coefficient array
  // to convert the integral parts of the coordinates into offsets.

  for ( int d = 0 ; d < dimension ; d++ )
  {
    for ( int c = 0 ; c < shape [ d ] ; c++ )
    {
      itp->inner.split
           ( r_grid [ d ] [ c ] , select , tune ) ;
           
      itp->inner.obtain_weights
           ( grid_weight [ d ] + spline_order * c , d , tune ) ;
           
      grid_ofs [ d ] [ c ] = select * estride [ d ] ;
    }
  }
  
  // allocate storage for a set of singular weights
  
  using allocator_t
  = typename vspline::allocator_traits < weight_type > :: type ;
    
  vigra::MultiArray < 2 , weight_type , allocator_t > weight
       ( vigra::Shape2 ( spline_order , dimension ) ) ;
  
  // now call the recursive workhorse routine

  detail::_grid_eval < evaluator_type , dimension - 1 ,
                       evaluator_type::vsize >()
   ( 0 , weight , grid_weight , spline_order , grid_ofs , *itp , result ) ;

  // clean up

  for ( int d = 0 ; d < dimension ; d++ )
  {
    delete[] grid_weight[d] ;
    delete[] grid_ofs[d] ;
  }
  
}

} ; // end of namespace detail

/// this is the multithreaded version of grid_eval, which sets up the
/// full range over 'result' and calls 'multithread' to do the rest
///
/// grid_eval evaluates a b-spline object
/// at points whose coordinates are distributed in a grid, so that for
/// every axis there is a set of as many coordinates as this axis is long,
/// which will be used in the grid as the coordinate for this axis at the
/// corresponding position. The resulting coordinate matrix (which remains
/// implicit) is like a grid made from the per-axis coordinates. Note how
/// these coordinates needn't be evenly spaced, or in any specific order.
/// Evenly spaced coordinates would hold the potential for even further
/// optimization, specifically when decimation or up/downsampling are
/// needed. So far I haven't coded for these special cases, and since the
/// bulk of the processing time here is used up for memory access (to load
/// the relevant coefficients), and for the arithmetic to apply the weights
/// to the coefficients, the extra performance gain would be moderate.
///
/// If we have two dimensions and x coordinates x0, x1 and x2, and y
/// coordinates y0 and y1, the resulting implicit coordinate matrix is
///
/// (x0,y0) (x1,y0) (x2,y0)
///
/// (x0,y1) (x1,y1) (x2,y1)
///
/// since the offsets and weights needed to perform an interpolation
/// only depend on the coordinates, this highly redundant coordinate array
/// can be processed more efficiently by precalculating the offset component
/// and weight component for all axes and then simply permutating them to
/// obtain the result. Especially for higher-degree and higher-dimensional
/// splines this saves quite some time, since the generation of weights
/// is computationally expensive.
///
/// grid_eval is useful for generating a scaled representation of the original
/// data, but when scaling down, aliasing will occur and the data should be
/// low-pass-filtered adequately before processing.
///
/// Note that this code is specific to b-spline evaluators and relies
/// on evaluator_type offering several b-spline specific methods which
/// are not present in other interpolators, like split() and
/// obtain_weights(). Since the weight generation for b-splines can
/// be done separately for each axis and is a computationally intensive
/// task, precalculating these per-axis weights makes sense. Coding for
/// the general case (other unary functors), the only achievement is
/// the permutation of the partial coordinates, so little is gained,
/// and a transform where the indices are used to pick up
/// the coordinates can be written easily: have a unary_functor taking
/// discrete coordinates, 'loaded' with the per-axis coordinates, and an
/// eval routine using the picked coordinates. This scheme is implemented
/// further below, in class grid_eval_functor and gen_grid_eval() using it.

template < typename evaluator_type >
void grid_eval ( grid_spec < evaluator_type::dim_in ,
                             typename evaluator_type::in_ele_type > & grid ,
                 const evaluator_type & itp ,
                 vigra::MultiArrayView < evaluator_type::dim_in ,
                                         typename evaluator_type::trg_type >
                   & result )
{
  enum { dimension = evaluator_type::dim_in } ;
  
  // make sure the grid specification has enough coordinates

  for ( int d = 0 ; d < dimension ; d++ )
    assert ( grid[d].size() >= result.shape ( d ) ) ;

  shape_range_type < dimension >
    range ( shape_type < dimension > () , result.shape() ) ;
  
  multithread ( detail::st_grid_eval < evaluator_type > ,
                VSPLINE_DEFAULT_PARTITIONER < dimension > ,
                ncores * 8 ,
                range ,
                &grid ,
                &itp ,
                &result ) ;
}

/// generalized grid evaluation. While grid_eval above specifically uses
/// b-spline evaluation, saving time by precalculating weights and offsets,
/// generalized grid evaluation can use any vspline::unary_functor on the
/// grid positions. If this functor happens to be a b-spline evaluator, the
/// result will be the same as the result obtained from using grid_eval,
/// but the calculation takes longer.
///
/// The implementation is simple: we wrap the 'inner' functor providing
/// evaluation at a grid location in an outer functor 'grid_eval_functor',
/// which receives discrete coordinates, picks the corresponding grid
/// coordinates and delegates to the inner functor to obtain the result.
/// The outer functor is then used with an index-based transform to fill
/// the target array.
///
/// This is a good example for the use of functional programming in vspline,
/// as it demonstrates wrapping of one functor in another and the use of
/// the combined functor with vspline::transform. It's also nice to have,
/// since it offers a straightforward equivalent implementation of grid_eval
/// to doublecheck the grid_eval functions correctly, as we do in scope_test.

template < typename _inner_type ,
           typename ic_type = int >
struct grid_eval_functor
: public vspline::unary_functor
         < typename vspline::canonical_type
                    < ic_type , _inner_type::dim_in > ,
           typename _inner_type::out_type ,
           _inner_type::vsize >
{
  typedef _inner_type inner_type ;
  
  enum { vsize = inner_type::vsize } ;
  enum { dimension = inner_type::dim_in } ;
  
  typedef typename vspline::canonical_type
                    < ic_type , dimension > in_type ;
                    
  typedef typename inner_type::out_type out_type ;
  
  typedef typename inner_type::in_ele_type rc_type ;
  
  typedef typename vspline::unary_functor
          < in_type , out_type , vsize > base_type ;
  
  static_assert ( std::is_integral < ic_type > :: value ,
                  "grid_eval_functor: must use integral coordinates" ) ;
  
  const inner_type inner ;
  
  typedef grid_spec < inner_type::dim_in ,
                      typename inner_type::in_ele_type > grid_spec_t ;

  grid_spec_t grid ;

  grid_eval_functor ( grid_spec_t & _grid ,
                      const inner_type & _inner )
  : grid ( _grid ) ,
    inner ( _inner )
  { } ;
  
  void eval ( const in_type & c , out_type & result ) const
  {
    typename inner_type::in_type cc ;
    
    // for uniform access, we use reinterpretations of the coordinates
    // as nD types, even if they are only 1D. This is only used to
    // fill in 'cc', the cordinate to be fed to 'inner'.

    typedef typename base_type::in_nd_ele_type nd_ic_type ;
    typedef typename inner_type::in_nd_ele_type nd_rc_type ;
  
    const nd_ic_type & nd_c ( reinterpret_cast < const nd_ic_type & > ( c ) ) ;
    nd_rc_type & nd_cc ( reinterpret_cast < nd_rc_type & > ( cc ) ) ;
    
    for ( int d = 0 ; d < dimension ; d++ )
      nd_cc [ d ] = grid [ d ] [ nd_c[d] ] ;
    
    inner.eval ( cc , result ) ;
  }
  
  template < typename = std::enable_if < ( vsize > 1 ) > >
  void eval ( const typename base_type::in_v & c ,
              typename base_type::out_v & result ) const
  {
    typename inner_type::in_v cc ;
    
    typedef typename base_type::in_nd_ele_v nd_ic_v ;
    typedef typename inner_type::in_nd_ele_v nd_rc_v ;

    const nd_ic_v & nd_c ( reinterpret_cast < const nd_ic_v & > ( c ) ) ;
    nd_rc_v & nd_cc ( reinterpret_cast < nd_rc_v & > ( cc ) ) ;
    
    // TODO: we might optimize in two ways:
    // if the grid data are contiguous, we can issue a gather,
    // and if the coordinates above dimension 0 are equal for all e,
    // we can assign a scalar to nd_cc[d] for d > 0.

    for ( int d = 0 ; d < dimension ; d++ )
      for ( int e = 0 ; e < vsize ; e++ )
        nd_cc[d][e] = grid[d][ nd_c[d][e] ] ;
    
    inner.eval ( cc , result ) ;
  }
} ;

/// generalized grid evaluation. The production of result values from
/// input values is done by an instance of grid_eval_functor, see above.
/// The template argument, ev_type, has to be a functor (usually this
/// will be a vspline::unary_functor). If the functor's in_type has
/// dim_in components, grid_spec must also point to dim_in pointers,
/// since ev's input is put together by picking a value from each
/// of the arrays grid_spec points to. The result obviously has to have
/// as many dimensions.

template < typename ev_type >
void gen_grid_eval ( grid_spec < ev_type::dim_in ,
                                 typename ev_type::in_ele_type > & grid ,
                     const ev_type & ev ,
                     vigra::MultiArrayView < ev_type::dim_in ,
                                             typename ev_type::out_type >
                       & result )
{
  // make sure the grid specification has enough coordinates
  
  for ( int d = 0 ; d < ev_type::dim_in ; d++ )
    assert ( grid[d].size() >= result.shape ( d ) ) ;

  // set up the grid evaluation functor and use it with 'transform'
  grid_eval_functor < ev_type > gev ( grid , ev ) ;
  vspline::transform ( gev , result ) ;
}

/// deprecated previous version taking the grid specification as
/// a pointer to pointers. These will go with the 0.4.x series

template < typename ev_type >
void grid_eval ( typename ev_type::in_ele_type ** const p_grid_spec ,
                 const ev_type & ev ,
                 vigra::MultiArrayView < ev_type::dim_in ,
                                         typename ev_type::out_type >
                   & result )
{
  typedef typename ev_type::in_ele_type rc_type ;
  
  vspline::grid_spec < ev_type::dim_in ,
                       rc_type > grid_spec ;

  for ( int i = 0 ; i < ev_type::dim_in ; i++ )
  {
    vigra::TinyVector < std::ptrdiff_t , 1 > sz ( result.shape(i) ) ;
    
    grid_spec[i] = vigra::MultiArrayView < 1 , rc_type >
                     ( sz , p_grid_spec[i] ) ;
  }
  
  grid_eval ( grid_spec , ev , result ) ;
}

/// deprecated previous version taking the grid specification as
/// a pointer to pointers. These will go with the 0.4.x series

template < typename ev_type >
void gen_grid_eval ( typename ev_type::in_ele_type ** const p_grid_spec ,
                     const ev_type & ev ,
                     vigra::MultiArrayView < ev_type::dim_in ,
                                             typename ev_type::out_type >
                       & result )
{
  typedef typename ev_type::in_ele_type rc_type ;
  
  vspline::grid_spec < ev_type::dim_in ,
                       rc_type > grid_spec ;

  for ( int i = 0 ; i < ev_type::dim_in ; i++ )
  {
    vigra::TinyVector < std::ptrdiff_t , 1 > sz ( result.shape(i) ) ;
    
    grid_spec[i] = vigra::MultiArrayView < 1 , rc_type >
                     ( sz , p_grid_spec[i] ) ;
  }
  
  gen_grid_eval ( grid_spec , ev , result ) ;
}

/// restore restores the original data from the b-spline coefficients.
/// This is done efficiently using a separable convolution, the kernel
/// is simply a unit-spaced sampling of the basis function.
/// Since the filter uses internal buffering, using this routine
/// in-place is safe - meaning that 'target' may be bspl.core itself.
/// math_type, the data type for performing the actual maths on the
/// buffered data, and the type the data are converted to when they
/// are placed into the buffer, can be chosen, but I could not detect
/// any real benefits from using anything but the default, which is to
/// leave the data in their 'native' type.
///
/// an alternative way to restore is running an index-based
/// transform with an evaluator for the spline. This is much
/// less efficient, but the effect is the same:
///
///   auto ev = vspline::make_evaluator ( bspl ) ;
///   vspline::transform ( ev , target ) ;
///
/// Note that vsize, the vectorization width, can be passed explicitly.
/// If Vc is in use and math_ele_type can be used with hardware
/// vectorization, the arithmetic will be done with Vc::SimdArrays
/// of the given size. Otherwise 'goading' will be used: the data are
/// presented in TinyVectors of vsize math_ele_type, hoping that the
/// compiler may autovectorize the operation.
///
/// 'math_ele_type', the type used for arithmetic inside the filter,
/// defaults to the vigra RealPromote type of value_type's elementary.
/// This ensures appropriate treatment of integral-valued splines.

// TODO hardcoded default vsize

template < unsigned int dimension ,
           typename value_type ,
           typename math_ele_type
             = typename vigra::NumericTraits
                        < typename vigra::ExpandElementResult
                                   < value_type > :: type 
                        > :: RealPromote ,
           size_t vsize = vspline::vector_traits<math_ele_type>::size >
void restore
  ( const vspline::bspline < value_type , dimension > & bspl ,
    vigra::MultiArrayView < dimension , value_type > & target )
{
  if ( target.shape() != bspl.core.shape() )
    throw shape_mismatch
     ( "restore: spline's core shape and target array shape must match" ) ;

  if ( bspl.spline_degree < 2 )
  {
    // we can handle the degree 0 and 1 cases very efficiently,
    // since we needn't apply a filter at all. This is an
    // optimization, the filter code would still perform
    // correctly without it.

    if ( (void*) ( bspl.core.data() ) != (void*) ( target.data() ) )
    {
      // operation is not in-place, copy data to target
      target = bspl.core ;
    }
    return ;
  }
  
  // first assemble the arguments for the filter

  int degree = bspl.spline_degree ;
  int headroom = degree / 2 ;
  int ksize = headroom * 2 + 1 ;
  xlf_type kernel [ ksize ] ;
  
  // pick the precomputed basis function values for the kernel.
  // Note how the values in precomputed_basis_function_values
  // (see poles.h) are provided at half-unit steps, hence the
  // index acrobatics.

  for ( int k = - headroom ; k <= headroom ; k++ )
  {
    int pick = 2 * std::abs ( k ) ;
    kernel [ k + headroom ]
    = vspline_constants
      ::precomputed_basis_function_values [ degree ]
        [ pick ] ;
  }
  
  // the arguments have to be passed one per axis. While most
  // arguments are the same throughout, the boundary conditions
  // may be different for each axis.

  std::vector < vspline::fir_filter_specs > vspecs ;
  
  for ( int axis = 0 ; axis < dimension ; axis++ )
  {
    vspecs.push_back 
      ( vspline::fir_filter_specs
        ( bspl.bcv [ axis ] , ksize , headroom , kernel ) ) ;
  }
 
  // KFJ 2018-05-08 with the automatic use of vectorization the
  // distinction whether math_ele_type is 'vectorizable' or not
  // is no longer needed: simdized_type will be a Vc::SimdArray
  // if possible, a vspline::simd_tv otherwise.
  
  typedef typename vspline::convolve
                            < vspline::simdized_type ,
                              math_ele_type ,
                              vsize
                            > filter_type ;
                            
  // now we have the filter's type, create an instance and
  // use it to affect the restoration of the original data.
  
  vspline::filter
  < value_type , value_type , dimension , filter_type >
  ( bspl.core , target , vspecs ) ;
}

/// overload of 'restore' writing the result of the operation back to
/// the array which is passed in. This looks like an in-place operation,
/// but the data are in fact moved to a buffer stripe by stripe, then
/// the arithmetic is done on the buffer and finally the buffer is
/// written back. This is repeated for each dimension of the array.

template < int dimension ,
           typename value_type ,
           typename math_ele_type
             = typename vigra::NumericTraits
                        < typename vigra::ExpandElementResult
                                   < value_type > :: type 
                        > :: RealPromote ,
           size_t vsize = vspline::vector_traits<math_ele_type>::size >
void restore
  ( vspline::bspline < value_type , dimension > & bspl )
{
  restore < dimension , value_type , math_ele_type , vsize >
    ( bspl , bspl.core ) ;
}

} ; // end of namespace vspline

#endif // VSPLINE_TRANSFORM_H
