Consolidating memory lookups

Currently, four memory buffers are used to store information.  These are `paramIndices`, `cudaArray`, `functorConstants`, and `normalisationFactors`.  When a PDF needs to fetch a variable, the device-side function will look into the paramIndices to find the appropriate index within cudaArray.  This has poor memory coherence.  A proof of concept has been realized that will improve performance by having more coherent memory accesses, and removing second memory lookup.

This proof of concept has two major parts.  The first major part is updating the framework, such that the indices are configured before calling the PDF.  The second major part is updating all PDFs with the new index model.  

This current model will extend from the original proof of concept, which had placed all possible parameters into a single array, and allowed a generic PDF to space-leap through a PDF for random access.  This version will now utilize a container structure that will be used for containing four separate arrays, with four indices that each device function will update.  The device function pointers need to be changed such that we 'increment' an index to get the next function to be used.  Parameters need to be incremented based on the number of values used within a device function.  If a PDF uses two contants, then it will simply increment variables, norms, obs without performing a memory access.  The prototype follows.

**Parameter Container**:
```cpp
struct ParameterContainer
{
  fptype *parameters;  //[NumParams for device Function][Variables..]
  fptype *constants;  //[NumCons for device Function][Variables..]
  fptype *observables;  //[NumObs for device Function][Variables..]
  fptype *normalisations;  //[NumNorms for device Function][Variables..]
  int varIdx;
  int consIdx;
  int obsIdx;
  int normsIdx;
  int funcIdx;

  void incrementIndex (const int funcs, const int parameters, const int cons, const int obs, const int norms)
  {
    funcIdx += funcs + 1;
    parameterIdx += params + 1;
    constantIdx += cons + 1;
    observableIdx += obs + 1;
    normalIdx += norms + 1;
  }

  void incrementIndex()
  {
    funcIdx ++;
    parameterIdx += parameters[parameterIdx] + 1;
    constantIdx += constants[constantIdx] + 1;
    observableIdx += observables[observableIdx] + 1;
    normalIdx += normalisations[normalIdx] + 1;
  }
}; 
```

These parameter indices are populated by every PDF within the fit.  In order to increment the index for a given function, we simple add/increment based on the device PDF.  This container will be passed and modified for every device function.

**Functions**:
The PDF tree needs to re-write the function pointers into the `device_function` list in order of device pointers that are visited.  Essentially, convert a tree into an array, and each function that will be called is incremented.  This can be extended to handle divergent conditions based on events (ie a device function will call function a/b depending on event).  This will need to have a 'hardcoded' value to index off of, which can be looked up.

One item to ensure is the proper casting to a void* type for the NLL functions.

These changes will allow for coherent memory accesses as each device function is evaluated, since each function will evaluate and fetch variables linearly in memory.  Also this will remove the extra memory lookup that occurs for every variable/constant/normal/observable, which will greatly improve performance.

**Potential memory overheads**:
Some functions do not use all of these parameters in the parameter index list.  Any type of consolidation will greatly benefit this parameter list.  For instance, each ResonancePDF, DalitzPlotPdf, etc use the same mother/daughter masses.  These could be made constant, initially configured, and done.  Otherwise we duplicate `4*4*17` (4 byte size, 4 values, 17 device functions) which can be consolidated to 4*4 bytes in constant memory.

Memory consolidation has the adverse affect that we won't be able to perform divergent function lookups, as we won't be able to predict how far the function can go.

Here is an example of how the changes will be done.  One thing to note is that the NLL is currently handled without a function call due to conversions, should just take a reinterpret_cast.
```cpp
__device__ fptype MetricTaker::operator () (thrust::tuple<int, fptype*, fptype*, int> t) const {
  ParamIndex idx; //pass this datastructure around

  // Calculate event offset for this thread.
  int eventIndex = thrust::get<0>(t);
  int eventSize  = thrust::get<3>(t);
  const fptype* eventAddress = thrust::get<1>(t) + (eventIndex * abs(eventSize));

  fptype normal = idx.normalisations[idx.normalIdx + 1];
  idx.normalIdx++;

  // Pass Idx, each PDF updates this datastructure.
  fptype ret = callFunction(eventAddress, paramAddress, &idx);

  //This is WIP, since each calc function is done differently with multiple arguments.
  ret = (*(reinterpret_cast<device_metric_ptr>(device_function_table[pc.funcIdx])))(ret, obs, norm);
  return ret;
}
```

Each function will take a pointer to the event they are working on, the parameter address, and an integer for the function id and the parameter id.  Notice both start at 0, and will be manipulated based on the function.

Here is an **example** usage:
```cpp
__device__ fptype device_DalitzPlot (const fptype* __restrict evt, ParamIndex *idx) {
  int numVars = *idx->variables[idx->varIdx];
  int numObs = *idx->observables[idx->observableIdx];

  fptype m12 = evt[0];
  fptype m13 = evt[1];
  int evtNum = int (floor(0.5 + evt[2]));

  int numCons = *idx->constants[idx->consIdx];
  int numNorms = *idx->normalisations[idx->normsIdx];

  if (!inDalitz(m12, m13, *idx->constants[1], *idx->constants[2], *idx->constants[3], *idx->constants[4]))
    return 0;

  thrust::complex<fptype> totalAmp(0, 0);

  unsigned int numResonances = params[*indices + consIdx + 0];

  for (int i = 0; i < numResonances; ++i) {
    thrust::complex<fptype> amp = thrust::complex<fptype>(
        *idx->parameters[idx->paramIdx + i*2],
        *idx->parameters[idx->paramIdx + i*2 + 1]);

    thrust::complex<fptype> me = RO_CACHE(cResonances[i][evtNum]);
    totalAmp += amp * me;
  }

  fptype ret = norm2(totalAmp);

  pc.incrementIndex (1, numVars, numObs, numConstants, numNormals);

  //for the total number of resonances, we need to skip based on the ResonancePdfs.
  for (int i = 0; i < numResonances; i++)
    pc.incrementIndex ();

  fptype eff = callFunction(evt, idx);
  ret *= eff;

  return ret;
}
```

One other important piece of information is propagating variable updates from Minuit to the PDF list.  Using recursion, two passes properly distribute updated values.  First is a call to `updateVariable`, which will update a specified variable with a new value.  This is recursively called on all components for each variable.  Next, we need to regenerate our host_parameters buffer, and copy this to device memory.  This is done with another recursive call to `updateParameters`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidating memory lookups #71

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidating memory lookups #71

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions