Skip to content

QG Index Creation and Build Failure #180

@arpitagarwal-meesho

Description

@arpitagarwal-meesho

Hello,

I had created a 113M embedding index (768 dim), with the following prf:

AccuracyTable
BatchSizeForCreation    200
BuildTimeLimit  0
DatabaseType    Memory
Dimension       768
DistanceType    Cosine
DynamicEdgeSizeBase     30
DynamicEdgeSizeRate     20
EdgeSizeForCreation     10
EdgeSizeForSearch       40
EdgeSizeLimitForCreation        5
EpsilonForCreation      0.1
GraphType       ANNG
IncomingEdge    80
IncrimentalEdgeSizeLimitForTruncation   0
IndexType       GraphAndTree
ObjectAlignment False
ObjectType      Float-4
OutgoingEdge    10
PathAdjustmentInterval  0
PrefetchOffset  1
PrefetchSize    3072
SeedSize        10
SeedType        None
ThreadPoolSize  32
TruncationThreadPoolSize        8

The search works fine on this index.

However, when I attempt to create and build a QG index, I run into multiple issues.

  1. Creating a QG index for this, by executing the following command,
    qbg create-qg -d 768 -D C -E 10 -S 40 -i t -o f -p 32 -N 384 -c 16 -C sqsu8 -B 2 -b 200 -M l -L s -e 0.1 -v /path_to_index

Issue:
The generated /path_to_index/qg/prf file contains values that do not match the arguments I passed.

Here is the generated qg/prf file:

BatchSize       1000
CentroidCreationMode    1
DataSize        0
DataType        1
Dimension       768
DistanceType    1
GenuineDataType 1
GenuineDimension        768
GlobalCentroidLimit     1
GlobalRange     0
LocalCentroidCreationMode       1
LocalCentroidLimit      16
LocalClusterDataType    2
LocalCodebookState      1
LocalDivisionNo 384
LocalIDByteSize 2
LocalRange      0
LocalSampleCoefficient  100
MaxMagnitude    -1
QuantizerType   0
RefinementDataType      99
ScalarQuantizationClippingRate  0.01
ScalarQuantizationNoOfSamples   0
ScalarQuantizationOffset        0
ScalarQuantizationScale 0
SingleLocalCodebook     0
ThreadSize      24

Q. Firstly, the qg/prf values are not matching the passed argument values, Why are the values in qg/prf different from the ones I passed? Is there an internal default overriding them?

  1. QG Index Build Failure
    Post creation, I executed the command for building QG index,
    qbg build-qg -E 128 -v /path_to_index

which fails in creation (or gets stuck).

Observations:

  • CPU is getting 100% when is it processing the index objects (machine used is n2-highmem-128)
  • Intermediate logs:
append: Data loading time=2.255e-05 (sec) 0.02255 (msec)
# of objects=16
Index creation time=0.00132462 (sec) 1.32462 (msec)
qbg: loading the rotation...
QuantizationCodebook::buildIndex
QuantizationCodebook::buildIndex # of the centroids=1
load() done
codebook index size=1
  • Tail logs for this command:
# of processed objects=105000000, time=21.5226 (m), vm size=104.58 G/104.58 G
# of processed objects=106000000, time=21.7245 (m), vm size=104.58 G/104.58 G
# of processed objects=107000000, time=21.9246 (m), vm size=104.58 G/104.58 G
# of processed objects=108000000, time=22.1269 (m), vm size=104.58 G/104.58 G
# of processed objects=109000000, time=22.3308 (m), vm size=104.58 G/104.58 G
# of processed objects=110000000, time=22.5344 (m), vm size=104.58 G/104.58 G
# of processed objects=111000000, time=22.7377 (m), vm size=104.58 G/104.58 G
# of processed objects=112000000, time=22.941 (m), vm size=104.58 G/104.58 G
# of processed objects=113000000, time=23.1424 (m), vm size=104.58 G/104.58 G
cp: cannot stat '/path_to_index/qg/ws/hkc_3c': No such file or directory

After the cp error log, cpu and memory both become 0 and index build gets stuck (or failed).

Questions

  1. Why does the QG index creation (create-qg) generate a qg/prf file with mismatched values?
  • Are there any internal defaults that override command-line arguments?
  • Is there a way to verify which parameters were actually used?
  1. Why does the build-qg process fail after cp: cannot stat '/path_to_index/qg/ws/hkc_3c'?
  • What is this missing file, and why is it required?
  • Is it possible that a previous step failed, leading to missing files?
  • Is there any known issue related to QG index builds for very large datasets?
  1. How can I resolve this issue and successfully build the QG index for 113M embeddings?
  • Are there additional configuration settings I should check?
  • Are there memory/CPU constraints I should be aware of?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions