Benchmark effect of merging query and keys matrices in transformers

For certain architectures (like GPTJ and LLaMa), it may be possible to replace Query $Q$ and Key $K$ matrices by a single matrix - saving on 1 out of seven/eight matrix multiplications in the transformer. I don't see an obvious way of having this for GPT-NeoX and OPT.

Take a standard benchmark, run the model before and after merging Query and Key matrices.



---------- Following are the details: (How to write latex in GitHub?)----------
`.T()` denotes transpose

Consider the input representation $X = {x1, ... xi, ... xj, ... xn}$.
qi = MatMul(Q, xi)
kj = MatMul(K, xj)

score_i,j = MatMul(qi.T(),  kj)
= MatMul(   MatMul(Q, xi).T(),   MatMul(K, xj)   )
= MatMul(   MatMul(xi.T(), Q.T()),   MatMul(K, xj)   )
= MatrixChainMul(xi.T(), Q.T(), K, xj)

let QKMerge = MatMul(Q.T(), K)

score_i,j = MatrixChainMul(xi.T(), QKMerge, xj)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark effect of merging query and keys matrices in transformers #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark effect of merging query and keys matrices in transformers #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions