Skip to content

to_pickable/from_pickable may be obsoleted or may be more simple #440

@skirpichev

Description

@skirpichev

Quote from libmpf.py:

We don't pickle tuples directly for the following reasons:
  1: pickle uses str() for ints, which is inefficient when they are large
  2: pickle doesn't work for gmpy mpzs
Both problems are solved by using hex()

It seems, now gmpy2 does support pickle, so 2) is gone. Regarding 1) - it also doesn't seems to be true, take an example benchmark:

$ cat bench.py 
import pickle
import gmpy2

with open('ai.dat', "bw") as f:
    for a in range(1000000):
        pickle.dump(a, f)

with open('as.dat', "bw") as f:
    for a in range(1000000):
        pickle.dump(str(a), f)

with open('ah.dat', "bw") as f:
    for a in range(1000000):
        pickle.dump(hex(a)[2:], f)

with open('ag.dat', "bw") as f:
    for a in range(1000000):
        pickle.dump(gmpy2.mpz(a), f)

big, step = int(10e30), 20

with open('bi.dat', "bw") as f:
    for a in range(big, big+step):
        pickle.dump(a, f)

with open('bs.dat', "bw") as f:
    for a in range(big, big+step):
        pickle.dump(str(a), f)

with open('bh.dat', "bw") as f:
    for a in range(big, big+step):
        pickle.dump(hex(a)[2:], f)

with open('bg.dat', "bw") as f:
    for a in range(big, big+step):
        pickle.dump(gmpy2.mpz(a), f)
$ python bench.py
$ ls -l *.dat
-rw-r--r-- 1 sk sk  43M фев 16 13:50 ag.dat
-rw-r--r-- 1 sk sk  15M фев 16 13:50 ah.dat
-rw-r--r-- 1 sk sk 7,6M фев 16 13:50 ai.dat
-rw-r--r-- 1 sk sk  16M фев 16 13:50 as.dat
-rw-r--r-- 1 sk sk 1,1K фев 16 13:50 bg.dat
-rw-r--r-- 1 sk sk  720 фев 16 13:50 bh.dat
-rw-r--r-- 1 sk sk  360 фев 16 13:50 bi.dat
-rw-r--r-- 1 sk sk  820 фев 16 13:50 bs.dat

So, it seems, the most efficient in size dump is with plain int's and as a first step I suggest replacing using hex() with int(). In this way, Sage's case will not be a special.

But in a long term, I think it would be better to drop any special workarounds for pickle support. Huge dumps for pickled mpz's may require some investigation (perhaps, there are some speed/size tradeoff?), at first sight it seems to be a bug for me. But using plain mpz's also will be better than using str/hex repr already now, for large inputs, as here sizes seems to be asymptotically same for int vs mpz's (and int is better than str/hex anyway):

$ cat bench2.py
import sys
import pickle
import gmpy2

big, step = 1<<int(sys.argv[1]), 20

with open('i.dat', "bw") as f:
    for a in range(big, big+step):
        pickle.dump(a, f)

with open('g.dat', "bw") as f:
    for a in range(big, big+step):
        pickle.dump(gmpy2.mpz(a), f)
$ python bench2.py 100; ls -l *.dat
-rw-r--r-- 1 sk sk 1,1K фев 16 14:12 g.dat
-rw-r--r-- 1 sk sk  360 фев 16 14:12 i.dat
$ python bench2.py 1000; ls -l *.dat
-rw-r--r-- 1 sk sk 3,3K фев 16 14:12 g.dat
-rw-r--r-- 1 sk sk 2,6K фев 16 14:12 i.dat
$ python bench2.py 10000000; ls -l *.dat
-rw-r--r-- 1 sk sk 24M фев 16 14:12 g.dat
-rw-r--r-- 1 sk sk 24M фев 16 14:12 i.dat

(Tested on CPython 3.7.1 with gmpy2 2.1.0a4.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions