Skip to content

Conversation

simonw
Copy link
Owner

@simonw simonw commented Jan 6, 2022

Refs #356

Still TODO:

  • Get --lines working, with tests
  • Get --text working, with tests
  • Get regular JSON import working with --convert with tests
  • Get --lines working with --convert with tests
  • Get --text working with --convert with tests
  • Get --csv and --tsv import working with --convert with tests
  • Get --nl working with --convert with tests
  • Documentation for all of the above

@simonw simonw added the enhancement New feature or request label Jan 6, 2022
@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

So far I've just implemented the new help:

% sqlite-utils insert --help
Usage: sqlite-utils insert [OPTIONS] PATH TABLE FILE

  Insert records from FILE into a table, creating the table if it does not
  already exist.

  By default the input is expected to be a JSON array of objects. Or:

  - Use --nl for newline-delimited JSON objects
  - Use --csv or --tsv for comma-separated or tab-separated input
  - Use --lines to write each incoming line to a column called "line"
  - Use --all to write the entire input to a column called "all"

  You can also use --convert to pass a fragment of Python code that will be
  used to convert each input.

  Your Python code will be passed a "row" variable representing the imported
  row, and can return a modified row.

  If you are using --lines your code will be passed a "line" variable, and for
  --all an "all" variable.

Options:
  --pk TEXT                 Columns to use as the primary key, e.g. id
  --flatten                 Flatten nested JSON objects, so {"a": {"b": 1}}
                            becomes {"a_b": 1}
  --nl                      Expect newline-delimited JSON
  -c, --csv                 Expect CSV input
  --tsv                     Expect TSV input
  --lines                   Treat each line as a single value called 'line'
  --all                     Treat input as a single value called 'all'
  --convert TEXT            Python code to convert each item
  --import TEXT             Python modules to import
  --delimiter TEXT          Delimiter to use for CSV files
  --quotechar TEXT          Quote character to use for CSV/TSV
  --sniff                   Detect delimiter and quote character
  --no-headers              CSV file has no header row
  --batch-size INTEGER      Commit every X records
  --alter                   Alter existing table to add any missing columns
  --not-null TEXT           Columns that should be created as NOT NULL
  --default <TEXT TEXT>...  Default value that should be set for a column
  --encoding TEXT           Character encoding for input, defaults to utf-8
  -d, --detect-types        Detect types for columns in CSV/TSV data
  --load-extension TEXT     SQLite extensions to load
  --silent                  Do not show progress bar
  --ignore                  Ignore records if pk already exists
  --replace                 Replace records if pk already exists
  --truncate                Truncate table before inserting records, if table
                            already exists
  -h, --help                Show this message and exit.

@codecov
Copy link

codecov bot commented Jan 6, 2022

Codecov Report

Merging #361 (b7f0b88) into main (f3fd861) will decrease coverage by 0.05%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #361      +/-   ##
==========================================
- Coverage   96.49%   96.44%   -0.06%     
==========================================
  Files           5        5              
  Lines        2283     2306      +23     
==========================================
+ Hits         2203     2224      +21     
- Misses         80       82       +2     
Impacted Files Coverage Δ
sqlite_utils/cli.py 95.49% <92.00%> (-0.11%) ⬇️
sqlite_utils/utils.py 94.23% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3fd861...b7f0b88. Read the comment docs.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

I'm going to refactor all of the tests for sqlite-utils insert into a new test_cli_insert.py module.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

Documentation: https://github.com/simonw/sqlite-utils/blob/33223856ff7fe746b7b77750fbe5b218531d0545/docs/cli.rst#inserting-unstructured-data-with---lines-and---all - I went with a single section titled "Inserting unstructured data with --lines and --all"

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

I'm having second thoughts about this bit:

Your Python code will be passed a "row" variable representing the imported row, and can return a modified row.

If you are using --lines your code will be passed a "line" variable, and for --all an "all" variable.

The code in question is this:

# If single line and no 'return', add the return
if "\n" not in code and not code.strip().startswith("return "):
code = "return {}".format(code)
new_code = ["def fn(value):"]
for line in code.split("\n"):
new_code.append(" {}".format(line))
code_o = compile("\n".join(new_code), "<string>", "exec")

Do I really want to add the complexity of supporting different variable names there? I think always using value might be better.

Except... value made sense for the existing sqlite-utils convert command where you are running a conversion function against the value for the column in the current row - is it confusing if applied to lines or documents or all?

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

Test code that just worked for me:

sqlite-utils insert /tmp/blah.db blah /tmp/log.log --convert '
bits = line.split()
return dict([("b_{}".format(i), bit) for i, bit in enumerate(bits)])' --lines

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

And here's a demo of --convert used with --all - I added a custom error message for if the user's --convert code doesn't return a dict.

% sqlite-utils insert /tmp/all.db blah /tmp/log.log --convert 'all.upper()' --all         
Error: Records returned by your --convert function must be dicts
% sqlite-utils insert /tmp/all.db blah /tmp/log.log --convert '{"all": all.upper()}' --all
% sqlite-utils dump /tmp/all.db                                                           
BEGIN TRANSACTION;
CREATE TABLE [blah] (
   [all] TEXT
);
INSERT INTO "blah" VALUES('INFO:     127.0.0.1:60581 - "GET / HTTP/1.1" 200 OK
INFO:     127.0.0.1:60581 - "GET /FOO/-/STATIC/APP.CSS?CEAD5A HTTP/1.1" 200 OK
INFO:     127.0.0.1:60581 - "GET /FAVICON.ICO HTTP/1.1" 200 OK
INFO:     127.0.0.1:60581 - "GET /FOO/TIDDLYWIKI HTTP/1.1" 200 OK
INFO:     127.0.0.1:60581 - "GET /FOO/-/STATIC/APP.CSS?CEAD5A HTTP/1.1" 200 OK
INFO:     127.0.0.1:60584 - "GET /FOO/-/STATIC/SQL-FORMATTER-2.3.3.MIN.JS HTTP/1.1" 200 OK
INFO:     127.0.0.1:60586 - "GET /FOO/-/STATIC/CODEMIRROR-5.57.0.MIN.JS HTTP/1.1" 200 OK
INFO:     127.0.0.1:60585 - "GET /FOO/-/STATIC/CODEMIRROR-5.57.0.MIN.CSS HTTP/1.1" 200 OK
INFO:     127.0.0.1:60588 - "GET /FOO/-/STATIC/CODEMIRROR-5.57.0-SQL.MIN.JS HTTP/1.1" 200 OK
INFO:     127.0.0.1:60587 - "GET /FOO/-/STATIC/CM-RESIZE-1.0.1.MIN.JS HTTP/1.1" 200 OK
INFO:     127.0.0.1:60586 - "GET /FOO/TIDDLYWIKI/TIDDLERS HTTP/1.1" 200 OK
INFO:     127.0.0.1:60586 - "GET /FOO/-/STATIC/APP.CSS?CEAD5A HTTP/1.1" 200 OK
INFO:     127.0.0.1:60584 - "GET /FOO/-/STATIC/TABLE.JS HTTP/1.1" 200 OK
');
COMMIT;

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

I added a custom error message for if the user's --convert code doesn't return a dict.

That turned out to be a bad idea because it meant exhausting the iterator early for the check - before we got to the .insert_all() code that breaks the iterator up into chunks. I tried fixing that with itertools.tee() to run the generator twice but that's grossly memory-inefficient for large imports.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

Here's the traceback if your --convert function doesn't return a dict right now:

% sqlite-utils insert /tmp/all.db blah /tmp/log.log --convert 'all.upper()' --all         

Traceback (most recent call last):
  File "/Users/simon/.local/share/virtualenvs/sqlite-utils-C4Ilevlm/bin/sqlite-utils", line 33, in <module>
    sys.exit(load_entry_point('sqlite-utils', 'console_scripts', 'sqlite-utils')())
  File "/Users/simon/.local/share/virtualenvs/sqlite-utils-C4Ilevlm/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/sqlite-utils-C4Ilevlm/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/Users/simon/.local/share/virtualenvs/sqlite-utils-C4Ilevlm/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/simon/.local/share/virtualenvs/sqlite-utils-C4Ilevlm/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/simon/.local/share/virtualenvs/sqlite-utils-C4Ilevlm/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py", line 949, in insert
    insert_upsert_implementation(
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py", line 834, in insert_upsert_implementation
    db[table].insert_all(
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/db.py", line 2602, in insert_all
    first_record = next(records)
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/db.py", line 3044, in fix_square_braces
    for record in records:
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/cli.py", line 831, in <genexpr>
    docs = (decode_base64_values(doc) for doc in docs)
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/utils.py", line 86, in decode_base64_values
    to_fix = [
  File "/Users/simon/Dropbox/Development/sqlite-utils/sqlite_utils/utils.py", line 89, in <listcomp>
    if isinstance(doc[k], dict)
TypeError: string indices must be integers

I can live with that for the moment.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

all is a bad variable name because it clashes with the Python all() built-in function.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

I'm going to rename --all to --text:

  • Use --text to write the entire input to a column called "text"

To avoid that clash with Python's all() function.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

Just need documentation for --convert now against the various different types of input.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

For --text the conversion function should be allowed to return an iterable instead of a dictionary, in which case it will be treated as the full list of records to be inserted.

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

Got that working:

% echo 'This is cool' | sqlite-utils insert words.db words - --text --convert '({"word": w} for w in text.split())'
% sqlite-utils dump words.db                                                                                       
BEGIN TRANSACTION;
CREATE TABLE [words] (
   [word] TEXT
);
INSERT INTO "words" VALUES('This');
INSERT INTO "words" VALUES('is');
INSERT INTO "words" VALUES('cool');
COMMIT;

@simonw
Copy link
Owner Author

simonw commented Jan 6, 2022

This is all documented. I'm going to rebase-merge it to keep the individual commits.

@simonw simonw merged commit 413f8ed into main Jan 6, 2022
@simonw simonw deleted the insert-convert branch January 6, 2022 06:24
@simonw simonw changed the title --lines and --all and --convert and --import --lines and --text and --convert and --import Jan 6, 2022
@simonw simonw mentioned this pull request Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant