Feature Request: Add HAS_BOM option in csv reader #6727

douenergy · 2023-03-15T09:59:11Z

douenergy
Mar 15, 2023

"The BOM (byte order mark) is a particular usage of the special Unicode character, U+FEFF BYTE ORDER MARK, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text"

people.csv

FROM 'people.csv' ; -- works fine

SELECT id FROM 'people.csv'; -- Error: Binder Error: Referenced column "id" not found in FROM clause!

The problem is that there is a "BOM" character before id.

Pandas can solve this by specifying the engine.

import pandas as pd 
people = pd.read_csv('./people.csv', engine='python')
print(people['id'])

https://en.wikipedia.org/wiki/Byte_order_mark

douenergy · 2023-03-15T10:05:52Z

douenergy
Mar 15, 2023
Author

#3014
seems not working

0 replies

Mytherin · 2023-03-15T11:44:32Z

Mytherin
Mar 15, 2023
Maintainer

The BOM in this file is not actually valid. The BOM has to be the first three bytes of the file. The first byte of the file is the quote instead. If we modify the file so the BOM is actually at the start this works as expected:

"id","name"
"1","Alice"
"2","Bob"

SELECT id FROM '~/Downloads/people.csv';
┌───────┐
│  id   │
│ int64 │
├───────┤
│     1 │
│     2 │
└───────┘

You can also use normalize_names to drop special characters from the column names:

SELECT id FROM read_csv_auto('~/Downloads/people.csv', normalize_names=True);
┌───────┐
│  id   │
│ int64 │
├───────┤
│     1 │
│     2 │
└───────┘

0 replies

douenergy · 2023-03-15T12:55:57Z

douenergy
Mar 15, 2023
Author

Thanks ! normalize_names works perfectly

0 replies

jimdickson007 · 2024-09-14T06:08:09Z

jimdickson007
Sep 14, 2024

I think the behaviour of how duckdb (read_csv) handles BOM should be explicitly documented.
I searched for a long time before I stumbled on normalize_names.
Actually appears to drop BOM with default normalize_names = false

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Add HAS_BOM option in csv reader #6727

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Feature Request: Add HAS_BOM option in csv reader #6727

Uh oh!

douenergy Mar 15, 2023

Replies: 4 comments

Uh oh!

Uh oh!

douenergy Mar 15, 2023 Author

Uh oh!

Mytherin Mar 15, 2023 Maintainer

Uh oh!

douenergy Mar 15, 2023 Author

Uh oh!

Uh oh!

jimdickson007 Sep 14, 2024

douenergy
Mar 15, 2023

douenergy
Mar 15, 2023
Author

Mytherin
Mar 15, 2023
Maintainer

douenergy
Mar 15, 2023
Author

jimdickson007
Sep 14, 2024