Skip to content

CSV 词典中不能包含逗号 #1785

@elonzh

Description

@elonzh

Describe the bug

使用 CSV 文件作为词典时,由于部分词含有逗号会导致词典失败。

从代码上看,HanLP 只是单纯的使用逗号切分每一行,并没有处理 CSV 转义的情况。

列数据中存在 ", , 符号时会将该列使用 "" 进行转义。

Code to reproduce the issue

将以下文本直接保存为 csv 文件并加载词典。

19th century music
20 century British history
21st Century Music
21st century science & technology
2D Materials
3 Biotech
3D Printing and Additive Manufacturing
3D Printing in Medicine
3D Research
"3L: Language, Linguistics, Literature"

Describe the current behavior

Exception in thread "main" java.lang.NumberFormatException: For input string: " Linguistics"
	at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
	at java.base/java.lang.Integer.parseInt(Integer.java:638)
	at java.base/java.lang.Integer.parseInt(Integer.java:770)
	at com.hankcs.hanlp.corpus.io.IOUtil.loadDictionary(IOUtil.java:794)
	at com.hankcs.hanlp.corpus.io.IOUtil.loadDictionary(IOUtil.java:752)
	at com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment.<init>(DoubleArrayTrieSegment.java:68)
	at org.grobid.core.lexicon.DictSegmenterKt.main(DictSegmenter.kt:6)
	at org.grobid.core.lexicon.DictSegmenterKt.main(DictSegmenter.kt)

Expected behavior

正常加载

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.1 LTS
  • HanLP version: com.hankcs:hanlp:portable-1.8.3
  • I've completed this form and searched the web for solutions.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions