-
Notifications
You must be signed in to change notification settings - Fork 10.7k
Closed
Labels
Description
Describe the bug
使用 CSV 文件作为词典时,由于部分词含有逗号会导致词典失败。
从代码上看,HanLP 只是单纯的使用逗号切分每一行,并没有处理 CSV 转义的情况。
列数据中存在 "
, ,
符号时会将该列使用 ""
进行转义。
Code to reproduce the issue
将以下文本直接保存为 csv 文件并加载词典。
19th century music
20 century British history
21st Century Music
21st century science & technology
2D Materials
3 Biotech
3D Printing and Additive Manufacturing
3D Printing in Medicine
3D Research
"3L: Language, Linguistics, Literature"
Describe the current behavior
Exception in thread "main" java.lang.NumberFormatException: For input string: " Linguistics"
at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.base/java.lang.Integer.parseInt(Integer.java:638)
at java.base/java.lang.Integer.parseInt(Integer.java:770)
at com.hankcs.hanlp.corpus.io.IOUtil.loadDictionary(IOUtil.java:794)
at com.hankcs.hanlp.corpus.io.IOUtil.loadDictionary(IOUtil.java:752)
at com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment.<init>(DoubleArrayTrieSegment.java:68)
at org.grobid.core.lexicon.DictSegmenterKt.main(DictSegmenter.kt:6)
at org.grobid.core.lexicon.DictSegmenterKt.main(DictSegmenter.kt)
Expected behavior
正常加载
System information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 22.04.1 LTS
- HanLP version: com.hankcs:hanlp:portable-1.8.3
- I've completed this form and searched the web for solutions.