Skip to content

CharTabel 归一化部分字符存在错误 #1615

@tiandiweizun

Description

@tiandiweizun

Describe the bug
A clear and concise description of what the bug is.

  1. 有个issue关于调用CharTabel,把“幺”改为“么”不合理portable修复了,但是下载1.7.5 zip包有问题,后发现CharTable.txt.bin md5不一致
  2. 以下字符有问题:其中第一列是原始字符,第二列是归一化后字符,括号表示 建议可以考虑括号内字符替换原有归一化内容
    猛 勐
    蜺 霓
    脊 嵴
    骼 胳
    拾 十
    劈 噼
    溜 熘
    呱 哌
    怵 憷
    糸 纟(丝)
    乾 干
    艸 艹(草)
    Code to reproduce the issue
    Provide a reproducible test case that is the bare minimum necessary to generate the problem.
public void testCharTable() {
        Map<String, String> normalizationBadCase = new HashMap<>();
        normalizationBadCase.put("猛", "猛");
        normalizationBadCase.put("蜺", "蜺");
        normalizationBadCase.put("脊", "脊");
        normalizationBadCase.put("骼", "骼");
        normalizationBadCase.put("拾", "拾");
        normalizationBadCase.put("劈", "劈");
        normalizationBadCase.put("溜", "溜");
        normalizationBadCase.put("呱", "呱");
        normalizationBadCase.put("怵", "怵");
        normalizationBadCase.put("糸", "丝");
        normalizationBadCase.put("乾", "乾");
        normalizationBadCase.put("艸", "草");
        for (Map.Entry<String, String> entry : normalizationBadCase.entrySet()) {
            assert CharTable.convert(entry.getKey()).equals(entry.getValue());
        }
    }

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): win10
  • Python version:
  • HanLP version: 1.8.0
  • I've completed this form and searched the web for solutions.

Metadata

Metadata

Assignees

Labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions