Skip to content

Crash when there are parentheses among tokens #65

@hristost

Description

@hristost

The code below:

import nltk
import supar

sent = 'Supar (in particular, the tree binarization) crashes when parentheses are present in input.'

parser = supar.Parser.load('crf-con-en')

tokens = [nltk.word_tokenize(sent)]
parsed = parser.predict(tokens).sentences

print(parsed)

crashes when there are parentheses in the input. Backtrace as follows:

2021-04-12 23:08:45 INFO Loading the data
Traceback (most recent call last):                           
  File "paren.py", line 9, in <module>
    parsed = parser.predict(tokens).sentences
  File "/usr/local/lib/python3.8/site-packages/supar/parsers/crf_constituency.py", line 129, in predict
    return super().predict(**Config().update(locals()))
  File "/usr/local/lib/python3.8/site-packages/supar/parsers/parser.py", line 129, in predict
    dataset = Dataset(self.transform, data)
  File "/usr/local/lib/python3.8/site-packages/supar/utils/data.py", line 38, in __init__
    self.sentences = transform.load(data, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 656, in load
    sentences.append(TreeSentence(self, tree))
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 681, in __init__
    Tree.factorize(Tree.binarize(tree)[0])]
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 520, in binarize
    if not isinstance(child[0], nltk.Tree):
  File "/usr/local/lib/python3.8/site-packages/nltk-3.5-py3.8.egg/nltk/tree.py", line 162, in __getitem__
    return list.__getitem__(self, index)
IndexError: list index out of range

This seems related to #64 and #59. I tried editing the code to not index into empty arrays, but i ended up either discarding tokens, or triggering an error inside nltk.Tree.collapse_unary.

When I replaced the parentheses with square brackets (Supar [the parser] crashes when parentheses are present in input. ), parsing worked but produced weird result -- "the parser" moved inside the verb phrase:

(TOP
 (S
  (NP
   (_ Supar))
  (VP
   (_
    [)
    (SBAR
     (S
      (NP
       (NP
	(_ the)
	(_ parser))
       (_]))
     (VP
      (_ crashes)
      (SBAR
       (WHADVP
	(_ when))
       (S
	(NP
	 (_ parentheses))
	(VP
	 (_ are)
	 (ADJP
	  (_ present)
	  (PP
	   (_ in)
	   (NP
	    (_ input)))))))))))
  (_.)))

Has the model been trained on text that uses parentheses? Or are we expected to strip out text inside parentheses and parse it separately?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions