Crash when there are parentheses among tokens

The code below:

```python
import nltk
import supar

sent = 'Supar (in particular, the tree binarization) crashes when parentheses are present in input.'

parser = supar.Parser.load('crf-con-en')

tokens = [nltk.word_tokenize(sent)]
parsed = parser.predict(tokens).sentences

print(parsed)

```

crashes when there are parentheses in the input. Backtrace as follows:
```
2021-04-12 23:08:45 INFO Loading the data
Traceback (most recent call last):                           
  File "paren.py", line 9, in <module>
    parsed = parser.predict(tokens).sentences
  File "/usr/local/lib/python3.8/site-packages/supar/parsers/crf_constituency.py", line 129, in predict
    return super().predict(**Config().update(locals()))
  File "/usr/local/lib/python3.8/site-packages/supar/parsers/parser.py", line 129, in predict
    dataset = Dataset(self.transform, data)
  File "/usr/local/lib/python3.8/site-packages/supar/utils/data.py", line 38, in __init__
    self.sentences = transform.load(data, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 656, in load
    sentences.append(TreeSentence(self, tree))
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 681, in __init__
    Tree.factorize(Tree.binarize(tree)[0])]
  File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 520, in binarize
    if not isinstance(child[0], nltk.Tree):
  File "/usr/local/lib/python3.8/site-packages/nltk-3.5-py3.8.egg/nltk/tree.py", line 162, in __getitem__
    return list.__getitem__(self, index)
IndexError: list index out of range
```

This seems related to #64 and #59. I tried editing the code to not index into empty arrays, but i ended up either discarding tokens, or triggering an error inside `nltk.Tree.collapse_unary`.

When I replaced the parentheses with square brackets (`Supar [the parser] crashes when parentheses are present in input.
`), parsing worked but produced weird result -- "the parser" moved inside the verb phrase:

```
(TOP
 (S
  (NP
   (_ Supar))
  (VP
   (_
    [)
    (SBAR
     (S
      (NP
       (NP
	(_ the)
	(_ parser))
       (_]))
     (VP
      (_ crashes)
      (SBAR
       (WHADVP
	(_ when))
       (S
	(NP
	 (_ parentheses))
	(VP
	 (_ are)
	 (ADJP
	  (_ present)
	  (PP
	   (_ in)
	   (NP
	    (_ input)))))))))))
  (_.)))
```

Has the model been trained on text that uses parentheses? Or are we expected to strip out text inside parentheses and parse it separately?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Crash when there are parentheses among tokens #65

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Crash when there are parentheses among tokens #65

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions