-
Notifications
You must be signed in to change notification settings - Fork 151
Closed
Description
The code below:
import nltk
import supar
sent = 'Supar (in particular, the tree binarization) crashes when parentheses are present in input.'
parser = supar.Parser.load('crf-con-en')
tokens = [nltk.word_tokenize(sent)]
parsed = parser.predict(tokens).sentences
print(parsed)
crashes when there are parentheses in the input. Backtrace as follows:
2021-04-12 23:08:45 INFO Loading the data
Traceback (most recent call last):
File "paren.py", line 9, in <module>
parsed = parser.predict(tokens).sentences
File "/usr/local/lib/python3.8/site-packages/supar/parsers/crf_constituency.py", line 129, in predict
return super().predict(**Config().update(locals()))
File "/usr/local/lib/python3.8/site-packages/supar/parsers/parser.py", line 129, in predict
dataset = Dataset(self.transform, data)
File "/usr/local/lib/python3.8/site-packages/supar/utils/data.py", line 38, in __init__
self.sentences = transform.load(data, **kwargs)
File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 656, in load
sentences.append(TreeSentence(self, tree))
File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 681, in __init__
Tree.factorize(Tree.binarize(tree)[0])]
File "/usr/local/lib/python3.8/site-packages/supar/utils/transform.py", line 520, in binarize
if not isinstance(child[0], nltk.Tree):
File "/usr/local/lib/python3.8/site-packages/nltk-3.5-py3.8.egg/nltk/tree.py", line 162, in __getitem__
return list.__getitem__(self, index)
IndexError: list index out of range
This seems related to #64 and #59. I tried editing the code to not index into empty arrays, but i ended up either discarding tokens, or triggering an error inside nltk.Tree.collapse_unary
.
When I replaced the parentheses with square brackets (Supar [the parser] crashes when parentheses are present in input.
), parsing worked but produced weird result -- "the parser" moved inside the verb phrase:
(TOP
(S
(NP
(_ Supar))
(VP
(_
[)
(SBAR
(S
(NP
(NP
(_ the)
(_ parser))
(_]))
(VP
(_ crashes)
(SBAR
(WHADVP
(_ when))
(S
(NP
(_ parentheses))
(VP
(_ are)
(ADJP
(_ present)
(PP
(_ in)
(NP
(_ input)))))))))))
(_.)))
Has the model been trained on text that uses parentheses? Or are we expected to strip out text inside parentheses and parse it separately?
Metadata
Metadata
Assignees
Labels
No labels