-
Notifications
You must be signed in to change notification settings - Fork 248
Termite spectral sort #295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Termite spectral sort #295
Conversation
Hi @tbsexton , thank you tons for the PR! This is an old part of the code base that I've not given thought to in years; as such, it's going to take me a bit to dig back into it. Apologies in advance for the belated review. 😌 |
No worries! I noticed some of the data interfaces were a bit out of sync with the other structures in the package. I'm not tied to dataframes, but it seemed to match well with standard viz practice in other places (i.e. seaborn, holoviews, etc.) and honestly made it easier to work with. Happy to discuss other ways we might implement the spectral sort (i.e. not a wrapper function)...this was just the first pass at something that wasn't going to cause backward-incompatibilities for you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again! I have a few comments and suggestions, but the core functionality seems good (such as I could follow). And the example plot you included looks nice. Thanks very much for submitting it.
Digging into this code was basically an archeological dig — I really, really have to give viz
some love. Thank you for the constructive reminder. :)
""" | ||
Make a "termite" plot for assessing topic models using a tabular layout | ||
to promote comparison of terms both within and across topics. | ||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a line of whitespace between each of the docstring's sections
textacy/viz/termite.py
Outdated
elif len(highlight_topics) > 6: | ||
raise ValueError("no more than 6 topics may be highlighted at once") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a strict technical requirement, or just an aesthetic preference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is actually a hold-over from your function (see line 138).
Its presumably because you hard-code the available colors, so additional complementary colors for highlighting is unavailable.
I honestly would avoid this alltogether and adopt a seaborn
-esque approach: the function accepts a list of highlighted features/columns (+rows?), and a colormap/iterator that gets applied to them. Let the user worry about which those are, as long as we apply a decent default.
textacy/viz/termite.py
Outdated
# # substract minimum of sim mat in order to keep sim mat nonnegative | ||
# # UNNECESSARY | ||
# topic_term_weights_sim = ( | ||
# topic_term_weights_sim - topic_term_weights_sim.min() | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's all this about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code was refactored into the pipe
statement above. Forgot to remove!
component_filter = (components.loc[ | ||
components | ||
.agg(rank_terms_by, axis=1) | ||
.sort_values(ascending=False) | ||
.iloc[:n_terms] | ||
.index | ||
]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work if components
is a sparse matrix, as described in the docstring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH I had meant to write a check to see if a sparse mat was passed, but didn't want to reverse engineer the expected matrices from your function.
I highly recommend interfacing with plots by exploiting pandas. To be clear, pandas sparse dtypes will work. And if a csr mat is passed (and we check for shape compatibility) we can load it up into a dataframe automatically.
Ill see if some of the tests I write can cover the bases for columnar dtypes (see the new String ant Sparse types pandas provides in their docs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for hc in highlight_cols: | ||
if max_tick_val > 0 and values_mat[i, hc] == max_tick_val: | ||
ticklabel.set_color(highlight_colors[hc][1]) | ||
|
||
ax.get_xaxis().set_ticks_position("top") | ||
_ = ax.set_xticks(range(n_cols)) | ||
xticklabels = ax.set_xticklabels( | ||
col_labels, fontsize=14, color="gray", rotation=30, ha="left" | ||
col_labels, fontsize=14, color="gray", rotation=-60, ha="right" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YES
Oh, one other question: How much work would it be to write a couple basic tests? It would be great to know if I accidentally break something when I start mucking around in this part of the code base again... |
Co-Authored-By: Burton DeWilde <burtdewilde@gmail.com>
textacy/viz/termite.py
Outdated
dx = 10/72.; dy = 0/72. | ||
dx = 10 / 72 | ||
dy = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bdewilde this is for the angled text. If matplotlib only receives a rotation command, the rotation "anchor" is the corner of the textbox, which includes whitespace. That means, without shifting, the axis tickmarks will be misaligned with the last letters.
Notice in the example image the tickmarks are directly below the final letters.
@bdewilde re:tests I should be able to. We might have to iterate a bit to scope the testing correctly, and I don't have loads of free time atm. But will get around to an initial pass! |
preferred over if block Co-Authored-By: Burton DeWilde <burtdewilde@gmail.com>
Hi @tbsexton , I'm going to merge this in. I've been neglecting |
my pleasure! I'll think a bit about possible ways to clean up some viz stuff.... honestly given the scope of textacy, integrating more with something model-based like holoviews could make more advanced viz with modularity easier to design around in the long term. |
Adds a data-frame-based function to create a "Termite Plot" using spectral seriation as outlined in the original paper.
Description
Spectral seriation added as a term-ordering technique, matching the paper by Chuang et al. Doing this required some aggregation and filtering logic that was much more succinct via Pandas.
New function
termite_df_plot
is essentially a wrapper around the originaldraw_termite_plot
that offloads a lot of the logic for aggregation and sorting to Pandas, since it requires a dataframe as input.Seriation
Assuming "seriation" is passed as the term-sorting option, the "magic" is to use the feidler vector as an ordering, directly on the top-ranked terms (as determined by the rank_terms_by option). Given a filtered doc-topic
component_filter
dataframe, this looks like:This is an excerpt from the new function.
Aesthetics
Minor changes to the original
draw_termite_plot
set defaults that are amenable to two-column academic papers (e.g. avoiding overhang with column labels leaning left instead of right).Motivation and Context
This was written as part of a paper to provide topic model visualizations to an engineering community. Textacy seemed to be one of the only modern libraries to provide easy access to termite plots (which have proven incredibly useful to explain topic models), but the current implementation did not include the key seriation technique that really makes them powerful.
Was waiting to finish the manuscript acceptance process before submitting this code.
How Has This Been Tested?
In production of this paper, the code was tested across multiple iterations for figures.
I do not see a
test_viz.py
, so the pytest suite was skipped. Almost no original functionality was altered.Screenshots (if appropriate):
Example output using seriation.
termite.pdf
Types of changes
Checklist: