Skip to content

Conversation

rtbs-dev
Copy link
Contributor

Adds a data-frame-based function to create a "Termite Plot" using spectral seriation as outlined in the original paper.

Description

Spectral seriation added as a term-ordering technique, matching the paper by Chuang et al. Doing this required some aggregation and filtering logic that was much more succinct via Pandas.

New function termite_df_plot is essentially a wrapper around the original draw_termite_plot that offloads a lot of the logic for aggregation and sorting to Pandas, since it requires a dataframe as input.

Seriation

Assuming "seriation" is passed as the term-sorting option, the "magic" is to use the feidler vector as an ordering, directly on the top-ranked terms (as determined by the rank_terms_by option). Given a filtered doc-topic component_filter dataframe, this looks like:

# calculate similarity matrix
similarity = (
        component_filter@component_filter.T
        .pipe(lambda df: df-df.min().min())
).values
# compute Laplacian matrice and its 2nd eigenvector
L = np.diag(similarity.sum(axis=1)) - similarity
D, V = np.linalg.eigh(L)
D = D[np.argsort(D)]
V = V[:, np.argsort(D)]
fiedler = V[:, 1]

# get permutation corresponding to sorting the 2nd eigenvector
component_filter=component_filter.reindex(
    index=[
        component_filter.index[i]
        for i in np.argsort(fiedler)
    ],
)

This is an excerpt from the new function.

Aesthetics

Minor changes to the original draw_termite_plot set defaults that are amenable to two-column academic papers (e.g. avoiding overhang with column labels leaning left instead of right).

Motivation and Context

This was written as part of a paper to provide topic model visualizations to an engineering community. Textacy seemed to be one of the only modern libraries to provide easy access to termite plots (which have proven incredibly useful to explain topic models), but the current implementation did not include the key seriation technique that really makes them powerful.

Was waiting to finish the manuscript acceptance process before submitting this code.

How Has This Been Tested?

In production of this paper, the code was tested across multiple iterations for figures.

I do not see a test_viz.py, so the pytest suite was skipped. Almost no original functionality was altered.

Screenshots (if appropriate):

Example output using seriation.

termite.pdf

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation, and I have updated it accordingly.

@bdewilde
Copy link
Collaborator

bdewilde commented Apr 2, 2020

Hi @tbsexton , thank you tons for the PR! This is an old part of the code base that I've not given thought to in years; as such, it's going to take me a bit to dig back into it. Apologies in advance for the belated review. 😌

@rtbs-dev
Copy link
Contributor Author

rtbs-dev commented Apr 2, 2020

No worries! I noticed some of the data interfaces were a bit out of sync with the other structures in the package. I'm not tied to dataframes, but it seemed to match well with standard viz practice in other places (i.e. seaborn, holoviews, etc.) and honestly made it easier to work with.

Happy to discuss other ways we might implement the spectral sort (i.e. not a wrapper function)...this was just the first pass at something that wasn't going to cause backward-incompatibilities for you!

Copy link
Collaborator

@bdewilde bdewilde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again! I have a few comments and suggestions, but the core functionality seems good (such as I could follow). And the example plot you included looks nice. Thanks very much for submitting it.

Digging into this code was basically an archeological dig — I really, really have to give viz some love. Thank you for the constructive reminder. :)

"""
Make a "termite" plot for assessing topic models using a tabular layout
to promote comparison of terms both within and across topics.
Args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a line of whitespace between each of the docstring's sections

Comment on lines 268 to 269
elif len(highlight_topics) > 6:
raise ValueError("no more than 6 topics may be highlighted at once")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a strict technical requirement, or just an aesthetic preference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually a hold-over from your function (see line 138).

Its presumably because you hard-code the available colors, so additional complementary colors for highlighting is unavailable.

I honestly would avoid this alltogether and adopt a seaborn-esque approach: the function accepts a list of highlighted features/columns (+rows?), and a colormap/iterator that gets applied to them. Let the user worry about which those are, as long as we apply a decent default.

Comment on lines 299 to 303
# # substract minimum of sim mat in order to keep sim mat nonnegative
# # UNNECESSARY
# topic_term_weights_sim = (
# topic_term_weights_sim - topic_term_weights_sim.min()
# )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's all this about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code was refactored into the pipe statement above. Forgot to remove!

Comment on lines +280 to +286
component_filter = (components.loc[
components
.agg(rank_terms_by, axis=1)
.sort_values(ascending=False)
.iloc[:n_terms]
.index
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this work if components is a sparse matrix, as described in the docstring?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH I had meant to write a check to see if a sparse mat was passed, but didn't want to reverse engineer the expected matrices from your function.

I highly recommend interfacing with plots by exploiting pandas. To be clear, pandas sparse dtypes will work. And if a csr mat is passed (and we check for shape compatibility) we can load it up into a dataframe automatically.

Ill see if some of the tests I write can cover the bases for columnar dtypes (see the new String ant Sparse types pandas provides in their docs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for hc in highlight_cols:
if max_tick_val > 0 and values_mat[i, hc] == max_tick_val:
ticklabel.set_color(highlight_colors[hc][1])

ax.get_xaxis().set_ticks_position("top")
_ = ax.set_xticks(range(n_cols))
xticklabels = ax.set_xticklabels(
col_labels, fontsize=14, color="gray", rotation=30, ha="left"
col_labels, fontsize=14, color="gray", rotation=-60, ha="right"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES

@bdewilde
Copy link
Collaborator

bdewilde commented Apr 9, 2020

Oh, one other question: How much work would it be to write a couple basic tests? It would be great to know if I accidentally break something when I start mucking around in this part of the code base again...

Co-Authored-By: Burton DeWilde <burtdewilde@gmail.com>
Comment on lines 168 to 169
dx = 10/72.; dy = 0/72.
dx = 10 / 72
dy = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bdewilde this is for the angled text. If matplotlib only receives a rotation command, the rotation "anchor" is the corner of the textbox, which includes whitespace. That means, without shifting, the axis tickmarks will be misaligned with the last letters.

Notice in the example image the tickmarks are directly below the final letters.

@rtbs-dev
Copy link
Contributor Author

@bdewilde re:tests I should be able to. We might have to iterate a bit to scope the testing correctly, and I don't have loads of free time atm. But will get around to an initial pass!

rtbs-dev and others added 4 commits April 10, 2020 11:18
@bdewilde
Copy link
Collaborator

Hi @tbsexton , I'm going to merge this in. I've been neglecting textacy while waiting on some advances in thinc and spacy, but I should probably stop dragging my feet. :) Thanks again for the PR!

@bdewilde bdewilde merged commit 5fa7b2b into chartbeat-labs:master Jul 15, 2020
@rtbs-dev
Copy link
Contributor Author

rtbs-dev commented Jul 16, 2020

my pleasure! I'll think a bit about possible ways to clean up some viz stuff.... honestly given the scope of textacy, integrating more with something model-based like holoviews could make more advanced viz with modularity easier to design around in the long term.

@rtbs-dev rtbs-dev deleted the termite-spectral-sort branch July 16, 2020 13:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants