Skip to content

Conversation

edemaine
Copy link
Member

@edemaine edemaine commented Dec 9, 2024

What is the previous behavior before this PR?

When rendering MathML directly to the DOM (as opposed to a string), such as on the katex.org front-page demo, consecutive digits and text characters get rendered as multiple text children of an <mn> or <mtext> elements:

image

What is the new behavior after this PR?

When rendering MathML directly to the DOM, we combine consecutive text children into one:

image

Fixes #3995 which reports that this causes screen-reading issues.


Merging of consecutive <mtext> and <mn> elements is already done here:

KaTeX/src/buildMathML.js

Lines 157 to 167 in 2d1fec9

// Concatenate adjacent <mtext>s
if (group.type === 'mtext' && lastGroup.type === 'mtext'
&& group.getAttribute('mathvariant') ===
lastGroup.getAttribute('mathvariant')) {
lastGroup.children.push(...group.children);
continue;
// Concatenate adjacent <mn>s
} else if (group.type === 'mn' && lastGroup.type === 'mn') {
lastGroup.children.push(...group.children);
continue;
// Concatenate <mn>...</mn> followed by <mi>.</mi>

And when rendering to a string, this was enough. Indeed, we don't have a great way of testing this, without a virtual DOM setup. But the inspect screenshots above confirm that this PR fixes the issue.

@ronkok
Copy link
Collaborator

ronkok commented Dec 10, 2024

The code written above will split a number like "4,200.30" into two elements, split at the comma. It will also mangle an expression like 24^3, writing the "4" and the "3" inside a <msup>, but leaving the leading "2" outside the <msup>.

In Temml, I struggled a bit and ended up with a more complex solution. My approach modified only the buildMathML.js file. After line 152, I inserted a function call:

    consolidateNumbers(expression);

Elsewhere in the same file, I added this code:

const numberRegEx = /^[0-9]$/
const isDotOrComma = (node, followingNode) => {
  return ((node.type === "textord" && node.text === ".") ||
    (node.type === "atom" && node.text === ",")) &&
    // Don't consolidate if there is a space after the comma.
    node.loc && followingNode.loc && node.loc.end === followingNode.loc.start
}
const consolidateNumbers = expression => {
  // Consolidate adjacent numbers. We want to return <mn>1,506.3</mn>,
  // not <mn>1</mn><mo>,</mo><mn>5</mn><mn>0</mn><mn>6</mn><mi>.</mi><mn>3</mn>
  if (expression.length < 2) { return }
  const nums = [];
  let inNum = false
  // Find adjacent numerals
  for (let i = 0; i < expression.length; i++) {
    const node = expression[i];
    if (node.type === "textord" && numberRegEx.test(node.text)) {
      if (!inNum) { nums.push({ start: i }) }
      inNum = true
    } else {
      if (inNum) { nums[nums.length - 1].end = i - 1 }
      inNum = false
    }
  }
  if (inNum) { nums[nums.length - 1].end = expression.length - 1 }

  // Determine if numeral groups are separated by a comma or dot.
  for (let i = nums.length - 1; i > 0; i--) {
    if (nums[i - 1].end === nums[i].start - 2 &&
      isDotOrComma(expression[nums[i].start - 1], expression[nums[i].start])) {
      // Merge the two groups.
      nums[i - 1].end = nums[i].end
      nums.splice(i, 1)
    }
  }

  // Consolidate the number nodes
  for (let i = nums.length - 1; i >= 0; i--) {
    for (let j = nums[i].start + 1; j <= nums[i].end; j++) {
      expression[nums[i].start].text += expression[j].text
    }
    expression.splice(nums[i].start + 1, nums[i].end - nums[i].start)
    // Check if the <mn> is followed by a numeric base in a supsub, e.g. the "3" in 123^4
    // If so, merge the first <mn> into the base.
    if (expression.length > nums[i].start + 1) {
      const nextTerm = expression[nums[i].start + 1];
      if (nextTerm.type === "supsub" && nextTerm.base && nextTerm.base.type === "textord" &&
          numberRegEx.test(nextTerm.base.text)) {
        nextTerm.base.text = expression[nums[i].start].text + nextTerm.base.text
        expression.splice(nums[i].start, 1)
      }
    }
  }
}

It's quite possible that parts of both solutions could be combined to form a solution that is less verbose than the Temml version, but still is correct.

@mbourne
Copy link

mbourne commented Dec 11, 2024

@ronkok Wondering if your solution caters for European number style, with periods and commas reversed, so 1.234,5, or with spaces, like 10 000,45?

@edemaine
Copy link
Member Author

I haven't grokked your code yet, but I'm a little confused about the goal. In LaTeX you need to write 1{,}000.00. If you write 1,000 you get rendering that looks like a list of two numbers. Is a braced comma what you mean by a comma?

These heuristics feel a little messy, but it seems they're necessary for going from presentation to semantics.

I also don't think they're especially related to this PR. Do you agree that the change in this PR is useful, independent of how the nodes get combined?

@ronkok
Copy link
Collaborator

ronkok commented Dec 11, 2024

@mbourne The European number style was a major motivation for how I wrote this code.

@edemaine I take your point about TeX requiring a 1{,}23 to get a separator without a space. I think Knuth made a mistake. A majority of the world uses a comma for their decimal separator. It should be a first-class option, not relegated to a workaround with extra braces. Having taken your point, I intend to modify Temml so that 1{,}23 will be written to a single <mn> element.

@ronkok
Copy link
Collaborator

ronkok commented Dec 11, 2024

These heuristics feel a little messy

I agree. I'd like to find code that is less verbose. I haven't found it yet.

Do you agree that the change in this PR is useful

It is certainly better than the status quo. I would suggest at least examining the following atom to see if it a msup or msub element. This PR, as written will not recognize the "24" in 24^2 as a single number. That seems like a pretty big semantic bust.

@edemaine edemaine changed the title fix: MathML combines multidigit numbers and multicharacter text when outputting to DOM fix: MathML combines multidigit numbers with sup/subscript, comma separators, and multicharacter text when outputting to DOM Dec 16, 2024
@edemaine
Copy link
Member Author

@ronkok I've taken a stab at handling ., {,}, and sup/subscripts. Let me know what you think! (You can try it out live in the preview.)

@ronkok
Copy link
Collaborator

ronkok commented Dec 16, 2024

Good stuff! I haven't looked at the code yet, but the preview shows me that you have:

  • handled a dot as a decimal separator well.
  • handled {,} as a decimal separator well.
  • handled a space as a thousands separator well.
  • handled a following sup or sub well.
  • handled a plain comma as I expected you to handle it. That's okay. I would have handled it differently, but I understand your reasons. You show more fidelity to TeX than I do, and given your goals, your code works as it should.

I hate to move the goal posts at this late date, but I have just now realized there is another problem. In the expression x^2.2, the "2.2" is not treated as a single <mn> element. The code needs to check after a sup or sub for a following number.

@edemaine
Copy link
Member Author

Thanks for reviewing!

  • I would have handled it differently, but I understand your reasons. You show more fidelity to TeX than I do, and given your goals, your code works as it should.

Yep, exactly.

I hate to move the goal posts at this late date, but I have just now realized there is another problem. In the expression x^2.2, the "2.2" is not treated as a single <mn> element.

I believe that that is actually correct: here's how it renders in HTML.

image

So the MathML correctly reflects the same.

x^{2.2} correctly produces a single <mn>2.2</mn> for the exponent.

@ronkok
Copy link
Collaborator

ronkok commented Dec 16, 2024

Ah yes, you're correct. Which means that all the behavior I see from this code is good.

I don't plan to do a detailed review of code style, but I think it looks good in general.

@edemaine edemaine merged commit 7d79e22 into main Dec 17, 2024
8 checks passed
@edemaine edemaine deleted the screen-reading-fix branch December 17, 2024 15:07
KaTeX-bot added a commit that referenced this pull request Dec 17, 2024
## [0.16.17](v0.16.16...v0.16.17) (2024-12-17)

### Bug Fixes

* MathML combines multidigit numbers with sup/subscript, comma separators, and multicharacter text when outputting to DOM ([#3999](#3999)) ([7d79e22](7d79e22)), closes [#3995](#3995)
@KaTeX-bot
Copy link
Member

🎉 This PR is included in version 0.16.17 🎉

The release is available on:

Your semantic-release bot 📦🚀

DennisRasey pushed a commit to DennisRasey/forgejo that referenced this pull request Jan 28, 2025
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [katex](https://katex.org) ([source](https://github.com/KaTeX/KaTeX)) | dependencies | patch | [`0.16.10` -> `0.16.21`](https://renovatebot.com/diffs/npm/katex/0.16.10/0.16.21) |

---

### KaTeX \htmlData does not validate attribute names
[CVE-2025-23207](https://nvd.nist.gov/vuln/detail/CVE-2025-23207) / [GHSA-cg87-wmx4-v546](GHSA-cg87-wmx4-v546)

<details>
<summary>More information</summary>

#### Details
##### Impact
KaTeX users who render untrusted mathematical expressions with `renderToString` could encounter malicious input using `\htmlData` that runs arbitrary JavaScript, or generate invalid HTML.

##### Patches
Upgrade to KaTeX v0.16.21 to remove this vulnerability.

##### Workarounds
- Avoid use of or turn off the `trust` option, or set it to forbid `\htmlData` commands.
- Forbid inputs containing the substring `"\\htmlData"`.
- Sanitize HTML output from KaTeX.

##### Details
`\htmlData` did not validate its attribute name argument, allowing it to generate invalid or malicious HTML that runs scripts.

##### For more information
If you have any questions or comments about this advisory:

- Open an issue or security advisory in the [KaTeX repository](https://github.com/KaTeX/KaTeX/)
- Email us at [katex-security@mit.edu](mailto:katex-security@mit.edu)

#### Severity
- CVSS Score: 6.3 / 10 (Medium)
- Vector String: `CVSS:3.1/AV:N/AC:L/PR:L/UI:N/S:U/C:L/I:L/A:L`

#### References
- [https://github.com/KaTeX/KaTeX/security/advisories/GHSA-cg87-wmx4-v546](https://github.com/KaTeX/KaTeX/security/advisories/GHSA-cg87-wmx4-v546)
- [https://nvd.nist.gov/vuln/detail/CVE-2025-23207](https://nvd.nist.gov/vuln/detail/CVE-2025-23207)
- [https://github.com/KaTeX/KaTeX/commit/ff289955e81aab89086eef09254cbf88573d415c](https://github.com/KaTeX/KaTeX/commit/ff289955e81aab89086eef09254cbf88573d415c)
- [https://github.com/KaTeX/KaTeX](https://github.com/KaTeX/KaTeX)

This data is provided by [OSV](https://osv.dev/vulnerability/GHSA-cg87-wmx4-v546) and the [GitHub Advisory Database](https://github.com/github/advisory-database) ([CC-BY 4.0](https://github.com/github/advisory-database/blob/main/LICENSE.md)).
</details>

---

### Release Notes

<details>
<summary>KaTeX/KaTeX (katex)</summary>

### [`v0.16.21`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01621-2025-01-17)

[Compare Source](KaTeX/KaTeX@v0.16.20...v0.16.21)

##### Bug Fixes

-   escape \htmlData attribute name ([57914ad](KaTeX/KaTeX@57914ad))

### [`v0.16.20`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01620-2025-01-12)

[Compare Source](KaTeX/KaTeX@v0.16.19...v0.16.20)

##### Bug Fixes

-   \providecommand does not overwrite existing macro ([#&#8203;4000](KaTeX/KaTeX#4000)) ([6d30fe4](KaTeX/KaTeX@6d30fe4)), closes [#&#8203;3928](KaTeX/KaTeX#3928)

### [`v0.16.19`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01619-2024-12-29)

[Compare Source](KaTeX/KaTeX@v0.16.18...v0.16.19)

##### Bug Fixes

-   **types:** improve `strict` function type ([#&#8203;4009](KaTeX/KaTeX#4009)) ([4228b4e](KaTeX/KaTeX@4228b4e))

### [`v0.16.18`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01618-2024-12-18)

[Compare Source](KaTeX/KaTeX@v0.16.17...v0.16.18)

##### Bug Fixes

-   Actually publish TypeScript type definitions ([#&#8203;4008](KaTeX/KaTeX#4008)) ([629b873](KaTeX/KaTeX@629b873))

### [`v0.16.17`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01617-2024-12-17)

[Compare Source](KaTeX/KaTeX@v0.16.16...v0.16.17)

##### Bug Fixes

-   MathML combines multidigit numbers with sup/subscript, comma separators, and multicharacter text when outputting to DOM ([#&#8203;3999](KaTeX/KaTeX#3999)) ([7d79e22](KaTeX/KaTeX@7d79e22)), closes [#&#8203;3995](KaTeX/KaTeX#3995)

### [`v0.16.16`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01616-2024-12-17)

[Compare Source](KaTeX/KaTeX@v0.16.15...v0.16.16)

##### Features

-   ESM exports, TypeScript types ([#&#8203;3992](KaTeX/KaTeX#3992)) ([ea9c173](KaTeX/KaTeX@ea9c173))

### [`v0.16.15`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01615-2024-12-09)

[Compare Source](KaTeX/KaTeX@v0.16.14...v0.16.15)

##### Features

-   italic sans-serif in math mode via `\mathsfit` command ([#&#8203;3998](KaTeX/KaTeX#3998)) ([2218901](KaTeX/KaTeX@2218901))

### [`v0.16.14`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01614-2024-12-08)

[Compare Source](KaTeX/KaTeX@v0.16.13...v0.16.14)

##### Features

-   \dddot and \ddddot support ([#&#8203;3834](KaTeX/KaTeX#3834)) ([bda35cd](KaTeX/KaTeX@bda35cd)), closes [#&#8203;2744](KaTeX/KaTeX#2744)

### [`v0.16.13`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01613-2024-12-08)

[Compare Source](KaTeX/KaTeX@v0.16.12...v0.16.13)

##### Bug Fixes

-   `\vdots` and `\rule` support in text mode ([#&#8203;3997](KaTeX/KaTeX#3997)) ([0e08352](KaTeX/KaTeX@0e08352)), closes [#&#8203;3990](KaTeX/KaTeX#3990)

### [`v0.16.12`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01612-2024-12-08)

[Compare Source](KaTeX/KaTeX@v0.16.11...v0.16.12)

##### Features

-   **css:** configurable margin for display math ([#&#8203;3638](KaTeX/KaTeX#3638)) ([3405001](KaTeX/KaTeX@3405001))

### [`v0.16.11`](https://github.com/KaTeX/KaTeX/blob/HEAD/CHANGELOG.md#01611-2024-07-02)

[Compare Source](KaTeX/KaTeX@v0.16.10...v0.16.11)

##### Features

-   add \emph ([#&#8203;3963](KaTeX/KaTeX#3963)) ([9f34da4](KaTeX/KaTeX@9f34da4)), closes [#&#8203;3566](KaTeX/KaTeX#3566)

</details>

---

### Configuration

📅 **Schedule**: Branch creation - "" (UTC), Automerge - "* 0-3 * * *" (UTC).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4xMzYuMCIsInVwZGF0ZWRJblZlciI6IjM5LjEzNi4wIiwidGFyZ2V0QnJhbmNoIjoidjcuMC9mb3JnZWpvIiwibGFiZWxzIjpbImRlcGVuZGVuY3ktdXBncmFkZSIsInRlc3Qvbm90LW5lZWRlZCJdfQ==-->

Reviewed-on: https://codeberg.org/forgejo/forgejo/pulls/6693
Reviewed-by: Gusted <gusted@noreply.codeberg.org>
Co-authored-by: Renovate Bot <forgejo-renovate-action@forgejo.org>
Co-committed-by: Renovate Bot <forgejo-renovate-action@forgejo.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multidigit numbers aren't read out by the voice over screen reader
4 participants