Skip to content

Conversation

phiresky
Copy link

@phiresky phiresky commented Mar 30, 2023

This (kinda dirty) PR implements outputting a pandoc JSON AST from the CLI instead of PDF.

That means it indirectly allows outputting all kinds of formats such as Markdown, HTML, ePub, docx.

The pandoc AST is not infinitely powerful so this conversion has a fair bit of information loss, but in exchange it gives access to many different output formats without specific implementations.

For example this typst document:

#set page(width: 10cm, height: auto)
#set heading(numbering: "1.")

= Test1
== Test2

#v(3mm)
#align(center)[
  #set par(leading: 3mm)
  #text(1.2em)[*3. Übungsblatt Computerorientierte Mathematik II*] \
  *Abgabe: 03.05.2019* (bis 10:10 Uhr in MA 001) \
  *Alle Antworten sind zu beweisen.*
]

*1. Aufgabe* #h(1fr) (1 + 1 + 2 Punkte)

Ein _Binärbaum_ ist ein Wurzelbaum, in dem jeder Knoten ≤ 2 Kinder hat.
Die Tiefe eines Knotens _v_ ist die Länge des eindeutigen Weges von der Wurzel
zu _v_.

#align(center, image("/graph.png", width: 75%))


#table(
  columns: 4,
  [], [*Q1*], [*Q2*], [*Q3*],
  [Revenue:], [1000 €], [2000 €], [3000 €],
  [Expenses:], [500 €], [1000 €], [1500 €],
  [Profit:], [500 €], [1000 €], [1500 €],
)

- hello
- world
- foo


```rust
fn fun() {}
```

run via these commands:

typst test.typ --output-format pandoc-json  test.pandoc.json
pandoc test.pandoc.json -o test.md

results in this:

markdown

# Test1

## Test2

::: {align="center"}
**3. Übungsblatt Computerorientierte Mathematik II** \
**Abgabe: 03.05.2019** (bis 10:10 Uhr in MA 001) \
**Alle Antworten sind zu beweisen.**
:::

**1. Aufgabe** (1 + 1 + 2 Punkte)

Ein *Binärbaum* ist ein Wurzelbaum, in dem jeder Knoten ≤ 2 Kinder hat.
Die Tiefe eines Knotens *v*
ist die Länge des eindeutigen Weges von der Wurzel zu *v*.

::: {align="center"}
![](/home/phire/data/dev/2023/typst/graph.png)
:::

  ----------- -------- -------- --------
              **Q1**   **Q2**   **Q3**
  Revenue:    1000 €   2000 €   3000 €
  Expenses:   500 €    1000 €   1500 €
  Profit:     500 €    1000 €   1500 €
  ----------- -------- -------- --------

-   hello

-   world

-   foo

``` rust
fn fun() {}
```

The exact output format can be controlled within pandoc, for example to prevent it from outputting standard commonmark (without fenced divs etc) use pandoc -t commonmark

HTML

<h1>Test1</h1>

<h2>Test2</h2>

<div data-align="center">
<strong>3. Übungsblatt Computerorientierte Mathematik II</strong> <br />
<strong>Abgabe: 03.05.2019</strong> (bis 10:10 Uhr in MA 001) <br />
<strong>Alle Antworten sind zu beweisen.</strong>
</div>
<p><strong>1. Aufgabe</strong> (1 + 1 + 2 Punkte)</p>
<p>Ein <em>Binärbaum</em> ist ein Wurzelbaum, in dem jeder Knoten ≤ 2
Kinder hat. Die Tiefe eines Knotens <em>v</em> ist die Länge des
eindeutigen Weges von der Wurzel zu <em>v</em>.</p>
<div data-align="center">
<img src="/home/phire/data/dev/2023/typst/graph.png" />
</div>
<table>
<tbody>
<tr class="odd">
<td></td>
<td><strong>Q1</strong></td>
<td><strong>Q2</strong></td>
<td><strong>Q3</strong></td>
</tr>
<tr class="even">
<td>Revenue:</td>
<td>1000 €</td>
<td>2000 €</td>
<td>3000 €</td>
</tr>
<tr class="odd">
<td>Expenses:</td>
<td>500 €</td>
<td>1000 €</td>
<td>1500 €</td>
</tr>
<tr class="even">
<td>Profit:</td>
<td>500 €</td>
<td>1000 €</td>
<td>1500 €</td>
</tr>
</tbody>
</table>
<ul>
<li>hello</li>
</ul>

<ul>
<li>world</li>
</ul>

<ul>
<li>foo</li>
</ul>
<div class="sourceCode" id="cb1"><pre
class="sourceCode rust"><code class="sourceCode rust"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="kw">fn</span> fun() <span class="op">{}</span></span></code></pre></div>

docx

screenshot:
image

Issues

fixable:

  • many attributes are not transferred even though they probably do have an equivalent (e.g. image width)
  • the code might be more messy than needed
  • lists are not numbered correctly since the ListBuilder in typst happens during the layout stage which is skipped here
  • equations are not supported since this would require conversion to latex math which is not trivial (or rendered)
  • columns could kinda be supported via pandoc but not for every format
  • drawings are not supported. they could be converted to vector images and included that way i guess
  • grid layout is not supported. that could kinda be fixed by using tables or by deconstructing the grid in some order

probably unfixable:

  • lines, horizontal and vertical alignment, more exact styling instructions
  • everything regarding pages, page breaks, page sizes (obviously)
  • general loss of control over how exactly the output looks

@Enivex
Copy link
Collaborator

Enivex commented Mar 30, 2023

Naive question, but wouldn't this be better suited as a reader in pandoc? It already has a typst writer.

@alerque
Copy link
Contributor

alerque commented Mar 30, 2023

@Enivex A fair question from a user perspective, but having looked at how this is accomplished I can say no, that this cannot be done from the Pandoc end. Pandoc could implement a reader that would convert the input syntax from one form to another, but it would not evaluate it. This method is basically using the typst internals to iterate through the document and evaluate/expand/run everything, then at the last second rather than outputting shapes to the PDF it dumps strings to a JSON object. The exported result will not be equivalent to the input it will be a closer equivalence of the output.

Rather clever actually, and something that has to be done from inside Typst (if at all), not from an outside reader.

@phiresky
Copy link
Author

phiresky commented Mar 30, 2023

As I implemented it it's not exactly "at the last second", but after the evaluation and before the layouting phase (after step 2 in ARCHITECTURE.md. The layouting phase is what defines the exact pagebreaks and locations within the pages which doesn't really make sense for digitally viewed documents. I've found two things from the layouting phase that would make sense to have in this output: the ListBuilder and the SmartQuotes handling. Not sure why those are part of layouting. The listbuilder is especially weird because you can both explicitly declare a #list() with items, but if you use the list syntax the #list() element never appears in the Content, it's just ListItems

Other than that I'd say @alerque is pretty correct, if pandoc wanted to have an integrated reader for typst it would have to either reimplement all of typst (including the language interpreter) in haskell or it would work on a syntax level and ignore any custom functions (as the latex reader in pandoc does).

@Leedehai
Copy link
Contributor

Leedehai commented Mar 31, 2023

Thanks for the work! From a product perspective, however, I think HTML (and other formats) eventually generated this way does not have the same look as the PDF, and this will put Typst at an inferior position in comparison with KaTeX and MathJAX, because the math equations/expressions rendered from the latter two look almost exactly the same as PDF from LaTeX (of course, KaTeX and MathJAX are not strictly the same thing as LaTeX, but they suffice for the most part).

The public impression of this inferiority could stick, negatively impacting the goal of being an alternative for LaTeX. I think if Typst wants to do HTML or any other formats, it'd be well-wised to do them "the right way", instead of having a not-there-yet-but-kinda-works interim solution.

@claudiomattera
Copy link

Thanks for this work!
This is really a big advantage over LaTeX, finally we can produce documents readable on "modern books" (eBook readers :D ).

Not to mention the ability to support all other Pandoc output formats.

equations are not supported since this would require conversion to latex math which is not trivial (or rendered)

Would MathML be an option here?
Pandoc writes maths as MathML in HTML (or at least it is one of the options), but I am not sure if it can also parse it.

@phiresky
Copy link
Author

phiresky commented Mar 31, 2023

this will put Typst at an inferior position in comparison with KaTeX and MathJAX
The public impression of this inferiority could stick
@Leedehai

I'm not sure what exactly you mean by "inferior output", but Pandoc does actually support outputting all options from Unicode math, KaTeX, MathJAX, and MathML formats when outputting HTML. What's missing in this PR is parsing the math expressions at all, but that's not really related to outputting pandoc AST instead of HTML directly - you'll still need to figure out what to render them to regardless.

Even if you want to render them as PNG or SVG you could still do that with the pandoc intermediary format. The advantage of outputting that intermediate AST is just that it allows many output formats with the same effort as just supporting HTML would be directly.

I actually wrote my masters thesis in markdown with pandoc (plus my blog, many other papers, presentations, ...).

Here's a screenshot from the HTML version (with uses MathJAX for the math)

image

and here's a screenshot from the pdf output via LaTeX

image

@Leedehai
Copy link
Contributor

Leedehai commented Mar 31, 2023

Hi @phiresky, thanks for the response.

I'm not sure what exactly you mean by "inferior output"

By "inferior" I meant this in my original post

I think HTML (and other formats) eventually generated this way does not have the same look as the PDF, and this will put Typst at an inferior position in comparison with KaTeX and MathJAX.

My concern boils down to: given any modestly complex math equation, is this Typst+Pandoc route able to generate an output that can be, with appropriate configs, eventually rendered into HTML that is of the same quality as the PDF?

By "same quality as the PDF", I'm having KaTeX and MathJAX as examples (ignoring edge cases). If this answer to the above question is ye, i.e. Typst+Pandoc+(appropriate configs) is on par with KaTeX/MathJAX, then my concern is resolved, and please forgive me for my unawareness of Pandoc's prowess :)


Example: I wish a pretty equation like [1] in PDF can look just like [1] in HTML (with appropriate configs), instead of something like [2] at best.
[1] Screenshot 2023-03-31 at 10 26 48
[2] Screenshot 2023-03-31 at 10 23 39

@laurmaedje
Copy link
Member

I've posted some thoughts in jgm/pandoc#8740.

@bdarcus bdarcus mentioned this pull request Apr 1, 2023
@laurmaedje
Copy link
Member

The plan of action is to rework the styling implementation a bit, which makes it easier to create good native HTML export and that can then be used as input to pandoc for generating things like docx. In contrast to the approach used in this PR, this will also correctly export lists and things affected by show rules. Still, thanks for your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants