Skip to content

converting file formats

converting HTML to Markdown

Note

  • foo.html represents an input HTML file.
  • bar.md represents an output text file formatted with Pandoc Markdown.

Use pandoc -f html-native_divs-native_spans -t markdown-escaped_line_breaks-fenced_divs-header_attributes-fenced_code_attributes-inline_code_attributes-bracketed_spans-smart-grid_tables-multiline_tables-simple_tables --atx-headers --wrap=none "foo.html" -o "bar.md" to convert an HTML file to a Pandoc Markdown-formatted text file.

explanation

Note

This is an incomplete explanation.

Attention

Even when -native_spans is used for HTML input, some <span> HTML elements are preserved as bracketed spans, unless -bracketed_spans is used for Markdown output, in which case the <span> HTML elements are preserved without being converted to Markdown format.

Attention

Using -link_attributes to disable the Pandoc Markdown extension for assigning attribute lists to hyperlinks prevents hyperlinks with attributes from being converted to Markdown-style links.

Attention

Even when +fenced_code_blocks is used for Pandoc Markdown output, indented code blocks are still produced instead of fenced code blocks.

prior work

The method of producing ATX-style headings for all heading levels was introduced to me by an answer on Stack Overflow by shawnhcorey.

converting Markdown to DOCX

Note

  • foo.md represents an input text file formatted with Pandoc Markdown.
  • bar.docx represents an output Word DOCX file.

Use pandoc -f markdown -t docx "foo.md" --reference-doc="baz.docx" --lua-filter=~/lua/pagebreak.lua -o "bar.docx" to convert a Pandoc Markdown-formatted text file to a Word DOCX file, using baz.docx as a style reference and pagebreak.lua to produce page breaks.

converting Markdown to HTML

Note

  • foo.md represents an input text file formatted with Pandoc Markdown.
  • bar.html represents an output HTML file.

Use pandoc -f markdown -t html "foo.md" -o "bar.html" to convert a Pandoc Markdown-formatted text file to an HTML file.

converting Markdown to plain text

Note

  • foo.md represents an input text file formatted with Pandoc Markdown.
  • bar.txt represents an output plain text file.

Use pandoc -f markdown -t plain --wrap=none "foo.md" | sed -e 's/—/-/g' -e "s/’/'/g" -e 's/\xC2\xA0/ /g' - | cat -s - | sponge "bar.txt" to convert a Pandoc Markdown-formatted text file to a plain text file.

explanation

Note

This is an incomplete explanation.

converting PDF to text

Note

  • foo.pdf represents an input PDF file.
  • bar.txt represents an output text file.

Use pdftotext -layout -nopgbrk foo.pdf bar.txt to convert a PDF file to a text file.

explanation

  • The -layout option attempts to preserve the layout of the PDF when converting.
  • The -nopgbrk option disables the insertion of form feed characters to indicate page breaks.

converting plain text to synthesized-speech-FLAC

converting plain text to synthesized-speech-FLAC using eSpeak NG

Note

  • foo.txt represents an input plain text file.
  • bar.flac represents an output FLAC file.

Use espeak-ng -f foo.txt --stdout | sox --no-clobber - bar.flac to convert a plain text file to a synthesized-speech-FLAC file.

explanation

Note

This is an incomplete explanation.

  • The --no-clobber option prevents SoX from producing a FLAC output file if a file with the same name already exists.

prior work

The -f and --stdout options were introduced to me by an answer on Stack Overflow by user76204.

converting plain text to synthesized-speech-FLAC using the Festival Speech Synthesis System

Note

  • foo.txt represents an input plain text file.
  • bar.flac represents an output FLAC file.

Use text2wave -otype aiff foo.txt | sox --no-clobber - bar.flac to convert a plain text file to a synthesized-speech-FLAC file.

Attention

text2wave does not seem to handle contractions correctly, reading out each individual character if an apostrophe is encountered in the middle of a word. A workaround is to omit apostrophes (') from the plain text input file, eliminating any contractions that rely on apostrophes.

explanation

Note

This is an incomplete explanation.

  • The -otype aiff option produces synthesized speech in AIFF format.
  • The --no-clobber option prevents SoX from producing a FLAC output file if a file with the same name already exists.

converting plain text to synthesized-speech-OGG using the Festival Speech Synthesis System

Note

  • foo.txt represents an input plain text file.
  • bar.ogg represents an output Vorbis-Ogg file.

Use text2wave -otype aiff foo.txt | sox --no-clobber - -C -1 bar.ogg to convert a plain text file to a synthesized-speech-Vorbis-Ogg file.

Attention

text2wave does not seem to handle contractions correctly, reading out each individual character if an apostrophe is encountered in the middle of a word. A workaround is to omit apostrophes (') from the plain text input file, eliminating any contractions that rely on apostrophes.

explanation

Note

This is an incomplete explanation.

  • The -otype aiff option produces synthesized speech in AIFF format.
  • The --no-clobber option prevents SoX from producing a Vorbis-Ogg output file if a file with the same name already exists.

licensing

No rights reserved: CC0 1.0.