converting file formats
converting HTML to Markdown
Note
- foo.html represents an input HTML file.
- bar.md represents an output text file formatted with Pandoc Markdown.
Use pandoc -f html-native_divs-native_spans -t markdown-escaped_line_breaks-fenced_divs-header_attributes-fenced_code_attributes-inline_code_attributes-bracketed_spans-smart-grid_tables-multiline_tables-simple_tables --atx-headers --wrap=none "foo.html" -o "bar.md"
to convert an HTML file to a Pandoc Markdown-formatted text file.
explanation
Note
This is an incomplete explanation.
- The
--atx-headers
option produces ATX-style headings for all heading levels, overriding the default behavior of producing Setext-style Markdown headings for levels 1 and 2. - The
--wrap=none
option disables text wrapping. - Pandoc Markdown output
-bracketed_spans
disables the Pandoc Markdown extension for bracketed spans.-escaped_line_breaks
disables the Pandoc Markdown extension for backslash-escaped line breaks.-fenced_code_attributes
disables the Pandoc Markdown extension for assigning attribute lists to fenced code blocks.-header_attributes
disables the Pandoc Markdown extension for assigning attribute lists to headings.-inline_code_attributes
disables the Pandoc Markdown extension for assigning attribute lists to inline code.-smart
disables the extension for interpreting ASCII characters as curly quotes, em-dashes, en-dashes, and ellipses, and for inserting nonbreaking spaces. Backslash-escaped double-quotes\"
are also no longer produced.-grid_tables-multiline_tables-simple_tables
disables the Pandoc Markdown extensions for grid tables, multiline tables, and simple tables, leaving only pipe tables.
- HTML input
-native_divs
disables the raw HTML extension for preserving native<div>
HTML elements.-native_spans
disables the raw HTML extension for preserving some native<span>
HTML elements. Some<span>
HTML elements are still preserved as bracketed spans (see the admonition below).
Attention
Even when -native_spans
is used for HTML input, some <span>
HTML elements are preserved as bracketed spans, unless -bracketed_spans
is used for Markdown output, in which case the <span>
HTML elements are preserved without being converted to Markdown format.
Attention
Using -link_attributes
to disable the Pandoc Markdown extension for assigning attribute lists to hyperlinks prevents hyperlinks with attributes from being converted to Markdown-style links.
Attention
Even when +fenced_code_blocks
is used for Pandoc Markdown output, indented code blocks are still produced instead of fenced code blocks.
prior work
The method of producing ATX-style headings for all heading levels was introduced to me by an answer on Stack Overflow by shawnhcorey.
converting Markdown to DOCX
Note
- foo.md represents an input text file formatted with Pandoc Markdown.
- bar.docx represents an output Word DOCX file.
- baz.docx represents a style reference for the output Word DOCX file.
Use pandoc -f markdown -t docx "foo.md" --reference-doc="baz.docx" --lua-filter=~/lua/pagebreak.lua -o "bar.docx"
to convert a Pandoc Markdown-formatted text file to a Word DOCX file, using baz.docx as a style reference and pagebreak.lua
to produce page breaks.
converting Markdown to HTML
Note
- foo.md represents an input text file formatted with Pandoc Markdown.
- bar.html represents an output HTML file.
Use pandoc -f markdown -t html "foo.md" -o "bar.html"
to convert a Pandoc Markdown-formatted text file to an HTML file.
converting Markdown to plain text
Note
- foo.md represents an input text file formatted with Pandoc Markdown.
- bar.txt represents an output plain text file.
Use pandoc -f markdown -t plain --wrap=none "foo.md" | sed -e 's/—/-/g' -e "s/’/'/g" -e 's/\xC2\xA0/ /g' - | cat -s - | sponge "bar.txt"
to convert a Pandoc Markdown-formatted text file to a plain text file.
explanation
Note
This is an incomplete explanation.
- The
cat -s
command produces a single blank line in place of multiple adjacent blank lines.1 - The
sed -e 's/—/-/g' -e "s/’/'/g" -e 's/\xC2\xA0/ /g'
command does the following:'s/—/-/g'
replaces any em dashes (—
) with hyphen-minuses (-
)."s/’/'/g"
replaces any right single quotation marks (’
) with apostrophes ('
).'s/\xC2\xA0/ /g'
replaces any non-breaking spaces with ordinary spaces, using theU+C2A0
UTF-8 code point.
- The
--wrap=none
option disables text wrapping.
converting PDF to text
Note
- foo.pdf represents an input PDF file.
- bar.txt represents an output text file.
Use pdftotext -layout -nopgbrk foo.pdf bar.txt
to convert a PDF file to a text file.
explanation
- The
-layout
option attempts to preserve the layout of the PDF when converting. - The
-nopgbrk
option disables the insertion of form feed characters to indicate page breaks.
converting plain text to synthesized-speech-FLAC
converting plain text to synthesized-speech-FLAC using eSpeak NG
Note
- foo.txt represents an input plain text file.
- bar.flac represents an output FLAC file.
Use espeak-ng -f foo.txt --stdout | sox --no-clobber - bar.flac
to convert a plain text file to a synthesized-speech-FLAC file.
explanation
Note
This is an incomplete explanation.
- The
--no-clobber
option prevents SoX from producing a FLAC output file if a file with the same name already exists.
prior work
The -f
and --stdout
options were introduced to me by an answer on Stack Overflow by user76204.
converting plain text to synthesized-speech-FLAC using the Festival Speech Synthesis System
Note
- foo.txt represents an input plain text file.
- bar.flac represents an output FLAC file.
Use text2wave -otype aiff foo.txt | sox --no-clobber - bar.flac
to convert a plain text file to a synthesized-speech-FLAC file.
Attention
text2wave
does not seem to handle contractions correctly, reading out each individual character if an apostrophe is encountered in the middle of a word. A workaround is to omit apostrophes ('
) from the plain text input file, eliminating any contractions that rely on apostrophes.
explanation
Note
This is an incomplete explanation.
- The
-otype aiff
option produces synthesized speech in AIFF format. - The
--no-clobber
option prevents SoX from producing a FLAC output file if a file with the same name already exists.
converting plain text to synthesized-speech-OGG using the Festival Speech Synthesis System
Use text2wave -otype aiff foo.txt | sox --no-clobber - -C -1 bar.ogg
to convert a plain text file to a synthesized-speech-Vorbis-Ogg file.
Attention
text2wave
does not seem to handle contractions correctly, reading out each individual character if an apostrophe is encountered in the middle of a word. A workaround is to omit apostrophes ('
) from the plain text input file, eliminating any contractions that rely on apostrophes.
explanation
Note
This is an incomplete explanation.
- The
-otype aiff
option produces synthesized speech in AIFF format. - The
--no-clobber
option prevents SoX from producing a Vorbis-Ogg output file if a file with the same name already exists.
licensing
No rights reserved: CC0 1.0.