Replacing WordPress

2022-02-16

Frank Siebert

Is it a story, or is it a technical documentation? Probably it is both, I only know it started with a specification of some kind and went on to become a solution. And now I try to compile the specification and the implementation notes into a journey description. My journey to Python and my new web representation and what I had learn on my way.

The code in this article has still hot needle quality.

Motivation

Part One

I think WordPress is a good platform to create a web presence. It is just a pain in the ass if you start to care about privacy. You start to remove all the included tracker features contacting google analytics, you continue with the removal of "social" media features, because you are not sure whether these communicate the respecitve platforms without pressing the social media button and you remove all those resource links loading resources from foreign servers, because the respective resource servers might be able to identify you and also the page you visited.

After this isolation of your web presence, you are halfway sure that the privacy of your visitors is save. Then you might receive a notification, that you should update WordPress or one of the plugins. ...

If you do not update, you might run into the situation where your web presence might become insecure for your visitors. If you update, you have to review everything again. Is your server isolation still in effect, or did the update reinstate some of the removed side-communications?

Part Two

I decided to learn Python and needed a project for this. I stumbled upon the article "Gitblog - the software that powers my blog" ¹ , and I liked the idea to publish via

$ git push

I'd guess most developers are in agreement with this. Posting articles just the same way as you push new code to a git server - developers a bound to like this.

The Gitblock solution is based on Java and nothing is wrong with this. I used Java in software development for more than 20 years, with a noticable break where I mainly used ABAP to come back to Java again.

But since I wanted to learn Python anyhow, this was the ideal project to get a start with that language.

Project Duration

From start to go live it took slighly less than 2 month. The first Wiki entry dates back to December 31 ^st 2021, but this was nothing more than a note to myself. January 16 ^th the real drafting on the specification started and rarely anything from that first specification survived. But the most important requirements stayed stable and are met with the solution.

No JavaScript
- There is one article containing JavaScript, because the article contains a quiz. But articles are content, not publishing solution.
All plain HTML+CSS static content
State of the art semantic HTML
Search (the exception from the static content)
- Currently YaCy integration
- Planned to have a python based Search

The website with all previously published articles migrated did go live on March 14 ^th . And migration was a really heavy topic. Most articles where originally written in one of my Wikis and then published in WordPress, but some of the very short video, audio or article recommendations where not written in the Wiki. Some articles got corrections maintained directly in WordPress instead of a correction in the Wiki with republishing afterwards. For some of those corrections I just made a comment in WordPress to make readers aware of the mistake.

In short: Inconsistency in previous publishing's made migration a major effort.

Other inconsistencies where:

Articles maintained in Wiki had partly chapter headlines starting with header level 2 and partly starting with header level 3.
Quotations where only preceeded with "Zitat:" and concluded with "Zitat Ende", but not everywhere, since I started this only when I started with the audio recordings of my articles - to make sure I do not forget to mention the start and the end of the quotation in the recording.
Quotation where not enclosed in < blockquote > tags.

To provide state of the art semantical HTML I had to copy-edit every article during the migration. The good news is: The new setup will drive and support me to publish in a more constent manner.

Considering this major migration effort, I'm pretty proud the project did take "only" 2 month, especially since it was the Python learning project.

As it happens often on the way, I learned much more than just Python. I learned new things about PDF generation, fonts, git, regex, HTML, CSS, vim, the IDE spyder, the web server nginx and even more.

Requirement Specification

The requirement specification was subject to changes. As it often happens, this was mainly because it not only described requirements, but made also already assumptions about technical details of the solution.

A funny fact: I spend years explaining my own customers that it is important not to write requirement specifications with a technical solution in mind. The requirement specification should focus strictly on non technical scenario descriptions. The rationale behind this: Very often a customer would ask to eliminate work efforts caused by previously implemented workarounds. This workaround is viewed as tool by the customer, and following the customers suggestion leads to the implementation of yet another workaround. Very often you get a much better overall solution, if you sunset also existing workarounds, which is difficult, because those where so helpful in the past.

My previous publishing scenario

I use one of my MediaWiki siblings to collect information and I use also this wiki to create articles based on that information. This part of the publishing szenario stays in place. I considered to change this as well and to write articles in future in the editor vim, but I decided to keep infomation collection and article compilation together in one place.

To get this article published, together with its audio recording. I used the HTML export option of an PDF export extension.

I then used the editor vim with 3 regex statements to strip the header and the footer from that export, and, if necessary, also the references to categories, which would otherwise establish links pointing into the void when displayed in WordPress.

The remaining HTML was then pasted into one HTML input field in the Create Post UI of WordPress. Thus the page internal links in the table of contents and to and from the reference section of the page stayed functional.

If pictures were included, I uploaded these first to WordPress and used these uploaded pictures already in the wiki. That way the links to the pictures stayed as they were in the later WordPress version of the article.

All in all not too cumbersome a process, but with room for improvement. Especially when corrections where required it was much to easy to apply the correction directly in WordPress instead of doing the correction in the Wiki and to republish it. And this leads to problems on the long run. For some time I thought about some automation of text deployment to WordPress to mitigate this. But those thought are now obviously obsolete.

How do I want to do it in future ?

This chapter is from my early specification notes. I tried to figure out what I really want.

This is not really easy to tell. I'm still struggling to have one opinion with myself about this topic.

I'd like to edit my pages with MediaWiki markup or, as alternative, with markdown. From the implementation side it would be simplest to keep the editing process as I do it today and only to change the publishing.

The publishing and the result as shown in Gitblog is quite to my taste. However, this solution is based on Java, and I think 20+ years of Java is enough. I'd like to base my own solution on Python. Not because it is so much better than Java, which might or might not be the case, but because I decided to learn Python down to its depths and such a project is a perfect opportunity.

This does not mean, that I need to write everything from scratch, there are already a lot of modules in existence to build upon.

At the other hand I'd like to be able to write my articles also completely offline, just using vim as markup editor. But is this a realistic scenario? Am I not researching every detail anyhow online during the authoring? So many things you did read about and you are quite sure about, but you need a source as reference when you write it into an article. Will I ever really do authoring offline?

But why not both options?

During commit a pre-commit handler can check the mime type and do one thing if it is an html-fragment to be placed into an empty html page template, and do another thing if it is an markdown file with the extension md.

Looking at GitLab Docs ² it is undeniable that powerful versions of markdown exist. However, installing GitLab means also to install a big bunch of software, which is not really smaller than wordpress. In search for a small minimalistic solution GitLab is probably out of scope.

To be honest, I'm not sure I ended up with less installations than Gitlab. But as you see in this text, I initially expected that would need to meddle with the HTML from the MediaWiki, as I did before. Fortunately specifications are a moving target, something we developers will often complain about.

How shall it look like in the Future

''This chapter is from my early specification notes as well. It was less off the mark than the previous chapter.

For a start it should look like before, just without those things no longer required. E.g. a logon is no longer required, since I push and merge new articles to the server, instead of logging in and using an authoring front-end.

It looks different, but not too much different

Search

Initially the search will use my YaCy instance. I have to look how well this integrates.

Yes, YaCy is integrated. But I consider the current search integration as improvable.

Site Pane?

Is a side pane required any longer? Probably not, I'm not sure.

No site pane any more.

Header Collapse or not?

Can I collapse the header with the site navigation during scroll down and make it available when the user starts to scroll up? I mean without JavaScript, only with CSS? I will see.

This point got no priority at all. Nice idea probably, but in the end I didn't care.

Small Header

The header will become smaller, since I will shorten the main site name to "Idee" with a smaller "der eigenen Erkenntnis" and I will write it in small-caps to hide the problem with uppercase "Erkenntnis" not fitting exactly to the lowercase "e" at the end of "Idee". And the Header-Text will move on Top of the Header Picture. ''Forget it. Ok, the header got smaller, but that's it.

Article PDF

Every article will get an PDF-Download Button. The PDF is not necessarily optimized for print and offline reading, but it is nonetheless a good idea to simplify the access to references in the reference section via QR-Code for the respective links.

Article PDF is implemented and also a possibility to suppress its generation for low value content, e.g. if the "article" is just a recommendation note.

References in the PDF are rendered as in the online version, but showing additionally the HTTP address as text. QRCode creation for every reference does not take place.

Article Archive

WordPress shows am archive drop down. That needs scripting and dynamic population. An alternative would be the generation of one archive page, which allows drill down to the year, which allows drill down to the month. Such pages can stay unchanged, as long as I do not change the portal part. as soon as the respective month or respective year passed by. The usability is most probably not less than a drop down, which at some point gets a bit messy to scroll on small screens.

Implemented as described. Only a portal change does in the chosen implementation not require a regeneration of pages, which is a huge improvement compared with the initial specification.

Sizing Pictures?

Should I size pictures during commit? Should I sample audio files to a number of different qualities? A lot of options are open now, with the development of an own page factory.

The question-mark in the title can be answered Today with No.

Picture based article selection

I could create a picture gallery to select articles by picture. But then probably I should create a picture for every article... . Not really, I also, in rare cases, do not create audio. I wouldn't force myself into picture creation, where the picture does not add value.

The original text does hint it already. Nothing in this regard has happened.

Semantic Web

Articles will have state of the art HTML5 article structure. This needs some intelligent logic if it comes to the correct use of tags like the cite-tag. I probably need to think about the markdown and MediaWiki representation of the HTML cite-tag to make this one working nicely.

I obviously meant quotations. The markup representation for quotations is the respective HTML tag < blockquote >. The mentioned < cite > tag could probably come into use in my articles as well in the references, but this is probably not a could idea. Possible that I'll introduce this later.

However, a lot of semantic is simple. The article content resides in the article-tag. The article-tag contains a header-tag, whose headline and media are descriptive for the complete article, like the QR Code of the URL, the PDF file and the audio file. Video is not planned.

The HTML head meta-tags for articles and the og meta-tags bring a lot of invisible semantic to the page.

A lot of options. In the end I will strip this text down to those things, which made it into the product.

It's all in, apart of the cite tag, which was an error in the specification.

Citation

That would be an interactive page function.

Reuse Citations I made in various citation formats.
- Click at a function link at the footnote. the citation gets shown and a citation format can be selected.
- The result can be used via copy-paste buffer.
Cite statements made by me in various citation formats.
- Mark a text passage in my article and a citation function link gets shown.
- Then as above.

Yes, this function needs java-script, which would be a draw back from the plain HTML philosophy I started with. Citing myself a lot of other publications and knowing about the effort to create the citations as I need them, I think such a function is worth to be scripted.

Plain HTML is not a religion. It is rather: Avoid scripting where it is not required. Ok, nice idea to make it simple for others to cite me. But it is not implemented and probably will never be.

JavaScript

JavaScript can be disabled in the browser without any impact for the casual user, Only "extended features" may rely on JavaScript, as the citation feature. Features not enabled due to disabled JavaScript stay invisible to the user.

I translate myself for myself :-). Function-links are written into the page via JavaScript. If JavaScript is not enabled, the page will not contain any malfunctioning function links.

Till now I did not need to follow this specification, since the implementation doesn't use JavaScript anywhere. But the specification stays valid.

Sitemaps

The "Sitemaps XML format " ³ description explains the concept and the XML document structure of sitemaps.

RSS

Updates have a reason, most probably additional or corrected information went into the article. To make an announcement of such changes is imperative for an information provider, and the RSS.xml is the place for this.

Monthly Archive

As sitemaps can be structured by one or more sitemap index files, it does make sense to use this to structure the sitemaps by "yyyy-MM", getting one sitemap per month.

Thus its probably simple to create a monthly page of articles to be selected by the user in the archive overview created from the sitemaps index.

The first sentence describes how the implementation was done later. But the second sentence was too optimistic. The sitemap, if not pepped up with extensions, does not contain enough information to create monthly archive pages from it.

HTML5 Article (with prepared location for portal injection

This is the article HTML draft. it contains a line with div id="main", which I planned to be the place where I place the portal part via Phyton. That div turned out to be unnecessary, instead an xml comment is now placed after body and before main tag as include instruction. The include is performed by the nginx web-server.

<!DOCTYPE html>
<html lang="de-DE" xml:lang="de-DE" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  <meta content="pandoc, fs-commit-msg-hook 1.0" name="generator"/>
  <meta content="width=device-width, initial-scale=1.0, user-scalable=yes" 
   name="viewport"/>
  <meta content="2022-01-19T16:05:43" property="article:modified_time"/>
  <meta content="2020-10-15 09:49:27" property="article:published_time"/>
  <meta content="Frank Siebert" property="article:author"/>
  <meta content="Idee" property="og:site_name"/>
  <meta content="de-DE" property="og:locale"/>
  <meta content="The Article Title" property="og:title"/>
  <link href="../website/css/fs.css" rel="stylesheet"/>
  <title>
   The Article Title
  </title>
  <style>
   <!-- styles by pandoc -->
   code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
  </style>
 </head>
 <body>
  <div id="main"> <!-- Portal injection parent -->
   <main>
    <article>
     <header>
      <h1>
       The Article Title
      </h1>
      <div>
       <time datetime="yyyy-MM-dd hh:mm:ss" pubdate="true">
        yyyy-MM-dd
       </time>
       <address>
        Author Name
       </address>
       <!-- probably PDF download link location -->
      </div>
      <!-- probably audio player location -->
     </header>
     <!-- article content (paragraphs, toc, headlines (< h1), images, footnotes)
    </article>
   </main>
  </div>
 </body>
</html>

Scenario

This is still specification before the development started. It seems to be repetition, but it is not, because it contains a decision not made before. But it contains also things which should not be written into a scenario. If you wear the hut of the customer, the architect and the developer all in one person, then there is no second person taking care to write the specification correctly.

Obviously I'm not completely sure about the scenario. But if it changes over time, then that's a common thing often seen also in other projects.

Decision: Material collection and writing happens in a MediaWiki. The export may happen with the current export tool, or it might happen with a Python based Wiki-page parser ⁴

Aspiration: I want to have useful meta tags generated during the commit. The Author Information and the date of publishing, the date of update and a documentation of changes should be automatically processed into an HTML meta information and into a standardized representation in the visible text. This is important to ensure that corrections are processed in a transparent, reader friendly manner.

For the sake of usability, a commit should not lead to an automatic release of the text. The commit is for the draft version only. This basically means that I will work in branches and the final publishing is done with a merge into the master branch,

Forget this branching explanation. Yes, commits generate HTML for review, nut branches are not necessary and therefore also no merge. The final publishing is done via the git push command, as explained earlier.

During the merge into the master branch on the server:

The HTML is processed to contain a header, a style-sheet, meta information and change markers if the document had not been merged for the first time.
The page is fed into a search engine for indexing
The page is fed into an rss feed generator to provide a new entry in the rss feed.
The page is fed into a sitemap generator to provide an updated sitemap

In the end everything is done during commit, with the exception of search engine indexing, which can be done by the YaCy-Search Engine only after publishing.

Search Index

And even more specification, if you like to call it such. Probably it is more am investigation of options regarding search.

For Python some search index implementations exist. There is one Doit-Yourself-Example by Bart de Goede ⁵ , at the opposite end of the spectrum we find Gensim ⁶ , which probably can do much more than just index, and there is a module named Whoosh ⁷ , and there is rank-bm25 ⁸ , which implements multiple variants of the bm25 search algorithm.

I tend to base my search on the latter module, and I'm curious how well this will work.

Search Index Related Learning Material

"Improvements to BM25 and Language Models Examined" ⁹
"What is the difference between Okapi bm25 and NMSLIB?" ¹⁰

Implementation

With the chapter "Toolchain" the implementation started. The chapters are sorted by initial implementation sequence.

Toolchain

MediaWiki-Tools git ~/projects/wikitools/

This is the git used to implement the tools to access the MediaWiki instances. The default instance used is my private sammel-wiki, but there is no reason why I should not also access my installations-wiki to create postings from it. Well, the language probably, since my blog is in German language.

The language problem is solved with the creation of two sites, one in German and one in English

The wikitools project git existed already, hosting the code for a program "reference.py" to scrape Webpages for the creation of a reference tag stored in a new created reference wiki-page for the scraped website.

Related to WordPress-replacement-project is the new tool "export.py", which extracts a wiki-page with expanded templates as MediaWiki markup file. The output of this tool is placed into a configured directory, which is, how convenient, the authoring directory of the authoring git.

Authoring git ~/projects/idee

Authoring takes place in the folder ./author/ , for a start via MediaWiki files. This means I can also use my MediaWiki instances for authoring and afterwards use the export.py from my wikitools to save the article as "authoring source" into this folder.

During git commit the commit-msg hook implementation checks for committed ./author/*.mediawiki files to be processed.

TODO: Consider options to structure this in ./author/yyyy/MM/ folders.
DONE: The result is NO.

The respective mediawiki files are processed into plain HTML by the method pandocmw() , pandoc beeing the conversion tool used.

Processing results are stored in:

./plain/ - the plain html files
./website/image/ - the image files

The plain html files are further processed into PDF files by the method pandoc-html-pdf() , pandoc again beeing the conversion tool used.

Processing results are stored in:

./website/pdf/ - the pdf files

Created HTML and PDF files open automatically in Firefox for review.

At this point of the processing the Pictures as well as PDF files are supposed to be final, at least after some possible round trips of review and correction.

TODO: check the creation of an asset list for each authored article, to prevent the deployment of the article without the corresponding pictures and audios.
DONE: The result is NO. The risk of this to happen is minimal and it is also very fast corrected, if it should happen

TODO: one option to simplify the commit of all required assets is the creation of an asset list for the committed article. Probably this can be in the form of a prepared commit message for these assets.
DONE: The result is NO. All git comments can be issued in the root of the git repository, making sure everything is included.Just "git add .", "git commit", that is simple enough.

During git commit the commit-msg hook implementation checks for committed ./plain/*.html files to be processed.

The respective plain html pages are processed by the method injectportal() into webpages of a website via:

the injection of the portal into the page
the placement of the PDF accesss link, if a pdf with the same name exists
the placement of the HTML5 audio player, if an audio with the same name exists

The processing results are stored in:

./website/article/ - webpages containing articles

Then the website is updated via:

the update of the sitemap,xml
the update of the feed.xml
the update of the index.html featuring the latest post as first entry.

Processing results are stored in:

./website/ - Entry point of the web representation

./website/ contains all website related content.

Resulting in the structure:

 website
 ├── article
 ├── css
 ├── media
 └── pdf

Privacy Statement and such administrative overhead will be deployed as article and linked as special page in the portal.

TODO: Think further about URL compatibility with the current WordPress site.
Option: I have exported data from the mysql table, which enables the creation of a redirect list.
DONE: Compatibiliy is a must and it is ensured. It is important not only for the redirect, but also to preserve the correct dates of the articles. The stem of the article page serves as urn to access the article data in the publishing list.

This git is a client git, connected to the server git. Deployment to the server is done via git merge .

Autoring git - Configuration

The configuration of the git gets stored and versioned in the git repository. The path to the configuration is ./config/ . The implemented hooks are part of the configuration and are stored in ./config/hooks/ , the preexisting examples are stored in ./config/hooks/samples/ .

./config/gitconfig

#!/bin/bash
# configure the wiki

# We develop hooks and want version control for that
git config --local core.hooksPath ./config/hooks

# We want to easy reading of German äüö in the file names
git config --local core.quotepath off

# We provide some variable override options in a modified template
git config --local commit.template ./config/commit-message

# We process the files committed and need absolute file paths from $GIT_DIR
# written into the commit-message
git config --local status.relativePaths false

The configuration settings are applied with the above shown bash script.

Folder Structure

To give you complete overview of the final git folder structure, here it is:

frank @Asimov:~/projects/idee$ tree -d
.
├── author
├── bash
├── config
│   └── hooks
│       └── samples
├── generator
│   └── __pycache__
├── nginx
├── plain
├── test
└── website
    ├── archive
    ├── article
    ├── audio
    ├── css
    ├── env
    │   └── bootstrap
    │       └── css
    ├── files
    ├── image
    ├── js
    ├── pdf
    ├── portal
    ├── qrcode
    └── sitemap

25 directories

The folder website/env/bootstrap/css contains the CSS to format the YaCy search result page. In the current implementation I refrained from merging somehow the portal header into that page. Probably I could convince YaCy to return the results as XML and to render a portal page via XSLT, There is room for improvement.

The folder /generator contains the Python part of the project. If you consider software development and content development as two different projects, then the git contains the development project as a nested project.

To get a re-usable software product from this, it is required to separate the projects into separate git repositories. But for initial development the combined repository saved a lot of time.

Server git: /home/git/idee.git

This is the server git, and as such it is without work directory. When the pushed changes had been merged, a hook needs to take care to write the content into the web-server directory.

Not all content, but the content belonging the website as documented above, needs to be processed.

The implementation uses a simple fetch by a client git located in the web-server directory.

Export

There are tools you can install in your MediaWiki instance to support the export to PDF and to HTML. However, these require you to change the wiki installation and the result might not be tailored to your need,

To publish articles I decided to write my own export tool, extracting a single page containing the composed article, with all included templates expanded. Mediawiki ships with a special page Special:Export, which uses the same API function as used in my implementation. The API call was already implemented in the module mwclient.py, but to get it to function with long pages I had to change the respective GET request into a POST request. I informed the developers via their Github issue tracker with the message "expandtemplates should use "post" instead of "get" · Issue -272 · mwclient-mwclient" ¹¹ .

The current implementation of the export.py is by no means beautified. I just began with python and I'm pretty much unaware of established coding conventions. As I progress with python, things will get nicer over time.

export.py

wikitools git repository

I let a lot of comments survive, which document also some of the wrong ideas I had. E.g. during the migration I used the Pandoc feature to download the images from their web location on my WordPress installation. Pandoc creates own filenames for the pictures via SHA1 hash during this process.

Afterwards I thought I should change the export function to enable Pandoc also to download images from the Wiki. But the wiki requires authentication and overall the identifier for the image might Image:, File: Bild:, Datei:, and in additional languages you might get additional alternatives. I'm not sure whether Pandoc does address all those correctly, but at least that's SEP ¹² as long I do not meddle myself in that soup.

Now nothing is done in the export, the media download in Pandoc is deactivated and the image path is adjusted right after HTML creation at a point, where I meddled with that path anyhow already.

#!/usr/bin/python3
"""
Export MediaWiki Pages with expanded templates and page includes.

@author: Frank Siebert
@website: https://idee.frank-siebert.de
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

"""

import sys
import os
import getopt
import termios
import fcntl
import subprocess
import time
import re

from pathlib import Path
import configparser
from termcolor import colored
from mwclient import Site
from mwclient.errors import LoginError

HELPTEXT = 'Usage: export.py [-w \'wiki\'] \'Page_Name\'\n'\
    '\n'\
    '-w \'wiki\'  Name the wiki to be used, using the section\n'\
    '             name in the configuration file.\n'\
    '\n'\
    'Page_Name    The page to be exported. In case of spaces either\n'\
    '             surrounded by \' or with _ instead of spaces.\n'


def askforkeypress(prompt, keylist, onerror):
    """
    Ask and wait for user input.

    Parameters
    ----------
    prompt: Str
        The promt shown to the user to ask for input.

    keylist: List
        A list of characters as possible input keys.

    onerror: Object
        An object to return to the caller in case of an error.
    """
    fileno = sys.stdin.fileno()
    oldterm = termios.tcgetattr(fileno)
    newattr = termios.tcgetattr(fileno)
    newattr[3] = newattr[3] & ~termios.ICANON & ~termios.ECHO
    termios.tcsetattr(fileno, termios.TCSANOW, newattr)

    oldflags = fcntl.fcntl(fileno, fcntl.F_GETFL)
    fcntl.fcntl(fileno, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)

    # stay in the same line to wait for input
    print(prompt, end=' ', flush=True)
    char = None
    try:
        while char not in keylist:
            try:
                char = sys.stdin.read(1)
                time.sleep(.1)
            except IOError:
                char = onerror
            except ValueError:
                char = onerror
        print(char)
    finally:
        termios.tcsetattr(fileno, termios.TCSAFLUSH, oldterm)
        fcntl.fcntl(fileno, fcntl.F_SETFL, oldflags)
    return char


def export():
    """
    Export the page as mediawiki markup.

    Uses the API used by Special:Export including templates
    https://wiki.frank-siebert.de/script-inst/index.php?title=Special:Export

    For long pages the use of POST is important. I changed the library function
    in this regard. A respective fix awaits its merge into the mwclient module.
    Request Type: POST
    Request Parameters:
    catname=&pages=Replacing+Wordpress&curonly=1&templates=1&wpDownload=
    1&wpEditToken=%2B%5C&title=Special%3AExport
    """
    print(pagename)
    # Login
    host = config[CFGSECTION]['Host']
    scriptpath = config[CFGSECTION]['ScriptPath']
    user = config[CFGSECTION]['User']
    password = config[CFGSECTION]['Passwort']
    exportdir = config[CFGSECTION]['ExportDirectory']
    references = config[CFGSECTION]['References']

    site = Site(host, path=scriptpath)
    try:
        site.login(user, password)
        site.get_token("login")
    except LoginError:
        print("login failed")

    for result in site.search(pagename, what='title'):
        print(result)
        page = site.pages.get(pagename)
        if page:
            print(page)
            # expand = page.templates().count > 0
            # might fail if client.py is updated, because
            # def expandtemplates(self, text, title=None, generatexml=False)
            # sould use post instead of get
            wikitext = page.text(section=None,
                                 expandtemplates=True,
                                 cache=True,
                                 slot='main')

            # load page until it is no longer changeing
            # stop waiting after 5 seconds
            webpage = wikitext
            time.sleep(.5)
            i = 10
            while len(webpage) != len(wikitext) and i > 0:
                webpage = wikitext
                time.sleep(.5)
                i -= 1
            # make clear, which one is not used from now on
            webpage = None

            # Later we might need a <reference/> tag to identify the
            # location where footnotes shall be placed.
            # Lets check non-existance of the tag and existance of the
            # reference section.
            # Insert the tag now, if it is not present at the expected place.
            # In vim the regex :%s:^=.*Fußnoten.*=\n<references.+/>  finds
            # the german footnotes with the reference tag in the next line.

            # strip down to headline text only
            ref_cfg = references.strip().strip('=').strip()

            # I tried repr(pattern) to get the raw string, but got it with
            # ' at the start and end
            # repr(pattern)[1:-1] would strip them off, but this is easier
            # to read and to write

            # requires re.M
            r_ref_cfg_pattern = r"" + "\n=.*{}.*=.*".format(ref_cfg)
            # a negative lookahead for the reference tag
            r_nreference = r"" + "(?!\n.*<references.*/>.*$)"
            # ok if footnote header found without reference tag
            r_okpattern = r"" + r_ref_cfg_pattern + r_nreference

            okcheck = re.compile(r_okpattern)
            exists = okcheck.search(wikitext)
            if exists:
                print(colored('\nWARNING:', 'yellow'),
                      "<references/> tag is not in the expected location")
                print("Automatic insertion of the tag takes place.")
                wikitext = okcheck.sub(exists.group(0) +
                                       "\n<references/>", wikitext)

            # replace category references
            categorypattern1 = re.compile(r"(.*)",
                                          flags=re.MULTILINE)
            wikitext = categorypattern1.sub(r"\1", wikitext)

            # replace category references, take care not to touch Images
            categorypattern2 = re.compile(r"\[\[[K|C]ategor.*\|(.*)\]\]",
                                          flags=re.MULTILINE)
            wikitext = categorypattern2.sub(r"\1", wikitext)

            # A mediawiki extension enables me to embedd images from my
            # idee demain into articles I write in the wiki just by pasting
            # the URL in to an otherwise empty paragraph.
            # https://something....filename.ext
            # This, who wonders, is not recognized by pandoc.
            # I need to beautify these image links.

            # only if the line starts with http
            # transitional code for the migration
            imagepattern = re.compile(r"" + "^http(.*)png", re.MULTILINE)
            wikitext = imagepattern.sub(r"[[Image:http\1png|No Caption]]",
                                        wikitext)

            imagepattern = re.compile(r"" + "^http(.*)jpg", re.MULTILINE)
            wikitext = imagepattern.sub(r"[[Image:http\1jpg|No Caption]]",
                                        wikitext)

            # Above imagepattern replacements take care for external
            # links, as I used them in the past to upload images to WordPress
            # instead to the MediaWiki, when I intended to use them in
            # articles.
            #
            # The new scenario is the upload to the MediaWiki and to create
            # image entries with Capture Text and probaly Size Information.
            # For those the image file name needs to be expanded with the
            # URL to retrieve the images from the Wiki.
            #
            # The exported image information looks as follows:
            # [[Image:Imagename.png|NNxNNpx|Capture Text]]
            # Or:
            # [[Image:Imagename.png|Capture Text]]
            #
            # The text "Imgagename.png" needs to be expanded into:
            # $host/$script-path/index.php?title=File:Imagename.png
            #
            # Instead of Imgage the returned WikiText may contain Bild or File
            # or Datei as Keyword.

            # This became a NOOP, because changing the image path here
            # to help pandoc to download them does not make any sense.
            # Probably an image export function would make sense here.

            # mstr1 = r"\[\[(Image|Bild|Datei|File):"
            # mstr2 = r"([^h][^t][^t][^p].*p[n|j]g)|.*\]\]"
            # mstr = mstr1 + mstr2
            # imagepattern = re.compile(mstr, flags=re.MULTILINE)
            # replstr = r"" + site + r"index.php?title=" + r"\2"
            # wikitext = imagepattern.sub(replstr, wikitext)

            # Write the result to disk.
            # TODO: Enable also enviroment variables to determine HOME.
            # Using ~ is nice, but quite OS dependent
            if exportdir[0] == '~':
                wikifile = Path.home() / exportdir.strip('~').strip('/') \
                    / "{0}.{1}".format(pagename, "mediawiki")
            else:
                wikifile = Path(exportdir) \
                    / "{0}.{1}".format(pagename, "mediawiki")
            wikifile = wikifile.resolve()
            with open(wikifile, 'w') as outfile:
                print(wikitext, file=outfile)
                outfile.flush()
                outfile.close()
                print('\nThe mediawiki file was exported to:\n' + outfile.name)

                prompt = '\nDo you want to review the file? yes/no (y/n):'
                pressed = askforkeypress(prompt=prompt,
                                         keylist=['y', 'Y', 'n', 'N'],
                                         onerror='n')
                if pressed in ['y', 'Y']:
                    subprocess.run(["vim", wikifile])

            # TODO: Consider to execute the commit into the wiki git
            # htmltext = subprocess.run(["pandoc", "-f", "mediawiki", \
            #      "-t", "html"], input=bytearray(wikitext.encode()), \
            #           capture_output=True)
            # print(htmltext.stdout.decode("utf-8"))
            break
    return


if __name__ == "__main__":
    #  Check command line arguments, provide help and call the functions

    CREATEFLAG = False

    try:
        opts, args = getopt.getopt(sys.argv[1:], "hw:", ["help", "wiki"])

    except getopt.GetoptError:
        print(HELPTEXT)
        sys.exit(2)

    # Defaults
    CFGSECTION = 'Default'

    for opt, arg in opts:
        if opt in {"-h", "--help"}:
            print(HELPTEXT)
            sys.exit()
        if opt in {"-w", "--wiki"}:
            CFGSECTION = arg

    arg_names = ['pagename']
    args = dict(zip(arg_names, args))

    # print(args)

    # Kept as inspiration for future
    # ------------------------------
    # Arg_list = collections.namedtuple('Arg_list', arg_names)
    # args = Arg_list(*(args.get(arg, None) for arg in arg_names))

    pagename = args.get('pagename')
    if not pagename:
        print(colored('\nERROR:', 'red'), 'Page_Name parameter is missing.')
        print(HELPTEXT)
        sys.exit(2)

    config = configparser.ConfigParser()

    configpath = Path.home() / '.config' / 'wikitools' / 'wikitools.cfg'
    config.read(configpath)
    sections = config.sections()
    if CFGSECTION not in sections:
        print(colored('\nERROR:', 'red'), 'Configuration is missing.')
        print(HELPTEXT)
        sys.exit(3)

    export()

    sys.exit(0)

~/.config/wikitools/wikitools.cfg

The code reads a configuration, which enables me to post the code without the risk of exposing my user and password for the wiki instances I use.

One default wiki can be configured and as many additional wikis as you like. With the command line paramente -w you can address the configuration section you want to use in current export.

The configuration is shared between multiple wikitools.

#
# The configurations location has to be
#
#   UserHome/.config/wikitools/
#
# If the command line names no wiki section
# the Default section is used.
#
# The command line option -w
# with a parameter can be used to name a wiki,
# for which a configuration section exists.
#
# access to a key from python
# config[cfgsection]['key']

[Default]
DefaultCategory = Your Category
Host = Your wiki host name
ScriptPath = Your wiki script path
User = Your wiki user name
Passwort = Your wiki password
References = == Fußnoten ==
ExportDirectory = ~/projects/idee/author

[yourwiki]
DefaultCategory = Your Category
Host = Your wiki host name
ScriptPath = Your wiki script path
User = Your wiki user name
Passwort = Your wiki password
References = == Footnotes ==
ExportDirectory = ~/projects/idee/author

WordPress Migration

Meta data and, if I decide to use this, also the content of WordPress articles and sites, are available in the MariaDB database wp_idee.

In the context of the project it made sense to grant remote access to the MariaDB from the local area network. This is well described in the official documentation ¹³ .

root @sol:~# cd /etc/mysql/mariadb.conf.d/
root @sol:/etc/mysql/mariadb.conf.d# ls
50-client.cnf  50-mysql-clients.cnf  50-mysqld_safe.cnf  50-server.cnf
root @sol:/etc/mysql/mariadb.conf.d# vim 50-server.cnf

In file 50-server.cnf comment out the the bind-address 127.0.0.1, making it bind to the network cards addresses. In my case there is just one if these,

[...]
# Instead of skip-networking the default is now to listen only on
# localhost which is more compatible and is not less secure.
# bind-address            = 127.0.0.1
[...]

Restart of the sql server.

root @sol:/etc/mysql/mariadb.conf.d# systemctl restart mysql

root @sol:/etc/mysql/mariadb.conf.d# mysql -p

MariaDB [(none)]> SELECT User, Host FROM mysql.user;
+-----------+-----------+
| User      | Host      |
+-----------+-----------+
| ninja     | localhost |
| root      | localhost |
| wiki      | localhost |
| wordpress | localhost |
+-----------+-----------+
4 rows in set (0.000 sec)

MariaDB [(none)]> CREATE USER wpremote@'10.19.67.%' IDENTIFIED BY 'password-of-new-user';
Query OK, 0 rows affected (0.001 sec)

MariaDB [(none)]> SELECT User, Host FROM mysql.user;
+-----------+------------+
| User      | Host       |
+-----------+------------+
| wpremote  | 10.19.67.% |
| ninja     | localhost  |
| root      | localhost  |
| wiki      | localhost  |
| wordpress | localhost  |
+-----------+------------+
5 rows in set (0.000 sec)

MariaDB [(none)]> GRANT ALL PRIVILEGES ON wp_idee.* TO 'wpremote'@'10.19.67.%' WITH GRANT OPTION;
Query OK, 0 rows affected (0.017 sec)

Remote Access with the new user.

frank @Asimov:~$ mysql -u wpremote -h sol -p wp_idee
Enter password: 
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 205
Server version: 10.3.31-MariaDB-0+deb10u1 Debian 10

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [wp_idee]>

Table wp_posts

MariaDB [wp_idee]> describe wp_posts;
+-----------------------+---------------------+------+-----+---------------------+
| Field                 | Type                | Null | Key | Default             |...
+-----------------------+---------------------+------+-----+---------------------+
| ID                    | bigint(20) unsigned | NO   | PRI | NULL                |
| post_author           | bigint(20) unsigned | NO   | MUL | 0                   |
| post_date             | datetime            | NO   |     | 0000-00-00 00:00:00 |
| post_date_gmt         | datetime            | NO   |     | 0000-00-00 00:00:00 |
| post_content          | longtext            | NO   |     | NULL                |
| post_title            | text                | NO   |     | NULL                |
| post_excerpt          | text                | NO   |     | NULL                |
| post_status           | varchar(20)         | NO   |     | publish             |
| comment_status        | varchar(20)         | NO   |     | open                |
| ping_status           | varchar(20)         | NO   |     | open                |
| post_password         | varchar(255)        | NO   |     |                     |
| post_name             | varchar(200)        | NO   | MUL |                     |
| to_ping               | text                | NO   |     | NULL                |
| pinged                | text                | NO   |     | NULL                |
| post_modified         | datetime            | NO   |     | 0000-00-00 00:00:00 |
| post_modified_gmt     | datetime            | NO   |     | 0000-00-00 00:00:00 |
| post_content_filtered | longtext            | NO   |     | NULL                |
| post_parent           | bigint(20) unsigned | NO   | MUL | 0                   |
| guid                  | varchar(255)        | NO   |     |                     |
| menu_order            | int(11)             | NO   |     | 0                   |
| post_type             | varchar(20)         | NO   | MUL | post                |
| post_mime_type        | varchar(100)        | NO   |     |                     |
| comment_count         | bigint(20)          | NO   |     | 0                   |
+-----------------------+---------------------+------+-----+---------------------+
23 rows in set (0.003 sec)

Post Types

MariaDB [wp_idee]> select distinct post_type from wp_posts;
+---------------+
| post_type     |
+---------------+
| attachment    |
| nav_menu_item |
| page          |
| post          |
| revision      |
+---------------+
5 rows in set (0.003 sec)

The post types of interest are most probably only page and post. I'm not sure about revision, but we can take a look, what's tagged as post type "revision".

The query "select post_title from wp_posts where post_type='revision';" brings repeatedly the same title, most probably the post_modified information should differ in those.

The query "select post_title from wp_posts where post_type='page';" shows only 2 pages, which are already processed into one for the new solution.

The query "select post_title from wp_posts where post_type='post';" shows every title only once, i would hope with the initial post information only. I'll not only hope, but check this before I continue.

The query "select post_name from wp_posts where post_type='post';" shows the pages unique identifiers, the last element of the pages URL. This information is very helpful to ensure that the migrated pages are named exactly the same in the new solution and are found via one redirect instruction in nginx. The redirect is important, since I will not have the date encoded in the URL, as I had it in wordpress.

Post Mime Types

Post Mime types help to select PDF, Audio and zip and Spreadsheets, which had been embedded into the posts in WordPress. The HTML posts themselves got the mime type "" in this DB, which saves space, but is irritating.

Missing Information

One Information is missing. The language or locale of the posted content. In the new solution I'll use de-DE and en-US as locales and as language information. I did not really create a lot of English pages, but this might change and the few ones I already have shall get presented correctly.

WordPress Meta Data Export

Create an export query for the HTML pages posted in WordPress. Include

a site column with default value "Idee",
a locale column with default value "de-DE",
a author column with default value "Frank Siebert"

to be changed manually for the few English posts existing into

site = "Concept" (Concept of new cognition elicitation personally thinking) replacing Idee (Idee der eigenen Erkenntnis). That's the best recursive acronym translation I found.
locale = "en-US"

Get the earliest post date into one column and the latest post date into another column of the same row. Have one row per post.

The export query is created in a bash script, whose output can be piped into a local tab delimited file. The last line of this file needs to be deleted, since it contains an automatically saved draft, and .. (see next chapter).

wpmeta bash script:

#!/bin/bash
sql="select \
    min(post_date) over (partition by post_name) as post_date, \
  max(post_modified) over (partition by post_name), \
    max(comment_count) over (partition by post_name), \
    'Idee' as site, \
    'de-DE' as locale, \
    'Frank Siebert' as author, \
    post_name, \
    post_title \
    from wp_posts \
    where (post_mime_type='' and post_type='post') \
    order by post_date asc;"

mysql wp_idee -u wpremote -h sol -p'not-exposed-pwd' \
--default-character-set=utf8 -N -e "$sql" > ../config/migrationlist.csv

Using bash for the query and persisting the result directing the output into a file, the data in the file becomes tab delimited.

Special-Case for nginx redirect

Querying the database revealed: The post "Social Distancing und Lockdown" has for unknown reasons the url "/2021/01/26/255/". This will not be the case in the new solution, I'll not have a 255.html around there, but I'll either create a special case in the redirect for this article or rather ignore this case at all.

In the result of the final export query a corrected page_name needs to be maintained, to migrate this article correctly.

Option in Consideration

For all articles written in the Wiki and published afterwards, export.py from the wikitools will do perfect service, I hope.

But things change over time and quite a number of articles where written in WordPress, when I had no MediaWiki in place. Also some last minute typo correction, I know this, where maintained directly in WordPress after publishing.

And the article "Das SARI-Rätsel" contains an interactive java-script part, which required some authoring in html. This last point might not stay a single happenstance, as developer I might find more often a reason to extend a page functionally, or even to write the complete page directly in html as source format.

My concept does support such deviations from the standard scenario. But this is not the point here. The point is: I need an wp-export tool to generate MediaWiki files from the current WordPress pages for the migration. That's the only way to ensure that the migrated pages contain exactly what they are supposed to contain. With this I can also incorporate comments posted by me after publishing as Update Notes.

I need to leave a note about the last paragraph. For the reasons already mentioned above I made a highly manual migration. Overall content quality improved considerably during migration, even if I missed one or two typos.

A migration file created via sql from the wp_posts table ist stored as ./config/migration.csv (with spaces instead of colons) containing date, time and title columns.

wp-export.py

This was never implemented. I place this tool in the directory ./tools/ of the Authoring GIT and create an alias wpe for the single file export and probably I'll also have an wpbe command as WordPress batch-export.

Manage Meta Data

I have now migration data in a tab delimited csv available, and I need to manage publishing meta data to make sure the correct meta data is shown in every web-page, in the sitemaps and RSS-feed.

I started to implement this as a specialized dictionary implementation, but I falter to proceed in this direction. The python modules pandas and agate seem to offer fascinating power working with csv files, and I have to investigate both in much more detail in the future.

My choice fell on the module agate (Documentation: "agate 1.6.3" ¹⁴ ) for this implementation, since it is a smaller implementation and because I do not need any extensive statistical horsepower for the meta data stored. I use the module in version 1.6.1 as it is provided currently by Debian Stable.

I have to revise me decision. Agate makes the reading of the csv easy and provides a powerful table object, making the data accessible very nicely. But the documentation states that it returns always copies to the data structure, which I like very much, but the documentation fails to show options to update data in the table object, which I will need to do without getting new instances of the meta data table next to the singleton created for that purpose.

Installing python3-pandas

frank @Asimov:~$ sudo apt-get install python-pandas-doc python3-pandas
[sudo] password for frank: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libblosc1 libclang-cpp9 libffi-dev liblbfgsb0 libllvm9 libncurses-dev libpfm4 
  libtinfo-dev libz3-dev llvm-9 llvm-9-dev llvm-9-runtime llvm-9-tools 
  numba-doc python-odf-doc python-odf-tools python-tables-data 
  python3-bottleneck python3-et-xmlfile python3-iniconfig python3-jdcal 
  python3-llvmlite python3-numba python3-numexpr python3-odf python3-openpyxl 
  python3-pandas-lib python3-py python3-pytest python3-scipy python3-tables 
  python3-tables-lib python3-xlwt
Suggested packages:
  ncurses-doc llvm-9-doc python-bottleneck-doc llvmlite-doc nvidia-cuda-toolkit 
  python3-statsmodels python-scipy-doc python3-netcdf4 python-tables-doc 
  vitables python3-xlrd python-xlrt-doc
The following NEW packages will be installed:
  libblosc1 libclang-cpp9 libffi-dev liblbfgsb0 libllvm9 libncurses-dev libpfm4 
  libtinfo-dev libz3-dev llvm-9 llvm-9-dev llvm-9-runtime llvm-9-tools numba-doc 
  python-odf-doc python-odf-tools python-pandas-doc python-tables-data 
  python3-bottleneck python3-et-xmlfile python3-iniconfig python3-jdcal 
  python3-llvmlite python3-numba python3-numexpr python3-odf python3-openpyxl 
  python3-pandas python3-pandas-lib python3-py python3-pytest python3-scipy 
  python3-tables python3-tables-lib python3-xlwt
0 upgraded, 35 newly installed, 0 to remove and 1 not upgraded.
Need to get 84.2 MB of archives.
After this operation, 496 MB of additional disk space will be used.
Do you want to continue? [Y/n] y

A lot of suggestions next to a lot of required packages. If I would not intend to go deeper into data analysis with python, I would stay with my initial dict-based implementation instead.

python3-pandas web references

"pandas documentation" ¹⁵
"Add new rows and columns to Pandas dataframe" ¹⁶ With the most helpful explanation to insert via loc (or update via iloc) rows with:

df.loc[len(df.index)]=list(data[0].values())

"Pandas Tutorial" ¹⁷

Bioconductor

The search "derive from dataframe" had one result on startpage.com. This search result does not help me to find out, what to take special care for when I derive my own class from the dataframe class, but it relates strongly to a lot of articles I posted on my site.

"Getting Started with Bioconductor 3.7" ¹⁸

A quick check reveals - it is available in Debian just at my fingertips.

frank @Asimov:~$ sudo apt-cache search bioconductor
bio-tradis - analyse the output from TraDIS analyses of genomic sequences
libtfbs-perl - scanning DNA sequence with a position weight matrix
q2-dada2 - QIIME 2 plugin to work with adapters in sequence data
r-bioc-affy - BioConductor methods for Affymetrix Oligonucleotide Arrays
r-bioc-affyio - BioConductor tools for parsing Affymetrix data files
...
r-bioc-variantannotation - BioConductor annotation of genetic variants
r-bioc-xvector - BioConductor representation and manpulation of external
  sequences
r-bioc-zlibbioc - (Virtual) zlibbioc Bioconductor package
r-cran-ape - GNU R package for Analyses of Phylogenetics and Evolution
r-cran-biocmanager - access the Bioconductor project package repository

I suppose I'll never have enough leisure time to dig into everything I'm interested in. And it is written in R, a programming language I would like to learn as well.

I have to remind me from time to time that I'm doing this to learn and not to to prove I can do this implementation in very short time. More reading, less coding, take your time and it will take less time.

MediaWiki to Plain HTML Conversion

You are right, if you make the assumption that I started much earlier writing Python code. But we are now finally at the point where first parts of the final Python implementation can be used in the final setup.

I spare every evolutionary step of the Python code development. All following code is in the most recent state.

Commit-msg Nessage Hook

''Originally I wrote an commit-msg hook directly as executable Python program. But the message hook shown next is a bash scriot.

~/projects/idee/config/hooks/commit-msg

#!/bin/bash
/usr/bin/python3 generator/commitmsg.py $1

The parameter $1 is the commit message, for which the solution does use a modified template.

Commit Message Template

The commit message template provides the possibility to define some meta data to taken into the content and to steer apart different content creation options.

~/projects/idee/config/commit-message

 
# Overwrite values if neccesary, based on https://ogp.me/
# pdf:draft=false
# og:locale=de-DE
# og:site_name=Idee
# article:author=Frank Siebert
#

Note: There is a significant empty line as the first line.

Probably I will eliminate either the og:site_name or the og:locale line in future. I'm not yet sure.

Main Program: commit-msg.py

You might have seen it above, this is the Python program called by the message hook bash script. The commit message is passed on as parameter.

The commit-msg.py registers workers to take care for specific the commit message entries identified by a regex match at a dispatcher class.

~/projects/idee/generator/commit-msg.py

"""Website Generator - "pandoc, fs-commit-msg-hook 1.0".

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

https://wiki.frank-siebert.de/inst/Replacing_Wordpress
https://idee.frank-siebert.de/article/replacing-wordpress.html

Website Generator uses Beautiful Soup, Pandoc and GIT to manage
authoring in *.wikimedia files and to convert those into:
* plain html pages, one per wikimedia file as article
* PDF files, one per wikimedia file for article download
* a Web site portal pages by injecting the portal into the plain html pages

Website Generator generates as additional portal assets:
* sitemap.xml
* feed.xml
* ...

Website Generator works with Python 3 and up. It works better if lxml
and/or html5lib is installed, as Beautiful Soup states it runs better then.
"""
# Systen Imports
import sys
import getopt

from termcolor import colored

from gitmsgdispatcher import GitMsgDispatcher
from mwworker import MwWorker
from pdfworker import PdfWorker
from plainworker import PlainWorker

# Ask for a key press and return the pressed key, if it is part of the keylist.
# In case of errors return the value of onerror, to enable the caller to
# decide on the most convinient way to preceed.


if __name__ == "__main__":

    HELPTEXT = 'Usage: commit-msg \'message-file\'\n'\
        '\n'\
        'message-file   The commit message file with the list of files\n'\
        '               to be processed.\n'\

    try:
        opts, args = getopt.getopt(sys.argv[1:], "h:", ["help"])

    except getopt.GetoptError:
        print(HELPTEXT)
        sys.exit(2)

    for opt, arg in opts:
        if opt in {"-h", "--help"}:
            print(HELPTEXT)
            sys.exit()

    arg_names = ['message-file']
    args = dict(zip(arg_names, args))

    # print(args)

    # Kept as inspiration for future
    # ------------------------------
    # Arg_list = collections.namedtuple('Arg_list', arg_names)
    # args = Arg_list(*(args.get(arg, None) for arg in arg_names))

    messagefile = args.get('message-file')
    if not messagefile:
        print(colored('\nERROR:', 'red'), 'message-file parameter is missing.')
        print(HELPTEXT)
        sys.exit(2)

    mwworker = MwWorker(r".*(new file|modified).*author[/].*\.mediawiki")
    plainworker = PlainWorker(r".*(new file|modified).*plain[/].*\.html")
    pdfworker = PdfWorker(r"" + PdfWorker.pdfworkitem)

    disp = GitMsgDispatcher(messagefile, [mwworker, plainworker, pdfworker])

    sys.exit(0)

Message Dispatcher: gitmsgdispatcher.py

The message dispatcher dispatches work-items from the git message to registered workers. Those workers then can place new work-items for later workers, to pick up work where they stopped working.

~/projects/idee/generator/commit-msg.py

"""
GitMessageDispatcher with MsgWorker base class.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

Instantiate the GitMessageDispatcher with the message
and with specialized workers. The workers provide a
pattern matching the lines they claim for work.
"""
import re
from pathlib import Path
from pubmetadata import PubMetaData
from sitemap import SiteMap
from archive import Archive
from rssbuilder import RSSBuilder
from idxbuilder import IDXBuilder


class GitMsgDispatcher:
    """
    Dispatch the lines of the git message to registered workers.

    Parameters
    ----------
    gitmessagepath : Path
        Path as type str or type Path pointing to the git message.

    msgworkers : List of MsgWorker
        The list of message workers is used as worker queue. Workers first
        in the queue get their workitems first.

        Workers can return their work result to be picked up by
        later workers.

        The ParameterValueWorker runs allways first. Do not add
        the ParamterValueWorker to the provided list of workers,
        or it will run twice.

        The ParameterValueWorker takes care to provide the parameter
        values provided by the message in place for all workers.

        When all workers finished, the sitemap is updated, the RSS
        feed is updated and the index page is updated.

    Returns
    -------
    GitMsgDispatcher.

    """

    def __init__(self, gitmessagepath, msgworkers):
        """
        Dispatch the lines of the git message to registered workers.

        Parameters
        ----------
        gitmessagepath : Path
            DESCRIPTION.
        *msgworkers : List of MsgWorker
            DESCRIPTION.

        Returns
        -------
        GitMsgDispatcher.

        """
        self.gitmessagepath = gitmessagepath

        """
        Extract the relevant part of the message
        """
        self.worklist = []
        self.parameters = ParameterValueWorker()

        with open(self.gitmessagepath, 'r') as infile:
            # TODO: Do Better. Latest when the git server joins the game
            # Message section of most interest: "Changes to be committed"
            # But we are also interested in the parameter values
            # we placed ealier into the file.
            # The Start helps us to find the End.
            start = re.compile(r"^# Changes to be committed:")
            # Next uppercase entry starts another message section
            end = re.compile(r"^# [A-Z]")

            started = None
            for line in infile:
                if self.parameters.pattern.match(line):
                    self.worklist.append(line)
                if not started and start.match(line):
                    started = True
                elif started and end.match(line):
                    started = False
                if started is True:
                    self.worklist.append(line)
                if started is False:
                    break
            infile.close()

        self.msgworkers = msgworkers

        self.msgworkers.insert(0, self.parameters)
        self.dispatch()

        # Save changed publishing meta data, if any.
        if PubMetaData.instance:
            PubMetaData.instance.save()
            # Generate Sitemaps (bilingual)
            SiteMap().update()
            # Generate Archive
            Archive().update()
            # Generate RSS feed (bilingual)
            RSSBuilder().update()
            # Generate Index Pages (English and German Version)
            IDXBuilder().update()

    def dispatch(self):
        """
        Dispatch the git message lines to registered message workers.

        Returns
        -------
        None.

        """
        for worker in self.msgworkers:
            print("Dispatching work to: {}".format(type(worker)))
            for item in self.worklist:
                if type(item) == str:
                    if worker.pattern.match(item):
                        worker.work(self, item)
                if type(item) == dict:
                    worker_match = item[MsgWorker.task_worker_match]
                    if worker.pattern.match(worker_match):
                        worker.work(self, item)


class MsgWorker:
    """
    Create a worker for lines matching the pattern.

    Parameters
    ----------
    pattern : Pattern
        A regex pattern matching the lines the worker takes care for.

    Returns
    -------
    MsgWorker.

    """

    # Cases to work against
    CREATEPAT = re.compile(r"^#.*(modified:|new file:)")
    RENAMEPAT = re.compile(r"^#.*renamed:")
    DELETEPAT = re.compile(r"^#.*deleted:")

    # Task types
    task_worker_match = "workermatch"
    task_type = "tasktype"
    task_create = "create"
    task_rename = "rename"
    task_delete = "delete"

    def __init__(self, pattern):
        if isinstance(pattern, str):
            self.pattern = re.compile(pattern)
        elif isinstance(pattern, re.Pattern):
            self.pattern = pattern
        self.dispatcher = None  # initialized in work method
        self.item = None        # initialized in work method
        self.inpath = None      # initialized in work method, if any
        self.delpath = None     # initialized in work methid, if any
        self.outpath = None     # initialized in work method, if any

    def get_pattern(self):
        """
        Get the pattern for matching list items.

        Returns
        -------
        Pattern
            A regex pattern matching the lines the worker takes care for.

        """
        return self.pattern

    def work(self, dispatcher, item):
        """
        Overwrite this method to implement if required.

        Call the super().work() method in your new method, to get
        - self.dispatcher initialized
        - the inpath initialized (if any), to the file to be processed
        - the delpath initialized (if any)
        - the delete() or the process() method or both called, whatever applies

        Parameters
        ----------
        dispatcher : GitMsgDispatcher
            The dispatcher, which assigned the work item

        item : str or dict
            One matching line from the git message.
            Or complex workitem added by ealier workers.

        Returns
        -------
        None.

        """
        self.dispatcher = dispatcher
        self.item = item

        # For some workers a dictionary is passed as item
        if isinstance(item,  str):
            filename = item[14:].strip()
            self.inpath = Path(filename)

            if self.RENAMEPAT.match(item):
                # part filename in new and old
                # filename = line[14:].strip()
                # self.inpath = Path(filename)
                self.delpath = None  # Needs to be assigned now
                self.rename()

            if self.CREATEPAT.match(item):
                self.process()

            if self.DELETEPAT.match(item):
                self.delpath = self.inpath  # clear deletion request
                self.delete()

        if isinstance(item, dict):
            self.inpath = None
            self.delpath = None

            if item[self.task_type] == self.task_rename:
                self.rename()
            if item[self.task_type] == self.task_create:
                self.process()
            if item[self.task_type] == self.task_delete:
                self.delete()

    def delete(self):
        """
        Overwrite this method to implement the actual work to be done.

        The method is called by super.work(), if the message line is
        about a rename or a deletion.

        Since renames might go along with additional content change, deletion
        and re-processing take place both in that case.

        The path to the resource named in the message is available via
        self.delpath

        Depending on the type of content, more than just deleting the
        file might be required.

        Parameters
        ----------
        None

        Returns
        -------
        None.

        """

    def process(self):
        """
        Overwrite this method to implement the actual work to be done.

        The method is called by super.work(), if the message line is
        about a rename or a new file.

        Since renames might go along with additional content change, deletion
        and re-processing take place both in that case.

        The path to the resource named in the message is available via
        self.delpath

        Parameters
        ----------
        None

        Returns
        -------
        None.

        """

    def rename(self):
        """
        Rename by delete and process.

        Since we do not know, whether next to the rename additional
        changes were applied, deletion and recreation is savest.
        """
        self.delete()
        self.process()


class ParameterValueWorker(MsgWorker):
    """
    The ParameterValueWorker reads parameter value pairs.

    Example of a line with parameter value pair:
        # article:author=Firstname Lastname

    These parameters in the git message allow the injection
    of values for metadata, which would be otherwise not available.

    Other workers can access the values dictionary via:
        dispatcher.parameters.values

    Parameters
    ----------
    super: MsgWorker
        The ParameterValueWorker is derived from the MsgWorker.

    Returns
    -------
    ParameterValueWorker.

    """

    def __init__(self, pattern=r"^#.*="):
        super().__init__(pattern)
        self.values = {}

    def work(self, dispatcher, item):
        """."""
        super().work(dispatcher, item)

        if item.count('=') == 1:
            lineparts = item.rpartition('=')
            self.values.update(
                {lineparts[0].strip('#').strip():
                 lineparts[2].strip()}
            )


if __name__ == "__main__":
    pass

MediaWiki Worker: mwworker.py

This worker converts the MediaWiki file to the plain HTML version, which is used for Copy-Edit reading, which is a combined task with the audio recording.

I found out that I find typos best, if I read the text loud. Creating the audio is therefore a good option for me to improve the quality of the written text.

~/projects/idee/generator/mwworker.py

"""
MwWorker is derived from the MsgWorker base class.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

The MwWorker takes care of *.mediawiki files
in the author directory, if changes are committed
for them.
"""
import re
import subprocess
import sys

from bs4 import BeautifulSoup
from bs4 import Comment
from bs4.builder._htmlparser import HTMLParserTreeBuilder

from gitmsgdispatcher import MsgWorker
from gitmsgconstants import GitMsgConstants as gmc
from pdfworker import PdfWorker
from pubmetadata import PubMetaData
from pubmetadata import pageurn


class MwWorker(MsgWorker):
    """
    The MwWorker takes care of *.mediawiki files in the author/ directory.

    Example of a line taken care for
        #   modified:   author/PDF-Icon.mediawiki

    The line has to be from the section git message section:
        # Changes to be committed:

    The main output is an HTML created from the mediawiki file,
    which is plain (without portal part) and stored in the
    folder GITROOT/plain/

    A minor output, a PDF, might be requirested via the message line:
        # pdf:draft=true

    The respective PDF is created from HTML and stored in the folder
    GITROOT/website/pdf/

    Parameters
    ----------
    super: MsgWorker
        The MwWorker is derived from the MsgWorker.

    Returns
    -------
    MwWorker.

    """

    def __init__(self, pattern):
        super().__init__(pattern)
        self.values = {}

    @staticmethod
    def __make_url_migration__(soup):
        r"""
        Migrate the wordpress url pattern to the new one.

        Articles: r"idee.frank.siebert.de.\d{4}.\d{2}.\d{2}"

        Parameters
        ----------
        soup : BeautifulSoup
            HTML represented by BeautifulSoup top level object

        Returns
        -------
        soup.  r"https://idee.frank-siebert/"

        """
        site_r = r"(https://idee\.frank-siebert\.de)"
        date_r = r"([/]\d{4}[/]\d{2}[/]\d{2}[/])"  # '/yyyy/MM/dd'
        article_r = r"[/][a][r][t][i][c][l][e][/]"

        # Links to own articles will be addressed by relative path,
        # In article migration we point to pages in the same location.

        repattern = re.compile(site_r + date_r)
        tags = soup.find_all("a", attrs={"href": repattern})

        for tag in tags:
            # in case page internal id was addressed
            url = tag["href"].split("#")
            url[0] = "./" + repattern.sub("", url[0].rstrip("/"))\
                + ".html"
            new_url = '#'.join(url)
            new_url = new_url.lower()  # change camel case to lower case
            tag.attrs.update({"href": new_url})

        # References to own articles in the new portal
        # shall be relative as well.
        reart = re.compile(site_r+article_r)
        tags = soup.find_all(re.compile(r"^a$"), attrs={"href": reart})
        for tag in tags:
            new_url = "./" + reart.sub("", tag["href"])
            new_url = new_url.lower()
            tag.attrs.update({"href": new_url})
        return soup

    def process(self):
        """
        Process the mediawiki files into plain html files.

        Returns
        -------
        None.
        """
        # The file name is the title
        title = self.inpath.stem

        # inject meta information from commit message
        # Creates the single instance of Publishing Dictionary
        PubMetaData(self.dispatcher.parameters.values)

        article_data = PubMetaData.instance.get_new_revision(
            title
            )

        # compose the output path
        self.outpath = gmc.plainpath / pageurn(title)
        self.outpath = self.outpath.with_suffix(".html")
        self.outpath.resolve()

        # To enable --toc, the parameter -s (standalone) needs to be set.
        # This parameter leads to the generation of an html header with
        # some meta tags.

        # <!DOCTYPE html>
        # <html lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
        #     <head>
        #         <meta charset="utf-8"/>
        #         <meta content="pandoc" name="generator"/>
        #         <meta content="width=device-width, initial-scale=1.0,
        #             user-scalable=yes" name="viewport"/>
        #         <title>
        #         Verstehen
        #         </title>
        #         <style>
        #         code{white-space: pre-wrap;}
        #         span.smallcaps{font-variant: small-caps;}
        #         span.underline{text-decoration: underline;}
        #         div.column{display: inline-block; vertical-align:
        #                    top; width: 50%;}
        #         div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
        #         ul.task-list{list-style: none;}
        #         </style>
        #     </head>
        #     <body>
        #     </body>
        # </html>

        # The TOC is created as <nav id="TOC"> tag,
        # and it is not placed at the __TOC__
        # location specified in the mediawiki page.

        # Also __NOTOC__ is not honored.

        # Own meta data lines need be injected and
        # the toc needs to be moved to the correct location if specified,
        # or removed, if specified.
        imgdir = gmc.imagepath
        imgdir.resolve()
        htmltext = subprocess.run(["pandoc",
                                   # extract media to the folder
                                   # disabled after migration
                                   # "--extract-media={}".format(imgdir),
                                   # standalone (full html)
                                   "-s",
                                   # create table of content
                                   "--toc",
                                   "--toc-depth=5",
                                   # mediawiki markup as input format
                                   "-f", "mediawiki",
                                   # html as output format
                                   "-t", "html",
                                   # input file
                                   "-i", self.inpath
                                   # don't use stdout, return the result
                                   ], capture_output=True)

        html_doc = htmltext.stdout.decode("utf-8")
        builder = HTMLParserTreeBuilder()
        soup = BeautifulSoup(html_doc, builder=builder)

        # stupid but not avoidable:
        # pandoc does not know where we store the plain html.
        # therefore it cannot set the links to medias correctly.
        # we have to give a helping hand
        # We could pandoc tell to use another working director to get
        # the paths correct. TODO: change this when folders move again
        tags = soup.find_all("img")
        if tags:
            for tag in tags:
                tag.attrs.update({"src": "../website/image/" + tag["src"]})
                # since we are already here, provide a cheap
                # picture maximization via hraf to target _blank
                newtag = soup.new_tag("a")
                tag.insert_after(newtag)
                newtag.insert(0, tag)
                href = tag["src"]
                # Special exception for licence icons
                creative_commons = re.compile(
                    r".*CC-Icon.png")
                href = creative_commons.sub(
                        "creative-commons-cc0-1-0-universal.html",
                        href)
                creative_commons_0 = re.compile(
                    r".*CC0-Icon.png")
                href = creative_commons_0.sub(
                        "creative-commons-cc0-1-0-universal.html",
                        href)
                newtag.attrs.update({"href": href, "target": "_blank"})

        # inject language information
        tag = soup.find("html")
        tag.attrs.update({"lang": article_data[PubMetaData.locale]})
        tag.attrs.update({"xml:lang": article_data[PubMetaData.locale]})

        # inject stylesheet link
        # <link rel="stylesheet" href="../website/css/fs.css"/>
        tag = soup.find("head")
        newtag = soup.new_tag("link")
        newtag.attrs.update(
            {"rel": "stylesheet", "href": "../website/css/fs.css"})
        tag.insert(6, newtag)

        for key in article_data.keys():
            if (key.startswith('og:') or key.startswith('article:')):
                newtag = soup.new_tag("meta")
                newtag.attrs.update({"property": key,
                                     "content": article_data[key]})
                tag.insert(6, newtag)

        # my own invention: article:urn
        newtag = soup.new_tag("meta")
        newtag.attrs.update({"property": PubMetaData.urn,
                             "content": article_data.name})
        tag.insert(6, newtag)

        # http://www.gnuterrypratchett.com/
        newtag = soup.new_tag("meta")
        newtag.attrs.update({"http-equiv": "X-Clacks-Overhead",
                             "content": "Terry Pratchett"})

        # inject the generator meta information.
        # one exists already
        tag = soup.find("meta", attrs={"name": "generator"})
        tag.attrs.update({"name": "generator", "content": gmc.generator})

        # WikiLinks [https://webpage https//webpage]
        # leads to nested anchor tags.
        # The resulting page works in firefox, but it is no valid html.
        # We use soup for the correction.
        tags = soup.find_all("a")
        for tag in tags:
            nested_a = tag.find("a")
            if nested_a:
                atext = "" + nested_a.text
                nested_a.decompose()
                tag.append(atext)

        # use a better symbol for backreferences
        tags = soup.find_all("a", text='↩︎')
        for tag in tags:
            tag.clear()
            tag.append('↑')

        # Move the TOC to the correct location
        toc = soup.find("nav", id='TOC')
        tag = soup.find("p", text='__TOC__')
        if tag:
            tag.replace_with(toc)
        else:
            tag = soup.find("p", text='__NOTOC__')
            if tag:
                tag.decompose()
                toc.decompose()

        # Footnotes get not placed at the location
        # of the <references/> tag.

        # Footnotes are generated as section
        # <section class="footnotes" role="doc-endnotes">

        # Search Section and use it to replace References.
        footnotes = soup.find("section", class_="footnotes")
        if footnotes:
            tag = soup.find("references")
            if tag:
                tag.replace_with(footnotes)
            else:
                print("Provode a reference tag as footnote target location.")
                sys.exit(1)

        # Category-Links get a title "wikilink"
        # Add those anchors a class "category" to hide them until
        # I decide to use them.

        # But "Kategorie:Artikel" gets removed. These are all articles.
        tag = soup.find("a", href="Kategorie:Artikel")
        if tag:
            tag.decompose()

        tags = soup.find_all("a", title="wikilink")
        for tag in tags:
            tag.attrs.update({"class": "category"})

        site_r = r"https://idee\.frank-siebert\.de"
        date_r = r"[/]\d{4}[/]\d{2}[/]\d{2}[/]"  # '/yyyy/MM/dd'
        article_r = r"[/][a][r][t][i][c][l][e][/]"

        # Links to own articles will be addressed by relative path,
        # In article migration we point to pages in the same location.

        repattern = re.compile(site_r + date_r)
        tags = soup.find_all("a", href=repattern)

        for tag in tags:
            # in case page internal id was addressed
            url = tag["href"].split("#")
            url[0] = "./" + repattern.sub("", url[0].rstrip("/")) + ".html"
            new_url = '#'.join(url)
            new_url = new_url.lower()  # change camel case to lower case
            tag.attrs.update({"href": new_url})

        # Links to other resources will be also addressed by relative path,
        # Those resources need to be addressed by ../

        repattern = re.compile(site_r)
        tags = soup.find_all("a", href=repattern)

        for tag in tags:
            new_url = repattern.sub("..", tag["href"])
            new_url = new_url.lower()  # change camel case to lower case
            tag.attrs.update({"href": new_url})

        # References to own articles in the new portal
        # shall be relative as well.
        reart = re.compile(site_r+article_r)
        tags = soup.find_all(re.compile(r"^a$"), attrs={"href": reart})
        for tag in tags:
            new_url = "./" + reart.sub("", tag["href"])
            new_url = new_url.lower()
            tag.attrs.update({"href": new_url})

        # its about articles, one article a page.
        # For later site function injection, we need a
        # container around the main content.

        # After reading https://html.spec.whatwg.org/dev/sections.html
        # I go for this structure:
        # <body>
        #     <header">
        #     </header>           Injected by SSI module in nginx
        #     <main>              as semantic element for the main content
        #         <article>       as semantic element for the article
        #             <header>    an article header
        #                 <h1>
        #                     <div>
        #                     <time pubdate="true" datetime=
        #                      "2022-01-19T13:03:08">
        #                      2022-01-19
        #                     </time>
        #                     <address>Author Name</address>
        body = soup.find("body")
        newbody = soup.new_tag("body")        # temporary container

        # SSI header injection is a function of the language
        if article_data[PubMetaData.locale].startswith("de"):
            newtag = Comment('# include file="/portal/idee-header.html" ')
        else:
            newtag = Comment('# include file="/portal/concept-header.html" ')
        newbody.insert(0, newtag)

        newtag = soup.new_tag("main")
        newbody.insert(1, newtag)

        tag = newtag
        article = soup.new_tag("article")
        tag.insert(0, article)

        # previous body content becomes article content
        # the new body replaces the old
        article.contents = body.contents.copy()
        body.replace_with(newbody)

        # inject article header information about
        # title, creation date and author
        tag = article
        newtag = soup.new_tag("header")
        tag.insert(1, newtag)
        tag = newtag
        newtag = soup.new_tag("h1")
        newtag.append(title)
        tag.insert(0, newtag)
        newtag = soup.new_tag("div")
        tag.insert(1, newtag)
        tag = newtag
        newtag = soup.new_tag("time")
        newtag.append(article_data[PubMetaData.pubdate][:10])
        newtag.attrs.update({"datetime":
                             article_data[PubMetaData.pubdate][:10][:19]})
        # probably deprecated by itemprop alternative
        newtag.attrs.update({"pubdate": "true"})
        tag.insert(0, newtag)
        newtag = soup.new_tag("address")
        newtag.append(article_data.get("article:author"))
        tag.insert(1, newtag)

        html_doc = soup.prettify()

        with open(self.outpath, 'w') as outfile:
            print(html_doc, file=outfile)
            outfile.flush()
            outfile.close()

        print('wrote file {0}'.format(self.outpath))

        subprocess.run(["firefox", self.outpath], capture_output=False)

        # Placing a worklist item for the PdfWorker
        if article_data[PubMetaData.pdfdraft] == "true":
            self.dispatcher.worklist.append(
                PdfWorker.make_pdf_worklist_item(
                    article_data.name,
                    html_doc,
                    gmc.plainpath,
                    MsgWorker.task_create,
                    draft=True
                    )
                )

    def delete(self):
        """
        Delete the generated HTML.

        Resources used by the HTML need additional care.
        If the delete was triggered by rename, no resources have to be deleted.
        If it was triggered by a delete, a check is required,
        whether the resources are used by other pages as well.
        But resources are place anyhow in the final website location.
        They must not be deleted by the MwWorker.
        """


if __name__ == "__main__":
    from gitmsgdispatcher import GitMsgDispatcher

    print("Running Test-Cases")

    mwworker = MwWorker(r".*(new file|modified).*author[/].*\.mediawiki")
    pdfworker = PdfWorker(r"" + PdfWorker.pdfworkitem)

    # MESSAGEFILE = "test/PDF-Icon-TestCase-1"
    # MESSAGEFILE = "test/mw_new_testcase"
    # MESSAGEFILE = "test/WordPress-testcase-1"
    # MESSAGEFILE = "test/ich-denke-TestCase-1"
    # MESSAGEFILE = "test/FragenSieIhrenArzt-TestCase1"
    # MESSAGEFILE = "test/PandemieBeenden-TestCase-1"
    # MESSAGEFILE = "test/LegalTribune-TestCase-1"
    MESSAGEFILE = "test/TwoArticles-TestCase-1"
    disp = GitMsgDispatcher(MESSAGEFILE, [mwworker, pdfworker])

Publishing Meta Data Management: pubmetadata.py

I have already written about the meta data export from WordPress and about the Python Module choice to manage the meta data in a csv file.

During migration it is of vital importance to identify the correct meta data entry to get correct publishing date shown in the article. And later we want to keep track of the original publishing date as well, if we perform updates.

The code above already shows that this is done in the PubMetaData class. Because the pages URN is the identifier in the stored publishing meta data, the Python file with class PubMetaData also contains the method pageurn(pagename), which computes the Unified Resource Name from the MediaWiki file name, which equals the article title used in the MediaWiki.

While the mwworker.py does not trigger a meta data save, it is vital for the migration and in the new scenario also for article updates, that the mwworker uses meta data if such exists for the processed article.

~/projects/idee/generator/pubmetadata.py

"""
The PubMetaData manages the meta information about the publishings.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

This includes the migrated data from WordPress as well as publishing data
created by new publishings with the new page generator.
"""
import datetime
import pandas as pd

from gitmsgconstants import GitMsgConstants as gmc


def pageurn(pagename):
    """
    Create a browser friendly urn from the pagename.

    German special characters are replaced by readable two-character
    alternatives, and spaces in the filename are replaced with '-'.

    Parameters
    ----------
    pagename : String
        The pagename, which is also the title of the article.
        All characters can appear, but we want not all in the resulting URL.

    Returns
    -------
    :String
        Alternative URL friendly name.
    """
    urn = pagename.lower().strip() \
        .replace(' ', '-').replace('/', '-') \
        .replace('ß', 'ss').replace('ä', 'ae') \
        .replace('ö', 'oe').replace('ü', 'ue') \
        .replace('&', 'and').replace('\\', '-') \
        .replace('?', '').replace(':', '') \
        .replace('.', '-').replace(',', '') \
        .replace("(", "").replace(")", "") \
        .replace("\"", "").replace("!", "-") \
        .replace("„", "").replace("“", "") \
        .replace("#", "").replace("%", "") \
        .replace("'", "")

    # remove stacked hypens
    while "--" in urn:
        urn = urn.replace("--", "-")

    urn = urn.rstrip('-')

    return urn


class PubMetaData():
    """
    The PubMetaData manages the meta information about the publishings.

    Parameters
    ----------
    None.

    Returns
    -------
    None.
    """

    instance = None

    # used as column names as well as meta tag names
    # article:urn is my own invention, who cares? It serves as unique index.
    urn = "article:urn"
    author = "article:author"
    pubdate = "article:published_time"
    revdate = "article:modified_time"    # Updatedate sounds stupid
    commentcount = "comments:count"      # of some interest during migration.
    title = "og:title"
    site = "og:site_name"
    locale = "og:locale"

    # not used in persistance
    pdfdraft = "pdf:draft"
    # not used now in persistance
    deletion = "deleted_time"

    class _PubMetaData():

        def __len__(self):
            return len(self._storage)

        def __init__(self, disp_msgparam):
            """
            Initiale only one publishing dictionary.

            Returns
            -------
            None.

            """
            # The msg parameters from the message dispatcher
            self._msgparam = disp_msgparam

            # Registers for updates and deletions
            self._updates = []
            self._deletions = []

            if not gmc.publishingdatapath.exists():
                self._read_migration_list()
            else:
                self._read()

        def _read_migration_list(self):
            """
            Read the migration list.

            The migrationlist.csv is one of two trusted sources
            for the correct publishing date.

            The second one is the pubmetadata.csv.

            This method moves the migration list entries to pubmetadata.

            As soon as the pubmetadata has been saved once,
            this method is no longer required.

            The data structure aligns to the planned pubmetadata
            data structure.

            The urn is the stem part of the url the page will finally have.
            It serves as index in the pandas dataframe, which translates
            into the name of the respecive Series of ones articles data.

            article:published_time            2022-02-15T14:41:13.367917
            article:modified_time             2022-02-15T14:41:13.367917
            comments:count                                             0
            og:site_name                                            Idee
            og:locale                                              de-DE
            article:author                                 Frank Siebert
            og:title                  Creative Commons CC0 1.0 Universal
            pdf:draft                                               true
            Name: creative-commons-cc0-1-0-universal, dtype: object

            Returns
            -------
            None.
            """
            self._storage = pd.read_csv(gmc.migrationlistpath,
                                        delimiter='\t',
                                        index_col=PubMetaData.urn)

        def get_new_revision(self, title=None, urn=None):
            """
            Provide publishing dictionary data for the title.

            A message worker may use this method to get information
            about the current publishing in work.

            To make this useful, the meta information from the current
            git message is incorporated into the article entry, if the meta
            information is not already in by pevious publishings, bringing
            all metadata required into one place.

            If the worker succeeds and his work was not DRAFT publishing, the
            worker may provide the article_data to get it saved via update().

            If the workers task was the deletion of the publishing, the
            worker nay provide the article_data to get the deletion
            information saved via deletion().

            Parameters
            ----------
            title:
                The title of the article, whose data has to be updated. If
                privided, it is used to compute the urn of the article.
            urn:
                The unique resource name of the article, whose data has to be
                updated.

            Returns
            -------
            dict:
                The titles data dictionary with revised data entries.
            """
            nowdate = datetime.datetime.now().isoformat()

            if not urn and not title:
                return None
            if not urn:
                urn = pageurn(title)

            if urn in self._storage.index:  # .to_list():
                article_data = self._storage.loc[urn]
                # Working copy
                article_data = article_data.copy()
            else:
                for index, article_data in self._storage.iterrows():
                    article_data = pd.Series(
                        data={
                            PubMetaData.title: title,
                            PubMetaData.pubdate: nowdate,
                            PubMetaData.commentcount: 0,
                            PubMetaData.site: None,
                            PubMetaData.locale: None,
                            PubMetaData.author: None,
                            },
                        index=article_data.index,
                        dtype=article_data.dtype,
                        name=urn)
                    article_data = article_data.copy()
                    article_data.update({"Name": urn})
                    break

            # Set the revision date
            article_data.update({PubMetaData.revdate: nowdate})

            # Iterate the message parameter keys and add parameters and their
            # value, if data for this key is not present in the titles
            # data series.

            # This also adds a key, if it is not part of the pubmetadata.csv.
            for key in self._msgparam.keys():
                if not article_data.get(key):
                    article_data.loc[key] = self._msgparam[key]
            return article_data

        def update(self, series):
            self._updates.append(series)

        def delete(self, series):
            self._deletions.append(series)

        def save(self):
            """
            Save the publishing dict data.

            Incorporates updates and new entries,
            and removes entries deleted (Implementation
            pending, probably I decide to extend the
            data structure with a deleted column).

            Deletions never took place till now,
            might take a while till its implemented.

            Returns
            -------
            None.
            """
            for article_data in self._updates:
                urn = article_data.name
                self._storage.loc[urn] = article_data

            for article_data in self._deletions:
                pass  # implementation pending

            self._storage.to_csv(gmc.publishingdatapath,
                                 sep=';', quotechar='"')

        def _read(self):
            """
            Read the publishing dict data from previous publishings.

            Returns
            -------
            None.
            """
            self._storage = pd.read_csv(gmc.publishingdatapath,
                                        delimiter=';',
                                        index_col=PubMetaData.urn)

    def __init__(self, disp_msgparam):
        if not PubMetaData.instance:
            PubMetaData.instance = PubMetaData._PubMetaData(disp_msgparam)

    def __getattr__(self, name):
        """
        Get attrubute value by name.

        Parameters
        ----------
        name : str
            Name of the attribute.

        Returns
        -------
        TYPE
            Value of the attribute.

        """
        return getattr(self.instance, name)

    def __len__(self):
        return len(PubMetaData.instance)


if __name__ == "__main__":
    pass

Constants

That's a rather strange decision. Why would someone create a class to store constants?

I see these values less as real constants, but they are more likely to become, at least partly, configuration entries, when I decide to separate the generator code into a software package usable for more than one content project,

The existence of this class and its current content is a strong signal for the unfinished nature of the project. It's just ready for first use, nothing more.

~/projects/idee/generator/gitmsgconstants.py

"""
GitMsgConstants provides project wide constants.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

No Instance is required. Could leverage in future a config file.
"""
from pathlib import Path


class GitMsgConstants():
    """
    Dispatch the lines of the git message to registered workers.

    Parameters
    ----------
    gitmessagepath : Path
        Path as type str or type Path pointing to the git message.

    msgworkers : List of MsgWorker
        The list of message workers is used as worker queue. Workers first
        in the queue get their workitems first.

        Workers can return their work result to be picked up by
        later workers.

    Returns
    -------
    GitMsgConstants.

    """

    generator = "pandoc, fs-commit-msg-hook 1.0"
    website = "https://idee.frank-siebert.de"
    pdfimage = "3cd97bab8bb20288768b35fd72979ec3bbf4b2a8.png"

    plainpath = Path("plain")
    confpath = Path("config")
    sitepath = Path("website")
    articlepath = sitepath / "article"
    audiopath = sitepath / "audio"
    csspath = sitepath / "css" / "fs.css"
    headerpath = sitepath / "portal" / "header.html"
    imagepath = sitepath / "image"
    pdfpath = sitepath / "pdf"
    qrpath = sitepath / "qrcode"

    migrationlistpath = confpath / "migrationlist.csv"
    publishingdatapath = sitepath / "pubmetadata.csv"

    pdfdraft = "pdf:draft"
    locale = "og:locale"

    archivepath = sitepath / "archive"
    idee_archive = archivepath / Path("idee-archive.html")
    concept_archive = archivepath / Path("concept-archive.html")
    sitemap = sitepath / Path("sitemap.xml")
    idee_map = sitepath / Path("idee-map.xml")
    concept_map = sitepath / Path("concept-map.xml")
    sitemappath = sitepath / "sitemap"
    map_template = sitepath / "portal" / "monthly-map.xml"
    archive_template = sitepath / "portal" / "monthly-archive.html"

    idee_rss = sitepath / Path("idee-rss.xml")
    concept_rss = sitepath / Path("concept-rss.xml")

    idee_index = sitepath / Path("idee-index.html")
    concept_index = sitepath / Path("concept-index.html")


if __name__ == "__main__":
    pass

HTML Formatting: fs.css

If we generate HTML, we want also a nice view on it. The CSS is a critical part to get a nice looking result.

~/projects/idee/website/css/fs.css

/* ***************************************************************************
 * Frank Siebert's CSS 
 +
 * Licence: CC0 
 * httpx://frank-siebert.de/article/creative-commons-cc0-1-0-universal.html 
 * ***************************************************************************/

:root {  
    /* kind of blue */
    --theme-color: #006080;
    /* black on white */
    --theme-text-color: #000000; 
    /* white background */
    --theme-background-color: #ffffff;
    /* for minor meta information */
    --theme-meta-color: #999999;                                   
    /* Arial and Helvetica exist on my Computer */
    /* --theme-font-family: Arial, Helvetica, Verdana, Tahoma, sans-serif; */
    --theme-font-family: Liberation Sans, sans-serif;
    /* One theme font only, based on the theme font-family */
    --theme-font: 16px/1.4 Liberation Sans, sans-serif; 
    /* Improve readability */
    --theme-letter-spacing: 0.05em;
}

html {
    padding: 0px 5px 0px 0px; 
    margin: 0; 
    border: 0; 
    font: var(--theme-font); 
    letter-spacing: var(--theme-letter-spacing);
    background-color: lightgray;
}

body {
    width: 100%; 
    height: 100%; 
    min-width: 280px; 
    max-width:1200px; 
    padding: 0 0 0 0;
    margin-top: 0;
    margin-bottom: 0;
    margin-left:auto;
    margin-right:auto;
    border-right: 1px solid var(--theme-color);
    border-left: 1px solid var(--theme-color);

    color: var(--theme-text-color); 
    background-color: var(--theme-background-color); 
    font-size: 1em; 
    word-wrap: break-word; 
}

/* **************************************************************************
 * keep the two body elements in sync 
 * **************************************************************************/

div.row,
body header, 
body main {   
    min-height: 100px; 
    padding: 5px;

    background-color: var(--theme-background-color); 
    background-repeat: no-repeat; 
    background-position: top center; 
    background-size: auto;
}

body header nav { 
    padding: 0 0 0 0;
    /* background: #ddcc99; */
}

/* **************************************************************************
 * The tag <figure> comes with build in padding,
 * but we have to have the same for the article.
 *
 * These styles keep the respectve block elements horizontally alligned.
 *
 * ====  MEDIA SCREEN Variants ====
 * **************************************************************************/

@media screen and (min-width: 641px) {
    body div div, /* yacy search */
    header figure, 
    header nav, 
    header hr, 
    /* main>h3 is used in the archive.html*/
    main article,
    main>h3 {
        display: block;
        margin: 1em 3em 1em 3em;
        /* border-style: dotted;  
         * border-width: 2px;     */
    }
    main>h1 {
        display: block;
        margin: 0.6em 1.8em 0.6em 1.8em;
    }
    .searchinput {
        max-width: 600px;
    }
}

@media screen and (max-width: 640px) {
    body div div, /* yacy search */
    header figure, 
    header nav, 
    header hr, 
    /* main>h3 is used in the archive.html*/
    main article,
    main>h3  {
        display: block;
        margin: 1em 1em 1em 0;
    }
    main>h1 {
        display: block;
        margin: 0.6em 0.6em 0.6em 0.2em;
    }
    .searchinput {
        max-width: 260px;
    }
} 

/* **************************************************************************
 * ==== END OF MEDIA SCREEN Variants ====
 * **************************************************************************/

/* the main content is the article */
article { 
    display: block; 
}

/* **************************************************************************
/* ==== all about headlines ====
 * **************************************************************************/
/* Ich glaube nicht, dass ich a tags unter die Überschriften legen werde.
 * h1 a, h2 a, h3 a, h4 a, h5 a, h6 a { text-decoration: none; } */
h1, h2, h3, h4, h5, h6  
{ 
    line-height: 1.1;
    margin: 0; 
    padding: 1em 0 0.5em 0; 

    color: var(--theme-color); 
    font-family: var(--theme-font-family); 
    font-weight: bold;  
}

h1      { font-size: 1.8em; }
h2      { font-size: 1.6em; }
h3      { font-size: 1.4em; }
h4      { font-size: 1.2em; }
h5, h6  { font-size: 1em; }

/* Newspaper Style First Letter of First Paragraph Upper-Case */
article>p:first-of-type::first-letter,
hr+p::first-letter,
h2+p::first-letter,
h3+p::first-letter,
h4+p::first-letter { 
    font-family: serif;
    font-size: 1.8em;
    font-weight: bold;
}

/* **************************************************************************
 * ==== Article Header ====
 * - h1 headline
 * - address information
 * - page qr-code
 * - licence information
 * - audio player
 * **************************************************************************/

article header {
    min-height: 0 
}

article header h1 {
    padding: 0 0 0.2em 0; 
}

article header div {
    color: var(--theme-meta-color);
    font-size: 0.8em; 
    padding: 0 0 1em 0; 
}


/* The browser decided, that address gets rendered italic,
 * but we do not want this */
article header time,
article header address {
    padding-right: 20px;
    display: inline;
    font: var(--theme-font);
    font-size:inherit
}

/* **************************************************************************
 * ==== Article Block Elements
 * **************************************************************************/
p {
    margin: 0;
    font-size: 1em;
    padding: 0 0 1em 0;
}

p:last-child
{
    padding-bottom: 0;
}

table th {
    background: #ddd; 
    border-right: 1px solid #fff; 
    padding: 10px 20px; 
}

table tr th:last-child {
    border-right: 1px solid #ddd;
}

table td {
    padding: 5px 20px;
    border: 1px solid #ddd;
}

/* **************************************************************************
 * ==== Figures in the header and in the article ====
 * **************************************************************************/

figure img    { width: 100%; height: auto; }
figure audio  { width: 50%;  height: auto; min-height:2em;}
header figure figcaption  { font: var(--theme-font); font-size: 1em; 
color: var(--theme-color); font-weight: bold}
article figure            { margin: 10px }
figure figcaption { font: var(--theme-font); font-size: 0.8em; 
color: var(--theme-color); font-style: italic; padding: 2px;}

article header div figure           { display: Inline; }
article header div figure img       { width: 50px; }
article header div figure figcaption { display: Inline; width: 150px }
article header div figure audio { margin: .5em .5em .5em .5em; }

/* **************************************************************************
 * ==== Navigation in the header ====
 * **************************************************************************/

header>nav>a {
    font-size: 1.2em;
    padding: 0 0.5em 0 0;
    display: inline-grid;
    grid-template-columns: 30px auto auto auto;
}

header>nav>a>img {
    width: 24px;
    vertical-align: sub; 
}

header>nav>form {
    display: inline;
    padding: 0 0.5em 0 0;
    margin: 0 0 0 0;
}

header>nav>form>input{
    font: var(--theme-font); 
    letter-spacing: var(--theme-letter-spacing);
    font-size: 1em;
    vertical-align: super; 
    padding: 0 0 0 0;
    margin: 0 0 0 0;
    border-color: var(--theme-color);
}

/* context break is meta information */
hr {
    height:1px;
    border-width:0;
    background-color: var(--theme-meta-color);
}

/* **************************************************************************
 *    inline HTML TAGS
 * **************************************************************************/

pre {
    background: #f5f5f5;
    border: 1px solid #ddd;
    padding: 10px;
    text-shadow: 1px 1px rgba(255, 255, 255, 0.4);
    font-size: 0.8em;
    line-height: 1.25;
    margin: 0 0 1em 0;
    overflow: auto;
}

sup, sub { 
    font-size: 0.75em; 
    height: 0;
    line-height: 0;
    position: relative;
    vertical-align: baseline;
}

sup {
    bottom: 1ex; 
}

sub {
    top: 1ex;
}

small { 
    font-size: 0.75em 
}


/* **************************************************************************
 * ==== Navigation and their targets ====
 * **************************************************************************/

*:target {
    border-bottom: 0.3em solid var(--theme-color);
}

a { 
    text-decoration: none;
    font: var(--theme-font); 
    font-size: 1em;
    font-weight: bold;
    color: var(--theme-color); 
    border-width: 0 0 0 0;
    border-style: none;
}
a:link        { color: var(--theme-color);       }
a:visited     { color: var(--theme-text-color);  }

/* figure:has(a:focus), */ /* Wait for CSS 4 */
a:focus,
a:hover /* ,
a:active */ { 
color: var(--theme-background-color); 
background-color: var(--theme-color);
outline: none;
}

figure a:focus,
figure a:hover { 
    color: var(--theme-background-color);
    background-color: var(--theme-color);
    outline: none;
    border: none;
}

header>div>a:focus,
header>div>a:hover {
    background-color: var(--theme-background-color); 
    color: var(--theme-color);
    outline: none;
    border: none;
}

a.category    { visibility: hidden; }


/* **************************************************************************
 * ==== YaCy Search ====
 * **************************************************************************/

p.urlinfo :nth-child(2),
p.urlinfo :nth-child(3),
p.urlinfo :nth-child(4),
p.urlinfo :nth-child(5),
.favicon,
.navbar,
.starter-template,
.hidden,
.urlactions,
.input-group-btn,
.sidebar,
#datehistogram,
#api {
    display: none;
}

div {
    min-height: 10px;
    margin: 0 0 0 0;
    padding: 0 0 0 0;
}

span#resNav ul li {
    display: inline;
    font-size: 1.4em;
}

.searchinput {
    font: var(--theme-font); 
    letter-spacing: var(--theme-letter-spacing);
    font-size: 1em;
    border-color: var(--theme-color);
    outline: 5px solid var(--theme-meta-color);
}

.linktitle,
.pagination {
    font-size: 1.4em;
    border-top: 2px solid var(--theme-meta-color);
}

/* **************************************************************************
 * ==== syntaxhighlight ====
 * CSS as created in the html style-element by WeasyOrint for syntaxhighlight
 * Changes for the print version need to be applied in fspdf.css
 * Changes for the browser version need to be applied at the end of this file.
 * **************************************************************************/

code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
span.underline{text-decoration: underline;}
div.column{display: inline-block; vertical-align: top; width: 50%;}
div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
ul.task-list{list-style: none;}
pre > code.sourceCode { white-space: pre; position: relative; }
pre > code.sourceCode > span { display: inline-block; line-height: 1.25; }
pre > code.sourceCode > span:empty { height: 1.2em; }
code.sourceCode > span { color: inherit; text-decoration: inherit; }
div.sourceCode { margin: 1em 0; }
pre.sourceCode { margin: 0; }

@media screen {
    div.sourceCode { overflow: auto; }
}


@media print {
    pre > code.sourceCode { white-space: pre-wrap; }
    pre > code.sourceCode > span { text-indent: -5em; padding-left: 5em; }
}

pre.numberSource code
  { counter-reset: source-line 0; }
pre.numberSource code > span
  { position: relative; left: -4em; counter-increment: source-line; }
pre.numberSource code > span > a:first-child::before
  { content: counter(source-line);
    position: relative; left: -1em; text-align: right; vertical-align: baseline;
    border: none; display: inline-block;
    -webkit-touch-callout: none; -webkit-user-select: none;
    -khtml-user-select: none; -moz-user-select: none;
    -ms-user-select: none; user-select: none;
    padding: 0 4px; width: 4em;
    color: #aaaaaa;
  }
pre.numberSource { margin-left: 3em; border-left: 1px solid #aaaaaa; 
    padding-left: 4px; }
div.sourceCode
  {   }

@media screen {
    pre > code.sourceCode > span > a:first-child::before { 
        text-decoration: underline; }
}

code span.al { color: #ff0000; font-weight: bold; } /* Alert */
code span.an { color: #60a0b0; font-weight: bold; font-style: italic; 
    } /* Annotation */
code span.at { color: #7d9029; } /* Attribute */
code span.bn { color: #40a070; } /* BaseN */
code span.bu { } /* BuiltIn */
code span.cf { color: #007020; font-weight: bold; } /* ControlFlow */
code span.ch { color: #4070a0; } /* Char */
code span.cn { color: #880000; } /* Constant */
code span.co { color: #60a0b0; font-style: italic; } /* Comment */
code span.cv { color: #60a0b0; font-weight: bold; font-style: italic; 
    } /* CommentVar */
code span.do { color: #ba2121; font-style: italic; } /* Documentation */
code span.dt { color: #902000; } /* DataType */
code span.dv { color: #40a070; } /* DecVal */
code span.er { color: #ff0000; font-weight: bold; } /* Error */
code span.ex { } /* Extension */
code span.fl { color: #40a070; } /* Float */
code span.fu { color: #06287e; } /* Function */
code span.im { } /* Import */
code span.in { color: #60a0b0; font-weight: bold; font-style: italic; 
    } /* Information */
code span.kw { color: #007020; font-weight: bold; } /* Keyword */
code span.op { color: #666666; } /* Operator */
code span.ot { color: #007020; } /* Other */
code span.pp { color: #bc7a00; } /* Preprocessor */
code span.sc { color: #4070a0; } /* SpecialChar */
code span.ss { color: #bb6688; } /* SpecialString */
code span.st { color: #4070a0; } /* String */
code span.va { color: #19177c; } /* Variable */
code span.vs { color: #4070a0; } /* VerbatimString */
code span.wa { color: #60a0b0; font-weight: bold; font-style: italic; 
    } /* Warning */

/* **************************************************************************
 * ==== syntaxhighlight ====
 * Own Part
 * **************************************************************************/
pre.sourceCode {
    width: 80ch; /* classic terminal width for code sections */
}

MediaWiki to HTML Recapitulation

At this point it is possible to copy a MediaWiki title and to use it via paste into the command line:

we 'MediaWiki title'

Halt! We are missing something here. The command we is unknown to your system. But you can get rid of this problem by placing the folling line into the file

~/.bash_aliases

alias we='~/projects/wikitools/src/export.py'

This simplifies your life a lot, since you need to remember only w iki e xport has to be written as we on the command line.

The export, if you made the default wiki configuration in your configuration file correctly, will create the file 'MediaWiki title.mediawiki' in the directory ~/projects/idee/author/ .

It does not matter in which working directory you are, when you invoke this command.

You can also use

~/projects/idee$ git add .
~/projects/idee$ git commit

This will add your new MediaWiki file to the commit list and start the commit. It will trigger the invocation of the MWWorker and you will get an HTML file named 'mediawiki-title.html' placed into the directory ~/projects/idee/plain/ and opened into Firefox.

Well, you might need to comment out some parts of the code, because some not yet implemented parts are referenced in it.

HTML to PDF Conversion

I was pretty sure that I would be able to convert the plain HTML into a portal page. Therefore the priority was PDF generation first.

Logically I started this PDF generation using Pandoc. Quite a big part in the later chapter [Migration#Migration] reports the various problems I did run into and how I managed to solve them. I keep these parts in the documentation, since they might help one or another person to solve these problems.

In the end I found out, that I will not find any possibility in Pandoc to get working links in the footnote section, which point back to footnote number in the text.

This was too much functional loss and I was not able to accept it. I ended up using WeasyPrint. A lot of code commented out was required to make the results in Pandoc look ok.

For WeasyPrint I needed to create an extra CSS file, but the result looks good, at least for my taste.

The WeasyPrint installation description is further down in this document, there where it happened in my project, nearly at the end of all.

The PDFWorker

~/projects/idee/generator/pdfworker.py

"""
PdfWorker is derived from the MsgWorker base class.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

The PdfWorker takes care of a worklist item placed
by an earlier worker.
"""
import re
from pathlib import Path

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder

from weasyprint import HTML
from weasyprint import CSS

from gitmsgdispatcher import MsgWorker
from gitmsgconstants import GitMsgConstants as gmc
# from pubmetadata import pageurn


class PdfWorker(MsgWorker):
    """
    The PdfWorker takes care of a worklist item placed by an earlier worker.

    The class method makePdfWorklistItem() can be used to create a work item,
    which can be placed into the worklist.

    The respective PDF is created from HTML and stored in the folder
    GITROOT/website/pdf/

    Parameters
    ----------
    super: MsgWorker
        The MwWorker is derived from the MsgWorker.

    Returns
    -------
    MwWorker.

    """

    # keys
    pdfworkitem = "pdfworkitem"
    urn = "urn"
    title = "title"
    workpath = "workpath"
    html_doc = "html_doc"
    draft = "draft"

    def __init__(self, pattern):
        super().__init__(pattern)
        self.values = {}

    def process(self):
        """
        Create a PDF file for the HTML.

        Parameters
        ----------
        html_doc: Type String of HTML

        workpath: Type Path, Folder of html file location (planned or factual)

        Converts the HTML provived as String containing an article with updated
        publishing date into PDF.

        It might be a draft for a new plain html or it might be
        publishing version with updated publishing date but still
        without portal injection.

        This makes no difference for the processing result.

        Returns
        -------
        None.

        Implementation Notes
        --------------------
        The PDF generation fails, if pictures in tables are embedded inside
        of a figure tag. To address this, we have to open the html file,
        look for figures inside of tables, and remove the figure without
        removing the figures content.

        Then we need to save the result in a temporary file and tell
        pandoc the correct workdirectory for the successful resolutiin
        of relative pathes in href and src entries in the html.
        """
        html_doc = self.item[PdfWorker.html_doc]
        workpath = self.item[PdfWorker.workpath]
        draft = self.item[PdfWorker.draft]

        builder = HTMLParserTreeBuilder()
        soup = BeautifulSoup(html_doc, builder=builder)

        title = soup.find("title")

        self.outpath = gmc.pdfpath / (self.item[PdfWorker.urn] + ".pdf")
        self.outpath = self.outpath.resolve()
        workpath = workpath.resolve()

        if draft:
            newtitle = title.text.strip() + " - DRAFT"
            title.clear()
            title.append(newtitle)

        # First we need to remove some things.

        # The article header
        tag = soup.find("article")
        header = tag.find("header")

        # tags = header.find_all("figcaption")
        # for tag in tags:
        #     tag.decompose()

        tags = header.find_all("figure")
        if len(tags) == 3:
            tags[2].decompose()  # remove audio
        if len(tags) > 1:
            tags[1].decompose()  # remove PDF Icon in the PDF Version
        # if len(tags) > 0:
        #     # size the qrcode picture
        #     tag = tags[0].find("img")
        #     tag.attrs.update({
        #         "height": "80px",
        #         "width": "80px"
        #         })
        #     tags[0].unwrap()

        # figures in tables do not work in pandoc
        # tables = soup.find_all("table")
        # for table in tables:
        #     tags = table.find_all("figcaption")
        #     for tag in tags:
        #         tag.unwrap()
        #     tags = table.find_all("figure")
        #     for tag in tags:
        #         tag.unwrap()

        # tables = soup.find_all("table")
        # for table in tables:
        #     figs = table.find_all("figure")
        #     for fig in figs:
        #         figcap = fig.find("figcaption")
        #         if figcap:
        #             figcap.unwrap()
        #         fig.unwrap()

        # headers = soup.find_all("header")
        # for header in headers:
        #     figs = header.find_all("figure")
        #     for fig in figs:
        #         figcap = fig.find("figcaption")
        #         if figcap:
        #             figcap.unwrap()
        #         fig.unwrap()

        # We need to change relative paths to own articles into absolute
        # paths.
        rhref = re.compile(r"^\.\/")
        anchors = soup.find_all("a", href=rhref)
        for anchor in anchors:
            url = rhref.sub("https://idee.frank-siebert.de/article/",
                            anchor["href"])
            anchor.attrs.update({"href": url})

        # On paper we need complete written URLs
        rhref = re.compile(r"^http.*")
        tag = soup.find("section", class_="footnotes")

        if tag:
            anchors = tag.find_all("a", href=rhref)
            for anchor in anchors:
                url = anchor["href"]
                anchor.parent.append(soup.new_tag("br"))
                anchor.parent.append(url)

        csspath = Path(r"/home/frank/projects/idee/website/css/fspdf.css")
        csspath.resolve()
        # if csspath.exists():
        #     print("css exists")

        html_doc = soup.prettify()

        weasy_html = HTML(string=html_doc, base_url=str(workpath))
        weasy_html.write_pdf(target=self.outpath,
                             stylesheets=[CSS(filename=str(csspath))]
                             )

        # subprocess.run(["pandoc",
        #                 # mediawiki markup as input format
        #                 "-f", "html",
        #                 # html as output forma
        #                 "-t", "pdf",
        #                 # input file
        #                 # "-i", inpath,
        #                 # output file
        #                 "-o", self.outpath,
        #                 # "--pdf-engine=xelatex",
        #                 "--pdf-engine=weasyprint",
        #                 "--variable=mainfont:Liberation Sans",
        #                 "--variable=sansfont:Liberation Sans",
        #                 "--variable=monofont:Liberation Mono",
        #                 "--css", csspath,
        #                 # "--variable=mainfont:DejaVu Serif",
        #                 # "--variable=sansfont:DejaVu Sans",
        #                 # "--variable=monofont:DejaVu Sans Mono",
        #                 # "--variable=geometry:a4paper",
        #                 # "--variable=geometry:margin=2.5cm",
        #                 # "--variable=linkcolor:blue"
        #                 ],
        #                capture_output=False,
        #                # the correct workdirectory to find the images
        #                cwd=workpath,
        #                # html string as stdin
        #                input=html_doc.encode("utf-8"))

        # print('wrote file {0}'.format(self.outpath))
        # subprocess.run(["firefox", pdfpath], capture_output=False)

    def delete(self):
        """
        Delete the generated HTML.

        Resources used by the HTML need additional care.
        If the delete was triggered by rename, no resources have to be deleted.
        If it was triggered by a delete, a check is required,
        whether the resources are used by other pages as well.
        But resources are place anyhow in the final website location.
        They must not be deleted by the MwWorker.
        """

    @staticmethod
    def make_pdf_worklist_item(urn, html_doc, workpath, task_type,
                               draft=False):
        """
        Create a worklist item for the PdfWorker.

        Parameters
        ----------
        title : str
            Title of the article
        urn: str
            The unique resource name, also stem of the related files
        html_doc : str
            The generated HTML to transform.
        workpath : TYPE
            Where to work to have the relative links right.
        task_type : str, optional
            One of MsgWorker.task_*
        draft : TYPE, optional
            Flag whether this is a PDF draft work item. The default is False.

        Returns
        -------
        None.
        """
        return {
            MsgWorker.task_worker_match: PdfWorker.pdfworkitem,
            PdfWorker.urn: urn,
            PdfWorker.html_doc: html_doc,
            PdfWorker.workpath: workpath,
            MsgWorker.task_type: task_type,
            PdfWorker.draft: draft
            }


if __name__ == "__main__":
    pass

The PDF Style Sheet

~/projects/idee/website/css/fspdf.css

/* ***************************************************************************
 * Frank Siebert's PDF CSS 
 +
 * Licence: CC0 
 * httpx://frank-siebert.de/article/creative-commons-cc0-1-0-universal.html 
 * ***************************************************************************/

html {
    font-family: Liberation Sans, sans-serif !important;
    font: 12px/1.4 Liberation Sans, sans-serif !important;
    background-color: #ffffff !important;
}

@page {
  size: A4; /* Change from the default size of A4 */
  margin: 1.5cm; /* Set margin on each page */

  @top-right {
    content: counter(page);
    color: #006080;
    font-size: 1.2em;
  }

  @top-left {
    content:  string(pageheader);
    color: #006080;
    font-size: 1.2em;
  }
}

header h1 {
  string-set: pageheader content();
}

article header div figure img       { width: 150px !important; }

/* **************************************************************************
 * ==== syntaxhighlight ====
 * **************************************************************************/

/* Allow only intentional line breaks in source code */
pre > code.sourceCode > span {
    white-space: nowrap !important;
}

pre.sourceCode {
    width: 80ch !important; /* classic terminal width for code sections */
}

HTML to PDF Recapitulation

At this point of the implementation the PDFWorker related parts no longer need to be commented out. And you can request in the commit message the generation of a draftPDF, if you commit a new MediaWiki file.

Note that this is only a chapter in a much longer description.

see Replacing WordPress

=== Plain HTML to Portal Page Conversion === We HTML, we have a PDF, it is time to create an article page, which is ready to be used as portal page.

The portal page contains:

A QRCode poining to its own URL
A PDF
- For low content pages PDF generation can be suppressed.
License Information
- For low content pages License Information can be suppressed.
Audio controls if an an audio was created.
The portal header

The audio is not generated, it needs to recorded and saved in the folder ~/projects/idee/website/audio/ with the same filename as computed for the plain HTML file, but with extension mpg.

Note to myself: Consider to allow ogg as alternative extension.

The license information is given by license icons, linking to an article text about the license. Apart of the code to place the icon and to link to article this part is mainly content.

The only missing pieces are the qrcode generator, but that exists as ready to use Python module, and the portal header to be integrated.

The original plan was to include the portal HTML fragment into the article HTML file, because HTML does not support any includes, even not with same origin policy. Luckily I discovered that the web server nginx supports such includes on the server side. The respective include instruction is already included in the plain HTML version.

Portal Header

The Portal Header is an HTML fragment file.

~/projects/idee/website/portal/idee-portal.html

  <header>
   <figure>
    <a href="/idee-index.html" alt="Home" tabindex="1">
     <img src="../image/bookpress.jpg" alt="Idee der eigenen Erkenntnis" 
      srcset="../image/bookpress.jpg 1600w, 
       ../image/bookpress-300x43.jpg 300w, 
       ../image/bookpress-768x110.jpg 768w,
       ../image/bookpress-1024x147.jpg 1024w, 
       ../image/bookpress-1568x225.jpg 1568w" 
      sizes="(max-width: 1600px) 100vw, 1600px" width="1600" height="auto"/>
    </a>
    <figcaption>
     Idee der eigenen Erkenntnis
    </figcaption>
   </figure>
   <nav>
    <form action="../yacysearch.html" accept-charset="UTF-8" method="get">
     <input type="text" name="query" placeholder="Suche.." maxlength="80"
      autocomplete="off" tabindex="2"/>
     <input type="hidden" name="verify" value="cacheonly" />
     <input type="hidden" name="maximumRecords" value="10" />
     <input type="hidden" name="meanCount" value="5" />
     <input type="hidden" name="resource" value="local" />
     <input type="hidden" name="urlmaskfilter" value=".*" />
     <input type="hidden" name="prefermaskfilter" value="" />
     <input type="hidden" name="display" value="2" />
     <input type="hidden" name="nav" value="all" />
     <input type="submit" name="Enter" value="Search" title="Suche"
      alt="Suche" hidden>
     &nbsp;
    </form>
    <a href="../idee-rss.xml" tabindex="3">
     <img src="../image/RSS.png" alt="RSS-Feed" width="1em"/>
      RSS
     </a>
     <a href="../article/rechtliches.html" rel="nofollow"
      alt="Impressum, Urheberrecht und Datenschutz" tabindex="4">
      <img src="../image/Legal.png" alt="RSS-Feed" width="1em"/>
      Rechtliches
     </a>
     <a href="../archive/idee-archive.html"
     alt="Archiv" tabindex="5">
      <img src="../image/Archive.png" alt="Archiv" width="1em"/>
      Archiv
     </a>
    </nav>
   <hr/>
   <script src="../js/header.js" type="text/javascript" defer></script>
  </header>

Portal Page Generation: The PlainWorker

~/projects/idee/generator/plainworker.py

"""
PlainWorker is derived from the MsgWorker base class.

@author: Frank Siebert
@website: https://idee.frank-siebert.de
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

The PlainWorker takes care of *.mediawiki files
in the author directory, if changes are committed
for them.
"""
import re
import subprocess
import qrcode

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder


from gitmsgdispatcher import GitMsgDispatcher
from gitmsgdispatcher import MsgWorker
from gitmsgconstants import GitMsgConstants as gmc
from pdfworker import PdfWorker
from pubmetadata import PubMetaData


class PlainWorker(MsgWorker):
    """
    The PlainWorker takes care of *.mediawiki files in the author/ directory.

    Example of a line taken care for
        #    modified:   author/PDF-Icon.mediawiki

    The line has to be from the section git message section:
        # Changes to be committed:

    The main output is an HTML created from the mediawiki file,
    which is plain (without portal part) and stored in the
    folder GITROOT/plain/

    A minor output, a PDF, might be requirested via the message line:
        # pdf:draft=true

    The respective PDF is created from HTML and stored in the folder
    GITROOT/website/pdf/

    Parameters
    ----------
    super: MsgWorker
        The PlainWorker is derived from the MsgWorker.

    Returns
    -------
    PlainWorker.
    """

    portal_header_fragment = None
    licence = "./creative-commons-cc0-1-0-universal.html"
    ccimg = "../image/CC-Icon.png"
    cc0img = "../image/CC0-Icon.png"

    def __init__(self, pattern):
        super().__init__(pattern)
        self.values = {}

    @staticmethod
    def __make_qrcode__(stem):
        """
        Create a qrcode for the page, whose stem name is provided.

        The created qrcode is saved in the sites qrcode directory.
        We create a QR Code for each article, containing its URL

        Parameters
        ----------
        stem : String

        Returns
        -------
        None.
        """
        docurl = gmc.website + "/article/" + stem + ".html"
        image = qrcode.make(data=docurl)
        qrpath = gmc.qrpath / stem
        qrpath = qrpath.with_suffix(".png")
        qrpath.resolve()
        image.save(qrpath)
        print('wrote file {0}'.format(qrpath))

    @staticmethod
    def __make_portal_page__(soup, urn, create_pdf):
        """
        Inject the portal into prepared HTML.

        Function:
        The tag <header> in the context of <body>
        is replaced with the portal header.

        Parameters
        ----------
        soup : BeautifulSoup, require
            DESCRIPTION. HTML page as BeautifulSoup Opject.
        urn : Str
            Unique Resource Identifier also used as stem in related files

        Returns
        -------
        soup.

        """
        # include the favicon just behind the css link
        csslink = soup.find("link")
        newtag = soup.new_tag("link")
        newtag.attrs.update({"rel": "icon",
                             "href": r"../image/favicon.ico",
                             "type": "image/x-icon"
                             })
        csslink.insert_after(newtag)

        # inject article artefacts
        tag = soup.find("article")
        tag = tag.find("header")

        headermedia = soup.new_tag("div")
        tag.append(headermedia)

        # Move Article -> Div Artefacts to Article -> Header -> Div
        tag = soup.find('article')
        tag = tag.find("div")
        if tag:
            headermedia.replace_with(tag)
            headermedia = tag

        tag = soup.find("article")
        tag = tag.find("div")

        if create_pdf:
            newtag = soup.new_tag("figure")
            headermedia.insert(1, newtag)
            tag = newtag
            newtag = soup.new_tag("a")
            tag.append(newtag)

            newtag.attrs.update({"accesskey": "p",
                                 # "download": "",
                                 "href": r"../pdf/" + urn + ".pdf",
                                 "target": "_blank",
                                 "type": "application/pdf"
                                 })

            # Inject the PDF Icon
            tag = newtag
            newtag = soup.new_tag("img")
            tag.append(newtag)
            newtag.attrs.update({"src": "../image/" + gmc.pdfimage})

        # Inject the Audio Player, if an audio does exist
        audio = gmc.audiopath / (urn + ".mp3")
        audio.resolve()
        if audio.exists():
            audio = r"../audio/" + urn + ".mp3"
            newtag = soup.new_tag("figure")
            headermedia.append(newtag)
            tag = newtag
            newtag = soup.new_tag("audio")
            tag.append(newtag)
            newtag.attrs.update({"accesskey": "a",
                                 "type": "audio/mp3",
                                 "preload": "none",
                                 "controls": "true",
                                 "src": audio})

        # Finally, no more additions expected,
        # We give every anchor a tabindex
        # 5 (or less) Tabindexes are in the portal header
        index = 6
        tags = soup.find_all(re.compile(r"^a$|^audio$|^input$"))
        for tag in tags:
            tag.attrs.update({"tabindex": index})
            index += 1

        return soup

    def process(self):
        """
        Process the plain HTML files into article HTML files.

        Returns
        -------
        None.
        """
        # inject meta information from commit message
        # Creates the single instance of PubMetaData
        PubMetaData(self.dispatcher.parameters.values)

        # compose the output path
        self.outpath = gmc.articlepath / self.inpath.stem
        self.outpath = self.outpath.with_suffix(".html")
        self.outpath.resolve()

        # The plain html contains a publising date.
        # But this might be the date the plain html was created,
        # and not the real publishing date, if no previous publishing
        # took place.
        # We need to read the plain html and use the title to search
        # for a publishing date of previous publishings.
        # If we do not find a previous publishing date, we need
        # to change the publishing date entries to the current date.

        with open(self.inpath, 'r') as infile:
            html_doc = infile.read()
            infile.flush()
            infile.close()

        builder = HTMLParserTreeBuilder()
        soup = BeautifulSoup(html_doc, builder=builder)

        # Own magic words:
        # __NOPDF__ Do not create PDF
        # __NOLIC__ Place no own CC0 license information
        # Noting but whitespaces and magic word in one line
        create_pdf = True
        tag = soup.find("p", string=re.compile(r'^\s*__NOPDF__\s*$'))
        if tag:
            create_pdf = False
            tag.decompose()

        show_lic = True
        tag = soup.find("p", string=re.compile(r'^\s*__NOLIC__\s*$'))
        if tag:
            show_lic = False
            tag.decompose()

        title = soup.find("title").text.strip()

        article_data = PubMetaData.instance.get_new_revision(
            title=title,
            urn=self.inpath.stem  # takes preference before title
            )

        tag = soup.find("meta", attrs={"property": PubMetaData.pubdate})
        tag.attrs.update({
            "property": PubMetaData.pubdate,
            "content": article_data[PubMetaData.pubdate]})

        tag = soup.find("time")
        tag.clear()
        tag.append(article_data[PubMetaData.pubdate][:10])
        tag.attrs.update({"datetime": article_data[PubMetaData.pubdate][:19]})
        # probably deprecated by itemprop alternative
        tag.attrs.update({"pubdate": "true"})

        tag = soup.find(
            "meta", attrs={"property": PubMetaData.revdate})
        if not tag:
            # inject the modified_time as meta tag
            head = soup.find("head")
            tag = soup.new_tag("meta")
            head.insert(6, tag)
        tag.attrs.update({
            "property": PubMetaData.revdate,
            "content": article_data[PubMetaData.revdate]})

        # take care for links
        # For a start we know, that "../website/" becomes "../".
        tags = soup.find_all(re.compile("link|a"),
                             attrs={"href": re.compile(r"../website/")})
        for tag in tags:
            shref = tag["href"]
            shref = shref.replace("../website/", "../")
            tag.attrs.update({"href": shref})

            tags = soup.find_all("img",
                                 attrs={"src": re.compile(r"../website/")})
        for tag in tags:
            shref = tag["src"]
            shref = shref.replace("../website/", "../")
            tag.attrs.update({"src": shref})

        # Insert header div for article artefacts
        # Embedd it into the article.
        tag = soup.find("article")
        headerdiv = soup.new_tag("div")
        tag.insert(1, headerdiv)

        # Create QR code for the document and the site.
        # Embedd it into header div.
        self.__make_qrcode__(self.inpath.stem)
        qruri = "../qrcode/" + self.inpath.stem + ".png"
        newtag = soup.new_tag("figure")
        headerdiv.append(newtag)
        tag = newtag
        newtag = soup.new_tag("figcaption")
        # Decided in the end to get rid of text for the RQ Code
        # newtag.append(soup.new_string("URL"))
        tag.insert(0, newtag)

        newtag = soup.new_tag("a")
        newtag.attrs.update({"href": qruri})
        tag.insert(0, newtag)
        tag = newtag
        newtag = soup.new_tag("img")
        newtag.attrs.update({"width": "150px", "height": "150px"})
        newtag.attrs.update({"src": qruri})
        newtag.attrs.update({"alt": "QR Code"})
        tag.insert(0, newtag)

        if show_lic:
            newtag = soup.new_tag("a")
            headerdiv.append(newtag)
            newtag.attrs.update({"href": PlainWorker.licence})
            tag = newtag
            newtag = soup.new_tag("img")
            # The following scaling is for the PDF
            # In the browser the CSS overwrites this scaling:
            newtag.attrs.update({"width": "28px", "height": "28px"})
            newtag.attrs.update({"src": PlainWorker.ccimg})
            newtag.attrs.update({"alt": "Creative Commons"})
            tag.insert(0, newtag)

            newtag = soup.new_tag("a")
            headerdiv.append(newtag)
            newtag.attrs.update({"href": PlainWorker.licence})
            tag = newtag
            newtag = soup.new_tag("img")
            # The following scaling is for the PDF
            # In the browser the CSS overwrites this scaling:
            newtag.attrs.update({"width": "28px", "height": "28px"})
            newtag.attrs.update({"src": PlainWorker.cc0img})
            newtag.attrs.update({"alt": "Zero"})
            tag.insert(0, newtag)

        # Make a portal page from the html
        soup = self.__make_portal_page__(soup, self.inpath.stem, create_pdf)
        html_doc = soup.prettify()

        # Save the article.
        with open(self.outpath, 'w') as outfile:
            print(html_doc, file=outfile)
            outfile.flush()
            outfile.close()
            print('wrote file {0}'.format(self.outpath))
            subprocess.run(["firefox", self.outpath], capture_output=False)

        # Flag a metadata update
        PubMetaData.instance.update(article_data)

        if create_pdf:
            # Placing a worklist item for the PdfWorker
            self.dispatcher.worklist.append(
                PdfWorker.make_pdf_worklist_item(
                    article_data.name,
                    html_doc,
                    gmc.articlepath,
                    MsgWorker.task_create,
                    draft=False
                    )
                )

    def delete(self):
        """
        Delete the generated HTML.

        Resources used by the HTML need additional care.
        If the delete was triggered by rename, no resources have to be deleted.
        If it was triggered by a delete, a check is required,
        whether the resources are used by other pages as well.
        But resources are place anyhow in the final website location.
        They must not be deleted by the PlainWorker.
        """


if __name__ == "__main__":
    from mwworker import MwWorker
    print("Running Test-Cases")

    mwworker = MwWorker(r".*modified.*author[/].*\.mediawiki")
    plainworker = PlainWorker(r".*[modified|new file].*plain[/].*\.html")
    pdfworker = PdfWorker(r"" + PdfWorker.pdfworkitem)

    # MESSAGEFILE = "test/PDF-Icon-TestCase-2"
    # MESSAGEFILE = "test/cc-plain-testcase"
    # MESSAGEFILE = "test/englands-gesamttodesraten-TestCase-2"
    # MESSAGEFILE = "test/endlich-TestCase-2"
    # MESSAGEFILE = "test/ich-denke-TestCase-2"
    # MESSAGEFILE = "test/astravacz-TestCase-2"
    MESSAGEFILE = "test/allesaufdentisch-TestCase-2"
    disp = GitMsgDispatcher(MESSAGEFILE, [mwworker, plainworker, pdfworker])

Portal Page Conversion Recapitulation

With this code part included we can create the final article HTML. We can also view it in the Browser with its QRCode, PDF, license information and audio. Thanks to relative paths everything works in locally viewed HTML file. But to see it as a portal page, we need to setup nginx to perform the include.

To trigger this conversion, another git add and git commit sequence is required. This does make sense, since the scenario sees the plain HTML version a base for copy-editing and audio recording.

Idee Website Server Setup

User and Group git

A user named git is used and the server git repository resides in /home/git/idee.git/.

Create git

The following command creates an empty git repository without working directory (--bare), which is supossed to be shared between multiple users (--share=group).

git@sol:~$ git init --bare --share=group idee.git

Git initializes the folders with a sticky group permission flag, which inherits down the directory tree.

Push from client git

Since I started without a server git, I need to connect my client git with the server. I did this by changing the conf file the clients .git/ directory, providing information about the remote "origin".

.git/conf

[core]
    repositoryformatversion = 0
    filemode = true
    bare = false
    logallrefupdates = true
    hooksPath = ./config/hooks
    quotepath = off
[remote "origin"]
    url = ssh://git@sol/home/git/idee.git
    fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
    remote = origin
    merge = refs/heads/master
[commit]
    template = ./config/commit-message
[status]
    relativePaths = false

As can be seen, I use the user git for ssh access.

initial push

frank @Asimov:~/projects/idee$ git push
Enter passphrase for key '/home/frank/.ssh/id_rsa': 
Enumerating objects: 1156, done.
Counting objects: 100% (1156/1156), done.
Delta compression using up to 4 threads
Compressing objects: 100% (496/496), done.
Writing objects: 100% (1156/1156), 28.32 MiB | 7.89 MiB/s, done.
Total 1156 (delta 675), reused 1064 (delta 616), pack-reused 0
remote: Resolving deltas: 100% (675/675), done.
To ssh://sol/home/git/idee.git
 * [new branch]      master -> master

/home/git/idee.git/hooks/post-receive

#!/bin/bash
#
# The hook "post-receive" takes care for the
# deployment after all pushed files where
# successfully stored.
#
# The deployment is implemented as pull 
# from a client git on the servers wwww folder.

# prevent message: "fatal: Not a git repository: '.'"
unset $(git rev-parse --local-env-vars)

cd /var/www/idee/
git pull

I found the solution for the error message at "Git Hook Pull After Push - remote: fatal: Not a git repository: '.' · Joe Januszkiewicz" ¹⁹

/var/www/idee

I create the server side client git also as shared git, making sure that www-data will have sufficient rights to read everything as member of the group git.

git@sol:/var/www$ git init --share=group idee
Initialized empty shared Git repository in /mnt/data/www/idee/.git/
git@sol:/var/www/idee/.git$ git remote add origin /home/git/idee.git

The branch master was set in the ini file by text editor.

[core]
        repositoryformatversion = 0
        filemode = true
        bare = false
        logallrefupdates = true
        sharedrepository = 1
[receive]
        denyNonFastforwards = true
[remote "origin"]
        url = /home/git/idee.git
        fetch = +refs/heads/*:refs/remotes/origin/*
[branch "master"]
        remote = origin
        merge = refs/heads/master

Testing the pull

git@sol:/var/www/idee$ git pull
git@sol:/var/www/idee$ ls -la
total 32
drwxrwxr-x  8 www-data www-data 4096 Feb  3 20:11 .
drwxr-xr-x 10 root     root     4096 Jan 12 23:33 ..
drwxr-xr-x  2 git      git      4096 Feb  3 20:11 author
drwxr-xr-x  3 git      git      4096 Feb  3 20:11 config
drwxr-xr-x  2 git      git      4096 Feb  3 20:11 generator
drwxrwsr-x  8 git      git      4096 Feb  3 20:11 .git
drwxr-xr-x  2 git      git      4096 Feb  3 20:11 plain
drwxr-xr-x 11 git      git      4096 Feb  3 20:11 website

Since the connection runs via the same user and the remote location is in reality local, no password is asked and we need not setup anything to feed something into a password request.

Providing www-data with group permission

root @sol:/home/git/idee.git/hooks# adduser www-data git
Adding user `www-data' to group `git' ...
Adding user www-data to group git
Done.

Creating a nginx site

The following server definition for nginx uses http instead of https. That's not a problem, it is for testing and migration only in the local network.

/etc/nginx/sites-available/idee_88

# Idee Server Configuration
#
server {
    listen 88 default_server;
    listen [::]:88 default_server;

    disable_symlinks off;

    root /var/www/idee/website;

# Add index.php to the list if you are using PHP
    index index.html index.htm index.nginx-debian.html;

    server_name _;

    location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
        try_files $uri %uri.html $uri/ =404;
    }

    location /yacysearch.html {
        set $myquery '';
        set $other '';

        if ($args ~* query=([^&]*)(.*)){
            set $myquery $1;
            set $other $2;
        }

        if ($myquery !~* (site(%3a|:)idee\.frank-siebert\.de)) {
            set $args query=$myquery+site:idee.frank-siebert.de$other;
        }
        proxy_pass https://yacy.frank-siebert.de/yacysearch.html;
    }
}

This configuration does also the heavy lifting for the yacy search integration. The main effort was the part, which enforces that the site: filter is passed on to YaCy, restricting search results to my own web page.

Enabling the new site

root @sol:/etc/nginx/sites-enabled# ln -s ../sites-available/idee_88 .
root @sol:/etc/nginx/sites-enabled# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
root @sol:/etc/nginx/sites-enabled# nginx -s reload

Test-URL

http://sol:88/article/verstehen.html

The server works, the YaCy search works as well, but naturally the links still point to the wordpress instance. A redirect from the old to the new URL pattern is required, and the migration of the content is still pending.

But sitemap, rss and index page are the next most important parts to be implemented.

Test run on this article

Doing a test run this article, while it is obviously still work in progress, reveals that it renders nicely, even the source code sections are very pretty, without investing time to make them look nice.

Every source code line is a reference

That is really nice for a number of use cases.

TODO: I have to take care, that these source code references do not get a tabindex each, or blind people will start to hate me.

Source code in the PDF

Source code in the PDF gets colored very nicely. DONE: I have to take care that the source code does not flow out of the page.

After refactoring the program export.py , where I took care to restrict the code to 80 characters per line, the PDF print of this program stays inside the page borders.

Sitemap Implementation

The "Sitemaps XML format " ²⁰ description explains the concept and the XML document structure of sitemaps. It seems to be quite simple, if I just write down some code to create the respective xml-elements and to persist the document afterwards.

Sometimes knowledge makes everything a bit more complicated. I know that I should validate the resulting XML against its schema, and, committed to high quality, I started to dig into question, how this validation has to be set up on a Linux system to work, lets say, first of all in vim.

This theme turns out to be quite complex, and it is independent enough to get its own article: Validating XML in vim .

Fortunately the setup done for the validation in vim will provide also everything required for a validation without vim.

Python3-lxml

The module lxml is required to create xml via beautiful soup.

frank @Asimov:~$ sudo apt-get install python3-lxml
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
python3-lxml is already the newest version (4.6.3+dfsg-0.1+deb11u1).
python3-lxml set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 1 not upgraded.

Requirements Draft

Sitemaps will be split into monthly maps. Content will be listed in the month of its first publishing. If content from earlier months need an update (e.g. when I migrate the content) the respective older sitemaps are updated accordingly.

I'll not implement the hreflang link stuff, since I do not expect much overlap between English and German content. However, since I plan to use two different site names, "Concept" in English, "Idee" in German, I think I should have two different sitemap trees.

My sitemap-tree will start with one sitemap.xml, referencing idee-mao.xml and concept-map.xml, referencing down to idee-yyyy-MM.xml and concept-yyyy-MM.xml files. Since the sitemap specification does not provide itself a language information, Google my figure out itself the page languages by content.

Since the content I provide on my German site is anyhow heavily suppressed by Google, I do not really care to optimize much to ease Googles live.

Solution Specification

Every sitemap update involves 3 sitemap files, the monthly file, the site file and the top file. The information about the required sitemap changes are found in PubMetaData.instance._updates and PubMetaData.instance._deletions.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
   <sitemap>
            <loc>https://idee.frank-siebert.de/idee-map.xml</loc>
      <lastmod>2021-03-31T18:23:17+00:00</lastmod>
   </sitemap>
   <sitemap>
      <loc>https://idee.frank-siebert.de/concept-map.xml</loc>
      <lastmod>2005-01-01</lastmod>
   </sitemap>
</sitemapindex>

The modification of sitemaps start at the leaves of the sitemap tree, which is easily possible since the respective map can be found by the article:modified_time information, and the og:site_name information. og:site_name is either "idee" or "concept".

The monthly sitemaps are stored in a dedicated folder named sitemaps to keep the root directory clean.

A class SiteMap applies all the changes. The timestamps for all changes done in the 3 top-level sitemaps during one publishing commit will always be the same.

The sitemap.xml, idee-map.xml and concept-map.xml are created in the root directory and pre-created via text editor to provide the general structure.

A monthly-map.xml template is created in text editor and provided in in the portal folder next to other already existing templates. It simply contains the top level element and the xmlns information.

Implementation Result

~/project/idee/generator/sitemap.py

"""
Update the sitemap of the webseite.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

@author: Frank Siebert
"""
import re
import datetime
from pubmetadata import PubMetaData
from gitmsgconstants import GitMsgConstants as gmc

from bs4 import BeautifulSoup
from bs4.builder._lxml import LXMLTreeBuilderForXML

URLSET_TAG = "urlset"
URL_TAG = "url"
LOC_TAG = "loc"
LASTMOD_TAG = "lastmod"
# Not used:
# CHANGEFREQ_TAG = "changefreq"
# PRIORITY_TAG = "priority"

INDEX_TAG = "sitemapindex"
SIDEMAP_TAG = "sitemap"


class SiteMap():
    """Manage all changees in the sitemaps."""

    def __init__(self):
        """
        Initialize changelists.

        Returns
        -------
        None.

        """
        # map information for German page changes on site "Idee".
        self.de_list = []
        # map information for English page changes on site "Concept".
        self.en_list = []
        # The time of the update
        self._nowdate = datetime.datetime.now().isoformat()

    def update(self):
        """
        Iterate over changes and update respective sitemaps.

        Add the respective sitemaps to their respective change list.
        The information about the changed html pages comes from
        PubMetaData.instance._updates  and
        PubMetaData.instance._deletions  .

        Returns
        -------
        None.

        """
        for article_data in PubMetaData.instance._updates:
            creation_month = article_data[PubMetaData.pubdate][0:7]
            site = article_data[PubMetaData.site]
            sitemap_path = site.lower() + "-" + creation_month + ".xml"
            sitemap_path = gmc.sitemappath / sitemap_path

            if site == "Idee":
                if sitemap_path not in self.de_list:
                    self.de_list.append(sitemap_path)
            else:
                if sitemap_path not in self.en_list:
                    self.en_list.append(sitemap_path)

            if article_data.name != "rechtliches" \
                    and article_data.name != "legal":
                self._update(sitemap_path, article_data)

        for article_data in PubMetaData.instance._deletions:
            # TODO
            pass

        self._update_de()
        self._update_en()
        self._update_main()

    def _update_de(self):
        """Update idee-map.xml."""
        if len(self.de_list) == 0:
            return

        with open(gmc.idee_map, 'r') as sitemap_file:
            xml_doc = sitemap_file.read()
            sitemap_file.flush()
            sitemap_file.close()

        builder = LXMLTreeBuilderForXML
        soup = BeautifulSoup(xml_doc, builder=builder, features='xml')

        for sitemap_path in self.de_list:
            url = gmc.website + "/" + sitemap_path.name
            tag = soup.find(LOC_TAG, text=re.compile(r"" + url))

            if not tag:
                tag = soup.find(INDEX_TAG)
                new_tag = soup.new_tag(SIDEMAP_TAG)
                tag.append(new_tag)
                tag = new_tag
                new_tag = soup.new_tag(LOC_TAG)
                new_tag.string = url
                tag.append(new_tag)
                new_tag = soup.new_tag(LASTMOD_TAG)
                tag.append(new_tag)
            else:
                tag = tag.parent

            # tag holds now the correct SIDEMAP_TAG.
            # Either it had been found or created.
            # All used child tags exist also.

            tag = tag.find(LASTMOD_TAG)
            tag.string = self._nowdate

        xml_doc = soup.prettify()

        with open(gmc.idee_map, 'w') as sitemap_file:
            print(xml_doc, file=sitemap_file)
            sitemap_file.flush()
            sitemap_file.close()

    def _update_en(self):
        """Update concept-map.xml."""
        if len(self.en_list) == 0:
            return

        with open(gmc.concept_map, 'r') as sitemap_file:
            xml_doc = sitemap_file.read()
            sitemap_file.flush()
            sitemap_file.close()

        builder = LXMLTreeBuilderForXML
        soup = BeautifulSoup(xml_doc, builder=builder, features='xml')

        for sitemap_path in self.en_list:
            url = gmc.website + "/" + sitemap_path.name
            tag = soup.find(LOC_TAG, text=re.compile(r"" + url))

            if not tag:
                tag = soup.find(INDEX_TAG)
                new_tag = soup.new_tag(SIDEMAP_TAG)
                tag.append(new_tag)
                tag = new_tag
                new_tag = soup.new_tag(LOC_TAG)
                new_tag.string = url
                tag.append(new_tag)
                new_tag = soup.new_tag(LASTMOD_TAG)
                tag.append(new_tag)
            else:
                tag = tag.parent

            # tag holds now the correct SIDEMAP_TAG.
            # Either it had been found or created.
            # All used child tags exist also.

            tag = tag.find(LASTMOD_TAG)
            tag.string = self._nowdate

        xml_doc = soup.prettify()

        with open(gmc.concept_map, 'w') as sitemap_file:
            print(xml_doc, file=sitemap_file)
            sitemap_file.flush()
            sitemap_file.close()

    def _update_main(self):
        """Update sitemap.xml."""
        if len(self.de_list) == 0 and len(self.en_list) == 0:
            return

        with open(gmc.sitemap, 'r') as sitemap_file:
            xml_doc = sitemap_file.read()
            sitemap_file.flush()
            sitemap_file.close()

        builder = LXMLTreeBuilderForXML
        soup = BeautifulSoup(xml_doc, builder=builder, features='xml')

        if len(self.de_list) > 0:
            url = gmc.website + "/" + gmc.idee_map.name
            tag = soup.find(LOC_TAG, text=re.compile(r"" + url))
            # We know in this case, that the tag exists
            tag = tag.parent
            tag = tag.find(LASTMOD_TAG)
            tag.string = self._nowdate

        if len(self.en_list) > 0:
            url = gmc.website + "/" + gmc.concept_map.name
            tag = soup.find(LOC_TAG, text=re.compile(r"" + url))
            # We know in this case, that the tag exists
            tag = tag.parent
            tag = tag.find(LASTMOD_TAG)
            tag.string = self._nowdate

        xml_doc = soup.prettify()

        with open(gmc.sitemap, 'w') as sitemap_file:
            print(xml_doc, file=sitemap_file)
            sitemap_file.flush()
            sitemap_file.close()

    @staticmethod
    def _update(sitemap_path, article_data):
        sitemap_path.resolve()
        if sitemap_path.exists():
            with open(sitemap_path, 'r') as sitemap_file:
                xml_doc = sitemap_file.read()
                sitemap_file.flush()
                sitemap_file.close()
        else:
            gmc.map_template.resolve()
            with open(gmc.map_template, 'r') as sitemap_file:
                xml_doc = sitemap_file.read()
                sitemap_file.flush()
                sitemap_file.close()

        builder = LXMLTreeBuilderForXML
        soup = BeautifulSoup(xml_doc, builder=builder, features='xml')

        article_url = gmc.website + "/"\
            + "article" + "/" + article_data.name + ".html"

        tag = soup.find(LOC_TAG, text=re.compile(r"" + article_url))
        if not tag:
            tag = soup.find(URLSET_TAG)
            new_tag = soup.new_tag(URL_TAG)
            tag.append(new_tag)
            tag = new_tag
            new_tag = soup.new_tag(LOC_TAG)
            new_tag.string = article_url
            tag.append(new_tag)
            new_tag = soup.new_tag(LASTMOD_TAG)
            tag.append(new_tag)
        else:
            tag = tag.parent

        # tag holds now the correct URL_TAG.
        # Either it had been found or created.
        # All used child tags exist also.

        tag = tag.find(LASTMOD_TAG)
        tag.string = article_data[PubMetaData.revdate]

        xml_doc = soup.prettify()

        with open(sitemap_path, 'w') as sitemap_file:
            print(xml_doc, file=sitemap_file)
            sitemap_file.flush()
            sitemap_file.close()

RSS - Really Simple Syndication

The RSS will be based on the standard described by the "Feed Validation Service" ²¹ and "RSS 2.0 Specification" ²² .

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    >
  <channel>
    <title>Idee der eigenen Erkenntnis</title>
    <atom:link href="https://idee.frank-siebert.de/idee-rss.xml" rel="self" 
      type="application/rss+xml" />
    <link>https://idee.frank-siebert.de</link>
    <description>Idee</description>
    <lastBuildDate>Tue, 11 Jan 2022 07:54:24 +0000</lastBuildDate>
    <language>de-DE</language>
    <generator>pandoc, fs-commit-msg-hook 1.0</generator>
    <image>
      <url>https://idee.frank-siebert.de/image/favicon-256x256-150x150.png</url>
      <title>Idee der eigenen Erkenntnis</title>
      <link>https://idee.frank-siebert.de</link>
      <width>32</width>
      <height>32</height>
    </image> 
    <item>
      <title>Best Article Ever Written</title>
      <link>
        https://idee.frank-siebert.de/article/best-article-ever-written.html
      </link>
      <pubDate>Tue, 11 Jan 2022 07:50:11 +0000</pubDate>
      <category><![CDATA[Uncategorized]]></category>
      <guid isPermaLink="false">
https://idee.frank-siebert.de/article/best-article-ever-written.html-2022-01-11T07:50:11
      </guid>
      <description>
        <![CDATA[First 406 characters of the article, followed by ...]]>
      </description>
          <content:encoded><![CDATA[<article>......</article>]]></content:encoded>
      <enclosure 
      url="https://idee.frank-siebert.de/audio/best-article-ever-written.mp3"
      length="9090090" type="audio/mpeg" />
      </item>
  </channel>
</rss>

The article content will be embedded completely into the RSS, enclosed in a CDATA tag and encoded in utf-8. To be able to include the complete content, the extension "RDF Site Summary 1.0 Modules: Content" ²³ with the namespace declaration xmlns:content=" http://purl.org/rss/1.0/modules/content/ " needs to be used.

The RSS file will reference its web location via atom:link, which needs the inclusion of the namespace entry xmlns:atom=" http://www.w3.org/2005/Atom " for the line . A quite brief excursion into the "The Atom Syndication Format" ²⁴ , which is in itself a concurrent syndication format specification. The current WordPress generated RSS indicates that the mixing of the two concurrent specifications does at least not create problems with feed consumers.

To make sure existing feed consumers are served the rss feed without any need to change the link, the nginx configuration needs a location /feed/ to redirect this location to the RSS file.

Since the RSS consumers most likely will use the channels title to present the feed items, and this title is the site title, two different RSS xml files are required, one for the site Idee and one for the site Concept . An additional reason to create two RSS xml files is the language information, which can be provided only once in the language tag of the channel.

The specification, according to the post "Multiple channels in a single RSS xml - is it ever appropriate?" ²⁵ does not allow more than one channel in one RSS xml file.

The question left: What's the unit if the length attribute in the enclosure tag?

I found that WordPress provides the size of the file in bytes as value for this attribute, which was also the most probable answer to this question.

RSS feeds to create

idee-rss.xml
- title: Idee der eigenen Erkenntnis
- link: https://idee.frank-siebert.de
- description: Idee
- language; de (or de-DE)
concept-rss.xml
- title: Concept of new cognition elicitation personally thinking
- link: https://idee.frank-siebert.de
- description: Concept
- language; en (or en-US)

Article Updates

If articles are updated after publishing, RSS does not provide any option to inform about the date of revision. The best idea, how such an update could be communicated to consumers is described in the post "RSS update single item" ²⁶ .

The idea is to change the guid of the item to inform that the item contains changed content. The answer was not marked as correct, but it was the only answer provided,

The implementation choice is to use the link and timestamp of the update as combined guid string.

The GUID change resulted in GPodder sometimes in duplicate entries shown for one article, which is not the result intended. However, GPodder recognized changes in the content without any additional signaling. At least that's the current impression.

Number of RSS items

The rss files will contain the last 10 articles, the last created/update first. Since I plan to migrate articles in the sequence of their original publishing, I'll come out of the migration with my latest articles automatically being featured in the rss feed, with the only difference that I will have two feets in the new solution.

Templates

RSS feed implementation will start off with two templates, one for the english and one for the german version, in folder portal, containing the channel information only items to be added.

After initial feed creation the templates are no longer required, but I'll keep them anyhow, Supposedly the implementation will be very similar to the sitemap implementation.

Implementation

The implementation of the RSS feed generator turned out to be much more cumbersome than expected. Pythons Module BeautifulSoup gives you the alternatives to use the LXMLTreeBuilderForXML, which will nicely write CDATA sections, but will remove and HTML encode them (you know > and such), when it reads the XML.

The alternative HTMLParserTreeBuilder works nicely for xml as well, as long as all XML tags are lower-case. But since this was not mentioned anywhere I looked for solutions of the first problem, I had to find out this second problem by myself.

Just with luck I found out before I tried it in an implementation, that using the lxml package without BeautifulSoup will not solve problem number one.

After careful reading I based my third implementation on the module xml.dom.minidom. This is a pretty low-level implementation requiring some more lines of code, but it provides the required control over the CDATA sections and does not overwrite my implementation decision when it reads the XML.

It just has the annoying habit of adding empty lines with white-spaces only, with its method toprettyxml(). Every time you read and save it will add an additional line between otherwise untouched lines of the XML. But this is at least easily solved by two regex pattern substitutions without any risk to alter mistakenly also the HTML fragments embedded via CDATA.

The following code shows the current implementation. The result has been tested with GPodder, Liferea and Tidings, where GPodder cares only for items with a media reference in the enclosure tag, while Tidings and Liferea show items regardless of the presence of an enclosure.

~/projects/idee/generator/rssbuilder.py

"""
Update the rss feed of the webseite.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

All links provided relative to the /article/ folder

@author: Frank Siebert
"""
import re
import datetime
import xml.dom.minidom
from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder
from pubmetadata import PubMetaData
from pubmetadata import pageurn
from gitmsgconstants import GitMsgConstants as gmc

CHANNEL_TAG = "channel"
LASTBUILD_TAG = "lastBuildDate"
ITEM_TAG = "item"
TITLE_TAG = "title"
LINK_TAG = "link"
PUBDATE_TAG = "pubDate"
GUID_TAG = "guid"
DESCRIPTION_TAG = "description"
CONTENT_TAG = "content:encoded"
ENCLOSURE_TAG = "enclosure"
AUDIO_TAG = "audio"

# for the testing on server sol
# HOST = "http://sol:88/"
# for the website
HOST = gmc.website + "/"

# Number of items to included into the RSS feed
ITEM_COUNT = 15


def by_pub_date(article_data):
    """
    Return the publishing date as sort criteria.

    Parameters
    ----------
    e : Series
        article_data.

    Returns
    -------
    TYPE
        Date as Str

    """
    return article_data[PubMetaData.pubdate]


class RSSBuilder():
    """Manage all changees in the sitemaps."""

    def __init__(self):
        """
        Initialize changelists.

        The information about the changed html pages comes from
        PubMetaData.instance._updates  and
        PubMetaData.instance._deletions  .

        Returns
        -------
        None.

        """
        # information for German page changes on site "Idee".
        self.de_list = []
        # information for English page changes on site "Concept".
        self.en_list = []
        # The time of the update
        self._nowdate = datetime.datetime.now().isoformat()
        # soup of currently processed RSS xml
        self._rss_xml = None
        # soup tag of currently processed article
        self._article_tag = None

        for article_data in PubMetaData.instance._updates:
            if article_data[PubMetaData.site] == "Idee" \
                    and article_data.name != "rechtliches":
                self.de_list.append(article_data)
            else:
                if article_data.name != "legal":
                    self.en_list.append(article_data)

        for article_data in PubMetaData.instance._deletions:
            # TODO
            pass

        # Default sort is ascending, oldest posts first in list
        self.de_list.sort(key=by_pub_date)
        self.en_list.sort(key=by_pub_date)

    def update(self):
        """
        Iterate over changes and update respective rss files.

        Returns
        -------
        None.

        """
        # Update idee-rss.xml.
        if len(self.de_list) > 0:
            self._update(self.de_list, gmc.idee_rss)

        # Update concept-rss.xml.
        if len(self.en_list) > 0:
            self._update(self.en_list, gmc.concept_rss)

    def _read_article_tag(self, article_data):
        """
        Read the article tag of the processed article.

        The article tag gets assigned to self._article_tag

        Returns
        -------
        None.

        """
        articlepath = gmc.articlepath / article_data.name
        articlepath = articlepath.with_suffix(".html")
        articlepath.resolve()

        with open(articlepath, 'r') as infile:
            html_doc = infile.read()
            infile.flush()
            infile.close()

        builder = HTMLParserTreeBuilder()
        soup = BeautifulSoup(html_doc, builder=builder)

        self._article_tag = soup.find("article")

        # RSS is downloaded, there is no use case for relatvie links
        # even if RSS consumer theoritically could compute them
        # to absolute links

        # "../" becomes "https://idee.frank-siebert.de/"
        tags = self._article_tag.find_all(re.compile(r".*"), attrs={
            "href": re.compile(r"^\.\./")})
        for tag in tags:
            href = tag.attrs["href"]
            href = href.replace("../", HOST)
            tag.attrs.update({"href": href})

        tags = self._article_tag.find_all(re.compile(r".*"), attrs={
            "src": re.compile(r"^\.\./")})
        for tag in tags:
            href = tag.attrs["src"]
            href = href.replace("../", HOST)
            tag.attrs.update({"src": href})

        # "./" becomes "https://idee.frank-siebert.de/article/"
        tags = self._article_tag.find_all("a", attrs={
            "href": re.compile(r"^\./")})
        for tag in tags:
            href = tag.attrs["href"]
            href = href.replace("./", HOST + "article/")
            tag.attrs.update({"href": href})

        self._article_tag.prettify()

    def _article_cleanup(self):
        """
        Remove some things from the articles BeautifulSoup model.

        Remove those things, which are not rendered nicely in the
        RSS feed consumer, or which are simply dysfunctional there.

        Changes are applied to the currently processed article
        referenced by self._article_tag

        Consumers tested: GPodder, Liferea, Tidings

        Returns
        -------
        None.

        """
        # peel out sections
        sections = self._article_tag.find_all("section")
        for section in sections:
            section.unwrap()

        # fallback to more common tags
        tag = self._article_tag.find("header")
        tag.name = "div"
        self._article_tag.name = "div"

        # Remove toc
        nav = self._article_tag.find("nav")
        if nav:
            nav.decompose()

        # Remove footnote-back anchors.
        tags = self._article_tag.find_all("a", class_="footnote-back")
        for tag in tags:
            tag.decompose()

        # Remove footnote-ref anchors, preserve the footnote.
        tags = self._article_tag.find_all("a", class_="footnote-ref")
        for tag in tags:
            suptag = tag.find("sup")
            # make footnotes more visible
            suptag.string.replace_with("(" + suptag.text + ")")
            tag.unwrap()

        # Remove category anchors
        tags = self._article_tag.find_all("a", class_="category")
        for tag in tags:
            tag.decompose()

        # Remove attributes from image preventiong it
        # to be shown in gpodder
        images = self._article_tag.find_all("img")
        for img in images:
            img.attrs = {"src": img.attrs["src"]}

        # Remove id attributes or some tags might not
        # render nicely
        idtags = self._article_tag.find_all(re.compile(r".*"), attrs={
            "id": True})
        for tag in idtags:
            tag.attrs.pop("id")

        # Remove role attributes or some tags might not
        # render nicely
        idtags = self._article_tag.find_all(re.compile(r".*"), attrs={
            "role": True})
        for tag in idtags:
            tag.attrs.pop("role")

        # Remove tabindex attributes not working anyhow in gpodder
        idtags = self._article_tag.find_all(re.compile(r".*"), attrs={
            "tabindex": True})
        for tag in idtags:
            tag.attrs.pop("tabindex")

    def _get_item_tag(self, channel_tag, url, article_data):
        """
        Find the item tag based on the url information.

        Parameters
        ----------
        channel_tag : xml.dom.minidom.Tag
            The <channel> tag from the minidom document model.
        url : Str
            The url of the article, whose item tag is to be returned.
        article_data : Dict
            Data dictionary of the currently processed article.

        Returns
        -------
        item_tag : xml.dom.minidom.Tag
            The pre-existing or created <item> tag for the currently
            processed article.
        """
        item_tag = None
        tag = None
        links = channel_tag.getElementsByTagName(LINK_TAG)

        for link in links:
            savedurl = None
            if len(link.childNodes) > 0:
                savedurl = link.childNodes[0].data.strip()
            if url == savedurl:
                tag = link
                break

        if tag:
            item_tag = tag.parentNode
        else:
            item_tag = self._rss_xml.createElement(ITEM_TAG)

            new_tag = self._rss_xml.createElement(TITLE_TAG)
            nodetext = article_data[PubMetaData.title]
            textnode = self._rss_xml.createTextNode(nodetext)
            new_tag.appendChild(textnode)
            item_tag.appendChild(new_tag)

            new_tag = self._rss_xml.createElement(LINK_TAG)
            nodetext = url
            textnode = self._rss_xml.createTextNode(nodetext)
            new_tag.appendChild(textnode)
            item_tag.appendChild(new_tag)

            new_tag = self._rss_xml.createElement(PUBDATE_TAG)
            pubdatetime = datetime.datetime.fromisoformat(
                    article_data[PubMetaData.pubdate])
            # running your computer on an english locale
            # is helpful for the next line.
            nodetext = pubdatetime.strftime(
                    "%a, %d %b %Y %H:%M:%S +0000")
            textnode = self._rss_xml.createTextNode(nodetext)
            new_tag.appendChild(textnode)
            item_tag.appendChild(new_tag)

            new_tag = self._rss_xml.createElement(GUID_TAG)
            new_tag.setAttribute("isPermaLink", "false")
            item_tag.appendChild(new_tag)

            new_tag = self._rss_xml.createElement(DESCRIPTION_TAG)
            item_tag.appendChild(new_tag)

            new_tag = self._rss_xml.createElement(CONTENT_TAG)
            item_tag.appendChild(new_tag)

        # Processing oldes first, and inserting the items always
        # before the frst childNode, wie get newest first in the XML.
        # To become the sepcification compliant, we finalize by moving
        # all item tags to the end of the channel tag later.
        channel_tag.insertBefore(item_tag,
                                 channel_tag.childNodes[0])

        return item_tag

    def _finalize_channel(self, channel_tag):
        """
        Move the items behind the other channel tags.

        Take care that the number of items does not exceed ITEM_COUNT.
        Update the lastBuildDate.

        Parameters
        ----------
        channel_tag : xml.dom.minidom.Tag
            The <channel> tag from the minidom document model.

        Returns
        -------
        None.

        """
        tags = channel_tag.getElementsByTagName(ITEM_TAG)
        item_count = 0
        for tag in tags:
            if item_count < ITEM_COUNT:
                channel_tag.appendChild(tag)
                item_count += 1
            else:
                channel_tag.removeChild(tag)

        # change last build date
        # running your computer on an english locale
        # is helpful for this.
        tag = channel_tag.getElementsByTagName(LASTBUILD_TAG)[0]
        pubdatetime = datetime.datetime.fromisoformat(
                self._nowdate)
        nodetext = pubdatetime.strftime(
                "%a, %d %b %Y %H:%M:%S +0000")
        tag.childNodes[0].nodeValue = nodetext

    @staticmethod
    def _remove_empty_lines(xml_doc):
        """Remove empty lines with and without whitespaces."""
        pattern = re.compile(r"^\s*$", re.MULTILINE)
        xml_doc = pattern.sub("", xml_doc)
        pattern = re.compile(r"\n\n", re.MULTILINE)
        xml_doc = pattern.sub("\n", xml_doc)
        return xml_doc

    def _update(self, article_list, rss_path):
        """
        Update the RSS file based on the list of changed or added articles.

        Parameters
        ----------
        article_list : List
            The list of article_data entries of changed or added articles.
            Oldest posts are first in the list.
        rss_path : Path
            The Path to the RSS file.

        Returns
        -------
        None.

        """
        with open(rss_path, 'r') as rss_file:
            self._rss_xml = xml.dom.minidom.parse(rss_file)
        channel_tag = self._rss_xml.getElementsByTagName(CHANNEL_TAG)[0]

        for article_data in article_list:
            self._read_article_tag(article_data)

            url = HOST + "article" +\
                "/" + article_data.name + ".html"

            item_tag = self._get_item_tag(channel_tag, url, article_data)

            tag = item_tag.getElementsByTagName(GUID_TAG)[0]
            # Changing the guid on update creates problems with some
            # consumers
            nodetext = url  # + "-" + self._nowdate
            if not tag.hasChildNodes():
                textnode = self._rss_xml.createTextNode(nodetext)
                tag.appendChild(textnode)
            else:
                tag.childNodes[0].nodeValue = nodetext

            tag = item_tag.getElementsByTagName(DESCRIPTION_TAG)[0]
            nodetext = " ".join(
                    self._article_tag.find("p").text.split())[:406] + " ..."
            if not tag.hasChildNodes():
                textnode = self._rss_xml.createCDATASection(nodetext)
                tag.appendChild(textnode)
            else:
                tag.childNodes[0].nodeValue = nodetext

            # save the audio uri before the removal
            # of the header tag
            url = None
            tag = self._article_tag.find(AUDIO_TAG)
            if tag:
                url = tag.attrs["src"]

            self._article_cleanup()

            tag = item_tag.getElementsByTagName(CONTENT_TAG)[0]
            if tag.hasChildNodes():
                tag.removeChild(tag.childNodes[0])
            nodetext = self._article_tag.prettify()
            nodetext = " ".join(nodetext.split())
            textnode = self._rss_xml.createCDATASection(nodetext)
            tag.appendChild(textnode)

            # An update might add or update the audio
            tags = item_tag.getElementsByTagName(ENCLOSURE_TAG)
            tag = None
            if url and len(tags) == 0:
                tag = self._rss_xml.createElement(ENCLOSURE_TAG)
                item_tag.appendChild(tag)
            elif len(tags) > 0 and not url:
                item_tag.removeChild(tags[0])

            # Update enclosure tag
            if tag:
                audio = gmc.audiopath / (pageurn(
                    article_data[PubMetaData.title]) + ".mp3")
                filelength = 0
                audio.resolve()
                if audio.exists():
                    filelength = audio.stat().st_size
                tag.setAttribute("url", url)
                tag.setAttribute("length", "{}".format(filelength))
                tag.setAttribute("type", "audio/mpeg")

        self._finalize_channel(channel_tag)

        xml_doc = self._rss_xml.toprettyxml(indent=" ", encoding="utf-8")
        xml_doc = self._remove_empty_lines(xml_doc.decode("utf-8"))

        with open(rss_path, 'w') as rss_file:
            print(xml_doc, file=rss_file)
            rss_file.flush()
            rss_file.close()

Error Search

The initial code worked nicely, but I didn't get my audio episodes shown in my favorite podcast catcher GPodder on my sailfish OS device. I found out that GPodder contains a lot of python as well, and that the module to parse the rss feed is named podcastparser.

Since I didn't see the cause of error with my blinded eyes, I ended up to investigate this with the following test code.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Tue Feb 22 13:02:51 2022

@author: frank
"""

import podcastparser
import urllib

feedurl = 'http://sol:88/idee-rss.xml'

parsed = podcastparser.parse(feedurl, urllib.request.urlopen(feedurl))

# parsed is a dict
import pprint
pprint.pprint(parsed)

Via this excursion I found out the following things:

The reason of error was me using an uri attribute instead of an url attribute in the enclosure tag.
This podcastparser support relative links in the RSS file, so most probably others will support this as well.
The testcases in their git repository indicate, that CDATA Sections should work nicely.

Relative Links in RSS

The RSS Advisory Board declares its opinion, that relative links should be supported. The discussion documented on that page also proposes how it should be done, which seems to fit with the podcastparser.py implementation ²⁷ .

The proposal boils down to the notion that the channels link element should provide the location, to which the the links are relative. If required, this can be overwritten with the use of the attribute xml:base.

Since the link element of the channel should point to the html of the channels index (or entry) page, I felt more comfortable with the dedicated xml:base attribute.

Making everything else in the channel elements relative, the RSS Template for my German channel should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
 xml:base="http://sol:88/"
 xmlns:content="http://purl.org/rss/1.0/modules/content/"
 xmlns:atom="http://www.w3.org/2005/Atom"
 >
 <channel>
  <title>Idee der eigenen Erkenntnis</title>
  <atom:link href="idee-rss.xml" rel="self" 
   type="application/rss+xml" />
  <link>idee.html</link>
  <description>Idee</description>
  <lastBuildDate>Tue, 11 Jan 2022 07:54:24 +0000</lastBuildDate>
  <language>de</language>
  <generator>pandoc, fs-commit-msg-hook 1.0</generator>
  <image>
   <url>image/favicon-256x256-150x150.png</url>
   <title>Idee der eigenen Erkenntnis</title>
   <link>idee.html</link>
   <width>64</width>
   <height>64</height>
  </image> 
  </channel>
</rss>

Where the current xml:base value is for the testing period only.

Thinking further about this, all my article pages are in the folder article. If use that folder as base, all relative links used in the content part should resolve nicely.

The channel part then should look as follows:

<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
 xml:base="http://sol:88/article/"
 xmlns:content="http://purl.org/rss/1.0/modules/content/"
 xmlns:atom="http://www.w3.org/2005/Atom"
 >
 <channel>
  <title>Idee der eigenen Erkenntnis</title>
  <atom:link href="../idee-rss.xml" rel="self" 
   type="application/rss+xml" />
  <link>../idee.html</link>
  <description>Idee</description>
  <lastBuildDate>Tue, 11 Jan 2022 07:54:24 +0000</lastBuildDate>
  <language>de</language>
  <generator>pandoc, fs-commit-msg-hook 1.0</generator>
  <image>
   <url>../image/favicon-256x256-150x150.png</url>
   <title>Idee der eigenen Erkenntnis</title>
   <link>../idee.html</link>
   <width>64</width>
   <height>64</height>
  </image> 
  </channel>
</rss>

Nice and good thoughts, but it doesn't work as thought. You might use relative links in the channel tags, and it works fine as far as I tested it. But you cannot rely on that for links in the content tag. How the consumer resolves these links, or whether it bothers to try to do this, is something you may not rely on. To be fair, the specification is really unspecific in this respect.

According to the official specification even CDATA sections would not work, as they state that all content needs to HTML-escape all special characters. Using a CDATA section instead is much more convenient and turns out, luckily, to be supported by the feed consumers. But CDATA by definition means: "Character Data" not to be parsed (Character Data to be parsed would be PCDATA).

Implementators can now argue, that parsing and processing relative links must not be done for the CDATA section in the context of xml:base, and that would be correct. But they could argue also, that CDATA is not to be parsed and processed in any kind, and just to be displayed, and that would be correct as well.

I had some hard time to get my article images shown using relative links. In the end I found that images where not not shown because of issues with relative links, but because of tag attributes like alt and title. Also headline tags are not rendered as headline in GPodder, if the e.g. the h2 tag does feature an id attribute,

I extended the code to process a number of attribute removals and some tag replacements, which removed my issues. I did this with a code version which used full qualified links and I did not go back to give the relative links one more try. The full qualified links are anyhow least least likely causing problems with any feed reader.

Include the Portal Fragment

I go back to an early discussion. Obviously I failed to find any HTML means to separate the content of the article from the content of the portal, and it does also not look as if something like html-include will become part of HTML and be supported by browsers.

But it turns out, that it can be done by the web server, using one of its extension modules. Some examples exist, where the functions add_before_body and add_after_body from "Module ngx_http_addition_module" ²⁸ are used to inject a header and a footer.

The article "nginx: Mitigating the BREACH Vulnerability with Perl and SSI or Addition or Substitution Modules — Wild Wild Wolf" ²⁹ is not really about this topic, but it does show that using these two functions we would end up with invalid HTML. Not a big problem, if it works and if this is everything you do care about.

The same article shows that the "Module ngx_http_ssi_module" ³⁰ does exactly what's required to perform such an include on the server side.

You could now argue, that this establishes a step back from the goal to be completely plain HTML only. But it is, that's my argument, close enough to the feature I would have hoped to have included into the HTML standard. I'm willing to accept that the feature is now provided by the web server.

For the implementation this means, that I have to go some steps back and to modify the code, which makes my plain HTML to portal HTML. This part will no longer include the header, but only an comment line with the process instruction to include the header.

Since the nginx site configuration becomes now essential part of the implementation, I'll move that into the git repository as well.

Relocate nginx Site Configuration

In this first step the content of the site configuration stays the same. It is just copied via copy+paste from the file sol:/etc/nginx/sites-available/idee_88 into the new file in the git repository.

The file name and the used port will change, when I go live.

frank @Asimov:~/projects/idee$ mkdir nginx
frank @Asimov:~/projects/idee$ cd nginx/
frank @Asimov:~/projects/idee/nginx$ vim idee_88
frank @Asimov:~/projects/idee/nginx$ git add .
frank @Asimov:~/projects/idee/nginx$ git commit
frank @Asimov:~/projects/idee/nginx$ git push
Enter passphrase for key '/home/frank/.ssh/id_rsa': 
Enumerating objects: 5, done.
Counting objects: 100% (5/5), done.
Delta compression using up to 4 threads
Compressing objects: 100% (3/3), done.
Writing objects: 100% (4/4), 750 bytes | 750.00 KiB/s, done.
Total 4 (delta 1), reused 0 (delta 0), pack-reused 0
remote: From /home/git/idee
remote:    8316cca..eabae10  master     -> origin/master
remote: Updating 8316cca..eabae10
remote: Fast-forward
remote:  nginx/idee_88 | 36 ++++++++++++++++++++++++++++++++++++
remote:  1 file changed, 36 insertions(+)
remote:  create mode 100644 nginx/idee_88
To ssh://sol/home/git/idee.git
   8316cca..eabae10  master -> master

root @sol:/etc/nginx/sites-available# rm idee_88 
root @sol:/etc/nginx/sites-available# ln -s /var/www/idee/nginx/idee_88 
root @sol:/etc/nginx/sites-available# nginx -t
nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful
root @sol:/etc/nginx/sites-available# nginx -s reload

SSI Installation

SSI stands for Server Side Injection.

The required SSI module is included in the nginx-extras package. But it turns out to be also already part of the nginx-full package, which I already have installed.

root @sol:/etc/nginx/sites-available# apt-cache show nginx-full
[...]
 OPTIONAL HTTP MODULES: Addition, Auth Request, Charset, WebDAV, GeoIP, Gunzip,
 Gzip, Gzip Precompression, Headers, HTTP/2, Image Filter, Index, Log, Real IP,
 Slice, SSI, SSL, Stream, SSL Preread, Stub Status, Substitution, Thread  Pool,
 Upstream, User ID, XSLT.
[...]

No further installation required.

SSI Configuration

The nginx site configuration needs one additional line to active SSI for the location.

    location / {
   ssi on;
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
        try_files $uri %uri.html $uri/ =404;
    }

Include Instruction in the HTML

<html lang="de-DE" xml:lang="de-DE" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <meta charset="utf-8"/>
  [...]
 </head>
 <body>
<!--# include file="portal/idee_header.html" -->
  <main>
   <article>
   [...]
   </article>
  </main>
 </body>
</html>

For english pages the include will reference the file portal/concept_header.html. The documentation is not explaining the reference directory for the include path. Is it simply the webroot, or is it the html document location? As you see I guess it is the webroot, which is also simpler for my implementation.

First tests upfront implementation shows that the assumption is not correct. The above sample leads to Error 404 (included header page was not found). Defining it relative to the article is correct, but considered unsave:

2022/03/02 11:07:32 [error] 13045#13045: *396623 unsafe URI "/article/../portal/idee_header.html" was detected while sending response to client, client: 10.19.67.21, server: _, request: "GET /article/endlich.html HTTP/1.1", host: "sol:88"

Since there was only on last choice, that one turned out work.

[...]
 <body>
<!--# include file="/portal/idee-header.html" -->
  <main>
[...]

You might notice also, that I decided to rename the header file to use an hypen instead of an underscore. This was just for consistency in my file names. Note that this is only a chapter in a much longer description.

see Replacing WordPress

== Create Article Archive Pages == The article archive will be separated into an English and a German archive and be organized by year and month. The archive will not be presented as list in a drop-down, as it is in WordPress the case, instead archive pages will be created.

~/projects/idee/generator/archive.py

"""
Update the archive of the webseite.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15
"""
import re
import datetime

from bs4 import BeautifulSoup
from bs4 import Comment
from bs4.builder._htmlparser import HTMLParserTreeBuilder

from pubmetadata import PubMetaData
from gitmsgconstants import GitMsgConstants as gmc

class Archive():
    """Manage all changees in the archive."""

    def __init__(self):
        """
        Initialize changelists.

        Returns
        -------
        None.

        """
        # map information for German page changes on site "Idee".
        self.de_list = []
        # map information for English page changes on site "Concept".
        self.en_list = []
        # The time of the update
        self._nowdate = datetime.datetime.now().isoformat()

    def update(self):
        """
        Iterate over changes and update respective archive pages.

        Add the archive pages to their respective change list.

        The information about the changed html pages comes from
        PubMetaData.instance._updates  and
        PubMetaData.instance._deletions  .

        Returns
        -------
        None.

        """
        for article_data in PubMetaData.instance._updates:
            creation_month = article_data[PubMetaData.pubdate][0:7]
            site = article_data[PubMetaData.site]
            archive_path = site.lower() + "-" + creation_month + ".html"
            archive_path = gmc.archivepath / archive_path

            if site == "Idee":
                if archive_path not in self.de_list:
                    self.de_list.append(archive_path)
            else:
                if archive_path not in self.en_list:
                    self.en_list.append(archive_path)

            if article_data.name != "rechtliches" \
                    and article_data.name != "legal":
                soup = self._update(archive_path, article_data)
                html_doc = soup.prettify()

                with open(archive_path, 'w') as archive_file:
                    print(html_doc, file=archive_file)
                    archive_file.flush()
                    archive_file.close()

        for article_data in PubMetaData.instance._deletions:
            # TODO
            pass

        self._update_de()
        self._update_en()

    def _update_de(self):
        """Update idee-archive.html."""
        if len(self.de_list) == 0:
            return

        with open(gmc.idee_archive, 'r') as archive_file:
            html_doc = archive_file.read()
            archive_file.flush()
            archive_file.close()

        builder = HTMLParserTreeBuilder
        soup = BeautifulSoup(html_doc, builder=builder)

        for archive_path in self.de_list:
            url = './' + archive_path.name
            tag = soup.find("a", href=re.compile(r"" + url))

            if not tag:
                tag = soup.find("main")
                new_tag = soup.new_tag("h3")
                tag.insert(0, new_tag)
                tag = new_tag
                new_tag = soup.new_tag("a")
                new_tag.attrs.update({"href": url})
                new_tag.string = archive_path.name
                tag.append(new_tag)

        html_doc = soup.prettify()

        with open(gmc.idee_archive, 'w') as archive_file:
            print(html_doc, file=archive_file)
            archive_file.flush()
            archive_file.close()

    def _update_en(self):
        """Update concept-archive.html."""
        if len(self.en_list) == 0:
            return

        with open(gmc.concept_archive, 'r') as archive_file:
            html_doc = archive_file.read()
            archive_file.flush()
            archive_file.close()

        builder = HTMLParserTreeBuilder
        soup = BeautifulSoup(html_doc, builder=builder)

        for archive_path in self.en_list:
            url = './' + archive_path.name
            tag = soup.find("a", href=re.compile(r"" + url))

            if not tag:
                tag = soup.find("main")
                new_tag = soup.new_tag("h3")
                tag.insert(0, new_tag)
                tag = new_tag
                new_tag = soup.new_tag("a")
                new_tag.attrs.update({"href": url})
                new_tag.string = archive_path.name
                tag.append(new_tag)

        html_doc = soup.prettify()

        with open(gmc.concept_archive, 'w') as archive_file:
            print(html_doc, file=archive_file)
            archive_file.flush()
            archive_file.close()

    @staticmethod
    def _get_abstract(article_data):
        """
        Read the abstract of the processed article.

        The abstract consists of the first 406 characters of the first
        <p> tag, or less, if the respective string is shorter.

        Returns
        -------
        Str.

        """
        articlepath = gmc.articlepath / article_data.name
        articlepath = articlepath.with_suffix(".html")
        articlepath.resolve()

        with open(articlepath, 'r') as infile:
            html_doc = infile.read()
            infile.flush()
            infile.close()

        builder = HTMLParserTreeBuilder()
        soup = BeautifulSoup(html_doc, builder=builder)

        tag = soup.find("p")
        return " ".join(tag.text.split())[0:406]

    @staticmethod
    def _update(archive_path, article_data, article_loc="../article/"):
        is_new = None
        archive_path.resolve()
        if archive_path.exists():
            with open(archive_path, 'r') as archive_file:
                html_doc = archive_file.read()
                archive_file.flush()
                archive_file.close()
                is_new = False
        else:
            gmc.archive_template.resolve()
            with open(gmc.archive_template, 'r') as archive_file:
                html_doc = archive_file.read()
                archive_file.flush()
                archive_file.close()
                is_new = True

        builder = HTMLParserTreeBuilder
        soup = BeautifulSoup(html_doc, builder=builder)

        if is_new:
            tag = soup.find("body")
            # SSI header injection is a function of the language
            if article_data[PubMetaData.locale].startswith("de"):
                new_tag = Comment('# include file="/portal/idee-header.html" ')
                language = "de"
                site_name = "Idee"
                title_prefix = "Archiv"
            else:
                new_tag = Comment(
                    '# include file="/portal/concept-header.html" ')
                language = "en"
                site_name = "Concept"
                title_prefix = "Archive"
            tag.insert(0, new_tag)

            tag = soup.find("html")
            tag.attrs.update({"lang": language, "xml:lang": language})
            tag = soup.find("meta", property="og:site_name")
            tag.attrs.update({"Content": site_name})

            tag = soup.find("title")
            tag.string = " ".join([title_prefix,
                                  article_data[PubMetaData.pubdate][0:7]])

            tag = soup.find("h1")
            tag.string = " ".join([title_prefix,
                                  article_data[PubMetaData.pubdate][0:7]])

        article_url = article_loc + article_data.name + ".html"

        tag = soup.find("a", href=article_url)
        if not tag:
            tag = soup.find("h1")

            new_tag = soup.new_tag("article")
            if tag:  # true in archive, false in index page
                tag.insert_after(new_tag)
            else:
                tag = soup.find("main")
                tag.insert(0, new_tag)
            tag = new_tag

            new_tag = soup.new_tag("header")
            tag.append(new_tag)
            tag = new_tag

            new_tag = soup.new_tag("h2")
            tag.append(new_tag)
            tag = new_tag

            new_tag = soup.new_tag("a")
            new_tag.attrs.update({"href": article_url,
                                  "alt": article_data[PubMetaData.title]})
            new_tag.string = article_data[PubMetaData.title]
            tag.append(new_tag)
            tag = tag.parent  # header

            new_tag = soup.new_tag("div")
            tag.append(new_tag)
            tag = new_tag

            new_tag = soup.new_tag("time")
            new_tag.attrs.update({"datetime":
                                  article_data[PubMetaData.pubdate][:19],
                                  "pubdate": "true"})
            new_tag.string = article_data[PubMetaData.pubdate][:10]
            tag.append(new_tag)

            new_tag = soup.new_tag("address")
            new_tag.string = article_data[PubMetaData.author]
            tag.append(new_tag)
            tag = tag.parent.parent  # article

            new_tag = soup.new_tag("p")
            tag.append(new_tag)
            tag = new_tag

            new_tag = soup.new_tag("a")
            new_tag.attrs.update({"href": article_url,
                                  "alt": article_data[PubMetaData.title]})
            new_tag.string = "..."
            tag.append("placeholder")  # for the article abstract
            tag.append(new_tag)
            tag = tag.parent  # article

            new_tag = soup.new_tag("hr")
            tag.append(new_tag)
        else:
            tag = tag.parent.parent.parent  # article

        # tag holds now the article tag.
        # Either it had been found or created.
        # All used child tags exist also.

        # Write or update the article abstract
        tag = tag.find("p")
        tag = tag.find("a")
        tag.previousSibling.replace_with(Archive._get_abstract(article_data))

        # We give every anchor a tabindex
        # 5 Tabindexes are in the portal header
        index = 6
        tags = soup.find_all(re.compile(r"^a$|^audio$|^input$"))
        for tag in tags:
            tag.attrs.update({"tabindex": index})
            index += 1

        return soup

Migration

Migration, I hoped for a quick one, needs to be manually. Not only need I supervise the result step by step, I also did put own comments below articles to update or amend articles, which now needs to be incorporated into the article text.

And, as will be seen, slight adjustments to the wiki text needs to done in some cases to get the desired result.

Migration issue: double-byte unicode characters break PDF generation

The standard pandoc installation does not support double-byte unicode characters, as it does use LaTeX for the PDF generation.

In my case this happened with the code U+03BA for the creek character κ. Not to know when and why PDF generation will break next time is no option. And its not possible to remove the issue just by removing the character, since it is for sure used for a reason.

The stackoverflow discussion "Pandoc and foreign characters" ³¹ explains that the problem can be solved specifying a different PDF engine via --pdf-engine=xelatex.

However, this is only a part of the answer, since this engine needs first to be installed and since a font needs to be chosen, which contains the character.

The engine can be installed from the debian repository by:

frank @Asimov:~/projects/idee$ sudo apt-get install texlive-xetex

A search for fonts supporting the character can be done with:

frank @Asimov:~/projects/idee$ fc-list ':charset=03BA'

This list is quite long and, if you think about it, helpful only in most exotic cases. My best guess for a suitable font to render everything in PDF what I use in my wiki pages would be the font used by my web browser.

Was it in FireFox or was in Chromium? In one if my browsers I found the default to be DejaVu Sans. How can the font be specified? That can be done via command line parameters.

Indeed I found a number of pages describing how this can be done, but in the end they all did not work as expected. Only the "Pandoc User’s Guide" ³² helped in the end.

-V KEY[=VAL], --variable=KEY[:VAL]

Set the template variable KEY to the value VAL when rendering the document in standalone mode. If no VAL is specified, the key will be given the value true.

mainfont, sansfont, monofont, mathfont, CJKmainfont

font families for use with xelatex or lualatex: take the name of any system font, using the fontspec package. CJKmainfont uses the xecjk package.

Those two information combined explained me, when to use the ":" symbol and when the "=" symbol, which, for whatever reason, was not correctly done in the examples I found, or probably I failed to understand them correctly.

The working code to call pandoc from python with naming the fonts to use:

        subprocess.run(["pandoc",
                        # mediawiki markup as input format
                        "-f", "html",
                        # html as output forma
                        "-t", "pdf",
                        # input file
                        # "-i", inpath,
                        # output file
                        "-o", self.outpath,
                        "--pdf-engine=xelatex",
                        "--variable=mainfont:DejaVu Serif",
                        "--variable=sansfont:DejaVu Sans",
                        "--variable=monofont:DejaVu Sans Mono",
                        "--variable=geometry:a4paper",
                        "--variable=geometry:margin=2.5cm",
                        "--variable=linkcolor:blue"
                        ],\
                       capture_output=False,\
                       # the correct workdirectory to find the images
                       cwd=workpath,\
                       # html string as stdin
                       input=html_doc.encode("utf-8"))

A resource worth to visit for further beautification: "Customizing pandoc to generate beautiful pdf and epub from markdown" ³³

For a start I'm happy if the PDF is generated correctly, but I'm sure I'll revisit the theme again to get from good results to perfect results.

Missing Character

Creek Characters where no problem, and for CJK (Chinese, Japan, Korea) Fonts a seperate vairable can be set. But now I got problems with Hebrew Chars and how would it look like if Arabian Chars would be required?

Funnily enough having the chars nicely rendered in the web page doesn't tell you anything about your success during PDF creation.

[WARNING] Missing character: There is no א (U+05D0) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ָ (U+05B8) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ד (U+05D3) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ָ (U+05B8) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ם (U+05DD) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no א (U+05D0) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ֲ (U+05B2) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ד (U+05D3) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ָ (U+05B8) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no מ (U+05DE) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ָ (U+05B8) 
in font DejaVu Serif/OT:script=latn;language=d
[WARNING] Missing character: There is no ה (U+05D4) 
in font DejaVu Serif/OT:script=latn;language=d

The command fc-list does not show any installed font for these character codes, but the browser does show them. This means that the browser gets its fonts from somewhere else, if it needs them.

Curious what font would be reported by the browser, I used the inspection tool and got the answer "Liberation Sans", which convinced me to change the fonts to be used for the PDF generation to Liberation Fonts.

This is one of the fonts installed by default on Debian. And guess what, it worked! Probably I did something wrong with the fc-list command. I think the font looks better balanced in the PDF, it is definitely a good change, not only for that article.

Tables flow out of the PDF page

There is a lot of web pages out there about this topic and all of them, at least those I found, where about how to change the markudown to prevent this to happen.

To basic solutions are:

scaling the table down together with the font size
use the markdown for multiline tables

Well, not using markdown but MediaWiki markup to write the articles, and then creating HTML first and then PDF from the HTML, I had some bad time to figure out the solution. Processing things in multiple steps I could probably find a solution by editing intermediate results, but that is cumbersome and not desirable.

I even started to thing about my solution. Shouldn't I use markdown for the PDF as well as for the HTML generation? Should I throw big parts of my implementation away and start over again?

However, reading about the solutions helped in the end. How can I convince Pandoc to make a multiline table from my markdown? I need to enforce a multiline header-cell in my MediaWiki markup.

Note the

<br/>

in the third column:

{| class="wikitable" style="text-align:left;" cellpadding="2px" 
! Hersteller 
! Impfstoff 
! Primary<br/>Completion
! Completion 
|-
| BioNTech / Pfizer
| BNT162b2
| 2021-11-30
| 2021-11-30
|}

Relative URLs to own articles do not work in PDF

Nothing to wonder about, but I stumbled upon it nonetheless. There is no chance, before generating PDF I have to revert the relative URLs back into absolute URLs pointing to my web site.

DONE

Half Way Migrated - Checkpoint

At the point, where already migrated up to all Mai 2021 articles, I have 8 audio articles in my RSS feed. On my way I had the take care for a number of bugs, e.g in the code part, where the urn of the article is created based on the title. It is a critical detail, that the urn does match with current WordPress article URL stem, or I will not be able to process automatic redirects from the old URL to the new URL without an extensive matching list.

I also learned what I have to take care for in regard to the articles Title in MediaWiki. E.g. in some titles I have used quotes. If I use the standard Quote ("), then I get a problem with the article filename on the disc. Writing the file, the quote char is correctly escaped. But the attempt to read the file in Python leads to a path where the escape char is escaped, leading to a file not found error. I now change these titles to use the quote chars („) and (“) instead.

I'm also editing the articles to use < ref > tags also to my own articles and to put quotes into < blockquite > tags.

Also the filenames of the audio files is now different than before, using now the urn of the article as stem of the audio filename.

I wouldn't need to care, but I'd like to have all audio articles on my phone in their new representation in my GPodder App. This is not critical for other consumers, it is something I just want to have. Other consumers most probably have no problem with the changed appearance of the articles, just as long as their url for the feed consumption does not stop working.

But for me my special requirement (my wish) leads to the conclusion, that I either need to allow a very big rss feed at the start, or I have to prepare the further article migration, implement the index page generation and to go live with a rapid migration after go-live.

Which way I'll decide, I have to implement the index page generation rather sooner than later, because the go live is near. Not that a fixed day exists for this, but the progress indicates that it cannot be too far away.

TODOs I have not to forget:

Prevent the Search-Engine indexing of my legal page (in German and English)
- Prevent the legal page to appear in the sitemap (Done)
- Prevent the legal page to appear in the RSS feed (Done)
- Prevent the legal page to appear in the archive (Done)
- nofollow information at the anchor in the headers (Done)
- disallow English and German legal page explicitly in the robots.txt (Done)
/feed/ redirect
I have to take care, that source code references do not get a tabindex each
Find out how to get backward references from references back to their text in PDF work. (Done)
- Included readable http links into the PDF to provide useful footnotes also if printed. (footnote section only)

Enabling Backlinks in PDF

I found a list of options to investigate in the post: "How to convert HTML to PDF using pandoc?" ³⁴

wkhtmltopdf

frank @Asimov:~/projects/idee$ sudo apt-cache search wkhtmltopdf
[sudo] password for frank: 
python3-django-wkhtmltopdf - Django module with views for HTML to PDF 
  conversions (Python 3)
pandoc - general markup converter
python3-pdfkit - Python wrapper for wkhtmltopdf to convert HTML to PDF 
  (Python 3)
wkhtmltopdf - Command line utilities to convert html to pdf or image using
  WebKit

frank @Asimov:~/projects/idee$ sudo apt-get install wkhtmltopdf 

frank @Asimov:~/projects/idee/plain$ wkhtmltopdf --enable-local-file-access \
--enable-external-links --enable-internal-links --keep-relative-links  \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.html \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.pdf

The switch --enable-external-links, is not support using unpatched qt, and will 
be ignored.The switch --enable-internal-links, is not support using unpatched 
qt, and will be ignored.The switch --keep-relative-links, is not support using 
unpatched qt, and will be ignored.Loading page (1/2)
Printing pages (2/2)                                               
Done

I'm not yet willing to install a qt-patch for this purpose, being not even sure about the result. Because of this no links at all are working in the PDF. The PDF shows the HTML exactly as it is rendered in the browser, and that is not really what I want to get as well.

But I'll keep this in mind. It might be useful in other use cases.

frank @Asimov:~/projects/idee/plain$ sudo apt-get purge wkhtmltopdf

WeasyPrint

The quite impressing list of packages to be installed for WeasyPrint made me think twice about pressing yes. It even made me reading the documentation first: "WeasyPrint" ³⁵

I learned from this documentation that CCS2 contains style elements for paged media layout. ³⁶

From the reading I get the impression it does everything required to layout the HTML nicely for PDF and to enable all links to work.

And from the post which made me aware of this tool I know already, that it can be named as engine for pandoc. The question for sure has to be asked, whether this does make sense. Calling a program written in R to call a program written in Python, when I'm already in a Python program.

However, I'll try exact that setup for a start, and probably later I'll kick out Pandoc for PDF generation and use directly WeasyPrint via its API, if it works nicely.

I guess if I go that route, I'll develop a second CSS for page layout details and to overwrite some CSS formatting used in HTML but being not nice in PDF.

frank @Asimov:~/projects/idee/plain$ sudo apt-get install weasyprint
[sudo] password for frank: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libblkid-dev libbrotli-dev libcairo-script-interpreter2 libcairo2-dev
  libdatrie-dev libfontconfig-dev libfontconfig1-dev libfreetype-dev
  libfreetype6-dev libfribidi-dev libglib2.0-dev libglib2.0-dev-bin
  libgraphite2-dev libharfbuzz-dev libharfbuzz-gobject0 libice-dev libmount-dev
  libpango1.0-dev libpcre2-32-0 libpcre2-dev libpcre2-posix2 libpixman-1-dev
  libpng-dev libpng-tools libpthread-stubs0-dev libselinux1-dev libsepol1-dev
  libsm-dev libthai-dev libx11-dev libxau-dev libxcb-render0-dev libxcb-shm0-dev
  libxcb1-dev libxdmcp-dev libxext-dev libxft-dev libxrender-dev pango1.0-tools
  python-tinycss2-common python3-cairocffi python3-cairosvg python3-cffi
  python3-cssselect2 python3-pycparser python3-pyphen python3-tinycss2
  python3-xcffib uuid-dev x11proto-dev x11proto-xext-dev xorg-sgml-doctools
  xtrans-dev
Suggested packages:
  libcairo2-doc libdatrie-doc freetype2-doc libgirepository1.0-dev libglib2.0-doc
  libgraphite2-utils libice-doc libpango1.0-doc libsm-doc libthai-doc libx11-doc
  libxcb-doc libxext-doc python-cairocffi-doc python-cssselect2-doc
  python-tinycss2-doc
The following NEW packages will be installed:
  libblkid-dev libbrotli-dev libcairo-script-interpreter2 libcairo2-dev
  libdatrie-dev libfontconfig-dev libfontconfig1-dev libfreetype-dev
  libfreetype6-dev libfribidi-dev libglib2.0-dev libglib2.0-dev-bin
  libgraphite2-dev libharfbuzz-dev libharfbuzz-gobject0 libice-dev libmount-dev
  libpango1.0-dev libpcre2-32-0 libpcre2-dev libpcre2-posix2 libpixman-1-dev
  libpng-dev libpng-tools libpthread-stubs0-dev libselinux1-dev libsepol1-dev
  libsm-dev libthai-dev libx11-dev libxau-dev libxcb-render0-dev libxcb-shm0-dev
  libxcb1-dev libxdmcp-dev libxext-dev libxft-dev libxrender-dev pango1.0-tools
  python-tinycss2-common python3-cairocffi python3-cairosvg python3-cffi
  python3-cssselect2 python3-pycparser python3-pyphen python3-tinycss2
  python3-xcffib uuid-dev weasyprint x11proto-dev x11proto-xext-dev
  xorg-sgml-doctools xtrans-dev
0 upgraded, 54 newly installed, 0 to remove and 1 not upgraded.
Need to get 13.3 MB of archives.
After this operation, 47.1 MB of additional disk space will be used.
Do you want to continue? [Y/n]

Naming weasyprint instead of xelatex as pdf-engine works instantly. Font setting from CSS is not used and Headline Color is also not used as in CSS defined. Probably the CSS is not found at all.

CSS is however found when the program is called from the command line, making the headlines use the CSS defined color, but font settings are still ignored, but this time with an warning message informing about this.

frank @Asimov:~/projects/idee/website/article$ weasyprint -f pdf \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.html \
astrazeneca-vaxzevria-verunreinigungen-thromozytopenie-thrombose.pdf
WARNING: Ignored `font: var(--theme-font)` at 29:2, invalid value.
WARNING: Ignored `border-right: 1px solid var(--theme-color)` at 44:2, invalid 
 value.
WARNING: Ignored `border-left: 1px solid var(--theme-color)` at 45:2, invalid 
 value.
WARNING: Expected a media type, got screen/**/and/**/(min-width: 641px)
WARNING: Invalid media type " screen and (min-width: 641px) " the whole @media 
 rule was ignored at 83:1.
WARNING: Expected a media type, got screen/**/and/**/(max-width: 640px)
WARNING: Invalid media type " screen and (max-width: 640px) " the whole @media 
 rule was ignored at 105:1.
WARNING: Ignored `font: var(--theme-font)` at 197:2, invalid value.
WARNING: Ignored `font: var(--theme-font)` at 236:29, invalid value.
WARNING: Ignored `font: var(--theme-font)` at 239:21, invalid value.
WARNING: Ignored `display: inline-grid` at 254:2, invalid value.
WARNING: Ignored `grid-template-columns: 30px auto auto auto` at 255:2, unknown 
 property.
WARNING: Ignored `font: var(--theme-font)` at 270:2, invalid value.
WARNING: Ignored `text-shadow: 1px 1px rgba(255, 255, 255, 0.4)` at 294:2, 
 unknown property.
WARNING: Ignored `border-bottom: 0.3em solid var(--theme-color)` at 327:2, 
 invalid value.
WARNING: Ignored `font: var(--theme-font)` at 332:2, invalid value.
WARNING: Ignored `font: var(--theme-font)` at 402:2, invalid value.
WARNING: Ignored `outline: 5px solid var(--theme-meta-color)` at 406:2, invalid 
 value.
WARNING: Ignored `border-top: 2px solid var(--theme-meta-color)` at 412:2, 
 invalid value.

Links pointing backward inside the document do work as they should. Obviously I'll now take a look into a CSS optimization for the PDF generation before I'll proceed with my migration.

fspdf.css

Creating a complete new CSS for PDF generation is not helpful, since this might introduce a lot of double maintenance if the style is changed in future. But a separate CSS to overwrite just some specific things is quite simple.

See the earlier chapter [#The PDF Style Sheet|The PDF Style Sheet]

This little initial CSS also reveals, that the removal of figures around images is no longer required. In contrary these figures are now important means to layout the images as we need them. However, anchor tags inside the figures around the image do nothing. Opening the image in web browser by clicking the image does not work. But I see this as a minor issue, since every document created will carry a QR-Code with the URL of the article for those, who wish to use the web-version of the article,

I was able to add a header line with the articles title and a page number at the top of the page. Over time the layout of the page might change to get perfect results, but for now good is good enough.

pdfworker.py

The following code shows just the essential parts of the new code. A lot more lines have been removed, e.g. the pandoc systemcall and the removal of figures from tables or the article header.

from weasyprint import HTML
from weasyprint import CSS
[...]
        csspath = Path(r"/home/frank/projects/idee/website/css/fspdf.css")
        csspath.resolve()
        html_doc = soup.prettify()

        weasy_html = HTML(string=html_doc, base_url=str(workpath))
        weasy_html.write_pdf(target=self.outpath,
                             stylesheets=[CSS(filename=str(csspath))]
                             )
[...]

WeasyPrint Bug?

I'm perfectly satisfied with the PDF generated by Weasyprint, but I discovered now, not before generating quite a lot of PDF documents, that German special characters (ÄäÖöÜüß) in Headlines lead to dysfunctional links in the table of contents.

The TOC used is not a real PDF TOC, but it is the TOC generated for the HTML and it should work on the PDF just as it does in HTML.

The HTML is generated by Pandoc, and till now I did not meddle with the id and href names generated for internal navigation. Indeed I like it very much, that Pandoc does not escape the Germen umlaute in those,

At a point in near feature I need to investigate this issue closer. Does PDF allow full UTF-8 in references? Where is the Bug in the WeasyPrint implementation? Would the correction be in the escaping of special characters or in enablement of UTF-8?

And then, when the issue is solved, I'll have to trigger re-creation of the PDFs.

Knowledge Resources about CSS Paged Media

"Revisting HTML To PDF Conversion with CSS Paged Media" ³⁷
"CSS Paged Media Module Level 3" ³⁸

Index Page Implementation

The subdomain idee.frank-siebert.de will serve two index pages, one with English language and one with German language.

idee.html
- Main index page in German language
concept.html
- English index page

There will not be many English articles, as far as I foresee. That's the main reason to give those article no own subdomain. And most probably the English articles will not be a translation of German articles.

I'm undecided about the question, whether the search should be restricted to the current sites language. For a start I'll not implement such a restriction.

Under these circumstances it seems to make no sense to have a language switch somewhere on the site, because then visitors would assume that they can switch the language of the current article, which will not be the case. And for sure I'll refrain from faking a multi-language page via google translate, just to be able to show a language switch button.

Index Page Content

~~The index page content for the respective language will be generated from the RSS file created for that language. The language specific portal header will be injected.~~ I based the index page generation on the archive generation. Hot needle implementation and a lot to refactor to get it nice, but it works.

This implementation allows to define a separate item count for the RSS feed and the index page.

The Index Builder

~/projects/idee/generator/idxbuilder.py

"""
Update the index pages of the webseite.

@author: Frank Siebert
@license: https://creativecommons.org/publicdomain/zero/1.0/deed.en
@date: 2022-03-15

All links provided relative to the /article/ folder

@author: Frank Siebert
"""
import datetime
from pubmetadata import PubMetaData
from gitmsgconstants import GitMsgConstants as gmc
from archive import Archive

# Number of items to included into the RSS feed
ITEM_COUNT = 15


def by_pub_date(article_data):
    """
    Return the publishing date as sort criteria.

    Parameters
    ----------
    e : Series
        article_data.

    Returns
    -------
    TYPE
        Date as Str

    """
    return article_data[PubMetaData.pubdate]


class IDXBuilder():
    """Manage all changees in the index page."""

    def __init__(self):
        """
        Initialize changelists.

        The information about the changed html pages comes from
        PubMetaData.instance._updates  and
        PubMetaData.instance._deletions  .

        Returns
        -------
        None.

        """
        # information for German page changes on site "Idee".
        self.de_list = []
        # information for English page changes on site "Concept".
        self.en_list = []
        # The time of the update
        self._nowdate = datetime.datetime.now().isoformat()
        # soup of currently processed Index html

        for article_data in PubMetaData.instance._updates:
            if article_data[PubMetaData.site] == "Idee" \
                    and article_data.name != "rechtliches":
                self.de_list.append(article_data)
            else:
                if article_data.name != "legal":
                    self.en_list.append(article_data)

        for article_data in PubMetaData.instance._deletions:
            # TODO
            pass

        # Default sort is ascending, oldest posts first in list
        self.de_list.sort(key=by_pub_date)
        self.en_list.sort(key=by_pub_date)

    def update(self):
        """
        Iterate over changes and update respective index pages.

        The information about the changed html pages comes from
        PubMetaData.instance._updates  and
        PubMetaData.instance._deletions  .

        Returns
        -------
        None.

        """
        for article_data in self.de_list:
            soup = Archive._update(gmc.idee_index, article_data,
                                   article_loc="./article/")
            soup = IDXBuilder._limit_entries(soup)
            html_doc = soup.prettify()

            with open(gmc.idee_index, 'w') as index_file:
                print(html_doc, file=index_file)
                index_file.flush()
                index_file.close()

        for article_data in self.en_list:
            soup = Archive._update(gmc.concept_index, article_data,
                                   article_loc="./article/")
            soup = IDXBuilder._limit_entries(soup)

            html_doc = soup.prettify()

            with open(gmc.concept_index, 'w') as index_file:
                print(html_doc, file=index_file)
                index_file.flush()
                index_file.close()

    @staticmethod
    def _limit_entries(soup):
        tags = soup.find_all("article")
        count = 0
        for tag in tags:
            if count > ITEM_COUNT:
                tag.decompose()
            else:
                count += 1
        return soup

Since the template for archive pages is used for the index page, it is necessary to remove the h1 tag with the text "Archive" after initial creation.

That's a one time intervention, and i did not see any need to implement something to avoid this. As easily can be seen, the index pages are created with Archive._update() function. Probably not very elegant implemented, but effective reuse.

There is obviously room for improvement.

Own Magic Words

I indroduced own so called magic words to steer the production of PDF or the display of the CCß license information.

__NOPDF__ prevents the PDF creation and the placement of the PDF Icon.
__NOLIC__ prevents the placement of the License Icons

The rationale is quote simple. If I post just a simple video, audio or reading recommendation, it does not make any sense to place a license information for a non existing own intellectual work.

Indeed it only does raise the risk that consumers misunderstand the meaning of the license information as to be applicable to the recommended content.

The magic words are ignored in the MediaWiki and processed by Pandoc into content placed in < p > tags. The plainworker.py does query there existence and changes the output accordingly.

German Literals

As I found out, that I cannot use "normal" literals in titles, I learned today how enter German literals via the German keyboard in front of me.

Which year is it? 2022. When did I start working in the IT business? I think it was called EDV in Germany at those times, „elektronische Datenverarbeitung“. It was in December 1988.

It took 33 years and a bit more to learn how to enter the German literals. Time to note it down, or I probably forget it again.

„ [AltGr]+[Fn]+v
“ [AltGr]+[Fn]+b

Final Recapitulation

The shown documentation follows the implementation sequence, while avoiding to show the code evolution in detail. This is probably not the best sequence possible for a documentation, but I tried to combine it with the implementation story.

The code itself contains a opportunity for improvement. I would not consider the code shown here to be best practice for any purpose.

However, the code is stable enough to go live with the solution on my own site, and I already did. This is the first article, apart from the legal page and the page about the PDF logo, which gets published natively on this page.

It is a very long article and I hope the formatting in the new article HTML takes care to keep it readable in spite of its length.

I learned a lot from this project, and I hope the description is helpful for someone.

Footnotes

Gitblog - the software that powers my blog , 2020-05-07 ↑
GitLab Flavored Markdown ↑
sitemaps.org ; www.sitemaps.org ↑
Parsing a Wikipedia page's content with python ↑
Building a full-text search engine in 150 lines of Python code ; Bart de Goede; bart.degoe.de; 2021-03-24 ↑
Gensim ; WikiPedia ↑
https://whoosh.readthedocs.io/en/latest/searching.html Whoosh - How to search]; whoosh.readthedocs.io ↑
rank-bm25 0.2.1 ; pypi.org; 2020-06-04 ↑
Improvements to BM25 and Language Models Examined ; Andrew Trotman, Antti Puurula, Blake Burgess; Association for Computing Machinery; DOI: https://doi.org/10.1145/2682862.2682863 ; PDF ; 2014-11-26 ↑
What is the difference between Okapi bm25 and NMSLIB? ; Data Science Stack Exchange; 2021-03-01 ↑
expandtemplates should use "post" instead of "get" · Issue -272 · mwclient-mwclient ; github.com ↑
Somebody elses problem - Wikipedia ; en.wikipedia.org ↑
Configuring MariaDB for Remote Client Access ; mariadb.com ↑
agate 1.6.3 ; agate.readthedocs.io ↑
pandas documentation ; pandas.pydata.org ↑
Add new rows and columns to Pandas dataframe ; kanoki; 2019-08-03 ↑
Pandas Tutorial ; www.w3schools.com ↑
Getting Started with Bioconductor 3.7 ; bioconductor.org ↑
Git Hook Pull After Push - remote: fatal: Not a git repository: '.' · Joe Januszkiewicz ; Joe Januszkiewicz; 2014-04-03 ↑
sitemaps.org ; www.sitemaps.org ↑
Feed Validation Service ; validator.w3.org ↑
RSS 2.0 Specification ; www.rssboard.org ↑
RDF Site Summary 1.0 Modules: Content ; web.resource.org ↑
The Atom Syndication Format ; M. Nottingham, R. Sayre; www.rfc-editor.org; DOI: https://doi.org/10.17487/RFC4287 ; December, ↑
Multiple channels in a single RSS xml - is it ever appropriate? ; , aoeu, aoeu; Stack Overflow; 2010-10-18 ↑
RSS update single item ; , lou; Stack Overflow; 2013-03-18 ↑
RSS Advisory Board - Relative links ; www.rssboard.org ↑
Module ngx_http_addition_module ; nginx.org ↑
nginx: Mitigating the BREACH Vulnerability with Perl and SSI or Addition or Substitution Modules — Wild Wild Wolf ; wwa; Wild Wild Wolf; 2018-09-04 ↑
Module ngx_http_ssi_module ; nginx.org ↑
Pandoc and foreign characters ; , Mike Thomsen, Mike Thomsen; Stack Overflow; 2013-09-05 ↑
Pandoc User’s Guide ; pandoc.org ↑
Customizing pandoc to generate beautiful pdf and epub from markdown ; learnbyexample.github.io ↑
How to convert HTML to PDF using pandoc? ; , Chris Stryczynski; Stack Overflow; 2017-06-08 ↑
WeasyPrint ; doc.courtbouillon.org ↑
Going Further ; doc.courtbouillon.org ↑
Revisting HTML To PDF Conversion with CSS Paged Media ; carlos; The Publishing Project; 2021-11-15 ↑
CSS Paged Media Module Level 3 ; www.w3.org; 2018-10-18 ↑

Category:Git Category:Web