Script Deep Dive: generate-archive.ps1

Archive · Tag Cloud · Latest Posts · Search Index

What This Script Does

generate-archive.ps1 is the first stop in the entire build pipeline. It scans all blog content files, extracts metadata, and generates four things in one pass — the latest posts component, the article archive page, the tag cloud page, and the search index. These four outputs serve the sidebar, standalone pages, and the CGI search engine respectively.

It must run before build.ps1 because build.ps1 needs to read the latest-posts.html it generates for sidebar injection. If the order is reversed, visitors will see stale data. This dependency is enforced by rebuild-all.ps1 and doesn't need to be remembered manually.

Metadata Scanning

The first thing the script does is traverse the blog directory, extracting four pieces of metadata from HTML comments in each .html file:

Date (date) — compatible with both 2026-04-28 and 2026-04-28 23:30 formats
Title (title) — the display name of the article
Tags (tags) — comma-separated keywords used for the tag cloud
Draft flag (draft: true) — if present, the entire article is skipped and excluded from all outputs

All article information is stored in a PowerShell object array, sorted by date in descending order. This step forms the data foundation for the entire script; all four subsequent outputs are derived from this array.

Output A: Latest Posts Component

Takes the top 5 articles and generates src/components/latest-posts.html. This file is injected by build.ps1 into the right sidebar of every page, so the "Latest Posts" list on the site is always up to date.

There's a noteworthy detail here: title truncation uses pixel width estimation rather than character count. Chinese characters are estimated at 11px, ASCII at 6px, and titles exceeding 110px total width are truncated to 10 characters. Character-count truncation works fine for pure Chinese, but mixed Chinese-English titles can vary wildly — "AAAAAAAAAA" and "ChineseChinese" are both 5 characters but can differ by nearly a factor of two in pixel width. Before CSS text-overflow existed, manual width calculation was the only option, and it so happens that IE5.5 doesn't support text-overflow.

The output file uses UTF-8 without BOM encoding. Because it gets injected into pages that already have a BOM, including a BOM would introduce garbled characters at the injection point.

Output B: Article Archive Page

The core of the archive page is year-month grouping. Dates are standardized to YYYY-MM-DD format, and taking the first 7 characters gives YYYY-MM. PowerShell's Group-Object groups everything in one shot, then sorted by month in descending order with the newest on top.

Each month is rendered as a heading ("April 2026") + a two-column table: date on the left (yellow, 130px wide) and article title on the right (cyan links). This page has no sidebar — it uses a two-column layout with blog articles on the left and other pages (download center, FAQ, etc.) on the right, centered in a 980px table.

The archive page outputs to dist/archive.html (Chinese) or dist/en/archive.html (English).

Output C: Tag Cloud Page

The tag cloud is the most complex of the four outputs. It needs to build a reverse index of tags to articles: iterating through all article tags and using a hash table to map each tag to its list of containing articles.

Tag font sizes are tiered by article count: 3+ articles get large size (size="4"), 2 articles get medium (size="3"), 1 article gets small (size="2"). Three levels are sufficient for differentiation without needing more sophisticated popularity algorithms.

Below the tag cloud is the article list for each tag, using <a name="tagname"> as anchor points — clicking a link in the tag cloud jumps to the corresponding section. The tag cloud also has independent Chinese and English versions.

Output D: Search Index

The search index is a plain text file data/build/search_index.txt, read by cgi-bin/search.py for full-text search. The format is one TITLE/DATE/LINK/TEXT record block per article, delimited by ---.

The text extraction process is straightforward: read HTML → strip all tags with regex → replace → compress consecutive whitespace to single spaces. No NLP, no word segmentation, just pure string matching. For a personal blog with a few dozen articles, this approach is entirely sufficient.

The search is completely JavaScript-free — the search box is an HTML form that submits to CGI, the server performs full-text matching, and returns a results page with highlighting. The entire process works the same way search engines did in 2002.

Bilingual Support

The script uses the -Lang parameter to control language (default "zh"). In English mode, blog content is read from blog/en/, interface text switches automatically ("Latest Posts" instead of the Chinese equivalent), and output goes to the dist/en/ subdirectory. Chinese and English archives, tags, and search indexes are completely independent and never mixed.

Design Philosophy

If this script were split into four separate scripts, the code would be more modular, but each full-site build would require scanning the blog directory four times. The design of one scan, four outputs reduces I/O. On a site with only a few dozen articles, the difference may be just a few hundred milliseconds, but it reflects an engineering habit of "thinking through the data flow ahead of time" — knowing where data comes from, where it goes, and how many transformations it passes through in between.

PowerShell's Group-Object, Sort-Object, Select-Object, and other pipeline commands serve a role similar to SQL's GROUP BY, ORDER BY, and LIMIT — except they operate on in-memory object arrays rather than database tables.

« Guestbook Architecture Notes

« Home

Script Deep Dive: generate-rss.ps1 »