Script Deep Dive: generate-sitemap.ps1
Sitemap Protocol 0.9 · changefreq · priority
What This Script Does
generate-sitemap.ps1 is the final step in the build pipeline. It generates an XML file conforming to the Sitemap Protocol 0.9 standard, telling search engines like Google and Bing which pages exist on this site, when each page was last updated, and which pages are more important.
A sitemap doesn't guarantee indexing — it's merely a suggestion. Search engines have the right to ignore certain pages and to crawl pages not listed in the sitemap. But providing an accurate, up-to-date sitemap can significantly speed up the discovery of new content, especially for new websites and infrequently updated pages.
The Three Fields of the Sitemap Protocol
Each <url> contains three fields:
<loc> (required) — the full URL of the page, including protocol and domain
<lastmod> — last modified date, in YYYY-MM-DD format. Search engines use it to prioritize crawling recently updated pages
<changefreq> — update frequency hint (daily / weekly / monthly), helping crawlers allocate their crawl budget
<priority> — relative priority (0.0-1.0). Homepage is 1.0, blog posts are 0.7, tool pages are 0.3-0.5
Honestly, modern search engines pay less attention to changefreq and priority — they have their own algorithms for assessing page importance. But adding these two fields costs nothing, and "having them is better than not."
What Pages Are Scanned
The current version scans the following, generating approximately 40 URLs:
Homepage (Chinese and English) — hardcoded, since the homepage has no corresponding content file
All blog posts (Chinese and English) — scanned from src/content/blog/ and blog/en/, automatically skipping drafts
All standalone pages (Chinese and English) — scanned from src/content/page/ and page/en/
Static pages — archive.html, tags.html, stats.html
lastmod uses the source file's LastWriteTime, which is more accurate than the date in metadata — it reflects when the content was actually last modified, rather than the article's byline date.
Why Static Generation Instead of Dynamic
Some websites use CGI or PHP to dynamically generate sitemaps (scanning the database on each request). This site chose static generation. The reasons are straightforward:
With a small number of articles, generating a static XML in 0.1 seconds during build is perfectly acceptable
Static files can be cached by the web server without consuming CGI processes
Search engine crawlers may visit the sitemap daily; dynamic generation would cause unnecessary server load
Consistent with the site's "static-first" philosophy — whenever possible, pre-generate rather than compute at runtime
This is a consistent design philosophy: static first, CGI as a last resort.
Output Location and Auto-Discovery
The sitemap outputs to the project root (not dist/) because sitemap.xml must, by convention, be placed in the website's root directory. Search engine crawlers look for /sitemap.xml and /robots.txt first when crawling.
This site's robots.txt also includes a Sitemap: https://www.dragonrster.cn/sitemap.xml line, providing a second discovery path. Belt and suspenders.
Search Engine Discovery Flow
A complete search engine discovery flow goes like this: the crawler first reads robots.txt to check crawl restrictions, then reads sitemap.xml to get all URLs and their last modification times, crawls by lastmod priority, parses HTML to extract body text and titles, and finally enters the index.
In 2026, a 90s-style website with no JavaScript and pure table-based layout can still be found by Google, thanks in no small part to the sitemap and solid SEO metadata.
|