Guestbook Architecture Notes

Plain text storage + CGI processing

Overall Flow

The complete path after a guestbook submission is as follows:


[Visitor fills form]  →  [CGI script receives]  →  [Writes to guestbook.txt]
       ↓
[Triggers rebuild-all]  →  [build.ps1 reads data]  →  [Injects into sidebar HTML]
       ↓
[Deploy new pages]  →  [Visitor sees message]

Messages don't appear in real time after submission; instead, they are compiled into the static pages during the next build (I feel real-time updates are needed)^[2].

1. Front-end Form

The form is written directly in sidebar-left.html, in the left sidebar guestbook area. A standard HTML 3.2 form:


<form action="/cgi-bin/guestbook.py" method="post">
    Name:    <input type="text" name="name">
    Email:   <input type="text" name="email" placeholder="Optional">
    Content: <textarea name="content"></textarea>
    [x] Show IP  <input type="checkbox" name="show_ip" value="yes">
    <input type="submit" value="Send Message">
</form>

Four fields: name (required), email (optional — if provided, the username becomes a mailto link), content (required), show_ip (controls whether the IP is publicly displayed). No CAPTCHA (not considering it for now).

2. CGI Backend

After submission, the form posts to /cgi-bin/guestbook.py, a CGI^[1] script for processing.

Input Handling


def parse_form_data():
    method = os.environ.get('REQUEST_METHOD', 'GET')
    if method == 'GET':
        qs = os.environ.get('QUERY_STRING', '')
        form_data = parse_qs(qs)
    else:
        length = int(os.environ.get('CONTENT_LENGTH', 0))
        body = sys.stdin.read(length)
        form_data = parse_qs(body)
    return {k: v[0] for k, v in form_data.items()}

CGI data sources are very primitive: GET requests read from the QUERY_STRING environment variable, POST requests read from stdin. The script parses URL-encoded form data, then performs sanitization:


def sanitize(text):
    text = text.replace('\n', ' ').replace('\r', ' ')  # Remove newlines
    text = text.replace('|', ' ')                       # Remove delimiter
    return html.escape(text)                            # Escape HTML

Removing | is necessary because it's used as the data file delimiter; removing newlines prevents data format corruption; HTML escaping prevents XSS attacks.

3. Client IP Detection

Because the deployment sits behind a reverse proxy, directly reading REMOTE_ADDR would only return the proxy server's IP. So web_server.py resolves the real IP on every request and passes it to the CGI via environment variables:


# In CustomHandler._inject_real_ip()
cf_ip   = headers.get('CF-Connecting-IP')       # Cloudflare
xff     = headers.get('X-Forwarded-For')         # Standard proxy header
real_ip = headers.get('X-Real-IP')               # Commonly used by Nginx
if cf_ip:
    real_client_ip = cf_ip
elif xff:
    real_client_ip = xff.split(',')[0].strip()   # Take the first one
elif real_ip:
    real_client_ip = real_ip
else:
    real_client_ip = client_ip                    # Direct connection fallback
os.environ["REAL_CLIENT_IP"] = real_client_ip     # Inject into CGI environment

Priority: Cloudflare > X-Forwarded-For > X-Real-IP > Direct connection. The CGI script obtains the correct client IP via os.environ.get("REAL_CLIENT_IP").

4. Data Storage

Messages are stored in data/runtime/guestbook.txt, one per line, fields delimited by |:


name|email|content|ip|time|show_ip    ← New format (6 fields)
name|content|ip|time|show_ip          ← Old format (5 fields, backward compatible)

Examples:


DragonRSTER|dragonrster@foxmail.com|Hey, email support is now available|hidden|2026-04-26 18:52:27|no
xintai||This message was sent from win98|180.154.121.226|2026-04-24 23:33:41|yes

If the user chooses not to display their IP, the IP field is written as hidden rather than the actual address. The entire file is essentially a plain-text CSV variant, viewable with any tool. Currently there are over 30 messages, with the earliest dating back to 2020 (a relic from the previous blog).

5. Build-time Injection

Every time rebuild-all.ps1 runs, build.ps1 performs the following:


# Read guestbook.txt
$lines = Get-Content $guestbookFile -Encoding UTF8
# Take the last 20 messages, reverse order (newest on top)
$lastLines = $lines | Select-Object -Last 20
[array]::Reverse($lastLines)
# Generate HTML for each message
foreach ($line in $lastLines) {
    $parts = $line -split '\|'
    # Compatible with both old and new formats...
    # Generates: name (with mailto) + content + IP (optional) + timestamp
}
# Inject at the placeholder in sidebar-left.html
$sidebarLeft = $sidebarLeft -replace "<!-- GUESTBOOK_MESSAGES -->", $messagesHtml

Message content is compiled directly into HTML, written at the  placeholder in sidebar-left.html. Display logic:

If email is provided → username renders as an <a href="mailto:..."> link
If show_ip is yes → IP address is displayed below the message (small gray text)
All messages are wrapped in a scrollable container, showing at most 20 entries

Since guestbook.txt now supports the email field, both old and new formats are handled with backward compatibility, automatically detected by field count during read.

6. Server Side

web_server.py inherits from Python's standard library CGIHTTPRequestHandler, adding several layers of custom logic on top of standard CGI support:

Path mapping: / → index.html, /blog-<em> → dist/, /assets/</em> → dist/assets/
Security: /data/, /scripts/, /src/ return 403 directly
CGI directory: /cgi-bin/ uses standard CGI processing
Logging: each request writes to data/logs/YYYY-MM-DD.log, old logs are automatically gzip-compressed

Startup:


python web_server.py
# Listens on 0.0.0.0:81, default port 81

7. Some Defensive Measures

Although entirely human-powered, some basic restrictions are in place:

Field sanitization: removes | and newlines to prevent data format injection
HTML escaping: html.escape() handles all user input to prevent XSS
IP controllable: users can choose not to disclose their IP, written as hidden instead of the actual address
Directory protection: /data/, /scripts/, /src/ are blocked at the HTTP level with 403
robots.txt: prohibits crawlers from accessing /cgi-bin/ and /data/

[1]	CGI (Common Gateway Interface) was born in 1993, proposed by Rob McCool at NCSA. It was the earliest dynamic content technology standard for the Web. Although it forks a process for every request, it's perfectly adequate for low-traffic sites and requires no framework dependencies.
[2]	This has since been changed to automatic background compilation after message submission.

« April 2026 · What I've Been Up To

« Home

Script Deep Dive: generate-archive.ps1 »