fix: transliterate non-Latin titles in URL slugs#1526
Conversation
Pure-Arabic (and other non-Latin) titles previously got stripped by slugify and collapsed to the "topic" fallback, so every Arabic question landed at /questions/<id>/topic. Mirror the existing convertChinese pre-step using go-unidecode so titles in Arabic, Cyrillic, Hebrew, Thai etc. produce a readable ASCII slug. Latin-only and Chinese-only inputs short-circuit and remain byte-identical to the previous output. Gated by a package-level atomic flag (default on) exposed via SetTransliterateNonLatin so an admin toggle can be wired up in a follow-up PR without re-plumbing call sites.
|
I think this PR needs a clearer scope statement, because in its current form it does more than fix Arabic-only titles. By adding
So this is not only an Arabic fix. It changes the behavior for Thai, Japanese, Korean, Hebrew, Cyrillic, and other scripts as well. I think the PR description and tests should reflect that broader impact explicitly. The second concern is about transliteration quality. What this PR introduces is a generic ASCII approximation, not linguistically correct multi-language romanization. That may be acceptable as a pragmatic fallback to avoid collapsing to If the goal of this PR is “avoid empty/topic fallback for non-Latin titles by generating a usable ASCII slug”, then I think that should be stated much more explicitly in the PR description and test coverage. |
Reviewer pointed out the fix changes slug generation for many non-Latin scripts, not just Arabic. Pin the actual behavior across Thai, Japanese hiragana, Korean, Hebrew, and Cyrillic so the test surface matches the real scope of the change. Also pin the pre-existing Japanese-kanji-via-pinyin path so reviewers can see it is unchanged by this PR.
fix: transliterate non-Latin titles in URL slugs
Summary
slugify.Slugifyand collapsed to the literal"topic"fallback, so on a live site every Arabic question ended up at/questions/<id>/topic.convertChinesepre-step pattern inpkg/htmltext/htmltext.gousinggithub.com/mozillazg/go-unidecode(same author asgo-pinyinalready in the repo, to minimise new-dep friction).The fix
UrlTitle()now runsconvertNonLatinafterconvertChinese. The detector skips ASCII, Latin-1 Supplement, Latin Extended, and CJK (which is handled by the existing pinyin step), so emoji / punctuation / symbols still flow intoclearEmoji+slugifyunchanged. Only when there are non-Latin letters present does it pay the unidecode cost.Example:
كيف حالك→kyf-hlk(wastopic).Live deployment / real-world verification
This patch has been running in production on ask.namasoft.com (an Apache Answer instance we operate) since deployment, built directly from this branch via
docker compose build. The site has Arabic-language questions, so the fix exercises the affected code path on every page load.Sample question URL on the deployed instance:
Click the link and you'll see the slug is the transliterated Arabic title rather than
topic. No data migration was needed sinceurl_titleis computed on every request fromTitleand never persisted (see "Why this is safe to ship" below).Admin-configurable
The transliteration is gated by a package-level
atomic.Bool(default on, since the current behavior is objectively broken for affected users):htmltext.SetTransliterateNonLatin(enabled bool)htmltext.IsTransliterateNonLatinEnabled() boolThis is deliberately the minimum surface needed to satisfy "the setting must be readable from
UrlTitle()". A follow-up PR can add the admin UI section (Non-Latin Languages Handling) that callsSetTransliterateNonLatinon save and on startup, without having to re-plumb everyhtmltext.UrlTitlecall site throughcontext.Context.Why this is safe to ship
url_titleis not a persisted column. It's not on theQuestionentity ininternal/entity/question_entity.go, no migration has ever added/dropped it, and every call site (internal/service/content/question_service.go,revision_service.go,vote_service.go, etc.) recomputes it fromTitleat response-build time viahtmltext.UrlTitle(...).Test coverage
pkg/htmltext/htmltext_test.go:TestUrlTitleTable(table-driven): empty, pure Latin (unchanged), pure Chinese (unchanged — pins existing pinyin behavior), pure Arabic, mixed Latin+Arabic, emoji-only (still collapses totopicas before), very long Arabic (exercisescutLongTitle's 150-byte cap and UTF-8 boundary safety).TestUrlTitleTransliterationToggle: with the toggle off, Arabic collapses totopic(pre-fix behavior); with it on, transliterates.TestUrlTitleleft untouched.Test plan for reviewers:
go test ./pkg/htmltext/...— all pass locallytopicmain(covered by table tests)Out of scope (intentionally)
Non-Latin Languages Handlingadmin page +SiteType+ service / controller / migration in a follow-up if maintainers want it."topic"empty-result fallback.convertChinesepre-step pattern instead.Issues / discussion
I didn't find an existing upstream issue covering this — happy to be pointed at one if there is.
🤖 Generated with Claude Code