Raise minimum required WP version to WP 5.2 ( and PHP 5.6)#99
Merged
Raise minimum required WP version to WP 5.2 ( and PHP 5.6)#99
Conversation
Includes: * Adding `Requires PHP` headers. * Updating the `Requires at least` (WP) headers. * Stop detection of PHP compatibility issues for PHP < 5.6. * CI: Stop testing on WP < 5.2 and PHP < 5.6. Includes tidying up the file docblock in the `wordpress-importer.php` file. Closes #95
With WP 5.2 being the new minimum supported WP version, the polyfill for the `map_deep()` function is no longer needed. Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.
This was referenced Feb 15, 2021
adamziel
added a commit
that referenced
this pull request
Sep 12, 2025
Adds a URL rewriting feature to the WP_Importer class. It is **disabled by default**. When enabled, it rewrites the imported content to replace the original site absolute URLs with the new site absolute URLs. The code understands structured data and handles scenarios a regular string replace cannot. Solves item 3 from WordPress/php-toolkit#138 Let's discuss enabling URL rewriting for the users in a follow-up PR. ## Example of what this PR can do Say we're importing a WXR file and: 1. We are importing to `https://science.wordpress.org` 2. WXR has base URL set to `https://🚀-science.com/science` 3. The WXR file contains the following post: ```html <!-- wp:paragraph --> <p> <!-- Inline URLs are migrated --> 🚀-science.com/science has the best scientific articles on the internet! We're also available via the punycode URL: <!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path --> https://xn---science-7f85g.com/%73%63ience/. <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science </p> <!-- /wp:paragraph --> <!-- Block attributes are migrated without any issue --> <!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} --> <!-- As are URI HTML attributes --> <img src="https://xn---science-7f85g.com/science/wp-content/image.png"> <!-- /wp:image --> <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` Then, the post content will be imported as: ```html <!-- wp:paragraph --> <p> <!-- Inline URLs are migrated --> science.wordpress.org has the best scientific articles on the internet! We're also available via the punycode URL: <!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path --> https://science.wordpress.org/. <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science </p> <!-- /wp:paragraph --> <!-- Block attributes are migrated without any issue --> <!-- wp:image {"src":"https:\/\/science.wordpress.org\/wp-content\/image.png"} --> <!-- As are URI HTML attributes --> <img src="https://science.wordpress.org/wp-content/image.png"> <!-- /wp:image --> <!-- Class names are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` And that's it! No need to run `wp search-replace`, go through any posts it may have missed etc. etc. ## The problem with traditional URL rewriting methods Traditional methods of URL replacement in WordPress, such as using the `wp search-replace` CLI command, come with several limitations that can lead to various issues. These problems stem from the simplistic nature of these methods, which treat the content as plain text without understanding the context or structure of the document. The primary pitfalls include: ### Inconsistent Replacements Traditional URL replacement methods rely on straightforward string matching and replacement techniques. While this approach can be effective for simple cases, it often leads to inconsistent replacements in more complex scenarios. For example: * **Substring Matching**: If you need to replace https://science.com with a new URL, the tool might inadvertently replace instances where science.com appears as part of a larger URL like https://science.comcast.net, leading to incorrect and broken URLs. * **Case Sensitivity Issues**: These methods might not handle different cases (e.g., Science.com vs. science.com) consistently, resulting in partial or missed replacements. ### Lack of Context The traditional methods treat the entire content as raw text and lack an understanding of the document’s structure. This can cause several issues: * **HTML attributes**: The search-replace operation does not distinguish between URLs in HTML tag attributes (like `href` or `src`) and URLs that may appear in plain text, comments, or scripts. For instance, altering `<div id="https://science.com">` to `<div id="https://newsite.com">` might affect JavaScript or CSS, leading to unintended behaviors. * **Structured Data**: Data serialized in formats like JSON, where URLs are part of a more complex structure, might go unreplaced at best or get malformed at worst. A URL found in text needs to be escaped differently than one found in a `<a href>` attribute or inside block markup. Here's a few examples: ```html <!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} --> <img src="https://xn---science-7f85g.com/science/wp-content/image.png"> <!-- /wp:image --> https://xn---science-7f85g.com/%73%63ience/. ``` ### Punycode, URL Encoding The URL syntax described in [WHATWG URL standard](https://url.spec.whatwg.org/) isn't trivial. There are special rules for encoding unicode characters, and they're different in paths and query strings. Here's just two: * **Punycode**: Internationalized domain names (IDNs) often use Punycode encoding (e.g., https://xn--fsq.com for https://🚀science.com). Simple search-and-replace methods do not account for these encoded values, potentially missing them or corrupting the URL. * **Encoded URLs**: URLs often contain encoded characters like `%20` for spaces, making direct matching tricky. A naive replacement might fail to recognize or properly handle these encodings, leading to incomplete or erroneous replacements. The same URL may be expressed in a lot of diferent ways, for example: ```html 🚀-science.com/science 🚀-science.com/%73%63ience https://xn---science-7f85g.com/science ``` ### Other edge Cases In real-world use cases, URLs can take various forms and structures that challenge traditional search-replace methods: * **Variants and Subdomains**: URLs can have different subdomains, paths, or query parameters. A method targeting `https://science.com` might miss `https://blog.science.com` or `https://science.com/path?query=1`. A person doing the migration might either want to either preserve or replace the latter two. * **Even more contextual awareness**: A URL found inside a `<script>` tag might need to be migrated or might need to be left alone. Ditto for URLs found in HTML attributes such as `class`. ## The solution proposed in this PR **All layers of structured data are parsed as structured data**. This PR ships a subset of the [php-toolkit](https://github.com/wordpress/php-toolkit) repository, including a URL parser and a block markup parser. This complements the HTML parser, XML parser, and UTF-8 parser already shipped with the wordpress-importer plugin. There are no naive string replacements involved. What kinds of URLs can we handle in practice? Here's a few isolated examples: ### Inline text ```html <!-- wp:paragraph --><p>🚀-science.com/science</p><!-- /wp:paragraph --> ``` Gets rewritten as: ```html <!-- wp:paragraph --><p>science.wordpress.org</p><!-- /wp:paragraph --> ``` Note that a mere `index.html` would not get picked up as a domain name. This PR consults the [public suffix list](https://publicsuffix.org/list/) to avoid such false-positives. ### Punycode and HTML entities in text Since HTML is parsed as HTML and URLs are parsed as URLs, this PR recognizes the domain encoded in this snippet: ```html <!-- wp:paragraph --><p>https://xn---science-7f85g.com/%73%63ience/</p><!-- /wp:paragraph --> ``` And rewritten it as: ```html <!-- wp:paragraph --><p>https://science.wordpress.org/</p><!-- /wp:paragraph --> ``` ### Similar-looking domains Here are two scenarios where a naive search and replace would corrupt the data but this PR handles gracefully: ```html <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science ``` Gets rewritten as: ```html <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science ``` No changes were made, since neither domain matched the original base site URL. #### Block attributes ```html <!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} --> <img src="https://xn---science-7f85g.com/science/wp-content/image.png"> <!-- /wp:image --> ``` Gets rewritten as: ```html <!-- wp:image {"src":"https:\/\/science.wordpress.org\/wp-content\/image.png"} --> <img src="https://science.wordpress.org/wp-content/image.png"> <!-- /wp:image --> ``` #### Non-URL attributes ```html <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` Gets rewritten as: ```html <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` ## Implementation This PR ships: * **A subset of [WordPress/php-toolkit](https://github.com/wordpress/php-toolkit)**. It's the minimum subset required to parse all the data layers involved (Block markup, HTML, URL). * **Vendor libraries in `vendor-patched` directories.** They're all patched to work with PHP 7.2+ (via rector and manual work). Being vendor libraries, they're excluded from phpcs rules. With those pre-requisites in place, the `WP_Import` class runs that code on every imported post: ```php $url_mapping = array( $this->base_url_parsed->toString() => $this->site_url_parsed, ); $postdata['post_content'] = wp_rewrite_urls( array( 'block_markup' => $postdata['post_content'], 'url-mapping' => $url_mapping, ) ); $postdata['post_excerpt'] = wp_rewrite_urls( array( 'block_markup' => $postdata['post_excerpt'], 'url-mapping' => $url_mapping, ) ); ``` ### Limitations #### URL rewriting is only available for WP 6.7+ This PR relies on the `WP_HTML_Tag_Processor::set_modifiable_text` method introduced in WordPress 6.7. It does not attempt to rewrite URLs in WordPress 6.6 or older. I'm interested in submitting a follow-up PR to extend support to those older WordPress versions by shipping a namespaced version of the `WP_HTML_Tag_Processor`. ## Open questions - [x] **Should URL rewriting be opt-in?** On one hand, I'm worried about existing workflows and tests that assume the URLs don't change and attempt to replace them with `wp search-replace`. On the other hand, making this feature opt-in means most users won't benefit from it. There is a middle ground where it starts as an opt-in switch and, say, 3 or 6 months from now we publish a new major release where URL rewriting is enabled by default. ^ I've made URL rewriting disabled by default. Let's discuss enabling it in a separate PR. ### Further reading For even more context, see: * The [Make post](https://make.wordpress.org/playground/2024/11/06/using-playground-for-data-liberation-site-synchronization-and-building-streaming-parsers/) * The [GitHub discussion](WordPress/data-liberation#74) cover a lot. Let me summarize the most important points:
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Raise minimum required WP version to WP 5.2 ( and PHP 5.6)
Includes:
Requires PHPheaders.Requires at least(WP) headers.Note: the PHP 7.4 and 8.0 builds against WP 5.2 were running into trouble due to WP itself not being fully compatible, so the CI matrix has been adjusted to only test against those PHP versions on WP versions which claim to be compatible with those PHP versions.
Includes tidying up the file docblock in the
wordpress-importer.phpfile.Closes #95
Clean up: remove work-arounds and polyfills
With WP 5.2 being the new minimum supported WP version, the polyfill for the
map_deep()function is no longer needed.Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.