Raise minimum required WP version to WP 5.2 ( and PHP 5.6) by jrfnl · Pull Request #99 · WordPress/wordpress-importer

jrfnl · 2021-02-15T19:33:55Z

Raise minimum required WP version to WP 5.2 ( and PHP 5.6)

Includes:

Adding Requires PHP headers.
Updating the Requires at least (WP) headers.
Stop detection of PHP compatibility issues for PHP < 5.6.
CI: Stop testing on WP < 5.2 and PHP < 5.6.
Note: the PHP 7.4 and 8.0 builds against WP 5.2 were running into trouble due to WP itself not being fully compatible, so the CI matrix has been adjusted to only test against those PHP versions on WP versions which claim to be compatible with those PHP versions.

Includes tidying up the file docblock in the wordpress-importer.php file.

Closes #95

Clean up: remove work-arounds and polyfills

With WP 5.2 being the new minimum supported WP version, the polyfill for the map_deep() function is no longer needed.

Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.

Includes: * Adding `Requires PHP` headers. * Updating the `Requires at least` (WP) headers. * Stop detection of PHP compatibility issues for PHP < 5.6. * CI: Stop testing on WP < 5.2 and PHP < 5.6. Includes tidying up the file docblock in the `wordpress-importer.php` file. Closes #95

With WP 5.2 being the new minimum supported WP version, the polyfill for the `map_deep()` function is no longer needed. Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.

ocean90

Let's do this.

Adds a URL rewriting feature to the WP_Importer class. It is **disabled by default**. When enabled, it rewrites the imported content to replace the original site absolute URLs with the new site absolute URLs. The code understands structured data and handles scenarios a regular string replace cannot. Solves item 3 from WordPress/php-toolkit#138 Let's discuss enabling URL rewriting for the users in a follow-up PR. ## Example of what this PR can do Say we're importing a WXR file and: 1. We are importing to `https://science.wordpress.org` 2. WXR has base URL set to `https://🚀-science.com/science` 3. The WXR file contains the following post: ```html   🚀-science.com/science has the best scientific articles on the internet! We're also available via the punycode URL:  https://xn---science-7f85g.com/%73%63ience/.  This isn't migrated: https://🚀-science.comcast/science Or this: super-🚀-science.com/science     <img src="https://xn---science-7f85g.com/science/wp-content/image.png">   ``` Then, the post content will be imported as: ```html   science.wordpress.org has the best scientific articles on the internet! We're also available via the punycode URL:  https://science.wordpress.org/.  This isn't migrated: https://🚀-science.comcast/science Or this: super-🚀-science.com/science     <img src="https://science.wordpress.org/wp-content/image.png">   ``` And that's it! No need to run `wp search-replace`, go through any posts it may have missed etc. etc. ## The problem with traditional URL rewriting methods Traditional methods of URL replacement in WordPress, such as using the `wp search-replace` CLI command, come with several limitations that can lead to various issues. These problems stem from the simplistic nature of these methods, which treat the content as plain text without understanding the context or structure of the document. The primary pitfalls include: ### Inconsistent Replacements Traditional URL replacement methods rely on straightforward string matching and replacement techniques. While this approach can be effective for simple cases, it often leads to inconsistent replacements in more complex scenarios. For example: * **Substring Matching**: If you need to replace https://science.com with a new URL, the tool might inadvertently replace instances where science.com appears as part of a larger URL like https://science.comcast.net, leading to incorrect and broken URLs. * **Case Sensitivity Issues**: These methods might not handle different cases (e.g., Science.com vs. science.com) consistently, resulting in partial or missed replacements. ### Lack of Context The traditional methods treat the entire content as raw text and lack an understanding of the document’s structure. This can cause several issues: * **HTML attributes**: The search-replace operation does not distinguish between URLs in HTML tag attributes (like `href` or `src`) and URLs that may appear in plain text, comments, or scripts. For instance, altering `<div id="https://science.com">` to `<div id="https://newsite.com">` might affect JavaScript or CSS, leading to unintended behaviors. * **Structured Data**: Data serialized in formats like JSON, where URLs are part of a more complex structure, might go unreplaced at best or get malformed at worst. A URL found in text needs to be escaped differently than one found in a `<a href>` attribute or inside block markup. Here's a few examples: ```html  <img src="https://xn---science-7f85g.com/science/wp-content/image.png">  https://xn---science-7f85g.com/%73%63ience/. ``` ### Punycode, URL Encoding The URL syntax described in [WHATWG URL standard](https://url.spec.whatwg.org/) isn't trivial. There are special rules for encoding unicode characters, and they're different in paths and query strings. Here's just two: * **Punycode**: Internationalized domain names (IDNs) often use Punycode encoding (e.g., https://xn--fsq.com for https://🚀science.com). Simple search-and-replace methods do not account for these encoded values, potentially missing them or corrupting the URL. * **Encoded URLs**: URLs often contain encoded characters like `%20` for spaces, making direct matching tricky. A naive replacement might fail to recognize or properly handle these encodings, leading to incomplete or erroneous replacements. The same URL may be expressed in a lot of diferent ways, for example: ```html 🚀-science.com/science 🚀-science.com/%73%63ience https://xn---science-7f85g.com/science ``` ### Other edge Cases In real-world use cases, URLs can take various forms and structures that challenge traditional search-replace methods: * **Variants and Subdomains**: URLs can have different subdomains, paths, or query parameters. A method targeting `https://science.com` might miss `https://blog.science.com` or `https://science.com/path?query=1`. A person doing the migration might either want to either preserve or replace the latter two. * **Even more contextual awareness**: A URL found inside a `<script>` tag might need to be migrated or might need to be left alone. Ditto for URLs found in HTML attributes such as `class`. ## The solution proposed in this PR **All layers of structured data are parsed as structured data**. This PR ships a subset of the [php-toolkit](https://github.com/wordpress/php-toolkit) repository, including a URL parser and a block markup parser. This complements the HTML parser, XML parser, and UTF-8 parser already shipped with the wordpress-importer plugin. There are no naive string replacements involved. What kinds of URLs can we handle in practice? Here's a few isolated examples: ### Inline text ```html 🚀-science.com/science ``` Gets rewritten as: ```html science.wordpress.org ``` Note that a mere `index.html` would not get picked up as a domain name. This PR consults the [public suffix list](https://publicsuffix.org/list/) to avoid such false-positives. ### Punycode and HTML entities in text Since HTML is parsed as HTML and URLs are parsed as URLs, this PR recognizes the domain encoded in this snippet: ```html https://xn---science-7f85g.com/%73%63ience/ ``` And rewritten it as: ```html https://science.wordpress.org/ ``` ### Similar-looking domains Here are two scenarios where a naive search and replace would corrupt the data but this PR handles gracefully: ```html  This isn't migrated: https://🚀-science.comcast/science Or this: super-🚀-science.com/science ``` Gets rewritten as: ```html  This isn't migrated: https://🚀-science.comcast/science Or this: super-🚀-science.com/science ``` No changes were made, since neither domain matched the original base site URL. #### Block attributes ```html  <img src="https://xn---science-7f85g.com/science/wp-content/image.png">  ``` Gets rewritten as: ```html  <img src="https://science.wordpress.org/wp-content/image.png">  ``` #### Non-URL attributes ```html  ``` Gets rewritten as: ```html  ``` ## Implementation This PR ships: * **A subset of [WordPress/php-toolkit](https://github.com/wordpress/php-toolkit)**. It's the minimum subset required to parse all the data layers involved (Block markup, HTML, URL). * **Vendor libraries in `vendor-patched` directories.** They're all patched to work with PHP 7.2+ (via rector and manual work). Being vendor libraries, they're excluded from phpcs rules. With those pre-requisites in place, the `WP_Import` class runs that code on every imported post: ```php $url_mapping = array( $this->base_url_parsed->toString() => $this->site_url_parsed, ); $postdata['post_content'] = wp_rewrite_urls( array( 'block_markup' => $postdata['post_content'], 'url-mapping' => $url_mapping, ) ); $postdata['post_excerpt'] = wp_rewrite_urls( array( 'block_markup' => $postdata['post_excerpt'], 'url-mapping' => $url_mapping, ) ); ``` ### Limitations #### URL rewriting is only available for WP 6.7+ This PR relies on the `WP_HTML_Tag_Processor::set_modifiable_text` method introduced in WordPress 6.7. It does not attempt to rewrite URLs in WordPress 6.6 or older. I'm interested in submitting a follow-up PR to extend support to those older WordPress versions by shipping a namespaced version of the `WP_HTML_Tag_Processor`. ## Open questions - [x] **Should URL rewriting be opt-in?** On one hand, I'm worried about existing workflows and tests that assume the URLs don't change and attempt to replace them with `wp search-replace`. On the other hand, making this feature opt-in means most users won't benefit from it. There is a middle ground where it starts as an opt-in switch and, say, 3 or 6 months from now we publish a new major release where URL rewriting is enabled by default. ^ I've made URL rewriting disabled by default. Let's discuss enabling it in a separate PR. ### Further reading For even more context, see: * The [Make post](https://make.wordpress.org/playground/2024/11/06/using-playground-for-data-liberation-site-synchronization-and-building-streaming-parsers/) * The [GitHub discussion](WordPress/data-liberation#74) cover a lot. Let me summarize the most important points:

jrfnl added 2 commits February 15, 2021 20:29

Clean up: remove work-arounds and polyfills

09b77e5

With WP 5.2 being the new minimum supported WP version, the polyfill for the `map_deep()` function is no longer needed. Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.

jrfnl added the [Type] Task label Feb 15, 2021

jrfnl added this to the 0.8.0 milestone Feb 15, 2021

This was referenced Feb 15, 2021

Tests: improve test quality #92

Merged

Drop support for WP < 5.2 / PHP < 5.6 #95

Closed

jrfnl requested a review from dd32 February 15, 2021 19:36

ocean90 approved these changes Feb 15, 2021

View reviewed changes

ocean90 merged commit 0240a95 into master Feb 15, 2021

ocean90 deleted the feature/drop-support-for-wp-lt-52 branch February 15, 2021 19:48

danielbachhuber mentioned this pull request Dec 5, 2022

Update tests using WordPress Importer to require WordPress 5.2 wp-cli/export-command#102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise minimum required WP version to WP 5.2 ( and PHP 5.6)#99

Raise minimum required WP version to WP 5.2 ( and PHP 5.6)#99
ocean90 merged 2 commits intomasterfrom
feature/drop-support-for-wp-lt-52

jrfnl commented Feb 15, 2021

Uh oh!

ocean90 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jrfnl commented Feb 15, 2021