Skip to content

Raise minimum required WP version to WP 5.2 ( and PHP 5.6)#99

Merged
ocean90 merged 2 commits intomasterfrom
feature/drop-support-for-wp-lt-52
Feb 15, 2021
Merged

Raise minimum required WP version to WP 5.2 ( and PHP 5.6)#99
ocean90 merged 2 commits intomasterfrom
feature/drop-support-for-wp-lt-52

Conversation

@jrfnl
Copy link
Copy Markdown
Member

@jrfnl jrfnl commented Feb 15, 2021

Raise minimum required WP version to WP 5.2 ( and PHP 5.6)

Includes:

  • Adding Requires PHP headers.
  • Updating the Requires at least (WP) headers.
  • Stop detection of PHP compatibility issues for PHP < 5.6.
  • CI: Stop testing on WP < 5.2 and PHP < 5.6.
    Note: the PHP 7.4 and 8.0 builds against WP 5.2 were running into trouble due to WP itself not being fully compatible, so the CI matrix has been adjusted to only test against those PHP versions on WP versions which claim to be compatible with those PHP versions.

Includes tidying up the file docblock in the wordpress-importer.php file.

Closes #95

Clean up: remove work-arounds and polyfills

With WP 5.2 being the new minimum supported WP version, the polyfill for the map_deep() function is no longer needed.

Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.

Includes:
* Adding `Requires PHP` headers.
* Updating the `Requires at least` (WP) headers.
* Stop detection of PHP compatibility issues for PHP < 5.6.
* CI: Stop testing on WP < 5.2 and PHP < 5.6.

Includes tidying up the file docblock in the `wordpress-importer.php` file.

Closes #95
With WP 5.2 being the new minimum supported WP version, the polyfill for the `map_deep()` function is no longer needed.

Similarly, the bowing out for the term meta import (on WP < 4.4) is also no longer needed.
@jrfnl jrfnl added this to the 0.8.0 milestone Feb 15, 2021
@jrfnl jrfnl requested a review from dd32 February 15, 2021 19:36
Copy link
Copy Markdown
Member

@ocean90 ocean90 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this.

@ocean90 ocean90 merged commit 0240a95 into master Feb 15, 2021
@ocean90 ocean90 deleted the feature/drop-support-for-wp-lt-52 branch February 15, 2021 19:48
adamziel added a commit that referenced this pull request Sep 12, 2025
Adds a URL rewriting feature to the WP_Importer class. It is **disabled by default**. When enabled, it rewrites the imported content to replace the original site absolute URLs with the new site absolute URLs.

The code understands structured data and handles scenarios a regular string replace cannot.

Solves item 3 from WordPress/php-toolkit#138

Let's discuss enabling URL rewriting for the users in a follow-up PR.

## Example of what this PR can do

Say we're importing a WXR file and:

1. We are importing to `https://science.wordpress.org`
2. WXR has base URL set to `https://🚀-science.com/science`
3. The WXR file contains the following post:

```html
<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	🚀-science.com/science has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->

<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
```

Then, the post content will be imported as:

```html
<!-- wp:paragraph -->
<p>
	<!-- Inline URLs are migrated -->
	science.wordpress.org has the best scientific articles on the internet! We're also
	available via the punycode URL:
	
	<!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path -->
	https://science.wordpress.org/.
	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
</p>
<!-- /wp:paragraph -->

<!-- Block attributes are migrated without any issue -->
<!-- wp:image {"src":"https:\/\/science.wordpress.org\/wp-content\/image.png"} -->
<!-- As are URI HTML attributes -->
<img src="https://science.wordpress.org/wp-content/image.png">
<!-- /wp:image -->

<!-- Class names are not migrated. -->
<span class="https://🚀-science.com/science"></span>
```

And that's it! No need to run `wp search-replace`, go through any posts it may have missed etc. etc.

## The problem with traditional URL rewriting methods

Traditional methods of URL replacement in WordPress, such as using the `wp search-replace` CLI command, come with several limitations that can lead to various issues. These problems stem from the simplistic nature of these methods, which treat the content as plain text without understanding the context or structure of the document. The primary pitfalls include:

### Inconsistent Replacements

Traditional URL replacement methods rely on straightforward string matching and replacement techniques. While this approach can be effective for simple cases, it often leads to inconsistent replacements in more complex scenarios. For example:

* **Substring Matching**: If you need to replace https://science.com with a new URL, the tool might inadvertently replace instances where science.com appears as part of a larger URL like https://science.comcast.net, leading to incorrect and broken URLs.
* **Case Sensitivity Issues**: These methods might not handle different cases (e.g., Science.com vs. science.com) consistently, resulting in partial or missed replacements.

### Lack of Context

The traditional methods treat the entire content as raw text and lack an understanding of the document’s structure. This can cause several issues:

* **HTML attributes**: The search-replace operation does not distinguish between URLs in HTML tag attributes (like `href` or `src`) and URLs that may appear in plain text, comments, or scripts. For instance, altering `<div id="https://science.com">` to `<div id="https://newsite.com">` might affect JavaScript or CSS, leading to unintended behaviors.
* **Structured Data**: Data serialized in formats like JSON, where URLs are part of a more complex structure, might go unreplaced at best or get malformed at worst. A URL found in text needs to be escaped differently than one found in a `<a href>` attribute or inside block markup. 

Here's a few examples:

```html
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
<img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->
&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/.
```

### Punycode, URL Encoding

The URL syntax described in [WHATWG URL standard](https://url.spec.whatwg.org/) isn't trivial. There are special rules for encoding unicode characters, and they're different in paths and query strings. Here's just two:

* **Punycode**: Internationalized domain names (IDNs) often use Punycode encoding (e.g., https://xn--fsq.com for https://🚀science.com). Simple search-and-replace methods do not account for these encoded values, potentially missing them or corrupting the URL.
* **Encoded URLs**: URLs often contain encoded characters like `%20` for spaces, making direct matching tricky. A naive replacement might fail to recognize or properly handle these encodings, leading to incomplete or erroneous replacements.

The same URL may be expressed in a lot of diferent ways, for example:

```html
🚀-science.com/science
🚀-science.com/%73%63ience
https://xn---science-7f85g.com/science
```

### Other edge Cases

In real-world use cases, URLs can take various forms and structures that challenge traditional search-replace methods:

* **Variants and Subdomains**: URLs can have different subdomains, paths, or query parameters. A method targeting `https://science.com` might miss `https://blog.science.com` or `https://science.com/path?query=1`. A person doing the migration might either want to either preserve or replace the latter two.
* **Even more contextual awareness**: A URL found inside a `<script>` tag might need to be migrated or might need to be left alone. Ditto for URLs found in HTML attributes such as `class`.

## The solution proposed in this PR

**All layers of structured data are parsed as structured data**. This PR ships a subset of the [php-toolkit](https://github.com/wordpress/php-toolkit) repository, including a URL parser and a block markup parser. This complements the HTML parser, XML parser, and UTF-8 parser already shipped with the wordpress-importer plugin. There are no naive string replacements involved.

What kinds of URLs can we handle in practice? Here's a few isolated examples:

### Inline text

```html
<!-- wp:paragraph --><p>🚀-science.com/science</p><!-- /wp:paragraph -->
```

Gets rewritten as:

```html
<!-- wp:paragraph --><p>science.wordpress.org</p><!-- /wp:paragraph -->
```

Note that a mere `index.html` would not get picked up as a domain name. This PR consults the [public suffix list](https://publicsuffix.org/list/) to avoid such false-positives.

### Punycode and HTML entities in text

Since HTML is parsed as HTML and URLs are parsed as URLs, this PR recognizes the domain encoded in this snippet:

```html
<!-- wp:paragraph --><p>&#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/</p><!-- /wp:paragraph -->
```

And rewritten it as:

```html
<!-- wp:paragraph --><p>https://science.wordpress.org/</p><!-- /wp:paragraph -->
```

### Similar-looking domains 

Here are two scenarios where a naive search and replace would corrupt the data but this PR handles gracefully:

```html	
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
```

Gets rewritten as:

```html
	<!-- Correctly ignores similar–but–different URLs -->
	This isn't migrated: https://🚀-science.comcast/science <br>
	Or this: super-🚀-science.com/science
```

No changes were made, since neither domain matched the original base site URL.

#### Block attributes

```html
<!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} -->
    <img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png">
<!-- /wp:image -->
```

Gets rewritten as:

```html
<!-- wp:image {"src":"https:\/\/science.wordpress.org\/wp-content\/image.png"} -->
<img src="https://science.wordpress.org/wp-content/image.png">
<!-- /wp:image -->
```

#### Non-URL attributes

```html
<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
```

Gets rewritten as:

```html
<!-- Classes are not migrated. -->
<span class="https://🚀-science.com/science"></span>
```

## Implementation

This PR ships:

* **A subset of [WordPress/php-toolkit](https://github.com/wordpress/php-toolkit)**. It's the minimum subset required to parse all the data layers involved (Block markup, HTML, URL).
* **Vendor libraries in `vendor-patched` directories.** They're all patched to work with PHP 7.2+ (via rector and manual work). Being vendor libraries, they're excluded from phpcs rules.

With those pre-requisites in place, the `WP_Import` class runs that code on every imported post:

```php
$url_mapping              = array(
	$this->base_url_parsed->toString() => $this->site_url_parsed,
);
$postdata['post_content'] = wp_rewrite_urls(
	array(
		'block_markup' => $postdata['post_content'],
		'url-mapping'  => $url_mapping,
	)
);
$postdata['post_excerpt'] = wp_rewrite_urls(
	array(
		'block_markup' => $postdata['post_excerpt'],
		'url-mapping'  => $url_mapping,
	)
);
```

### Limitations

#### URL rewriting is only available for WP 6.7+

This PR relies on the `WP_HTML_Tag_Processor::set_modifiable_text` method introduced in WordPress 6.7. It does not attempt to rewrite URLs in WordPress 6.6 or older. I'm interested in submitting a follow-up PR to extend support to those older WordPress versions by shipping a namespaced version of the `WP_HTML_Tag_Processor`.

## Open questions

- [x] **Should URL rewriting be opt-in?** On one hand, I'm worried about existing workflows and tests that assume the URLs don't change and attempt to replace them with `wp search-replace`. On the other hand, making this feature opt-in means most users won't benefit from it. There is a middle ground where it starts as an opt-in switch and, say, 3 or 6 months from now we publish a new major release where URL rewriting is enabled by default.

^ I've made URL rewriting disabled by default. Let's discuss enabling it in a separate PR.

### Further reading

For even more context, see:

* The [Make post](https://make.wordpress.org/playground/2024/11/06/using-playground-for-data-liberation-site-synchronization-and-building-streaming-parsers/)
* The [GitHub discussion](WordPress/data-liberation#74) cover a lot. Let me summarize the most important points:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Drop support for WP < 5.2 / PHP < 5.6

2 participants