Provide an MVP implementation of a session middleware by Gallaecio · Pull Request #193 · scrapy-plugins/scrapy-zyte-api

Gallaecio · 2024-04-17T17:10:58Z

To do:

codecov · 2024-04-18T07:13:53Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.56%. Comparing base (a2284c8) to head (c3ed86f).
Report is 6 commits behind head on main.

❗ Current head c3ed86f differs from pull request most recent head c01d368

Please upload reports for the commit c01d368 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #193      +/-   ##
==========================================
- Coverage   98.45%   97.56%   -0.90%     
==========================================
  Files          13       14       +1     
  Lines        1102     1476     +374     
  Branches        0      309     +309     
==========================================
+ Hits         1085     1440     +355     
+ Misses         17       15       -2     
- Partials        0       21      +21

Files	Coverage Δ
scrapy_zyte_api/__init__.py	`100.00% <100.00%> (ø)`
scrapy_zyte_api/_middlewares.py	`97.65% <100.00%> (ø)`
scrapy_zyte_api/_session.py	`100.00% <100.00%> (ø)`
scrapy_zyte_api/addon.py	`98.07% <100.00%> (-1.93%)`	⬇️
scrapy_zyte_api/utils.py	`100.00% <100.00%> (ø)`

... and 6 files with indirect coverage changes

…rameters

Gallaecio · 2024-05-15T15:14:59Z

I have created a project based on https://github.com/zytedata/zyte-spider-templates-project, added the following spider to it:

from logging import getLogger

from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


class _SessionChecker:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]

    def check_session(self, request: Request, response: Response) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False

from zyte_spider_templates import EcommerceSpider


class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

And executed it as follows:

scrapy crawl session_ecommerce -a "url=https://ecommerce.example/product-list" -a crawl_strategy=pagination_only

It seems to work well enough. Although retries can be exceeded, i.e. a higher RETRY_TIMES value might be something worth recommending in the docs.

proway2 · 2024-05-16T12:41:26Z

@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?

Gallaecio · 2024-05-16T22:42:20Z

@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?

For a short crawl I performed just now:

'scrapy-zyte-api/processed': 21
'scrapy-zyte-api/status_codes/200': 18
'scrapy-zyte-api/status_codes/521': 3
'zyte-api-session/checks/passed': 18
'zyte-api-session/sessions': 11

i.e. sessions were created 11 times because the default is creating 8, and 3 got 521. Once a valid session was created, all (7) usages succeeded. Of course, it is a rather small sample.

In any case, I will now try to build some of the ideas by @VMRuiz, of a better API for location, into this PR (including per-domain web-poet-like configurations).

…etloc support)

Gallaecio · 2024-05-17T10:50:26Z

0638df7 is based on the ideas shared by @VMRuiz elsewhere.

It enables both a location-specific approach with poet-based overrides:

from logging import getLogger
from typing import Any

from pydantic import BaseModel, Field
from pydantic.types import Json
from scrapy import Request, Spider
from scrapy.crawler import Crawler
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from scrapy_spider_metadata import Args
from scrapy_zyte_api import SessionConfig, session_config
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.base import ARG_SETTING_PRIORITY
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


@session_config("ecommerce.example")
class EcommerceExampleLocationSessionConfig(SessionConfig):

    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        expected_zip_code = self.location(request)["postalCode"]
        if zip_code == expected_zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{expected_zip_code!r})."
        )
        return False


class LocationParam(BaseModel):
    location: Json[Any] = Field(default_factory=dict)


class LocationSpiderParams(LocationParam, EcommerceSpiderParams):
    pass


class LocationEcommerceSpider(EcommerceSpider, Args[LocationSpiderParams]):
    name = "location_ecommerce"

    @classmethod
    def from_crawler(cls, crawler: Crawler, *args, **kwargs) -> Spider:
        spider = super(LocationEcommerceSpider, cls).from_crawler(crawler, *args, **kwargs)
        if spider.args.location:
            spider.settings.set(
                "ZYTE_API_SESSION_ENABLED",
                True,
                priority=ARG_SETTING_PRIORITY,
            )
            spider.settings.set(
                "ZYTE_API_SESSION_LOCATION",
                spider.args.location,
                priority=ARG_SETTING_PRIORITY,
            )
        return spider

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

But it also supports a non-location-specific approach, as well as a non-poet-like definition of session initialization parameters and a check function:

from logging import getLogger

from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


class _SessionChecker:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]

    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False

from zyte_spider_templates import EcommerceSpider


class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

Also, multiple session pools are now supported, and by default each domain has its own pool. The stats now show check passes and failures separately depending on whether they happened during session initialization or regular response checks, and are split per session pool:

 'zyte-api-session/checks/ecommerce.example/passed': 7,
 'zyte-api-session/init/ecommerce.example/failed': 2,
 'zyte-api-session/init/ecommerce.example/passed': 11}

Now I need to figure out why the number of passes for session initializations is higher than the number of total passes. Feels like more sessions are being initialized than necessary.

I also want to implement a basic default check for setLocation based on the action outcome, i.e. if the action failed consider the initialization failed by default. And maybe try to find out if setLocation is not supported for the target website, and stop the spider in that scenario.

…us change

…by the session downloader middleware

… cookie handling docs

Gallaecio added 2 commits April 17, 2024 19:07

Provide an MVP implementation of a session middleware

d255b75

Solve issues reported by CI

27e9273

proway2 reviewed Apr 18, 2024

View reviewed changes

Comment thread docs/usage/session.rst Outdated

proway2 reviewed Apr 18, 2024

View reviewed changes

Comment thread docs/usage/session.rst Outdated

proway2 reviewed Apr 18, 2024

View reviewed changes

Comment thread docs/reference/settings.rst

Disable the session middleware by default

8300dd7

proway2 reviewed Apr 18, 2024

View reviewed changes

Comment thread docs/reference/settings.rst Outdated

Gallaecio added 12 commits April 18, 2024 09:19

Set the session ID even on requests without request-level Zyte API pa…

d0c1cf4

…rameters

Update test expectations for scrapy-poet integration

645154c

Checker.check → Checker.check_session

332c15d

Remove ZYTE_API_SESSION_URL

5b4ac9f

Make the session example read the ZIP code from ZYTE_API_SESSION_PARAMS

d96198d

Use a more reliable approach to create the initial sessions

f21da88

Fix missing parameter in docs example

bc2a4c6

Fix process_response return value

9d15808

Retry and refresh session on RequestError

f744dbf

Add doc links for client-managed sessions

1f9152b

Improvements based on real life scenarios

546da9a

Address issues found by mypy

434d91d

VMRuiz reviewed May 16, 2024

View reviewed changes

Comment thread docs/usage/session.rst Outdated

Gallaecio added 3 commits May 17, 2024 00:08

Add stats

61986cf

Fix KeyError

119c3a4

Minor cleanup

2f1023b

Gallaecio added 2 commits May 17, 2024 07:51

Split session code off into its own module

efd2a50

New iteration (cleaner API for location, poet-like overrides, multi-n…

0638df7

…etloc support)

Gallaecio added 18 commits June 5, 2024 05:23

Test provider use of sessions

a7998d9

Address issues reported by CI

f5ea0f5

Test process_exception

b868c05

Add a changelog entry so that we do not forget to mention a non-obvio…

10479a4

…us change

Remove unused mockserver response

5f4cecf

Update test expectations after adjusting which responses are retried …

9e850ba

…by the session downloader middleware

Speed up tests using temporary-download-error.example

21a0f09

Do not use the add-on on tests running on old Scrapy versions

9705c08

Add a Codecov token

58d3fc4

Fix test_provider for old Scrapy

d180f75

Mark code not covered on purpose

1370701

Add tests for session config corner cases

8e75fad

Test @session_config without web-poet and concurrent session refreshing

19a82fa

Minor test function renaming

4f787be

Test raising CloseSpider from a checker during session use

a942216

Fix mockserver expectations

d27ea55

Ignore forward-ported code for coverage

336523e

Remove unnecessary if

02c097a

Gallaecio marked this pull request as ready for review June 5, 2024 14:03

Gallaecio requested review from BurnzZ, kmike and wRAR June 5, 2024 14:04

Gallaecio added 2 commits June 10, 2024 09:58

Merge branch 'main' into session-middleware

9094cc4

Merge branch 'main' into session-middleware

cb213e5

wRAR approved these changes Jun 12, 2024

View reviewed changes

Gallaecio added 2 commits June 13, 2024 10:10

Improve the session config example

28ae038

Improve handling of user-defined session config code

c3ed86f

proway2 approved these changes Jun 19, 2024

View reviewed changes

Cover advantages and disadvantages of session management, and improve…

c01d368

… cookie handling docs

Gallaecio merged commit f582076 into scrapy-plugins:main Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide an MVP implementation of a session middleware#193

Provide an MVP implementation of a session middleware#193
Gallaecio merged 98 commits intoscrapy-plugins:mainfrom
Gallaecio:session-middleware

Gallaecio commented Apr 17, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 18, 2024 •

edited

Loading

Uh oh!

Gallaecio commented May 15, 2024

Uh oh!

Uh oh!

proway2 commented May 16, 2024

Uh oh!

Gallaecio commented May 16, 2024 •

edited

Loading

Uh oh!

Gallaecio commented May 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Gallaecio commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Gallaecio commented May 15, 2024

Uh oh!

Uh oh!

proway2 commented May 16, 2024

Uh oh!

Gallaecio commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Gallaecio commented May 17, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Gallaecio commented Apr 17, 2024 •

edited

Loading

codecov Bot commented Apr 18, 2024 •

edited

Loading

Gallaecio commented May 16, 2024 •

edited

Loading