Skip to content

Provide an MVP implementation of a session middleware#193

Merged
Gallaecio merged 98 commits intoscrapy-plugins:mainfrom
Gallaecio:session-middleware
Jun 19, 2024
Merged

Provide an MVP implementation of a session middleware#193
Gallaecio merged 98 commits intoscrapy-plugins:mainfrom
Gallaecio:session-middleware

Conversation

@Gallaecio
Copy link
Copy Markdown
Contributor

@Gallaecio Gallaecio commented Apr 17, 2024

To do:

  • Make existing tests pass.
    • Restore compatibility with Scrapy 2.0.1+.
  • Perform a minimal test with a real spider.
  • Add more session links to the docs once client-managed session documentation has been published.
  • Find out why there seem to be extra sessions created
  • Add a default check for session initialization based on the outcome of the setLocation action
  • Stop the spider if setLocation is not available for a given session pool.
  • Get Zyte API to fail with a proper status code for expired sessions
  • Add a stat to track session expirations.
  • Pass tests (make web-poet-related deps optional).
  • Update the docs to cover everything implemented so far.
  • Provide retry policies that do not retry temporary/permanent download errors, and document the need to set one of them when enabling sessions. Maybe override that automatically in the add-on for the default policies.
  • In the documentation about checking sessions, cover how it may be necessary, when using automatic extraction, to also request a body field, browserHtml or httpResponseBody, and how to do that when using providers.
  • Add reference docs for @session_config.
  • Allow disabling session assignment per request.
  • Provide complete test coverage.

Comment thread docs/usage/session.rst Outdated
Comment thread docs/usage/session.rst Outdated
Comment thread docs/reference/settings.rst
Comment thread docs/reference/settings.rst Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.56%. Comparing base (a2284c8) to head (c3ed86f).
Report is 6 commits behind head on main.

Current head c3ed86f differs from pull request most recent head c01d368

Please upload reports for the commit c01d368 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #193      +/-   ##
==========================================
- Coverage   98.45%   97.56%   -0.90%     
==========================================
  Files          13       14       +1     
  Lines        1102     1476     +374     
  Branches        0      309     +309     
==========================================
+ Hits         1085     1440     +355     
+ Misses         17       15       -2     
- Partials        0       21      +21     
Files Coverage Δ
scrapy_zyte_api/__init__.py 100.00% <100.00%> (ø)
scrapy_zyte_api/_middlewares.py 97.65% <100.00%> (ø)
scrapy_zyte_api/_session.py 100.00% <100.00%> (ø)
scrapy_zyte_api/addon.py 98.07% <100.00%> (-1.93%) ⬇️
scrapy_zyte_api/utils.py 100.00% <100.00%> (ø)

... and 6 files with indirect coverage changes

@Gallaecio
Copy link
Copy Markdown
Contributor Author

I have created a project based on https://github.com/zytedata/zyte-spider-templates-project, added the following spider to it:

from logging import getLogger

from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


class _SessionChecker:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]

    def check_session(self, request: Request, response: Response) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False

from zyte_spider_templates import EcommerceSpider


class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

And executed it as follows:

scrapy crawl session_ecommerce -a "url=https://ecommerce.example/product-list" -a crawl_strategy=pagination_only

It seems to work well enough. Although retries can be exceeded, i.e. a higher RETRY_TIMES value might be something worth recommending in the docs.

Comment thread docs/usage/session.rst Outdated
@proway2
Copy link
Copy Markdown

proway2 commented May 16, 2024

@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?

@Gallaecio
Copy link
Copy Markdown
Contributor Author

Gallaecio commented May 16, 2024

@Gallaecio Do you have a real job or stats? How many times did it get invalid response? What reasons responses were invalidated upon?

For a short crawl I performed just now:

'scrapy-zyte-api/processed': 21
'scrapy-zyte-api/status_codes/200': 18
'scrapy-zyte-api/status_codes/521': 3
'zyte-api-session/checks/passed': 18
'zyte-api-session/sessions': 11

i.e. sessions were created 11 times because the default is creating 8, and 3 got 521. Once a valid session was created, all (7) usages succeeded. Of course, it is a rather small sample.

In any case, I will now try to build some of the ideas by @VMRuiz, of a better API for location, into this PR (including per-domain web-poet-like configurations).

@Gallaecio
Copy link
Copy Markdown
Contributor Author

0638df7 is based on the ideas shared by @VMRuiz elsewhere.

It enables both a location-specific approach with poet-based overrides:

from logging import getLogger
from typing import Any

from pydantic import BaseModel, Field
from pydantic.types import Json
from scrapy import Request, Spider
from scrapy.crawler import Crawler
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from scrapy_spider_metadata import Args
from scrapy_zyte_api import SessionConfig, session_config
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory
from zyte_spider_templates import EcommerceSpider
from zyte_spider_templates.spiders.base import ARG_SETTING_PRIORITY
from zyte_spider_templates.spiders.ecommerce import EcommerceSpiderParams

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


@session_config("ecommerce.example")
class EcommerceExampleLocationSessionConfig(SessionConfig):

    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        expected_zip_code = self.location(request)["postalCode"]
        if zip_code == expected_zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{expected_zip_code!r})."
        )
        return False


class LocationParam(BaseModel):
    location: Json[Any] = Field(default_factory=dict)


class LocationSpiderParams(LocationParam, EcommerceSpiderParams):
    pass


class LocationEcommerceSpider(EcommerceSpider, Args[LocationSpiderParams]):
    name = "location_ecommerce"

    @classmethod
    def from_crawler(cls, crawler: Crawler, *args, **kwargs) -> Spider:
        spider = super(LocationEcommerceSpider, cls).from_crawler(crawler, *args, **kwargs)
        if spider.args.location:
            spider.settings.set(
                "ZYTE_API_SESSION_ENABLED",
                True,
                priority=ARG_SETTING_PRIORITY,
            )
            spider.settings.set(
                "ZYTE_API_SESSION_LOCATION",
                spider.args.location,
                priority=ARG_SETTING_PRIORITY,
            )
        return spider

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

But it also supports a non-location-specific approach, as well as a non-poet-like definition of session initialization parameters and a check function:

from logging import getLogger

from scrapy import Request
from scrapy.exceptions import NotSupported
from scrapy.http.response import Response
from tenacity import stop_after_attempt
from tenacity.stop import stop_base
from zyte_api import RequestError, RetryFactory

logger = getLogger(__name__)


class custom_throttling_stop(stop_base):

    def __call__(self, retry_state: "RetryCallState") -> bool:
        assert retry_state.outcome, "Unexpected empty outcome"
        exc = retry_state.outcome.exception()
        assert exc, "Unexpected empty exception"
        return (
            isinstance(exc, RequestError)
            and exc.status == 429
            and exc.parsed.data["title"] == "Session has expired"
        )


class CustomRetryFactory(RetryFactory):
    # Do not retry 520, let Scrapy deal with them (i.e. retry them with a
    # different session).
    temporary_download_error_stop = stop_after_attempt(1)

    # Handle temporary bug
    throttling_stop = custom_throttling_stop()


SESSION_RETRY_POLICY = CustomRetryFactory().build()


class _SessionChecker:

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def __init__(self, crawler):
        params = crawler.settings["ZYTE_API_SESSION_PARAMS"]
        self.zip_code = params["actions"][0]["address"]["postalCode"]

    def check(self, response: Response, request: Request) -> bool:
        try:
            zip_code = response.css(".delivery-text + a > span > span::text").get()
        except NotSupported:  # Empty response.
            logger.debug(f"Empty response {response}.")
            return False
        if not zip_code:
            logger.debug(f"No ZIP code found in {response}.")
            return False
        if zip_code == self.zip_code:
            logger.debug(f"Found expected ZIP code {zip_code!r} in {response}.")
            return True
        logger.debug(
            f"Found unexpected ZIP code {zip_code!r} in {response} (expected "
            f"{self.zip_code!r})."
        )
        return False

from zyte_spider_templates import EcommerceSpider


class SessionEcommerceSpider(EcommerceSpider):
    name = "session_ecommerce"

    @classmethod
    def update_settings(cls, settings):
        super().update_settings(settings)
        settings["ZYTE_API_AUTOMAP_PARAMS"] = {"browserHtml": True}

        # DEBUG
        settings["ZYTE_API_LOG_REQUESTS"] = True
        settings["ZYTE_API_LOG_REQUESTS_TRUNCATE"] = 0

        # Settings needed for the session stuff.
        settings["ZYTE_API_SESSION_ENABLED"] = True
        settings["ZYTE_API_SESSION_CHECKER"] = _SessionChecker
        settings["ZYTE_API_SESSION_PARAMS"] = {
            "browserHtml": True,
            "actions": [{"action": "setLocation", "address": {"postalCode": "94124"}}],
        }
        settings["COOKIES_ENABLED"] = False  # Sessions handle cookies.
        settings["ZYTE_API_RETRY_POLICY"] = SESSION_RETRY_POLICY  # Don’t retry bans.
        settings["ZYTE_API_PROVIDER_PARAMS"] = {"browserHtml": True}  # Cannot validate extraction-only responses

Also, multiple session pools are now supported, and by default each domain has its own pool. The stats now show check passes and failures separately depending on whether they happened during session initialization or regular response checks, and are split per session pool:

 'zyte-api-session/checks/ecommerce.example/passed': 7,
 'zyte-api-session/init/ecommerce.example/failed': 2,
 'zyte-api-session/init/ecommerce.example/passed': 11}

Now I need to figure out why the number of passes for session initializations is higher than the number of total passes. Feels like more sessions are being initialized than necessary.

I also want to implement a basic default check for setLocation based on the action outcome, i.e. if the action failed consider the initialization failed by default. And maybe try to find out if setLocation is not supported for the target website, and stop the spider in that scenario.

@Gallaecio Gallaecio marked this pull request as ready for review June 5, 2024 14:03
@Gallaecio Gallaecio requested review from BurnzZ, kmike and wRAR June 5, 2024 14:04
@Gallaecio Gallaecio merged commit f582076 into scrapy-plugins:main Jun 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants