This document explains how to run the tests for the semantic_corpus project, which uses live APIs and real downloads.
These tests don't require network access and run quickly:
pytest tests/test_corpus_core.py tests/test_metadata_processing.py -m "not live_api"These tests interact with real repositories and download actual papers:
# Run all live API tests
pytest -m "live_api"
# Run only Europe PMC tests
pytest tests/test_repository_interface.py::TestEuropePMCRepository -m "live_api"
# Run only arXiv tests
pytest tests/test_repository_interface.py::TestArxivRepository -m "live_api"These tests demonstrate complete workflows with real data:
pytest tests/test_integration_live.pyThese tests verify the command-line interface with real repositories:
pytest tests/test_cli.py -m "live_api"# Run all tests including live API tests
pytest
# Run with verbose output
pytest -v
# Run with coverage
pytest --cov=semantic_corpus# Skip live API tests for faster execution
pytest -m "not live_api"SEMANTIC_CORPUS_TEST_TIMEOUT: Set timeout for API calls (default: 30 seconds)SEMANTIC_CORPUS_TEST_LIMIT: Set maximum number of papers to download (default: 3)
@pytest.mark.live_api: Tests that use live APIs@pytest.mark.network: Tests that require network access@pytest.mark.integration: Integration tests@pytest.mark.slow: Tests that take longer to run
- Search Tests: Query real repositories (Europe PMC, arXiv) for papers
- Metadata Tests: Retrieve actual paper metadata
- Download Tests: Download real papers (XML, PDF) to temporary directories
- Full Workflow: Search → Download → Add to Corpus → Search in Corpus
- Statistics: Test corpus statistics with real downloaded papers
- Error Handling: Test error scenarios with live APIs
- Search Command: Test
semantic_corpus searchwith real repositories - Download Command: Test
semantic_corpus downloadwith real papers - Config File: Test configuration file support
- ✅ Successfully search for papers using real queries
- ✅ Download actual papers (XML/PDF files)
- ✅ Verify downloaded files exist and have content
- ✅ Add real papers to corpus and retrieve them
- ✅ Perform searches within the corpus
- Query: "climate change" (reliable, returns results)
- Limit: 2-3 papers per test (fast, sufficient for testing)
- Formats: XML for Europe PMC, PDF for arXiv
- Categories: cs.AI for arXiv (reliable results)
If tests fail due to network issues:
# Run without live API tests
pytest -m "not live_api"If you hit rate limits:
# Run tests with delays
pytest --tb=short -xInstall required packages:
pip install -e ".[dev]"Successful live API tests will show:
- Real paper titles and abstracts
- Downloaded file paths and sizes
- Corpus statistics with actual data
- Search results from real repositories
This demonstrates that the semantic_corpus system works with real scientific data and can be used for actual research workflows.