UN-2807 [FEAT] Add user_data variable support for Prompt Studio#1544
Conversation
Summary by CodeRabbit
WalkthroughThreads an optional Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant API as API v2
participant DH as DeploymentHelper
participant WH as WorkflowHelper
participant Q as Async Queue
participant FT as FileExecutionTasks
participant SC as SourceConnector
participant EFH as ExecutionFileHandler
Client->>API: POST /execute { files, user_data? }
API->>DH: execute_workflow(..., user_data)
DH->>WH: execute_workflow_async(..., user_data)
WH-->>Q: enqueue task with user_data
Q-->>FT: run batch with FileData(user_data)
FT->>SC: add_file_to_volume(..., user_data)
SC->>EFH: add_metadata_to_volume(..., user_data)
EFH-->>SC: write metadata (USER_DATA if provided)
SC-->>FT: ok
FT-->>Client: results via API callback/response
sequenceDiagram
autonumber
participant Client
participant PS as Prompt Service
participant VRS as VariableReplacementService
participant VRH as VariableReplacementHelper
Client->>PS: Request answer_prompt { ..., user_data? }
PS->>VRS: replace_variables_in_prompt(..., user_data)
VRS->>VRS: detect variable types (USER_DATA first)
alt USER_DATA variable
VRS->>VRH: replace_user_data_variable(prompt, "{{user_data.path}}", user_data)
VRH-->>VRS: prompt text with substitution
else DYNAMIC/STATIC
VRS->>VRH: existing replacement paths
end
VRS-->>PS: replaced prompt
PS-->>Client: response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py (1)
656-659: Bug: possible UnboundLocalError for challenge_llm (and context)If
enable_challengeis true but no challenge plugin is available (or fails early),challenge_llmis undefined infinally. Same forcontextwhen retrieval didn’t run. Initialize both.Apply this diff:
@@ - for output in prompts: # type:ignore + challenge_llm = None + context: list[str] = [] + for output in prompts: # type:ignore @@ - try: + try: answer = "NA" @@ - answer, context = RetrievalService.perform_retrieval( + answer, context = RetrievalService.perform_retrieval( @@ - finally: - challenge_metrics = ( - {f"{challenge_llm.get_usage_reason()}_llm": challenge_llm.get_metrics()} - if enable_challenge and challenge_llm - else {} - ) + finally: + challenge_metrics = {} + if enable_challenge and challenge_llm is not None: + challenge_metrics = { + f"{challenge_llm.get_usage_reason()}_llm": challenge_llm.get_metrics() + }backend/api_v2/api_deployment_views.py (1)
78-91: Critical: file-history/execution dedupe ignores user_data — cached API results can be returned for different user_dataVerified: EXECUTION_EXCLUDED_PARAMS includes "user_data" (backend/workflow_manager/workflow_v2/workflow_helper.py) and file-history keys/lookups are content-only (file_execution_tasks.py calls FileHistoryHelper.get_file_history(..., cache_key=content_hash) and file_history_helper.create_file_history persists cache_key=file_hash.file_hash).
Action: either include a stable hash of user_data in the file-history/cache key (combine content-hash + hash(json.dumps(user_data, sort_keys=True))) or disable caching when user_data is present.
Update points: backend/workflow_manager/workflow_v2/workflow_helper.py (EXECUTION_EXCLUDED_PARAMS / filtered kwargs), backend/workflow_manager/workflow_v2/file_execution_tasks.py (get_file_history call), backend/workflow_manager/workflow_v2/file_history_helper.py (create_file_history / persisted cache_key).backend/api_v2/serializers.py (1)
229-259: Harden user_data validation and prevent cache leakage when user_data is present
- Apply the serializer hardening from your diff: import json, add MAX_USER_DATA_BYTES (e.g. 65536), require dict, ensure JSON-serializable and reject oversized payloads — backend/api_v2/serializers.py (validate_user_data).
- Keep the validate() change that sets data["use_file_history"] = False when user_data is present — this prevents reuse of FileHistory-based results (FileHistoryHelper / file_history records).
- Critical: api-results caching is independent of FileHistory. ResultCacheUtils uses key "api_results:{workflow_id}:{execution_id}" (backend/workflow_manager/endpoint_v2/result_cache_utils.py) and DOES NOT include user_data — this can leak results if execution_id is reused. Fix (pick one):
- Skip reading/writing api_results when workflow_execution.user_data is present, or
- Include a stable hash of user_data in the api_results cache key.
Minimal, targeted change: wrap the call to ResultCacheUtils.update_api_results/get_api_results in backend/workflow_manager/workflow_v2/file_execution_tasks.py::_build_final_result (where update_api_results is invoked ~lines 1133–1137) with a guard that skips caching when workflow_execution.user_data exists.
🧹 Nitpick comments (20)
backend/prompt_studio/prompt_studio_registry_v2/prompt_studio_registry_helper.py (2)
123-131: Prefer parameterized logging; drop trailing space and include traceback for debuggability.Use logger’s lazy formatting and capture the stack trace from the except block. Current f-string also has a trailing space after the exception message.
Apply:
- logger.warning( - f"Error while fetching for prompt registry ID {prompt_registry_id}: {e} " - ) + logger.warning( + "Error fetching tool for prompt registry ID %s", + prompt_registry_id, + exc_info=True, + )
215-220: Chain the original IntegrityError and use logger.exception with context.Preserve the root cause via exception chaining and let logging capture the traceback. Also add tool_id for quick correlation.
Apply:
- logger.error( - f"Integrity Error - Error occurred while exporting custom tool : {error}" - ) - raise ToolSaveError + logger.exception( + "IntegrityError while exporting custom tool %s", + custom_tool.tool_id, + ) + raise ToolSaveError from errorbackend/workflow_manager/workflow_v2/execution.py (1)
54-56: Code style improvement: Use class name for static method calls.Calling static methods using the class name is more common and can be called as either a class method or an instance method. The change from
self.convert_tool_instance_model_to_data_class()toWorkflowExecutionServiceHelper.convert_tool_instance_model_to_data_class()improves code clarity by making it explicit that this is a static method call.backend/workflow_manager/workflow_v2/models/execution.py (1)
121-125: Prefer keeping feature PRs focused (cosmetic change unrelated to user_data).This drive-by formatting tweak is harmless, but consider moving non-functional comment cleanups to a separate “lint/comments” PR to reduce noise in feature diffs.
tools/structure/src/constants.py (1)
83-83: USER_DATA constant addition — OK; consider de-duplicating nearby keys.
Addition is consistent with the platform keys. Noting this file contains duplicate constants (e.g., NAME, OUTPUTS, TOOL_ID defined twice). Consider consolidating to avoid future confusion.tools/structure/src/main.py (1)
223-224: Normalize user_data to a dict to avoid None/invalid types reaching downstream.
If the exec metadata contains user_data=None or a non-object, downstream substitution may break. Normalize defensively.Apply this diff:
- user_data = self.get_exec_metadata.get(SettingsKeys.USER_DATA, {}) - payload["user_data"] = user_data + user_data = self.get_exec_metadata.get(SettingsKeys.USER_DATA) + if not isinstance(user_data, dict): + user_data = {} # Normalize None/non-dicts + payload["user_data"] = user_dataOptional: enforce a size cap at the API layer; tool-side normalization here is sufficient.
backend/tool_instance_v2/views.py (1)
121-123: Minor: centralize error message in the exception classRaising
DuplicateDatawith a composed string is fine, but consider moving the message into the exception to satisfy TRY003 and keep messages DRY.backend/api_v2/api_deployment_views.py (1)
73-74: Bound user_data size before propagationEven with serializer validation, consider guarding here (or in serializer) with a conservative max payload size to prevent oversized prompt expansions and metadata bloat.
backend/api_v2/serializers.py (1)
213-215: Doc clarity: show nested example and limitationsAdd a nested example (e.g.,
{{user_data.address.city}}) and clarify that array indexing isn’t supported in dot-notation.prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py (1)
56-57: Guard user_data before useValidate type and cap size here (in addition to API layer) to prevent oversized expansions if this endpoint is called directly.
Apply this minimal guard:
@@ - user_data: dict[str, Any] = payload.get(PSKeys.USER_DATA, {}) + user_data_raw = payload.get(PSKeys.USER_DATA, {}) + if not isinstance(user_data_raw, dict): + user_data: dict[str, Any] = {} + else: + try: + import json + b = len(json.dumps(user_data_raw, separators=(",", ":"), ensure_ascii=False).encode("utf-8")) + user_data = user_data_raw if b <= 65536 else {} + except Exception: + user_data = {}backend/workflow_manager/endpoint_v2/source.py (1)
924-925: Add user_data param: looks good; please document it in the docstring.Signature change is fine. Add a brief docstring entry so downstream readers know it’s persisted to METADATA.json.
Args: workflow_file_execution: WorkflowFileExecution model tags (list[str]): Tag names associated with the workflow execution. file_hash: FileHash model llm_profile_id (str, optional): LLM profile ID for overriding tool settings. + user_data (dict[str, Any] | None): Per-request user metadata to persist in METADATA.json.backend/workflow_manager/workflow_v2/file_execution_tasks.py (1)
754-755: Forwarding user_data into volume metadata: LGTM; consider size guard upstream.Looks correct. As a future hardening, consider enforcing a max size for user_data to avoid bloating per-file METADATA.json.
prompt-service/src/unstract/prompt_service/constants.py (1)
171-176: Remove unused noqa and keep the URL regex multiline.Ruff flags the E501 noqa as unused; safe to drop.
DYNAMIC_VARIABLE_URL_REGEX = ( r"(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'\".,<>?«»" - "'']))" - ) # noqa: E501 + "'']))" + )Optional: if USER_DATA detection runs on pre-tokenized variables, you may tighten the pattern with start/end anchors to avoid accidental matches in free text.
unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py (2)
105-106: Signature addition: LGTM; update docstring to mention user_data.The param is present but undocumented in the Parameters section.
Parameters: input_file_path (str): The path of the input file. file_execution_id (str): Unique execution id for the file. source_hash (str): The hash value of the source/input file. tags (list[str]): Tag names associated with the workflow execution. llm_profile_id (str, optional): LLM profile ID for overriding tool settings. + user_data (dict[str, Any] | None): User-provided JSON metadata to store under MetaDataKey.USER_DATA.
145-148: Make user_data write robust to non-JSON types (defensive).Serializer paths should provide JSON, but a defensive guard prevents unexpected TypeError during json_dump.
- # Add user_data to metadata if provided - if user_data: - content[MetaDataKey.USER_DATA] = user_data + # Add user_data to metadata if provided (validate JSON-serializability) + if user_data is not None: + try: + json.dumps(user_data) # validate + content[MetaDataKey.USER_DATA] = user_data + except TypeError: + logger.warning( + "user_data contains non-JSON-serializable types; coercing with default=str" + ) + content[MetaDataKey.USER_DATA] = json.loads( + json.dumps(user_data, default=str) + )prompt-service/src/unstract/prompt_service/services/variable_replacement.py (2)
37-37: Consider using explicitT | Nonetype annotationsThe static analysis hints suggest using explicit
T | Noneinstead of implicitOptional. While functionally equivalent, the modern type annotation style improves code clarity.Apply this diff to use explicit union types:
- user_data: dict[str, Any] = None, + user_data: dict[str, Any] | None = None,- prompt_text: str, variable_map: dict[str, Any], user_data: dict[str, Any] = None + prompt_text: str, variable_map: dict[str, Any], user_data: dict[str, Any] | None = NoneAlso applies to: 101-101
124-130: Consider handling the case whereuser_dataisNonefor USER_DATA variablesWhen a USER_DATA variable is detected but
user_dataisNone, the variable remains unreplaced in the prompt. This could lead to prompts with visible placeholder variables like{{user_data.name}}.Consider logging a warning or replacing with an empty string when USER_DATA variables are found but no user_data is provided:
elif variable_type == VariableType.USER_DATA: if user_data: prompt_text = VariableReplacementHelper.replace_user_data_variable( prompt=prompt_text, variable=variable_data, user_data=user_data, ) + else: + app.logger.warning( + f"USER_DATA variable '{variable}' found but no user_data provided. " + "Variable will remain unreplaced." + )prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (2)
64-75: Potential for regex compilation optimizationThe regex patterns are compiled on every call to
identify_variable_type. Since these patterns are constant, consider compiling them once at module level or usinglru_cachefor better performance, especially when processing prompts with many variables.Consider caching the compiled patterns:
+from functools import lru_cache + +@lru_cache(maxsize=2) +def _get_compiled_patterns(): + return { + 'user_data': re.compile(VariableConstants.USER_DATA_VARIABLE_REGEX), + 'dynamic': re.compile(VariableConstants.DYNAMIC_VARIABLE_URL_REGEX) + } + @staticmethod def identify_variable_type(variable: str) -> VariableType: variable_type: VariableType - # Check for user_data variable type first - user_data_pattern = re.compile(VariableConstants.USER_DATA_VARIABLE_REGEX) - if re.findall(user_data_pattern, variable): + patterns = _get_compiled_patterns() + + # Check for user_data variable type first + if patterns['user_data'].findall(variable): variable_type = VariableType.USER_DATA else: # Check for dynamic variable type - dynamic_pattern = re.compile(VariableConstants.DYNAMIC_VARIABLE_URL_REGEX) - if re.findall(dynamic_pattern, variable): + if patterns['dynamic'].findall(variable): variable_type = VariableType.DYNAMIC else: variable_type = VariableType.STATIC return variable_type
132-140: Enhance error handling for nested path traversalThe current implementation catches
KeyErrorandTypeErrorbut doesn't distinguish between missing keys and attempting to access properties on non-dict values (e.g., trying to accessuser_data.name.citywhennameis a string).Consider providing more specific error messages:
# Navigate through the nested dictionary try: value = user_data for part in path_parts: - value = value[part] - except (KeyError, TypeError): + if not isinstance(value, dict): + app.logger.warning( + f"Cannot traverse path {path_str}: '{part}' is not accessible on non-dict type {type(value).__name__}" + ) + return prompt + value = value[part] + except KeyError as e: app.logger.warning( - f"Path {path_str} not found in user_data. Unable to replace variable {variable}" + f"Key '{e.args[0]}' not found in path {path_str}. Unable to replace variable {variable}" ) return promptbackend/workflow_manager/workflow_v2/workflow_helper.py (1)
689-702: user_data validated as JSON object in API — add size/schema limits if requiredbackend/api_v2/serializers.py defines user_data = JSONField(required=False, allow_null=True) and a validate_user_data that ensures the value is a dict. No explicit size limits or JSON schema validation were found; add payload size limits or schema validation here if you need to constrain shape/size.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (24)
backend/api_v2/api_deployment_views.py(2 hunks)backend/api_v2/constants.py(1 hunks)backend/api_v2/deployment_helper.py(2 hunks)backend/api_v2/serializers.py(3 hunks)backend/prompt_studio/prompt_studio_registry_v2/prompt_studio_registry_helper.py(2 hunks)backend/pyproject.toml(1 hunks)backend/tool_instance_v2/views.py(1 hunks)backend/workflow_manager/endpoint_v2/source.py(2 hunks)backend/workflow_manager/workflow_v2/dto.py(1 hunks)backend/workflow_manager/workflow_v2/execution.py(1 hunks)backend/workflow_manager/workflow_v2/file_execution_tasks.py(2 hunks)backend/workflow_manager/workflow_v2/models/execution.py(1 hunks)backend/workflow_manager/workflow_v2/workflow_helper.py(8 hunks)platform-service/pyproject.toml(1 hunks)prompt-service/pyproject.toml(1 hunks)prompt-service/src/unstract/prompt_service/constants.py(3 hunks)prompt-service/src/unstract/prompt_service/controllers/answer_prompt.py(2 hunks)prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py(2 hunks)prompt-service/src/unstract/prompt_service/services/variable_replacement.py(4 hunks)tools/structure/requirements.txt(1 hunks)tools/structure/src/constants.py(1 hunks)tools/structure/src/main.py(1 hunks)unstract/workflow-execution/src/unstract/workflow_execution/constants.py(1 hunks)unstract/workflow-execution/src/unstract/workflow_execution/execution_file_handler.py(2 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
backend/api_v2/serializers.py
256-256: Avoid specifying long messages outside the exception class
(TRY003)
prompt-service/src/unstract/prompt_service/services/variable_replacement.py
37-37: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
101-101: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
prompt-service/src/unstract/prompt_service/constants.py
174-174: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
🔇 Additional comments (11)
backend/workflow_manager/workflow_v2/models/execution.py (1)
121-125: LGTM — confirm no migration createdComment-only change; no functional impact. Verification here failed: manage.py not found. Run from the repository root and paste output:
python manage.py makemigrations --check --dry-run
(Or run via your environment: e.g.
poetry run python manage.py ...,pipenv run python ..., ordocker-compose run --rm web python ....)prompt-service/pyproject.toml (1)
20-20: Azure extra bumped to 0.77.2 — OK.
Matches the cross-repo version; no issues spotted here.backend/pyproject.toml (1)
40-40: Backend SDK to 0.77.2 — OK.
Version parity maintained with other services.platform-service/pyproject.toml (1)
18-18: Platform-service SDK to 0.77.2 — OK.
Multi-cloud extras preserved; looks good.backend/workflow_manager/endpoint_v2/source.py (1)
966-967: Pass-through to metadata: LGTM.The user_data forwarding is correct and preserves backward compatibility with llm_profile_id.
backend/api_v2/deployment_helper.py (1)
238-239: Async propagation: LGTM.user_data is correctly forwarded to execute_workflow_async.
prompt-service/src/unstract/prompt_service/constants.py (1)
70-71: New USER_DATA constants/enums: LGTM.Also applies to: 151-152
prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (1)
102-152: Good implementation of dot-notation traversalThe implementation correctly handles nested dictionary access using dot notation (e.g.,
user_data.address.city) and properly integrates with existing helper methods for value formatting and replacement. The error handling ensures the prompt remains unchanged when paths are invalid.backend/workflow_manager/workflow_v2/workflow_helper.py (3)
67-68: Constant naming conventionThe constant name
EXECUTION_EXCLUDED_PARAMSfollows Python naming conventions correctly. Good addition of"user_data"to the exclusion list to maintain separation between workflow execution parameters and user-provided data.
154-154: Consistent user_data propagation throughout the workflowThe
user_dataparameter is correctly threaded through all the workflow execution paths:
- From initial entry points (
process_input_files,run_workflow,execute_workflow_async)- Through FileData objects for batch processing
- To the Celery task execution
- Finally to the prompt service
This ensures user context is available throughout the processing pipeline.
Also applies to: 206-206, 273-273, 325-325, 444-444, 484-484, 702-702
669-679: Clean parameter filtering for workflow executionGood implementation of filtering out excluded parameters before passing to
create_workflow_execution. This maintains backward compatibility while preventing user_data from being persisted in execution records.
There was a problem hiding this comment.
LGTM on the API deployment side of things. But, there are some changes needed for the user_data support from Prompt Studio, during the prompt development phase. Please add it as a TODO to handle in subsequent PRs. It is critical, because only then this support would be from the Prompt Studio side.
True, We need to introduce this in Prompt Studio too. I'll add a TODO or ticket to keep track of this. |
…ack/unstract into feature/user-data-variable-support
|
|
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (3)
prompt-service/src/unstract/prompt_service/services/variable_replacement.py (2)
37-37: Fix type hint to follow modern Python standards.The preferred way to denote an optional argument in Python 3.10 and up is
Type | Nonerather than implicit optional types.Apply this diff to use the modern type hint syntax:
- user_data: dict[str, Any] = None, + user_data: dict[str, Any] | None = None,
101-101: Fix type hint to follow modern Python standards.The preferred way to denote an optional argument in Python 3.10 and up is
Type | Nonerather than implicit optional types.Apply this diff to use the modern type hint syntax:
- prompt_text: str, variable_map: dict[str, Any], user_data: dict[str, Any] = None + prompt_text: str, variable_map: dict[str, Any], user_data: dict[str, Any] | None = Noneprompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (1)
138-139: Consider using logging.exception for better error context.The static analysis tool suggests using
logging.exceptioninstead oflogging.errorwhen handling exceptions, as it automatically includes the exception traceback.Apply this diff to improve error logging:
- app.logger.error(error_msg) - raise ValueError(error_msg) from e + app.logger.exception(error_msg) + raise ValueError(error_msg) from e
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
⛔ Files ignored due to path filters (3)
backend/uv.lockis excluded by!**/*.lockplatform-service/uv.lockis excluded by!**/*.lockprompt-service/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (3)
backend/api_v2/deployment_helper.py(3 hunks)prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py(2 hunks)prompt-service/src/unstract/prompt_service/services/variable_replacement.py(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/api_v2/deployment_helper.py
🧰 Additional context used
🪛 Ruff (0.12.2)
prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py
138-138: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
prompt-service/src/unstract/prompt_service/services/variable_replacement.py
37-37: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
101-101: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (4)
prompt-service/src/unstract/prompt_service/services/variable_replacement.py (2)
67-81: LGTM! User data parameter correctly threaded through both code paths.The implementation correctly propagates
user_datathrough both the normal execution path (whenvariable_mapexists) and the fallback path (for structured tools and APIs without a variable map). This ensures consistent behavior regardless of the execution context.
117-129: LGTM! Clean implementation of user data variable handling.The logic correctly:
- Converts the previous if-elif chain to handle the new USER_DATA variable type
- Only processes user data variables when
user_datais provided (guards against None)- Maintains the existing flow for STATIC and DYNAMIC variables
The elif structure ensures only one variable type is processed per iteration, which is appropriate behavior.
prompt-service/src/unstract/prompt_service/helpers/variable_replacement.py (2)
64-75: LGTM! Proper variable type identification priority.The reordered logic correctly prioritizes USER_DATA detection first, then DYNAMIC, then defaults to STATIC. This ordering ensures that user data variables with dot notation (e.g.,
{{user_data.name}}) are properly identified before being misclassified as dynamic variables.
102-152: LGTM! Robust user data variable replacement implementation.The implementation provides comprehensive error handling and follows the established patterns:
- Validates user_data presence and variable format
- Implements proper dot-notation traversal for nested JSON access
- Uses existing helper methods (
handle_json_and_str_types,replace_generic_string_value)- Provides clear error messages for debugging
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
backend/sample.env (1)
81-84: Consider pinning by digest to harden supply chain.Tags can be retagged; prefer content-addressable refs (image@sha256:...) in STRUCTURE_TOOL_IMAGE_URL and keep TAG only for display. Non-blocker.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
backend/sample.env(1 hunks)tools/structure/src/config/properties.json(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tools/structure/src/config/properties.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (1)
backend/sample.env (1)
81-84: Keep STRUCTURE_TOOL_IMAGE_URL and STRUCTURE_TOOL_IMAGE_TAG in lockstep — manual verification requiredAutomated checks here failed (rg returned line-prefixed values and a follow-up script errored "conditional binary operator expected"). Confirm backend/sample.env has identical tag values and that the image/tag exists and matches any properties.json .toolVersion.
Verify locally with these commands:
Extract & compare tags:
tag_url=$(sed -nE 's/^STRUCTURE_TOOL_IMAGE_URL\s*=\s*".:([0-9]+.[0-9]+.[0-9]+)"./\1/p' backend/sample.env)
tag_var=$(sed -nE 's/^STRUCTURE_TOOL_IMAGE_TAG\s*=\s*"([0-9]+.[0-9]+.[0-9]+)".*/\1/p' backend/sample.env)
echo "url:$tag_url var:$tag_var"; [ "$tag_url" = "$tag_var" ] && echo "MATCH" || echo "MISMATCH: $tag_url vs $tag_var"Check properties.json (if present):
find . -name properties.json -exec sh -c 'jq -r .toolVersion {} 2>/dev/null || echo "{}: no toolVersion"' ;Confirm Docker Hub tag exists:
curl -sf -o /dev/null "https://hub.docker.com/v2/repositories/unstract/tool-structure/tags/${tag_var:-0.0.87}/" && echo "docker tag exists" || echo "docker tag missing"If any step reports a mismatch or missing tag, update backend/sample.env (or properties.json) so both sources match and point to a published Docker tag.
…rnal APIs (#1494) * UN-2470 [MISC] Remove Django dependency from Celery workers This commit introduces a new worker architecture that decouples Celery workers from Django where possible, enabling support for gevent/eventlet pool types and reducing worker startup overhead. Key changes: - Created separate worker modules (api-deployment, callback, file_processing, general) - Added internal API endpoints for worker communication - Implemented Django-free task execution where appropriate - Added shared utilities and client facades - Updated container configurations for new worker architecture 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix pre-commit issues: file permissions and ruff errors Setup the docker for new workers - Add executable permissions to worker entrypoint files - Fix import order in namespace package __init__.py - Remove unused variable api_status in general worker - Address ruff E402 and F841 errors 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * refactoreed, Dockerfiles,fixes * flexibility on celery run commands * added debug logs * handled filehistory for API * cleanup * cleanup * cloud plugin structure * minor changes in import plugin * added notification and logger workers under new worker module * add docker compatibility for new workers * handled docker issues * log consumer worker fixes * added scheduler worker * minor env changes * cleanup the logs * minor changes in logs * resolved scheduler worker issues * cleanup and refactor * ensuring backward compatibbility to existing wokers * added configuration internal apis and cache utils * optimization * Fix API client singleton pattern to share HTTP sessions - Fix flawed singleton implementation that was trying to share BaseAPIClient instances - Now properly shares HTTP sessions between specialized clients - Eliminates 6x BaseAPIClient initialization by reusing the same underlying session - Should reduce API deployment orchestration time by ~135ms (from 6 clients to 1 session) - Added debug logging to verify singleton pattern activation * cleanup and structuring * cleanup in callback * file system connectors issue * celery env values changes * optional gossip * variables for sync, mingle and gossip * Fix for file type check * Task pipeline issue resolving * api deployement failed response handled * Task pipline fixes * updated file history cleanup with active file execution * pipline status update and workflow ui page execution * cleanup and resolvinf conflicts * remove unstract-core from conenctoprs * Commit uv.lock changes * uv locks updates * resolve migration issues * defer connector-metadtda * Fix connector migration for production scale - Add encryption key handling with defer() to prevent decryption failures - Add final cleanup step to fix duplicate connector names - Optimize for large datasets with batch processing and bulk operations - Ensure unique constraint in migration 0004 can be created successfully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * hitl fixes * minor fixes on hitl * api_hub related changes * dockerfile fixes * api client cache fixes with actual response class * fix: tags and llm_profile_id * optimized clear cache * cleanup * enhanced logs * added more handling on is file dir and added loggers * cleanup the runplatform script * internal apis are excempting from csrf * sonal cloud issues * sona-cloud issues * resolving sonar cloud issues * resolving sonar cloud issues * Delta: added Batch size fix in workers * comments addressed * celery configurational changes for new workers * fiixes in callback regaurding the pipline type check * change internal url registry logic * gitignore changes * gitignore changes * addressng pr cmmnets and cleanup the codes * adding missed profiles for v2 * sonal cloud blocker issues resolved * imlement otel * Commit uv.lock changes * handle execution time and some cleanup * adding user_data in metadata Pr: #1544 * scheduler backward compatibitlity * replace user_data with custom_data * Commit uv.lock changes * celery worker command issue resolved * enhance package imports in connectors by changing to lazy imports * Update runner.py by removing the otel from it Update runner.py by removing the otel from it Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com> * added delta changes * handle erro to destination db * resolve tool instances id validation and hitl queu name in API * handled direct execution from workflow page to worker and logs * handle cost logs * Update health.py Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor log changes * introducing log consumer scheduler to bulk create, and socket .emit from worker for ws * Commit uv.lock changes * time limit or timeout celery config cleanup * implemented redis client class in worker * pipline status enum mismatch * notification worker fixes * resolve uv lock conflicts * workflow log fixes * ws channel name issue resolved. and handling redis down in status tracker, and removing redis keys * default TTL changed for unified logs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com> Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
…rnal APIs (#1494) * UN-2470 [MISC] Remove Django dependency from Celery workers This commit introduces a new worker architecture that decouples Celery workers from Django where possible, enabling support for gevent/eventlet pool types and reducing worker startup overhead. Key changes: - Created separate worker modules (api-deployment, callback, file_processing, general) - Added internal API endpoints for worker communication - Implemented Django-free task execution where appropriate - Added shared utilities and client facades - Updated container configurations for new worker architecture 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * Fix pre-commit issues: file permissions and ruff errors Setup the docker for new workers - Add executable permissions to worker entrypoint files - Fix import order in namespace package __init__.py - Remove unused variable api_status in general worker - Address ruff E402 and F841 errors 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * refactoreed, Dockerfiles,fixes * flexibility on celery run commands * added debug logs * handled filehistory for API * cleanup * cleanup * cloud plugin structure * minor changes in import plugin * added notification and logger workers under new worker module * add docker compatibility for new workers * handled docker issues * log consumer worker fixes * added scheduler worker * minor env changes * cleanup the logs * minor changes in logs * resolved scheduler worker issues * cleanup and refactor * ensuring backward compatibbility to existing wokers * added configuration internal apis and cache utils * optimization * Fix API client singleton pattern to share HTTP sessions - Fix flawed singleton implementation that was trying to share BaseAPIClient instances - Now properly shares HTTP sessions between specialized clients - Eliminates 6x BaseAPIClient initialization by reusing the same underlying session - Should reduce API deployment orchestration time by ~135ms (from 6 clients to 1 session) - Added debug logging to verify singleton pattern activation * cleanup and structuring * cleanup in callback * file system connectors issue * celery env values changes * optional gossip * variables for sync, mingle and gossip * Fix for file type check * Task pipeline issue resolving * api deployement failed response handled * Task pipline fixes * updated file history cleanup with active file execution * pipline status update and workflow ui page execution * cleanup and resolvinf conflicts * remove unstract-core from conenctoprs * Commit uv.lock changes * uv locks updates * resolve migration issues * defer connector-metadtda * Fix connector migration for production scale - Add encryption key handling with defer() to prevent decryption failures - Add final cleanup step to fix duplicate connector names - Optimize for large datasets with batch processing and bulk operations - Ensure unique constraint in migration 0004 can be created successfully 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> * hitl fixes * minor fixes on hitl * api_hub related changes * dockerfile fixes * api client cache fixes with actual response class * fix: tags and llm_profile_id * optimized clear cache * cleanup * enhanced logs * added more handling on is file dir and added loggers * cleanup the runplatform script * internal apis are excempting from csrf * sonal cloud issues * sona-cloud issues * resolving sonar cloud issues * resolving sonar cloud issues * Delta: added Batch size fix in workers * comments addressed * celery configurational changes for new workers * fiixes in callback regaurding the pipline type check * change internal url registry logic * gitignore changes * gitignore changes * addressng pr cmmnets and cleanup the codes * adding missed profiles for v2 * sonal cloud blocker issues resolved * imlement otel * Commit uv.lock changes * handle execution time and some cleanup * adding user_data in metadata Pr: #1544 * scheduler backward compatibitlity * replace user_data with custom_data * Commit uv.lock changes * celery worker command issue resolved * enhance package imports in connectors by changing to lazy imports * Update runner.py by removing the otel from it Update runner.py by removing the otel from it Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com> * added delta changes * handle erro to destination db * resolve tool instances id validation and hitl queu name in API * handled direct execution from workflow page to worker and logs * handle cost logs * Update health.py Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor log changes * introducing log consumer scheduler to bulk create, and socket .emit from worker for ws * Commit uv.lock changes * time limit or timeout celery config cleanup * implemented redis client class in worker * pipline status enum mismatch * notification worker fixes * resolve uv lock conflicts * workflow log fixes * ws channel name issue resolved. and handling redis down in status tracker, and removing redis keys * default TTL changed for unified logs * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: ali <117142933+muhammad-ali-e@users.noreply.github.com> Signed-off-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>



What
Why
How
Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
No, this PR should not break any existing features because:
Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
Screenshots
N/A - Backend API feature
Checklist
I have read and understood the Contribution Guidelines.