-
Notifications
You must be signed in to change notification settings - Fork 8
[RHAIENG-528] Docling KFP support for accelerator devices #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[RHAIENG-528] Docling KFP support for accelerator devices #42
Conversation
WalkthroughAdds an accelerator device parameter (default "auto") to Standard and VLM converters, propagates it through pipeline wiring and compiled YAMLs, updates documentation and READMEs, expands local test harnesses, adds HTTP download retry logic, and updates .gitignore. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
kubeflow-pipelines/docling-vlm/vlm_components.py (2)
1-1: Fix the formatting issue flagged by GitHub Actions.The Ruff formatter has flagged this file for reformatting.
Run the following to fix:
#!/bin/bash ruff format kubeflow-pipelines/docling-vlm/vlm_components.py
27-40: Update the docstring to document the accelerator_device parameter.The
accelerator_deviceparameter is missing from the Args section of the docstring.Apply this diff:
Args: input_path: Path to the input directory containing PDF files. artifacts_path: Path to the directory containing Docling models. output_path: Path to the output directory for converted JSON and Markdown files. pdf_filenames: List of PDF file names to process. num_threads: Number of threads to use per document processing. timeout_per_document: Timeout per document processing. image_export_mode: Mode to export images. remote_model_enabled: Whether or not to use a remote model. remote_model_secret_mount_path: Path to the remote model secret mount path. + accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps).kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
104-156: The accelerator_device parameter is not wired through the for-loop-1 component.The
accelerator_deviceparameter is defined in thecomp-docling-convert-vlminputDefinitions (lines 46-49) but is not passed through thefor-loop-1component to the actualdocling-convert-vlmtask. This means users cannot configure the accelerator device at runtime—it will always use the default "auto" value.Apply this diff to wire the parameter:
docling-convert-vlm: cachingOptions: {} componentRef: name: comp-docling-convert-vlm inputs: artifacts: artifacts_path: componentInputArtifact: pipelinechannel--download-docling-models-output_path input_path: componentInputArtifact: pipelinechannel--import-pdfs-output_path parameters: + accelerator_device: + componentInputParameter: pipelinechannel--docling_accelerator_device image_export_mode: componentInputParameter: pipelinechannel--docling_image_export_modeAlso add the corresponding input definition in the for-loop-1 component:
inputDefinitions: artifacts: pipelinechannel--download-docling-models-output_path: artifactType: schemaTitle: system.Artifact schemaVersion: 0.0.1 pipelinechannel--import-pdfs-output_path: artifactType: schemaTitle: system.Artifact schemaVersion: 0.0.1 parameters: + pipelinechannel--docling_accelerator_device: + parameterType: STRING pipelinechannel--create-pdf-splits-Output: parameterType: LISTAnd add it to the root pipeline's for-loop-1 task inputs:
for-loop-1: componentRef: name: comp-for-loop-1 dependentTasks: - create-pdf-splits - download-docling-models - import-pdfs inputs: artifacts: pipelinechannel--download-docling-models-output_path: taskOutputArtifact: outputArtifactKey: output_path producerTask: download-docling-models pipelinechannel--import-pdfs-output_path: taskOutputArtifact: outputArtifactKey: output_path producerTask: import-pdfs parameters: + pipelinechannel--docling_accelerator_device: + componentInputParameter: docling_accelerator_device pipelinechannel--create-pdf-splits-Output: taskOutputParameter: outputParameterKey: Output producerTask: create-pdf-splitsAlso add the root-level input definition:
root: dag: tasks: ... inputDefinitions: parameters: + docling_accelerator_device: + defaultValue: auto + isOptional: true + parameterType: STRING docling_image_export_mode: defaultValue: embeddedkubeflow-pipelines/docling-standard/standard_components.py (2)
1-1: Fix the formatting issue flagged by GitHub Actions.The Ruff formatter has flagged this file for reformatting.
Run the following to fix:
#!/bin/bash ruff format kubeflow-pipelines/docling-standard/standard_components.py
35-56: Update the docstring to document the accelerator_device parameter.The
accelerator_deviceparameter is missing from the Args section of the docstring.Apply this diff:
Args: input_path: Path to the input directory containing PDF files. artifacts_path: Path to the directory containing Docling models. output_path: Path to the output directory for converted JSON and Markdown files. pdf_filenames: List of PDF file names to process. pdf_backend: Backend to use for PDF processing. image_export_mode: Mode to export images. table_mode: Mode to detect tables. num_threads: Number of threads to use per document processing. timeout_per_document: Timeout per document processing. ocr: Whether or not to use OCR if needed. force_ocr: Whether or not to force OCR. ocr_engine: Engine to use for OCR. allow_external_plugins: Whether or not to allow external plugins. enrich_code: Whether or not to enrich code. enrich_formula: Whether or not to enrich formula. enrich_picture_classes: Whether or not to enrich picture classes. enrich_picture_description: Whether or not to enrich picture description. + accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps).kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
153-237: The accelerator_device parameter is not wired through the for-loop-1 component.The
accelerator_deviceparameter is defined in thecomp-docling-convert-standardinputDefinitions (lines 55-58) but is not passed through thefor-loop-1component to the actualdocling-convert-standardtask. This means users cannot configure the accelerator device at runtime—it will always use the default "auto" value.Apply this diff to wire the parameter:
docling-convert-standard: cachingOptions: {} componentRef: name: comp-docling-convert-standard inputs: artifacts: artifacts_path: componentInputArtifact: pipelinechannel--download-docling-models-output_path input_path: componentInputArtifact: pipelinechannel--import-pdfs-output_path parameters: + accelerator_device: + componentInputParameter: pipelinechannel--docling_accelerator_device allow_external_plugins: componentInputParameter: pipelinechannel--docling_allow_external_pluginsAlso add the corresponding input definition in the for-loop-1 component:
inputDefinitions: artifacts: pipelinechannel--download-docling-models-output_path: artifactType: schemaTitle: system.Artifact schemaVersion: 0.0.1 pipelinechannel--import-pdfs-output_path: artifactType: schemaTitle: system.Artifact schemaVersion: 0.0.1 parameters: + pipelinechannel--docling_accelerator_device: + parameterType: STRING pipelinechannel--create-pdf-splits-Output: parameterType: LISTAnd add it to the root pipeline's for-loop-1 task inputs:
for-loop-1: componentRef: name: comp-for-loop-1 dependentTasks: - create-pdf-splits - download-docling-models - import-pdfs inputs: artifacts: pipelinechannel--download-docling-models-output_path: taskOutputArtifact: outputArtifactKey: output_path producerTask: download-docling-models pipelinechannel--import-pdfs-output_path: taskOutputArtifact: outputArtifactKey: output_path producerTask: import-pdfs parameters: + pipelinechannel--docling_accelerator_device: + componentInputParameter: docling_accelerator_device pipelinechannel--create-pdf-splits-Output: taskOutputParameter: outputParameterKey: Output producerTask: create-pdf-splitsAlso add the root-level input definition:
root: dag: tasks: ... inputDefinitions: parameters: + docling_accelerator_device: + defaultValue: auto + isOptional: true + parameterType: STRING docling_allow_external_plugins: defaultValue: false
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(4 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(4 hunks)
🧰 Additional context used
🪛 GitHub Actions: Python Formatting Validation
kubeflow-pipelines/docling-standard/standard_components.py
[error] 1-1: Ruff formatting would modify this file (would reformat).
kubeflow-pipelines/docling-vlm/vlm_components.py
[error] 1-1: Ruff formatting would modify this file (would reformat).
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
| device_map = { | ||
| "auto": AcceleratorDevice.AUTO, | ||
| "cpu": AcceleratorDevice.CPU, | ||
| "cuda": AcceleratorDevice.CUDA, | ||
| "gpu": AcceleratorDevice.CUDA, # alias | ||
| "mps": AcceleratorDevice.MPS, | ||
| } | ||
| dev = accelerator_device.lower() | ||
| if dev not in device_map: | ||
| raise ValueError(f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Consider extracting the device_map to a shared constant.
This device_map dictionary is identical to the one in kubeflow-pipelines/docling-vlm/vlm_components.py (lines 75-81). This duplication violates the DRY principle.
See the recommendation in the VLM components file to extract this to a shared module in common/device_mapping.py.
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/standard_components.py around lines 127
to 136, the device_map dict is duplicated (same as in
kubeflow-pipelines/docling-vlm/vlm_components.py) — extract this mapping to a
shared module (e.g., kubeflow-pipelines/common/device_mapping.py) and replace
the local literal with an import; move the device_map constant into that module,
export it (or a getter), update both files to import device_map (or
get_device_map()) and then use it unchanged so the mapping is maintained in one
place.
| device_map = { | ||
| "auto": AcceleratorDevice.AUTO, | ||
| "cpu": AcceleratorDevice.CPU, | ||
| "cuda": AcceleratorDevice.CUDA, | ||
| "gpu": AcceleratorDevice.CUDA, # alias | ||
| "mps": AcceleratorDevice.MPS, | ||
| } | ||
| dev = accelerator_device.lower() | ||
| if dev not in device_map: | ||
| raise ValueError( | ||
| f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Consider extracting the device_map to a shared constant.
The device_map dictionary is duplicated in both kubeflow-pipelines/docling-vlm/vlm_components.py and kubeflow-pipelines/docling-standard/standard_components.py (lines 127-136). This violates the DRY principle and creates a maintenance burden if the mapping needs to be updated.
Consider extracting this to a shared module (e.g., common/constants.py or common/device_mapping.py):
# common/device_mapping.py
from docling.datamodel.accelerator_options import AcceleratorDevice
ACCELERATOR_DEVICE_MAP = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"cuda": AcceleratorDevice.CUDA,
"gpu": AcceleratorDevice.CUDA, # alias
"mps": AcceleratorDevice.MPS,
}Then import and use it in both component files:
from common.device_mapping import ACCELERATOR_DEVICE_MAP🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/vlm_components.py around lines 75 to 86, the
local device_map is duplicated across components causing DRY/maintenance issues;
extract the mapping into a shared module (e.g., common/device_mapping.py or
common/constants.py) that exports a constant (ACCELERATOR_DEVICE_MAP) using
AcceleratorDevice values, then replace the local mapping with an import from
that shared module and use the imported constant in the validation logic; update
the other file (kubeflow-pipelines/docling-standard/standard_components.py) to
import and use the same constant and run tests/flake to ensure imports and names
are correct.
4200278 to
5bbcc5f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
kubeflow-pipelines/docling-standard/standard_components.py (1)
33-56: Add documentation for theaccelerator_deviceparameter.The new
accelerator_deviceparameter is missing from the function's docstring. Users need to understand the purpose, valid values, and default behavior.Add the parameter documentation to the Args section:
enrich_picture_classes: Whether or not to enrich picture classes. enrich_picture_description: Whether or not to enrich picture description. + accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps). Defaults to "auto". """kubeflow-pipelines/docling-vlm/vlm_components.py (1)
25-40: Add documentation for theaccelerator_deviceparameter.The new
accelerator_deviceparameter is missing from the function's docstring. Users need to understand the purpose, valid values, and default behavior.Add the parameter documentation to the Args section:
remote_model_enabled: Whether or not to use a remote model. remote_model_secret_mount_path: Path to the remote model secret mount path. + accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps). Defaults to "auto". """
♻️ Duplicate comments (3)
kubeflow-pipelines/docling-standard/standard_components.py (1)
127-138: Device mapping duplication already flagged in previous review.The duplication of
device_mapacross component files has been identified in the previous review comments. Please refer to the earlier recommendation to extract this to a shared module.kubeflow-pipelines/docling-vlm/vlm_components.py (1)
75-86: Device mapping duplication already flagged in previous review.The duplication of
device_mapacross component files has been identified in the previous review comments. Please refer to the earlier recommendation to extract this to a shared module.kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
251-261: Missing documentation reflects source file issue.The docstring in this compiled YAML is missing documentation for
accelerator_devicebecause the source file (vlm_components.py) is missing it. Please address this by updating the source file's docstring as flagged in the earlier review comment.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(4 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- kubeflow-pipelines/README.md
- kubeflow-pipelines/docling-standard/README.md
- kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
- kubeflow-pipelines/docling-vlm/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
🔇 Additional comments (1)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
280-285: LGTM!The device mapping and validation logic in the compiled YAML correctly reflects the source implementation from
vlm_components.py.
5bbcc5f to
cb254e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)
4-21: Update pipeline header comments and wire accelerator_device through the full DAG.The pipeline header (lines 4-21) documents all other parameters but omits
accelerator_device. Additionally, this parameter must be added to:
- Pipeline header (lines 4-21): Add
accelerator_device: str [Default: auto]- Root inputDefinitions (lines 717-785): Add parameter definition matching the component
- comp-for-loop-1 inputDefinitions (lines 207-237): Add parameter definition
- for-loop-1 task inputs (lines 166-194): Wire parameter via
componentInputParameterExample of the missing wiring in for-loop task (should be added around line 167):
accelerator_device: componentInputParameter: pipelinechannel--docling_accelerator_deviceApply these changes to complete the parameter threading.
Also applies to: 55-58, 166-194
154-194: Add accelerator_device to the pipeline parameter chain.The component defines
accelerator_device(line 55) with a default of "auto", but it is not wired through the pipeline DAG:
- Not included in the docling-convert-standard task's parameter inputs (lines 167–194)
- Not defined in
comp-for-loop-1.inputDefinitions.parameters(lines 197–237)- Not defined in
root.inputDefinitions.parameters(lines 710–785)- Not documented in the pipeline header comments (lines 4–21)
This prevents pipeline users from overriding the default
accelerator_devicevalue. Either add this parameter to all four locations for full configurability, or document if this is intentionally fixed per pipeline.
♻️ Duplicate comments (1)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
46-49:accelerator_deviceparameter must be wired through the pipeline root and for-loop-1 task.The
accelerator_deviceparameter is defined in the component (lines 46–49) with a default value of "auto", but it is not exposed at the pipeline root level (root.inputDefinitions) nor passed through thefor-loop-1task to thedocling-convert-vlmcomponent. This prevents runtime configuration—the parameter will always use its default value. Other similar optional parameters likeimage_export_mode,num_threads, andtimeout_per_documentfollow the correct pattern and are exposed at the root and wired through the task.Required changes:
- Add
docling_accelerator_devicetoroot.inputDefinitions(lines 610–643) following the pattern of otherdocling_*parameters:docling_accelerator_device: defaultValue: auto isOptional: true parameterType: STRING
- Add the parameter to the
for-loop-1task's inputs (lines 575–586):pipelinechannel--docling_accelerator_device: componentInputParameter: docling_accelerator_device
- Wire it into the
docling-convert-vlmcomponent parameters within thefor-loop-1task (around line 130):accelerator_device: componentInputParameter: pipelinechannel--docling_accelerator_deviceAlso update the pipeline header comment (lines 4–12) to include
docling_accelerator_device: str [Default: 'auto']in the Inputs section.Also applies to: 107-132, 610-643
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(4 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- kubeflow-pipelines/docling-vlm/vlm_components.py
- kubeflow-pipelines/README.md
- kubeflow-pipelines/docling-standard/standard_components.py
- kubeflow-pipelines/docling-vlm/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
🔇 Additional comments (2)
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
388-411: Device mapping and validation are well-implemented.The device_map correctly defines all supported accelerators with "gpu" as an alias for "cuda", case-insensitive input handling, and clear error messaging on invalid input.
kubeflow-pipelines/docling-standard/README.md (1)
13-13: Documentation is clear and accurate.The new
docling_accelerator_deviceparameter is well-documented with supported values and default. Placement at the start of the configuration options list is logical and improves discoverability.
|
@Mergifyio refresh |
✅ Pull request refreshed |
cb254e0 to
bd03127
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/validate-python.yml (1)
12-12: Fix path pattern typo: "workflow" → "workflows" (plural).Lines 12 and 20 reference
.github/workflow/validate-python.ymlbut the file is located at.github/workflows/validate-python.yml. The singular "workflow" won't match; this breaks the self-referential trigger and the workflow won't re-run when it changes.Apply this diff to fix both occurrences:
paths: - "**/*.py" - ".pre-commit-config.yaml" - "Makefile" - "pyproject.toml" - - ".github/workflow/validate-python.yml" + - ".github/workflows/validate-python.yml"Also applies to: 20-20
♻️ Duplicate comments (4)
kubeflow-pipelines/docling-standard/standard_components.py (1)
127-138: Avoid duplicatingdevice_mapacross standard and VLM components.The
device_maphere (including thegpualias) is identical in spirit to the one used in the VLM components. Maintaining these in two places will be error-prone if you ever add/remove devices or aliases.As suggested in the prior review, consider extracting this mapping into a shared module (e.g.,
kubeflow-pipelines/common/device_mapping.py) and importing it in both components:- device_map = { - "auto": AcceleratorDevice.AUTO, - "cpu": AcceleratorDevice.CPU, - "cuda": AcceleratorDevice.CUDA, - "gpu": AcceleratorDevice.CUDA, # alias - "mps": AcceleratorDevice.MPS, - } + from common.device_mapping import DEVICE_MAP as device_map @@ - pipeline_options.accelerator_options = AcceleratorOptions( - num_threads=num_threads, device=device_map[dev] - ) + pipeline_options.accelerator_options = AcceleratorOptions( + num_threads=num_threads, device=device_map[dev] + )Also applies to: 167-169
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
25-26: Documentaccelerator_deviceand consider centralizingdevice_map.The runtime behavior looks good (normalization, alias support, validation, and passing
device_map[dev]intoAcceleratorOptions). Two follow‑ups:
- The docstring’s Args block doesn’t mention
accelerator_device, which makes the new parameter discoverability poor. Consider adding a brief line:- remote_model_secret_mount_path: Path to the remote model secret mount path. + remote_model_secret_mount_path: Path to the remote model secret mount path. + accelerator_device: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
- The
device_mapdictionary is now duplicated across the VLM and standard components (and appears again in the compiled YAML). To avoid drift when adding new accelerators or aliases, it would be cleaner to move this mapping into a shared module (e.g.,common/device_mapping.py) and import it here.Also applies to: 75-86, 179-180
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)
46-49:accelerator_devicestill isn’t configurable at the pipeline level.
accelerator_deviceis defined oncomp-docling-convert-vlm(lines 46–49) and used insidedocling_convert_vlm, but it isn’t:
- Exposed as a root pipeline parameter, and
- Passed through the
docling-convert-vlmtask incomp-for-loop-1.As a result, runs always use the default
"auto", and users cannot select CPU/GPU/MPS via pipeline inputs, which undercuts the goal of “KFP support for accelerator devices”.To align with how other
docling_*parameters are wired:
- Add a root parameter, e.g.
docling_accelerator_device(default"auto", optional, STRING) underroot.inputDefinitions.parameters.- Add a
pipelinechannel--docling_accelerator_deviceentry undercomp-for-loop-1.inputDefinitions.parameters.- In
comp-for-loop-1.dag.tasks.docling-convert-vlm.inputs.parameters, add:accelerator_device: componentInputParameter: pipelinechannel--docling_accelerator_device
- Under
root.dag.tasks.for-loop-1.inputs.parameters, map:pipelinechannel--docling_accelerator_device: componentInputParameter: docling_accelerator_deviceAlso applies to: 107-131, 575-587, 610-644
251-261: Docstring still omitsaccelerator_device.
accelerator_deviceis in the function signature, but the Args section doesn’t mention it, which makes the parameter opaque when reading the compiled spec.Consider extending the docstring as follows:
- remote_model_secret_mount_path: Path to the remote model secret mount path. + remote_model_secret_mount_path: Path to the remote model secret mount path. + accelerator_device: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
🧹 Nitpick comments (8)
.github/mergify.yml (1)
45-69: Consider guarding backport rule against breaking-change labeled PRs.The new backport rule is a useful addition for maintaining stable branches. However, it currently creates a backport PR even if the original PR is labeled with
breaking-change, which could result in unintended backports of incompatible changes.Consider adding a condition to exclude breaking-change labeled PRs from automatic backporting:
- name: Create backport PRs for labeled changes conditions: - merged - base=main - label=backport + - "label!=breaking-change" actions:Alternatively, if all PRs with
backportlabel are intentionally stable changes regardless of the breaking-change label, this can be skipped.Please confirm whether the backport rule should automatically backport PRs labeled with both
backportandbreaking-change.tests/conftest.py (1)
11-19: LGTM with minor note on defensive filtering.The function correctly discovers notebooks using glob with recursive search. The
Path(f).exists()filter on line 17 is technically redundant sinceglob.globonly returns existing files, but it serves as defensive programming and doesn't impact functionality.If you prefer to streamline the code slightly, you can remove the existence check:
- # Convert to Path objects and filter out any non-existent files - notebook_paths = [Path(f) for f in notebook_files if Path(f).exists()] + # Convert to Path objects + notebook_paths = [Path(f) for f in notebook_files]README.md (1)
7-27: Consider aligning “workbenches directory” wording with the new structure.The paragraph mentioning “The workbenches directory…” no longer matches the tree below, which highlights
custom-workbench-image. Ifworkbencheswas renamed or superseded, consider updating that sentence to reference the current directory name(s) to avoid confusion.custom-workbench-image/dp-entrypoint.sh (1)
5-15: Paths look good; consider quoting$SRC_DIRincpfor robustness.The new
SRC_DIR/DEST_DIRvalues are consistent with the repo rename. To make the copy command safer against unexpected characters, you can quote the variable:- cp -r $SRC_DIR /opt/app-root/src/ + cp -r "$SRC_DIR" /opt/app-root/src/custom-workbench-image/README.md (1)
33-51: Docs match the new image setup; mirror the cp quoting improvement.The README snippets correctly reflect the renamed repository (
data-processing), updated TMP/SRC paths, anddp-entrypoint.shusage.To keep the example aligned with the script best practice, you can also quote
SRC_DIRhere:- cp -r $SRC_DIR /opt/app-root/src/ + cp -r "$SRC_DIR" /opt/app-root/src/Also applies to: 61-78, 136-146
kubeflow-pipelines/README.md (1)
24-25: Accelerator-device docs look good; consider clarifying parameter naming across layers.The new defaults and the “Configuring accelerator device” section accurately describe the supported values (
auto,cpu,cuda,gpu,mps) and thegpu→cudaalias, and the updated URLs/paths are consistent with thedata-processingrepo.To avoid confusion between the KFP parameter name (
docling_accelerator_devicein the docs) and the internal component argument (accelerator_devicein code), consider adding a short note that the pipeline’sdocling_accelerator_deviceruntime parameter is passed through to the underlying components asaccelerator_device:-#### 7) Configuring accelerator device +#### 7) Configuring accelerator device @@ -- The parameter is case-insensitive and supports friendly aliases (`gpu` maps to `cuda`). +- The parameter is case-insensitive and supports friendly aliases (`gpu` maps to `cuda`). +- Internally, this pipeline parameter is forwarded to the underlying Docling components as the `accelerator_device` argument.Also applies to: 59-60, 63-64, 69-75, 96-98, 161-171
kubeflow-pipelines/docling-standard/standard_components.py (1)
33-56: Document the newaccelerator_deviceparameter in the docstring.The new
accelerator_deviceknob is implemented correctly but isn’t mentioned in the Args section, which can surprise readers of this component.Consider extending the docstring like this:
- enrich_picture_classes: Whether or not to enrich picture classes. - enrich_picture_description: Whether or not to enrich picture description. + enrich_picture_classes: Whether or not to enrich picture classes. + enrich_picture_description: Whether or not to enrich picture description. + accelerator_device: Device to use for acceleration. One of "auto", "cpu", "cuda", "gpu", or "mps".kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
55-58: Wireaccelerator_devicethrough the standard pipeline so it’s configurable.Right now
accelerator_deviceis declared oncomp-docling-convert-standard(lines 55–58) and used in the embedded function, but it’s never passed in thedocling-convert-standardtask withincomp-for-loop-1nor exposed as a root‑level pipeline input. That means runs will always use the default"auto"and callers cannot choose CPU/GPU/MPS at pipeline invocation time.For consistency with the other
docling_*knobs:
- Add a root parameter, e.g.
docling_accelerator_device(default"auto", optional, STRING).- Plumb it into
comp-for-loop-1.inputDefinitions.parametersaspipelinechannel--docling_accelerator_device.- Pass it into the
docling-convert-standardtask’saccelerator_deviceinput via acomponentInputParameter.This keeps the compiled spec aligned with the stated goal of making accelerator selection configurable in KFP.
Also applies to: 156-195, 717-786
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
.github/dependabot.yml(1 hunks).github/mergify.yml(2 hunks).github/workflows/compile-kfp.yml(1 hunks).github/workflows/execute-all-notebooks.yml(1 hunks).github/workflows/execute-notebooks.yml(1 hunks).github/workflows/validate-and-execute-notebooks.yml(0 hunks).github/workflows/validate-notebooks.yml(1 hunks).github/workflows/validate-python.yml(1 hunks).gitignore(1 hunks)Makefile(2 hunks)README.md(3 hunks)custom-workbench-image/Containerfile(1 hunks)custom-workbench-image/README.md(4 hunks)custom-workbench-image/dp-entrypoint.sh(1 hunks)kubeflow-pipelines/README.md(5 hunks)kubeflow-pipelines/docling-standard/README.md(2 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(4 hunks)kubeflow-pipelines/docling-vlm/README.md(2 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(4 hunks)notebooks/README.md(1 hunks)pyproject.toml(1 hunks)requirements-dev.txt(1 hunks)scripts/subset_selection/README.md(1 hunks)tests/.gitignore(1 hunks)tests/README.md(1 hunks)tests/conftest.py(1 hunks)tests/test_notebook_parameters.py(1 hunks)
💤 Files with no reviewable changes (1)
- .github/workflows/validate-and-execute-notebooks.yml
✅ Files skipped from review due to trivial changes (4)
- notebooks/README.md
- scripts/subset_selection/README.md
- tests/README.md
- tests/.gitignore
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_notebook_parameters.py (1)
tests/conftest.py (1)
get_notebook_files(11-19)
🪛 actionlint (1.7.9)
.github/workflows/execute-notebooks.yml
45-45: shellcheck reported issue in this script: SC2046:warning:8:44: Quote this to prevent word splitting
(shellcheck)
🪛 checkmake (0.2.2)
Makefile
[warning] 1-1: Missing required phony target "all"
(minphony)
[warning] 1-1: Missing required phony target "clean"
(minphony)
[warning] 1-1: Missing required phony target "test"
(minphony)
🪛 LanguageTool
README.md
[grammar] ~38-~38: Use a hyphen to join words.
Context: ...pelines. ### Notebooks Data processing related jupyter notebooks a...
(QB_NEW_EN_HYPHEN)
🪛 Pylint (4.0.4)
tests/test_notebook_parameters.py
[refactor] 16-16: Too few public methods (1/2)
(R0903)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: execute_tests (notebooks/use-cases/document-conversion-standard.ipynb, https://raw.githubusercont...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: Summary
🔇 Additional comments (19)
.gitignore (1)
6-6: Good addition for broader venv directory support.Adding "venv" (without leading dot) alongside the existing ".venv" entry ensures that virtual environment directories from different tools and conventions are properly ignored. This improves the developer experience across varied Python development workflows.
.github/mergify.yml (1)
5-12: Primary merge rule conditions look solid.The auto-merge conditions are well-defined with appropriate safeguards (approval requirement, no requested changes, exclusion of do-not-merge/breaking-change labels, conflict detection).
pyproject.toml (1)
29-32: Pytest configuration is well-structured.Standard configuration aligns with the new test suite infrastructure.
.github/workflows/validate-python.yml (1)
30-30: GitHub Actions versions updated appropriately.Bumping to actions/checkout@v5 and actions/setup-python@v6 is reasonable and non-breaking.
Also applies to: 35-35
.github/workflows/compile-kfp.yml (1)
40-40: GitHub Actions versions updated consistently.Aligns with other workflow modernization in the PR; no functional changes to pipeline compilation logic.
Also applies to: 43-43
.github/dependabot.yml (1)
1-23: Dependabot configuration is well-designed.Sensible rules: direct dependencies only, major updates ignored, clear commit prefixes for categorization.
.github/workflows/execute-notebooks.yml (1)
1-55: Notebook execution workflow is well-structured.Setup, installation, and execution steps follow a logical sequence. After fixing the shell quoting issue above, this workflow provides good CI coverage for notebook execution.
Please verify that the base64-encoded parameters are correctly decoded and applied within the notebook execution context (papermill handles the
-bflag parameter injection)..github/workflows/execute-all-notebooks.yml (1)
1-100: EC2 runner orchestration workflow is well-designed.Proper OIDC setup, role assumption, cross-job dependencies, and cleanup (always runs). Resource tagging and AZ configuration show operational rigor. The execute-all-notebooks job is currently a placeholder; ensure the actual notebook execution logic is added before this workflow is used in production.
Confirm that the execute-all-notebooks job will be populated with notebook execution logic in a follow-up PR (currently only a setup step is present).
requirements-dev.txt (1)
3-5: Development dependencies align with new test infrastructure.pytest, nbformat, and papermill support the notebook parameter validation tests and notebook execution workflows added in this PR.
Makefile (1)
1-1: Makefile target for notebook parameter tests is well-integrated.Target follows existing patterns and is properly declared in .PHONY. Enables CI workflow to validate notebook parameter tagging.
Also applies to: 28-31
tests/conftest.py (1)
22-25: LGTM!The fixture is correctly implemented and provides a clean interface for tests to access discovered notebook files.
.github/workflows/validate-notebooks.yml (1)
26-43: LGTM!The workflow steps are well-structured and appropriate for notebook validation. Python setup with pip caching, dependency installation, and kernel configuration are all correctly implemented.
tests/test_notebook_parameters.py (3)
16-53: LGTM!The
NotebookParametersValidatorclass is well-implemented. The validation logic correctly:
- Reads notebooks with
NO_CONVERTto preserve structure- Validates notebook format with
nbformat.validate()- Checks for code cells tagged with 'parameters'
- Provides descriptive error messages
Regarding the Pylint hint (R0903): This warning is a false positive. The class serves as a reasonable organizational unit for notebook validation logic, and having a single public method is appropriate for this focused responsibility.
56-70: LGTM!The parameterized test correctly validates that each notebook contains a parameters cell. The assertion message is helpful, clearly explaining what's missing and why it's needed for papermill execution.
73-107: LGTM!The validator self-test is thorough and well-structured. It correctly tests both negative and positive cases by creating minimal notebook structures. The use of
try/finallyblocks ensures proper cleanup of temporary files even if assertions fail.kubeflow-pipelines/docling-vlm/README.md (1)
1-14: Accelerator device docs look aligned; please double-check parameter wiring.The new
docling_accelerator_deviceoption, supported values, and default match the accelerator semantics used elsewhere in the repo, and the updated URLs/paths fordata-processinglook correct.Please just verify that the KFP parameter name used in:
- VLM components,
- the pipeline definition,
- the compiled YAML, and
- this README
is consistent (including the
gpu→cudaalias) so users don’t see mismatched names between UI and docs.Also applies to: 27-30
custom-workbench-image/Containerfile (1)
51-67: Repo/entrypoint path updates are consistent.The switch to cloning
data-processing, usingdp-entrypoint.sh, and updatingNOTEBOOK_ROOT_DIRto/opt/app-root/src/data-processing/notebooks/use-casesis internally consistent with the entrypoint script and documentation.kubeflow-pipelines/docling-standard/README.md (1)
1-14: Standard pipeline README updates align with accelerator support and repo rename.The new
docling_accelerator_deviceoption (values and default) and the updated clone/import paths correctly reflect the accelerator support and thedata-processingrepository layout.Also applies to: 35-38
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
280-285: Accelerator mapping and validation look solid.The
device_mapwith aliases ("gpu"→ CUDA) plus lowercase normalization and explicit ValueError on invalid keys is a clean pattern, and wiringdevice_map[dev]intoAcceleratorOptions(device=...)keeps the spec aligned with the Docling enums.Also applies to: 333-334
| echo "Executing notebook '$NOTEBOOK' with file '$FILE'..." | ||
| papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b $(echo -n "files: [\"$FILE\"]" | base64 -w 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quote the command substitution to prevent word splitting.
Shellcheck (SC2046) flags the unquoted base64 expansion. While unlikely to cause issues in practice, quote the output to adhere to shell best practices and silence the static analysis warning.
Apply this diff to fix line 53:
- papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b $(echo -n "files: [\"$FILE\"]" | base64 -w 0)
+ papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b "$(echo -n "files: [\"$FILE\"]" | base64 -w 0)"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b $(echo -n "files: [\"$FILE\"]" | base64 -w 0) | |
| papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b "$(echo -n "files: [\"$FILE\"]" | base64 -w 0)" |
🤖 Prompt for AI Agents
.github/workflows/execute-notebooks.yml around line 53: the unquoted command
substitution for base64 can be word-split by the shell and triggers ShellCheck
SC2046; fix it by quoting the expansion so the -b argument receives a single
string — change the call to pass the output of the echo|base64 pipeline wrapped
in double quotes (i.e. quote the $(...) substitution) so the entire JSON/base64
payload is treated as one argument.
| validate_tests: | ||
| runs-on: ubuntu-latest | ||
| strategy: | ||
| matrix: | ||
| # Set the notebooks to validate, wildcards are allowed | ||
| notebooks_to_validate: ["notebooks/**/*.ipynb"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused matrix configuration.
The matrix strategy defines notebooks_to_validate but this value is never referenced in any of the workflow steps. The Make targets (format-notebooks-check and test-notebook-parameters) operate on all notebooks discovered by the test configuration, making this matrix unnecessary.
Apply this diff to remove the unused matrix:
jobs:
validate_tests:
runs-on: ubuntu-latest
- strategy:
- matrix:
- # Set the notebooks to validate, wildcards are allowed
- notebooks_to_validate: ["notebooks/**/*.ipynb"]
steps:
- uses: actions/checkout@v5Alternatively, if you intend to use the matrix to selectively test notebooks, you'll need to pass ${{ matrix.notebooks_to_validate }} to the Make targets or test commands.
🤖 Prompt for AI Agents
.github/workflows/validate-notebooks.yml around lines 19 to 24: the workflow
defines a matrix key notebooks_to_validate that is never used; either remove the
unused matrix block entirely (delete the strategy.matrix.notebooks_to_validate
lines) so the job runs without an unused variable, or update the workflow steps
that call the Make targets to pass the matrix value (e.g., use ${{
matrix.notebooks_to_validate }} as an argument to the format-notebooks-check and
test-notebook-parameters targets) so the matrix actually controls which
notebooks are validated.
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
Show resolved
Hide resolved
| ### Notebooks | ||
|
|
||
| Data processing related [jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials). | ||
|
|
||
| ### Custom Workbench Image | ||
|
|
||
| Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images). | ||
|
|
||
| A sample `Containerfile` and instructions to create a custom workbench image are in [custom-workbench-image](custom-workbench-image). | ||
|
|
||
| ### Scripts | ||
|
|
||
| Curated scripts related to data processing live in this directory. | ||
|
|
||
| For example the [subset selection](scripts/subset_selection) scripts allows users to identify representative samples from large training datasets. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tighten wording for notebooks/scripts sentences.
Minor grammar/clarity fixes here will read better and address the LanguageTool warning:
- Use a hyphenated compound and capitalize “Jupyter”.
- Fix subject–verb agreement and add a small comma for readability.
-Data processing related [jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials).
+Data-processing-related [Jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials).
@@
-For example the [subset selection](scripts/subset_selection) scripts allows users to identify representative samples from large training datasets.
+For example, the [subset selection](scripts/subset_selection) scripts allow users to identify representative samples from large training datasets.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ### Notebooks | |
| Data processing related [jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials). | |
| ### Custom Workbench Image | |
| Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images). | |
| A sample `Containerfile` and instructions to create a custom workbench image are in [custom-workbench-image](custom-workbench-image). | |
| ### Scripts | |
| Curated scripts related to data processing live in this directory. | |
| For example the [subset selection](scripts/subset_selection) scripts allows users to identify representative samples from large training datasets. | |
| ### Notebooks | |
| Data-processing-related [Jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials). | |
| ### Custom Workbench Image | |
| Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images). | |
| A sample `Containerfile` and instructions to create a custom workbench image are in [custom-workbench-image](custom-workbench-image). | |
| ### Scripts | |
| Curated scripts related to data processing live in this directory. | |
| For example, the [subset selection](scripts/subset_selection) scripts allow users to identify representative samples from large training datasets. |
🧰 Tools
🪛 LanguageTool
[grammar] ~38-~38: Use a hyphen to join words.
Context: ...pelines. ### Notebooks Data processing related jupyter notebooks a...
(QB_NEW_EN_HYPHEN)
🤖 Prompt for AI Agents
In README.md around lines 36 to 51, tighten wording: change "jupyter notebooks"
to "Jupyter notebooks" and hyphenate the compound as "Data-processing-related
Jupyter notebooks"; add a comma after "For example,"; hyphenate
"subset-selection" and fix subject–verb agreement by changing "scripts allows"
to "scripts allow". Ensure these small edits preserve links and casing.
|
🔀 Merge Conflict Detected This pull request has merge conflicts that must be resolved before it can be merged. |
bd03127 to
0249d81
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
kubeflow-pipelines/docling-standard/standard_components.py (1)
127-133: Reuse a shareddevice_mapto avoid drift between componentsThe
device_maphere is (per earlier review) identical to the one in the VLM components; keeping these literals duplicated makes it easy for the mappings to diverge over time.Extract this mapping once (for example into
kubeflow-pipelines/common/device_mapping.py) and import it in bothstandard_components.pyandvlm_components.py, then re‑compile the pipelines so the compiled specs pick up the shared definition.Please double‑check against the current
docling.datamodel.accelerator_options.AcceleratorDeviceAPI to ensure the set of supported devices is still correct.Also applies to: 168-168
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)
330-337: Addaccelerator_deviceto the embedded component docstringThe embedded
docling_convert_standardfunction signature includesaccelerator_device, but theArgs:section still omits it (same as in the source Python file and earlier review).Add a brief line to keep the compiled spec self‑documenting, for example:
- enrich_picture_description: Whether or not to enrich picture description. + enrich_picture_description: Whether or not to enrich picture description. + accelerator_device: Device to use for acceleration ('auto', 'cpu', 'cuda'/'gpu', 'mps').Ideally, update the docstring in
standard_components.pyand re‑compile so this stays in sync.Also applies to: 338-353
388-399: Samedevice_mapDRY concern applies to the compiled component bodyThe
device_mapandAcceleratorOptions(device=device_map[dev])logic here mirrorsstandard_components.py(and the VLM components). Once you extractdevice_mapinto a shared module and import it in the source component, re‑compiling will propagate that refactor into this embedded code as well, eliminating the copy‑pasted mapping.No manual edits to this YAML should be needed after you refactor the Python source.
After refactoring the Python source to use a shared mapping, please re‑compile and confirm the generated YAML no longer hard‑codes a divergent
device_map.Also applies to: 411-411
🧹 Nitpick comments (2)
kubeflow-pipelines/docling-standard/standard_components.py (1)
33-33: Documentaccelerator_deviceand normalize input before lookupThe new
accelerator_deviceparameter is good, but it isn’t mentioned in the docstring and whitespace isn’t stripped before validation:
- Add an
Args:entry documenting accepted values ("auto","cpu","cuda"/"gpu","mps").- Consider trimming whitespace before
.lower()to be more forgiving of user input:- accelerator_device: str = "auto", + accelerator_device: str = "auto", @@ - dev = accelerator_device.lower() + dev = accelerator_device.strip().lower()Also re‑compiling the pipeline after updating this docstring will automatically fix the embedded component docstring in the compiled YAML.
Also applies to: 134-138
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
55-58: Clarify and (optionally) exposeaccelerator_deviceat the pipeline levelThe
accelerator_deviceparameter is defined forcomp-docling-convert-standardbut:
- It has no
description, so UIs won’t explain allowable values.- It isn’t threaded from
root.inputDefinitionsthroughcomp-for-loop-1into this component, so this pipeline can’t override the default"auto".If you want pipeline consumers to choose the device, consider:
- Adding a root parameter like
docling_accelerator_deviceand documenting it in the top# Inputssection.- Wiring it through
comp-for-loop-1.inputs.parametersintocomp-docling-convert-standard.inputs.parameters.accelerator_device.If the intent is to keep this as an internal knob, at least add a short
descriptionhere mirroring the Python docstring.Please validate the compiled spec in your KFP environment to ensure
accelerator_deviceshows up and is settable as you intend.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(4 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- kubeflow-pipelines/docling-standard/README.md
- kubeflow-pipelines/docling-vlm/README.md
- kubeflow-pipelines/docling-vlm/vlm_components.py
- kubeflow-pipelines/README.md
- kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: Summary
|
Hi Rishabh, I tried running this on RHOAI cluster and did not see the configurable parameter for the accel device. How can the user adjust this option without exposing? Also, we have minimum test coverage for pipelines in our CI, that's something I want to get to sometime soon. To prepare for that work, I would like to understand your validation strategy: |
0249d81 to
a9d9cd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (1)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
256-266: Docstring Args section still missingaccelerator_devicedocumentation.The function signature includes
accelerator_device: str = "auto"(line 256), but the Args section (lines 258-266) does not document it. A past review flagged this, but the compiled YAML still shows the parameter undocumented.
🧹 Nitpick comments (10)
kubeflow-pipelines/common/components.py (1)
31-31: HTTP retry logic is solid; consider small robustness tweaksThe new retry loop looks correct and is a good fit for flaky HTTP downloads:
- Importing
timeinside the function matches the existing pattern foros/Path.- Per-file retry semantics are clear: up to
max_retriesattempts,breakon first success, and aRuntimeErrorchained from the lastRequestExceptionon repeated failure.- Using
requests.get(..., stream=True, timeout=30)withresp.raise_for_status()and chunked writes is appropriate for potentially large PDFs.Two minor, non-blocking refinements you might consider:
Clean up partial files on total failure: If an exception happens after the file has been opened/written (e.g., mid‑stream), the last attempt may leave a truncated
deston disk before raisingRuntimeError. You could defensively removedestwhenattempt == max_retriesin the except block, e.g.:if attempt == max_retries: try: dest.unlink(missing_ok=True) # Python 3.8+: adjust if older except OSError: pass raise RuntimeError( f"Failed to download {url} after {max_retries} attempts" ) from excNot strictly necessary in a failing component, but it keeps the artifact directory cleaner.
Optional configurability of retry parameters: If you anticipate tuning for different environments,
max_retries/retry_delay_seccould be module‑level constants or read from env vars. For now, the hard‑coded defaults are reasonable and readable.Overall, the change improves resilience without altering the S3 path or public interface.
Also applies to: 129-131, 136-161
kubeflow-pipelines/docling-standard/standard_components.py (1)
33-56: Addaccelerator_deviceto the docstring.The new
accelerator_deviceparameter is missing from the docstring. Consider adding documentation for it.enrich_picture_classes: Whether or not to enrich picture classes. enrich_picture_description: Whether or not to enrich picture description. + accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps). """kubeflow-pipelines/docling-vlm/local_run.py (2)
92-143:validate_outputis defined but never called.This function performs useful output validation but is not invoked anywhere in the test flow. Either integrate it into
run_test_scenarioafter successful runs, or remove it to avoid dead code.To integrate validation, add a call in the success path of
run_test_scenario:# After successful pipeline run, validate output # output_dir would need to be determined from the local runner's output location validation = validate_output(output_dir, test_name) if not validation["passed"]: return {"test_name": test_name, "status": "FAIL", "reason": "Validation failed", ...}
189-215: Remove unnecessaryelseafterreturn.Per static analysis, the
elseblock afterreturnat line 204 is unnecessary.return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f"Expected error containing: '{expected_error}'") - print(f"Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": f"Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f"Expected error containing: '{expected_error}'") + print(f"Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": f"Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)
56-59: Add a description for theaccelerator_deviceparameter.The
accelerator_deviceparameter definition is missing adescriptionfield, unlike other parameters in this component (e.g.,allow_external_plugins,enrich_code). This would improve discoverability in the pipeline UI.accelerator_device: defaultValue: auto + description: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps'). isOptional: true parameterType: STRING
725-728: Consider adding a description at the pipeline root level as well.For consistency with other
docling_*pipeline parameters, a description would help users understand the parameter's purpose in the pipeline UI.kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
47-50: Add a description for theaccelerator_deviceparameter.Similar to the standard pipeline, this parameter lacks a description field. Adding one would improve discoverability:
accelerator_device: defaultValue: auto + description: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps'). isOptional: true parameterType: STRINGkubeflow-pipelines/docling-standard/local_run.py (3)
147-205:validate_outputfunction is defined but never called.This function validates output quality/format but is not used in the test harness. Consider either:
- Integrating it into successful test scenarios in
run_test_scenario- Removing it if not needed
If intended for future use, adding a TODO comment would clarify intent.
Would you like me to generate code that integrates this validation into the test flow?
256-278: Consider removing unnecessaryelseafterreturn(optional).Per static analysis hint R1705, the
elseblock at line 268 is unnecessary since line 267 returns. This is a minor style improvement:return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f"Expected error containing: '{expected_error}'") - print(f"Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": f"Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f"Expected error containing: '{expected_error}'") + print(f"Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": f"Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }
322-322:PASS_WITH_WARNINGSstatus is handled but never produced.The summary logic counts and displays
PASS_WITH_WARNINGSstatus (lines 322, 328, 336-337), but no code path inrun_test_scenarioever returns this status. Either implement warning scenarios or remove the dead handling code.Also applies to: 336-337
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignore(1 hunks)kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/common/components.py(2 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/local_run.py(3 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(9 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/local_run.py(4 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(9 hunks)
✅ Files skipped from review due to trivial changes (1)
- .gitignore
🚧 Files skipped from review as they are similar to previous changes (2)
- kubeflow-pipelines/docling-vlm/vlm_components.py
- kubeflow-pipelines/README.md
🧰 Additional context used
🧬 Code graph analysis (1)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
convert_pipeline_test(58-89)validate_output(92-143)run_test_scenario(146-223)main(226-274)
🪛 GitHub Actions: Python Formatting Validation
kubeflow-pipelines/docling-vlm/local_run.py
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format' to fix code style issues.
kubeflow-pipelines/docling-standard/local_run.py
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format' to fix code style issues.
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-vlm/local_run.py
[refactor] 193-215: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
kubeflow-pipelines/docling-standard/local_run.py
[refactor] 105-105: Too many positional arguments (8/5)
(R0917)
[refactor] 256-278: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: Summary
🔇 Additional comments (9)
kubeflow-pipelines/docling-vlm/README.md (1)
13-13: LGTM!The documentation accurately describes the new
docling_accelerator_deviceparameter with correct device options and default value that align with the implementation.kubeflow-pipelines/docling-standard/README.md (1)
13-13: LGTM!Documentation correctly describes the accelerator device configuration options, consistent with the VLM pipeline README and the implementation.
kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)
37-37: LGTM!The
docling_accelerator_deviceparameter is properly added to the pipeline signature with a sensible default and correctly wired to the component call.Also applies to: 85-85
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)
28-28: LGTM!The accelerator device parameter is consistently implemented in the VLM pipeline, matching the pattern used in the standard pipeline.
Also applies to: 69-69
kubeflow-pipelines/docling-standard/standard_components.py (2)
127-138: Validation logic is sound.The device mapping and validation are correctly implemented with case-insensitive matching, a helpful "gpu" alias for "cuda", and a clear error message listing valid options.
Note: The
device_mapduplication withvlm_components.pywas already flagged in a previous review for extraction to a shared module.
167-169: LGTM!The
AcceleratorOptionscorrectly uses the validated device from the map.kubeflow-pipelines/docling-vlm/local_run.py (1)
18-54: Good test coverage for accelerator device validation.The test configurations cover baseline CPU, auto-detection, and failure scenarios for invalid device and image mode inputs. The commented-out GPU/MPS tests are a reasonable approach for environment-specific testing.
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
285-295: LGTM on device validation logic.The
device_mapvalidation is correctly implemented and consistent with the standard pipeline. The case-insensitive comparison via.lower()and clear error message with allowed values are good practices.kubeflow-pipelines/docling-standard/local_run.py (1)
17-61: Good test coverage for accelerator device configurations.The
TEST_CONFIGSprovides a reasonable baseline for testing CPU and auto device modes, along with variations in backends and table modes. The commented-out GPU/MPS tests are appropriate for environments without those accelerators.
| import sys | ||
| from pathlib import Path | ||
| from typing import List | ||
| import json | ||
| import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Run ruff format to fix formatting issues.
The pipeline is failing due to formatting violations. Run ruff format kubeflow-pipelines/docling-standard/local_run.py to resolve.
🧰 Tools
🪛 GitHub Actions: Python Formatting Validation
[error] 1-1: Ruff format check would reformat this file. Run 'ruff format' to fix code style issues.
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/local_run.py around lines 1 to 5, the
file has formatting violations; run the automatic formatter to fix them by
executing `ruff format kubeflow-pipelines/docling-standard/local_run.py` (or run
`ruff format` for the repo) and then stage the changes—this will reformat
imports and whitespace to satisfy the linter.
a9d9cd4 to
17924e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
75-86: Address the device_map duplication.The
device_mapdictionary is duplicated in bothvlm_components.pyandstandard_components.py, violating DRY principles and creating a maintenance burden if the mapping needs updating.Consider extracting this to a shared module (e.g.,
common/device_mapping.py) that both components can import.
🧹 Nitpick comments (3)
kubeflow-pipelines/docling-vlm/local_run.py (1)
151-225: LGTM: Test scenario execution logic is sound.The
run_test_scenariofunction correctly:
- Filters metadata keys before passing config to the pipeline
- Handles both expected success and expected failure cases
- Provides detailed error reporting with truncated messages
Optionally, the
elseat line 207 could be removed per Pylint R1705, but the current structure is clear and readable.kubeflow-pipelines/docling-standard/local_run.py (1)
215-292: LGTM: Test execution logic handles all scenarios correctly.The
run_test_scenariofunction properly filters metadata keys and handles both expected successes and failures with detailed error matching.Optionally, the
elseat line 274 could be removed per Pylint R1705, but readability is fine as-is.kubeflow-pipelines/common/components.py (1)
129-161: Retry loop is solid; consider minor tweaks for configurability and error handlingThe bounded retry loop around the HTTP download is well structured and should significantly reduce flakiness from transient network issues. A couple of optional refinements you might consider:
- Make
max_retriesandretry_delay_secconfigurable (e.g., via function args or env vars) so operators can tune behavior for slow or unreliable networks without code changes.- Differentiate between retriable and non-retriable HTTP errors (e.g., avoid retrying clear client errors like 4xx, while still retrying timeouts and 5xx), to avoid unnecessary repeated attempts for permanent failures.
- If HTTP-based imports tend to be flaky, you might also want to mirror a similar retry mechanism on the S3 path for consistency.
These are purely optional; the current implementation is correct and already a good improvement in robustness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignore(1 hunks)kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/common/components.py(2 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/local_run.py(3 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(9 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/local_run.py(4 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(9 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
- .gitignore
- kubeflow-pipelines/docling-standard/README.md
- kubeflow-pipelines/docling-standard/standard_components.py
- kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
- kubeflow-pipelines/README.md
🧰 Additional context used
🧬 Code graph analysis (2)
kubeflow-pipelines/docling-vlm/local_run.py (1)
kubeflow-pipelines/docling-standard/local_run.py (4)
convert_pipeline_test(106-145)validate_output(148-212)run_test_scenario(215-292)main(295-361)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
convert_pipeline_test(59-90)validate_output(93-148)run_test_scenario(151-225)main(228-278)
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-vlm/local_run.py
[refactor] 195-217: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
kubeflow-pipelines/docling-standard/local_run.py
[refactor] 106-106: Too many positional arguments (8/5)
(R0917)
[refactor] 262-284: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: Summary
🔇 Additional comments (19)
kubeflow-pipelines/docling-vlm/vlm_components.py (2)
25-25: LGTM: Parameter addition is clean.The
accelerator_deviceparameter with default value "auto" follows the existing parameter patterns and integrates well with the function signature.
178-180: LGTM: Dynamic device selection implemented correctly.The AcceleratorOptions now uses the validated device from
device_map[dev], enabling runtime device configuration as intended.kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)
47-50: LGTM: Component input definition is correct.The
accelerator_deviceparameter is properly defined in the component'sinputDefinitionswith appropriate default value and type.
119-120: LGTM: Pipeline wiring is complete and correct.The
docling_accelerator_deviceparameter is correctly wired through all pipeline levels:
- Root pipeline input (lines 619-622)
- For-loop task input (lines 584-585)
- Pipeline channel declaration (lines 151-152)
- Component parameter binding (lines 119-120)
Also applies to: 151-152, 584-585, 619-622
kubeflow-pipelines/docling-vlm/local_run.py (5)
19-41: LGTM: Test configurations are well-structured.The
TEST_CONFIGSprovides good coverage with baseline CPU and auto-detect scenarios. GPU and MPS tests are appropriately commented out for environments without specialized hardware.
43-55: LGTM: Failure scenarios provide good negative test coverage.The
FAILURE_SCENARIOSappropriately test validation logic for invalid device and image mode inputs.
59-90: LGTM: Pipeline test function correctly forwards parameters.The
convert_pipeline_testfunction properly accepts and forwards all docling configuration parameters, including the newaccelerator_device, to the VLM converter.
93-148: LGTM: Output validation is comprehensive.The
validate_outputfunction thoroughly checks both JSON and Markdown outputs with appropriate error handling and structured results.
228-279: LGTM: Test runner orchestration is well-implemented.The
mainfunction:
- Properly initializes the Docker runner
- Executes both normal and failure test phases
- Uses
config.get("should_fail", True)(line 251) which correctly avoids mutating theFAILURE_SCENARIOSdictionary- Provides clear test summary output
kubeflow-pipelines/docling-vlm/README.md (1)
13-13: LGTM: Documentation clearly describes the new parameter.The
docling_accelerator_deviceparameter is well-documented with all valid values and the default clearly stated.kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)
37-37: LGTM: Accelerator device parameter correctly added and forwarded.The
docling_accelerator_deviceparameter is added to the pipeline signature (line 37) and properly forwarded todocling_convert_standard(line 85).Also applies to: 85-85
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)
56-59: LGTM: Pipeline wiring is complete and consistent.The
docling_accelerator_deviceparameter is correctly wired through all pipeline levels, following the same pattern as the VLM pipeline.Also applies to: 168-169, 215-216, 672-673, 725-728
393-398: LGTM: Device mapping and validation logic is correctly embedded.The compiled YAML includes the complete device_map validation logic (lines 393-398) and proper usage in AcceleratorOptions (line 416).
Also applies to: 416-416
kubeflow-pipelines/docling-standard/local_run.py (5)
18-62: LGTM: Comprehensive test configurations.The
TEST_CONFIGSprovides excellent coverage with scenarios testing different backends (pypdfium2), table modes (fast), and device options. GPU/MPS tests are appropriately commented out.
64-102: LGTM: Thorough failure scenario coverage.The
FAILURE_SCENARIOScomprehensively test validation for all major parameters: device, backend, table mode, image mode, and OCR engine.
106-145: LGTM: Pipeline test function with comprehensive parameter support.The
convert_pipeline_testfunction correctly accepts and forwards all standard pipeline parameters including accelerator_device, backend, table mode, OCR settings, etc.
148-212: LGTM: Robust output validation with pass/fail logic.The
validate_outputfunction includes comprehensive checks and aggregates them into an overall pass/fail result (lines 204-210), which is an improvement over the VLM version.
295-361: LGTM: Well-structured test runner with clear reporting.The
mainfunction:
- Correctly uses
config.get("should_fail", True)(line 318) to avoid mutation- Provides emoji-rich output for better readability
- Includes detailed error preview with truncation (lines 351-356)
- Exits with appropriate status code
kubeflow-pipelines/common/components.py (1)
31-31: Localtimeimport for retry delay is appropriateUsing a local
timeimport here is consistent with the other stdlib imports in this component and keeps the dependency surface minimal at module import time.
17924e1 to
384f15f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
25-40: Addaccelerator_deviceto the docstring Args section.The new
accelerator_deviceparameter is missing from the docstring. This omission makes the parameter's purpose and valid values unclear to users.remote_model_secret_mount_path: Path to the remote model secret mount path. + accelerator_device: Device for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps'). """kubeflow-pipelines/docling-standard/standard_components.py (1)
33-56: Addaccelerator_deviceto the docstring Args section.The new
accelerator_deviceparameter is missing from the Args documentation.enrich_picture_description: Whether or not to enrich picture description. + accelerator_device: Device for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps'). """
♻️ Duplicate comments (3)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
75-86: Device mapping and validation logic is correct.The implementation properly handles case-insensitive input, includes a helpful "gpu" alias for "cuda", and provides clear error messages. The device_map duplication with
standard_components.pywas previously flagged and can be addressed as a follow-up refactor.kubeflow-pipelines/docling-standard/standard_components.py (1)
127-138: Device mapping and validation is correct; DRY refactor noted.The implementation correctly handles case-insensitive device matching and validation. The duplication with
vlm_components.pywas already flagged in a previous review for extraction to a shared module.kubeflow-pipelines/docling-standard/local_run.py (1)
339-358: Status emoji logic handles non-existentPASS_WITH_WARNINGSstatus.Lines 342-343 handle "PASS_WITH_WARNINGS" which is never produced. This is consistent with the counting issue noted above.
🧹 Nitpick comments (4)
kubeflow-pipelines/docling-vlm/local_run.py (2)
93-148:validate_outputfunction is defined but never called.The
validate_outputfunction validates output artifacts but is not invoked anywhere in the test harness. If output validation is intended, call it after successful pipeline runs inrun_test_scenario.print(f"✅ TEST PASSED: {test_name} - Completed in {elapsed:.2f}s") + # Optionally validate outputs here: + # validation = validate_output(output_dir, test_name) return { "test_name": test_name, "status": "PASS", "elapsed_time": elapsed, }
191-217: Consider removing unnecessaryelseafterreturn.Per static analysis, the
elseblock at line 207 can be de-indented since the precedingifblock returns.if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower(): print(f"✅ TEST PASSED: {test_name} - Failed as expected") print(f"Pipeline correctly rejected invalid parameter") if expected_error: print(f"Expected error type: '{expected_error}'") return { "test_name": test_name, "status": "PASS", "reason": "Failed as expected - validation working", "error": error_msg[:200], "elapsed_time": elapsed, } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - print(f"Expected error containing: '{expected_error}'") - print(f"Got: {error_msg[:200]}") - return { - "test_name": test_name, - "status": "FAIL", - "reason": f"Wrong error type", - "error": error_msg, - "elapsed_time": elapsed, - } + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + print(f"Expected error containing: '{expected_error}'") + print(f"Got: {error_msg[:200]}") + return { + "test_name": test_name, + "status": "FAIL", + "reason": f"Wrong error type", + "error": error_msg, + "elapsed_time": elapsed, + }kubeflow-pipelines/docling-standard/local_run.py (2)
148-212:validate_outputfunction is defined but never called.Similar to the VLM test harness, this function is not invoked. Consider calling it after successful pipeline runs to validate output quality.
262-284: Consider removing unnecessaryelseafterreturn.Per static analysis, the
elseblock can be de-indented since the precedingifblock returns.if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower(): ... return { ... } - else: - print(f"❌ TEST FAILED: {test_name} - Wrong error type") - ... - return { - ... - } + print(f"❌ TEST FAILED: {test_name} - Wrong error type") + ... + return { + ... + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignore(1 hunks)kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/common/components.py(2 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/local_run.py(3 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(11 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/local_run.py(4 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
- kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
- .gitignore
- kubeflow-pipelines/common/components.py
- kubeflow-pipelines/README.md
- kubeflow-pipelines/docling-standard/README.md
- kubeflow-pipelines/docling-vlm/README.md
- kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
🧰 Additional context used
🧬 Code graph analysis (1)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
convert_pipeline_test(59-90)validate_output(93-148)run_test_scenario(151-225)main(228-278)
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py
[refactor] 106-106: Too many positional arguments (8/5)
(R0917)
[refactor] 262-284: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
kubeflow-pipelines/docling-vlm/local_run.py
[refactor] 195-217: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: Summary
🔇 Additional comments (10)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)
178-180: LGTM!The accelerator device is correctly passed to
AcceleratorOptions.kubeflow-pipelines/docling-standard/standard_components.py (1)
167-169: LGTM!The accelerator device is correctly passed to
AcceleratorOptions.kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (3)
56-59: LGTM!The
accelerator_deviceparameter is correctly defined with default value "auto", marked as optional, and typed as STRING.
168-169: Pipeline wiring is correctly configured.The
accelerator_deviceparameter is properly wired from the pipeline channel through the for-loop task to the component.Also applies to: 215-216
739-742: Root input definition is correct.The
docling_accelerator_deviceparameter is properly exposed at the root level with appropriate defaults, making it configurable at runtime.kubeflow-pipelines/docling-vlm/local_run.py (1)
228-279: Test harness looks well-structured.The main() function properly initializes the Docker runner, runs both normal and failure scenarios, and provides clear summary output. The mutation issue with
config.pop()was addressed in a previous commit.kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)
47-50: LGTM!The
accelerator_deviceparameter is correctly defined at the component level and properly wired through the for-loop task via pipeline channels.Also applies to: 119-120, 151-152
598-599: Root-level wiring is correctly configured.The
docling_accelerator_deviceparameter is properly exposed at the root level and wired to the for-loop, making it configurable at runtime. This addresses the previous review comment about the missing wiring.Also applies to: 633-636
kubeflow-pipelines/docling-standard/local_run.py (2)
105-145: Pipeline test function is well-structured.The
convert_pipeline_testfunction correctly wires all the new parameters through to the component. The "too many positional arguments" static analysis hint is acceptable here as this is a pipeline definition with many configuration options.
317-320: Mutation issue correctly fixed.The code uses
config.get("should_fail", True)instead ofconfig.pop(), preserving the originalFAILURE_SCENARIOSdictionary across runs.
| passed = sum(1 for r in results if r["status"] == "PASS") | ||
| warnings = sum(1 for r in results if r["status"] == "PASS_WITH_WARNINGS") | ||
| failed = sum(1 for r in results if r["status"] == "FAIL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PASS_WITH_WARNINGS status is counted but never produced.
The code counts results with status "PASS_WITH_WARNINGS" (line 328), but run_test_scenario only returns "PASS" or "FAIL". Either remove the warnings counting or add logic to produce this status.
passed = sum(1 for r in results if r["status"] == "PASS")
- warnings = sum(1 for r in results if r["status"] == "PASS_WITH_WARNINGS")
failed = sum(1 for r in results if r["status"] == "FAIL")
...
print(f"✅ Passed: {passed}")
- print(f"⚠️ Warnings: {warnings}")
print(f"❌ Failed: {failed}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| passed = sum(1 for r in results if r["status"] == "PASS") | |
| warnings = sum(1 for r in results if r["status"] == "PASS_WITH_WARNINGS") | |
| failed = sum(1 for r in results if r["status"] == "FAIL") | |
| passed = sum(1 for r in results if r["status"] == "PASS") | |
| failed = sum(1 for r in results if r["status"] == "FAIL") |
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/local_run.py around lines 327-329, the
code sums statuses for "PASS_WITH_WARNINGS" even though run_test_scenario only
returns "PASS" or "FAIL"; remove the unused warnings counting or change
run_test_scenario to return "PASS_WITH_WARNINGS" where appropriate. To fix,
either delete the warnings variable and any downstream usage of it (and adjust
reporting to only use passed/failed), or update run_test_scenario to detect
warning conditions and return "PASS_WITH_WARNINGS" consistently so the counting
is meaningful.
- Add accelerator_device parameter to docling_convert_standard and docling_convert_vlm component - Update README documentation - Add validation and user-friendly error messages - Recompile pipeline YAML files - Added test,validation to local_runner and updated components for retry logic - removed repeated runner initialization - removed mps and cuda from accelerator devices Signed-off-by: roburishabh <[email protected]>
384f15f to
ca67afc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
kubeflow-pipelines/docling-standard/standard_components.py (1)
35-56: Documentaccelerator_devicein the docstring Args.Add an Args entry for
accelerator_deviceso the component is self-describing.
♻️ Duplicate comments (4)
kubeflow-pipelines/docling-standard/standard_components.py (1)
127-136: (Dup) Consider extractingdevice_mapinto a shared module to avoid drift.Same DRY concern as previously noted (standard vs VLM components).
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
256-266: (Dup) Embedded component docstring still doesn’t documentaccelerator_device.Add
accelerator_deviceto the Args list so the generated YAML remains self-describing.kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
343-358: (Dup) Embedded component docstring still doesn’t documentaccelerator_device.Keep the generated YAML self-describing by adding
accelerator_deviceto Args.kubeflow-pipelines/docling-standard/local_run.py (1)
320-337: (Dup)PASS_WITH_WARNINGSis counted but never produced.Either remove the warnings counting/printing, or return that status when validation is partially successful.
🧹 Nitpick comments (5)
kubeflow-pipelines/common/components.py (2)
129-130: Consider exponential backoff for retry delays.The fixed 5-second delay works, but exponential backoff (e.g., 2^attempt seconds) is more resilient for network failures and reduces load on struggling servers.
Apply this diff for exponential backoff:
- max_retries = 3 - retry_delay_sec = 5 + max_retries = 3Then update the sleep call:
- time.sleep(retry_delay_sec) + delay = min(5 * (2 ** (attempt - 1)), 30) # Exponential backoff capped at 30s + time.sleep(delay)And adjust the log message:
print( f"import-test-pdfs: error downloading {url}: {exc}. " - f"Retrying in {retry_delay_sec} seconds...", + f"Retrying in {delay} seconds...", flush=True, )
136-161: Consider discriminating between transient and permanent HTTP errors.The current implementation retries all
RequestExceptiontypes, including permanent failures like 404 Not Found or 403 Forbidden. While not harmful, this wastes time on errors that won't resolve with retries.Apply this diff to skip retrying permanent errors:
except requests.exceptions.RequestException as exc: # type: ignore[attr-defined] # Network / HTTP errors are the main reason this component can be flaky + # Don't retry permanent HTTP errors + if hasattr(exc, 'response') and exc.response is not None: + if exc.response.status_code in (400, 401, 403, 404, 410): + raise RuntimeError( + f"Failed to download {url}: {exc} (non-retryable error)" + ) from exc if attempt == max_retries: raise RuntimeError( f"Failed to download {url} after {max_retries} attempts" ) from exckubeflow-pipelines/docling-vlm/local_run.py (1)
182-213: Drop the unnecessaryelseafterreturnin error handling.Matches pylint R1705 and reduces indentation.
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)
515-535: Make HTTP download retry safer for nested paths + consider backoff/jitter.If a filename contains
/,dest.open("wb")can fail because parent dirs aren’t created. Considerdest.parent.mkdir(parents=True, exist_ok=True)before writing; also consider exponential backoff w/ jitter to reduce thundering herd.kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)
601-622: Make HTTP download retry safer for nested paths + consider backoff/jitter.Same as VLM compiled YAML: create
dest.parentbefore writing, and consider exponential backoff.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
.gitignore(1 hunks)kubeflow-pipelines/README.md(3 hunks)kubeflow-pipelines/common/components.py(2 hunks)kubeflow-pipelines/docling-standard/README.md(1 hunks)kubeflow-pipelines/docling-standard/local_run.py(3 hunks)kubeflow-pipelines/docling-standard/standard_components.py(3 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml(11 hunks)kubeflow-pipelines/docling-vlm/README.md(1 hunks)kubeflow-pipelines/docling-vlm/local_run.py(4 hunks)kubeflow-pipelines/docling-vlm/vlm_components.py(3 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py(2 hunks)kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- .gitignore
- kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
- kubeflow-pipelines/docling-standard/README.md
- kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
- kubeflow-pipelines/docling-vlm/README.md
- kubeflow-pipelines/docling-vlm/vlm_components.py
🧰 Additional context used
🧬 Code graph analysis (2)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
convert_pipeline_test(54-85)validate_output(88-143)run_test_scenario(146-220)main(223-273)
kubeflow-pipelines/docling-vlm/local_run.py (1)
kubeflow-pipelines/docling-standard/local_run.py (4)
convert_pipeline_test(99-138)validate_output(141-205)run_test_scenario(208-285)main(288-354)
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py
[refactor] 99-99: Too many positional arguments (8/5)
(R0917)
[refactor] 255-277: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
kubeflow-pipelines/docling-vlm/local_run.py
[refactor] 190-212: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
- GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
- GitHub Check: Summary
🔇 Additional comments (2)
kubeflow-pipelines/common/components.py (2)
31-31: LGTM!The
timeimport is necessary for the retry delay mechanism and follows the existing import pattern with the appropriate pylint disable comment.
129-161: Verify that this change aligns with the PR scope.The retry logic added here appears unrelated to the stated PR objective of adding
accelerator_deviceparameter support. While the retry mechanism is a useful improvement for download reliability, mixing unrelated changes in a PR can:
- Complicate reviews and testing
- Make it harder to track which change caused issues
- Delay merging if one change needs rework
Consider whether this retry logic should be in a separate PR focused on download reliability improvements.
| def validate_output(output_dir: Path, test_name: str) -> dict: | ||
| """Validate the output quality and format""" | ||
| validation_results = { | ||
| "test_name": test_name, | ||
| "passed": True, | ||
| "checks": {}, | ||
| } | ||
|
|
||
| # Check for JSON output | ||
| json_files = list(output_dir.glob("*.json")) | ||
| validation_results["checks"]["json_exists"] = len(json_files) > 0 | ||
|
|
||
| # Check for Markdown output | ||
| md_files = list(output_dir.glob("*.md")) | ||
| validation_results["checks"]["md_exists"] = len(md_files) > 0 | ||
|
|
||
| if json_files: | ||
| json_path = json_files[0] | ||
| try: | ||
| with open(json_path) as f: | ||
| data = json.load(f) | ||
|
|
||
| # Required fields validation | ||
| validation_results["checks"]["has_name"] = "name" in data | ||
| validation_results["checks"]["has_pages"] = "pages" in data | ||
|
|
||
| if "pages" in data: | ||
| validation_results["checks"]["pages_count"] = len(data["pages"]) | ||
| validation_results["checks"]["has_content"] = len(data["pages"]) > 0 | ||
|
|
||
| # Check page structure | ||
| if data["pages"]: | ||
| first_page = data["pages"][0] | ||
| validation_results["checks"]["page_has_number"] = ( | ||
| "page_number" in first_page | ||
| ) | ||
| validation_results["checks"]["page_has_size"] = "size" in first_page | ||
| except Exception as e: | ||
| validation_results["checks"]["json_parsing_error"] = str(e) | ||
| validation_results["passed"] = False | ||
|
|
||
| if md_files: | ||
| md_path = md_files[0] | ||
| try: | ||
| content = md_path.read_text() | ||
| validation_results["checks"]["md_length"] = len(content) | ||
| validation_results["checks"]["md_not_empty"] = len(content) > 100 | ||
| validation_results["checks"]["md_has_headers"] = "#" in content | ||
| validation_results["checks"]["md_no_error"] = not content.startswith( | ||
| "Error" | ||
| ) | ||
| except Exception as e: | ||
| validation_results["checks"]["md_parsing_error"] = str(e) | ||
| validation_results["passed"] = False | ||
|
|
||
| # Overall pass/fail | ||
| validation_results["passed"] = all( | ||
| [ | ||
| validation_results["checks"].get("json_exists", False), | ||
| validation_results["checks"].get("md_exists", False), | ||
| validation_results["checks"].get("has_content", False), | ||
| ] | ||
| ) | ||
|
|
||
| return validation_results | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's get the full file to understand the context
wc -l kubeflow-pipelines/docling-standard/local_run.pyRepository: opendatahub-io/data-processing
Length of output: 204
🏁 Script executed:
# Search for all usages of validate_output function
rg "validate_output" kubeflow-pipelines/docling-standard/local_run.pyRepository: opendatahub-io/data-processing
Length of output: 136
🏁 Script executed:
# Read the run_test_scenario function (lines 208-246 area)
sed -n '208,250p' kubeflow-pipelines/docling-standard/local_run.pyRepository: opendatahub-io/data-processing
Length of output: 1399
🏁 Script executed:
# Read the main function area (lines 320-337)
sed -n '310,340p' kubeflow-pipelines/docling-standard/local_run.pyRepository: opendatahub-io/data-processing
Length of output: 1135
Test harness reports PASS without validating outputs—validate_output() is defined but never called.
The validate_output() function (lines 141-206) computes validation_results["passed"] with checks for JSON structure, Markdown content, and page data, but run_test_scenario() never invokes it. Instead, after convert_pipeline_test() succeeds, the code returns {"status": "PASS", ...} without validation (line 245). Additionally, main() counts PASS_WITH_WARNINGS status in the summary, but run_test_scenario() never produces this status—only "PASS" or "FAIL". Consider calling validate_output(output_dir, test_name) after pipeline success and returning FAIL (or PASS_WITH_WARNINGS) when checks fail to match the validation logic.
| accelerator_device: str = "auto", | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
cd kubeflow-pipelines/docling-standard && wc -l standard_components.pyRepository: opendatahub-io/data-processing
Length of output: 204
🏁 Script executed:
cd kubeflow-pipelines/docling-standard && sed -n '25,45p' standard_components.pyRepository: opendatahub-io/data-processing
Length of output: 941
🏁 Script executed:
cd kubeflow-pipelines/docling-standard && sed -n '120,145p' standard_components.pyRepository: opendatahub-io/data-processing
Length of output: 1016
🏁 Script executed:
cd kubeflow-pipelines/docling-standard && sed -n '160,175p' standard_components.pyRepository: opendatahub-io/data-processing
Length of output: 765
🏁 Script executed:
cd kubeflow-pipelines && fd -type f -name "vlm_components.py" | head -5Repository: opendatahub-io/data-processing
Length of output: 244
🏁 Script executed:
cd kubeflow-pipelines/docling-standard && sed -n '35,60p' standard_components.pyRepository: opendatahub-io/data-processing
Length of output: 1497
🏁 Script executed:
rg -l "vlm_components" kubeflow-pipelines/Repository: opendatahub-io/data-processing
Length of output: 172
🏁 Script executed:
rg "class AcceleratorDevice" kubeflow-pipelines/Repository: opendatahub-io/data-processing
Length of output: 56
🏁 Script executed:
cd kubeflow-pipelines/docling-vlm && grep -n "device_map\|accelerator_device" vlm_convert_pipeline.py | head -20Repository: opendatahub-io/data-processing
Length of output: 184
🏁 Script executed:
cd kubeflow-pipelines/docling-vlm && sed -n '1,50p' vlm_convert_pipeline.pyRepository: opendatahub-io/data-processing
Length of output: 1700
🏁 Script executed:
rg "AcceleratorDevice\." kubeflow-pipelines/ -A 2 -B 2Repository: opendatahub-io/data-processing
Length of output: 3199
🏁 Script executed:
cd kubeflow-pipelines/docling-vlm && sed -n '1,100p' vlm_components.py | grep -A 40 '"""'Repository: opendatahub-io/data-processing
Length of output: 2373
🏁 Script executed:
cd kubeflow-pipelines/docling-vlm && sed -n '20,80p' vlm_components.pyRepository: opendatahub-io/data-processing
Length of output: 2616
🏁 Script executed:
rg "MPS|mps" kubeflow-pipelines/ --type pyRepository: opendatahub-io/data-processing
Length of output: 56
Harden accelerator device normalization and document parameter + add aliases (cuda, mps).
The current dev = accelerator_device.lower() without .strip() will reject whitespace-padded values (e.g., " GPU "), lacks an empty-string guard, and doesn't support stated aliases like cuda or mps. Additionally, the accelerator_device parameter is missing from the docstring.
Apply the suggested patch to both kubeflow-pipelines/docling-standard/standard_components.py (lines 127–137) and kubeflow-pipelines/docling-vlm/vlm_components.py (identical code, same issues):
- device_map = {
- "auto": AcceleratorDevice.AUTO,
- "cpu": AcceleratorDevice.CPU,
- "gpu": AcceleratorDevice.CUDA,
- }
- dev = accelerator_device.lower()
+ if not accelerator_device or not accelerator_device.strip():
+ raise ValueError("accelerator_device must be a non-empty string")
+
+ device_map = {
+ "auto": AcceleratorDevice.AUTO,
+ "cpu": AcceleratorDevice.CPU,
+ "cuda": AcceleratorDevice.CUDA,
+ "gpu": AcceleratorDevice.CUDA,
+ }
+ if hasattr(AcceleratorDevice, "MPS"):
+ device_map["mps"] = getattr(AcceleratorDevice, "MPS")
+
+ dev = accelerator_device.strip().lower()
if dev not in device_map:
raise ValueError(
f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}"
)Also add accelerator_device: Accelerator device to use (e.g. "auto", "cpu", "gpu", "cuda", "mps"). to the Args section of both docstrings.
Committable suggestion skipped: line range outside the PR's diff.
| def validate_output(output_dir: Path, test_name: str) -> dict: | ||
| """Validate the output quality and format""" | ||
| validation_results = { | ||
| "test_name": test_name, | ||
| "passed": True, | ||
| "checks": {}, | ||
| } | ||
|
|
||
| # Check for JSON output | ||
| json_files = list(output_dir.glob("*.json")) | ||
| validation_results["checks"]["json_exists"] = len(json_files) > 0 | ||
|
|
||
| # Check for Markdown output | ||
| md_files = list(output_dir.glob("*.md")) | ||
| validation_results["checks"]["md_exists"] = len(md_files) > 0 | ||
|
|
||
| if json_files: | ||
| json_path = json_files[0] | ||
| try: | ||
| with open(json_path) as f: | ||
| data = json.load(f) | ||
|
|
||
| # Required fields validation | ||
| validation_results["checks"]["has_name"] = "name" in data | ||
| validation_results["checks"]["has_pages"] = "pages" in data | ||
|
|
||
| if "pages" in data: | ||
| validation_results["checks"]["pages_count"] = len(data["pages"]) | ||
| validation_results["checks"]["has_content"] = len(data["pages"]) > 0 | ||
|
|
||
| # Check page structure | ||
| if data["pages"]: | ||
| first_page = data["pages"][0] | ||
| validation_results["checks"]["page_has_number"] = ( | ||
| "page_number" in first_page | ||
| ) | ||
| validation_results["checks"]["page_has_size"] = "size" in first_page | ||
| except Exception as e: | ||
| validation_results["checks"]["json_parsing"] = f"Failed: {e}" | ||
| validation_results["passed"] = False | ||
|
|
||
| if md_files: | ||
| md_path = md_files[0] | ||
| try: | ||
| content = md_path.read_text() | ||
| validation_results["checks"]["md_length"] = len(content) | ||
| validation_results["checks"]["md_not_empty"] = len(content) > 100 | ||
| validation_results["checks"]["md_has_headers"] = "#" in content | ||
| validation_results["checks"]["md_no_error"] = not content.startswith( | ||
| "Error" | ||
| ) | ||
| except Exception as e: | ||
| validation_results["checks"]["md_parsing"] = f"Failed: {e}" | ||
| validation_results["passed"] = False | ||
|
|
||
| return validation_results | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the full local_run.py file structure
find . -name "local_run.py" -type f | head -20Repository: opendatahub-io/data-processing
Length of output: 170
🏁 Script executed:
# Get file size to determine how to read it
wc -l kubeflow-pipelines/docling-vlm/local_run.py 2>/dev/null || find . -name "local_run.py" | xargs wc -lRepository: opendatahub-io/data-processing
Length of output: 121
🏁 Script executed:
# Search for validate_output function calls in the file
rg "validate_output" --context 5Repository: opendatahub-io/data-processing
Length of output: 1751
🏁 Script executed:
# Read the entire local_run.py file to see the full context
cat -n kubeflow-pipelines/docling-vlm/local_run.pyRepository: opendatahub-io/data-processing
Length of output: 11398
🏁 Script executed:
# Search more thoroughly for any invocation of validate_output using different patterns
rg "validate_output\s*\(" kubeflow-pipelines/docling-vlm/Repository: opendatahub-io/data-processing
Length of output: 180
Test harness reports PASS without validating outputs—validate_output() is unused.
The validate_output() function (lines 88-143) is defined but never invoked. Currently, run_test_scenario() reports "PASS" only when the pipeline executes without throwing an exception; no output artifacts are checked. To make tests meaningful, capture the docling_convert_vlm output artifact, expose it as a pipeline output, and call validate_output() on that path, failing the test when validation checks are unsatisfied.
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/local_run.py around lines 88 to 144,
validate_output() is defined but never used; update run_test_scenario() to
capture the docling_convert_vlm output artifact path (ensure the pipeline step
exposes that artifact as a file/directory output), call
validate_output(output_path, test_name) after the pipeline run, inspect the
returned dict and mark the test as failed if validation_results["passed"] is
False (print/log the validation_results and exit non‑zero or raise an
exception); also ensure run_test_scenario propagates pipeline artifact locations
so validate_output receives the actual file system Path to the produced JSON/MD
outputs.
| - Standard pipeline defaults include `pdf_backend=dlparse_v4`, `image_export_mode=embedded`, `table_mode=accurate`, `num_threads=4`, `timeout_per_document=300`, `ocr=True`, `force_ocr=False`, `ocr_engine=tesseract`, `accelerator_device=auto`. | ||
| - VLM pipeline defaults include `num_threads=4`, `timeout_per_document=300`, `image_export_mode=embedded`, `remote_model_enabled=False`, and `accelerator_device=auto`. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's locate and examine the README file
find . -name "README.md" -path "*/kubeflow-pipelines*" | head -5Repository: opendatahub-io/data-processing
Length of output: 204
🏁 Script executed:
# Also search for README.md files to understand structure
git ls-files | grep -i readme | head -10Repository: opendatahub-io/data-processing
Length of output: 303
🏁 Script executed:
# Search for parameter definitions related to accelerator_device or docling_accelerator_device
rg "accelerator_device|docling_accelerator_device" -i --max-count 50 | head -100Repository: opendatahub-io/data-processing
Length of output: 7268
🏁 Script executed:
# Look for pipeline defaults and configuration
rg "pdf_backend|dlparse_v4|image_export_mode" --max-count 30Repository: opendatahub-io/data-processing
Length of output: 13807
🏁 Script executed:
# Read the README.md file with context around lines 96-98 and 161-170
cat -n kubeflow-pipelines/README.md | sed -n '90,105p'Repository: opendatahub-io/data-processing
Length of output: 1581
🏁 Script executed:
# Read context around lines 161-170
cat -n kubeflow-pipelines/README.md | sed -n '155,175p'Repository: opendatahub-io/data-processing
Length of output: 1445
🏁 Script executed:
# Check the actual device_map in the component code
grep -A 10 "device_map" kubeflow-pipelines/docling-standard/standard_components.pyRepository: opendatahub-io/data-processing
Length of output: 1153
🏁 Script executed:
# Also check VLM components
grep -A 10 "device_map" kubeflow-pipelines/docling-vlm/vlm_components.pyRepository: opendatahub-io/data-processing
Length of output: 968
🏁 Script executed:
# Check what values are validated
rg "device_map\[|Invalid accelerator_device" -A 5 -B 5Repository: opendatahub-io/data-processing
Length of output: 13158
🏁 Script executed:
# Look for any PR objectives or issues mentioning cuda/mps aliases
git log --oneline -20 | head -20Repository: opendatahub-io/data-processing
Length of output: 517
🏁 Script executed:
# Search for any comments or TODOs about cuda/mps support
rg "cuda|mps|gpu|accelerator" -i -B 2 -A 2 | grep -E "(TODO|FIXME|PR|objective)" -iRepository: opendatahub-io/data-processing
Length of output: 6590
Lines 96–97: Use correct parameter name docling_accelerator_device in the defaults list, not accelerator_device.
The "Defaults" section lists accelerator_device=auto, but users will configure this in the KFP UI using the actual parameter name docling_accelerator_device. Update both the Standard and VLM defaults to reflect the correct name (e.g., docling_accelerator_device=auto).
🤖 Prompt for AI Agents
In kubeflow-pipelines/README.md around lines 96 to 98, the defaults list uses
the wrong parameter name `accelerator_device=auto`; update both Standard and VLM
defaults to use the correct KFP parameter name `docling_accelerator_device=auto`
so the UI config matches documentation. Replace `accelerator_device=auto` with
`docling_accelerator_device=auto` in both default lines and verify spelling and
punctuation remain consistent with surrounding list formatting.
| - `docling_accelerator_device`: Device to use for acceleration. Supported values: | ||
| - `auto` (default): Automatically detect and use the best available device | ||
| - `cpu`: Force CPU-only processing | ||
| - `gpu`: Use GPU with CUDA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimaredia do we care about amd rocm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, I maybe wrong but rightnow Nvidia dominates AI/ML GPU market, also Docling support Uses AcceleratorDevice.CUDA which is NVIDIA-focused and most ML containers are built for CUDA, not ROCm. And suppose if we add ROCm, then we would need a separate testing, images, and validation. Also I think Kubeflow/Openshift are primarily deployed with NVIDIA GPU operator. So maybe we should not care about amd ROCm for now. @alimaredia wdyt ??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked about this in standup. upstream docling just uses cuda so we can stick to just cuda. if that changes then we will update
| # Uncomment if you have GPU | ||
| # "gpu_test": { | ||
| # "docling_accelerator_device": "gpu", | ||
| # "docling_pdf_backend": "dlparse_v4", | ||
| # "docling_table_mode": "accurate", | ||
| # "docling_num_threads": 4, | ||
| # "docling_ocr": True, | ||
| # }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rightnow we don't have GPU to test, once we start testing this out on GPU, we can uncomment it and test it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run this on AWS EC2 instance ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can take any AWS EC2 with Nvidia GPUs to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test with a GPU on an AWS instance?
| device_map = { | ||
| "auto": AcceleratorDevice.AUTO, | ||
| "cpu": AcceleratorDevice.CPU, | ||
| "gpu": AcceleratorDevice.CUDA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With Fabiano's image with torch installed for just cpu, would gpu work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I tried this
(.venv) roburishabh@rishasin-mac data-processing % docker pull quay.io/fabianofranz/docling-ubi9:2.54.0
2.54.0: Pulling from fabianofranz/docling-ubi9
Digest: sha256:0c050ec6ecfb3cc44f1f446994de0b655b48ba9c19f713facc473a2c458f9d96
Status: Image is up to date for quay.io/fabianofranz/docling-ubi9:2.54.0
quay.io/fabianofranz/docling-ubi9:2.54.0
-------------------------------------------
(.venv) roburishabh@rishasin-mac data-processing % docker run --rm quay.io/fabianofranz/docling-ubi9:2.54.0 \
python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA built: {torch.version.cuda}')"
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
CUDA available: False
CUDA built: None
The container image quay.io/fabianofranz/docling-ubi9:2.54.0 has CPU-only PyTorch. So, The gpu option is currently unusable with this image. To enable GPU support, someone would need to build a new container image with:
- CUDA-enabled PyTorch (torch+cu118 or similar) and CUDA runtime libraries. I think @alimaredia is building that image. WDYT ??
| # Uncomment if you have GPU | ||
| # "gpu_test": { | ||
| # "docling_accelerator_device": "gpu", | ||
| # "docling_pdf_backend": "dlparse_v4", | ||
| # "docling_table_mode": "accurate", | ||
| # "docling_num_threads": 4, | ||
| # "docling_ocr": True, | ||
| # }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you run this on AWS EC2 instance ?
Description
How Has This Been Tested?
Merge criteria:
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.