Skip to content

Conversation

@RobuRishabh
Copy link
Contributor

@RobuRishabh RobuRishabh commented Nov 4, 2025

Description

  • Added accelerator_device parameter to docling_convert_standard component
  • Added accelerator_device parameter to docling_convert_vlm component
  • Supports device aliases: auto, cpu, gpu
  • Update README documentation for all three files

How Has This Been Tested?

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • New Features

    • Added an accelerator-device option to Standard and VLM pipelines (auto, cpu, gpu; default: auto) and exposed it as a pipeline input.
  • Bug Fixes

    • Improved PDF download reliability with retry/backoff for non-S3 downloads.
  • Documentation

    • READMEs updated with accelerator-device docs, supported values, aliases, and examples.
  • Tests

    • Added local test harness with scenario-driven configurations and failure scenarios.
  • Chores

    • Added local_outputs/ to .gitignore.

✏️ Tip: You can customize this high-level summary in your review settings.

@RobuRishabh RobuRishabh requested a review from a team as a code owner November 4, 2025 19:46
@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

Adds an accelerator device parameter (default "auto") to Standard and VLM converters, propagates it through pipeline wiring and compiled YAMLs, updates documentation and READMEs, expands local test harnesses, adds HTTP download retry logic, and updates .gitignore.

Changes

Cohort / File(s) Summary
Documentation
kubeflow-pipelines/README.md, kubeflow-pipelines/docling-standard/README.md, kubeflow-pipelines/docling-vlm/README.md
Add docling_accelerator_device docs, list supported values (auto, cpu, gpu), note case‑insensitive handling and gpu alias for CUDA, update examples and default parameters.
Standard — components & pipeline
kubeflow-pipelines/docling-standard/standard_components.py, kubeflow-pipelines/docling-standard/standard_convert_pipeline.py, kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
Add accelerator_device: str = "auto" parameter; introduce device_map, normalize/validate input, raise ValueError on invalid keys, use mapped AcceleratorDevice in AcceleratorOptions, and propagate parameter through pipeline wiring/compiled YAML.
VLM — components & pipeline
kubeflow-pipelines/docling-vlm/vlm_components.py, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py, kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
Add accelerator_device: str = "auto" parameter; add device_map, normalize/validate input, raise on invalid values, use mapped AcceleratorDevice in AcceleratorOptions, and wire parameter through pipeline inputs/compiled YAML.
Local test harnesses
kubeflow-pipelines/docling-standard/local_run.py, kubeflow-pipelines/docling-vlm/local_run.py
Replace simple local runner with test harness: TEST_CONFIGS, FAILURE_SCENARIOS, convert_pipeline_test(...), validate_output(...), run_test_scenario(...), and a main orchestration. Tests pass docling_accelerator_device and other options to converters and summarize results.
Common utilities
kubeflow-pipelines/common/components.py
Add retry loop for non‑S3 HTTP PDF downloads (max_retries, retry_delay_sec), per‑attempt logging and sleep backoff, and raise on final failure; S3 path unchanged.
Misc
.gitignore
Added local_outputs/ to ignored paths.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Areas to focus review on:
    • docling-standard/local_run.py, docling-vlm/local_run.py — test harness logic, expected-failure handling, output validation and Docker/local runner interactions.
    • standard_components.py, vlm_components.pydevice_map correctness, enum imports/usage, validation/error messages, ordering of early validation relative to other logic.
    • Compiled pipeline YAMLs (*_compiled.yaml) — ensure inputDefinitions, pipelinechannel--docling_accelerator_device wiring, and for-loop/task inputs correctly match runtime signatures.
    • common/components.py — retry/backoff behavior, exception chaining, and logging clarity.

Poem

I nibble through code with curious paws,
Choosing "auto", "cpu", or "gpu" because—
I map the devices and check each name,
Hop, validate, then print the game. 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 70.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title accurately describes the main change: adding accelerator device support to Kubeflow Pipelines for Docling components. It is specific, concise, and directly related to the changeset.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
kubeflow-pipelines/docling-vlm/vlm_components.py (2)

1-1: Fix the formatting issue flagged by GitHub Actions.

The Ruff formatter has flagged this file for reformatting.

Run the following to fix:

#!/bin/bash
ruff format kubeflow-pipelines/docling-vlm/vlm_components.py

27-40: Update the docstring to document the accelerator_device parameter.

The accelerator_device parameter is missing from the Args section of the docstring.

Apply this diff:

     Args:
         input_path: Path to the input directory containing PDF files.
         artifacts_path: Path to the directory containing Docling models.
         output_path: Path to the output directory for converted JSON and Markdown files.
         pdf_filenames: List of PDF file names to process.
         num_threads: Number of threads to use per document processing.
         timeout_per_document: Timeout per document processing.
         image_export_mode: Mode to export images.
         remote_model_enabled: Whether or not to use a remote model.
         remote_model_secret_mount_path: Path to the remote model secret mount path.
+        accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps).
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

104-156: The accelerator_device parameter is not wired through the for-loop-1 component.

The accelerator_device parameter is defined in the comp-docling-convert-vlm inputDefinitions (lines 46-49) but is not passed through the for-loop-1 component to the actual docling-convert-vlm task. This means users cannot configure the accelerator device at runtime—it will always use the default "auto" value.

Apply this diff to wire the parameter:

       docling-convert-vlm:
         cachingOptions: {}
         componentRef:
           name: comp-docling-convert-vlm
         inputs:
           artifacts:
             artifacts_path:
               componentInputArtifact: pipelinechannel--download-docling-models-output_path
             input_path:
               componentInputArtifact: pipelinechannel--import-pdfs-output_path
           parameters:
+            accelerator_device:
+              componentInputParameter: pipelinechannel--docling_accelerator_device
             image_export_mode:
               componentInputParameter: pipelinechannel--docling_image_export_mode

Also add the corresponding input definition in the for-loop-1 component:

   inputDefinitions:
     artifacts:
       pipelinechannel--download-docling-models-output_path:
         artifactType:
           schemaTitle: system.Artifact
           schemaVersion: 0.0.1
       pipelinechannel--import-pdfs-output_path:
         artifactType:
           schemaTitle: system.Artifact
           schemaVersion: 0.0.1
     parameters:
+      pipelinechannel--docling_accelerator_device:
+        parameterType: STRING
       pipelinechannel--create-pdf-splits-Output:
         parameterType: LIST

And add it to the root pipeline's for-loop-1 task inputs:

     for-loop-1:
       componentRef:
         name: comp-for-loop-1
       dependentTasks:
       - create-pdf-splits
       - download-docling-models
       - import-pdfs
       inputs:
         artifacts:
           pipelinechannel--download-docling-models-output_path:
             taskOutputArtifact:
               outputArtifactKey: output_path
               producerTask: download-docling-models
           pipelinechannel--import-pdfs-output_path:
             taskOutputArtifact:
               outputArtifactKey: output_path
               producerTask: import-pdfs
         parameters:
+          pipelinechannel--docling_accelerator_device:
+            componentInputParameter: docling_accelerator_device
           pipelinechannel--create-pdf-splits-Output:
             taskOutputParameter:
               outputParameterKey: Output
               producerTask: create-pdf-splits

Also add the root-level input definition:

 root:
   dag:
     tasks:
       ...
   inputDefinitions:
     parameters:
+      docling_accelerator_device:
+        defaultValue: auto
+        isOptional: true
+        parameterType: STRING
       docling_image_export_mode:
         defaultValue: embedded
kubeflow-pipelines/docling-standard/standard_components.py (2)

1-1: Fix the formatting issue flagged by GitHub Actions.

The Ruff formatter has flagged this file for reformatting.

Run the following to fix:

#!/bin/bash
ruff format kubeflow-pipelines/docling-standard/standard_components.py

35-56: Update the docstring to document the accelerator_device parameter.

The accelerator_device parameter is missing from the Args section of the docstring.

Apply this diff:

     Args:
         input_path: Path to the input directory containing PDF files.
         artifacts_path: Path to the directory containing Docling models.
         output_path: Path to the output directory for converted JSON and Markdown files.
         pdf_filenames: List of PDF file names to process.
         pdf_backend: Backend to use for PDF processing.
         image_export_mode: Mode to export images.
         table_mode: Mode to detect tables.
         num_threads: Number of threads to use per document processing.
         timeout_per_document: Timeout per document processing.
         ocr: Whether or not to use OCR if needed.
         force_ocr: Whether or not to force OCR.
         ocr_engine: Engine to use for OCR.
         allow_external_plugins: Whether or not to allow external plugins.
         enrich_code: Whether or not to enrich code.
         enrich_formula: Whether or not to enrich formula.
         enrich_picture_classes: Whether or not to enrich picture classes.
         enrich_picture_description: Whether or not to enrich picture description.
+        accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps).
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

153-237: The accelerator_device parameter is not wired through the for-loop-1 component.

The accelerator_device parameter is defined in the comp-docling-convert-standard inputDefinitions (lines 55-58) but is not passed through the for-loop-1 component to the actual docling-convert-standard task. This means users cannot configure the accelerator device at runtime—it will always use the default "auto" value.

Apply this diff to wire the parameter:

       docling-convert-standard:
         cachingOptions: {}
         componentRef:
           name: comp-docling-convert-standard
         inputs:
           artifacts:
             artifacts_path:
               componentInputArtifact: pipelinechannel--download-docling-models-output_path
             input_path:
               componentInputArtifact: pipelinechannel--import-pdfs-output_path
           parameters:
+            accelerator_device:
+              componentInputParameter: pipelinechannel--docling_accelerator_device
             allow_external_plugins:
               componentInputParameter: pipelinechannel--docling_allow_external_plugins

Also add the corresponding input definition in the for-loop-1 component:

   inputDefinitions:
     artifacts:
       pipelinechannel--download-docling-models-output_path:
         artifactType:
           schemaTitle: system.Artifact
           schemaVersion: 0.0.1
       pipelinechannel--import-pdfs-output_path:
         artifactType:
           schemaTitle: system.Artifact
           schemaVersion: 0.0.1
     parameters:
+      pipelinechannel--docling_accelerator_device:
+        parameterType: STRING
       pipelinechannel--create-pdf-splits-Output:
         parameterType: LIST

And add it to the root pipeline's for-loop-1 task inputs:

     for-loop-1:
       componentRef:
         name: comp-for-loop-1
       dependentTasks:
       - create-pdf-splits
       - download-docling-models
       - import-pdfs
       inputs:
         artifacts:
           pipelinechannel--download-docling-models-output_path:
             taskOutputArtifact:
               outputArtifactKey: output_path
               producerTask: download-docling-models
           pipelinechannel--import-pdfs-output_path:
             taskOutputArtifact:
               outputArtifactKey: output_path
               producerTask: import-pdfs
         parameters:
+          pipelinechannel--docling_accelerator_device:
+            componentInputParameter: docling_accelerator_device
           pipelinechannel--create-pdf-splits-Output:
             taskOutputParameter:
               outputParameterKey: Output
               producerTask: create-pdf-splits

Also add the root-level input definition:

 root:
   dag:
     tasks:
       ...
   inputDefinitions:
     parameters:
+      docling_accelerator_device:
+        defaultValue: auto
+        isOptional: true
+        parameterType: STRING
       docling_allow_external_plugins:
         defaultValue: false
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 277fcac and 4200278.

📒 Files selected for processing (7)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (4 hunks)
🧰 Additional context used
🪛 GitHub Actions: Python Formatting Validation
kubeflow-pipelines/docling-standard/standard_components.py

[error] 1-1: Ruff formatting would modify this file (would reformat).

kubeflow-pipelines/docling-vlm/vlm_components.py

[error] 1-1: Ruff formatting would modify this file (would reformat).

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...

Comment on lines 127 to 136
device_map = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"cuda": AcceleratorDevice.CUDA,
"gpu": AcceleratorDevice.CUDA, # alias
"mps": AcceleratorDevice.MPS,
}
dev = accelerator_device.lower()
if dev not in device_map:
raise ValueError(f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Consider extracting the device_map to a shared constant.

This device_map dictionary is identical to the one in kubeflow-pipelines/docling-vlm/vlm_components.py (lines 75-81). This duplication violates the DRY principle.

See the recommendation in the VLM components file to extract this to a shared module in common/device_mapping.py.

🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/standard_components.py around lines 127
to 136, the device_map dict is duplicated (same as in
kubeflow-pipelines/docling-vlm/vlm_components.py) — extract this mapping to a
shared module (e.g., kubeflow-pipelines/common/device_mapping.py) and replace
the local literal with an import; move the device_map constant into that module,
export it (or a getter), update both files to import device_map (or
get_device_map()) and then use it unchanged so the mapping is maintained in one
place.

Comment on lines 75 to 84
device_map = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"cuda": AcceleratorDevice.CUDA,
"gpu": AcceleratorDevice.CUDA, # alias
"mps": AcceleratorDevice.MPS,
}
dev = accelerator_device.lower()
if dev not in device_map:
raise ValueError(
f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}"
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Consider extracting the device_map to a shared constant.

The device_map dictionary is duplicated in both kubeflow-pipelines/docling-vlm/vlm_components.py and kubeflow-pipelines/docling-standard/standard_components.py (lines 127-136). This violates the DRY principle and creates a maintenance burden if the mapping needs to be updated.

Consider extracting this to a shared module (e.g., common/constants.py or common/device_mapping.py):

# common/device_mapping.py
from docling.datamodel.accelerator_options import AcceleratorDevice

ACCELERATOR_DEVICE_MAP = {
    "auto": AcceleratorDevice.AUTO,
    "cpu":  AcceleratorDevice.CPU,
    "cuda": AcceleratorDevice.CUDA,
    "gpu":  AcceleratorDevice.CUDA,  # alias
    "mps":  AcceleratorDevice.MPS,
}

Then import and use it in both component files:

from common.device_mapping import ACCELERATOR_DEVICE_MAP
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/vlm_components.py around lines 75 to 86, the
local device_map is duplicated across components causing DRY/maintenance issues;
extract the mapping into a shared module (e.g., common/device_mapping.py or
common/constants.py) that exports a constant (ACCELERATOR_DEVICE_MAP) using
AcceleratorDevice values, then replace the local mapping with an import from
that shared module and use the imported constant in the validation logic; update
the other file (kubeflow-pipelines/docling-standard/standard_components.py) to
import and use the same constant and run tests/flake to ensure imports and names
are correct.

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from 4200278 to 5bbcc5f Compare November 4, 2025 19:54
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
kubeflow-pipelines/docling-standard/standard_components.py (1)

33-56: Add documentation for the accelerator_device parameter.

The new accelerator_device parameter is missing from the function's docstring. Users need to understand the purpose, valid values, and default behavior.

Add the parameter documentation to the Args section:

         enrich_picture_classes: Whether or not to enrich picture classes.
         enrich_picture_description: Whether or not to enrich picture description.
+        accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps). Defaults to "auto".
     """
kubeflow-pipelines/docling-vlm/vlm_components.py (1)

25-40: Add documentation for the accelerator_device parameter.

The new accelerator_device parameter is missing from the function's docstring. Users need to understand the purpose, valid values, and default behavior.

Add the parameter documentation to the Args section:

         remote_model_enabled: Whether or not to use a remote model.
         remote_model_secret_mount_path: Path to the remote model secret mount path.
+        accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps). Defaults to "auto".
     """
♻️ Duplicate comments (3)
kubeflow-pipelines/docling-standard/standard_components.py (1)

127-138: Device mapping duplication already flagged in previous review.

The duplication of device_map across component files has been identified in the previous review comments. Please refer to the earlier recommendation to extract this to a shared module.

kubeflow-pipelines/docling-vlm/vlm_components.py (1)

75-86: Device mapping duplication already flagged in previous review.

The duplication of device_map across component files has been identified in the previous review comments. Please refer to the earlier recommendation to extract this to a shared module.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

251-261: Missing documentation reflects source file issue.

The docstring in this compiled YAML is missing documentation for accelerator_device because the source file (vlm_components.py) is missing it. Please address this by updating the source file's docstring as flagged in the earlier review comment.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4200278 and 5bbcc5f.

📒 Files selected for processing (7)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • kubeflow-pipelines/README.md
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml
  • kubeflow-pipelines/docling-vlm/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
🔇 Additional comments (1)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

280-285: LGTM!

The device mapping and validation logic in the compiled YAML correctly reflects the source implementation from vlm_components.py.

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from 5bbcc5f to cb254e0 Compare November 4, 2025 20:05
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)

4-21: Update pipeline header comments and wire accelerator_device through the full DAG.

The pipeline header (lines 4-21) documents all other parameters but omits accelerator_device. Additionally, this parameter must be added to:

  1. Pipeline header (lines 4-21): Add accelerator_device: str [Default: auto]
  2. Root inputDefinitions (lines 717-785): Add parameter definition matching the component
  3. comp-for-loop-1 inputDefinitions (lines 207-237): Add parameter definition
  4. for-loop-1 task inputs (lines 166-194): Wire parameter via componentInputParameter

Example of the missing wiring in for-loop task (should be added around line 167):

accelerator_device:
  componentInputParameter: pipelinechannel--docling_accelerator_device

Apply these changes to complete the parameter threading.

Also applies to: 55-58, 166-194


154-194: Add accelerator_device to the pipeline parameter chain.

The component defines accelerator_device (line 55) with a default of "auto", but it is not wired through the pipeline DAG:

  • Not included in the docling-convert-standard task's parameter inputs (lines 167–194)
  • Not defined in comp-for-loop-1.inputDefinitions.parameters (lines 197–237)
  • Not defined in root.inputDefinitions.parameters (lines 710–785)
  • Not documented in the pipeline header comments (lines 4–21)

This prevents pipeline users from overriding the default accelerator_device value. Either add this parameter to all four locations for full configurability, or document if this is intentionally fixed per pipeline.

♻️ Duplicate comments (1)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

46-49: accelerator_device parameter must be wired through the pipeline root and for-loop-1 task.

The accelerator_device parameter is defined in the component (lines 46–49) with a default value of "auto", but it is not exposed at the pipeline root level (root.inputDefinitions) nor passed through the for-loop-1 task to the docling-convert-vlm component. This prevents runtime configuration—the parameter will always use its default value. Other similar optional parameters like image_export_mode, num_threads, and timeout_per_document follow the correct pattern and are exposed at the root and wired through the task.

Required changes:

  1. Add docling_accelerator_device to root.inputDefinitions (lines 610–643) following the pattern of other docling_* parameters:
      docling_accelerator_device:
        defaultValue: auto
        isOptional: true
        parameterType: STRING
  1. Add the parameter to the for-loop-1 task's inputs (lines 575–586):
            pipelinechannel--docling_accelerator_device:
              componentInputParameter: docling_accelerator_device
  1. Wire it into the docling-convert-vlm component parameters within the for-loop-1 task (around line 130):
              accelerator_device:
                componentInputParameter: pipelinechannel--docling_accelerator_device

Also update the pipeline header comment (lines 4–12) to include docling_accelerator_device: str [Default: 'auto'] in the Inputs section.

Also applies to: 107-132, 610-643

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5bbcc5f and cb254e0.

📒 Files selected for processing (7)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • kubeflow-pipelines/docling-vlm/vlm_components.py
  • kubeflow-pipelines/README.md
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-vlm/README.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
🔇 Additional comments (2)
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

388-411: Device mapping and validation are well-implemented.

The device_map correctly defines all supported accelerators with "gpu" as an alias for "cuda", case-insensitive input handling, and clear error messaging on invalid input.

kubeflow-pipelines/docling-standard/README.md (1)

13-13: Documentation is clear and accurate.

The new docling_accelerator_device parameter is well-documented with supported values and default. Placement at the start of the configuration options list is logical and improves discoverability.

@shruthis4
Copy link
Contributor

@Mergifyio refresh

@mergify
Copy link

mergify bot commented Nov 5, 2025

refresh

✅ Pull request refreshed

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from cb254e0 to bd03127 Compare December 5, 2025 20:16
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
.github/workflows/validate-python.yml (1)

12-12: Fix path pattern typo: "workflow" → "workflows" (plural).

Lines 12 and 20 reference .github/workflow/validate-python.yml but the file is located at .github/workflows/validate-python.yml. The singular "workflow" won't match; this breaks the self-referential trigger and the workflow won't re-run when it changes.

Apply this diff to fix both occurrences:

    paths:
      - "**/*.py"
      - ".pre-commit-config.yaml"
      - "Makefile"
      - "pyproject.toml"
-     - ".github/workflow/validate-python.yml"
+     - ".github/workflows/validate-python.yml"

Also applies to: 20-20

♻️ Duplicate comments (4)
kubeflow-pipelines/docling-standard/standard_components.py (1)

127-138: Avoid duplicating device_map across standard and VLM components.

The device_map here (including the gpu alias) is identical in spirit to the one used in the VLM components. Maintaining these in two places will be error-prone if you ever add/remove devices or aliases.

As suggested in the prior review, consider extracting this mapping into a shared module (e.g., kubeflow-pipelines/common/device_mapping.py) and importing it in both components:

-    device_map = {
-        "auto": AcceleratorDevice.AUTO,
-        "cpu": AcceleratorDevice.CPU,
-        "cuda": AcceleratorDevice.CUDA,
-        "gpu": AcceleratorDevice.CUDA,  # alias
-        "mps": AcceleratorDevice.MPS,
-    }
+    from common.device_mapping import DEVICE_MAP as device_map
@@
-    pipeline_options.accelerator_options = AcceleratorOptions(
-        num_threads=num_threads, device=device_map[dev]
-    )
+    pipeline_options.accelerator_options = AcceleratorOptions(
+        num_threads=num_threads, device=device_map[dev]
+    )

Also applies to: 167-169

kubeflow-pipelines/docling-vlm/vlm_components.py (1)

25-26: Document accelerator_device and consider centralizing device_map.

The runtime behavior looks good (normalization, alias support, validation, and passing device_map[dev] into AcceleratorOptions). Two follow‑ups:

  • The docstring’s Args block doesn’t mention accelerator_device, which makes the new parameter discoverability poor. Consider adding a brief line:
-        remote_model_secret_mount_path: Path to the remote model secret mount path.
+        remote_model_secret_mount_path: Path to the remote model secret mount path.
+        accelerator_device: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
  • The device_map dictionary is now duplicated across the VLM and standard components (and appears again in the compiled YAML). To avoid drift when adding new accelerators or aliases, it would be cleaner to move this mapping into a shared module (e.g., common/device_mapping.py) and import it here.

Also applies to: 75-86, 179-180

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)

46-49: accelerator_device still isn’t configurable at the pipeline level.

accelerator_device is defined on comp-docling-convert-vlm (lines 46–49) and used inside docling_convert_vlm, but it isn’t:

  • Exposed as a root pipeline parameter, and
  • Passed through the docling-convert-vlm task in comp-for-loop-1.

As a result, runs always use the default "auto", and users cannot select CPU/GPU/MPS via pipeline inputs, which undercuts the goal of “KFP support for accelerator devices”.

To align with how other docling_* parameters are wired:

  • Add a root parameter, e.g. docling_accelerator_device (default "auto", optional, STRING) under root.inputDefinitions.parameters.
  • Add a pipelinechannel--docling_accelerator_device entry under comp-for-loop-1.inputDefinitions.parameters.
  • In comp-for-loop-1.dag.tasks.docling-convert-vlm.inputs.parameters, add:
              accelerator_device:
                componentInputParameter: pipelinechannel--docling_accelerator_device
  • Under root.dag.tasks.for-loop-1.inputs.parameters, map:
            pipelinechannel--docling_accelerator_device:
              componentInputParameter: docling_accelerator_device

Also applies to: 107-131, 575-587, 610-644


251-261: Docstring still omits accelerator_device.

accelerator_device is in the function signature, but the Args section doesn’t mention it, which makes the parameter opaque when reading the compiled spec.

Consider extending the docstring as follows:

-        remote_model_secret_mount_path: Path to the remote model secret mount path.
+        remote_model_secret_mount_path: Path to the remote model secret mount path.
+        accelerator_device: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
🧹 Nitpick comments (8)
.github/mergify.yml (1)

45-69: Consider guarding backport rule against breaking-change labeled PRs.

The new backport rule is a useful addition for maintaining stable branches. However, it currently creates a backport PR even if the original PR is labeled with breaking-change, which could result in unintended backports of incompatible changes.

Consider adding a condition to exclude breaking-change labeled PRs from automatic backporting:

  - name: Create backport PRs for labeled changes
    conditions:
      - merged
      - base=main
      - label=backport
+     - "label!=breaking-change"
    actions:

Alternatively, if all PRs with backport label are intentionally stable changes regardless of the breaking-change label, this can be skipped.

Please confirm whether the backport rule should automatically backport PRs labeled with both backport and breaking-change.

tests/conftest.py (1)

11-19: LGTM with minor note on defensive filtering.

The function correctly discovers notebooks using glob with recursive search. The Path(f).exists() filter on line 17 is technically redundant since glob.glob only returns existing files, but it serves as defensive programming and doesn't impact functionality.

If you prefer to streamline the code slightly, you can remove the existence check:

-    # Convert to Path objects and filter out any non-existent files
-    notebook_paths = [Path(f) for f in notebook_files if Path(f).exists()]
+    # Convert to Path objects
+    notebook_paths = [Path(f) for f in notebook_files]
README.md (1)

7-27: Consider aligning “workbenches directory” wording with the new structure.

The paragraph mentioning “The workbenches directory…” no longer matches the tree below, which highlights custom-workbench-image. If workbenches was renamed or superseded, consider updating that sentence to reference the current directory name(s) to avoid confusion.

custom-workbench-image/dp-entrypoint.sh (1)

5-15: Paths look good; consider quoting $SRC_DIR in cp for robustness.

The new SRC_DIR/DEST_DIR values are consistent with the repo rename. To make the copy command safer against unexpected characters, you can quote the variable:

-  cp -r $SRC_DIR /opt/app-root/src/
+  cp -r "$SRC_DIR" /opt/app-root/src/
custom-workbench-image/README.md (1)

33-51: Docs match the new image setup; mirror the cp quoting improvement.

The README snippets correctly reflect the renamed repository (data-processing), updated TMP/SRC paths, and dp-entrypoint.sh usage.

To keep the example aligned with the script best practice, you can also quote SRC_DIR here:

-  cp -r $SRC_DIR /opt/app-root/src/
+  cp -r "$SRC_DIR" /opt/app-root/src/

Also applies to: 61-78, 136-146

kubeflow-pipelines/README.md (1)

24-25: Accelerator-device docs look good; consider clarifying parameter naming across layers.

The new defaults and the “Configuring accelerator device” section accurately describe the supported values (auto, cpu, cuda, gpu, mps) and the gpucuda alias, and the updated URLs/paths are consistent with the data-processing repo.

To avoid confusion between the KFP parameter name (docling_accelerator_device in the docs) and the internal component argument (accelerator_device in code), consider adding a short note that the pipeline’s docling_accelerator_device runtime parameter is passed through to the underlying components as accelerator_device:

-#### 7) Configuring accelerator device
+#### 7) Configuring accelerator device
@@
-- The parameter is case-insensitive and supports friendly aliases (`gpu` maps to `cuda`).
+- The parameter is case-insensitive and supports friendly aliases (`gpu` maps to `cuda`).
+- Internally, this pipeline parameter is forwarded to the underlying Docling components as the `accelerator_device` argument.

Also applies to: 59-60, 63-64, 69-75, 96-98, 161-171

kubeflow-pipelines/docling-standard/standard_components.py (1)

33-56: Document the new accelerator_device parameter in the docstring.

The new accelerator_device knob is implemented correctly but isn’t mentioned in the Args section, which can surprise readers of this component.

Consider extending the docstring like this:

-        enrich_picture_classes: Whether or not to enrich picture classes.
-        enrich_picture_description: Whether or not to enrich picture description.
+        enrich_picture_classes: Whether or not to enrich picture classes.
+        enrich_picture_description: Whether or not to enrich picture description.
+        accelerator_device: Device to use for acceleration. One of "auto", "cpu", "cuda", "gpu", or "mps".
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

55-58: Wire accelerator_device through the standard pipeline so it’s configurable.

Right now accelerator_device is declared on comp-docling-convert-standard (lines 55–58) and used in the embedded function, but it’s never passed in the docling-convert-standard task within comp-for-loop-1 nor exposed as a root‑level pipeline input. That means runs will always use the default "auto" and callers cannot choose CPU/GPU/MPS at pipeline invocation time.

For consistency with the other docling_* knobs:

  • Add a root parameter, e.g. docling_accelerator_device (default "auto", optional, STRING).
  • Plumb it into comp-for-loop-1.inputDefinitions.parameters as pipelinechannel--docling_accelerator_device.
  • Pass it into the docling-convert-standard task’s accelerator_device input via a componentInputParameter.

This keeps the compiled spec aligned with the stated goal of making accelerator selection configurable in KFP.

Also applies to: 156-195, 717-786

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cb254e0 and bd03127.

📒 Files selected for processing (29)
  • .github/dependabot.yml (1 hunks)
  • .github/mergify.yml (2 hunks)
  • .github/workflows/compile-kfp.yml (1 hunks)
  • .github/workflows/execute-all-notebooks.yml (1 hunks)
  • .github/workflows/execute-notebooks.yml (1 hunks)
  • .github/workflows/validate-and-execute-notebooks.yml (0 hunks)
  • .github/workflows/validate-notebooks.yml (1 hunks)
  • .github/workflows/validate-python.yml (1 hunks)
  • .gitignore (1 hunks)
  • Makefile (2 hunks)
  • README.md (3 hunks)
  • custom-workbench-image/Containerfile (1 hunks)
  • custom-workbench-image/README.md (4 hunks)
  • custom-workbench-image/dp-entrypoint.sh (1 hunks)
  • kubeflow-pipelines/README.md (5 hunks)
  • kubeflow-pipelines/docling-standard/README.md (2 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (2 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (4 hunks)
  • notebooks/README.md (1 hunks)
  • pyproject.toml (1 hunks)
  • requirements-dev.txt (1 hunks)
  • scripts/subset_selection/README.md (1 hunks)
  • tests/.gitignore (1 hunks)
  • tests/README.md (1 hunks)
  • tests/conftest.py (1 hunks)
  • tests/test_notebook_parameters.py (1 hunks)
💤 Files with no reviewable changes (1)
  • .github/workflows/validate-and-execute-notebooks.yml
✅ Files skipped from review due to trivial changes (4)
  • notebooks/README.md
  • scripts/subset_selection/README.md
  • tests/README.md
  • tests/.gitignore
🧰 Additional context used
🧬 Code graph analysis (1)
tests/test_notebook_parameters.py (1)
tests/conftest.py (1)
  • get_notebook_files (11-19)
🪛 actionlint (1.7.9)
.github/workflows/execute-notebooks.yml

45-45: shellcheck reported issue in this script: SC2046:warning:8:44: Quote this to prevent word splitting

(shellcheck)

🪛 checkmake (0.2.2)
Makefile

[warning] 1-1: Missing required phony target "all"

(minphony)


[warning] 1-1: Missing required phony target "clean"

(minphony)


[warning] 1-1: Missing required phony target "test"

(minphony)

🪛 LanguageTool
README.md

[grammar] ~38-~38: Use a hyphen to join words.
Context: ...pelines. ### Notebooks Data processing related jupyter notebooks a...

(QB_NEW_EN_HYPHEN)

🪛 Pylint (4.0.4)
tests/test_notebook_parameters.py

[refactor] 16-16: Too few public methods (1/2)

(R0903)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: execute_tests (notebooks/use-cases/document-conversion-standard.ipynb, https://raw.githubusercont...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: Summary
🔇 Additional comments (19)
.gitignore (1)

6-6: Good addition for broader venv directory support.

Adding "venv" (without leading dot) alongside the existing ".venv" entry ensures that virtual environment directories from different tools and conventions are properly ignored. This improves the developer experience across varied Python development workflows.

.github/mergify.yml (1)

5-12: Primary merge rule conditions look solid.

The auto-merge conditions are well-defined with appropriate safeguards (approval requirement, no requested changes, exclusion of do-not-merge/breaking-change labels, conflict detection).

pyproject.toml (1)

29-32: Pytest configuration is well-structured.

Standard configuration aligns with the new test suite infrastructure.

.github/workflows/validate-python.yml (1)

30-30: GitHub Actions versions updated appropriately.

Bumping to actions/checkout@v5 and actions/setup-python@v6 is reasonable and non-breaking.

Also applies to: 35-35

.github/workflows/compile-kfp.yml (1)

40-40: GitHub Actions versions updated consistently.

Aligns with other workflow modernization in the PR; no functional changes to pipeline compilation logic.

Also applies to: 43-43

.github/dependabot.yml (1)

1-23: Dependabot configuration is well-designed.

Sensible rules: direct dependencies only, major updates ignored, clear commit prefixes for categorization.

.github/workflows/execute-notebooks.yml (1)

1-55: Notebook execution workflow is well-structured.

Setup, installation, and execution steps follow a logical sequence. After fixing the shell quoting issue above, this workflow provides good CI coverage for notebook execution.

Please verify that the base64-encoded parameters are correctly decoded and applied within the notebook execution context (papermill handles the -b flag parameter injection).

.github/workflows/execute-all-notebooks.yml (1)

1-100: EC2 runner orchestration workflow is well-designed.

Proper OIDC setup, role assumption, cross-job dependencies, and cleanup (always runs). Resource tagging and AZ configuration show operational rigor. The execute-all-notebooks job is currently a placeholder; ensure the actual notebook execution logic is added before this workflow is used in production.

Confirm that the execute-all-notebooks job will be populated with notebook execution logic in a follow-up PR (currently only a setup step is present).

requirements-dev.txt (1)

3-5: Development dependencies align with new test infrastructure.

pytest, nbformat, and papermill support the notebook parameter validation tests and notebook execution workflows added in this PR.

Makefile (1)

1-1: Makefile target for notebook parameter tests is well-integrated.

Target follows existing patterns and is properly declared in .PHONY. Enables CI workflow to validate notebook parameter tagging.

Also applies to: 28-31

tests/conftest.py (1)

22-25: LGTM!

The fixture is correctly implemented and provides a clean interface for tests to access discovered notebook files.

.github/workflows/validate-notebooks.yml (1)

26-43: LGTM!

The workflow steps are well-structured and appropriate for notebook validation. Python setup with pip caching, dependency installation, and kernel configuration are all correctly implemented.

tests/test_notebook_parameters.py (3)

16-53: LGTM!

The NotebookParametersValidator class is well-implemented. The validation logic correctly:

  • Reads notebooks with NO_CONVERT to preserve structure
  • Validates notebook format with nbformat.validate()
  • Checks for code cells tagged with 'parameters'
  • Provides descriptive error messages

Regarding the Pylint hint (R0903): This warning is a false positive. The class serves as a reasonable organizational unit for notebook validation logic, and having a single public method is appropriate for this focused responsibility.


56-70: LGTM!

The parameterized test correctly validates that each notebook contains a parameters cell. The assertion message is helpful, clearly explaining what's missing and why it's needed for papermill execution.


73-107: LGTM!

The validator self-test is thorough and well-structured. It correctly tests both negative and positive cases by creating minimal notebook structures. The use of try/finally blocks ensures proper cleanup of temporary files even if assertions fail.

kubeflow-pipelines/docling-vlm/README.md (1)

1-14: Accelerator device docs look aligned; please double-check parameter wiring.

The new docling_accelerator_device option, supported values, and default match the accelerator semantics used elsewhere in the repo, and the updated URLs/paths for data-processing look correct.

Please just verify that the KFP parameter name used in:

  • VLM components,
  • the pipeline definition,
  • the compiled YAML, and
  • this README

is consistent (including the gpucuda alias) so users don’t see mismatched names between UI and docs.

Also applies to: 27-30

custom-workbench-image/Containerfile (1)

51-67: Repo/entrypoint path updates are consistent.

The switch to cloning data-processing, using dp-entrypoint.sh, and updating NOTEBOOK_ROOT_DIR to /opt/app-root/src/data-processing/notebooks/use-cases is internally consistent with the entrypoint script and documentation.

kubeflow-pipelines/docling-standard/README.md (1)

1-14: Standard pipeline README updates align with accelerator support and repo rename.

The new docling_accelerator_device option (values and default) and the updated clone/import paths correctly reflect the accelerator support and the data-processing repository layout.

Also applies to: 35-38

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

280-285: Accelerator mapping and validation look solid.

The device_map with aliases ("gpu" → CUDA) plus lowercase normalization and explicit ValueError on invalid keys is a clean pattern, and wiring device_map[dev] into AcceleratorOptions(device=...) keeps the spec aligned with the Docling enums.

Also applies to: 333-334

echo "Executing notebook '$NOTEBOOK' with file '$FILE'..."
papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b $(echo -n "files: [\"$FILE\"]" | base64 -w 0)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Quote the command substitution to prevent word splitting.

Shellcheck (SC2046) flags the unquoted base64 expansion. While unlikely to cause issues in practice, quote the output to adhere to shell best practices and silence the static analysis warning.

Apply this diff to fix line 53:

-          papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b $(echo -n "files: [\"$FILE\"]" | base64 -w 0)
+          papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b "$(echo -n "files: [\"$FILE\"]" | base64 -w 0)"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b $(echo -n "files: [\"$FILE\"]" | base64 -w 0)
papermill $NOTEBOOK $NOTEBOOK.tmp.ipynb -b "$(echo -n "files: [\"$FILE\"]" | base64 -w 0)"
🤖 Prompt for AI Agents
.github/workflows/execute-notebooks.yml around line 53: the unquoted command
substitution for base64 can be word-split by the shell and triggers ShellCheck
SC2046; fix it by quoting the expansion so the -b argument receives a single
string — change the call to pass the output of the echo|base64 pipeline wrapped
in double quotes (i.e. quote the $(...) substitution) so the entire JSON/base64
payload is treated as one argument.

Comment on lines 19 to 24
validate_tests:
runs-on: ubuntu-latest
strategy:
matrix:
# Set the notebooks to validate, wildcards are allowed
notebooks_to_validate: ["notebooks/**/*.ipynb"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused matrix configuration.

The matrix strategy defines notebooks_to_validate but this value is never referenced in any of the workflow steps. The Make targets (format-notebooks-check and test-notebook-parameters) operate on all notebooks discovered by the test configuration, making this matrix unnecessary.

Apply this diff to remove the unused matrix:

 jobs:
   validate_tests:
     runs-on: ubuntu-latest
-    strategy:
-      matrix:
-        # Set the notebooks to validate, wildcards are allowed
-        notebooks_to_validate: ["notebooks/**/*.ipynb"]
     steps:
       - uses: actions/checkout@v5

Alternatively, if you intend to use the matrix to selectively test notebooks, you'll need to pass ${{ matrix.notebooks_to_validate }} to the Make targets or test commands.

🤖 Prompt for AI Agents
.github/workflows/validate-notebooks.yml around lines 19 to 24: the workflow
defines a matrix key notebooks_to_validate that is never used; either remove the
unused matrix block entirely (delete the strategy.matrix.notebooks_to_validate
lines) so the job runs without an unused variable, or update the workflow steps
that call the Make targets to pass the matrix value (e.g., use ${{
matrix.notebooks_to_validate }} as an argument to the format-notebooks-check and
test-notebook-parameters targets) so the matrix actually controls which
notebooks are validated.

Comment on lines 36 to 51
### Notebooks

Data processing related [jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials).

### Custom Workbench Image

Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images).

A sample `Containerfile` and instructions to create a custom workbench image are in [custom-workbench-image](custom-workbench-image).

### Scripts

Curated scripts related to data processing live in this directory.

For example the [subset selection](scripts/subset_selection) scripts allows users to identify representative samples from large training datasets.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Tighten wording for notebooks/scripts sentences.

Minor grammar/clarity fixes here will read better and address the LanguageTool warning:

  • Use a hyphenated compound and capitalize “Jupyter”.
  • Fix subject–verb agreement and add a small comma for readability.
-Data processing related [jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials). 
+Data-processing-related [Jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials).
@@
-For example the [subset selection](scripts/subset_selection) scripts allows users to identify representative samples from large training datasets.
+For example, the [subset selection](scripts/subset_selection) scripts allow users to identify representative samples from large training datasets.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Notebooks
Data processing related [jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials).
### Custom Workbench Image
Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images).
A sample `Containerfile` and instructions to create a custom workbench image are in [custom-workbench-image](custom-workbench-image).
### Scripts
Curated scripts related to data processing live in this directory.
For example the [subset selection](scripts/subset_selection) scripts allows users to identify representative samples from large training datasets.
### Notebooks
Data-processing-related [Jupyter notebooks](notebooks) are broken down into [use-cases](notebooks/use-cases) and [tutorials](notebooks/tutorials).
### Custom Workbench Image
Open Data Hub has the ability for users to add and run [custom workbench images](https://docs.redhat.com/en/documentation/red_hat_openshift_ai_self-managed/2.24/html/managing_openshift_ai/creating-custom-workbench-images).
A sample `Containerfile` and instructions to create a custom workbench image are in [custom-workbench-image](custom-workbench-image).
### Scripts
Curated scripts related to data processing live in this directory.
For example, the [subset selection](scripts/subset_selection) scripts allow users to identify representative samples from large training datasets.
🧰 Tools
🪛 LanguageTool

[grammar] ~38-~38: Use a hyphen to join words.
Context: ...pelines. ### Notebooks Data processing related jupyter notebooks a...

(QB_NEW_EN_HYPHEN)

🤖 Prompt for AI Agents
In README.md around lines 36 to 51, tighten wording: change "jupyter notebooks"
to "Jupyter notebooks" and hyphenate the compound as "Data-processing-related
Jupyter notebooks"; add a comma after "For example,"; hyphenate
"subset-selection" and fix subject–verb agreement by changing "scripts allows"
to "scripts allow". Ensure these small edits preserve links and casing.

@mergify
Copy link

mergify bot commented Dec 8, 2025

🔀 Merge Conflict Detected

This pull request has merge conflicts that must be resolved before it can be merged.
@RobuRishabh please rebase your branch.

Syncing a Fork Guide

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from bd03127 to 0249d81 Compare December 8, 2025 20:22
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
kubeflow-pipelines/docling-standard/standard_components.py (1)

127-133: Reuse a shared device_map to avoid drift between components

The device_map here is (per earlier review) identical to the one in the VLM components; keeping these literals duplicated makes it easy for the mappings to diverge over time.

Extract this mapping once (for example into kubeflow-pipelines/common/device_mapping.py) and import it in both standard_components.py and vlm_components.py, then re‑compile the pipelines so the compiled specs pick up the shared definition.

Please double‑check against the current docling.datamodel.accelerator_options.AcceleratorDevice API to ensure the set of supported devices is still correct.

Also applies to: 168-168

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)

330-337: Add accelerator_device to the embedded component docstring

The embedded docling_convert_standard function signature includes accelerator_device, but the Args: section still omits it (same as in the source Python file and earlier review).

Add a brief line to keep the compiled spec self‑documenting, for example:

-        enrich_picture_description: Whether or not to enrich picture description.
+        enrich_picture_description: Whether or not to enrich picture description.
+        accelerator_device: Device to use for acceleration ('auto', 'cpu', 'cuda'/'gpu', 'mps').

Ideally, update the docstring in standard_components.py and re‑compile so this stays in sync.

Also applies to: 338-353


388-399: Same device_map DRY concern applies to the compiled component body

The device_map and AcceleratorOptions(device=device_map[dev]) logic here mirrors standard_components.py (and the VLM components). Once you extract device_map into a shared module and import it in the source component, re‑compiling will propagate that refactor into this embedded code as well, eliminating the copy‑pasted mapping.

No manual edits to this YAML should be needed after you refactor the Python source.

After refactoring the Python source to use a shared mapping, please re‑compile and confirm the generated YAML no longer hard‑codes a divergent device_map.

Also applies to: 411-411

🧹 Nitpick comments (2)
kubeflow-pipelines/docling-standard/standard_components.py (1)

33-33: Document accelerator_device and normalize input before lookup

The new accelerator_device parameter is good, but it isn’t mentioned in the docstring and whitespace isn’t stripped before validation:

  • Add an Args: entry documenting accepted values ("auto", "cpu", "cuda"/"gpu", "mps").
  • Consider trimming whitespace before .lower() to be more forgiving of user input:
-    accelerator_device: str = "auto",
+    accelerator_device: str = "auto",
@@
-    dev = accelerator_device.lower()
+    dev = accelerator_device.strip().lower()

Also re‑compiling the pipeline after updating this docstring will automatically fix the embedded component docstring in the compiled YAML.

Also applies to: 134-138

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

55-58: Clarify and (optionally) expose accelerator_device at the pipeline level

The accelerator_device parameter is defined for comp-docling-convert-standard but:

  • It has no description, so UIs won’t explain allowable values.
  • It isn’t threaded from root.inputDefinitions through comp-for-loop-1 into this component, so this pipeline can’t override the default "auto".

If you want pipeline consumers to choose the device, consider:

  • Adding a root parameter like docling_accelerator_device and documenting it in the top # Inputs section.
  • Wiring it through comp-for-loop-1.inputs.parameters into comp-docling-convert-standard.inputs.parameters.accelerator_device.

If the intent is to keep this as an internal knob, at least add a short description here mirroring the Python docstring.

Please validate the compiled spec in your KFP environment to ensure accelerator_device shows up and is settable as you intend.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bd03127 and 0249d81.

📒 Files selected for processing (7)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (4 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-vlm/README.md
  • kubeflow-pipelines/docling-vlm/vlm_components.py
  • kubeflow-pipelines/README.md
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: Summary

@RobuRishabh RobuRishabh requested review from alinaryan and removed request for fabianofranz December 8, 2025 20:52
@shruthis4
Copy link
Contributor

Hi Rishabh,

I tried running this on RHOAI cluster and did not see the configurable parameter for the accel device. How can the user adjust this option without exposing?
Here is the pipeline run: pipeline run

Also, we have minimum test coverage for pipelines in our CI, that's something I want to get to sometime soon. To prepare for that work, I would like to understand your validation strategy:
-what input docs did you use?
-Which pipeline parameters you varied
-How you verified the output quality/format
-Any failure scenarios you tried /would like to try out?

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from 0249d81 to a9d9cd4 Compare December 11, 2025 17:04
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

256-266: Docstring Args section still missing accelerator_device documentation.

The function signature includes accelerator_device: str = "auto" (line 256), but the Args section (lines 258-266) does not document it. A past review flagged this, but the compiled YAML still shows the parameter undocumented.

🧹 Nitpick comments (10)
kubeflow-pipelines/common/components.py (1)

31-31: HTTP retry logic is solid; consider small robustness tweaks

The new retry loop looks correct and is a good fit for flaky HTTP downloads:

  • Importing time inside the function matches the existing pattern for os/Path.
  • Per-file retry semantics are clear: up to max_retries attempts, break on first success, and a RuntimeError chained from the last RequestException on repeated failure.
  • Using requests.get(..., stream=True, timeout=30) with resp.raise_for_status() and chunked writes is appropriate for potentially large PDFs.

Two minor, non-blocking refinements you might consider:

  • Clean up partial files on total failure: If an exception happens after the file has been opened/written (e.g., mid‑stream), the last attempt may leave a truncated dest on disk before raising RuntimeError. You could defensively remove dest when attempt == max_retries in the except block, e.g.:

    if attempt == max_retries:
        try:
            dest.unlink(missing_ok=True)  # Python 3.8+: adjust if older
        except OSError:
            pass
        raise RuntimeError(
            f"Failed to download {url} after {max_retries} attempts"
        ) from exc

    Not strictly necessary in a failing component, but it keeps the artifact directory cleaner.

  • Optional configurability of retry parameters: If you anticipate tuning for different environments, max_retries / retry_delay_sec could be module‑level constants or read from env vars. For now, the hard‑coded defaults are reasonable and readable.

Overall, the change improves resilience without altering the S3 path or public interface.

Also applies to: 129-131, 136-161

kubeflow-pipelines/docling-standard/standard_components.py (1)

33-56: Add accelerator_device to the docstring.

The new accelerator_device parameter is missing from the docstring. Consider adding documentation for it.

         enrich_picture_classes: Whether or not to enrich picture classes.
         enrich_picture_description: Whether or not to enrich picture description.
+        accelerator_device: Device to use for acceleration (auto, cpu, cuda, gpu, mps).
     """
kubeflow-pipelines/docling-vlm/local_run.py (2)

92-143: validate_output is defined but never called.

This function performs useful output validation but is not invoked anywhere in the test flow. Either integrate it into run_test_scenario after successful runs, or remove it to avoid dead code.

To integrate validation, add a call in the success path of run_test_scenario:

# After successful pipeline run, validate output
# output_dir would need to be determined from the local runner's output location
validation = validate_output(output_dir, test_name)
if not validation["passed"]:
    return {"test_name": test_name, "status": "FAIL", "reason": "Validation failed", ...}

189-215: Remove unnecessary else after return.

Per static analysis, the else block after return at line 204 is unnecessary.

                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"Expected error containing: '{expected_error}'")
-                print(f"Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": f"Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"Expected error containing: '{expected_error}'")
+            print(f"Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": f"Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }
kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)

56-59: Add a description for the accelerator_device parameter.

The accelerator_device parameter definition is missing a description field, unlike other parameters in this component (e.g., allow_external_plugins, enrich_code). This would improve discoverability in the pipeline UI.

         accelerator_device:
           defaultValue: auto
+          description: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
           isOptional: true
           parameterType: STRING

725-728: Consider adding a description at the pipeline root level as well.

For consistency with other docling_* pipeline parameters, a description would help users understand the parameter's purpose in the pipeline UI.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

47-50: Add a description for the accelerator_device parameter.

Similar to the standard pipeline, this parameter lacks a description field. Adding one would improve discoverability:

         accelerator_device:
           defaultValue: auto
+          description: Device to use for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
           isOptional: true
           parameterType: STRING
kubeflow-pipelines/docling-standard/local_run.py (3)

147-205: validate_output function is defined but never called.

This function validates output quality/format but is not used in the test harness. Consider either:

  1. Integrating it into successful test scenarios in run_test_scenario
  2. Removing it if not needed

If intended for future use, adding a TODO comment would clarify intent.

Would you like me to generate code that integrates this validation into the test flow?


256-278: Consider removing unnecessary else after return (optional).

Per static analysis hint R1705, the else block at line 268 is unnecessary since line 267 returns. This is a minor style improvement:

                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"Expected error containing: '{expected_error}'")
-                print(f"Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": f"Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"Expected error containing: '{expected_error}'")
+            print(f"Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": f"Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }

322-322: PASS_WITH_WARNINGS status is handled but never produced.

The summary logic counts and displays PASS_WITH_WARNINGS status (lines 322, 328, 336-337), but no code path in run_test_scenario ever returns this status. Either implement warning scenarios or remove the dead handling code.

Also applies to: 336-337

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0249d81 and a9d9cd4.

📒 Files selected for processing (13)
  • .gitignore (1 hunks)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/common/components.py (2 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/local_run.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (9 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/local_run.py (4 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (9 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🚧 Files skipped from review as they are similar to previous changes (2)
  • kubeflow-pipelines/docling-vlm/vlm_components.py
  • kubeflow-pipelines/README.md
🧰 Additional context used
🧬 Code graph analysis (1)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
  • convert_pipeline_test (58-89)
  • validate_output (92-143)
  • run_test_scenario (146-223)
  • main (226-274)
🪛 GitHub Actions: Python Formatting Validation
kubeflow-pipelines/docling-vlm/local_run.py

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format' to fix code style issues.

kubeflow-pipelines/docling-standard/local_run.py

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format' to fix code style issues.

🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 193-215: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-standard/local_run.py

[refactor] 105-105: Too many positional arguments (8/5)

(R0917)


[refactor] 256-278: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: Summary
🔇 Additional comments (9)
kubeflow-pipelines/docling-vlm/README.md (1)

13-13: LGTM!

The documentation accurately describes the new docling_accelerator_device parameter with correct device options and default value that align with the implementation.

kubeflow-pipelines/docling-standard/README.md (1)

13-13: LGTM!

Documentation correctly describes the accelerator device configuration options, consistent with the VLM pipeline README and the implementation.

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)

37-37: LGTM!

The docling_accelerator_device parameter is properly added to the pipeline signature with a sensible default and correctly wired to the component call.

Also applies to: 85-85

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (1)

28-28: LGTM!

The accelerator device parameter is consistently implemented in the VLM pipeline, matching the pattern used in the standard pipeline.

Also applies to: 69-69

kubeflow-pipelines/docling-standard/standard_components.py (2)

127-138: Validation logic is sound.

The device mapping and validation are correctly implemented with case-insensitive matching, a helpful "gpu" alias for "cuda", and a clear error message listing valid options.

Note: The device_map duplication with vlm_components.py was already flagged in a previous review for extraction to a shared module.


167-169: LGTM!

The AcceleratorOptions correctly uses the validated device from the map.

kubeflow-pipelines/docling-vlm/local_run.py (1)

18-54: Good test coverage for accelerator device validation.

The test configurations cover baseline CPU, auto-detection, and failure scenarios for invalid device and image mode inputs. The commented-out GPU/MPS tests are a reasonable approach for environment-specific testing.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

285-295: LGTM on device validation logic.

The device_map validation is correctly implemented and consistent with the standard pipeline. The case-insensitive comparison via .lower() and clear error message with allowed values are good practices.

kubeflow-pipelines/docling-standard/local_run.py (1)

17-61: Good test coverage for accelerator device configurations.

The TEST_CONFIGS provides a reasonable baseline for testing CPU and auto device modes, along with variations in backends and table modes. The commented-out GPU/MPS tests are appropriate for environments without those accelerators.

Comment on lines 1 to +5
import sys
from pathlib import Path
from typing import List
import json
import time
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Run ruff format to fix formatting issues.

The pipeline is failing due to formatting violations. Run ruff format kubeflow-pipelines/docling-standard/local_run.py to resolve.

🧰 Tools
🪛 GitHub Actions: Python Formatting Validation

[error] 1-1: Ruff format check would reformat this file. Run 'ruff format' to fix code style issues.

🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/local_run.py around lines 1 to 5, the
file has formatting violations; run the automatic formatter to fix them by
executing `ruff format kubeflow-pipelines/docling-standard/local_run.py` (or run
`ruff format` for the repo) and then stage the changes—this will reformat
imports and whitespace to satisfy the linter.

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from a9d9cd4 to 17924e1 Compare December 11, 2025 17:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)

75-86: Address the device_map duplication.

The device_map dictionary is duplicated in both vlm_components.py and standard_components.py, violating DRY principles and creating a maintenance burden if the mapping needs updating.

Consider extracting this to a shared module (e.g., common/device_mapping.py) that both components can import.

🧹 Nitpick comments (3)
kubeflow-pipelines/docling-vlm/local_run.py (1)

151-225: LGTM: Test scenario execution logic is sound.

The run_test_scenario function correctly:

  • Filters metadata keys before passing config to the pipeline
  • Handles both expected success and expected failure cases
  • Provides detailed error reporting with truncated messages

Optionally, the else at line 207 could be removed per Pylint R1705, but the current structure is clear and readable.

kubeflow-pipelines/docling-standard/local_run.py (1)

215-292: LGTM: Test execution logic handles all scenarios correctly.

The run_test_scenario function properly filters metadata keys and handles both expected successes and failures with detailed error matching.

Optionally, the else at line 274 could be removed per Pylint R1705, but readability is fine as-is.

kubeflow-pipelines/common/components.py (1)

129-161: Retry loop is solid; consider minor tweaks for configurability and error handling

The bounded retry loop around the HTTP download is well structured and should significantly reduce flakiness from transient network issues. A couple of optional refinements you might consider:

  • Make max_retries and retry_delay_sec configurable (e.g., via function args or env vars) so operators can tune behavior for slow or unreliable networks without code changes.
  • Differentiate between retriable and non-retriable HTTP errors (e.g., avoid retrying clear client errors like 4xx, while still retrying timeouts and 5xx), to avoid unnecessary repeated attempts for permanent failures.
  • If HTTP-based imports tend to be flaky, you might also want to mirror a similar retry mechanism on the S3 path for consistency.

These are purely optional; the current implementation is correct and already a good improvement in robustness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9d9cd4 and 17924e1.

📒 Files selected for processing (13)
  • .gitignore (1 hunks)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/common/components.py (2 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/local_run.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (9 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/local_run.py (4 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • .gitignore
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-standard/standard_components.py
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
  • kubeflow-pipelines/README.md
🧰 Additional context used
🧬 Code graph analysis (2)
kubeflow-pipelines/docling-vlm/local_run.py (1)
kubeflow-pipelines/docling-standard/local_run.py (4)
  • convert_pipeline_test (106-145)
  • validate_output (148-212)
  • run_test_scenario (215-292)
  • main (295-361)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
  • convert_pipeline_test (59-90)
  • validate_output (93-148)
  • run_test_scenario (151-225)
  • main (228-278)
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 195-217: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-standard/local_run.py

[refactor] 106-106: Too many positional arguments (8/5)

(R0917)


[refactor] 262-284: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: Summary
🔇 Additional comments (19)
kubeflow-pipelines/docling-vlm/vlm_components.py (2)

25-25: LGTM: Parameter addition is clean.

The accelerator_device parameter with default value "auto" follows the existing parameter patterns and integrates well with the function signature.


178-180: LGTM: Dynamic device selection implemented correctly.

The AcceleratorOptions now uses the validated device from device_map[dev], enabling runtime device configuration as intended.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)

47-50: LGTM: Component input definition is correct.

The accelerator_device parameter is properly defined in the component's inputDefinitions with appropriate default value and type.


119-120: LGTM: Pipeline wiring is complete and correct.

The docling_accelerator_device parameter is correctly wired through all pipeline levels:

  • Root pipeline input (lines 619-622)
  • For-loop task input (lines 584-585)
  • Pipeline channel declaration (lines 151-152)
  • Component parameter binding (lines 119-120)

Also applies to: 151-152, 584-585, 619-622

kubeflow-pipelines/docling-vlm/local_run.py (5)

19-41: LGTM: Test configurations are well-structured.

The TEST_CONFIGS provides good coverage with baseline CPU and auto-detect scenarios. GPU and MPS tests are appropriately commented out for environments without specialized hardware.


43-55: LGTM: Failure scenarios provide good negative test coverage.

The FAILURE_SCENARIOS appropriately test validation logic for invalid device and image mode inputs.


59-90: LGTM: Pipeline test function correctly forwards parameters.

The convert_pipeline_test function properly accepts and forwards all docling configuration parameters, including the new accelerator_device, to the VLM converter.


93-148: LGTM: Output validation is comprehensive.

The validate_output function thoroughly checks both JSON and Markdown outputs with appropriate error handling and structured results.


228-279: LGTM: Test runner orchestration is well-implemented.

The main function:

  • Properly initializes the Docker runner
  • Executes both normal and failure test phases
  • Uses config.get("should_fail", True) (line 251) which correctly avoids mutating the FAILURE_SCENARIOS dictionary
  • Provides clear test summary output
kubeflow-pipelines/docling-vlm/README.md (1)

13-13: LGTM: Documentation clearly describes the new parameter.

The docling_accelerator_device parameter is well-documented with all valid values and the default clearly stated.

kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (1)

37-37: LGTM: Accelerator device parameter correctly added and forwarded.

The docling_accelerator_device parameter is added to the pipeline signature (line 37) and properly forwarded to docling_convert_standard (line 85).

Also applies to: 85-85

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (2)

56-59: LGTM: Pipeline wiring is complete and consistent.

The docling_accelerator_device parameter is correctly wired through all pipeline levels, following the same pattern as the VLM pipeline.

Also applies to: 168-169, 215-216, 672-673, 725-728


393-398: LGTM: Device mapping and validation logic is correctly embedded.

The compiled YAML includes the complete device_map validation logic (lines 393-398) and proper usage in AcceleratorOptions (line 416).

Also applies to: 416-416

kubeflow-pipelines/docling-standard/local_run.py (5)

18-62: LGTM: Comprehensive test configurations.

The TEST_CONFIGS provides excellent coverage with scenarios testing different backends (pypdfium2), table modes (fast), and device options. GPU/MPS tests are appropriately commented out.


64-102: LGTM: Thorough failure scenario coverage.

The FAILURE_SCENARIOS comprehensively test validation for all major parameters: device, backend, table mode, image mode, and OCR engine.


106-145: LGTM: Pipeline test function with comprehensive parameter support.

The convert_pipeline_test function correctly accepts and forwards all standard pipeline parameters including accelerator_device, backend, table mode, OCR settings, etc.


148-212: LGTM: Robust output validation with pass/fail logic.

The validate_output function includes comprehensive checks and aggregates them into an overall pass/fail result (lines 204-210), which is an improvement over the VLM version.


295-361: LGTM: Well-structured test runner with clear reporting.

The main function:

  • Correctly uses config.get("should_fail", True) (line 318) to avoid mutation
  • Provides emoji-rich output for better readability
  • Includes detailed error preview with truncation (lines 351-356)
  • Exits with appropriate status code
kubeflow-pipelines/common/components.py (1)

31-31: Local time import for retry delay is appropriate

Using a local time import here is consistent with the other stdlib imports in this component and keeps the dependency surface minimal at module import time.

@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from 17924e1 to 384f15f Compare December 11, 2025 18:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)

25-40: Add accelerator_device to the docstring Args section.

The new accelerator_device parameter is missing from the docstring. This omission makes the parameter's purpose and valid values unclear to users.

         remote_model_secret_mount_path: Path to the remote model secret mount path.
+        accelerator_device: Device for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
     """
kubeflow-pipelines/docling-standard/standard_components.py (1)

33-56: Add accelerator_device to the docstring Args section.

The new accelerator_device parameter is missing from the Args documentation.

         enrich_picture_description: Whether or not to enrich picture description.
+        accelerator_device: Device for acceleration ('auto', 'cpu', 'cuda', 'gpu', 'mps').
     """
♻️ Duplicate comments (3)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)

75-86: Device mapping and validation logic is correct.

The implementation properly handles case-insensitive input, includes a helpful "gpu" alias for "cuda", and provides clear error messages. The device_map duplication with standard_components.py was previously flagged and can be addressed as a follow-up refactor.

kubeflow-pipelines/docling-standard/standard_components.py (1)

127-138: Device mapping and validation is correct; DRY refactor noted.

The implementation correctly handles case-insensitive device matching and validation. The duplication with vlm_components.py was already flagged in a previous review for extraction to a shared module.

kubeflow-pipelines/docling-standard/local_run.py (1)

339-358: Status emoji logic handles non-existent PASS_WITH_WARNINGS status.

Lines 342-343 handle "PASS_WITH_WARNINGS" which is never produced. This is consistent with the counting issue noted above.

🧹 Nitpick comments (4)
kubeflow-pipelines/docling-vlm/local_run.py (2)

93-148: validate_output function is defined but never called.

The validate_output function validates output artifacts but is not invoked anywhere in the test harness. If output validation is intended, call it after successful pipeline runs in run_test_scenario.

         print(f"✅ TEST PASSED: {test_name} - Completed in {elapsed:.2f}s")
+        # Optionally validate outputs here:
+        # validation = validate_output(output_dir, test_name)
         return {
             "test_name": test_name,
             "status": "PASS",
             "elapsed_time": elapsed,
         }

191-217: Consider removing unnecessary else after return.

Per static analysis, the else block at line 207 can be de-indented since the preceding if block returns.

             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 print(f"✅ TEST PASSED: {test_name} - Failed as expected")
                 print(f"Pipeline correctly rejected invalid parameter")
                 if expected_error:
                     print(f"Expected error type: '{expected_error}'")
                 return {
                     "test_name": test_name,
                     "status": "PASS",
                     "reason": "Failed as expected - validation working",
                     "error": error_msg[:200],
                     "elapsed_time": elapsed,
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                print(f"Expected error containing: '{expected_error}'")
-                print(f"Got: {error_msg[:200]}")
-                return {
-                    "test_name": test_name,
-                    "status": "FAIL",
-                    "reason": f"Wrong error type",
-                    "error": error_msg,
-                    "elapsed_time": elapsed,
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            print(f"Expected error containing: '{expected_error}'")
+            print(f"Got: {error_msg[:200]}")
+            return {
+                "test_name": test_name,
+                "status": "FAIL",
+                "reason": f"Wrong error type",
+                "error": error_msg,
+                "elapsed_time": elapsed,
+            }
kubeflow-pipelines/docling-standard/local_run.py (2)

148-212: validate_output function is defined but never called.

Similar to the VLM test harness, this function is not invoked. Consider calling it after successful pipeline runs to validate output quality.


262-284: Consider removing unnecessary else after return.

Per static analysis, the else block can be de-indented since the preceding if block returns.

             if "FAILURE" in error_msg or expected_error.lower() in error_msg.lower():
                 ...
                 return {
                     ...
                 }
-            else:
-                print(f"❌ TEST FAILED: {test_name} - Wrong error type")
-                ...
-                return {
-                    ...
-                }
+            print(f"❌ TEST FAILED: {test_name} - Wrong error type")
+            ...
+            return {
+                ...
+            }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 17924e1 and 384f15f.

📒 Files selected for processing (13)
  • .gitignore (1 hunks)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/common/components.py (2 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/local_run.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (11 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/local_run.py (4 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (7)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
  • .gitignore
  • kubeflow-pipelines/common/components.py
  • kubeflow-pipelines/README.md
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-vlm/README.md
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
🧰 Additional context used
🧬 Code graph analysis (1)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
  • convert_pipeline_test (59-90)
  • validate_output (93-148)
  • run_test_scenario (151-225)
  • main (228-278)
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py

[refactor] 106-106: Too many positional arguments (8/5)

(R0917)


[refactor] 262-284: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 195-217: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: Summary
🔇 Additional comments (10)
kubeflow-pipelines/docling-vlm/vlm_components.py (1)

178-180: LGTM!

The accelerator device is correctly passed to AcceleratorOptions.

kubeflow-pipelines/docling-standard/standard_components.py (1)

167-169: LGTM!

The accelerator device is correctly passed to AcceleratorOptions.

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (3)

56-59: LGTM!

The accelerator_device parameter is correctly defined with default value "auto", marked as optional, and typed as STRING.


168-169: Pipeline wiring is correctly configured.

The accelerator_device parameter is properly wired from the pipeline channel through the for-loop task to the component.

Also applies to: 215-216


739-742: Root input definition is correct.

The docling_accelerator_device parameter is properly exposed at the root level with appropriate defaults, making it configurable at runtime.

kubeflow-pipelines/docling-vlm/local_run.py (1)

228-279: Test harness looks well-structured.

The main() function properly initializes the Docker runner, runs both normal and failure scenarios, and provides clear summary output. The mutation issue with config.pop() was addressed in a previous commit.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (2)

47-50: LGTM!

The accelerator_device parameter is correctly defined at the component level and properly wired through the for-loop task via pipeline channels.

Also applies to: 119-120, 151-152


598-599: Root-level wiring is correctly configured.

The docling_accelerator_device parameter is properly exposed at the root level and wired to the for-loop, making it configurable at runtime. This addresses the previous review comment about the missing wiring.

Also applies to: 633-636

kubeflow-pipelines/docling-standard/local_run.py (2)

105-145: Pipeline test function is well-structured.

The convert_pipeline_test function correctly wires all the new parameters through to the component. The "too many positional arguments" static analysis hint is acceptable here as this is a pipeline definition with many configuration options.


317-320: Mutation issue correctly fixed.

The code uses config.get("should_fail", True) instead of config.pop(), preserving the original FAILURE_SCENARIOS dictionary across runs.

Comment on lines +327 to +322
passed = sum(1 for r in results if r["status"] == "PASS")
warnings = sum(1 for r in results if r["status"] == "PASS_WITH_WARNINGS")
failed = sum(1 for r in results if r["status"] == "FAIL")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

PASS_WITH_WARNINGS status is counted but never produced.

The code counts results with status "PASS_WITH_WARNINGS" (line 328), but run_test_scenario only returns "PASS" or "FAIL". Either remove the warnings counting or add logic to produce this status.

     passed = sum(1 for r in results if r["status"] == "PASS")
-    warnings = sum(1 for r in results if r["status"] == "PASS_WITH_WARNINGS")
     failed = sum(1 for r in results if r["status"] == "FAIL")
     ...
     print(f"✅ Passed: {passed}")
-    print(f"⚠️  Warnings: {warnings}")
     print(f"❌ Failed: {failed}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
passed = sum(1 for r in results if r["status"] == "PASS")
warnings = sum(1 for r in results if r["status"] == "PASS_WITH_WARNINGS")
failed = sum(1 for r in results if r["status"] == "FAIL")
passed = sum(1 for r in results if r["status"] == "PASS")
failed = sum(1 for r in results if r["status"] == "FAIL")
🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-standard/local_run.py around lines 327-329, the
code sums statuses for "PASS_WITH_WARNINGS" even though run_test_scenario only
returns "PASS" or "FAIL"; remove the unused warnings counting or change
run_test_scenario to return "PASS_WITH_WARNINGS" where appropriate. To fix,
either delete the warnings variable and any downstream usage of it (and adjust
reporting to only use passed/failed), or update run_test_scenario to detect
warning conditions and return "PASS_WITH_WARNINGS" consistently so the counting
is meaningful.

- Add accelerator_device parameter to docling_convert_standard and  docling_convert_vlm component
- Update README documentation
- Add validation and user-friendly error messages
- Recompile pipeline YAML files
- Added test,validation to local_runner and updated components for retry logic
- removed repeated runner initialization
- removed mps and cuda from accelerator devices

Signed-off-by: roburishabh <[email protected]>
@RobuRishabh RobuRishabh force-pushed the Add_accelerator_device_to_KFP branch from 384f15f to ca67afc Compare December 12, 2025 00:36
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
kubeflow-pipelines/docling-standard/standard_components.py (1)

35-56: Document accelerator_device in the docstring Args.

Add an Args entry for accelerator_device so the component is self-describing.

♻️ Duplicate comments (4)
kubeflow-pipelines/docling-standard/standard_components.py (1)

127-136: (Dup) Consider extracting device_map into a shared module to avoid drift.

Same DRY concern as previously noted (standard vs VLM components).

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

256-266: (Dup) Embedded component docstring still doesn’t document accelerator_device.

Add accelerator_device to the Args list so the generated YAML remains self-describing.

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

343-358: (Dup) Embedded component docstring still doesn’t document accelerator_device.

Keep the generated YAML self-describing by adding accelerator_device to Args.

kubeflow-pipelines/docling-standard/local_run.py (1)

320-337: (Dup) PASS_WITH_WARNINGS is counted but never produced.

Either remove the warnings counting/printing, or return that status when validation is partially successful.

🧹 Nitpick comments (5)
kubeflow-pipelines/common/components.py (2)

129-130: Consider exponential backoff for retry delays.

The fixed 5-second delay works, but exponential backoff (e.g., 2^attempt seconds) is more resilient for network failures and reduces load on struggling servers.

Apply this diff for exponential backoff:

-        max_retries = 3
-        retry_delay_sec = 5
+        max_retries = 3

Then update the sleep call:

-                    time.sleep(retry_delay_sec)
+                    delay = min(5 * (2 ** (attempt - 1)), 30)  # Exponential backoff capped at 30s
+                    time.sleep(delay)

And adjust the log message:

                     print(
                         f"import-test-pdfs: error downloading {url}: {exc}. "
-                        f"Retrying in {retry_delay_sec} seconds...",
+                        f"Retrying in {delay} seconds...",
                         flush=True,
                     )

136-161: Consider discriminating between transient and permanent HTTP errors.

The current implementation retries all RequestException types, including permanent failures like 404 Not Found or 403 Forbidden. While not harmful, this wastes time on errors that won't resolve with retries.

Apply this diff to skip retrying permanent errors:

                 except requests.exceptions.RequestException as exc:  # type: ignore[attr-defined]
                     # Network / HTTP errors are the main reason this component can be flaky
+                    # Don't retry permanent HTTP errors
+                    if hasattr(exc, 'response') and exc.response is not None:
+                        if exc.response.status_code in (400, 401, 403, 404, 410):
+                            raise RuntimeError(
+                                f"Failed to download {url}: {exc} (non-retryable error)"
+                            ) from exc
                     if attempt == max_retries:
                         raise RuntimeError(
                             f"Failed to download {url} after {max_retries} attempts"
                         ) from exc
kubeflow-pipelines/docling-vlm/local_run.py (1)

182-213: Drop the unnecessary else after return in error handling.

Matches pylint R1705 and reduces indentation.

kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (1)

515-535: Make HTTP download retry safer for nested paths + consider backoff/jitter.

If a filename contains /, dest.open("wb") can fail because parent dirs aren’t created. Consider dest.parent.mkdir(parents=True, exist_ok=True) before writing; also consider exponential backoff w/ jitter to reduce thundering herd.

kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (1)

601-622: Make HTTP download retry safer for nested paths + consider backoff/jitter.

Same as VLM compiled YAML: create dest.parent before writing, and consider exponential backoff.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 384f15f and ca67afc.

📒 Files selected for processing (13)
  • .gitignore (1 hunks)
  • kubeflow-pipelines/README.md (3 hunks)
  • kubeflow-pipelines/common/components.py (2 hunks)
  • kubeflow-pipelines/docling-standard/README.md (1 hunks)
  • kubeflow-pipelines/docling-standard/local_run.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_components.py (3 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline_compiled.yaml (11 hunks)
  • kubeflow-pipelines/docling-vlm/README.md (1 hunks)
  • kubeflow-pipelines/docling-vlm/local_run.py (4 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_components.py (3 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py (2 hunks)
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline_compiled.yaml (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • .gitignore
  • kubeflow-pipelines/docling-standard/standard_convert_pipeline.py
  • kubeflow-pipelines/docling-standard/README.md
  • kubeflow-pipelines/docling-vlm/vlm_convert_pipeline.py
  • kubeflow-pipelines/docling-vlm/README.md
  • kubeflow-pipelines/docling-vlm/vlm_components.py
🧰 Additional context used
🧬 Code graph analysis (2)
kubeflow-pipelines/docling-standard/local_run.py (1)
kubeflow-pipelines/docling-vlm/local_run.py (4)
  • convert_pipeline_test (54-85)
  • validate_output (88-143)
  • run_test_scenario (146-220)
  • main (223-273)
kubeflow-pipelines/docling-vlm/local_run.py (1)
kubeflow-pipelines/docling-standard/local_run.py (4)
  • convert_pipeline_test (99-138)
  • validate_output (141-205)
  • run_test_scenario (208-285)
  • main (288-354)
🪛 Pylint (4.0.4)
kubeflow-pipelines/docling-standard/local_run.py

[refactor] 99-99: Too many positional arguments (8/5)

(R0917)


[refactor] 255-277: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

kubeflow-pipelines/docling-vlm/local_run.py

[refactor] 190-212: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: compile (docling-standard, kubeflow-pipelines/docling-standard, python standard_convert_pipeline....
  • GitHub Check: compile (docling-vlm, kubeflow-pipelines/docling-vlm, python vlm_convert_pipeline.py, vlm_convert...
  • GitHub Check: Summary
🔇 Additional comments (2)
kubeflow-pipelines/common/components.py (2)

31-31: LGTM!

The time import is necessary for the retry delay mechanism and follows the existing import pattern with the appropriate pylint disable comment.


129-161: Verify that this change aligns with the PR scope.

The retry logic added here appears unrelated to the stated PR objective of adding accelerator_device parameter support. While the retry mechanism is a useful improvement for download reliability, mixing unrelated changes in a PR can:

  • Complicate reviews and testing
  • Make it harder to track which change caused issues
  • Delay merging if one change needs rework

Consider whether this retry logic should be in a separate PR focused on download reliability improvements.

Comment on lines +141 to +206
def validate_output(output_dir: Path, test_name: str) -> dict:
"""Validate the output quality and format"""
validation_results = {
"test_name": test_name,
"passed": True,
"checks": {},
}

# Check for JSON output
json_files = list(output_dir.glob("*.json"))
validation_results["checks"]["json_exists"] = len(json_files) > 0

# Check for Markdown output
md_files = list(output_dir.glob("*.md"))
validation_results["checks"]["md_exists"] = len(md_files) > 0

if json_files:
json_path = json_files[0]
try:
with open(json_path) as f:
data = json.load(f)

# Required fields validation
validation_results["checks"]["has_name"] = "name" in data
validation_results["checks"]["has_pages"] = "pages" in data

if "pages" in data:
validation_results["checks"]["pages_count"] = len(data["pages"])
validation_results["checks"]["has_content"] = len(data["pages"]) > 0

# Check page structure
if data["pages"]:
first_page = data["pages"][0]
validation_results["checks"]["page_has_number"] = (
"page_number" in first_page
)
validation_results["checks"]["page_has_size"] = "size" in first_page
except Exception as e:
validation_results["checks"]["json_parsing_error"] = str(e)
validation_results["passed"] = False

if md_files:
md_path = md_files[0]
try:
content = md_path.read_text()
validation_results["checks"]["md_length"] = len(content)
validation_results["checks"]["md_not_empty"] = len(content) > 100
validation_results["checks"]["md_has_headers"] = "#" in content
validation_results["checks"]["md_no_error"] = not content.startswith(
"Error"
)
except Exception as e:
validation_results["checks"]["md_parsing_error"] = str(e)
validation_results["passed"] = False

# Overall pass/fail
validation_results["passed"] = all(
[
validation_results["checks"].get("json_exists", False),
validation_results["checks"].get("md_exists", False),
validation_results["checks"].get("has_content", False),
]
)

return validation_results

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's get the full file to understand the context
wc -l kubeflow-pipelines/docling-standard/local_run.py

Repository: opendatahub-io/data-processing

Length of output: 204


🏁 Script executed:

# Search for all usages of validate_output function
rg "validate_output" kubeflow-pipelines/docling-standard/local_run.py

Repository: opendatahub-io/data-processing

Length of output: 136


🏁 Script executed:

# Read the run_test_scenario function (lines 208-246 area)
sed -n '208,250p' kubeflow-pipelines/docling-standard/local_run.py

Repository: opendatahub-io/data-processing

Length of output: 1399


🏁 Script executed:

# Read the main function area (lines 320-337)
sed -n '310,340p' kubeflow-pipelines/docling-standard/local_run.py

Repository: opendatahub-io/data-processing

Length of output: 1135


Test harness reports PASS without validating outputs—validate_output() is defined but never called.

The validate_output() function (lines 141-206) computes validation_results["passed"] with checks for JSON structure, Markdown content, and page data, but run_test_scenario() never invokes it. Instead, after convert_pipeline_test() succeeds, the code returns {"status": "PASS", ...} without validation (line 245). Additionally, main() counts PASS_WITH_WARNINGS status in the summary, but run_test_scenario() never produces this status—only "PASS" or "FAIL". Consider calling validate_output(output_dir, test_name) after pipeline success and returning FAIL (or PASS_WITH_WARNINGS) when checks fail to match the validation logic.

Comment on lines +33 to 34
accelerator_device: str = "auto",
):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd kubeflow-pipelines/docling-standard && wc -l standard_components.py

Repository: opendatahub-io/data-processing

Length of output: 204


🏁 Script executed:

cd kubeflow-pipelines/docling-standard && sed -n '25,45p' standard_components.py

Repository: opendatahub-io/data-processing

Length of output: 941


🏁 Script executed:

cd kubeflow-pipelines/docling-standard && sed -n '120,145p' standard_components.py

Repository: opendatahub-io/data-processing

Length of output: 1016


🏁 Script executed:

cd kubeflow-pipelines/docling-standard && sed -n '160,175p' standard_components.py

Repository: opendatahub-io/data-processing

Length of output: 765


🏁 Script executed:

cd kubeflow-pipelines && fd -type f -name "vlm_components.py" | head -5

Repository: opendatahub-io/data-processing

Length of output: 244


🏁 Script executed:

cd kubeflow-pipelines/docling-standard && sed -n '35,60p' standard_components.py

Repository: opendatahub-io/data-processing

Length of output: 1497


🏁 Script executed:

rg -l "vlm_components" kubeflow-pipelines/

Repository: opendatahub-io/data-processing

Length of output: 172


🏁 Script executed:

rg "class AcceleratorDevice" kubeflow-pipelines/

Repository: opendatahub-io/data-processing

Length of output: 56


🏁 Script executed:

cd kubeflow-pipelines/docling-vlm && grep -n "device_map\|accelerator_device" vlm_convert_pipeline.py | head -20

Repository: opendatahub-io/data-processing

Length of output: 184


🏁 Script executed:

cd kubeflow-pipelines/docling-vlm && sed -n '1,50p' vlm_convert_pipeline.py

Repository: opendatahub-io/data-processing

Length of output: 1700


🏁 Script executed:

rg "AcceleratorDevice\." kubeflow-pipelines/ -A 2 -B 2

Repository: opendatahub-io/data-processing

Length of output: 3199


🏁 Script executed:

cd kubeflow-pipelines/docling-vlm && sed -n '1,100p' vlm_components.py | grep -A 40 '"""'

Repository: opendatahub-io/data-processing

Length of output: 2373


🏁 Script executed:

cd kubeflow-pipelines/docling-vlm && sed -n '20,80p' vlm_components.py

Repository: opendatahub-io/data-processing

Length of output: 2616


🏁 Script executed:

rg "MPS|mps" kubeflow-pipelines/ --type py

Repository: opendatahub-io/data-processing

Length of output: 56


Harden accelerator device normalization and document parameter + add aliases (cuda, mps).

The current dev = accelerator_device.lower() without .strip() will reject whitespace-padded values (e.g., " GPU "), lacks an empty-string guard, and doesn't support stated aliases like cuda or mps. Additionally, the accelerator_device parameter is missing from the docstring.

Apply the suggested patch to both kubeflow-pipelines/docling-standard/standard_components.py (lines 127–137) and kubeflow-pipelines/docling-vlm/vlm_components.py (identical code, same issues):

-    device_map = {
-        "auto": AcceleratorDevice.AUTO,
-        "cpu": AcceleratorDevice.CPU,
-        "gpu": AcceleratorDevice.CUDA,
-    }
-    dev = accelerator_device.lower()
+    if not accelerator_device or not accelerator_device.strip():
+        raise ValueError("accelerator_device must be a non-empty string")
+
+    device_map = {
+        "auto": AcceleratorDevice.AUTO,
+        "cpu": AcceleratorDevice.CPU,
+        "cuda": AcceleratorDevice.CUDA,
+        "gpu": AcceleratorDevice.CUDA,
+    }
+    if hasattr(AcceleratorDevice, "MPS"):
+        device_map["mps"] = getattr(AcceleratorDevice, "MPS")
+
+    dev = accelerator_device.strip().lower()
     if dev not in device_map:
         raise ValueError(
             f"Invalid accelerator_device: {accelerator_device}. Must be one of {sorted(device_map.keys())}"
         )

Also add accelerator_device: Accelerator device to use (e.g. "auto", "cpu", "gpu", "cuda", "mps"). to the Args section of both docstrings.

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +88 to +144
def validate_output(output_dir: Path, test_name: str) -> dict:
"""Validate the output quality and format"""
validation_results = {
"test_name": test_name,
"passed": True,
"checks": {},
}

# Check for JSON output
json_files = list(output_dir.glob("*.json"))
validation_results["checks"]["json_exists"] = len(json_files) > 0

# Check for Markdown output
md_files = list(output_dir.glob("*.md"))
validation_results["checks"]["md_exists"] = len(md_files) > 0

if json_files:
json_path = json_files[0]
try:
with open(json_path) as f:
data = json.load(f)

# Required fields validation
validation_results["checks"]["has_name"] = "name" in data
validation_results["checks"]["has_pages"] = "pages" in data

if "pages" in data:
validation_results["checks"]["pages_count"] = len(data["pages"])
validation_results["checks"]["has_content"] = len(data["pages"]) > 0

# Check page structure
if data["pages"]:
first_page = data["pages"][0]
validation_results["checks"]["page_has_number"] = (
"page_number" in first_page
)
validation_results["checks"]["page_has_size"] = "size" in first_page
except Exception as e:
validation_results["checks"]["json_parsing"] = f"Failed: {e}"
validation_results["passed"] = False

if md_files:
md_path = md_files[0]
try:
content = md_path.read_text()
validation_results["checks"]["md_length"] = len(content)
validation_results["checks"]["md_not_empty"] = len(content) > 100
validation_results["checks"]["md_has_headers"] = "#" in content
validation_results["checks"]["md_no_error"] = not content.startswith(
"Error"
)
except Exception as e:
validation_results["checks"]["md_parsing"] = f"Failed: {e}"
validation_results["passed"] = False

return validation_results

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the full local_run.py file structure
find . -name "local_run.py" -type f | head -20

Repository: opendatahub-io/data-processing

Length of output: 170


🏁 Script executed:

# Get file size to determine how to read it
wc -l kubeflow-pipelines/docling-vlm/local_run.py 2>/dev/null || find . -name "local_run.py" | xargs wc -l

Repository: opendatahub-io/data-processing

Length of output: 121


🏁 Script executed:

# Search for validate_output function calls in the file
rg "validate_output" --context 5

Repository: opendatahub-io/data-processing

Length of output: 1751


🏁 Script executed:

# Read the entire local_run.py file to see the full context
cat -n kubeflow-pipelines/docling-vlm/local_run.py

Repository: opendatahub-io/data-processing

Length of output: 11398


🏁 Script executed:

# Search more thoroughly for any invocation of validate_output using different patterns
rg "validate_output\s*\(" kubeflow-pipelines/docling-vlm/

Repository: opendatahub-io/data-processing

Length of output: 180


Test harness reports PASS without validating outputs—validate_output() is unused.

The validate_output() function (lines 88-143) is defined but never invoked. Currently, run_test_scenario() reports "PASS" only when the pipeline executes without throwing an exception; no output artifacts are checked. To make tests meaningful, capture the docling_convert_vlm output artifact, expose it as a pipeline output, and call validate_output() on that path, failing the test when validation checks are unsatisfied.

🤖 Prompt for AI Agents
In kubeflow-pipelines/docling-vlm/local_run.py around lines 88 to 144,
validate_output() is defined but never used; update run_test_scenario() to
capture the docling_convert_vlm output artifact path (ensure the pipeline step
exposes that artifact as a file/directory output), call
validate_output(output_path, test_name) after the pipeline run, inspect the
returned dict and mark the test as failed if validation_results["passed"] is
False (print/log the validation_results and exit non‑zero or raise an
exception); also ensure run_test_scenario propagates pipeline artifact locations
so validate_output receives the actual file system Path to the produced JSON/MD
outputs.

Comment on lines +96 to 98
- Standard pipeline defaults include `pdf_backend=dlparse_v4`, `image_export_mode=embedded`, `table_mode=accurate`, `num_threads=4`, `timeout_per_document=300`, `ocr=True`, `force_ocr=False`, `ocr_engine=tesseract`, `accelerator_device=auto`.
- VLM pipeline defaults include `num_threads=4`, `timeout_per_document=300`, `image_export_mode=embedded`, `remote_model_enabled=False`, and `accelerator_device=auto`.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and examine the README file
find . -name "README.md" -path "*/kubeflow-pipelines*" | head -5

Repository: opendatahub-io/data-processing

Length of output: 204


🏁 Script executed:

# Also search for README.md files to understand structure
git ls-files | grep -i readme | head -10

Repository: opendatahub-io/data-processing

Length of output: 303


🏁 Script executed:

# Search for parameter definitions related to accelerator_device or docling_accelerator_device
rg "accelerator_device|docling_accelerator_device" -i --max-count 50 | head -100

Repository: opendatahub-io/data-processing

Length of output: 7268


🏁 Script executed:

# Look for pipeline defaults and configuration
rg "pdf_backend|dlparse_v4|image_export_mode" --max-count 30

Repository: opendatahub-io/data-processing

Length of output: 13807


🏁 Script executed:

# Read the README.md file with context around lines 96-98 and 161-170
cat -n kubeflow-pipelines/README.md | sed -n '90,105p'

Repository: opendatahub-io/data-processing

Length of output: 1581


🏁 Script executed:

# Read context around lines 161-170
cat -n kubeflow-pipelines/README.md | sed -n '155,175p'

Repository: opendatahub-io/data-processing

Length of output: 1445


🏁 Script executed:

# Check the actual device_map in the component code
grep -A 10 "device_map" kubeflow-pipelines/docling-standard/standard_components.py

Repository: opendatahub-io/data-processing

Length of output: 1153


🏁 Script executed:

# Also check VLM components
grep -A 10 "device_map" kubeflow-pipelines/docling-vlm/vlm_components.py

Repository: opendatahub-io/data-processing

Length of output: 968


🏁 Script executed:

# Check what values are validated
rg "device_map\[|Invalid accelerator_device" -A 5 -B 5

Repository: opendatahub-io/data-processing

Length of output: 13158


🏁 Script executed:

# Look for any PR objectives or issues mentioning cuda/mps aliases
git log --oneline -20 | head -20

Repository: opendatahub-io/data-processing

Length of output: 517


🏁 Script executed:

# Search for any comments or TODOs about cuda/mps support
rg "cuda|mps|gpu|accelerator" -i -B 2 -A 2 | grep -E "(TODO|FIXME|PR|objective)" -i

Repository: opendatahub-io/data-processing

Length of output: 6590


Lines 96–97: Use correct parameter name docling_accelerator_device in the defaults list, not accelerator_device.

The "Defaults" section lists accelerator_device=auto, but users will configure this in the KFP UI using the actual parameter name docling_accelerator_device. Update both the Standard and VLM defaults to reflect the correct name (e.g., docling_accelerator_device=auto).

🤖 Prompt for AI Agents
In kubeflow-pipelines/README.md around lines 96 to 98, the defaults list uses
the wrong parameter name `accelerator_device=auto`; update both Standard and VLM
defaults to use the correct KFP parameter name `docling_accelerator_device=auto`
so the UI config matches documentation. Replace `accelerator_device=auto` with
`docling_accelerator_device=auto` in both default lines and verify spelling and
punctuation remain consistent with surrounding list formatting.

- `docling_accelerator_device`: Device to use for acceleration. Supported values:
- `auto` (default): Automatically detect and use the best available device
- `cpu`: Force CPU-only processing
- `gpu`: Use GPU with CUDA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimaredia do we care about amd rocm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, I maybe wrong but rightnow Nvidia dominates AI/ML GPU market, also Docling support Uses AcceleratorDevice.CUDA which is NVIDIA-focused and most ML containers are built for CUDA, not ROCm. And suppose if we add ROCm, then we would need a separate testing, images, and validation. Also I think Kubeflow/Openshift are primarily deployed with NVIDIA GPU operator. So maybe we should not care about amd ROCm for now. @alimaredia wdyt ??

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked about this in standup. upstream docling just uses cuda so we can stick to just cuda. if that changes then we will update

Comment on lines +47 to +54
# Uncomment if you have GPU
# "gpu_test": {
# "docling_accelerator_device": "gpu",
# "docling_pdf_backend": "dlparse_v4",
# "docling_table_mode": "accurate",
# "docling_num_threads": 4,
# "docling_ocr": True,
# },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to be commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rightnow we don't have GPU to test, once we start testing this out on GPU, we can uncomment it and test it then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run this on AWS EC2 instance ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can take any AWS EC2 with Nvidia GPUs to test it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test with a GPU on an AWS instance?

device_map = {
"auto": AcceleratorDevice.AUTO,
"cpu": AcceleratorDevice.CPU,
"gpu": AcceleratorDevice.CUDA,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With Fabiano's image with torch installed for just cpu, would gpu work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I tried this

(.venv) roburishabh@rishasin-mac data-processing % docker pull quay.io/fabianofranz/docling-ubi9:2.54.0
2.54.0: Pulling from fabianofranz/docling-ubi9
Digest: sha256:0c050ec6ecfb3cc44f1f446994de0b655b48ba9c19f713facc473a2c458f9d96
Status: Image is up to date for quay.io/fabianofranz/docling-ubi9:2.54.0
quay.io/fabianofranz/docling-ubi9:2.54.0

-------------------------------------------
(.venv) roburishabh@rishasin-mac data-processing % docker run --rm quay.io/fabianofranz/docling-ubi9:2.54.0 \
  python -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}'); print(f'CUDA built: {torch.version.cuda}')"
WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested
CUDA available: False
CUDA built: None

The container image quay.io/fabianofranz/docling-ubi9:2.54.0 has CPU-only PyTorch. So, The gpu option is currently unusable with this image. To enable GPU support, someone would need to build a new container image with:

  • CUDA-enabled PyTorch (torch+cu118 or similar) and CUDA runtime libraries. I think @alimaredia is building that image. WDYT ??

Comment on lines +47 to +54
# Uncomment if you have GPU
# "gpu_test": {
# "docling_accelerator_device": "gpu",
# "docling_pdf_backend": "dlparse_v4",
# "docling_table_mode": "accurate",
# "docling_num_threads": 4,
# "docling_ocr": True,
# },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you run this on AWS EC2 instance ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants