Skip to content

[BUG] When I used the ImageNet Training Script to train my model, an unknown error occurred. #2365

@stone-cloud

Description

@stone-cloud

Describe the bug
When I used the ImageNet Training Script to train my model, the following error occurred. However, when I trained other models (segmentation models), my model file worked fine. I have been troubled by this issue for a long time and haven't found a detailed solution.

Traceback (most recent call last):
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1214, in <module>
    main()
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 871, in main
    train_metrics = train_one_epoch(
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1063, in train_one_epoch
    loss = _forward()
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1031, in _forward
    loss = loss_fn(output, target)
  File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/timm/loss/cross_entropy.py", line 21, in forward
    logprobs = F.log_softmax(x, dim=-1)
  File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/torch/nn/functional.py", line 1930, in log_softmax
    ret = input.log_softmax(dim)
AttributeError: 'list' object has no attribute 'log_softmax'

To Reproduce

Steps to reproduce the behavior:
My process:

  1. Create a new environment: conda create --name seg python==3.10
    
  2. Install PyTorch 1.13.1
    
  3. Install timm
    
  4. After registering my model, run train.py. Run the command:
    
    #!/bin/bash
    NUM_PROC=$1
    shift
    python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC train.py "$@"
    
    
    ```bash
    MODEL=MaxFocusNet_b0 
    DROP_PATH=0.1
    DATA_PATH=/data/Datasets/ImageNet-1K/raw/ImageNet-1K
    CUDA_VISIBLE_DEVICES=0 bash distributed_train.sh 1 $DATA_PATH \
      --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --model-ema --amp
    
    

Expected behavior
This is strange, I couldn't find a solution to the same problem. I suspect that the distributed training has sliced the data, but I don't understand why the output results haven't been merged.

Screenshots
图片

Desktop (please complete the following information):

  • python3.10
  • Linux node1067 5.15.0-119-generic Segmentation models? #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • NVIDIA-SMI 550.54.15
  • PyTorch Version: 1.13.1+cu117
  • torchvision Version: 0.14.1+cu117
  • timm==1.0.12
  • numpy==1.26.4
  • CUDA Version: 11.7
  • nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2022 NVIDIA Corporation
    Built on Wed_Jun__8_16:49:14_PDT_2022
    Cuda compilation tools, release 11.7, V11.7.99
    Build cuda_11.7.r11.7/compiler.31442593_0
  • 8 x RTX4090

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions