[BUG] When I used the ImageNet Training Script to train my model, an unknown error occurred.

**Describe the bug**
When I used the ImageNet Training Script to train my model, the following error occurred. However, when I trained other models (segmentation models), my model file worked fine. I have been troubled by this issue for a long time and haven't found a detailed solution.

```bash
Traceback (most recent call last):
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1214, in <module>
    main()
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 871, in main
    train_metrics = train_one_epoch(
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1063, in train_one_epoch
    loss = _forward()
  File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1031, in _forward
    loss = loss_fn(output, target)
  File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/timm/loss/cross_entropy.py", line 21, in forward
    logprobs = F.log_softmax(x, dim=-1)
  File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/torch/nn/functional.py", line 1930, in log_softmax
    ret = input.log_softmax(dim)
AttributeError: 'list' object has no attribute 'log_softmax'
```
**To Reproduce**

Steps to reproduce the behavior:
My process:

1.     Create a new environment: conda create --name seg python==3.10
2.     Install PyTorch 1.13.1
3.     Install timm
4.     After registering my model, run train.py. Run the command:
    ```
    #!/bin/bash
    NUM_PROC=$1
    shift
    python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC train.py "$@"
    

    ```bash
    MODEL=MaxFocusNet_b0 
    DROP_PATH=0.1
    DATA_PATH=/data/Datasets/ImageNet-1K/raw/ImageNet-1K
    CUDA_VISIBLE_DEVICES=0 bash distributed_train.sh 1 $DATA_PATH \
      --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --model-ema --amp

**Expected behavior**
This is strange, I couldn't find a solution to the same problem. I suspect that the distributed training has sliced the data, but I don't understand why the output results haven't been merged.

**Screenshots**
![图片](https://github.com/user-attachments/assets/70dc9d8b-c351-4ce4-8bbd-a47693a065af)

**Desktop (please complete the following information):**
- python3.10
- Linux node1067 5.15.0-119-generic #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- NVIDIA-SMI 550.54.15
- PyTorch Version: 1.13.1+cu117
- torchvision Version: 0.14.1+cu117
- timm==1.0.12
- numpy==1.26.4
- CUDA Version: 11.7  
- nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0
- 8 x RTX4090


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] When I used the ImageNet Training Script to train my model, an unknown error occurred. #2365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] When I used the ImageNet Training Script to train my model, an unknown error occurred. #2365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions