-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When I used the ImageNet Training Script to train my model, the following error occurred. However, when I trained other models (segmentation models), my model file worked fine. I have been troubled by this issue for a long time and haven't found a detailed solution.
Traceback (most recent call last):
File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1214, in <module>
main()
File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 871, in main
train_metrics = train_one_epoch(
File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1063, in train_one_epoch
loss = _forward()
File "/data/guanwei/LYH/cvpr/poolformer/train.py", line 1031, in _forward
loss = loss_fn(output, target)
File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/timm/loss/cross_entropy.py", line 21, in forward
logprobs = F.log_softmax(x, dim=-1)
File "/home/ubuntu/anaconda3/envs/cvpr/lib/python3.10/site-packages/torch/nn/functional.py", line 1930, in log_softmax
ret = input.log_softmax(dim)
AttributeError: 'list' object has no attribute 'log_softmax'To Reproduce
Steps to reproduce the behavior:
My process:
-
Create a new environment: conda create --name seg python==3.10 -
Install PyTorch 1.13.1 -
Install timm -
After registering my model, run train.py. Run the command:#!/bin/bash NUM_PROC=$1 shift python3 -m torch.distributed.launch --nproc_per_node=$NUM_PROC train.py "$@" ```bash MODEL=MaxFocusNet_b0 DROP_PATH=0.1 DATA_PATH=/data/Datasets/ImageNet-1K/raw/ImageNet-1K CUDA_VISIBLE_DEVICES=0 bash distributed_train.sh 1 $DATA_PATH \ --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --model-ema --amp
Expected behavior
This is strange, I couldn't find a solution to the same problem. I suspect that the distributed training has sliced the data, but I don't understand why the output results haven't been merged.
Desktop (please complete the following information):
- python3.10
- Linux node1067 5.15.0-119-generic Segmentation models? #129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
- NVIDIA-SMI 550.54.15
- PyTorch Version: 1.13.1+cu117
- torchvision Version: 0.14.1+cu117
- timm==1.0.12
- numpy==1.26.4
- CUDA Version: 11.7
- nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:49:14_PDT_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0 - 8 x RTX4090
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
