Categories
Misc

[_Derived_]RecvAsync is cancelled – LSTM

Hey,

Tensorflow broke in my conda environment and I cant seem to get it working again. I’m having differnt issues with getting tensorflow-gpu==2.3.0 and 2.4.1 working.

GTX 1070 GPU drivers:

-CUDA 11.0.3

-CUDNN 8.0.5.77

installed with $conda install cudatoolkit=11.0 cudnn=8.0 -c=conda-forge

-Python 3.8.8

Tensorflow 2.4.1:

tensorflow 2.3.0 mkl_py38h8557ec7_0 tensorflow-base 2.3.0 eigen_py38h75a453f_0 tensorflow-estimator 2.4.0 pyh9656e83_0 conda-forge tensorflow-gpu 2.3.0 he13fc11_0 

installed with pip install –upgrade tensorflow-gpu==2.4.1

I have set all the environment variables correctly. Checking with print(tf.config.list_physical_devices(‘GPU’)) gives: [PhysicalDevice(name=’/physical_device:GPU:0′, device_type=’GPU’)]

So tensorflow seems to be installed and recognises my gpu. I’ve been working on a LSTM model, when training with $ model.fit() , it runs for 6 epochs and then gives this error

Epoch 1/50 2021-02-27 14:50:38.552734: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2) 2021-02-27 14:50:38.882403: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll 2021-02-27 14:50:39.546250: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll 2021-02-27 14:50:39.794953: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll 37/37 [==============================] - 7s 55ms/step - loss: 7.0684 - accuracy: 0.1270 Epoch 2/50 37/37 [==============================] - 2s 54ms/step - loss: 4.8889 - accuracy: 0.1828 Epoch 3/50 37/37 [==============================] - 2s 54ms/step - loss: 4.7884 - accuracy: 0.1666 Epoch 4/50 37/37 [==============================] - 2s 54ms/step - loss: 4.6866 - accuracy: 0.1480 Epoch 5/50 37/37 [==============================] - 2s 55ms/step - loss: 4.5179 - accuracy: 0.1630 Epoch 6/50 17/37 [============>.................] - ETA: 1s - loss: 4.2505 - accuracy: 0.14842021-02-27 14:50:55.955000: E tensorflow/stream_executor/dnn.cc:616] CUDNN_STATUS_INTERNAL_ERROR in tensorflow/stream_executor/cuda/cuda_dnn.cc(2004): 'cudnnRNNBackwardWeights( cudnn.handle(), rnn_desc.handle(), model_dims.max_seq_length, input_desc.handles(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), output_desc.handles(), output_data.opaque(), workspace.opaque(), workspace.size(), rnn_desc.params_handle(), params_backprop_data->opaque(), reserve_space_data->opaque(), reserve_space_data->size())' 2021-02-27 14:50:55.955194: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at cudnn_rnn_ops.cc:1926 : Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 256, 256, 1, 100, 64, 256] 2021-02-27 14:50:55,957 : MainThread : INFO : Saving model history to model_history.csv 2021-02-27 14:50:55,961 : MainThread : INFO : Saving model to D:projectproject_enginefftest_checkpointsbatch_0synthetic Traceback (most recent call last): File "runTrain.py", line 65, in <module> model.train() ... ... ... File "D:projectproject_enginerunTrain.py", line 201, in train_rnn model.fit(dataset, epochs=store.epochs, callbacks=_callbacks) File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythonkerasenginetraining.py", line 1100, in fit tmp_logs = self.train_function(iterator) File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythoneagerdef_function.py", line 828, in __call__ result = self._call(*args, **kwds) File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythoneagerdef_function.py", line 855, in _call return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythoneagerfunction.py", line 2942, in __call__ return graph_function._call_flat( File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythoneagerfunction.py", line 1918, in _call_flat return self._build_call_outputs(self._inference_function.call( File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythoneagerfunction.py", line 555, in call outputs = execute.execute( File "C:UsersMeanaconda3envstf_gpulibsite-packagestensorflowpythoneagerexecute.py", line 59, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, tensorflow.python.framework.errors_impl.CancelledError: [_Derived_]RecvAsync is cancelled. [[{{node gradient_tape/sequential/embedding/embedding_lookup/Reshape/_20}}]] [Op:__inference_train_function_4800] Function call stack: train_function 

Tensorflow forums with similar issues mention memory or driver issues but this isn’t the case as the model wouldn’t start training at all. Also I know the code is fine because I trained on the same code with no issue in an old environment I was using 2 months ago. It also runs fine in a CPU only tensorflow environment.

Does anyone have any suggestions on how to fix this?

Tensorflow 2.3.0:

Secondly, I cant even try another version of tensorflow gpu in a different environment.

conda install -c anaconda tensorflow-gpu 

Tensorflow GPU succesfully installs but doesn’t run on my GPU for reasons stated here – https://www.reddit.com/r/tensorflow/comments/jtwcth/how_to_enable_tensorflow_code_to_run_on_a_gpu/gp0b3mf/

I’ve now lost 2 days and a lot of will to leave, any help with either issues would be massively appreciated.

submitted by /u/nuusain
[visit reddit] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *