[python] Failed loading english.pickle with nltk.data.load

When trying to load the punkt tokenizer...

import nltk.data
tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

...a LookupError was raised:

> LookupError: 
>     *********************************************************************   
> Resource 'tokenizers/punkt/english.pickle' not found.  Please use the NLTK Downloader to obtain the resource: nltk.download().   Searched in:
>         - 'C:\\Users\\Martinos/nltk_data'
>         - 'C:\\nltk_data'
>         - 'D:\\nltk_data'
>         - 'E:\\nltk_data'
>         - 'E:\\Python26\\nltk_data'
>         - 'E:\\Python26\\lib\\nltk_data'
>         - 'C:\\Users\\Martinos\\AppData\\Roaming\\nltk_data'
>     **********************************************************************

This question is related to python jenkins nltk

The answer is


Simple nltk.download() will not solve this issue. I tried the below and it worked for me:

in the nltk folder create a tokenizers folder and copy your punkt folder into tokenizers folder.

This will work.! the folder structure needs to be as shown in the picture!1


I had similar issue when using an assigned folder for multiple downloads, and I had to append the data path manually:

single download, can be achived as followed (works)

import os as _os
from nltk.corpus import stopwords
from nltk import download as nltk_download

nltk_download('stopwords', download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

stop_words: list = stopwords.words('english')

This code works, meaning that nltk remembers the download path passed in the download fuction. On the other nads if I download a subsequent package I get similar error as described by user:

Multiple downloads raise an error:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download

nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))


Error:

Resource punkt not found. Please use the NLTK Downloader to obtain the resource:

import nltk nltk.download('punkt')

Now if I append the ntlk data path with my download path, it works:

import os as _os

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

from nltk import download as nltk_download
from nltk.data import path as nltk_path


nltk_path.append( _os.path.join(get_project_root_path(), 'temp'))


nltk_download(['stopwords', 'punkt'], download_dir=_os.path.join(get_project_root_path(), 'temp'), raise_on_error=True)

print(stopwords.words('english'))
print(word_tokenize("I am trying to find the download path 99."))

This works... Not sure why works in one case but not the other, but error message seems to imply that it doesn't check into the download folder the second time. NB: using windows8.1/python3.7/nltk3.5


This is what worked for me just now:

# Do this in a separate python interpreter session, since you only have to do it once
import nltk
nltk.download('punkt')

# Do this in your ipython notebook or analysis script
from nltk.tokenize import word_tokenize

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

sentences_tokenized = []
for s in sentences:
    sentences_tokenized.append(word_tokenize(s))

sentences_tokenized is a list of a list of tokens:

[['Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.'],
['Professor', 'Plum', 'has', 'a', 'green', 'plant', 'in', 'his', 'study', '.'],
['Miss', 'Scarlett', 'watered', 'Professor', 'Plum', "'s", 'green', 'plant', 'while', 'he', 'was', 'away', 'from', 'his', 'office', 'last', 'week', '.']]

The sentences were taken from the example ipython notebook accompanying the book "Mining the Social Web, 2nd Edition"


This Works for me:

>>> import nltk
>>> nltk.download()

In windows you will also get nltk downloader

NLTK Downloader


nltk have its pre-trained tokenizer models. Model is downloading from internally predefined web sources and stored at path of installed nltk package while executing following possible function calls.

E.g. 1 tokenizer = nltk.data.load('nltk:tokenizers/punkt/english.pickle')

E.g. 2 nltk.download('punkt')

If you call above sentence in your code, Make sure you have internet connection without any firewall protections.

I would like to share some more better alter-net way to resolve above issue with more better deep understandings.

Please follow following steps and enjoy english word tokenization using nltk.

Step 1: First download the "english.pickle" model following web path.

Goto link "http://www.nltk.org/nltk_data/" and click on "download" at option "107. Punkt Tokenizer Models"

Step 2: Extract the downloaded "punkt.zip" file and find the "english.pickle" file from it and place in C drive.

Step 3: copy paste following code and execute.

from nltk.data import load
from nltk.tokenize.treebank import TreebankWordTokenizer

sentences = [
    "Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow.",
    "Professor Plum has a green plant in his study.",
    "Miss Scarlett watered Professor Plum's green plant while he was away from his office last week."
]

tokenizer = load('file:C:/english.pickle')
treebank_word_tokenize = TreebankWordTokenizer().tokenize

wordToken = []
for sent in sentences:
    subSentToken = []
    for subSent in tokenizer.tokenize(sent):
        subSentToken.extend([token for token in treebank_word_tokenize(subSent)])

    wordToken.append(subSentToken)

for token in wordToken:
    print token

Let me know, if you face any problem


On Jenkins this can be fixed by adding following like of code to Virtualenv Builder under Build tab:

python -m nltk.downloader punkt

enter image description here


The main reason why you see that error is nltk couldn't find punkt package. Due to the size of nltk suite, all available packages are not downloaded by default when one installs it.

You can download punkt package like this.

import nltk
nltk.download('punkt')

from nltk import word_tokenize,sent_tokenize

If you do not pass any argument to the download function, it downloads all packages i.e chunkers, grammars, misc, sentiment, taggers, corpora, help, models, stemmers, tokenizers.

nltk.download()

The above function saves packages to a specific directory. You can find that directory location from comments here. https://github.com/nltk/nltk/blob/67ad86524d42a3a86b1f5983868fd2990b59f1ba/nltk/downloader.py#L1051


Check if you have all NLTK libraries.


i came across this problem when i was trying to do pos tagging in nltk. the way i got it correct is by making a new directory along with corpora directory named "taggers" and copying max_pos_tagger in directory taggers.
hope it works for you too. best of luck with it!!!.


you just need to go to python console and type->

import nltk

press enter and retype->

nltk.download()

and then a interface will come. Just search for download button and press it. It will install all the required items and will take time. Give the time and just try it again. Your problem will get solved


In Spyder, go to your active shell and download nltk using below 2 commands. import nltk nltk.download() Then you should see NLTK downloader window open as below, Go to 'Models' tab in this window and click on 'punkt' and download 'punkt'

Window


I had this same problem. Go into a python shell and type:

>>> import nltk
>>> nltk.download()

Then an installation window appears. Go to the 'Models' tab and select 'punkt' from under the 'Identifier' column. Then click Download and it will install the necessary files. Then it should work!


In Python-3.6 I can see the suggestion in the traceback. That's quite helpful. Hence I will say you guys to pay attention to the error you got, most of the time answers are within that problem ;).

enter image description here

And then as suggested by other folks here either using python terminal or using a command like python -c "import nltk; nltk.download('wordnet')" we can install them on the fly. You just need to run that command once and then it will save the data locally in your home directory.


From bash command line, run:

$ python -c "import nltk; nltk.download('punkt')"

nltk.download() will not solve this issue. I tried the below and it worked for me:

in the '...AppData\Roaming\nltk_data\tokenizers' folder, extract downloaded punkt.zip folder at the same location.


The punkt tokenizers data is quite large at over 35 MB, this can be a big deal if like me you are running nltk in an environment such as lambda that has limited resources.

If you only need one or perhaps a few language tokenizers you can drastically reduce the size of the data by only including those languages .pickle files.

If all you only need to support English then your nltk data size can be reduced to 407 KB (for the python 3 version).

Steps

  1. Download the nltk punkt data: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/tokenizers/punkt.zip
  2. Somewhere in your environment create the folders: nltk_data/tokenizers/punkt, if using python 3 add another folder PY3 so that your new directory structure looks like nltk_data/tokenizers/punkt/PY3. In my case I created these folders at the root of my project.
  3. Extract the zip and move the .pickle files for the languages you want to support into the punkt folder you just created. Note: Python 3 users should use the pickles from the PY3 folder. With your language files loaded it should look something like: example-folder-stucture
  4. Now you just need to add your nltk_data folder to the search paths, assuming your data is not in one of the pre-defined search paths. You can add your data using either the environment variable NLTK_DATA='path/to/your/nltk_data'. You can also add a custom path at runtime in python by doing:
from nltk import data
data.path += ['/path/to/your/nltk_data']

NOTE: If you don't need to load in the data at runtime or bundle the data with your code, it would be best to create your nltk_data folders at the built-in locations that nltk looks for.


Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to jenkins

Maven dependencies are failing with a 501 error Jenkins pipeline how to change to another folder Docker: Got permission denied while trying to connect to the Docker daemon socket at unix:///var/run/docker.sock groovy.lang.MissingPropertyException: No such property: jenkins for class: groovy.lang.Binding How to solve npm install throwing fsevents warning on non-MAC OS? Run bash command on jenkins pipeline Try-catch block in Jenkins pipeline script How to print a Groovy variable in Jenkins? Jenkins pipeline if else not working Error "The input device is not a TTY"

Examples related to nltk

re.sub erroring with "Expected string or bytes-like object" how to check which version of nltk, scikit learn installed? Python NLTK: SyntaxError: Non-ASCII character '\xc3' in file (Sentiment Analysis -NLP) NLTK and Stopwords Fail #lookuperror Resource u'tokenizers/punkt/english.pickle' not found How do I download NLTK data? Stopword removal with NLTK n-grams in python, four, five, six grams? pip issue installing almost any library How to get rid of punctuation using NLTK tokenizer?