[python] How to remove any URL within a string in Python

I want to remove all URLs inside a string (replace them with "") I searched around but couldn't really find what I want.

Example:

text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/

I want the result to be:

text1
text2
text3
text4
text5
text6

This question is related to python regex url replace python-2.7

The answer is


I know this has already been answered and its stupid late but I think this should be here. This is a regex that matches any kind of url.

[^ ]+\.[^ ]+

It can be used like

re.sub('[^ ]+\.[^ ]+','',sentence)

The following regular expression in Python works well for detecting URL(s) in the text:

source_text = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6    '''

import re
url_reg  = r'[a-z]*[:.]+\S+'
result   = re.sub(url_reg, '', source_text)
print(result)

Output:

text1
text2

text3
text4

text5
text6

import re
s = '''
text1
text2
http://url.com/bla1/blah1/
text3
text4
http://url.com/bla2/blah2/
text5
text6
http://url.com/bla3/blah3/'''
g = re.findall(r'(text\d+)',s)
print ('list',g)
for i in g:
    print (i)

Out

list ['text1', 'text2', 'text3', 'text4', 'text5', 'text6']
text1
text2
text3
text4
text5
text6    ?

This solution caters for http, https and the other normal url type special characters :

import re
def remove_urls (vTEXT):
    vTEXT = re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', vTEXT, flags=re.MULTILINE)
    return(vTEXT)


print( remove_urls("this is a test https://sdfs.sdfsdf.com/sdfsdf/sdfsdf/sd/sdfsdfs?bob=%20tree&jef=man lets see this too https://sdfsdf.fdf.com/sdf/f end"))

the shortest way

re.sub(r'http\S+', '', stringliteral)

What you really want to do is to remove any string that starts with either http:// or https:// plus any combination of non white space characters. Here is how I would solve it. My solution is very similar to that of @tolgayilmaz

#Define the text from which you want to replace the url with "".
text ='''The link to this post is https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python'''

import re
#Either use:
re.sub('http://\S+|https://\S+', '', text)
#OR 
re.sub('http[s]?://\S+', '', text)

And the result of running either code above is

>>> 'The link to this post is '

I prefer the second one because it is more readable.


Removal of HTTP links/URLs mixed up in any text:

import re
re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', " ", text)

I wasn't able to find any that handled my particular situation, which was removing urls in the middle of tweets that also have whitespaces in the middle of urls so I made my own:

(https?:\/\/)(\s)*(www\.)?(\s)*((\w|\s)+\.)*([\w\-\s]+\/)*([\w\-]+)((\?)?[\w\s]*=\s*[\w\%&]*)*

here's an explanation:
(https?:\/\/) matches http:// or https://
(\s)* optional whitespaces
(www\.)? optionally matches www.
(\s)* optionally matches whitespaces
((\w|\s)+\.)* matches 0 or more of one or more word characters followed by a period
([\w\-\s]+\/)* matches 0 or more of one or more words(or a dash or a space) followed by '\'
([\w\-]+) any remaining path at the end of the url followed by an optional ending
((\?)?[\w\s]*=\s*[\w\%&]*)* matches ending query params (even with white spaces,etc)

test this out here:https://regex101.com/r/NmVGOo/8


This worked for me:

import re
thestring = "text1\ntext2\nhttp://url.com/bla1/blah1/\ntext3\ntext4\nhttp://url.com/bla2/blah2/\ntext5\ntext6"

URLless_string = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', thestring)
print URLless_string

Result:

text1
text2

text3
text4

text5
text6

You could also look at it from the other way around...

from urlparse import urlparse
[el for el in ['text1', 'FTP://somewhere.com', 'text2', 'http://blah.com:8080/foo/bar#header'] if not urlparse(el).scheme]

In order to remove any URL within a string in Python, you can use this RegEx function :

import re

def remove_URL(text):
    """Remove URLs from a text string"""
    return re.sub(r"http\S+", "", text)

Examples related to python

programming a servo thru a barometer Is there a way to view two blocks of code from the same file simultaneously in Sublime Text? python variable NameError Why my regexp for hyphenated words doesn't work? Comparing a variable with a string python not working when redirecting from bash script is it possible to add colors to python output? Get Public URL for File - Google Cloud Storage - App Engine (Python) Real time face detection OpenCV, Python xlrd.biffh.XLRDError: Excel xlsx file; not supported Could not load dynamic library 'cudart64_101.dll' on tensorflow CPU-only installation

Examples related to regex

Why my regexp for hyphenated words doesn't work? grep's at sign caught as whitespace Preg_match backtrack error regex match any single character (one character only) re.sub erroring with "Expected string or bytes-like object" Only numbers. Input number in React Visual Studio Code Search and Replace with Regular Expressions Strip / trim all strings of a dataframe return string with first match Regex How to capture multiple repeated groups?

Examples related to url

What is the difference between URL parameters and query strings? Allow Access-Control-Allow-Origin header using HTML5 fetch API File URL "Not allowed to load local resource" in the Internet Browser Slack URL to open a channel from browser Getting absolute URLs using ASP.NET Core How do I load an HTTP URL with App Transport Security enabled in iOS 9? Adding form action in html in laravel React-router urls don't work when refreshing or writing manually URL for public Amazon S3 bucket How can I append a query parameter to an existing URL?

Examples related to replace

How do I find and replace all occurrences (in all files) in Visual Studio Code? How to find and replace with regex in excel How to replace text in a column of a Pandas dataframe? How to replace negative numbers in Pandas Data Frame by zero Replacing few values in a pandas dataframe column with another value How to replace multiple patterns at once with sed? Using tr to replace newline with space replace special characters in a string python Replace None with NaN in pandas dataframe Batch script to find and replace a string in text file within a minute for files up to 12 MB

Examples related to python-2.7

Numpy, multiply array with scalar Not able to install Python packages [SSL: TLSV1_ALERT_PROTOCOL_VERSION] How to create a new text file using Python Could not find a version that satisfies the requirement tensorflow Python: Pandas pd.read_excel giving ImportError: Install xlrd >= 0.9.0 for Excel support Display/Print one column from a DataFrame of Series in Pandas How to calculate 1st and 3rd quartiles? How can I read pdf in python? How to completely uninstall python 2.7.13 on Ubuntu 16.04 Check key exist in python dict