Python dir notes

Use os.listdir (or os.scandir in >= 3.5) to list items in a single directory, use os.walk (which uses os.listdir or os.scandir in >= 3.5) to list items in a directory tree.

Using os.walk for just a single directory is not as efficient as just using the function it uses internally to do that job.

I’m pretty sure in every version where scandir is more efficient, listdir and walk use it internally.
The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories…

The directory path uses a forward slash, but I didn’t say what operating system I was using. Windows uses backward slashes to denote subdirectories, while Mac OS X and Linux use forward slashes. But in Python, forward slashes always Just Work, even on Windows.

Python 101: How to traverse a directory
Python 101: How to traverse a directory
January 26, 2016Python

Using os.walk

Using os.walk takes a bit of practice to get it right. Here’s an example that just prints out all your file names in the path that you passed to it:

import os

def pywalker(path):
for root, dirs, files in os.walk(path):
for file_ in files:
print( os.path.join(root, file_) )

if __name__ == ‘__main__’:
pywalker(‘/path/to/some/folder’)

By joining the root and the file_ elements, you end up with the full path to the file. If you want to check the date of when the file was made, then you would use os.stat. I’ve used this in the past to create a cleanup script, for example.

If all you want to do is check out a listing of the folders and files in the specified path, then you’re looking for os.listdir. Most of the time, I usually need drill down to the lowest sub-folder, so listdir isn’t good enough and I need to use os.walk instead.
Using os.scandir() Directly

Python 3.5 recently added os.scandir(), which is a new directory iteration function. You can read about it in PEP 471. In Python 3.5, os.walk is implemented using os.scandir “which makes it 3 to 5 times faster on POSIX systems and 7 to 20 times faster on Windows systems” according to the Python 3.5 announcement.

Let’s try out a simple example using os.scandir directly.

import os

folders = []
files = []

for entry in os.scandir(‘/’):
if entry.is_dir():
folders.append(entry.path)
elif entry.is_file():
files.append(entry.path)

print(‘Folders:’)
print(folders)

Scandir returns an iterator of DirEntry objects which are lightweight and have handy methods that can tell you about the paths you’re iterating over. In the example above, we check to see if the entry is a file or a directory and append the item to the appropriate list. You can also a stat object via the DirEntry’s stat method, which is pretty neat!
Wrapping Up

Now you know how to traverse a directory structure in Python. If you’d like the speed improvements that os.scandir provide in a version of Python older than 3.5, you can get the scandir package on PyPI.
Which is much more efficient, glob.glob() or os.listdir()? (self.learnpython)

Which is much more efficient, glob.glob() or os.listdir()? (self.learnpython)

Do both functions have significant difference in processing speed especially when scanning a directory with a particular extension format?

Here are my two codes on how I use them both:

os.chdir(“./”)
for file in glob.glob(“*.pdf”):
convert_file_to_image(file)

And

for file in os.listdir(“./”):
if file.endswith(“.pdf”):
convert_file_to_image(file))

I believe when using the listdir() function can create more bandwidth requirement as I have a condition where it checks when the file ends in a certain format, compared with glob() wherein it accepts an argument before starting the loop, however I need to specify a chdir() before it starts.

[–]Brian 5 points6 points7 points 1 year ago (0 children)

There’s going to be almost no difference. Under the covers, glob is essentially doing the same thing as your listdir code (ie calling listdir and then checking if the paths match a particular pattern). In fact, it’s actually doing a bit more work since it’s more general (ie. handles more types of patterns than just “end of file”), so it’s actually going to be the slower one.

However the difference is going to be pretty much undetectable unless you’re looking at directories with millions of files (and even then is going to be pretty negligible). Computers are fast, and this kind of check is going to take a tiny amount of time – the time to actually access the disk is always going to completely dwarf it – trying to optimise it is like trying to reduce the time it takes a train door to open in an effort to cut down the time a 10 minute journey takes. Sure: maybe you could cut half a second off the time with a faster door, but it’s not going to be noticable next to the travel time.

however I need to specify a chdir() before it starts

Not at all – you can include the path as part of your pattern. Ie glob.glob(“path/to/file/*.pdf”)

[–]sanshinron 2 points3 points4 points 1 year ago (0 children)

Use pathlib. It was introduced into a standard library for a reason. Also it’s based on scandir, so it will be the most efficient.

[–]Rhomboid 1 point2 points3 points 1 year ago (0 children)

however I need to specify a chdir() before it starts.

You can give a path to glob.glob(), there’s no need to change the working directory. (Also, changing the working directory to . is completely pointless and doesn’t do anything.)

Any minor differences between the two are going to be a wash. What you really want to use is os.scandir(), added in 3.5. It’s significantly faster than anything else.

11.7. glob — Unix style pathname pattern expansion — Python 3.6.5rc1 documentation
11.7. glob — Unix style pathname pattern expansion

Source code: Lib/glob.py

The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell, although results are returned in arbitrary order. No tilde expansion is done, but *, ?, and character ranges expressed with [] will be correctly matched. This is done by using the os.scandir() and fnmatch.fnmatch() functions in concert, and not by actually invoking a subshell. Note that unlike fnmatch.fnmatch(), glob treats filenames beginning with a dot (.) as special cases. (For tilde and shell variable expansion, use os.path.expanduser() and os.path.expandvars().)

For a literal match, wrap the meta-characters in brackets. For example, ‘[?]’ matches the character ‘?’.

See also

The pathlib module offers high-level path objects.

glob.glob(pathname, *, recursive=False)¶

Return a possibly-empty list of path names that match pathname, which must be a string containing a path specification. pathname can be either absolute (like /usr/src/Python-1.5/Makefile) or relative (like ../../Tools/*/*.gif), and can contain shell-style wildcards. Broken symlinks are included in the results (as in the shell).

If recursive is true, the pattern “**” will match any files and zero or more directories and subdirectories. If the pattern is followed by an os.sep, only directories and subdirectories match.

Note

Using the “**” pattern in large directory trees may consume an inordinate amount of time.

Changed in version 3.5: Support for recursive globs using “**”.

glob.iglob(pathname, *, recursive=False)¶

Return an iterator which yields the same values as glob() without actually storing them all simultaneously.

glob.escape(pathname)¶

Escape all special characters (‘?’, ‘*’ and ‘[‘). This is useful if you want to match an arbitrary literal string that may have special characters in it. Special characters in drive/UNC sharepoints are not escaped, e.g. on Windows escape(‘//?/c:/Quo vadis?.txt’) returns ‘//?/c:/Quo vadis[?].txt’.

New in version 3.4.

For example, consider a directory containing the following files: 1.gif, 2.txt, card.gif and a subdirectory sub which contains only the file 3.txt. glob() will produce the following results. Notice how any leading components of the path are preserved.
>>>

>>> import glob
>>> glob.glob(‘./[0-9].*’)
[‘./1.gif’, ‘./2.txt’]
>>> glob.glob(‘*.gif’)
[‘1.gif’, ‘card.gif’]
>>> glob.glob(‘?.gif’)
[‘1.gif’]
>>> glob.glob(‘**/*.txt’, recursive=True)
[‘2.txt’, ‘sub/3.txt’]
>>> glob.glob(‘./**/’, recursive=True)
[‘./’, ‘./sub/’]

If the directory contains files starting with . they won’t be matched by default. For example, consider a directory containing card.gif and .card.gif:
>>>

>>> import glob
>>> glob.glob(‘*.gif’)
[‘card.gif’]
>>> glob.glob(‘.c*’)
[‘.card.gif’]

See also

Module fnmatch
Shell-style filename (not path) expansion

Navigation
glob – Filename pattern matching
glob – Filename pattern matching¶
Purpose: Use Unix shell rules to fine filenames matching a pattern.
Available In: 1.4

Even though the glob API is very simple, the module packs a lot of power. It is useful in any situation where your program needs to look for a list of files on the filesystem with names matching a pattern. If you need a list of filenames that all have a certain extension, prefix, or any common string in the middle, use glob instead of writing code to scan the directory contents yourself.

The pattern rules for glob are not regular expressions. Instead, they follow standard Unix path expansion rules. There are only a few special characters: two different wild-cards, and character ranges are supported. The patterns rules are applied to segments of the filename (stopping at the path separator, /). Paths in the pattern can be relative or absolute. Shell variable names and tilde (~) are not expanded.
Example Data¶

The examples below assume the following test files are present in the current working directory:

$ python glob_maketestdata.py

dir
dir/file.txt
dir/file1.txt
dir/file2.txt
dir/filea.txt
dir/fileb.txt
dir/subdir
dir/subdir/subfile.txt

Note

Use glob_maketestdata.py in the sample code to create these files if you want to run the examples.
Wildcards¶

An asterisk (*) matches zero or more characters in a segment of a name. For example, dir/*.

import glob
for name in glob.glob(‘dir/*’):
print name

The pattern matches every pathname (file or directory) in the directory dir, without recursing further into subdirectories.

$ python glob_asterisk.py

dir/file.txt
dir/file1.txt
dir/file2.txt
dir/filea.txt
dir/fileb.txt
dir/subdir

To list files in a subdirectory, you must include the subdirectory in the pattern:

import glob

print ‘Named explicitly:’
for name in glob.glob(‘dir/subdir/*’):
print ‘\t’, name

print ‘Named with wildcard:’
for name in glob.glob(‘dir/*/*’):
print ‘\t’, name

The first case above lists the subdirectory name explicitly, while the second case depends on a wildcard to find the directory.

$ python glob_subdir.py

Named explicitly:
dir/subdir/subfile.txt
Named with wildcard:
dir/subdir/subfile.txt

The results, in this case, are the same. If there was another subdirectory, the wildcard would match both subdirectories and include the filenames from both.
Single Character Wildcard¶

The other wildcard character supported is the question mark (?). It matches any single character in that position in the name. For example,

import glob

for name in glob.glob(‘dir/file?.txt’):
print name

Matches all of the filenames which begin with “file”, have one more character of any type, then end with ”.txt”.

$ python glob_question.py

dir/file1.txt
dir/file2.txt
dir/filea.txt
dir/fileb.txt

Character Ranges¶

When you need to match a specific character, use a character range instead of a question mark. For example, to find all of the files which have a digit in the name before the extension:

import glob
for name in glob.glob(‘dir/*[0-9].*’):
print name

The character range [0-9] matches any single digit. The range is ordered based on the character code for each letter/digit, and the dash indicates an unbroken range of sequential characters. The same range value could be written [0123456789].

$ python glob_charrange.py

dir/file1.txt
dir/file2.txt

See also

glob
The standard library documentation for this module.
Pattern Matching Notation
An explanation of globbing from The Open Group’s Shell Command Language specification.
fnmatch
Filename matching implementation.
File Access
Other tools for working with files.

How can I search sub-folders using glob.glob module in Python? – Stack Overflow
How can I search sub-folders using glob.glob module in Python?

I want to open a series of subfolders in a folder and find some text files and print some lines of the text files. I am using this:

configfiles = glob.glob(‘C:/Users/sam/Desktop/file1/*.txt’)

But this cannot access the subfolders as well. Does anyone know how I can use the same command to access subfolders as well?
edited Mar 17 ’17 at 0:19
asked Feb 10 ’13 at 13:27

related to: Use a Glob() to find files recursively in Python – samkhan13 Jun 10 ’13 at 18:56

9 Answers

In Python 3.5 and newer use the new recursive **/ functionality:

configfiles = glob.glob(‘C:/Users/sam/Desktop/file1/**/*.txt’, recursive=True)

When recursive is set, ** followed by a path separator matches 0 or more subdirectories.

In earlier Python versions, glob.glob() cannot list files in subdirectories recursively.

In that case I’d use os.walk() combined with fnmatch.filter() instead:

import os
import fnmatch

path = ‘C:/Users/sam/Desktop/file1’

configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in fnmatch.filter(files, ‘*.txt’)]

This’ll walk your directories recursively and return all absolute pathnames to matching .txt files. In this specific case the fnmatch.filter() may be overkill, you could also use a .endswith() test:

import os

path = ‘C:/Users/sam/Desktop/file1’

configfiles = [os.path.join(dirpath, f)
for dirpath, dirnames, files in os.walk(path)
for f in files if f.endswith(‘.txt’)]

edited Nov 5 ’17 at 11:46
answered Feb 10 ’13 at 13:31

3
I can see: glob.glob(‘/path to directory/*/*.txt”) working for me. This is bascially using the Unix shell rule. – Surya May 15 ’16 at 21:09
5
@User123: that doesn’t list directories recursively. You are listing all text files one level deep, but not in further subdirectories or even directly in path to directory. – Martijn Pieters♦ May 15 ’16 at 21:21

The glob2 package supports wild cards and is reasonably fast

code = ”’
import glob2
glob2.glob(“files/*/**”)
”’
timeit.timeit(code, number=1)

On my laptop it takes approximately 2 seconds to match >60,000 file paths.
edited Aug 15 ’14 at 16:37
answered Mar 13 ’14 at 19:10

To find files in immediate subdirectories:

configfiles = glob.glob(r’C:\Users\sam\Desktop\*\*.txt’)

For a recursive version that traverse all subdirectories, you could use ** and pass recursive=True since Python 3.5:

configfiles = glob.glob(r’C:\Users\sam\Desktop\**\*.txt’, recursive=True)

Both function calls return lists. You could use glob.iglob() to return paths one by one. Or use pathlib:

from pathlib import Path

path = Path(r’C:\Users\sam\Desktop’)
txt_files_only_subdirs = path.glob(‘*/*.txt’)
txt_files_all_recursively = path.rglob(‘*.txt’) # including the current dir

Both methods return iterators (you can get paths one by one).
edited Mar 6 ’17 at 10:30
answered Feb 10 ’13 at 13:47

Yes, I understood that; but I didn’t expect glob() to support patterns in directories either. – Martijn Pieters♦ Feb 10 ’13 at 13:57
Comment deleted, I see now that it gave the wrong impression; besides, the patch includes a documentation update for the ** recursion case. But for ** to work, you have to set the recursion=True switch, btw. – Martijn Pieters♦ Feb 10 ’13 at 14:53

You can use Formic with Python 2.6

import formic
fileset = formic.FileSet(include=”**/*.txt”, directory=”C:/Users/sam/Desktop/”)

Disclosure – I am the author of this package.
answered Feb 12 ’13 at 1:12

Works well. Is a good solution. – JayJay123 Sep 26 ’17 at 23:46

Here is a adapted version that enables glob.glob like functionality without using glob2.

def find_files(directory, pattern=’*’):
if not os.path.exists(directory):
raise ValueError(“Directory not found {}”.format(directory))

matches = []
for root, dirnames, filenames in os.walk(directory):
for filename in filenames:
full_path = os.path.join(root, filename)
if fnmatch.filter([full_path], pattern):
matches.append(os.path.join(root, filename))
return matches

So if you have the following dir structure

tests/files
├── a0
│ ├── a0.txt
│ ├── a0.yaml
│ └── b0
│ ├── b0.yaml
│ └── b00.yaml
└── a1

You can do something like this

files = utils.find_files(‘tests/files’,’**/b0/b*.yaml’)
> [‘tests/files/a0/b0/b0.yaml’, ‘tests/files/a0/b0/b00.yaml’]

Pretty much fnmatch pattern match on the whole filename itself, rather than the filename only.
answered Mar 26 ’15 at 2:20

configfiles = glob.glob(‘C:/Users/sam/Desktop/**/*.txt”)

Doesn’t works for all cases, instead use glob2

configfiles = glob2.glob(‘C:/Users/sam/Desktop/**/*.txt”)

edited May 26 ’16 at 10:01
answered May 26 ’16 at 9:31

If you can install glob2 package…

import glob2
filenames = glob2.glob(“C:\\top_directory\\**\\*.ext”) # Where ext is a specific file extension
folders = glob2.glob(“C:\\top_directory\\**\\”)

All filenames and folders:

all_ff = glob2.glob(“C:\\top_directory\\**\\**”)

answered Jul 26 ’16 at 15:05

As pointed out by Martijn, glob can only do this through the **operator introduced in Python 3.5. Since the OP explicitly asked for the glob module, the following will return a lazy evaluation iterator that behaves similarly

import os, glob, itertools

configfiles = itertools.chain.from_iterable(glob.iglob(os.path.join(root,’*.txt’))
for root, dirs, files in os.walk(‘C:/Users/sam/Desktop/file1/’))

Note that you can only iterate once over configfiles in this approach though. If you require a real list of configfiles that can be used in multiple operations you would have to create this explicitly by using list(configfiles).
answered Dec 5 ’15 at 23:45

If you’re running Python 3.4+, you can use the pathlib module. The Path.glob() method supports the ** pattern, which means “this directory and all subdirectories, recursively”. It returns a generator yielding Path objects for all matching files.

from pathlib import Path
configfiles = Path(“C:/Users/sam/Desktop/file1/”).glob(“**/*.txt”)

edited Oct 8 ’17 at 20:57
answered Jun 29 ’17 at 23:07
path – Use a Glob() to find files recursively in Python? – Stack Overflow
Use a Glob() to find files recursively in Python?

This is what I have:

glob(os.path.join(‘src’,’*.c’))

but I want to search the subfolders of src. Something like this would work:

glob(os.path.join(‘src’,’*.c’))
glob(os.path.join(‘src’,’*’,’*.c’))
glob(os.path.join(‘src’,’*’,’*’,’*.c’))
glob(os.path.join(‘src’,’*’,’*’,’*’,’*.c’))

But this is obviously limited and clunky.
edited Oct 3 ’16 at 14:36
asked Feb 2 ’10 at 18:19
21 Answers

Python 3.5+

Starting with Python version 3.5, the glob module supports the “**” directive (which is parsed only if you pass recursive flag):

import glob

for filename in glob.iglob(‘src/**/*.c’, recursive=True):
print(filename)

If you need a list, just use glob.glob instead of glob.iglob.

For cases where matching files beginning with a dot (.); like files in the current directory or hidden files on Unix based system, use the os.walk solution below.

Python 2.2 to 3.4

For older Python versions, starting with Python 2.2, use os.walk to recursively walk a directory and fnmatch.filter to match against a simple expression:

import fnmatch
import os

matches = []
for root, dirnames, filenames in os.walk(‘src’):
for filename in fnmatch.filter(filenames, ‘*.c’):
matches.append(os.path.join(root, filename))

Python 2.1 and earlier

For even older Python versions, use glob.glob against each filename instead of fnmatch.filter.
edited Dec 12 ’17 at 20:26
answered Feb 2 ’10 at 18:26

3
For Python older than 2.2 there is os.path.walk() which is a little more fiddly to use than os.walk() – John La Rooy Feb 2 ’10 at 19:34
10
@gnibbler I know that is an old comment, but my comment is just to let people know that os.path.walk() is deprecated and has been removed in Python 3. – Pedro Cunha Jan 18 ’13 at 16:14
1
why not .endwith(‘.c’), I think that will be faster than fnmatch in this scenario? – DevC Mar 18 ’14 at 6:53
3
@DevC that might work in the specific case asked in this question, but it’s easy to imagine someone that wants to do use it with queries such as ‘a*.c’ etc, so I think it’s worth keeping the current somewhat slow answer. – Johan Dahlin May 19 ’14 at 19:29
I concur that this is gold standard. For some reason, I wasn’t able to import glob as root. In a normal user prompt, it worked but not on a root prompt, fedora 20, python 2.7. So fnmatch and this answer is a gift here. – DarkForce Mar 17 ’15 at 11:02

| show 1 more comment

Similar to other solutions, but using fnmatch.fnmatch instead of glob, since os.walk already listed the filenames:

import os, fnmatch

def find_files(directory, pattern):
for root, dirs, files in os.walk(directory):
for basename in files:
if fnmatch.fnmatch(basename, pattern):
filename = os.path.join(root, basename)
yield filename

for filename in find_files(‘src’, ‘*.c’):
print ‘Found C source:’, filename

Also, using a generator alows you to process each file as it is found, instead of finding all the files and then processing them.
edited Nov 23 ’11 at 11:00
answered Feb 2 ’10 at 18:44

1
because 1-liners are fun: reduce(lambda x, y: x+y, map(lambda (r,_,x):map(lambda f: r+’/’+f, filter(lambda f: fnmatch.fnmatch(f, pattern), x)), os.walk(‘src/webapp/test_scripts’))) – njzk2 Aug 1 ’16 at 18:07

I’ve modified the glob module to support ** for recursive globbing, e.g:

>>> import glob2
>>> all_header_files = glob2.glob(‘src/**/*.c’)

https://github.com/miracle2k/python-glob2/

Useful when you want to provide your users with the ability to use the ** syntax, and thus os.walk() alone is not good enough.
edited Jul 14 ’15 at 19:40
answered Jun 26 ’11 at 14:14

1
Can we make this stop after it finds the first match? Maybe make it possible to use it as a generator rather than having it return a list of every possible result? Also, is this a DFS or a BFS? I’d much prefer a BFS, I think, so that files which are near the root are found first. +1 for making this module and providing it on GitHub/pip. – ArtOfWarfare Aug 5 ’14 at 18:24
7
The ** syntax was added to the official glob module in Python 3.5. – ArtOfWarfare Jan 26 ’15 at 19:13
3
Ended up using this in python 2.7. Works like a charm. – Sudipta Basak Apr 19 ’16 at 22:35
@ArtOfWarfare Alright, fine. This is still useful for < 3.5. – cᴏʟᴅsᴘᴇᴇᴅ Jul 3 ’17 at 14:42 success.Thank you!!! Env:python 2.7,anaconda – partida Jan 8 at 3:01 Starting with Python 3.4, one can use the glob() method of one of the Path classes in the new pathlib module, which supports ** wildcards. For example: from pathlib import Path for file_path in Path(‘src’).glob(‘**/*.c’): print(file_path) # do whatever you need with these files Update: Starting with Python 3.5, the same syntax is also supported by glob.glob(). edited Aug 19 ’15 at 9:13 answered Nov 11 ’14 at 16:08 1 This should be the default behaviour of glob module. Thank you. – jgomo3 Jan 5 ’15 at 22:42 3 Indeed, and it will be in Python 3.5. It was supposed to already be so in Python 3.4, but was omitted by mistake. – taleinat Feb 24 ’15 at 17:39 2 Beautiful solution, this should be the accepted answer. Thank you. – sleepycal May 1 ’15 at 11:31 This syntax is now supported by glob.glob() as of Python 3.5. – taleinat Aug 4 ’15 at 15:20 Thanks for this, I incorporated these changes into my answer. – Johan Dahlin Nov 12 ’15 at 13:26 | show 1 more comment import os import fnmatch def recursive_glob(treeroot, pattern): results = [] for base, dirs, files in os.walk(treeroot): goodfiles = fnmatch.filter(files, pattern) results.extend(os.path.join(base, f) for f in goodfiles) return results fnmatch gives you exactly the same patterns as glob, so this is really an excellent replacement for glob.glob with very close semantics. An iterative version (e.g. a generator), IOW a replacement for glob.iglob, is a trivial adaptation (just yield the intermediate results as you go, instead of extending a single results list to return at the end). edited Dec 9 ’14 at 10:02 answered Feb 2 ’10 at 18:39 1 What do you think about using recursive_glob(pattern, treeroot=’.’) as I suggested in my edit? This way, it can be called for example as recursive_glob(‘*.txt’) and intuitively match the syntax of glob. – Chris Redford Jan 4 ’15 at 21:07 @ChrisRedford, I see it as a pretty minor issue either way. As it stands now, it matches the “files then pattern” argument order of fnmatch.filter, which is roughly as useful as the possibility of matching single-argument glob.glob. – Alex Martelli Jan 4 ’15 at 21:43 You’ll want to use os.walk to collect filenames that match your criteria. For example: import os cfiles = [] for root, dirs, files in os.walk(‘src’): for file in files: if file.endswith(‘.c’): cfiles.append(os.path.join(root, file)) answered Feb 2 ’10 at 18:24 Here’s a solution with nested list comprehensions, os.walk and simple suffix matching instead of glob: import os cfiles = [os.path.join(root, filename) for root, dirnames, filenames in os.walk(‘src’) for filename in filenames if filename.endswith(‘.c’)] It can be compressed to a one-liner: import os;cfiles=[os.path.join(r,f) for r,d,fs in os.walk(‘src’) for f in fs if f.endswith(‘.c’)] or generalized as a function: import os def recursive_glob(rootdir=’.’, suffix=”): return [os.path.join(looproot, filename) for looproot, _, filenames in os.walk(rootdir) for filename in filenames if filename.endswith(suffix)] cfiles = recursive_glob(‘src’, ‘.c’) If you do need full glob style patterns, you can follow Alex’s and Bruno’s example and use fnmatch: import fnmatch import os def recursive_glob(rootdir=’.’, pattern=’*’): return [os.path.join(looproot, filename) for looproot, _, filenames in os.walk(rootdir) for filename in filenames if fnmatch.fnmatch(filename, pattern)] cfiles = recursive_glob(‘src’, ‘*.c’) edited Sep 10 ’14 at 17:38 answered Nov 2 ’11 at 8:10 Johan and Bruno provide excellent solutions on the minimal requirement as stated. I have just released Formic which implements Ant FileSet and Globs which can handle this and more complicated scenarios. An implementation of your requirement is: import formic fileset = formic.FileSet(include=”/src/**/*.c”) for file_name in fileset.qualified_files(): print file_name edited May 15 ’12 at 9:14 answered May 15 ’12 at 8:53 1 Formic appears to be abandoned?! And it does not support Python 3 (bitbucket.org/aviser/formic/issue/12/support-python-3) – blueyed Sep 4 ’14 at 3:53 based on other answers this is my current working implementation, which retrieves nested xml files in a root directory: files = [] for root, dirnames, filenames in os.walk(myDir): files.extend(glob.glob(root + “/*.xml”)) I’m really having fun with python 🙂 answered Jul 28 ’12 at 22:09 Recently I had to recover my pictures with the extension .jpg. I ran photorec and recovered 4579 directories 2.2 million files within, having tremendous variety of extensions.With the script below I was able to select 50133 files havin .jpg extension within minutes: #!/usr/binenv python2.7 import glob import shutil import os src_dir = “/home/mustafa/Masaüstü/yedek” dst_dir = “/home/mustafa/Genel/media” for mediafile in glob.iglob(os.path.join(src_dir, “*”, “*.jpg”)): #”*” is for subdirectory shutil.copy(mediafile, dst_dir) edited Jan 5 ’13 at 10:59 answered Jan 5 ’13 at 10:36 Another way to do it using just the glob module. Just seed the rglob method with a starting base directory and a pattern to match and it will return a list of matching file names. import glob import os def _getDirs(base): return [x for x in glob.iglob(os.path.join( base, ‘*’)) if os.path.isdir(x) ] def rglob(base, pattern): list = [] list.extend(glob.glob(os.path.join(base,pattern))) dirs = _getDirs(base) if len(dirs): for d in dirs: list.extend(rglob(os.path.join(base,d), pattern)) return list edited Dec 28 ’11 at 18:07 answered Sep 13 ’11 at 22:59 Just made this.. it will print files and directory in hierarchical way But I didn’t used fnmatch or walk #!/usr/bin/python import os,glob,sys def dirlist(path, c = 1): for i in glob.glob(os.path.join(path, “*”)): if os.path.isfile(i): filepath, filename = os.path.split(i) print ‘—-‘ *c + filename elif os.path.isdir(i): dirname = os.path.basename(i) print ‘—-‘ *c + dirname c+=1 dirlist(i,c) c-=1 path = os.path.normpath(sys.argv[1]) print(os.path.basename(path)) dirlist(path) answered Jul 27 ’13 at 18:12 In addition to the suggested answers, you can do this with some lazy generation and list comprehension magic: import os, glob, itertools results = itertools.chain.from_iterable(glob.iglob(os.path.join(root,’*.c’)) for root, dirs, files in os.walk(‘src’)) for f in results: print(f) Besides fitting in one line and avoiding unnecessary lists in memory, this also has the nice side effect, that you can use it in a way similar to the ** operator, e.g., you could use os.path.join(root, ‘some/path/*.c’) in order to get all .c files in all sub directories of src that have this structure. answered Dec 5 ’15 at 17:42 Simplified version of Johan Dahlin’s answer, without fnmatch. import os matches = [] for root, dirnames, filenames in os.walk(‘src’): matches += [os.path.join(root, f) for f in filenames if f[-2:] == ‘.c’] answered Jun 3 ’13 at 1:29 Or with a list comprehension: >>> base = r”c:\User\xtofl”
>>> binfiles = [ os.path.join(base,f)
for base, _, files in os.walk(root)
for f in files if f.endswith(“.jpg”) ]

answered Jun 24 ’13 at 10:41

That one uses fnmatch or regular expression:

import fnmatch, os

def filepaths(directory, pattern):
for root, dirs, files in os.walk(directory):
for basename in files:
try:
matched = pattern.match(basename)
except AttributeError:
matched = fnmatch.fnmatch(basename, pattern)
if matched:
yield os.path.join(root, basename)

# usage
if __name__ == ‘__main__’:
from pprint import pprint as pp
import re
path = r’/Users/hipertracker/app/myapp’
pp([x for x in filepaths(path, re.compile(r’.*\.py$’))])
pp([x for x in filepaths(path, ‘*.py’)])

answered Aug 2 ’13 at 16:01

Here is my solution using list comprehension to search for multiple file extensions recursively in a directory and all subdirectories:

import os, glob

def _globrec(path, *exts):
“”” Glob recursively a directory and all subdirectories for multiple file extensions
Note: Glob is case-insensitive, i. e. for ‘\*.jpg’ you will get files ending
with .jpg and .JPG

Parameters
———-
path : str
A directory name
exts : tuple
File extensions to glob for

Returns
——-
files : list
list of files matching extensions in exts in path and subfolders

“””
dirs = [a[0] for a in os.walk(path)]
f_filter = [d+e for d in dirs for e in exts]
return [f for files in [glob.iglob(files) for files in f_filter] for f in files]

my_pictures = _globrec(r’C:\Temp’, ‘\*.jpg’,’\*.bmp’,’\*.png’,’\*.gif’)
for f in my_pictures:
print f

answered Aug 18 ’14 at 17:50

import sys, os, glob

dir_list = [“c:\\books\\heap”]

while len(dir_list) > 0:
cur_dir = dir_list[0]
del dir_list[0]
list_of_files = glob.glob(cur_dir+’\\*’)
for book in list_of_files:
if os.path.isfile(book):
print(book)
else:
dir_list.append(book)

answered Jan 27 ’14 at 19:03

I modified the top answer in this posting.. and recently created this script which will loop through all files in a given directory (searchdir) and the sub-directories under it… and prints filename, rootdir, modified/creation date, and size.

Hope this helps someone… and they can walk the directory and get fileinfo.

import time
import fnmatch
import os

def fileinfo(file):
filename = os.path.basename(file)
rootdir = os.path.dirname(file)
lastmod = time.ctime(os.path.getmtime(file))
creation = time.ctime(os.path.getctime(file))
filesize = os.path.getsize(file)

print “%s**\t%s\t%s\t%s\t%s” % (rootdir, filename, lastmod, creation, filesize)

searchdir = r’D:\Your\Directory\Root’
matches = []

for root, dirnames, filenames in os.walk(searchdir):
## for filename in fnmatch.filter(filenames, ‘*.c’):
for filename in filenames:
## matches.append(os.path.join(root, filename))
##print matches
fileinfo(os.path.join(root, filename))

answered Nov 15 ’14 at 13:39

Here is a solution that will match the pattern against the full path and not just the base filename.

It uses fnmatch.translate to convert a glob-style pattern into a regular expression, which is then matched against the full path of each file found while walking the directory.

re.IGNORECASE is optional, but desirable on Windows since the file system itself is not case-sensitive. (I didn’t bother compiling the regex because docs indicate it should be cached internally.)

import fnmatch
import os
import re

def findfiles(dir, pattern):
patternregex = fnmatch.translate(pattern)
for root, dirs, files in os.walk(dir):
for basename in files:
filename = os.path.join(root, basename)
if re.search(patternregex, filename, re.IGNORECASE):
yield filename

answered Jun 30 ’15 at 15:39

I needed a solution for python 2.x that works fast on large directories.
I endet up with this:

import subprocess
foundfiles= subprocess.check_output(“ls src/*.c src/**/*.c”, shell=True)
for foundfile in foundfiles.splitlines():
print foundfile

Note that you might need some exception handling in case ls doesn’t find any matching file.
answered Jun 23 ’17 at 10:20

I just realized that ls src/**/*.c only works if globstar option is enabled (shopt -s globstar) – see this answer for details. – Roman Jun 27 ’17 at 13:44

protected by cᴏʟᴅsᴘᴇᴇᴅ Aug 30 ’17 at 12:16

Thank you for your interest in this question. Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?
os.listdir(path=’.’)

os.listdir(path=’.’)¶

Return a list containing the names of the entries in the directory given by path. The list is in arbitrary order, and does not include the special entries ‘.’ and ‘..’ even if they are present in the directory.

path may be a path-like object. If path is of type bytes (directly or indirectly through the PathLike interface), the filenames returned will also be of type bytes; in all other circumstances, they will be of type str.

This function can also support specifying a file descriptor; the file descriptor must refer to a directory.

Note

To encode str filenames to bytes, use fsencode().

See also

The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.

Changed in version 3.2: The path parameter became optional.

New in version 3.3: Added support for specifying an open file descriptor for path.

Changed in version 3.6: Accepts a path-like object.

os.scandir() vs os.walk(), any benefit/difference of directly using scandir? + filter files by extension (self.learnpython)

os.scandir() vs os.walk(), any benefit/difference of directly using scandir? + filter files by extension (self.learnpython)

submitted 6 months ago by kwhali

Although os.walk() now uses os.scandir() which is cited to give perf improvements, is there still an advantage to using os.scandir() directly, the announcement mentions it returns an iterator which improves memory efficiency, along with entry.is_file() not using additional system call. Are those two things not the case for os.walk()?

https://docs.python.org/3/whatsnew/3.5.html#pep-471-os-scandir-function-a-better-and-faster-directory-iterator

Above question is in general rather than anything specific.

My current usecase is for a batch script that needs to traverse a resources directory with multiple directories containing images with a sequence number( eg resources/A/image.0001.png, resources/B/image.0001.png ).

I’m walking the dirs getting a list of files, then using zip(with all lists as a *dirs arg)->map->list to get a list of lists containing images for each frame.

I feel that I should also ensure the files put into the initial lists filter/avoid non image extensions incase there are some stray files(no idea if things like thumbs.db or the macOS one etc might get picked up on if something generates them, we have both OS and a NAS). Other than that, I figure as long as the amount of files from each directory(after filtering) is the same, all the frames should in theory be matched up properly(ie. frames[0] == [A/image.0001.png, B/image.0001.png, …]).

I’ve read that Python is meant to aim for one way of doing things but when it comes to filtering file extensions, there seems to be a variety of approaches from regex, glob, endswith(), pathlib, etc.

[–]yawpitch 4 points5 points6 points 6 months ago (2 children)

Ok, answer to your first question: scandir does essentially the same thing as listdir (though not exactly the same thing), it is not recursive and does not do the same thing as walk. You are correct that walk now uses scandir internally, which makes it much faster than previous versions that used listdir.

Short answer, you might want to try using a tool specifically for file sequences (for instance fileseq (which, full disclosure, I’m a contributor on), but it’s not entirely clear what you’re trying to do. Also I’m honestly not sure how well it will work on 3, if at all, because VFX is a few years away from moving off 2, and that’s where it gets all its use.

There is generally one obvious right way, but with file paths it can be a bit of a job of figuring out what your actual task is.

Generally:

if you need to know the contents of a directory you know the path to, os.scandir or os.listdir (subtle differences, check the docs).

if you need to walk a directory tree to list contents recursively, or find a specific directory or file you know only the name of, then os.walk.

if you need to find files that match a specific pattern, but have wildcard components, fnmatch.fnmatch if you are comparing paths you already know, glob.glob if you’re searching the filesystem for this paths.

[–]kwhali[S] 0 points1 point2 points 6 months ago* (1 child)

I’m automating building a substance graph(substance designer if your familiar with it) based on a variable amount of image inputs that should be available in a predictable-ish layout. A resources directory that contains multiple directories each with what should contain the exact same amount of directories and the same names within.

Those directory names will be along the lines of diffuse, normals, specular etc, each containing a sequence of images, the filenames while consistent in their directory don’t appear to be consistent across(eg filenames in diffuse vs normals are quite different beyond sequential numbers prior to their image extension. Some directories are empty, it’s expected that if the directory was empty, then all other directories with that name are likely empty as well and won’t be processed. Directory depth shouldn’t go any further than described, any other directories can be ignored.

This is the structure/data that I’m being given to work with from a vendors software I believe. For this task I need to have a python script that identifies what folders to process, and transpose the related images into a list of frames, eg file structure:

resources/a/diffuse/image.0001.png
resources/a/diffuse/image.0002.png
resources/a/normals/
resources/a/specular/simage.0001.png
resources/a/specular/simage.0002.png
resources/b/diffuse/image.0001.png
resources/b/diffuse/image.0002.png
resources/b/normals/
resources/b/specular/simage.0001.png
resources/b/specular/simage.0002.png

result:

images = {
‘diffuse’: [
[‘resources/a/diffuse/image.0001.png’, ‘resources/b/diffuse/image0001.png’],
[‘resources/a/diffuse/image.0002.png’, ‘resources/b/diffuse/image0002.png’]
], ‘specular’:
[‘resources/a/specular/simage.0001.png’, ‘resources/b/specular/simage0001.png’],
[‘resources/a/specular/simage.0002.png’, ‘resources/b/specular/simage0002.png’]
]
}

I’ve got this working now, though it’s a bit of a mess. I’m using os.scandir() directly and switched from os.path to pathlib(seems to be better for supporting multiple platforms that I use). I check if a directory contains any image files, else I return any directories(it should only be one or the other), if there is none the directory is empty and I skip it. I also sort the lists alphabetically as the results have tripped me up as being sorted by some other means by scandir causing my frames to contain incorrect paths.

checking for image files:

dir_contents = list(os.scandir(dir_name)) # list to avoid generator as I use this var a few times

img_exts = [‘png’,’tif’,’jpg’]
def isImg(entry):
p = pathlib.Path(entry.path)
return entry.is_file() and any(p.suffix[1:].lower().strip() == ext for ext in img_exts)

# any() will short circuit the iteration and return True if it comes across any accepted image
contains_img_files = any(isImg(entry) for entry in dir_contents)

I’d like to do some sort of validation check if the directories all have the same amount of frames/images, as they should. Since the amount is unknown, I’d need to parse the directories and keep some sort of reference?(initially I thought of using set() with len() on the file lists but that is only good if there is one value, can’t really differentiate what the right amount is if I don’t keep a count I guess?

if at all, because VFX is a few years away from moving off 2

I thought some VFX software like Maya was on 3 these days? They’re using PyQt5 and PySide2 for Qt UIs now aren’t they?

[–]yawpitch 0 points1 point2 points 6 months ago (0 children)

PyQt5 and PySide2 both work in 2, and all the major DCC app vendors (Autodesk, Foundry, etc) follow the current VFX Reference Platform which has just recently defined a schedule for moving to 3.

If suggest you look at glob.glob:

from glob import glob
for f in sorted(glob(“resources/?/*/*.png”)):
print(f)

You could, for instance, build a dict like this:

from collections import defaultdict
from glob import glob
from os.path import split
data = defaultdict(list)
for f in sorted(glob(“resources/?/*/*.png”)):
dirname, basename = split(f):
data[dirname].append(basename)

And compare directories from there.

[–]destiny_functional 0 points1 point2 points 6 months ago (0 children)

in 3 you might want to use module pathlib, specifically pathlib.iterdir()
How do I use os.scandir() to return DirEntry objects recursively on a directory tree?
How do I use os.scandir() to return DirEntry objects recursively on a directory tree?

Python 3.5’s os.scandir(path) function returns lightweight DirEntry objects that are very helpful with information about files. However, it only works for the immediate path handed to it. Is there a way to wrap it in a recursive function so that it visits all subdirectories beneath the given path?
edited Oct 14 ’15 at 20:41
asked Oct 14 ’15 at 20:31

1
Take a look at os.walk(). It may be a bit more heavy-handed than your looking for, but it should be more simple than creating your own solution. – skrrgwasme Oct 14 ’15 at 20:35

1 Answer

You can scan recursively using os.walk(), or if you need DirEntry objects or more control, write a recursive function like scantree() below:

try:
from os import scandir
except ImportError:
from scandir import scandir # use scandir PyPI module on Python < 3.5 def scantree(path): “””Recursively yield DirEntry objects for given directory.””” for entry in scandir(path): if entry.is_dir(follow_symlinks=False): yield from scantree(entry.path) # see below for Python 2.x else: yield entry if __name__ == ‘__main__’: import sys for entry in scantree(sys.argv[1] if len(sys.argv) > 1 else ‘.’):
print(entry.path)

Notes:

There are a few more examples in PEP 471 and in the os.scandir() docs.
You can also add various logic in the for loop to skip directories or files starting with ‘.’ and that kind of thing.
You typically want follow_symlinks=false on the is_dir() calls in recursive functions like this, to avoid symlink loops.

On Python 2.x, replace the yield from line with:

for entry in scantree(entry.path):
yield entry

edited Oct 14 ’15 at 20:50

Given that os.scandir only exists in Python 3.5, the Python 2 fallback code probably isn’t needed. 🙂 Edit: Ah, you wrote it to import the PyPI module if os.scandir didn’t exist, and I’m guessing the PyPI module is available for 2.7? – ShadowRanger Oct 14 ’15 at 20:46
@ShadowRanger Well, true, but this way it’ll work for Python < 3.5 (including Python 2.x) using my scandir module. 🙂 – Ben Hoyt Oct 14 ’15 at 20:48
@ShadowRanger I’ve added a code comment to clarify. – Ben Hoyt Oct 14 ’15 at 20:51