If I were to name my top ten modules from the standard library,
pathlib would be high on that list. It could even make it to the top three.
Manipulating paths was always a tricky problem if your code was supposed to work on different operating systems. If you accidentally hardcoded the
./some/nested/folders path in your Python package, Windows users would complain that your code doesn't work on their computers. And the other way around – a hardcoded
some\\nested\\folder path wouldn't work on a Mac or a Linux machine.
Even if you figured out how to make paths work on different operating systems, the functions you can use with file paths are a bit scattered around different modules. Sure, most of them live in the
os.path module. But if you want to search for filenames matching a pattern, you must use the
glob() function from the
glob module. For moving files around, there is
os.rename but also
shutil.move (which actually calls
os.rename unless the destination is on a different disk). When searching for all the places in the code where files are moved, you must remember to check both functions. Unless, you know, someone used the third option:
os.replace. Then you have to check all three.
Luckily, thanks to PEP-428, since version 3.4 of CPython, we have a wonderful tool that makes working with paths much easier. Just look at this piece of code:
from pathlib import Path
p = Path('/')
q = p / 'some' / 'nested' / 'folder'
q.resolve() # PosixPath('/some/nested/folder')
Overloading the division operator is a bit unusual, but it's so smart and perfectly suitable for path manipulation that I find this code simply beautiful.
About the "Writing Faster Python" series
"Writing Faster Python" is a series of short articles discussing how to solve some common problems with different code structures. I run some benchmarks, discuss the difference between each code snippet, and finish with some personal recommendations.
Are those recommendations going to make your code much faster? Not really.
Is knowing those small differences going to make a slightly better Python programmer? Hopefully!
You can read more about some assumptions I made, the benchmarking setup, and answers to some common questions in the Introduction article. And you can find most of the code examples in this repository.
Path object makes working with paths easier in a couple of other ways:
- It normalizes paths to platform defaults.
some\\pathon Windows, and
- It ignores extraneous "." path separators, so
PosixPath('some/path')on my Macbook. The
Pathobject also tries to be smart about the front slashes. If you use too many (
Path('//////some/path')), it removes the redundant ones on Linux or Mac, and returns
- It unifies the API for various file manipulation operations that previously required using different Python modules. You no longer need the
globmodule to search for files matching a pattern, and you also don't need the
osmodule to get the names of their directories. All this functionality can now be found in the
pathlibmodule (of course, you can still use the
globmodules, if you prefer).
But is it faster?
So yeah, all sunshine and rainbows, but we are here to answer one fundamental question: is
pathlib faster than
Before I try to run the benchmarks, my guess is that it's not.
Path() is an object-oriented approach to path manipulation. Instantiating an object probably takes longer than calling, for example,
os.path.join (which simply spits out a string).
But even if it's slower, I would be curious by how much. Besides, who knows, maybe my gut feeling is wrong?
This time, I'm using a different approach to benchmarking because there is no one standard way to use
pathlib. Sure, we can use it to create a path to a file, but we can also use it to print the current directory, list files with names matching a given pattern, or even quickly write text to a file.
I'm going to run a series of benchmarks for different tasks and see how much faster (or slower) it is to use
pathlib instead of other functions.
First, let's benchmark probably the most common use case: joining directory names to create a full path to a file.
from pathlib import Path
return os.path.join("/", "some", "nested", "path", "to", "a", "file.txt")
return Path("/") / "some" / "nested" / "path" / "to" / "a" / "file.txt"
$ python -m timeit -s "from pathlib_benchmarks import os_path_join" "os_path_join()"
200000 loops, best of 5: 1.22 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_join" "pathlib_join()"
50000 loops, best of 5: 5.74 usec per loop
In a scenario where I initialize
Path() instance and then append multiple folders using the
Path can be over four times as slow as using
os.path.join (5.74/1.22 ≈ 4.70). And no matter if I create a path from 2 or 20 folders,
Path is always around four or five times as slow as
return os.path.join("/", "file.txt")
return Path("/") / "file.txt"
return os.path.join("/", "an", "even", "longer", "path", "to", "some",
"nested", "folder", "of", "a", "nested", "and", "nested", "and",
"nested", "and", "nested", "path", "to", "file.txt",
Path("/") / "an" / "even" / "longer" / "path" / "to" / "some" / "nested"
/ "folder" / "of" / "a" / "nested" / "and" / "nested" / "and" / "nested"
/ "and" / "nested" / "path" / "to" / "file.txt"
$ python -m timeit -s "from pathlib_benchmarks import os_path_join_short" "os_path_join_short()"
1000000 loops, best of 5: 345 nsec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_join_short" "pathlib_join_short()"
200000 loops, best of 5: 1.69 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import os_path_join_long" "os_path_join_long()"
100000 loops, best of 5: 3.57 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_join_long" "pathlib_join_long()"
20000 loops, best of 5: 17.3 usec per loop
Using an existing
What if it's the
Path("/") creation that takes a lot of time and the concatenation of folders' names is actually fast? To check this, I will extract
Path("/") to a global variable outside of the benchmarked function. Then, I can either reference the global variable directly, or pass it as a parameter to the benchmarked function. No matter which solution I choose, they both take a similar amount of time.
ROOT = Path("/")
return root / "some" / "nested" / "path" / "to" / "a" / "file.txt"
$ python -m timeit -s "from pathlib_benchmarks import pathlib_join_existing_object" "pathlib_join_existing_object()"
50000 loops, best of 5: 4.85 usec per loop
pathlib_join_existing_object() is slightly faster than
pathlib_join (featured in initial benchmarks), but still much slower than using
As @randallpittman pointed out in the comments, it seems that it's actually the concatenation of paths that makes
Path slower in my benchmarks. If I pass all the paths directly as parameters, then it gets faster. Take a look at those two scenarios and their benchmarks:
return Path("/", "some", "nested", "path", "to", "a", "file.txt")
$ python -m timeit -s "from pathlib_benchmarks import pathlib_multiple_args" "pathlib_multiple_args()"
100000 loops, best of 5: 2.21 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_full_path" "pathlib_full_path()"
200000 loops, best of 5: 1.4 usec per loop
pathlib_full_path are now much faster.
pathlib_full_path is only 15% slower than
Starting from the home folder
One more test - what if we don't want to start from the root folder but from the home folder of the current user? Both modules have functions that return the home folder, so let's combine them with some additional folders and benchmark that:
return os.path.join(os.path.expanduser("~"), "some", "nested", "path", "to", "a", "file.txt")
return Path.home() / "some" / "nested" / "path" / "to" / "a" / "file.txt"
$ python -m timeit -s "from pathlib_benchmarks import os_path_join_home" "os_path_join_home()"
100000 loops, best of 5: 2.12 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_join_home" "pathlib_join_home()"
50000 loops, best of 5: 8.01 usec per loop
The difference is smaller (8.01/2.12≈3.78), but the
os module still wins this round.
1:0 for the
Let's test some other common operations on file paths.
Is it a file?
Time for a second round of benchmarks. Let's compare the performance of functions that check if the object under a given path is a file (and not a directory):
And to make my benchmarks more complete, I will look for a file that exists but also for one that doesn't:
# First, a file that exists
$ python -m timeit -s "from pathlib_benchmarks import os_isfile" "os_isfile('pathlib_benchmarks.py')"
100000 loops, best of 5: 2.28 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_is_file" "pathlib_is_file('pathlib_benchmarks.py')"
50000 loops, best of 5: 4.12 usec per loop
# And a file that doesn't
$ python -m timeit -s "from pathlib_benchmarks import os_isfile" "os_isfile('non-existing-file')"
200000 loops, best of 5: 1.02 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_is_file" "pathlib_is_file('non-existing-file')"
100000 loops, best of 5: 2.82 usec per loop
In both scenarios
os.path is still faster, although the difference is smaller than in the first set of benchmarks.
Path.is_file is around twice as slow when the file exists (4.12/2.28≈1.81) and around three times as slow when it doesn't exist (2.82/1.02≈2.76).
Get the current directory
How about getting the current directory?
$ python -m timeit -s "import os" "os.getcwd()"
50000 loops, best of 5: 6.75 usec per loop
$ python -m timeit -s "from pathlib import Path" "Path.cwd()"
50000 loops, best of 5: 8.54 usec per loop
os.getcwd() is faster by around 30% this time (8.54/6.75≈1.27).
Find all the files matching a pattern
Let's try something more complex. This time, I want to recursively find all the Python files (that is, files with the ".py" extensions).
If I really need to stick with the
os module, I could write something like this:
python_files = 
for root, dirs, files in os.walk("."):
for filename in files:
python_files.append(root + filename)
But it's much easier to use the
glob module instead. That way we just need one line of code:
pathlib comes with a similar function called
rglob(). But there are two important distinctions between this function and
Path().rglob()returns Path objects, while
glob.glob()return strings. I assume we are ok with Path objects because they work fine for opening files indicated by the file paths or for printing those paths. I don't see a reason to convert them to inferior strings (inferior in terms of what we can do with them). If you really need strings, remember you must additionally call
str()on each Path object.
glob_find_files()return lists, but
Path().rglob()returns a generator. To make the results of all the examples as similar as possible to each other, I will convert this generator to a list (which will slow down my benchmarks). If I don't do this,
Path.globwill have an unfair advantage, as creating a generator is much faster than building a list. But in general, if you want to iterate over those files, there is no point in converting a generator to a list first. Moreover, if the list of files is huge, a generator will be much more memory-efficient.
Here is the
pathlib version of a function to find all the Python files:
Let's run the benchmarks:
$ python -m timeit -s "from pathlib_benchmarks import os_walk_files" "os_walk_files()"
5000 loops, best of 5: 80.6 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import glob_find_files" "glob_find_files()"
2000 loops, best of 5: 152 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import path_find_files" "path_find_files()"
2000 loops, best of 5: 156 usec per loop
The most verbose version that includes two loops and an
if statement still turns out to be almost twice as fast as using the
glob (152/80.6≈1.89) or
pathlib (156/80.6 ≈1.94) modules.
That puts our benchmarking score at I-have-lost-track-a-long-time-ago to 0 for the
Quickly write to a file
Another interesting feature of
pathlib is that you can quickly write some text or bytes to a file.
Below is a comparison of
Path().write_text() and the classic
with open() context manager. We open a file (or create it, if it doesn't exist) in write mode and replace the previous content with some simple text:
with open("a_file.txt", "w") as f:
$ python -m timeit -s "from pathlib_benchmarks import classic_write" "classic_write()"
5000 loops, best of 5: 55.3 usec per loop
$ python -m timeit -s "from pathlib_benchmarks import pathlib_write" "pathlib_write()"
5000 loops, best of 5: 55.8 usec per loop
They both take the same amount of time (no matter if the
a_file.txt already exists or not). No wonder -
write_text() is actually just a nice little wrapper around the
with open code.
If you're curious, there is also a wrapper for reading the content from a file. The wrapper is called
read_text() and has a similar performance as its
with open(<file>, 'r') equivalent.
The list of various tasks we can perform with
pathlib can go on for much longer. Creating, deleting, reading, writing, finding, moving, copying, splitting, and whatever other operation you want to perform on a file path or a file itself -
pathlib probably has a function for that. Sure,
os.path or some other module can do those things faster. But unless file manipulation is the main bottleneck in a program (which I really doubt is a problem for anyone anymore, with large-memory VMs being easily accessible in the cloud), I much more prefer to use
It's nice to finally have a single module with all the functionality related to paths and files. And I love this object-oriented approach to file paths. It makes writing scripts for filesystem manipulation much more fun, making Python an even better replacement for bash scripts.
You can find all the code examples from this article in my blog-resources repository.
If you want to learn more about all the cool things you can do with the
pathlib module, I can recommend these two articles:
- Should You Use Python pathlib or os?
- Don't Use Python OS Library Any More When Pathlib Can Do (sorry for the paywall, just open this page in an incognito mode)
I have absolutely nothing against bash or bash scripts. That's still my go-to tool if I need to glue together a few shell commands. But if you're not a bash expert (and neither are your colleagues) and you need a script that will run once per year (or even better – one that restores the database in case of an emergency, because there is nothing better than debugging a bash script when your production is on fire), do yourself a favor and write it in Python. Your future self will thank you when debugging this script five years later. ↩︎