Read and write files

PDBTools can read and write PDB and mmCIF files. The relevant functions are:

PDBTools.read_pdbFunction
read_pdb(pdbfile::String, selection::String)
read_pdb(pdbfile::String; only::Function = all)

read_pdb(pdbdata::IOBuffer, selection::String)
read_pdb(pdbdata::IOBuffer; only::Function = all)

Reads a PDB file and stores the data in a vector of type Atom.

If a selection is provided, only the atoms matching the selection will be read. For example, resname ALA will select all the atoms in the residue ALA.

If the only function keyword is provided, only the atoms for which only(atom) is true will be read.

Examples

julia> protein = read_pdb("../test/structure.pdb")
   Array{Atoms,1} with 62026 atoms with fields:
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
                                                       ⋮ 
   62025   H1    TIP3     C     9339    19638   13.218   -3.647  -34.453  0.00  1.00     1    WAT2     62025
   62026   H2    TIP3     C     9339    19638   12.618   -4.977  -34.303  0.00  1.00     1    WAT2     62026

julia> ALA = read_pdb("../test/structure.pdb","resname ALA")
   Array{Atoms,1} with 72 atoms with fields:
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
                                                       ⋮ 
    1339    C     ALA     A       95       95   14.815   -3.057   -5.633  0.00  1.00     1    PROT      1339
    1340    O     ALA     A       95       95   14.862   -2.204   -6.518  0.00  1.00     1    PROT      1340

julia> ALA = read_pdb("../test/structure.pdb", only = atom -> atom.resname == "ALA")
   Array{Atoms,1} with 72 atoms with fields:
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
                                                       ⋮ 
    1339    C     ALA     A       95       95   14.815   -3.057   -5.633  0.00  1.00     1    PROT      1339
    1340    O     ALA     A       95       95   14.862   -2.204   -6.518  0.00  1.00     1    PROT      1340
source
PDBTools.read_mmcifFunction
read_mmcif(mmCIF_file::String, selection::String)
read_mmcif(mmCIF_file::String; only::Function = all)

read_mmcif(mmCIF_data::IOBuffer, selection::String)
read_mmcif(mmCIF_data::IOBuffer; only::Function = all)

Reads a mmCIF file and stores the data in a vector of type Atom.

If a selection is provided, only the atoms matching the selection will be read. For example, resname ALA will select all the atoms in the residue ALA.

If the only function keyword is provided, only the atoms for which only(atom) is true will be returned.

Examples

julia> ats = read_mmcif(PDBTools.SMALLCIF)
   Array{Atoms,1} with 7 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     VAL     A        1        1    6.204   16.869    4.854  1.00 49.05     1                 1
       2   CA     VAL     A        1        1    6.913   17.759    4.607  1.00 43.14     1                 2
       3    C     VAL     A        1        1    8.504   17.378    4.797  1.00 24.80     1                 3
       5   CB     VAL     A        1        1    6.369   19.044    5.810  1.00 72.12     1                 5
       6  CG1     VAL     A        1        1    7.009   20.127    5.418  1.00 61.79     1                 6
       7  CG2     VAL     A        1        1    5.246   18.533    5.681  1.00 80.12     1                 7

julia> ats = read_mmcif(PDBTools.SMALLCIF, "index < 3")
   Array{Atoms,1} with 2 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     VAL     A        1        1    6.204   16.869    4.854  1.00 49.05     1                 1
       2   CA     VAL     A        1        1    6.913   17.759    4.607  1.00 43.14     1                 2

julia> ats = read_mmcif(PDBTools.SMALLCIF; only = at -> name(at) == "CA")
   Array{Atoms,1} with 1 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       2   CA     VAL     A        1        1    6.913   17.759    4.607  1.00 43.14     1                 2
source
Note

In the following examples, the read_pdb function will be illustrated. The usage is similar to that of read_mmcif, to read mmCIF (PDBx) files.

Read a PDB file

To read a PDB file and return a vector of atoms of type Atom, do:

atoms = read_pdb("file.pdb")

Atom is the structure of data containing the atom index, name, residue, coordinates, etc. For example, after reading a file (as shown bellow), a list of atoms with the following structure will be generated:

julia> printatom(atoms[1])
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1

The data in the Atom structure is organized as indicated in the following documentation:

PDBTools.AtomType
Atom::DataType

Structure that contains the atom properties. It is mutable, so its fields can be modified.

Fields:

mutable struct Atom{CustomType}
    index::Int32 # The sequential index of the atoms in the file
    index_pdb::Int32 # The index as written in the PDB file (might be anything)
    name::String7 # Atom name
    resname::String7 # Residue name
    chain::String3 # Chain identifier
    resnum::Int32 # Number of residue as written in PDB file
    residue::Int32 # Sequential residue (molecule) number in file
    x::Float32 # x coordinate
    y::Float32 # y coordinate
    z::Float32 # z coordinate
    beta::Float32 # temperature factor
    occup::Float32 # occupancy
    model::Int32 # model number
    segname::String7 # Segment name (cols 73:76)
    pdb_element::String3 # Element symbol string (cols 77:78)
    charge::Float32 # Charge (cols: 79:80)
    custom::CustomType # Custom fields
end

Example

julia> using PDBTools

julia> atoms = read_pdb(PDBTools.SMALLPDB)
   Array{Atoms,1} with 35 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2 1HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
       3  HT2     ALA     A        1        1   -9.488  -13.913   -5.295  0.00  0.00     1    PROT         3
                                                       ⋮
      33  OD2     ASP     A        3        3   -6.974  -11.289   -9.300  1.00  0.00     1    PROT        33
      34    C     ASP     A        3        3   -2.626  -10.480   -7.749  1.00  0.00     1    PROT        34
      35    O     ASP     A        3        3   -1.940  -10.014   -8.658  1.00  0.00     1    PROT        35

julia> resname(atoms[1])
"ALA"

julia> chain(atoms[1])
"A"

julia> element(atoms[1])
"N"

julia> mass(atoms[1])
14.0067

julia> position(atoms[1])
3-element StaticArraysCore.SVector{3, Float32} with indices SOneTo(3):
  -9.229
 -14.861
  -5.481

The pdb_element and charge fields, which are frequently left empty in PDB files, are not printed. The direct access to the fields is considered part of the interface.

Custom fields can be set on Atom construction with the custom keyword argument. The Atom structure will then be parameterized with the type of custom.

Example

julia> using PDBTools

julia> atom = Atom(index = 0; custom=Dict(:c => "c", :index => 1));

julia> typeof(atom)
Atom{Dict{Symbol, Any}}

julia> atom.custom
Dict{Symbol, Any} with 2 entries:
  :index => 1
  :c     => "c"

julia> atom.custom[:c]
"c"
source
Tip

For all these reading and writing functions, a final argument can be provided to read or write a subset of the atoms, following the selection syntax described in the Selection section. For example:

protein = read_pdb("file.pdb","protein")

or

arginines = read_pdb("file.pdb","resname ARG")

The only difference is that, if using Julia anonymous functions, the keyword is only:

arginines = read_pdb("file.pdb", only = atom -> atom.resname == "ARG")

The same is valid for the write function, below.

Retrieve from Protein Data Bank

Use the wget function to retrieve the atom data directly from the PDB database, optionally filtering the atoms with a selection:

julia> atoms = wget("1LBD","name CA")
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       2   CA     SER     A      225        1   46.080   83.165   70.327 68.73  1.00     1       -         2
       8   CA     ALA     A      226        2   43.020   80.825   70.455 63.69  1.00     1       -         8
      13   CA     ASN     A      227        3   41.052   82.178   67.504 53.45  1.00     1       -        13
                                                       ⋮
    1847   CA     GLN     A      460      236  -22.650   79.082   50.023 71.46  1.00     1       -      1847
    1856   CA     MET     A      461      237  -25.561   77.191   51.710 78.41  1.00     1       -      1856
    1864   CA     THR     A      462      238  -26.915   73.645   51.198 82.96  1.00     1       -      1864
PDBTools.wgetFunction
wget(PDBid; selection; format="mmCIF")

Retrieves a PDB file from the protein data bank. Selections may be applied.

The optional format argument can be either "mmCIF" or "PDB". The default is "mmCIF". To download the data of large structures, it is recommended to use the "mmCIF" format.

Example

julia> protein = wget("1LBD","chain A")
   Array{Atoms,1} with 1870 atoms with fields:
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     SER     A      225        1   45.228   84.358   70.638 67.05  1.00     1       -         1
       2   CA     SER     A      225        1   46.080   83.165   70.327 68.73  1.00     1       -         2
       3    C     SER     A      225        1   45.257   81.872   70.236 67.90  1.00     1       -         3
                                                       ⋮ 
    1868  OG1     THR     A      462      238  -27.462   74.325   48.885 79.98  1.00     1       -      1868
    1869  CG2     THR     A      462      238  -27.063   71.965   49.222 78.62  1.00     1       -      1869
    1870  OXT     THR     A      462      238  -25.379   71.816   51.613 84.35  1.00     1       -      1870
source

Edit a PDB file

The Atom structure is mutable, meaning that the fields can be edited. For example:

julia> atoms = read_pdb("file.pdb")
   Array{PDBTools.Atom,1} with 62026 atoms with fields:
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
       3  HT2     ALA     A        1        1   -9.488  -13.913   -5.295  0.00  0.00     1    PROT         3

julia> atoms[1].segname = "ABCD"
"ABCD"

julia> printatom(atoms[1])
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    ABCD         1

Additionally, With the edit! function, you can directly edit or view the data in a vector of Atoms in your preferred text editor.

julia> edit!(atoms)

This will open a text editor. Here, we modified the data in the resname field of the first atom to ABC. Saving and closing the file will update the atoms array:

julia> printatom(atoms[1])
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ABC     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1
PDBTools.edit!Function
edit!(atoms::AbstractVector{<:Atom})

Opens a temporary PDB file in which the fields of the vector of atoms can be edited.

source

Write a PDB file

To write a PDB file use the write_pdb function, as:

write_pdb("file.pdb", atoms)

where atoms contain a list of atoms with the Atom structures.

PDBTools.write_pdbFunction
write_pdb(filename::String, atoms::AbstractVector{<:Atom}, selection; header=:auto, footer=:auto)

Write a PDB file with the atoms in atoms to filename. The selection argument is a string that can be used to select a subset of the atoms in atoms. For example, write_pdb("test.pdb", atoms, "name CA").

The header and footer arguments can be used to add a header and footer to the PDB file. If header is :auto, then a header will be added with the number of atoms in atoms. If footer is :auto, then a footer will be added with the "END" keyword. Either can be set to nothing if no header or footer is desired.

source
PDBTools.write_mmcifFunction
write_mmcif(filename, atoms::AbstractVector{<:Atom}, [selection])

Write a mmCIF file with the atoms in atoms to filename. The optional selection argument is a string that can be used to select a subset of the atoms in atoms. For example, write_mmcif(atoms, "test.cif", "name CA").

source

Read from string buffer

In some cases a PDB file data may be available as a string and not a regular file. For example, when reading the output of a zipped file. In these cases, it is possible to obtain the array of atoms by reading directly the string buffer with, for example:

julia> pdbdata = read(pdb_file, String); # returns a string with the PDB data, to exemplify

julia> atoms = read_pdb(IOBuffer(pdbdata), "protein and name CA")
   Array{Atoms,1} with 104 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       5   CA     ALA     A        1        1   -8.483  -14.912   -6.726  1.00  0.00     1    PROT         5
      15   CA     CYS     A        2        2   -5.113  -13.737   -5.466  1.00  0.00     1    PROT        15
      26   CA     ASP     A        3        3   -3.903  -11.262   -8.062  1.00  0.00     1    PROT        26
                                                       ⋮ 
    1425   CA     GLU     A      102      102    4.414   -4.302   -7.734  1.00  0.00     1    PROT      1425
    1440   CA     CYS     A      103      103    4.134   -7.811   -6.344  1.00  0.00     1    PROT      1440
    1454   CA     THR     A      104      104    3.244  -10.715   -8.603  1.00  0.00     1    PROT      1454