Read and write files

PDBTools can read and write PDB and mmCIF files. The relevant functions are:

PDBTools.read_pdb — Function

read_pdb(pdbfile::String, selection::String)
read_pdb(pdbfile::String, selection_function::Function = all)

read_pdb(pdbdata::IOBuffer, selection::String)
read_pdb(pdbdata::IOBuffer, selection_function::Function = all)

Reads a PDB file and stores the data in a vector of type Atom.

If a selection is provided, only the atoms matching the selection will be read. For example, resname ALA will select all the atoms in the residue ALA.

If a selection function keyword is provided, only the atoms for which selection_function(atom) is true will be read.

Examples

julia> using PDBTools

julia> protein = read_pdb(PDBTools.TESTPDB)
   Vector{Atom{Nothing}} with 62026 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
⋮
   62025   H1    TIP3     C     9339    19638   13.218   -3.647  -34.453  1.00  0.00     1    WAT2     62025
   62026   H2    TIP3     C     9339    19638   12.618   -4.977  -34.303  1.00  0.00     1    WAT2     62026

julia> ALA = read_pdb(PDBTools.TESTPDB,"resname ALA")
   Vector{Atom{Nothing}} with 72 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
⋮
    1339    C     ALA     A       95       95   14.815   -3.057   -5.633  1.00  0.00     1    PROT      1339
    1340    O     ALA     A       95       95   14.862   -2.204   -6.518  1.00  0.00     1    PROT      1340

julia> ALA = read_pdb(PDBTools.TESTPDB, atom -> atom.resname == "ALA")
   Vector{Atom{Nothing}} with 72 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
⋮
    1339    C     ALA     A       95       95   14.815   -3.057   -5.633  1.00  0.00     1    PROT      1339
    1340    O     ALA     A       95       95   14.862   -2.204   -6.518  1.00  0.00     1    PROT      1340

source

PDBTools.read_mmcif — Function

read_mmcif(mmCIF_file::String, selection::String; field_assignment)
read_mmcif(mmCIF_file::String, selection_function::Function = all, field_assignment)

read_mmcif(mmCIF_data::IOBuffer, selection::String; field_assignment)
read_mmcif(mmCIF_data::IOBuffer, selection_function::Function = all, field_assignment)

Reads a mmCIF file and stores the data in a vector of type Atom.

All fields except the file name are optional.

If a selection is provided, only the atoms matching the selection will be read. For example, resname ALA will select all the atoms in the residue ALA.

If a selection function is provided, only the atoms for which selection_function(atom) is true will be returned.

The field_assignment keyword is nothing (default) or a Dict{String,Symbol} and can be used to specify which fields in the mmCIF file should be read into the Atom type. For example field_assignment = Dict("type_symbol" => :name) will read the _atom_site.type_symbol field in the mmCIF file into the name field of the Atom type.

The default assignment is follows the standard mmCIF convention:

Dict{String,Symbol}(
    "id" => :index_pdb
    "Cartn_x" => :x
    "Cartn_y" => :y
    "Cartn_z" => :z
    "occupancy" => :occup
    "B_iso_or_equiv" => :beta
    "pdbx_formal_charge" => :charge
    "pdbx_PDB_model_num" => :model
    "label_atom_id" => :name
    "label_comp_id" => :resname
    "label_asym_id" => :chain
    "auth_seq_id" => :resnum
    "type_symbol" => :pdb_element
)

Source: https://mmcif.wwpdb.org/docs/tutorials/content/atomic-description.html

Examples

julia> using PDBTools

julia> ats = read_mmcif(PDBTools.TESTCIF)
   Vector{Atom{Nothing}} with 76 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     GLY     A        1        1   -4.564   25.503   24.113  1.00 24.33     1                 1
       2   CA     GLY     A        1        1   -4.990   26.813   24.706  1.00 24.29     1                 2
⋮
      75    O     HOH     Q       63       15   -3.585   34.725   20.903  1.00 19.82     1              2980
      76    O     HOH     Q       64       16   -4.799   40.689   37.419  1.00 20.13     1              2981

julia> ats = read_mmcif(PDBTools.TESTCIF, "index < 3")
   Vector{Atom{Nothing}} with 2 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     GLY     A        1        1   -4.564   25.503   24.113  1.00 24.33     1                 1
       2   CA     GLY     A        1        1   -4.990   26.813   24.706  1.00 24.29     1                 2

julia> ats = read_mmcif(PDBTools.TESTCIF, at -> name(at) == "CA")
   Vector{Atom{Nothing}} with 11 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       2   CA     GLY     A        1        1   -4.990   26.813   24.706  1.00 24.29     1                 2
       6   CA     GLN     A        2        2   -4.738   30.402   23.484  1.00 23.74     1                 6
⋮
      70   CA      CA     G     1003       10  -24.170   27.201   64.364  1.00 27.40     1              2967
      71   CA      CA     H     1004       11  -10.624   32.854   69.292  1.00 29.53     1              2968

source

Note

In the following examples, the read_pdb function will be illustrated. The usage is similar to that of read_mmcif, to read mmCIF (PDBx) files.

Read a PDB/mmCIF file

To read a PDB file and return a vector of atoms of type Atom, do:

using PDBTools
atoms = read_pdb(PDBTools.test_dir*"/structure.pdb")

   Vector{Atom{Nothing}} with 62026 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
⋮
   62025   H1    TIP3     C     9339    19638   13.218   -3.647  -34.453  1.00  0.00     1    WAT2     62025
   62026   H2    TIP3     C     9339    19638   12.618   -4.977  -34.303  1.00  0.00     1    WAT2     62026

Atom{Nothing} is the default structure of data containing the atom index, name, residue, coordinates, etc. The Nothing refers to the content of the custom atom fields.

The data in the Atom structure is organized as indicated in the following documentation:

PDBTools.Atom — Type

Atom::DataType

Structure that contains the atom properties. It is mutable, so its fields can be modified.

Fields:

mutable struct Atom{CustomType}
    index::Int32 # The sequential index of the atoms in the file
    index_pdb::Int32 # The index as written in the PDB file (might be anything)
    name::String7 # Atom name
    resname::String7 # Residue name
    chain::String7 # Chain identifier
    resnum::Int32 # Number of residue as written in PDB file
    residue::Int32 # Sequential residue (molecule) number in file
    x::Float32 # x coordinate
    y::Float32 # y coordinate
    z::Float32 # z coordinate
    beta::Float32 # temperature factor
    occup::Float32 # occupancy
    model::Int32 # model number
    segname::String7 # Segment name (cols 73:76)
    pdb_element::String3 # Element symbol string (cols 77:78)
    charge::Float32 # Charge (cols: 79:80)
    custom::CustomType # Custom fields
    flag::Int8 # Flag for internal use
end

Example

julia> using PDBTools

julia> atoms = read_pdb(PDBTools.SMALLPDB)
   Vector{Atom{Nothing}} with 35 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2 1HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
⋮
      34    C     ASP     A        3        3   -2.626  -10.480   -7.749  1.00  0.00     1    PROT        34
      35    O     ASP     A        3        3   -1.940  -10.014   -8.658  1.00  0.00     1    PROT        35

julia> resname(atoms[1])
"ALA"

julia> chain(atoms[1])
"A"

julia> element(atoms[1])
"N"

julia> mass(atoms[1])
14.0067f0

julia> position(atoms[1])
3-element StaticArraysCore.SVector{3, Float32} with indices SOneTo(3):
  -9.229
 -14.861
  -5.481

The pdb_element and charge fields, which are frequently left empty in PDB files, are not printed. The direct access to the fields is considered part of the interface.

Custom fields can be set on Atom construction with the custom keyword argument. The Atom structure will then be parameterized with the type of custom.

Example

julia> using PDBTools

julia> atom = Atom(index = 0; custom=Dict(:c => "c", :index => 1));

julia> typeof(atom)
Atom{Dict{Symbol, Any}}

julia> atom.custom
Dict{Symbol, Any} with 2 entries:
  :index => 1
  :c     => "c"

julia> atom.custom[:c]
"c"

source

Tip

For all these reading and writing functions, a final argument can be provided to read or write a subset of the atoms, following the selection syntax described in the Selection section. For example:

protein = read_pdb("file.pdb","protein")

arginines = read_pdb("file.pdb","resname ARG")

Instead of the selection strings, a Julia function can be provided, for greater flexibility:

arginines = read_pdb("file.pdb", atom -> atom.resname == "ARG")

The same is valid for the write function, below.

Write a PDB/mmCIF file

To write a PDB file use the write_pdb function, as:

write_pdb("file.pdb", atoms)

where atoms contain a list of atoms with the Atom structures.

PDBTools.write_pdb — Function

write_pdb(filename::String, atoms::AbstractVector{<:Atom}, [selection]; header=:auto, footer=:auto, append=false)

Write a PDB file with the atoms in atoms to filename. The selection argument is a string or function that can be used to select a subset of the atoms in atoms. For example, write_pdb("test.pdb", atoms, "name CA").

Arguments

filename::String: The name of the file to write.
atoms::AbstractVector{<:Atom}: The atoms to write to the file.

Optional positional argument

selection::String: A selection string to select a subset of the atoms in atoms.

Keyword arguments

header::Union{String, Nothing}=:auto: The header to add to the PDB file. If :auto, a header will be added with the number of atoms in atoms.
footer::Union{String, Nothing}=:auto: The footer to add to the PDB file. If :auto, a footer will be added with the "END" keyword.
append::Bool=false: If true, the atoms will be appended to the file instead of overwriting it.

source

PDBTools.write_mmcif — Function

write_mmcif(filename, atoms::AbstractVector{<:Atom}, [selection]; field_assignment=nothing)

Write a mmCIF file with the atoms in atoms to filename. The optional selection argument is a string or function that can be used to select a subset of the atoms in atoms. For example, write_mmcif(atoms, "test.cif", "name CA").

The optional field_assignment argument is a dictionary that can be used to assign custom fields to the mmCIF file.

source

The use of the field_assignment keyword, as explained in the field assignment section is possible in the call to write_mmcif.

Get structure from the Protein Data Bank

PDBTools.wget — Function

wget(PDBid; selection; format="mmCIF")

Retrieves a PDB file from the protein data bank. Selections may be applied.

The optional format argument can be either "mmCIF" or "PDB". The default is "mmCIF". To download the data of large structures, it is recommended to use the "mmCIF" format.

Example

julia> using PDBTools

julia> protein = wget("1LBD","chain A")
   Vector{Atom{Nothing}} with 1870 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     SER     A      225        1   45.228   84.358   70.638  1.00 67.05     1                 1
       2   CA     SER     A      225        1   46.080   83.165   70.327  1.00 68.73     1                 2
⋮
    1869  CG2     THR     A      462      238  -27.063   71.965   49.222  1.00 78.62     1              1869
    1870  OXT     THR     A      462      238  -25.379   71.816   51.613  1.00 84.35     1              1870

source

Use the wget function to retrieve the atom data directly from the PDB database, optionally filtering the atoms with a selection:

atoms = wget("1LBD","name CA")

   Vector{Atom{Nothing}} with 238 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       2   CA     SER     A      225        1   46.080   83.165   70.327  1.00 68.73     1                 2
       8   CA     ALA     A      226        2   43.020   80.825   70.455  1.00 63.69     1                 8
⋮
    1856   CA     MET     A      461      237  -25.561   77.191   51.710  1.00 78.41     1              1856
    1864   CA     THR     A      462      238  -26.915   73.645   51.198  1.00 82.96     1              1864

Atom field assignment in mmCIF files

By default, the assignment of the _atom_site fields of the mmCIF format to the fields of the Atom data structure follows the standard mmCIF convention:

Dict{String,Symbol}(
    "id" => :index_pdb
    "Cartn_x" => :x
    "Cartn_y" => :y
    "Cartn_z" => :z
    "occupancy" => :occup
    "B_iso_or_equiv" => :beta
    "pdbx_formal_charge" => :charge
    "pdbx_PDB_model_num" => :model
    "label_atom_id" => :name
    "label_comp_id" => :resname
    "label_asym_id" => :chain
    "auth_seq_id" => :resnum
    "type_symbol" => :pdb_element
)

This assignment can be customized by providing the field_assignment keyword parameter to the read_mmcif function. In the following example, we exemplify the possibility of reading _atom_site.type_symbol field of the mmCIF file into the name field of the atom data structure:

atoms = read_mmcif(PDBTools.test_dir*"/small.cif", "index <= 5");
name.(atoms)

5-element Vector{InlineStrings.String7}:
 "N"
 "CA"
 "C"
 "O"
 "N"

If, however, we attribute the name field to the type_symbol mmCIF field, which contains the element symbols, we get:

atoms = read_mmcif(PDBTools.TESTCIF, "index <= 5";
   field_assignment=Dict("type_symbol" => :name)
)
name.(atoms)

5-element Vector{InlineStrings.String7}:
 "N"
 "C"
 "C"
 "O"
 "N"

The custom entries set in the field_assignment keyword will overwrite the default assignments for entries sharing keys or fields. For instance, in the example above, the label_atom_id fields which is by default assigned to :name is not being read anymore.

Read from string buffer

In some cases a PDB file data may be available as a string and not a regular file. For example, when reading the output of a zipped file. In these cases, it is possible to obtain the array of atoms by reading directly the string buffer with, for example:

The following read returns a string with the PDB file data, not parsed, to exemplify:

pdbdata = read(PDBTools.test_dir*"/small.pdb", String);

"HEADER    PDBTools.jl - 35 atoms                  04-Apr-24\nATOM      1  N   ALA A   1      -9.229 -14.861  -5.481  0.00  0.00      PROT N\nATOM      2 1HT1 ALA A   1     -10.048 -15.427  -5.569  0.00  0.00      PROT H\nATOM      3  HT2 ALA A   1      -9.488 -13.913  -5.295  0.00  0.00      PROT H\nATOM      4  HT3 ALA A   1      -8.652 -15.208  -4.741  0.00  0.00      PROT H\nATOM      5  CA  ALA A   1      -8" ⋯ 2009 bytes ⋯ "    PROT H\nATOM     31  CG  ASP A   3      -5.867 -10.850  -9.684  1.00  0.00      PROT C\nATOM     32  OD1 ASP A   3      -5.451 -10.837 -10.863  1.00  0.00      PROT O\nATOM     33  OD2 ASP A   3      -6.974 -11.289  -9.300  1.00  0.00      PROT O\nATOM     34  C   ASP A   3      -2.626 -10.480  -7.749  1.00  0.00      PROT C\nATOM     35  O   ASP A   3      -1.940 -10.014  -8.658  1.00  0.00      PROT O\nEND\n"

This string can be passed to the read_pdb function wrapped in a IOBuffer:

atoms = read_pdb(IOBuffer(pdbdata), "protein and name CA")

   Vector{Atom{Nothing}} with 3 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       5   CA     ALA     A        1        1   -8.483  -14.912   -6.726  1.00  0.00     1    PROT         5
      15   CA     CYS     A        2        2   -5.113  -13.737   -5.466  1.00  0.00     1    PROT        15
      26   CA     ASP     A        3        3   -3.903  -11.262   -8.062  1.00  0.00     1    PROT        26

Edit a Vector{<:Atom} object

The Atom structure is mutable, meaning that the fields can be edited. For example:

julia> using PDBTools

julia> atoms = read_pdb(PDBTools.TESTPDB)
   Vector{Atom{Nothing}} with 62026 atoms with fields:
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    PROT         1
       2  HT1     ALA     A        1        1  -10.048  -15.427   -5.569  0.00  0.00     1    PROT         2
⋮
   62025   H1    TIP3     C     9339    19638   13.218   -3.647  -34.453  1.00  0.00     1    WAT2     62025
   62026   H2    TIP3     C     9339    19638   12.618   -4.977  -34.303  1.00  0.00     1    WAT2     62026

julia> atoms[1].segname = "ABCD"
"ABCD"

julia> printatom(atoms[1])
   index name resname chain   resnum  residue        x        y        z occup  beta model segname index_pdb
       1    N     ALA     A        1        1   -9.229  -14.861   -5.481  0.00  0.00     1    ABCD         1

Additionally, With the edit! function, you can directly edit or view the data in a vector of Atoms in your preferred text editor.

julia> edit!(atoms)

This will open a text editor. Here, we modified the data in the resname field of the first atom to ABC. Saving and closing the file will update the atoms array:

julia> printatom(atoms[1])
   index name resname chain   resnum  residue        x        y        z  beta occup model segname index_pdb
       1    N     ABC     A        1        1   -9.229  -14.861   -5.481  0.00  1.00     1    PROT         1

PDBTools.edit! — Function

edit!(atoms::AbstractVector{<:Atom})

Opens a temporary PDB file in which the fields of the vector of atoms can be edited.

source