By: Siddharth Mehta | Last Updated: 2019-05-16 | Comments | Functions - System
String manipulation is one of the most fundamental data manipulation items used in almost every data processing exercise. Generally, string manipulation objectives are formatting, trimming, padding, replacing, casting and similar functions. To apply these functions, often one needs to select specific parts of the string, and then apply these transformations. In SQL Server, there are three languages that are often used in conjunction with each other – T-SQL, R and Python. All these programming / query languages offer string selection using substring or equivalent function / operators. In this tip we will learn how to use the substring function for string selection in all these three languages.
The substring function or equivalent operators in languages like T-SQL, R and Python enables selecting portions of a string and we will show how this can be done using T-SQL, R and Python when using SQL Server.
SQL Server T-SQL Substring Function
First, we will start with the substring function in the T-SQL language. The syntax of substring function is shown below.
SUBSTRING ( expression , start , length )
The expression means the actual string, field name or a variable of character data type. The start parameter is the starting position and length is the total length of the string to be selected.
In the below examples, the first example selects the part of the staring that starts at position 1 with a length of 5 characters from the index starting position. This means that from the string "hello world", "h" is the first character in the string and five characters from "h" would end at "o". So, the output would be the string "hello". The second example starts the selection from position -2, and the length parameter is 5, which will have the equivalent of starting at position 1 with a length of 3 and the output would be "he" as seen below.
We will now try some examples of using the substring function in the WHERE clause of a SELECT query. We can use the output from sys.tables as shown below. Consider that we want to select all the table names that have the word "Pro" in it.
We can use the substring function in the where clause and use the substring function as part of the filter criteria. In this substring function, one way to include the criteria is to mention the expression as the "name" field, start position as 1 and length as 3 as we know the word "Pro" is 3 characters long. As the criteria must be Boolean in the WHERE clause, we will use the substring function as stated and try to match it with the word "Pro". Execute the below query and the output would be all the table names that starts with the word "Pro".
The above result is only partially correct, as we get all the table names that starts with "Pro", but there are also possibilities where the word "Pro" would be in the middle or the end of the string. This brings up a situation where we need to find the start position of the word "Pro" dynamically. We can use the charindex function for this purpose. This function takes the string to search for as the first argument, the field / variable / value in which to search the string for as the next argument, and the start position as the third optional argument. Modify the SELECT query as shown below, and this time you will find all the table names having the word "Pro" in it.
Substring Function in R
Now let’s understand how substring works in the R programming language. R offers substring and substr functions that have equivalent functionality of the substring function in T-SQL. You can execute R code in T-SQL using the sp_execute_external_script stored procedure.
In the below examples, the first example demonstrates the use of the substr function which works exactly as the substring function in T-SQL. You can also use substring function, which takes the string, start position and end position as the arguments. You can use ranges for the start position and end position parameter.
In the second example, we are specifying the range of 1 to 5 for start position as well as the end position, and the result would be as shown below. You would get every letter as the output for the first five characters of the string, as the parameters would be executed as start position 1 and end position 1, start position 2 and end position 2 and so on. So, the output length is always 1 character.
If we slightly modify the code, and keep the start position constant and the end position as a range, the output would be in the order of increasing length as shown in the results below.
Substring Function in Python
Now let’s look at how to perform substring equivalent functionality in Python programming language.
String in python is treated as an array of Unicode characters. Python does not have a datatype equivalent of characters, as strings are treated just as arrays. Arrays in Python are accessed using ordinal positions surrounded by square brackets. So, let’s understand how to perform array operations on string literals.
In the first line of code, we are creating a string literal by assigning it a string value. In the next line of code, we are accessing the first character of the string literal by accessing the array element at position 1, which is the first letter of the string. If we use negative position, for example -4, then it would start the selection from the right side of the string. If you analyze the result for the third line of the code, you can see how negative ordinal position works.
Let’s say that we want to use it like a substring function where we want to specify the start and end position, in that case we can provide a range for the array element. In the fourth line of code we are providing the range of 1 to -4, means the selection would start from the first element and end at the fourth element from last. Analyze the result now with this logic, and you would understand how the selection is extracted from the original string.
Let’s consider another example, where we want to select every other 3rd character in a string. You can use the double colon operator in the array and mention the intervals as shown in the last line of the code, which is 3 in our case and the output would be a string with a selection of every third character as shown below.
In this way, you can perform string selection operations in different languages using the substring function or its equivalent with nearly similar syntax and / or parameters for string selection.
- Consider comparing the performance of substring functionality in different languages, so you can select the right language for performing substring operations on large volumes of data.
Last Updated: 2019-05-16
About the author
View all my tips